There are already some excellent answers to this question, but I want to answer why the standard error is what it is, why we use p=0.5 as the worst case, and how the standard error varies with n.
Suppose we take a poll of just one voter, let's call him or her voter 1, and ask "will you vote for the Purple Party?" We can code the answer as 1 for "yes" and 0 for "no". Let's say that probability of a "yes" is p. We now have a binary random variable X1 which is 1 with probability p and 0 with probability 1−p. We say that X1 is a Bernouilli variable with probability of success p, which we can write X1∼Bernouilli(p). The expected, or mean, value of X1 is given by E(X1)=∑xP(X1=x) where we sum over all possible outcomes x of X1. But there are only two outcomes, 0 with probability 1−p and 1 with probability p, so the sum is just E(X1)=0(1−p)+1(p)=p. Stop and think. This actually looks completely reasonable - if there is a 30% chance of voter 1 supporting the Purple Party, and we've coded the variable to be 1 if they say "yes" and 0 if they say "no", then we'd expect X1 to be 0.3 on average.
Let's think what happens we square X1. If X1=0 then X21=0 and if X1=1 then X21=1. So in fact X21=X1 in either case. Since they are the same, then they must have the same expected value, so E(X21)=p. This gives me an easy way of calculating the variance of a Bernouilli variable: I use Var(X1)=E(X21)−E(X1)2=p−p2=p(1−p) and so the standard deviation is σX1=p(1−p)−−−−−−−√.
Obviously I want to talk to other voters - lets call them voter 2, voter 3, through to voter n. Let's assume they all have the same probability p of supporting the Purple Party. Now we have n Bernouilli variables, X1, X2 through to Xn, with each Xi∼Bernoulli(p) for i from 1 to n. They all have the same mean, p, and variance, p(1−p).
I'd like to find how many people in my sample said "yes", and to do that I can just add up all the Xi. I'll write X=∑ni=1Xi. I can calculate the mean or expected value of X by using the rule that E(X+Y)=E(X)+E(Y) if those expectations exist, and extending that to E(X1+X2+…+Xn)=E(X1)+E(X2)+…+E(Xn). But I am adding up n of those expectations, and each is p, so I get in total that E(X)=np. Stop and think. If I poll 200 people and each has a 30% chance of saying they support the Purple Party, of course I'd expect 0.3 x 200 = 60 people to say "yes". So the np formula looks right. Less "obvious" is how to handle the variance.
There is a rule that says
Var(X1+X2+…+Xn)=Var(X1)+Var(X2)+…+Var(Xn)
but I can only use it
if my random variables are independent of each other. So fine, let's make that assumption, and by a similar logic to before I can see that
Var(X)=np(1−p). If a variable
X is the sum of
n independent Bernoulli trials, with identical probability of success
p, then we say that
X has a binomial distribution,
X∼Binomial(n,p). We have just shown that the mean of such a binomial distribution is
np and the variance is
np(1−p).
Our original problem was how to estimate p from the sample. The sensible way to define our estimator is p^=X/n. For instance of 64 out of our sample of 200 people said "yes", we'd estimate that 64/200 = 0.32 = 32% of people say they support the Purple Party. You can see that p^ is a "scaled-down" version of our total number of yes-voters, X. That means it is still a random variable, but no longer follows the binomial distribution. We can find its mean and variance, because when we scale a random variable by a constant factor k then it obeys the following rules: E(kX)=kE(X) (so the mean scales by the same factor k) and Var(kX)=k2Var(X). Note how variance scales by k2. That makes sense when you know that in general, the variance is measured in the square of whatever units the variable is measured in: not so applicable here, but if our random variable had been a height in cm then the variance would be in cm2 which scale differently - if you double lengths, you quadruple area.
Here our scale factor is 1n. This gives us E(p^)=1nE(X)=npn=p. This is great! On average, our estimator p^ is exactly what it "should" be, the true (or population) probability that a random voter says that they will vote for the Purple Party. We say that our estimator is unbiased. But while it is correct on average, sometimes it will be too small, and sometimes too high. We can see just how wrong it is likely to be by looking at its variance. Var(p^)=1n2Var(X)=np(1−p)n2=p(1−p)n. The standard deviation is the square root, p(1−p)n−−−−−√, and because it gives us a grasp of how badly our estimator will be off (it is effectively a root mean square error, a way of calculating the average error that treats positive and negative errors as equally bad, by squaring them before averaging out), it is usually called the standard error. A good rule of thumb, which works well for large samples and which can be dealt with more rigorously using the famous Central Limit Theorem, is that most of the time (about 95%) the estimate will be wrong by less than two standard errors.
Since it appears in the denominator of the fraction, higher values of n - bigger samples - make the standard error smaller. That is great news, as if I want a small standard error I just make the sample size big enough. The bad news is that n is inside a square root, so if I quadruple the sample size, I will only halve the standard error. Very small standard errors are going to involve very very large, hence expensive, samples. There's another problem: if I want to target a particular standard error, say 1%, then I need to know what value of p to use in my calculation. I might use historic values if I have past polling data, but I would like to prepare for the worst possible case. Which value of p is most problematic? A graph is instructive.
The worst-case (highest) standard error will occur when p=0.5. To prove that I could use calculus, but some high school algebra will do the trick, so long as I know how to "complete the square".
p(1−p)−−−−−−−√=p−p2−−−−−√=14−(p2−p+14)−−−−−−−−−−−−−−√=14−(p−12)2−−−−−−−−−−−√
The expression is the brackets is squared, so will always return a zero or positive answer, which then gets taken away from a quarter. In the worst case (large standard error) as little as possible gets taken away. I know the least that can be subtracted is zero, and that will occur when p−12=0, so when p=12. The upshot of this is that I get bigger standard errors when trying to estimate support for e.g. political parties near 50% of the vote, and lower standard errors for estimating support for propositions which are substantially more or substantially less popular than that. In fact the symmetry of my graph and equation show me that I would get the same standard error for my estimates of support of the Purple Party, whether they had 30% popular support or 70%.
So how many people do I need to poll to keep the standard error below 1%? This would mean that, the vast majority of the time, my estimate will be within 2% of the correct proportion. I now know that the worst case standard error is 0.25n−−−√=0.5n√<0.01 which gives me n−−√>50 and so n>2500. That would explain why you see polling figures in the thousands.
In reality low standard error is not a guarantee of a good estimate. Many problems in polling are of a practical rather than theoretical nature. For instance, I assumed that the sample was of random voters each with same probability p, but taking a "random" sample in real life is fraught with difficulty. You might try telephone or online polling - but not only has not everybody got a phone or internet access, but those who don't may have very different demographics (and voting intentions) to those who do. To avoid introducing bias to their results, polling firms actually do all kinds of complicated weighting of their samples, not the simple average ∑Xinthat I took. Also, people lie to pollsters! The different ways that pollsters have compensated for this possibility is, obviously, controversial. You can see a variety of approaches in how polling firms have dealt with the so-called Shy Tory Factor in the UK. One method of correction involved looking at how people voted in the past to judge how plausible their claimed voting intention is, but it turns out that even when they're not lying, many voters simply fail to remember their electoral history. When you've got this stuff going on, there's frankly very little point getting the "standard error" down to 0.00001%.
To finish, here are some graphs showing how the required sample size - according to my simplistic analysis - is influenced by the desired standard error, and how bad the "worst case" value of p=0.5 is compared to the more amenable proportions. Remember that the curve for p=0.7 would be identical to the one for p=0.3 due to the symmetry of the earlier graph of p(1−p)−−−−−−−√