Harry Enten noted a few weeks ago how Biden is doing worse than Clinton among Hispanic voters. I wanted to use this as a pretense to investigate some of the probability theoretic aspects of polling.
We'll specifically be building a series of toy models, progressively exploring aspects of polling a composite heterogeneous demographic like Hispanic Americans. These are idealizations which can be developed further into more accurate models, but even the approximations inform us about aspects seldom considered.
The basic game plan (the tl;dr version): we consider sampling without replacement as a hypergeometric distribution, then estimate the population of "successes" out of all trials using the maximum likelihood point-estimate and Bayesian conjugate prior. In section 2, we move on to consider multiple hypergeometric distributions pooled together (analogous to striped and solid balls of different colors) and apply the same methods to this more general setting.
Assumption 1. The polls are ideal, representative samples of the populations.
We will have a follow up post discussing sampling methods, and how it affects poll results.
Base Case
The floor model of polling amounts to treating respondents as balls in an urn. There are three types of balls (red, white, and blue, for America—err, I mean, for Republican, undecided, and Democrat leaning voters) and we assume the balls do not change color. For our interests, Biden's polling, there are B blue balls and N all balls (blue and non-blue alike).
We sample without replacement n balls, meaning: we take a ball out of the urn, inspect its color, make note of it, then set the ball aside (we do not return it to the urn or replace it), and repeat this process n times. The total number of blue balls drawn from the urn b will be reported.
Exercise 1. What is the probability of drawing \(b\) blue balls given \(N, B, n\)?
There are a number of ways to derive the solution. In frequentist terms, there are \[\binom{N}{n} = \frac{(N)!}{n!(N-n)!}\tag{1a}\] ways to draw n balls from the urn containing N balls without putting the sampled balls back. (We read the left-hand side of Eq (1a) as "N choose n".) This is the denominator of the probability.
The numerator is the product of the number of ways to draw b balls from a possible B cases, and similarly the number of ways to draw \(n-b\) non-blue balls from the population of \(N-B\) non-blue balls. This gives us the answer \[\Pr(b\mid N, B, n) = \frac{\binom{B}{b}\binom{N-B}{n-b}}{\binom{N}{n}}.\tag{1b}\] Readers familiar with probability recognize this as the probability mass function for the Hypergeometric distribution.
Exercise 2. What is the expected value and variance for the hypergeometric distribution?
This is a problem the reader should work out on their own. One trick is to introduce \(p=B/N\) as the proportion of the population which is blue. For \(N\to\infty\) with \(p\) fixed, the variance of a hypergeometrically distributed random variable \(X \sim \operatorname{Hypergeometric}(N, B, n)\) should approach the binomial variance \(\operatorname{Var}[X]\to np(1-p)\).
Exercise 3. How can we estimate \(B\) given \(N\), \(n\), and an observed \(b\)?
One way to solve this is to consider the value of \(B\) which maximizes the probability \(Pr(b\mid N, B, n)\). This would give us approximately \(B\approx (N/n)b\).
Bayesian Estimates
A more Bayesian approach would use the conjugate prior for the Hypergeometric distribution, i.e., the Beta-Binomial distribution to describe a random variable \(B\sim\operatorname{BetaBin}(N, \alpha, \beta)\) for some initial prior of the relative frequency of blue balls \(\alpha\) and non-blue balls \(\beta\) (an uninformed guess would be \(\alpha=\beta=1\)).
After observing a sample of \(n\) draws produce \(b\) blue balls, the estimate is updated to \(B-b\sim\operatorname{BetaBin}(N-n, \alpha+b, \beta+n-b)\). As more samples are done, we tally up the number of blue balls seen in all of them, and treat it as if it were a single sample. (That's the property of being a conjugate prior.)
Exercise 4. Given \(N = 40\times 10^{6}\), \(n = 273\), and \(b = 161\), estimate \(B\).
These numbers are not pulled out of thin air (though it might seem that way). Roughly 2/3 of any demographic groups is voting age, and there are approximately 60m Hispanic Americans,1According to the ACS 1-year estimate, B03001. which means about 40m are voting age. In a recent poll, the New York Times found 87+74 = 161 Hispanics supporting Biden out of 273 Hispanics polled.
In a separate-though-related poll, the New York Times reported among Hispanics 64% support Biden.
Using the first set of polls from battleground states, we can produce the following estimate for the number of Biden supporting Hispanic Americans (with the quantiles at 5%, 95%, and the expected value indicated with vertical lines):
Estimates using the New York Times Battleground polls and N = 40m. The quantiles are at 21,619,245 and 25,530,281 for the 5% and 95% respectively, with the expected value at 23,589,744.
Note this corresponds to 58.97% ± 4.9% support. The wide margins stem from a lack of data. We can make an informative statement from this little data and crude model: Harry Enten noted Clinton led by 61% to 23%
among Hispanics pre-election, but that lies within the credible interval we just constructed. In other words, Biden is doing alright among the Hispanics, unless the few polls we have used were skewed or biased.
Exercise 5. Perform the same analysis with Trump's numbers among Hispanics. Trump has \(t=75\) respondents supporting him and \(n-t=198\) not supporting him. [Spoiler: 27.47% ± 4.4%]
Heterogeneity
This model fails for the simple reason that people are not balls in an urn. No demographic is a homogeneous blob, so how can we start to introduce heterogeneity?
Some data to help us is a poll Telemundo conducted back in March 2020 specifically concerning the Latino vote in Florida and Arizona. We can now examine how Cuban-Americans poll compared to other Hispanic Americans.
Let's isolate the interesting aspects from a probabilistic perspective. We will consider k different colored balls, identical physically except for their color, which have counts \(N_{1}, \dots, N_{k}\) (there are \(N_{j}\) balls of color \(j\)). Suppose further that for each color \(j\) there are \(K_{j}\in\{0,1,\dots,N_{j}\}\) balls which are "striped" and \(N_{j}-K_{j}\) balls which are "solid" (think: billiards). The striped balls are analogous to our candidate's supporters, the solid balls do not support our candidate.
We sometimes have information about the number of striped balls drawn from a sample, though more often polls report the number of striped balls drawn without reference to color.
The meta-question guiding us here (i.e., the real interesting thing which we're trying to use to guide constructing exercises and worked examples) is when we have a multivariate hypergeometric distribution ("many colored balls") which can be collapsed into a hypergeometric distribution ("solids and stripes"), under what circumstances does the multivariate distribution distort the univariate distribution.
For concrete numbers, consider the following table:
Color | Number Striped | Number Solid | Total Number Balls (N) |
---|---|---|---|
\(C_{1}\) | 64 | 175 | 1,575,667 |
\(C_{2}\) | 112 | 26 | 3,860,969 |
\(C_{3}\) | 374 | 125 | 24,657,774 |
\(C_{4}\) | 161 | 112 | ??? |
Total | 711 | 438 | 39,842,421 |
If we tried using an exact Binomial test of the samples drawn against \(p\approx 711/1149\), only \(C_{4}\) fails to be significantly different with \(\alpha = 0.05\) (the p-values for the first three colors are on the order of \(10^{-10}\) or so, for those curious). The keen reader will realize this is because it's the numbers from exercise 4, i.e., from a pooled sample.
The index of dispersion for this data is approximately the ratio of the sample variance of the striped counts to the sample mean, i.e., approximately 105.1228 ≫ 1, which indicates it is quite overdispersed. None of this should be surprising, since it comes from samples from a quite heterogeneous population.
Exercise 6. How can we estimate the number of striped balls \(K_{j}\) for each color \(j\)?
Isn't this a repeat of exercise 3, but with slightly different data? Arguably, yes. (That's why it's exercise 3, because it's now a tool in our toolkit!) The maximum likelihood estimator suggests that 61.88% of all balls are striped. Let's see if the Bayesian approach will be as informative with the samples drawn.
If we tried to estimate the proportion of the colors which are striped with 90% credible intervals, Bayesian methods tell us for \(C_{1}\) we'd expect 26.77% ± 4.5%, for \(C_{2}\) we'd expect 81.16% ± 5.4% striped, and for \(C_{3}\) we'd expect between 69.% and 78.1% (centered around 74.95%) striped. We can plot the densities for the proportions \(C_{1}\) in red, \(C_{2}\) in blue, \(C_{3}\) in green:
Estimates using the figures given above, divided through by the total number of balls to estimate the proportion of balls striped.
The maximum likelihood estimates tell us \(C_{1}\) has about 26.78% striped, \(C_{2}\) has about 81.1% striped, and \(C_{3}\) has 74.95% striped. These differ from Bayesian estimates around the 8th digit after the decimal place.
Exercise 7. How do these Bayesian credibility intervals compare to the confidence intervals around the most likely number of striped balls?
Exercise 8. Using the values of \(b=711\), \(n-b=438\), and \(N=39842421\), compute the credibility interval for the conjugate prior \(B\sim\operatorname{BetaBin}(N,b,n-b)\). How does it compare to our results from the first part of this blog entry? How does it compare to the estimates for individual subpopulations? [Answer: the credibility interval for \(B/N\) with 90% of the probability is about 61.88% ± 2.35%]
An observation that we can make: balls colored \(C_{1}\) seem to be significantly different than the other balls, and it skews the overall estimation if we bundled them in with the others.
Concluding Remarks
So far, we have assumed a perfect sampling method, and have attempted to extract information from the samples given. We've used one basic trick. Working through the exercises grills it into the reader's mind.
Next time, we will discuss sampling methods (including nonresponse error) and how it impacts the sample reported. Although it impacts polling results, viewed differently, this serves as a model for voter turnout, as well.
If there were more time...
This post has gone on long enough. Had I more time, I would have discussed posterior predictive checking the Bayesian models we've set up. (Because that's as important as washing your hands after using the bathroom.)
I would have also liked to examine Clinton's polling numbers among Hispanics using these methods. This would give us some benchmarks to work against.
Further, as noted, not everyone polled will vote. It would be another fertile grounds for discussion to investigate "likely voter models".
There are a number of straightforward questions we can ask about using the hypergeometric distribution for polling results, and how it modifies the calculations we made related to the polling results. An example:
Homework. Given a hypergeometrically distributed random variable \(X\sim\operatorname{Hypergeometric}(N, B, n)\) with \(p=B/N\) show the variance \(\operatorname{Var}[X] = np(1-p)(\text{something})\). The puzzle is to see how this extra factor (the parenthetic "something") impacts the naive margin-of-error calculation for polls.
No comments:
Post a Comment