Showing posts with label Polls. Show all posts

Monday, July 27, 2020

Kalman Filtering Polls

We will discuss Kalman filters applied to opinion polling. They've been applied quite successfully to robotics (and other fields), so people playing with Arduino kits tend to be curious about them; I hate to disappoint, but we will focus particularly on polls.

Kalman filters are the optimal way to combine polls. The "filter" should be thought of as filtering out the noise from the signal.

We will briefly review state space formalism, then discuss polling framed in this formalism. Once we rephrase polls in state space variables, we will

State Space Models

We have n candidates and we're interested in the support among the electorate for each candidate. This is described by a vector \(\vec{x}_{k}\) on day \(k\). As we may suspect, support changes day-to-day, which would be described by the equation \[ \vec{x}_{k+1} = F_{k}\vec{x}_{k} + \omega_{k} \tag{1a}\] where \(\omega_{k}\sim MVN(0, Q_{k})\) is white noise, \(F_{k}\) is the state evolution operator.

Polls are released, which measure a sample of the electorate. We are informed about the support in terms of how many respondents support each candidate at the time of the survey, denoted \(\vec{z}_{k}\), and it obeys the equation \[ \vec{z}_{k} = H_{k}\vec{x}_{k} + \nu_{k} \tag{1b}\] where \(H_{k}\) is the sampling operator, and \(\nu_{k}\sim MVN(0, R_{k})\) is the sampling error.

State space models, for our interests, work in this general framework. What questions could we ask here? What problems could we solve here?

Filtering problem: from observations \(z_{1}\), ..., \(z_{k}\), estimate the current unobserved/latent state \(x_{k}\)
Smoothing problem: given past, present, and future observations \(z_{1}\), ..., \(z_{k}\), ..., \(z_{N}\) estimate past unobserved/latent states \(x_{1}\), ..., \(x_{k-1}\)
Prediction problem: given observations \(z_{1}\), ..., \(z_{k}\), estimate future unobserved/latent states \(x_{k+t}\) for some \(t\gt0\)

This is not exhaustive, Laine reviews more questions we could try to answer. We could try to also estimate the covariance matrices, for example, in addition to smoothing.

For our interests, we can estimate the covariance matrices \(R_{k}\) and \(Q_{k}\) for the sampling error and white noise using the normal approximation to the multinomial distributions. We add to it, though, some extra margin for variation: we add a diagonal matrix of the form \(0.25 z_{cr}^{2}nI\) where \(I\) is the identity matrix, \(z_{cr}\approx 1.96\) is the 95% critical value for the standard normal distribution, and n is the sample size of the poll.

Kalman Filter

We will stick-a-pin in determining the covariance matrices \(Q_{k}\) and \(R_{k}\), it will be discussed later, and we'll just assume it can be determined. In a similar vein, we will assume our polls give us the number of respondents supporting the Democratic candidate, the Republican candidate, the third party candidate(s), and the undecided respondents. The problem is to estimate \(x_{k}\) using \(z_{1}\), ..., \(z_{k}\) for each k.

The basic structure to Kalman filter amounts to "guess and check", or using more idiomatic phrasing "Predict and Update".

Predict

We begin with an estimate using previous observations \[ \widehat{x}_{k|k-1} = F_{k}\widehat{x}_{k-1|k-1} \tag{2a} \] the notation of the subscripts for estimates should be read as "time of estimate | using observations up to this time". Hence \(\widehat{x}_{k|k-1}\) uses observations \(z_{1}\), ..., \(z_{k-1}\) to estimate \(x_{k}\). This is the first stab at an estimate.

Now, this is a realization of a multivariate normally distributed random variable. There's a corresponding covariance matrix for this estimate, which is also unknown. We can estimate it using the previous estimates and observations as \[ P_{k|k-1} = F_{k}P_{k-1|k-1}F_{k}^{T} + Q_{k}. \tag{2b} \] Along similar lines, we need to estimate the covariance matrix for \(z_{k}\) as \[ S_{k} = H_{k}P_{k|k-1}H_{k}^{T} + R_{k}. \tag{2c} \] Initially this seemed quite random to me, but the reader familiar with covariance of random variables could see it from Eq (1b) and (2b).

None of these predicted estimates use the new observation \(z_{k}\). Instead, they are derived using estimates obtained from previous observations.

Update

The next step incorporates a new observation \(z_{k}\) to improve our estimates. Our estimates have some error, its residuals are \[ \widetilde{y}_{k} = z_{k} - H_{k}\widehat{x}_{k|k-1} \tag{3a} \] which tells us how far off our predicted estimates are from the new observation. We will update our estimates with the goal of minimizing the residuals. This is done as \[ \widehat{x}_{k|k} = \widehat{x}_{k|k} + K_{k}\widetilde{y}_{k} \tag{3b} \] for some "Kalman gain" matrix \(K_{k}\). Minimizing the sum of squares of the residuals give us the formula \[ K_{k} = P_{k|k-1}H_{k}^{T}S_{k}^{-1} \tag{3c} \] which we'll derive later.

The updated covariance matrix for the estimates can similarly derived, in Joseph form for numerical computational convenience, as \[ P_{k|k} = (I-K_{k}H_{k})P_{k|k-1}(I-K_{k}H_{k})^{T} + K_{k}R_{k}K_{k}^{T}. \tag{3d} \] When we use gain matrices other than the Kalman gain, Eq (3c), this updated covariance equation still holds. But for the Kalman gain matrix, we can simplify it further (useful for mathematics, but not so useful on the computer).

Combining it all together, the fitted residuals are now \[ \widetilde{y}_{k|k} = z_{k} - H_{k}\widehat{x}_{k|k} \tag{3e} \] which, for Kalman filters, is minimized. Assuming the model described by Eqs (1) is accurate and we have correct initial values \(P_{0|0}\) and \(\widehat{x}_{0|0}\), then Eq (3e) ensures we have the optimal estimates possible (in the sense that it minimizes the sum of squares of these residuals).

Invariants

If the model is accurate, and if the initial values \(\widehat{x}_{0|0}\) and \(P_{0|0}\) reflect the initial state, then we have the following invariants \[ \mathbb{E}[x_{k}-\widehat{x}_{k|k}] = \mathbb{E}[x_{k}-\widehat{x}_{k|k-1}] = 0\tag{4a} \] and \[ \mathbb{E}[\widetilde{y}_{k}] = 0.\tag{4b}\] These encode the fact that we're accurately estimating the state, and the residuals are just noise. We have similar invariants on the covariance matrices, like \[ \operatorname{cov}[x_{k}-\widehat{x}_{k|k}] = P_{k|k} \tag{4c} \] and \[ \operatorname{cov}[x_{k}-\widehat{x}_{k|k-1}] = P_{k|k-1} \tag{4d} \] as well as the residual covariance \[ \operatorname{cov}[\widetilde{y}_{k}] = S_{k}. \tag{4e} \] The just encode the assertions we're estimating the state and covariance matrices "correctly".

Where did all this come from?

The short answer is from minimizing the sum of squares of residuals, and from the definition of covariance in Eqs (4c) and (4d). From Eq (4c) we derive Joseph's form of covariance Eq (3d).

Kalman gain may be derived by minimizing the square of residuals \(\mathbb{E}[\|x_{k}-\widehat{x}_{k|k}\|^{2}]\). The trick here is to realize \begin{align} \mathbb{E}[\|x_{k}-\widehat{x}_{k|k}\|^{2}] &= \mathbb{E}[\operatorname{tr}\bigl((x_{k}-\widehat{x}_{k|k})(x_{k}-\widehat{x}_{k|k})^{T}\bigr)]\\ &= \operatorname{tr}(\operatorname{Cov}[x_{k} - \widehat{x}_{k|k}]) \tag{5a} \end{align} which, using (4c), amounts to minimizing the trace of the covariance matrix \(P_{k|k}\). As is usual in calculus, we use the derivative test, so this occurs when its derivative with respect to the unknown quantity is zero. We're looking for the Kalman gain (as it is the unknown), we then have \[ \frac{\partial}{\partial K_{k}}\operatorname{tr}(P_{k|k}) = -2(H_{k}P_{k|k-1})^{T} + 2K_{k}S_{k} = 0\tag{5b} \] which we solve for \(K_{k}\). This gives us Eq (3c).

Determine Initial Values and Parameters

I briefly mentioned how to estimate the covariance matrices for the noise terms appearing in Eq (1). It's basically the covariance matrices for multinomial distributions, perturbed by sampling margin of error. For \(R_{k}\), we use the poll data to estimate the multinomial covariance; for \(Q_{k}\), we use the \(\widehat{x}_{k-1|k-1}\) estimates.

The initial guess for \(\widehat{x}_{0|0}\) could be estimated in any number of ways: it could be from party registration data, or from the last presidential election, or just from the gut. I use a vector proportional to the last estimate from the previous election, whose components are integers that sum to 3600.

As for the initial covariance matrix \(P_{0|0}\), I use a modified multinomial distribution covariance matrix derived from the initial guess for \(\widehat{x}_{0|0}\). The changes are: multiply the entire matrix by a factor of 12, the diagonal components by an additional factor of 5, and change the sign on the covariance components of third-party/undecided off-diagonal components (under the hypothesis that third party voters are undecided, and vice-versa). Multiplying by these extra factors reflects the uncertainty in these estimates (it'd correspond to an approximate margin-of-error of ±12%).

For polling data, this amazingly enough works. If we were building a robot or driverless car, then we should use some rigorous and sound principles for estimating the initial state...I mean, if you don't want cars to drive off cliffs or into buildings.

Partial Pooling and Updating the Estimates

The only quirks left are (1) translating polling data into the \(\vec{z}_{k}\) vectors, (2) handling multiple polls on the same day, (3) rescaling \(\vec{x}_{k}\). Let me deal with these in turn.

1. Translating poll data. I'm given the percentage of respondents favoring the Democrat (e.g., "43" instead of "0.43"), the percentage favoring the Republican, the percentage undecided. I lump the remaining percentage (100 - dem - rep - undecided) into a hypothetical "third party" candidate. Using these percentages, I translate them into fractions (e.g., "43" [which stands for "43%"] becomes "0.43") then multiply by the sample size. These form the components of the \(\vec{z}_{k}\) vector.

2. Pooling the Polls. Following Jackman's work, I find the precision for the poll as \(\tau_{k} = 2n/z_{cr}\) where \(z_{cr}\approx 1.96\) is the critical value for the 95% standard normal distribution, and n is the sample size for the poll. When multiple polls are released on the same day, we take the weighted-mean of the polls, weighted by the precision of the polls.

2.1. What date do we assign a poll anyways? Polls usually take place over several days, so what date do we assign it? There's probably a clever way to do this, but the standard way is to take the midpoint between the start and end dates for a poll. The simplest way is to use the end date, which is what I've chosen to do for the moment.

3. Recale the estimates. The \(H_{k}\vec{x}_{k|k-1}\) needs to be of the same scale as \(z_{k}\). This gives us \(H_{k} = \|\vec{z}_{k}\|_{1}/\|\widehat{x}_{k|k-1}\|_{1}I\) using the L1-norm (sum of absolute values of components of vectors). The \(\vec{x}_{k}\) is proportional to the total electorate, but is not equal to it.

At the end, when the dust settles, we get an estimate \(\widehat{x}_{k|k}\) which we interpret as a "true sample estimate" reflective of the electorate. It's an optimal reflective sample.

What can I do with it?

Once we've done this, we get a sequence of optimal estimates of support among the electorate. We can use this to estimate the support for the candidate in a "nowcast" by taking a vector \(\vec{c}\) which, written as a sum of unit vectors, would be for a Democratic candidate \[ \vec{c} = \vec{e}_{dem} -\vec{e}_{rep} + c_{t}\vec{e}_{third} + c_{u}\vec{e}_{undecided} \] where \(c_{t}\) is the fraction of third party voters who ultimately change their mind and vote for the Democratic candidate minus the fraction of third party voters who ultimately vote for the Republican, and likewise for \(c_{u}\) the difference of undecideds who vote Democrat minus those who vote Republican. Implicit in this is the assumption that either the Democratic candidate or the Republican candidate win the election.

We can then construct a univariate normal distribution \[ X \sim\mathcal{N}(\vec{\mu}\cdot\vec{c}, \sigma^{2}=\vec{c}\cdot P_{k|k}\vec{c}) \] which describes the margin of victory for the Democratic candidate. The nowcast amounts to computing the probability \[ \Pr(X\gt 0) = p_{dem} \] to forecast the probability the Democratic candidate will win the election, as measured by the current support as estimated by the polls.

There's still sensitivity analysis to do with this, but it's actually a surprisingly accurate way to forecast an election if one has sufficient state-level polling data in the competitive states. (It predicted Wisconsin, Pennsylvania, and Michigan would all go for Trump in 2016, for example, a week before the election.)

Code

I've written some code implementing a Kalman filter from polling data for US presidential elections. It's rough around the edges, but it does what one would expect and hope.

Concluding Remarks

The topic of "state space models" is huge and vast, and this is by no means exhaustive. We've just looked at one particular filter. Even for the Kalman filter, we could add more bells and whistles: we could factor in how the public opinions changes when polling data is released (because polls are not always published the day after they finish), factor in methodology (phone interview vs internet polls, likely voters vs registered voters vs adults), house effects, etc.

We could also revisit assumptions made, like treating polls as one-day events, to try to model the situation better.

We could leave the covariance matrices as unknowns to be estimated, which greatly complicates things, but ostensibly could be done with MCMC methods. If we wanted to continue working in this domain, using state space models, then MCMC methods prove fruitful for Bayesian approaches.

References

Eric Benhamou, "Kalman filter demystified: from intuition to probabilistic graphical model to real case in financial markets". arXiv:1811.11618
Joao Tovar Jalles, "Structural Time Series Models and the Kalman Filter: a concise review". Manuscript dated 2009, Eprint
Simon Jackman, "Pooling the polls over an election campaign". Eprint
Marko Laine, "Introduction to Dynamic Linear Models for Time Series Analysis". arXiv:1903.11309, 18 pages.
Renato Zanetti and Kyle J. DeMars, "Joseph Formulation of Unscented and Quadrature Filters with Application to Consider States". Eprint, for further discussion on variations of Kalman gain, etc.

Monday, June 29, 2020

Polling as Sampling from Urns

Last time we discussed a simple model of polling where each demographic were a different color ball in an urn, with supporters of a candidate (and those not supporting that candidate) as solids and stripes. We saw with data from the Hispanic demographics that overdispersion was present and diluted support for Biden.

But we assumed the polls which formed the basis of our reasoning were a perfect representative sample. I think we will begin to tease this apart in this post. I don't think I'll discuss sampling methods yet.

Instead, in this post, I'll start with several urns with different quantities of striped and solid balls. Even using ideal sampling methods, the techniques produce a wide range of estimates of the proportion of striped balls to the rest of the urn.

Toy Problem

Consider 4 urns, A, C, F, and T. We have balls with three colors (red, green, blue) which are either striped or solid. We want to know how many solid balls are in each urn from a given sample. But we perform this backwards: we begin with knowing the number of striped and solid balls in each urn, and consider different sampling methods.

Urn `A`	Number Solid	Number Striped	Total	% of Balls
Green	962,036	357,294	1,319,330	96.12%
Blue	34,147	7,532	41,679	3.04%
Red	3,348	8,242	11,590	0.84%
Total	999,531 (72.8%)	373,068	1,372,599

Urn `C`	Number Solid	Number Striped	Total	% of Balls
Green	6,628,592	1,910,704	8,539,296	97.48%
Blue	130,711	20,448	151,159	1.72%
Red	19,519	49,726	69,245	0.79%
Total	6,778,822 (77.4%)	1,980,878	8,759,700

Urn `F`	Number Solid	Number Striped	Total	% of Balls
Green	364,392	126,652	491,044	21.01%
Blue	598,611	193,014	791,625	33.88%
Red	276,758	777,166	1,053,924	45.11%
Total	1,239,761 (53.0%)	1,096,832	2,336,593

Urn `T`	Number Solid	Number Striped	Total	% of Balls
Green	4,723,236	1,716,632	6,439,868	96.91%
Blue	110,395	32,782	143,177	2.15%
Red	18,830	43,645	62,475	0.94%
Total	4,852,461 (73.0%)	1,793,059	6,645,520

There are a total of 13,870,575 solid balls (72.566%) and 5,243,837 (27.434%) striped balls.

Exercise 1. If we had a hypothetical urn with 5,243,837 striped balls and 13,870,575 solid balls, and if we sampled without replacement n balls, what's the expected number of striped balls drawn? What's the interval containing 75% of the probability distribution about the mean for different sample sizes \(n=75,100,125\)?

This requires being more precise about what we're really interested in finding. The expected number of striped balls in a sample of n balls would be \(0.27434n\). This is unambiguous. (The median value would be, for \(n=100\), 27 striped balls.)

Constructing the interval can be done thus: the upper bound of k for \(\Pr(X\leq k|M,N,n=100)\approx 7/8\) empirically is about \(k\approx32\), and for the lower bound \(\Pr(X\leq k|M,N,n=100)\approx 1/8\) is about \(k\approx 22\). This would give us an interval of 27 ± 5.

We could also consider constructing an interval similar to the "highest posterior density interval". Intuitively, we plot the probability density function, then take a horizontal line tangent to the mode (peak of the PDF). We begin lowering the horizontal line until the area under the probability density between the intersection points equals 75%. This produces a slightly different value since the distribution is not symmetric, though for the sample sizes we are considering the differences would not be appreciable.

Exercise 2. If we drew 4 samples (each sampled without replacement) with each sample consisting of n balls, what's the expected number of striped balls drawn? What's the interval containing 75% of the probability distribution about the mean for \(n=75,100,125\)?

Constructing the intervals of expected striped balls drawn for \(n=100\) samples: urns A and T have interval 27 ± 5 striped balls, urn C has interval \(18\leq k\leq27\), and urn F has an interval \(41\leq k\leq53\).

Observe that urn C has an interval with fewer striped balls than the hypothetical pooled urn, whereas urn F has its interval centered at nearly double the hypothetical pooled urn's, and urns A and T coincide with the hypothetical pooled urn.

Exercise 3. If we sampled proportional to the number of balls in each urn (e.g., \(2336593/(5243837 + 13870575) \approx 12.22\%\) of the sample are drawn from urn F) and we sampled without replacement, what's the expected number of striped balls drawn? What's the interval containing 75% of the probability distribution about the mean for \(n=75,100,125\)? What's the expected number of balls of each color drawn?

This is several independent sampling problems, with \(n_{A} = nN_{A}/N\approx0.07n\), \(n_{C} = nN_{C}/N\approx 0.45n\), \(n_{F} = nN_{F}/N\approx 0.12n\), and \(n_{T} = nN_{T}/N\approx 0.35n\). The expected number of striped balls in the sample would be the sum of the expected number from a sample from respective urns with respective sizes. But some simple algebra shows this is just the expected value of the hypothetical pooled urn from exercise 1.

Since the wording suggests we move from urn to urn, taking a specific sample from each one independently, the intervals are computed independently, and the sum of the lower-bounds gives us the lower-bound for the resulting sample (and similarly for the upper-bound).

Exercise 4. What if each urn gets sampled equally? So a quarter of the sample is drawn from urn A, a quarter from urn C, etc. What is the expected value of striped balls appearing in the sample? What is the 75% interval for \(n=100\)?

This gives us an expected 30.92905 striped balls in the sample. The interval with 75% probability consists of \(20\leq k\leq41\) striped balls; the 50% intervals gives us 30 ± 6 striped balls.

Exercise 5. What if each urn gets sampled in this manner: we first fix a desired sample size \(n\geq100\). We want to randomly sample without replacement so we get at least \((R/N)n\) red balls (where there are \(R\) red balls and \(N\) total balls in the urn initially), at least \((G/N)n\) green balls (where \(G\) is the initial number of green balls in the urn) and at least \((B/N)n\) blue balls (with \(B\) the initial number of blue balls in the urn). We will have our sample consist of \(g\) green balls, \(r\) red balls, and \(b\) blue balls where \(r+g+b\geq n\). (A) What is the expected sample size for each urn for \(n=150\)? (B) What is the expected number of striped balls for each color? (C) [Open ended] Can we apply some set of weightings to better reflect the urn's composition?

We first compute how many balls we want, at minimum, drawn for each urn:

Urn A's sample needs at least 144 green balls, 5 blue balls, and 1 red ball
Urn C's sample needs at least 146 green balls, 3 blue balls, and 1 red ball
Urn F's sample needs at least 32 green balls, 51 blue balls, and 68 red ball
Urn T's sample needs at least 145 green balls, 3 blue balls, and 1 red ball

The exact numbers may differ due to rounding concerns, or desire for sampling particular subpopulations.

If we consider the stopping condition to be drawing s balls of a certain color (which the urn has K balls of that particular color), then we can compute the expected number of draws \(k+s\) needed using the negative hypergeometric distribution. The expected number of draws would be \[ n \approx s + s\frac{N-K}{K+1} = s\frac{N+1}{N-K+1}. \] This formula differs from a naive reading of the wikipedia page, because their "K" is our "N-K".

However, if we want our sample to contain the desired minimum with \(\alpha\) probability, we need at least \(\Pr(X\geq s\mid N, K, n)\geq \alpha\). Just brute forcing this, we find A needs \(n=354\), C needs \(n=378\), urn F requires \(n=194\), and finally T needs \(n=318\).

For, e.g., urn A, this has the unfortunate side effect of producing somewhere between double to triple as many green and blue balls as needed. So how do we handle this? We want to avoid discarding information (as a rule of thumb in life, but especially in statistics), so we may want to take the ratio of striped green balls to solid green balls, then multiply by the desired sample size for green balls (144). There are other possibilities, but this is the quickest for us.

Numerically, we find the samples produced for each urn has with the 90% confidence interval estimating solid balls of specific color in parentheses:

Urn A's sample has 340 green balls (234–261), 11 blue balls (7–11), and 3 red ball (0–2)
Urn C's sample needs at least 369 green balls (273–299), 7 blue balls (4–7), and 3 red ball (0–2)
Urn F's sample needs at least 41 green balls (26–35), 66 blue balls (44–55), and 87 red ball (16–30)
Urn T's sample needs at least 309 green balls (214–239), 7 blue balls (3–7), and 3 red ball (0–2)

If we were to normalize the overcounted balls, then we end up with the estimates:

Urn A's weighted sample has between 99 to 111 striped green balls, 2 or 3 striped blue balls, and at most 1 striped red ball, for a weighted total of somewhere between 102 to 115 striped balls (68%-76.67% striped)
Urn C's weighted sample has between 108 to 118 striped green balls, 2 (well, between 1.75 to 2) striped blue balls, and at most 1 striped red ball, for weighted total of somewhere between 111 to 121 striped balls (74%-80%)
Urn F's weighted sample has between 20 to 27 striped green balls, 34 to 42.5 striped blue balls, and 12.5 to 23 striped red ball, for a weighted total somewhere between 66 to 93 striped balls (44%-62%)
Urn T's weighted sample reports somewhere between 100 to 112 striped green balls, 1 to 3 striped blue balls, and at most 1 striped red ball, for a weighted total of 101 to 116 striped balls (67%-77%)

But if we were given this data, working backwards, we would end up with radically different estimates for the population size. Urn F has a "margin of error" of about ±9%, which would be huge. The reported margin of error (at 90% confidence) would be at most 6.75%, though. The reported margin of error underestimates the actual range of variability the polls could report.

On the other hand, for urn T, we see the reported margin-of-error would be 6.315% (at 90% confidence) whereas its estimates have a ± 5% margin. In this case, the margin of error reported over-estimates the interval width.

Observation 1. The number of striped balls drawn using these different sampling methods produce different estimates for the total number of striped balls in each of the urns.

Observation 2. For urns with high proportion of striped balls, the margin of error decreases. For urns with low estimates of striped balls with decent sample sizes should be believed.

Homework 1. Given the range of estimates for each of these sampling methods, produce plots estimating the number of striped balls in each urn.

A concluding remark: the reader may object at the sample size of 150 being too small. This is a valid criticism, but when you examine the polling crosstabs, it's not uncommon to find 150 Hispanics polled. The guiding question I have writing this series of posts is whether we can extract anything meaningful from the polling results.

Saturday, June 27, 2020

This one strange trick helps Hispanic support of Biden

Harry Enten noted a few weeks ago how Biden is doing worse than Clinton among Hispanic voters. I wanted to use this as a pretense to investigate some of the probability theoretic aspects of polling.

We'll specifically be building a series of toy models, progressively exploring aspects of polling a composite heterogeneous demographic like Hispanic Americans. These are idealizations which can be developed further into more accurate models, but even the approximations inform us about aspects seldom considered.

The basic game plan (the tl;dr version): we consider sampling without replacement as a hypergeometric distribution, then estimate the population of "successes" out of all trials using the maximum likelihood point-estimate and Bayesian conjugate prior. In section 2, we move on to consider multiple hypergeometric distributions pooled together (analogous to striped and solid balls of different colors) and apply the same methods to this more general setting.

Assumption 1. The polls are ideal, representative samples of the populations.

We will have a follow up post discussing sampling methods, and how it affects poll results.

Base Case

The floor model of polling amounts to treating respondents as balls in an urn. There are three types of balls (red, white, and blue, for America—err, I mean, for Republican, undecided, and Democrat leaning voters) and we assume the balls do not change color. For our interests, Biden's polling, there are B blue balls and N all balls (blue and non-blue alike).

We sample without replacement n balls, meaning: we take a ball out of the urn, inspect its color, make note of it, then set the ball aside (we do not return it to the urn or replace it), and repeat this process n times. The total number of blue balls drawn from the urn b will be reported.

Exercise 1. What is the probability of drawing \(b\) blue balls given \(N, B, n\)?

There are a number of ways to derive the solution. In frequentist terms, there are \[\binom{N}{n} = \frac{(N)!}{n!(N-n)!}\tag{1a}\] ways to draw n balls from the urn containing N balls without putting the sampled balls back. (We read the left-hand side of Eq (1a) as "N choose n".) This is the denominator of the probability.

The numerator is the product of the number of ways to draw b balls from a possible B cases, and similarly the number of ways to draw \(n-b\) non-blue balls from the population of \(N-B\) non-blue balls. This gives us the answer \[\Pr(b\mid N, B, n) = \frac{\binom{B}{b}\binom{N-B}{n-b}}{\binom{N}{n}}.\tag{1b}\] Readers familiar with probability recognize this as the probability mass function for the Hypergeometric distribution.

Exercise 2. What is the expected value and variance for the hypergeometric distribution?

This is a problem the reader should work out on their own. One trick is to introduce \(p=B/N\) as the proportion of the population which is blue. For \(N\to\infty\) with \(p\) fixed, the variance of a hypergeometrically distributed random variable \(X \sim \operatorname{Hypergeometric}(N, B, n)\) should approach the binomial variance \(\operatorname{Var}[X]\to np(1-p)\).

Exercise 3. How can we estimate \(B\) given \(N\), \(n\), and an observed \(b\)?

One way to solve this is to consider the value of \(B\) which maximizes the probability \(Pr(b\mid N, B, n)\). This would give us approximately \(B\approx (N/n)b\).

Bayesian Estimates

A more Bayesian approach would use the conjugate prior for the Hypergeometric distribution, i.e., the Beta-Binomial distribution to describe a random variable \(B\sim\operatorname{BetaBin}(N, \alpha, \beta)\) for some initial prior of the relative frequency of blue balls \(\alpha\) and non-blue balls \(\beta\) (an uninformed guess would be \(\alpha=\beta=1\)).

After observing a sample of \(n\) draws produce \(b\) blue balls, the estimate is updated to \(B-b\sim\operatorname{BetaBin}(N-n, \alpha+b, \beta+n-b)\). As more samples are done, we tally up the number of blue balls seen in all of them, and treat it as if it were a single sample. (That's the property of being a conjugate prior.)

Exercise 4. Given \(N = 40\times 10^{6}\), \(n = 273\), and \(b = 161\), estimate \(B\).

These numbers are not pulled out of thin air (though it might seem that way). Roughly 2/3 of any demographic groups is voting age, and there are approximately 60m Hispanic Americans,1According to the ACS 1-year estimate, B03001. which means about 40m are voting age. In a recent poll, the New York Times found 87+74 = 161 Hispanics supporting Biden out of 273 Hispanics polled.

In a separate-though-related poll, the New York Times reported among Hispanics 64% support Biden.

Using the first set of polls from battleground states, we can produce the following estimate for the number of Biden supporting Hispanic Americans (with the quantiles at 5%, 95%, and the expected value indicated with vertical lines):

Estimates using the New York Times Battleground polls and N = 40m. The quantiles are at 21,619,245 and 25,530,281 for the 5% and 95% respectively, with the expected value at 23,589,744.

Note this corresponds to 58.97% ± 4.9% support. The wide margins stem from a lack of data. We can make an informative statement from this little data and crude model: Harry Enten noted Clinton led by 61% to 23% among Hispanics pre-election, but that lies within the credible interval we just constructed. In other words, Biden is doing alright among the Hispanics, unless the few polls we have used were skewed or biased.

Exercise 5. Perform the same analysis with Trump's numbers among Hispanics. Trump has \(t=75\) respondents supporting him and \(n-t=198\) not supporting him. [Spoiler: 27.47% ± 4.4%]

Heterogeneity

This model fails for the simple reason that people are not balls in an urn. No demographic is a homogeneous blob, so how can we start to introduce heterogeneity?

Some data to help us is a poll Telemundo conducted back in March 2020 specifically concerning the Latino vote in Florida and Arizona. We can now examine how Cuban-Americans poll compared to other Hispanic Americans.

Let's isolate the interesting aspects from a probabilistic perspective. We will consider k different colored balls, identical physically except for their color, which have counts \(N_{1}, \dots, N_{k}\) (there are \(N_{j}\) balls of color \(j\)). Suppose further that for each color \(j\) there are \(K_{j}\in\{0,1,\dots,N_{j}\}\) balls which are "striped" and \(N_{j}-K_{j}\) balls which are "solid" (think: billiards). The striped balls are analogous to our candidate's supporters, the solid balls do not support our candidate.

We sometimes have information about the number of striped balls drawn from a sample, though more often polls report the number of striped balls drawn without reference to color.

The meta-question guiding us here (i.e., the real interesting thing which we're trying to use to guide constructing exercises and worked examples) is when we have a multivariate hypergeometric distribution ("many colored balls") which can be collapsed into a hypergeometric distribution ("solids and stripes"), under what circumstances does the multivariate distribution distort the univariate distribution.

For concrete numbers, consider the following table:

Color	Number Striped	Number Solid	Total Number Balls (N)
\(C_{1}\)	64	175	1,575,667
\(C_{2}\)	112	26	3,860,969
\(C_{3}\)	374	125	24,657,774
\(C_{4}\)	161	112	???
Total	711	438	39,842,421

If we tried using an exact Binomial test of the samples drawn against \(p\approx 711/1149\), only \(C_{4}\) fails to be significantly different with \(\alpha = 0.05\) (the p-values for the first three colors are on the order of \(10^{-10}\) or so, for those curious). The keen reader will realize this is because it's the numbers from exercise 4, i.e., from a pooled sample.

The index of dispersion for this data is approximately the ratio of the sample variance of the striped counts to the sample mean, i.e., approximately 105.1228 ≫ 1, which indicates it is quite overdispersed. None of this should be surprising, since it comes from samples from a quite heterogeneous population.

Exercise 6. How can we estimate the number of striped balls \(K_{j}\) for each color \(j\)?

Isn't this a repeat of exercise 3, but with slightly different data? Arguably, yes. (That's why it's exercise 3, because it's now a tool in our toolkit!) The maximum likelihood estimator suggests that 61.88% of all balls are striped. Let's see if the Bayesian approach will be as informative with the samples drawn.

If we tried to estimate the proportion of the colors which are striped with 90% credible intervals, Bayesian methods tell us for \(C_{1}\) we'd expect 26.77% ± 4.5%, for \(C_{2}\) we'd expect 81.16% ± 5.4% striped, and for \(C_{3}\) we'd expect between 69.% and 78.1% (centered around 74.95%) striped. We can plot the densities for the proportions \(C_{1}\) in red, \(C_{2}\) in blue, \(C_{3}\) in green:

Estimates using the figures given above, divided through by the total number of balls to estimate the proportion of balls striped.

The maximum likelihood estimates tell us \(C_{1}\) has about 26.78% striped, \(C_{2}\) has about 81.1% striped, and \(C_{3}\) has 74.95% striped. These differ from Bayesian estimates around the 8th digit after the decimal place.

Exercise 7. How do these Bayesian credibility intervals compare to the confidence intervals around the most likely number of striped balls?

Exercise 8. Using the values of \(b=711\), \(n-b=438\), and \(N=39842421\), compute the credibility interval for the conjugate prior \(B\sim\operatorname{BetaBin}(N,b,n-b)\). How does it compare to our results from the first part of this blog entry? How does it compare to the estimates for individual subpopulations? [Answer: the credibility interval for \(B/N\) with 90% of the probability is about 61.88% ± 2.35%]

An observation that we can make: balls colored \(C_{1}\) seem to be significantly different than the other balls, and it skews the overall estimation if we bundled them in with the others.

Concluding Remarks

So far, we have assumed a perfect sampling method, and have attempted to extract information from the samples given. We've used one basic trick. Working through the exercises grills it into the reader's mind.

Next time, we will discuss sampling methods (including nonresponse error) and how it impacts the sample reported. Although it impacts polling results, viewed differently, this serves as a model for voter turnout, as well.

If there were more time...

This post has gone on long enough. Had I more time, I would have discussed posterior predictive checking the Bayesian models we've set up. (Because that's as important as washing your hands after using the bathroom.)

I would have also liked to examine Clinton's polling numbers among Hispanics using these methods. This would give us some benchmarks to work against.

Further, as noted, not everyone polled will vote. It would be another fertile grounds for discussion to investigate "likely voter models".

There are a number of straightforward questions we can ask about using the hypergeometric distribution for polling results, and how it modifies the calculations we made related to the polling results. An example:

Homework. Given a hypergeometrically distributed random variable \(X\sim\operatorname{Hypergeometric}(N, B, n)\) with \(p=B/N\) show the variance \(\operatorname{Var}[X] = np(1-p)(\text{something})\). The puzzle is to see how this extra factor (the parenthetic "something") impacts the naive margin-of-error calculation for polls.

Thursday, May 28, 2020

Polling: Assorted Notes

These are my random notes on polling. I don't expect anything revolutionary to be contained here, I'm just hoping to consolidate them in one place.

Topics:

Margin of error
Likely voters
Averaging polls
Conducted by phone or by internet (think about, add later)

Margin of Error

Big idea. When we conduct a poll, we ask a subset of the population for their response. There is some error if we extrapolate the results from the poll conducted on this sample of the population, and try to apply it to the population as a whole. By "error" I mean "The estimates will be off 'by a few percentage points'." I do not mean the extrapolation is invalid.

The "off by a few percentage points" is called the sampling error. We can estimate it, and the estimate is referred to as the "margin of error."

Mathematical Details

Intuition: The margin of error for a poll of n respondents (out of a population of N individuals) asked a question is the width of the confidence interval of the response.

For a "large enough sample" on a binary question, we have a binomially distributed sample, and can use the normal approximation. We then determine some level of confidence \(\gamma\) to determine a z-score \(z_{\gamma}\) using the quantile function for the Normal distribution, which tells us how many standard deviations wide the confidence interval needs to be. We approximate the standard deviation using the "standard error", which in turn is approximated by \(\sqrt{s^{2}/n}\) the squareroot of the sample variance of the response divided by the sample size.

This is relatively unenlightening, there are technical matters which (I think) are contentious (at least, from a Bayesian perspective). It's also really hard to interpret the margin of error (it's easy to misinterpret it as "95% probability the true value lies in this interval", whereas it's really saying: "If we repeated this poll a large number of times, 95% of those polls would result in a confidence interval containing the population parameters").

Puzzle MOE1. Is there a better Bayesian replacement for the margin of error for a given poll? Presumably credibility intervals, but is there a quick way to get it without heavy computation?

Heuristic. The 95%-confidence margin of error for a binary question on a survey of n respondents may be approximated as \(1/\sqrt{n}\).

This is because the margin of error would be bounded (i.e., less than or equal to) the case where the true probability (proportion of "yes" responses) is 1/2, which produces \(moe = z_{0.95}\sqrt{0.5(1-0.5)/n} \approx 1.96\times 0.5/\sqrt{n}\leq 1/\sqrt{n}\).

Coincidentally, if we used Bayesian reasoning, and estimated the posterior distribution of the proportion of the population who would answer "yes" using a Beta distribution updated with the survey data, then the width of the 95% interval is also decently approximated by \(1/\sqrt{n}\). (Using \(2\sqrt{\operatorname{var}[\theta]}\) gives approximately the same result, but \(1/\sqrt{n}\) is for pessimists like me.)

Nonresponse error

One difficulty to note is if someone being polled by phone...hangs up before completing the survey. (Or, if in person, walks away from the questioner, or whatever.) If this happens sufficiently frequently, it impacts the reliability of the poll, and really increases the margin of error of the poll.

For many years, the response rate was viewed as a measure of the poll's quality. This heuristic is hard to validate.

We don't have an adequate way to digest polls with a high incompletion rate, or even what qualifies as a "high incompletion rate".

Puzzle MOE2. Can we have some approximate formula relating the nonresponse rate to the poll quality?

Likely Voters

Some polling firms ask questions to gauge if the respondent is a likely voter or not. What does this mean? Not every registered voter votes. We'd like to filter out the nonvoters. What's left are generically referred to as "likely voters". The exact statistical model sometimes remain undisclosed, it's the "secret sauce" for polling firms. (Gallup being a notable exception.)

There was some work done by Pew suggesting the likely voter model works fairly well, but can be improved if the respondent's voter history were known (and improved further with some magical machine learning algorithms).

The quality of a poll improves when it reports the results from likely voters, though this is far more costly to the polling firm.

Puzzle LV1. Is there some statistical way to infer how the reliability improves when a poll surveys likely voters as opposed to registered voters?

Poll Aggregation

This is the fancy term used for "combining polls". Let's consider some real data I just took from RealClearPolitics:

Poll	Date	Sample	MOE	Biden	Trump	Margin
Economist/YouGov	5/23 - 5/26	1157 RV	3.4	45	42	Biden +3
FOX News	5/17 - 5/20	1207 RV	3.0	48	40	Biden +8
Rasmussen Reports	5/18 - 5/19	1000 LV	3.0	48	43	Biden +5
CNBC	5/15 - 5/17	1424 LV	2.6	48	45	Biden +3
Quinnipiac	5/14 - 5/18	1323 RV	2.7	50	39	Biden +11
The Hill/HarrisX	5/13 - 5/14	950 RV	3.2	42	41	Biden +1
Harvard-Harris	5/13 - 5/14	1854 RV	2.0	53	47	Biden +6

There are a variety of ways to go about it. The most dangerous way is what RealClearPolitics does: just take the average of responses. For example, take the column of respondents favoring Biden, then take its average (which R tells me is 47.71429%). For Trump, the simple average is 42.42857%. Together, this sums to 90.14286% (only one poll, Harvard-Harris, has Biden and Trump sum to 100% support).

We don't have any way to gauge the margin of error of this estimate, though, and we don't reward larger polls anymore than smaller polls.

If we took the weighted mean (weighted by the sample size), Biden would receive 48.30791% and Trump 42.80864% with the weighted mean response at 91.11655% favoring one or the other.

We can further adjust weights, rewarding likely voter polls (or penalizing all others; for example, weigh registered voters proportional to the fraction of registered voters who turned out to vote in the last presidential election, something like 0.58).

The margin of error is all too frequently misinterpreted. It's probably better not to contrive some composite margin-of-error.

References

Margin of Error
1. Andrew Mercer, 5 key things to know about the margin of error in election polls. Pew Research, 8 Sept 2016.
2. Gary Langer, Sampling Error: What it Means. ABC, 8 October 2008
Likely voters
1. Scott Clement, Why the ‘likely voter’ is the holy grail of polling. Washington Post, 7 Jan. 2016
2. Gallup(?), What is the difference between registered voters and likely voters? Gallup, not dated.
3. Scott Keeter and Ruth Igielnik, Can Likely Voter Models Be Improved?. Pew Research Institute, 7 January 2016
Carl Bialik, Election Handicappers Are Using Risky Tool: Mixed Poll Averages. Wall Street Journal, 15 Feb. 2008

Monday, June 24, 2019

How is my candidate doing in the polls?

Given the profusion of polls, it is difficult to accurately gauge how well a given candidate is doing. A simple average of poll numbers won't adequately capture momentum (if such a concept exists), and a few averages (say, one of polls done in the past week, another of polls done in the past month) are difficult to parse. We want one, single, simple number.

We fix the candidate we're interested in, and we have polls \(P_{n+1}\) and \(P_{n}\) released at times \(t_{n}\lt t_{n+1}\). Ideally, we should be able to truncate the N polls to the last k without "much loss".

We could take a moving average, something like \[M_{n+1} = \alpha(t_{n}, t_{n+1}) P_{n+1} + (1 - \alpha(t_{n}, t_{n+1}))M_{n}\tag{1}\] where \(P_{n}\) refers to the n^th most recent poll released on the date \(t_{n}\), with the initial condition \(M_{1} = P_{1}\) and the function \[\alpha(t_{n}, t_{n+1}) = 1 - \exp\left(-\frac{|t_{n+1} - t_{n}|}{30 W}\right)\tag{2}\] where W is the average of the intervals between polls, and the difference in dates is measured in days. The 30 in the denominator of the exponent reflects 30 days in a month

Exercise 1. Show (1) α will take values between 0 and 1, (2) the larger the α, the quicker it "forgets" older data, (3) "older data" will be forgotten faster [what happens for regularly released polling data? Say, weekly, a new poll is released, what does α look like?].

Weighing Pollsters

If we knew about poll quality, we could add this in as another factor. Suppose we had a function1The codomain is a little ambiguous, we have it here as \(0\lt Q(P)\lt 1\), but either inequality could be weakened to "less than or equal than" conditions. So it could be extended to include \(0\leq Q(P)\lt 1\) or \(0\lt Q(P)\leq 1\) or even \(0\leq Q(P)\leq 1\). \[Q\colon \mathrm{Polls}\to (0,1)\tag{3}\] which gives each poll its quality (higher quality polls are nearer to 1). Then we could modify our function in Eq (2) to be something like \[ \begin{split} \tilde{\alpha}(t_{n}, t_{n+1}, P_{n+1}) &= Q(P_{n})\cdot\alpha(t_{n},t_{n+1}) \\ &= Q(P_{n})\cdot\left(1 - \exp\left(-\frac{|t_{n+1} - t_{n}|}{30 W}\right)\right)\end{split}\tag{4}\] which penalizes "worse polls" from influencing the moving average. (Since worse polls have smaller Q values, which leads to higher \(1 - Q\) values.)

One lazy way to go about this is to use pollster ratings from FiveThirtyEight, discard "F" rated polls, then take the moving average with \(Q(-)\) the familiar grading scheme used in the United States. (Or, more precisely, the midpoint of the interval for the grade.)

Letter grade	Percentage	`Q`-value
A+	97–100%	0.985
A	93–96%	0.945
A−	90–92%	0.91
B+	87–89%	0.88
B	83–86%	0.845
B−	80–82%	0.81
C+	77–79%	0.78
C	73–76%	0.745
C-	70–72%	0.71
D+	67–69%	0.68
D	63–66%	0.645
D-	60–62%	0.61

The other "natural" choices include (a) equidistant spacing in the interval (0, 1] so D- is given the value \(1/13\) all the way to A+ given \(13/13\), or (b) the roots of an orthogonal family of polynomials defined on the interval [0, 1].

Exercise 2. How do the different possible choices of Q-values affect the running average? [Hint: using the table above, is \(\widetilde{\alpha}\leq 0.61\) is an upper bound? Consider different scenarios, good poll numbers from bad polls, bad numbers from good polls.]

Exercise 3. If we assign \(Q(\mathrm{F}) = 0\) as opposed to discarding F-scored polls, how does that affect the weighted running average?

Some computed examples are available on github, but they're what you'd expect.