We will discuss Kalman filters applied to opinion polling. They've been applied quite successfully to robotics (and other fields), so people playing with Arduino kits tend to be curious about them; I hate to disappoint, but we will focus particularly on polls.
Kalman filters are the optimal way to combine polls. The "filter" should be thought of as filtering out the noise from the signal.
We will briefly review state space formalism, then discuss polling framed in this formalism. Once we rephrase polls in state space variables, we will
State Space Models
We have n candidates and we're interested in the support among the electorate for each candidate. This is described by a vector \(\vec{x}_{k}\) on day \(k\). As we may suspect, support changes day-to-day, which would be described by the equation \[ \vec{x}_{k+1} = F_{k}\vec{x}_{k} + \omega_{k} \tag{1a}\] where \(\omega_{k}\sim MVN(0, Q_{k})\) is white noise, \(F_{k}\) is the state evolution operator.
Polls are released, which measure a sample of the electorate. We are informed about the support in terms of how many respondents support each candidate at the time of the survey, denoted \(\vec{z}_{k}\), and it obeys the equation \[ \vec{z}_{k} = H_{k}\vec{x}_{k} + \nu_{k} \tag{1b}\] where \(H_{k}\) is the sampling operator, and \(\nu_{k}\sim MVN(0, R_{k})\) is the sampling error.
State space models, for our interests, work in this general framework. What questions could we ask here? What problems could we solve here?
- Filtering problem: from observations \(z_{1}\), ..., \(z_{k}\), estimate the current unobserved/latent state \(x_{k}\)
- Smoothing problem: given past, present, and future observations \(z_{1}\), ..., \(z_{k}\), ..., \(z_{N}\) estimate past unobserved/latent states \(x_{1}\), ..., \(x_{k-1}\)
- Prediction problem: given observations \(z_{1}\), ..., \(z_{k}\), estimate future unobserved/latent states \(x_{k+t}\) for some \(t\gt0\)
For our interests, we can estimate the covariance matrices \(R_{k}\) and \(Q_{k}\) for the sampling error and white noise using the normal approximation to the multinomial distributions. We add to it, though, some extra margin for variation: we add a diagonal matrix of the form \(0.25 z_{cr}^{2}nI\) where \(I\) is the identity matrix, \(z_{cr}\approx 1.96\) is the 95% critical value for the standard normal distribution, and n is the sample size of the poll.
Kalman Filter
We will stick-a-pin in determining the covariance matrices \(Q_{k}\) and \(R_{k}\), it will be discussed later, and we'll just assume it can be determined. In a similar vein, we will assume our polls give us the number of respondents supporting the Democratic candidate, the Republican candidate, the third party candidate(s), and the undecided respondents. The problem is to estimate \(x_{k}\) using \(z_{1}\), ..., \(z_{k}\) for each k.
The basic structure to Kalman filter amounts to "guess and check", or using more idiomatic phrasing "Predict and Update".
Predict
We begin with an estimate using previous observations \[ \widehat{x}_{k|k-1} = F_{k}\widehat{x}_{k-1|k-1} \tag{2a} \] the notation of the subscripts for estimates should be read as "time of estimate | using observations up to this time". Hence \(\widehat{x}_{k|k-1}\) uses observations \(z_{1}\), ..., \(z_{k-1}\) to estimate \(x_{k}\). This is the first stab at an estimate.
Now, this is a realization of a multivariate normally distributed random variable. There's a corresponding covariance matrix for this estimate, which is also unknown. We can estimate it using the previous estimates and observations as \[ P_{k|k-1} = F_{k}P_{k-1|k-1}F_{k}^{T} + Q_{k}. \tag{2b} \] Along similar lines, we need to estimate the covariance matrix for \(z_{k}\) as \[ S_{k} = H_{k}P_{k|k-1}H_{k}^{T} + R_{k}. \tag{2c} \] Initially this seemed quite random to me, but the reader familiar with covariance of random variables could see it from Eq (1b) and (2b).
None of these predicted estimates use the new observation \(z_{k}\). Instead, they are derived using estimates obtained from previous observations.
Update
The next step incorporates a new observation \(z_{k}\) to improve our estimates. Our estimates have some error, its residuals are \[ \widetilde{y}_{k} = z_{k} - H_{k}\widehat{x}_{k|k-1} \tag{3a} \] which tells us how far off our predicted estimates are from the new observation. We will update our estimates with the goal of minimizing the residuals. This is done as \[ \widehat{x}_{k|k} = \widehat{x}_{k|k} + K_{k}\widetilde{y}_{k} \tag{3b} \] for some "Kalman gain" matrix \(K_{k}\). Minimizing the sum of squares of the residuals give us the formula \[ K_{k} = P_{k|k-1}H_{k}^{T}S_{k}^{-1} \tag{3c} \] which we'll derive later.
The updated covariance matrix for the estimates can similarly derived, in Joseph form for numerical computational convenience, as \[ P_{k|k} = (I-K_{k}H_{k})P_{k|k-1}(I-K_{k}H_{k})^{T} + K_{k}R_{k}K_{k}^{T}. \tag{3d} \] When we use gain matrices other than the Kalman gain, Eq (3c), this updated covariance equation still holds. But for the Kalman gain matrix, we can simplify it further (useful for mathematics, but not so useful on the computer).
Combining it all together, the fitted residuals are now \[ \widetilde{y}_{k|k} = z_{k} - H_{k}\widehat{x}_{k|k} \tag{3e} \] which, for Kalman filters, is minimized. Assuming the model described by Eqs (1) is accurate and we have correct initial values \(P_{0|0}\) and \(\widehat{x}_{0|0}\), then Eq (3e) ensures we have the optimal estimates possible (in the sense that it minimizes the sum of squares of these residuals).
Invariants
If the model is accurate, and if the initial values \(\widehat{x}_{0|0}\) and \(P_{0|0}\) reflect the initial state, then we have the following invariants \[ \mathbb{E}[x_{k}-\widehat{x}_{k|k}] = \mathbb{E}[x_{k}-\widehat{x}_{k|k-1}] = 0\tag{4a} \] and \[ \mathbb{E}[\widetilde{y}_{k}] = 0.\tag{4b}\] These encode the fact that we're accurately estimating the state, and the residuals are just noise. We have similar invariants on the covariance matrices, like \[ \operatorname{cov}[x_{k}-\widehat{x}_{k|k}] = P_{k|k} \tag{4c} \] and \[ \operatorname{cov}[x_{k}-\widehat{x}_{k|k-1}] = P_{k|k-1} \tag{4d} \] as well as the residual covariance \[ \operatorname{cov}[\widetilde{y}_{k}] = S_{k}. \tag{4e} \] The just encode the assertions we're estimating the state and covariance matrices "correctly".
Where did all this come from?
The short answer is from minimizing the sum of squares of residuals, and from the definition of covariance in Eqs (4c) and (4d). From Eq (4c) we derive Joseph's form of covariance Eq (3d).
Kalman gain may be derived by minimizing the square of residuals \(\mathbb{E}[\|x_{k}-\widehat{x}_{k|k}\|^{2}]\). The trick here is to realize \begin{align} \mathbb{E}[\|x_{k}-\widehat{x}_{k|k}\|^{2}] &= \mathbb{E}[\operatorname{tr}\bigl((x_{k}-\widehat{x}_{k|k})(x_{k}-\widehat{x}_{k|k})^{T}\bigr)]\\ &= \operatorname{tr}(\operatorname{Cov}[x_{k} - \widehat{x}_{k|k}]) \tag{5a} \end{align} which, using (4c), amounts to minimizing the trace of the covariance matrix \(P_{k|k}\). As is usual in calculus, we use the derivative test, so this occurs when its derivative with respect to the unknown quantity is zero. We're looking for the Kalman gain (as it is the unknown), we then have \[ \frac{\partial}{\partial K_{k}}\operatorname{tr}(P_{k|k}) = -2(H_{k}P_{k|k-1})^{T} + 2K_{k}S_{k} = 0\tag{5b} \] which we solve for \(K_{k}\). This gives us Eq (3c).
Determine Initial Values and Parameters
I briefly mentioned how to estimate the covariance matrices for the noise terms appearing in Eq (1). It's basically the covariance matrices for multinomial distributions, perturbed by sampling margin of error. For \(R_{k}\), we use the poll data to estimate the multinomial covariance; for \(Q_{k}\), we use the \(\widehat{x}_{k-1|k-1}\) estimates.
The initial guess for \(\widehat{x}_{0|0}\) could be estimated in any number of ways: it could be from party registration data, or from the last presidential election, or just from the gut. I use a vector proportional to the last estimate from the previous election, whose components are integers that sum to 3600.
As for the initial covariance matrix \(P_{0|0}\), I use a modified multinomial distribution covariance matrix derived from the initial guess for \(\widehat{x}_{0|0}\). The changes are: multiply the entire matrix by a factor of 12, the diagonal components by an additional factor of 5, and change the sign on the covariance components of third-party/undecided off-diagonal components (under the hypothesis that third party voters are undecided, and vice-versa). Multiplying by these extra factors reflects the uncertainty in these estimates (it'd correspond to an approximate margin-of-error of ±12%).
For polling data, this amazingly enough works. If we were building a robot or driverless car, then we should use some rigorous and sound principles for estimating the initial state...I mean, if you don't want cars to drive off cliffs or into buildings.
Partial Pooling and Updating the Estimates
The only quirks left are (1) translating polling data into the \(\vec{z}_{k}\) vectors, (2) handling multiple polls on the same day, (3) rescaling \(\vec{x}_{k}\). Let me deal with these in turn.
1. Translating poll data. I'm given the percentage of respondents favoring the Democrat (e.g., "43" instead of "0.43"), the percentage favoring the Republican, the percentage undecided. I lump the remaining percentage (100 - dem - rep - undecided) into a hypothetical "third party" candidate. Using these percentages, I translate them into fractions (e.g., "43" [which stands for "43%"] becomes "0.43") then multiply by the sample size. These form the components of the \(\vec{z}_{k}\) vector.
2. Pooling the Polls. Following Jackman's work, I find the precision for the poll as \(\tau_{k} = 2n/z_{cr}\) where \(z_{cr}\approx 1.96\) is the critical value for the 95% standard normal distribution, and n is the sample size for the poll. When multiple polls are released on the same day, we take the weighted-mean of the polls, weighted by the precision of the polls.
2.1. What date do we assign a poll anyways? Polls usually take place over several days, so what date do we assign it? There's probably a clever way to do this, but the standard way is to take the midpoint between the start and end dates for a poll. The simplest way is to use the end date, which is what I've chosen to do for the moment.
3. Recale the estimates. The \(H_{k}\vec{x}_{k|k-1}\) needs to be of the same scale as \(z_{k}\). This gives us \(H_{k} = \|\vec{z}_{k}\|_{1}/\|\widehat{x}_{k|k-1}\|_{1}I\) using the L1-norm (sum of absolute values of components of vectors). The \(\vec{x}_{k}\) is proportional to the total electorate, but is not equal to it.
At the end, when the dust settles, we get an estimate \(\widehat{x}_{k|k}\) which we interpret as a "true sample estimate" reflective of the electorate. It's an optimal reflective sample.
What can I do with it?
Once we've done this, we get a sequence of optimal estimates of support among the electorate. We can use this to estimate the support for the candidate in a "nowcast" by taking a vector \(\vec{c}\) which, written as a sum of unit vectors, would be for a Democratic candidate \[ \vec{c} = \vec{e}_{dem} -\vec{e}_{rep} + c_{t}\vec{e}_{third} + c_{u}\vec{e}_{undecided} \] where \(c_{t}\) is the fraction of third party voters who ultimately change their mind and vote for the Democratic candidate minus the fraction of third party voters who ultimately vote for the Republican, and likewise for \(c_{u}\) the difference of undecideds who vote Democrat minus those who vote Republican. Implicit in this is the assumption that either the Democratic candidate or the Republican candidate win the election.
We can then construct a univariate normal distribution \[ X \sim\mathcal{N}(\vec{\mu}\cdot\vec{c}, \sigma^{2}=\vec{c}\cdot P_{k|k}\vec{c}) \] which describes the margin of victory for the Democratic candidate. The nowcast amounts to computing the probability \[ \Pr(X\gt 0) = p_{dem} \] to forecast the probability the Democratic candidate will win the election, as measured by the current support as estimated by the polls.
There's still sensitivity analysis to do with this, but it's actually a surprisingly accurate way to forecast an election if one has sufficient state-level polling data in the competitive states. (It predicted Wisconsin, Pennsylvania, and Michigan would all go for Trump in 2016, for example, a week before the election.)
Code
I've written some code implementing a Kalman filter from polling data for US presidential elections. It's rough around the edges, but it does what one would expect and hope.
Concluding Remarks
The topic of "state space models" is huge and vast, and this is by no means exhaustive. We've just looked at one particular filter. Even for the Kalman filter, we could add more bells and whistles: we could factor in how the public opinions changes when polling data is released (because polls are not always published the day after they finish), factor in methodology (phone interview vs internet polls, likely voters vs registered voters vs adults), house effects, etc.
We could also revisit assumptions made, like treating polls as one-day events, to try to model the situation better.
We could leave the covariance matrices as unknowns to be estimated, which greatly complicates things, but ostensibly could be done with MCMC methods. If we wanted to continue working in this domain, using state space models, then MCMC methods prove fruitful for Bayesian approaches.
References
- Eric Benhamou, "Kalman filter demystified: from intuition to probabilistic graphical model to real case in financial markets". arXiv:1811.11618
- Joao Tovar Jalles, "Structural Time Series Models and the Kalman Filter: a concise review". Manuscript dated 2009, Eprint
- Simon Jackman, "Pooling the polls over an election campaign". Eprint
- Marko Laine, "Introduction to Dynamic Linear Models for Time Series Analysis". arXiv:1903.11309, 18 pages.
- Renato Zanetti and Kyle J. DeMars, "Joseph Formulation of Unscented and Quadrature Filters with Application to Consider States". Eprint, for further discussion on variations of Kalman gain, etc.