Tuesday, September 1, 2020

Geometric Mean for Combining Probabilities, addendum

We've noted before the geometric mean is best when combining several different forecasts together into one. Today I'd like to discuss how to do this specifically for election forecasts.

If we naively try to combine the forecasts that, e.g., Biden will win Nevada, using the forecasts as of August 30th:

Forecaster	Pr(Biden)	Pr(Trump)
DecisionDeskHQ.com	77.8%	22.2%
The Economist	89%	11%
FiveThirtyEight	77%	23%
JHKForecasts.com	86.1%	13.9%
OurProgress.org	79%	21%
PluralVote.com	76.6%	24.4%
ReedForecasts.com	62.5%	37.5%

We would obtain \[ \Pr(Biden) = 77.87021\%\tag{1a} \] but we would also find \[ \Pr(Trump) = 20.33664\%\tag{1b} \] ...which sums to 98.20685%, which is odd. These probabilities should always sum to 100%, what gives?

The solution is to first transform the probabilities into odds. Then take the geometric mean of the odds, and finally transform back to probabilities.

Why does this work? Well, as odds, a forecast would look like \[ O(\text{Biden wins NV}) = \frac{\Pr(Biden)}{\Pr(Trump)} = \frac{N_{\text{Biden}}}{N_{\text{Trump}}}. \tag{2}\] Taking the geometric mean of the odds gives us better approximations for the ratio of frequencies, which could then be transformed back into probabilities. We can obtain the probability of Biden winning Nevada from the odds by \[ \Pr(Biden) = \frac{O(Biden)}{1 + O(Biden)} \tag{3}\] and for Trump we could note \[ O(Trump) = 1/O(Biden) \] then find \[ \Pr(Trump) = \frac{O(Trump)}{1 + O(Trump)} = \frac{1}{1 + O(Biden)}. \tag{4}\] Hence we find adding Eq (3) to Eq (4) that \[ \Pr(Biden) + \Pr(Trump) = 1 \tag{5} \] probabilities sum to 100%, as expected and desired.

Applied to our forecasts, we find the odds given as:

Forecaster	Odds(Biden)
DecisionDeskHQ.com	3.504505
The Economist	8.090909
FiveThirtyEight	3.347826
JHKForecasts.com	6.194245
OurProgress.org	3.761905
PluralVote.com	3.273504
ReedForecasts.com	1.666667

The geometric mean of these odds is approximately 3.829059, hence a probability of Biden winning Nevada approximately 79.29203% and Trump has a 20.70797% chance of winning Nevada. This makes a difference for Biden of about 2%, whilst negligible improvement for Trump.

Puzzle/Homework. Consider the case of, say, a primary with several candidates. Suppose we have multiple forecasters make predictions for each candidate to win the primary. How can we generalize our method to handle this case?

Monday, July 27, 2020

Kalman Filtering Polls

We will discuss Kalman filters applied to opinion polling. They've been applied quite successfully to robotics (and other fields), so people playing with Arduino kits tend to be curious about them; I hate to disappoint, but we will focus particularly on polls.

Kalman filters are the optimal way to combine polls. The "filter" should be thought of as filtering out the noise from the signal.

We will briefly review state space formalism, then discuss polling framed in this formalism. Once we rephrase polls in state space variables, we will

State Space Models

We have n candidates and we're interested in the support among the electorate for each candidate. This is described by a vector $\vec{x}_{k}$ on day $k$. As we may suspect, support changes day-to-day, which would be described by the equation \[ \vec{x}_{k+1} = F_{k}\vec{x}_{k} + \omega_{k} \tag{1a}\] where $\omega_{k}\sim MVN(0, Q_{k})$ is white noise, $F_{k}$ is the state evolution operator.

Polls are released, which measure a sample of the electorate. We are informed about the support in terms of how many respondents support each candidate at the time of the survey, denoted $\vec{z}_{k}$, and it obeys the equation \[ \vec{z}_{k} = H_{k}\vec{x}_{k} + \nu_{k} \tag{1b}\] where $H_{k}$ is the sampling operator, and $\nu_{k}\sim MVN(0, R_{k})$ is the sampling error.

State space models, for our interests, work in this general framework. What questions could we ask here? What problems could we solve here?

Filtering problem: from observations $z_{1}$, ..., $z_{k}$, estimate the current unobserved/latent state $x_{k}$
Smoothing problem: given past, present, and future observations $z_{1}$, ..., $z_{k}$, ..., $z_{N}$ estimate past unobserved/latent states $x_{1}$, ..., $x_{k-1}$
Prediction problem: given observations $z_{1}$, ..., $z_{k}$, estimate future unobserved/latent states $x_{k+t}$ for some $t\gt0$

This is not exhaustive, Laine reviews more questions we could try to answer. We could try to also estimate the covariance matrices, for example, in addition to smoothing.

For our interests, we can estimate the covariance matrices $R_{k}$ and $Q_{k}$ for the sampling error and white noise using the normal approximation to the multinomial distributions. We add to it, though, some extra margin for variation: we add a diagonal matrix of the form $0.25 z_{cr}^{2}nI$ where $I$ is the identity matrix, $z_{cr}\approx 1.96$ is the 95% critical value for the standard normal distribution, and n is the sample size of the poll.

Kalman Filter

We will stick-a-pin in determining the covariance matrices $Q_{k}$ and $R_{k}$, it will be discussed later, and we'll just assume it can be determined. In a similar vein, we will assume our polls give us the number of respondents supporting the Democratic candidate, the Republican candidate, the third party candidate(s), and the undecided respondents. The problem is to estimate $x_{k}$ using $z_{1}$, ..., $z_{k}$ for each k.

The basic structure to Kalman filter amounts to "guess and check", or using more idiomatic phrasing "Predict and Update".

Predict

We begin with an estimate using previous observations \[ \widehat{x}_{k|k-1} = F_{k}\widehat{x}_{k-1|k-1} \tag{2a} \] the notation of the subscripts for estimates should be read as "time of estimate | using observations up to this time". Hence $\widehat{x}_{k|k-1}$ uses observations $z_{1}$, ..., $z_{k-1}$ to estimate $x_{k}$. This is the first stab at an estimate.

Now, this is a realization of a multivariate normally distributed random variable. There's a corresponding covariance matrix for this estimate, which is also unknown. We can estimate it using the previous estimates and observations as \[ P_{k|k-1} = F_{k}P_{k-1|k-1}F_{k}^{T} + Q_{k}. \tag{2b} \] Along similar lines, we need to estimate the covariance matrix for $z_{k}$ as \[ S_{k} = H_{k}P_{k|k-1}H_{k}^{T} + R_{k}. \tag{2c} \] Initially this seemed quite random to me, but the reader familiar with covariance of random variables could see it from Eq (1b) and (2b).

None of these predicted estimates use the new observation $z_{k}$. Instead, they are derived using estimates obtained from previous observations.

Update

The next step incorporates a new observation $z_{k}$ to improve our estimates. Our estimates have some error, its residuals are \[ \widetilde{y}_{k} = z_{k} - H_{k}\widehat{x}_{k|k-1} \tag{3a} \] which tells us how far off our predicted estimates are from the new observation. We will update our estimates with the goal of minimizing the residuals. This is done as \[ \widehat{x}_{k|k} = \widehat{x}_{k|k} + K_{k}\widetilde{y}_{k} \tag{3b} \] for some "Kalman gain" matrix $K_{k}$. Minimizing the sum of squares of the residuals give us the formula \[ K_{k} = P_{k|k-1}H_{k}^{T}S_{k}^{-1} \tag{3c} \] which we'll derive later.

The updated covariance matrix for the estimates can similarly derived, in Joseph form for numerical computational convenience, as \[ P_{k|k} = (I-K_{k}H_{k})P_{k|k-1}(I-K_{k}H_{k})^{T} + K_{k}R_{k}K_{k}^{T}. \tag{3d} \] When we use gain matrices other than the Kalman gain, Eq (3c), this updated covariance equation still holds. But for the Kalman gain matrix, we can simplify it further (useful for mathematics, but not so useful on the computer).

Combining it all together, the fitted residuals are now \[ \widetilde{y}_{k|k} = z_{k} - H_{k}\widehat{x}_{k|k} \tag{3e} \] which, for Kalman filters, is minimized. Assuming the model described by Eqs (1) is accurate and we have correct initial values $P_{0|0}$ and $\widehat{x}_{0|0}$, then Eq (3e) ensures we have the optimal estimates possible (in the sense that it minimizes the sum of squares of these residuals).

Invariants

If the model is accurate, and if the initial values $\widehat{x}_{0|0}$ and $P_{0|0}$ reflect the initial state, then we have the following invariants \[ \mathbb{E}[x_{k}-\widehat{x}_{k|k}] = \mathbb{E}[x_{k}-\widehat{x}_{k|k-1}] = 0\tag{4a} \] and \[ \mathbb{E}[\widetilde{y}_{k}] = 0.\tag{4b}\] These encode the fact that we're accurately estimating the state, and the residuals are just noise. We have similar invariants on the covariance matrices, like \[ \operatorname{cov}[x_{k}-\widehat{x}_{k|k}] = P_{k|k} \tag{4c} \] and \[ \operatorname{cov}[x_{k}-\widehat{x}_{k|k-1}] = P_{k|k-1} \tag{4d} \] as well as the residual covariance \[ \operatorname{cov}[\widetilde{y}_{k}] = S_{k}. \tag{4e} \] The just encode the assertions we're estimating the state and covariance matrices "correctly".

Where did all this come from?

The short answer is from minimizing the sum of squares of residuals, and from the definition of covariance in Eqs (4c) and (4d). From Eq (4c) we derive Joseph's form of covariance Eq (3d).

Kalman gain may be derived by minimizing the square of residuals $\mathbb{E}[\|x_{k}-\widehat{x}_{k|k}\|^{2}]$. The trick here is to realize \begin{align} \mathbb{E}[\|x_{k}-\widehat{x}_{k|k}\|^{2}] &= \mathbb{E}[\operatorname{tr}\bigl((x_{k}-\widehat{x}_{k|k})(x_{k}-\widehat{x}_{k|k})^{T}\bigr)]\\ &= \operatorname{tr}(\operatorname{Cov}[x_{k} - \widehat{x}_{k|k}]) \tag{5a} \end{align} which, using (4c), amounts to minimizing the trace of the covariance matrix $P_{k|k}$. As is usual in calculus, we use the derivative test, so this occurs when its derivative with respect to the unknown quantity is zero. We're looking for the Kalman gain (as it is the unknown), we then have \[ \frac{\partial}{\partial K_{k}}\operatorname{tr}(P_{k|k}) = -2(H_{k}P_{k|k-1})^{T} + 2K_{k}S_{k} = 0\tag{5b} \] which we solve for $K_{k}$. This gives us Eq (3c).

Determine Initial Values and Parameters

I briefly mentioned how to estimate the covariance matrices for the noise terms appearing in Eq (1). It's basically the covariance matrices for multinomial distributions, perturbed by sampling margin of error. For $R_{k}$, we use the poll data to estimate the multinomial covariance; for $Q_{k}$, we use the $\widehat{x}_{k-1|k-1}$ estimates.

The initial guess for $\widehat{x}_{0|0}$ could be estimated in any number of ways: it could be from party registration data, or from the last presidential election, or just from the gut. I use a vector proportional to the last estimate from the previous election, whose components are integers that sum to 3600.

As for the initial covariance matrix $P_{0|0}$, I use a modified multinomial distribution covariance matrix derived from the initial guess for $\widehat{x}_{0|0}$. The changes are: multiply the entire matrix by a factor of 12, the diagonal components by an additional factor of 5, and change the sign on the covariance components of third-party/undecided off-diagonal components (under the hypothesis that third party voters are undecided, and vice-versa). Multiplying by these extra factors reflects the uncertainty in these estimates (it'd correspond to an approximate margin-of-error of ±12%).

For polling data, this amazingly enough works. If we were building a robot or driverless car, then we should use some rigorous and sound principles for estimating the initial state...I mean, if you don't want cars to drive off cliffs or into buildings.

Partial Pooling and Updating the Estimates

The only quirks left are (1) translating polling data into the $\vec{z}_{k}$ vectors, (2) handling multiple polls on the same day, (3) rescaling $\vec{x}_{k}$. Let me deal with these in turn.

1. Translating poll data. I'm given the percentage of respondents favoring the Democrat (e.g., "43" instead of "0.43"), the percentage favoring the Republican, the percentage undecided. I lump the remaining percentage (100 - dem - rep - undecided) into a hypothetical "third party" candidate. Using these percentages, I translate them into fractions (e.g., "43" [which stands for "43%"] becomes "0.43") then multiply by the sample size. These form the components of the $\vec{z}_{k}$ vector.

2. Pooling the Polls. Following Jackman's work, I find the precision for the poll as $\tau_{k} = 2n/z_{cr}$ where $z_{cr}\approx 1.96$ is the critical value for the 95% standard normal distribution, and n is the sample size for the poll. When multiple polls are released on the same day, we take the weighted-mean of the polls, weighted by the precision of the polls.

2.1. What date do we assign a poll anyways? Polls usually take place over several days, so what date do we assign it? There's probably a clever way to do this, but the standard way is to take the midpoint between the start and end dates for a poll. The simplest way is to use the end date, which is what I've chosen to do for the moment.

3. Recale the estimates. The $H_{k}\vec{x}_{k|k-1}$ needs to be of the same scale as $z_{k}$. This gives us $H_{k} = \|\vec{z}_{k}\|_{1}/\|\widehat{x}_{k|k-1}\|_{1}I$ using the L1-norm (sum of absolute values of components of vectors). The $\vec{x}_{k}$ is proportional to the total electorate, but is not equal to it.

At the end, when the dust settles, we get an estimate $\widehat{x}_{k|k}$ which we interpret as a "true sample estimate" reflective of the electorate. It's an optimal reflective sample.

What can I do with it?

Once we've done this, we get a sequence of optimal estimates of support among the electorate. We can use this to estimate the support for the candidate in a "nowcast" by taking a vector $\vec{c}$ which, written as a sum of unit vectors, would be for a Democratic candidate \[ \vec{c} = \vec{e}_{dem} -\vec{e}_{rep} + c_{t}\vec{e}_{third} + c_{u}\vec{e}_{undecided} \] where $c_{t}$ is the fraction of third party voters who ultimately change their mind and vote for the Democratic candidate minus the fraction of third party voters who ultimately vote for the Republican, and likewise for $c_{u}$ the difference of undecideds who vote Democrat minus those who vote Republican. Implicit in this is the assumption that either the Democratic candidate or the Republican candidate win the election.

We can then construct a univariate normal distribution \[ X \sim\mathcal{N}(\vec{\mu}\cdot\vec{c}, \sigma^{2}=\vec{c}\cdot P_{k|k}\vec{c}) \] which describes the margin of victory for the Democratic candidate. The nowcast amounts to computing the probability \[ \Pr(X\gt 0) = p_{dem} \] to forecast the probability the Democratic candidate will win the election, as measured by the current support as estimated by the polls.

There's still sensitivity analysis to do with this, but it's actually a surprisingly accurate way to forecast an election if one has sufficient state-level polling data in the competitive states. (It predicted Wisconsin, Pennsylvania, and Michigan would all go for Trump in 2016, for example, a week before the election.)

Code

I've written some code implementing a Kalman filter from polling data for US presidential elections. It's rough around the edges, but it does what one would expect and hope.

Concluding Remarks

The topic of "state space models" is huge and vast, and this is by no means exhaustive. We've just looked at one particular filter. Even for the Kalman filter, we could add more bells and whistles: we could factor in how the public opinions changes when polling data is released (because polls are not always published the day after they finish), factor in methodology (phone interview vs internet polls, likely voters vs registered voters vs adults), house effects, etc.

We could also revisit assumptions made, like treating polls as one-day events, to try to model the situation better.

We could leave the covariance matrices as unknowns to be estimated, which greatly complicates things, but ostensibly could be done with MCMC methods. If we wanted to continue working in this domain, using state space models, then MCMC methods prove fruitful for Bayesian approaches.

References

Eric Benhamou, "Kalman filter demystified: from intuition to probabilistic graphical model to real case in financial markets". arXiv:1811.11618
Joao Tovar Jalles, "Structural Time Series Models and the Kalman Filter: a concise review". Manuscript dated 2009, Eprint
Simon Jackman, "Pooling the polls over an election campaign". Eprint
Marko Laine, "Introduction to Dynamic Linear Models for Time Series Analysis". arXiv:1903.11309, 18 pages.
Renato Zanetti and Kyle J. DeMars, "Joseph Formulation of Unscented and Quadrature Filters with Application to Consider States". Eprint, for further discussion on variations of Kalman gain, etc.

Friday, July 24, 2020

Measuring a region's partisanship: a survey

When we say things like, The South is solidly red...what does that mean? How do we measure it?

There are a variety of statistics ("indices"?) to gauge the magnitude of partisanship for the constituency of a region.

Cook PVI

Cook PVI score tells us the party lean for a congressional district (or state), measured as "R+n" or "D+n" or "EVEN", where n is a natural number greater than zero (e.g., 1, 2, 3, 4, 5, etc.). This is basically is done as follows in pidgin code:

if (current_president_party == "D") {
    local_d_avg = (local_d_percent(last_election) +
                     local_d_percent(last_election - 4))/2
    national_d_avg = (national_d_percent(last_election) +
                        national_d_percent(last_election - 4))/2
    
    if (round(local_d_avg) > round(national_d_avg)) {
        return "D+"+round(local_d_avg - national_d_avg);
    } else if (round(local_d_avg) < round(national_d_avg)) {
        return "R+"+round(national_d_avg - local_d_avg);
    } else {
        return "EVEN";
    }
} else if (current_president_party == "R") {
    local_r_avg = (local_r_percent(last_election) +
                     local_r_percent(last_election - 4))/2
    national_r_avg = (national_r_percent(last_election) +
                        national_r_percent(last_election - 4))/2
    
    if (round(local_r_avg) > round(national_r_avg)) {
        return "R+"+round(local_r_avg - national_r_avg)
    } else if (round(local_r_avg) < round(national_r_avg)) {
        return "D+"+round(national_r_avg - local_r_avg)
    } else {
        return "EVEN";
    }
} // other parties handled similarly

The "region" is either a congressional district or an entire state, but ostensibly could be arbitrary (e.g., counties could get its own PVI).

One problem with this is the reliance on voting for a single candidate for determining partisanship or political leaning. But there is a different political culture between a region that (1) votes for the incumbent president and the opposition party for congress, and (2) "down ballot" voters.

Another line of criticism takes issue with working with 2 elections at a time. But really the Cook PVI is just the deviation of the simple moving average for a region compared to the simple moving average for the nation. This smooths out some of the fluctuating "noise".

Partisan Propensity Index

FiveThirtyEight's PPI is explained as:

Thus, Partisan Propensity Index, or PPI, is defined as the percentage chance that the Democrat would have won and open-seat race for Congress in a particular district given the conditions present, on average, between 2002 and 2008 (a period which conveniently featured two good cycles for Republicans and two good ones for Democrats). It contains only two variables: the Presidential vote share in the district and the percentage of households there with incomes below $25,000.

This is computed by using a logistic regression (apparently trained on "1" for Democratic winners, "0" for Republican winners).

One reservation I have, as I had articulated for the Cook PVI, is it turns out there are a lot of people who vote one party for President and the opposition party for congress. At present, I am not convinced the PPI avoids this problem.

The last reservation that comes to mind is the reliance on an economic indicator. This is nothing serious, per se, but could produce spurious results in an economic collapse or a sudden migration.

The prudent reader will realize that both Cook PVI and the PPI both are using federal election data, i.e., results for sending candidates to the White House or to Capitol Hill. Both measures ignore state-level data (e.g., for the gubernatorial race, or for state legislature races).

Ranney Index

Ranney Index, which measures (over a specified time period) the average of:

the proportion of seats held by the Democrats in the state's lower legislature,
the proportion of seats held by the Democrats in the state's upper legislature,
the Democratic gubernatorial candidate's percentage in the election,
the proportion of time the Democrats held both (a) majorities in both chambers of the state legislature, and (b) the governorship.

Since each of these factors are themselves between 0 and 1, the Ranney index will be between 0 (super Republican) and 1 (super Democratic).

The folded Ranney index, computed as 1 − abs(Ranney − 0.5), tells us how competitive the state. For states which are competitive (nearer 0.5) or solidly partisan (nearer 1.0).

Another variant is to use a moving average of the Ranney Index computed for each election, with the "time interval" being 2 gubernatorial elections. This mirrors the simple moving average scheme used in the Cook PVI.

This has the exact opposite problem as the Cook PVI and the PPI statistics, namely an exclusive focus on state-level elections while ignoring federal elections.

Majority Party Index

James W. Ceaser and Robert P. Saldin's A New Measure of Party Strength (2005) present a different metric, the Majority Party Index (MPI). The MPI is, like the Ranney index, a weighted average, but split evenly between federal elections and state-level elections.

The first component of the MPI and the first nationally based measure is the two-party vote for president in each state's most recent presidential election. Thus in 2000, the value is based on the election result of that year. For 2002 the same value is entered because there was no presidential election in 2002. This factor accounts for 25 percent of the total index value for each state.

The second MPI component, also at the national level, is the two-party vote in each state's two most recent United States Senate elections and accounts for 12.5 percent of the total index value. To take an example, Idaho's U.S. Senate value is calculated by averaging the results from the 1998 and 2002 Senate elections. By taking both of the Senate elections into account, despite the time lag on the former, the MPI attempts to reflect partisan voting from year to year without over-emphasizing current partisan swings. By including a Senate result every year, it evens out results among different states.

The third component, and the final national measure, is the total two-party average of all votes in each state's biennial U.S. House elections. Virginia's value, for example, is obtained by adding Republican and Democratic votes in all congressional districts and calculating each party's percentage of this total. This method provides a more accurate measure of the overall state partisan choice than would be obtained by averaging the two-party percentages of each district because districts differ in population and turnout. In addition, it reduces the impact of uncontested seats. This component of the MPI accounts for 12.5 percent of each state's total score.

The fourth component of the MPI, and the first state-level measure, is the two-party vote percentage in each state's most recent gubernatorial election. This component accounts for 25 percent of each state's total.

The fifth measure, also at the state level, is the two-party percentage of all seats in each state's Senate. The Major Party Index value is determined in this case by dividing the number of Republican seats by the sum of Democratic and Republican seats. There were two reasons in this case for using seats (as opposed to votes) as the basis of calculation: vote totals for state legislative electlons are difficult to find, and many of these races are uncontested (a large number of uncontested seats skews the two-party vote totals). This component is weighted as 12.5 percent of the total index.

The final component of the MPI, and the third state-level measure, is the two-party percentage of seats in state houses. This score is calculated in the same way as that for the state senate and is worth 12.5 percent of the total index value.

Nebraska has a nonpartisan, unicameral state legislature. What we do for them is omit the state legislature factors, then reweigh everything accordingly. (The president factor is weighed 1/3, the house and senate factors each weighed 1/6, and the governor is weighed 1/3?)

Thus the formula is:

MPI = ((Most recent 2-Party Republican Presidential Vote)*0.25) + 
    ((Average of Two Most Recent Republican 2-Party Votes for US Senate)*O.125) + 
    ((Republican 2-Party Percent of all US House Votes)*O.125) +
    ((Most Recent 2-Party Republican Vote for Governor)*0.25) +
    ((2-Party Republican Percentage of Seats in the State Senate)*O.125) +
    ((2-Party Republican Percentage of Seats in the State House)*O.125).

We could, for ease, rescale this to be between −1 for Democratic states to +1 for Republican states.

The only problem with this approach, and this stems from my own laziness, is there is no good data source for Governor election data. Or, at least, none that I could find.

Also, the turnover for legislator resignation is far higher in state legislatures. (They resign in disgrace far more frequently than federal legislators.) It's unclear to me how to handle the situation when the replacements belong to different parties.

Equally unclear to me is how special elections are handled for the federal offices. When a senator dies or resigns, some states have a special election for replacing the senator. Do we include special election percentages in the MPI?

Concluding Remarks

It would be useful to compute the Cook PVI for the states, as well as the MPI for the states, to see if these correlate or not. Similarly if there's any correlation between the Cook PVI and the Ranney index, that would be interesting to investigate.

We could ostensibly restrict the region size to counties (or county-equivalents) and use vote data to estimate indices at that level. Although counties range from 100 (rural Texas) to 10 million (Los Angeles), the intent is to apply the formulas to different sub-regions of states, to see how they compare within a state.

(This was more or less a summary I've had lying around for over a year, and I thought I ought to quit polishing it and just publish the thing.)

Thursday, July 23, 2020

Asymptotic Multinomial Distribution

The binomial distribution $\operatorname{Bin}(n,p)$ is asymptotically normal $\mathcal{N}(\mu=np, \sigma^{2}=np(1-p))$ provided $p$ is not "too close" to zero or one. There are a variety of heuristics about when this holds (e.g., $n\geq30$, etc.), but I'd like to discuss what happens with a multinomial distribution for large n.

A reasonable check (or consistency condition) is when there are $k=2$ outcomes in our multinomial distribution, we recover the familiar Binomial result: we should have a univariate normal distribution. This suggests a k-nomial distribution is asymptotically a $(k-1)$-variate normal distribution.

With this in mind, we would expect the asymptotic relation to be of the form \[ \operatorname{Multinom}(n, \vec{p}) \approx\mathcal{MVN}(\vec{\mu}=n\vec{p}, \Sigma = nC) \tag{1} \] where $C$ is the covariance matrix for the multinomial distribution. To be fully explicit for my "future self", the components of the covariance matrix are \[ (C)_{ij} = p_{i}(\delta_{ij}-p_{j}). \tag{2} \] The problem with this conjecture is the multivariate normal is $k$-dimensional and degenerate (which could be seen by multiplying the covariance matrix by the vector with all k entries equal to 1).

Our consistency condition suggests discarding one component of the vector $n\vec{p}$ and the corresponding row and column of the covariance matrix $\Sigma = nC$. Conventions usually drop the last component of the mean vector as well as the last row and column of the covariance matrix.

Also note, the condition for the binomial distribution that "p is not too close to 1 or 0" also has a counter-part with the multinomial distribution. This is that the vector $\vec{p}$ is not near the boundary of the n-simplex (i.e., no single component is "near 1 or 0").

There's nothing deep here. This is just a note to myself, since I find myself using this quite a bit when looking at state-space methods.

Saturday, July 18, 2020

The Economist's 2020 Forecast

Every presidential election, forecasts crop up. Some are qualitative, others are quantitative, some scientific, others humorous (e.g., "tossup bot"). Danger lies when either humorous forecasts are mistaken as serious, or when scientific forecasts err in its approach (by accident or by bad design).

I was struck reading this article on forecasting US presidential races, specifically the passage:

[Modelers] will be rolling out their predictions for the first time this year, and they are intent on avoiding mistakes from past election cycles. Morris, the Economist’s forecaster, is one of those entering the field. He has called previous, error-prone predictions “lying to people” and “editorial malpractice.” “We should learn from that,” he says.

(Emphasis added) I agree. But without model checking, structured code, model checking, some degree of design by contract, model checking, unit testing, model checking, reproducibility, etc., how are we to avoid such "editorial malpractice"?

The Economist has been proudly advertising its forecast for the 2020 presidential election. On July 1st, they predicted Biden would win with 90% certainty. Whenever a model, any model, has at least 90% certainty in an outcome, I have a tendency to become suspicious. Unlike most psephological prognostications, The Economist has put the source code online (relevant commit 980c703). With all this, we can try to dispel any misgivings concerning the possibility of this model giving us error-prone predictions...right?

This post will focus mostly about the mathematics and mechanics underpinning The Economist's model. We will iteratively explain how it works in successive refinement, from very broad strokes to more mundane details.

How their model works

I couldn't easily get The Economist's code working (a glaring red flag), so I studied the code and the papers it was based upon. The intellectual history behind the model is rather convoluted, spread across more than a few papers. Instead of tracing this history, I'll give you an overview of the steps, then discuss each step in detail. The model forecasts how support for a candidate changes over time assuming we know the outcome ahead of time.

The high-level description could be summed up in three steps:

Step 1: Tell me the outcome of the election.

Step 2: Fit the trajectory of candidate support over time to match the polling data such that the fitted trajectory results in this desired outcome from step 1.

Step 3: Call this a forecast.

Yes, I know, the input to the model in step 1 is normally what we would expect the forecast to tell us. But hidden away in the paper trail of citations, this model is based on Linzer's work. Linzer cites a paper from Gelman and King from 1993 arguing it's easy to predict the outcome of an election. We've discussed this before: it is easy to game forecasting elections to appear like a superforcaster.

Some may object this is a caricature of the actual model itself. But as Linzer (2013) himself notes (section 3.1):

As shown by Gelman and King (1993), although historical model-based forecasts can help predict where voters’ preferences end up on Election Day, it is not known in advance what path they will take to get there.

This unexamined critical assumption lacks justification. Unsurprisingly, quite a bit has happened in the past 27 years...there's little reason to believe such appeals to authority. If you too doubt this assumption is sound, then any predictions made from a model thus premised is equally as suspect.

In pictures

Step 1, you give me the outcome. We plot the given outcome ("prior") against time:

We transform the y-axis to use log-odds scale rather than probability, for purely technical reasons. We drop a normal distribution around the auxiliary model's "prior probability" which reflects our confidence in the model's estimates. We then randomly pick an outcome according to this probability distribution (the red dot) which will 68% of the time be within a sigma of the prior (red shaded region):

Then we perform a random walk backwards in time until we get to the most recent poll.1Why backwards? Strauss argued in his paper Florida or Ohio? Forecasting Presidential State Outcomes Using Reverse Random Walks this circumvents projection bias, which tends to underestimate the potential magnitude of future shocks. This technique has been more or less accepted, conveyed only through folklore. Since this is a random walk, we shade the zone where the walk would "likely be":

(The random walk is exaggerated, as most of the figures are, for the sake of conveying the intuition. The walk adjusts the trajectory by less than ~0.003 per day, barely perceptible graphically.)

Once we have polling data, we just take this region and bend it around the polling data transformed to log-odds quantities (the "x" marks):

The fitting is handled through Markov Chain Monte Carlo methods, though I suspect there's probably some maximum likelihood method to get similar results.

Now we need to transform our red region back to probability space:

Note: the forecast does not adjust the final prediction (red endpoint) beyond 2% when polling data is added. There are some technical caveats to this, but this is true for the choice of parameters The Economist employs for 2016 and presumably 2020 (2012 narrowly avoids being overly-restrictive). For this reason I explain step 1 as "tell me the outcome" and step 2 works out how to get there. Adding polling data does not impact the forecast on election day.2I discovered this accidentally by trying to make the 2016 forecast reproducible. The skeptical reader can verify this by changing the RUN_DATE and supplying to rstan a fixed seed (and subsets of a fixed initial values). Fixing the initial values, and the seed supplied to rstan produced identical forecasts on election day. Adding more polling data does not substantively adjust the final predictions in the states, but changing the initial guesses will. I suspected this when I compared the forecasts made at the convention as opposed to on election day: to my surprise, they were shifted by less than 2%. The only substantial source of altering the final prediction stems from precisely what was fixed: (1) randomly selecting a different final prediction (red endpoint), and (2) randomly sampling these initial values (which accounts for a fraction of a difference compared to randomly changing the red endpoint).

And we pretend this is a forecast. We do this many times, then average over the paths taken, which will produce a diagram like this. Of course, if the normal distribution has a small standard deviation (is "tightly peaked" around the prior prediction), then the outcome will hug even tighter to the prior prediction...and polling data loses its pulling power. The Economist's normal distribution is alarmingly narrow for 2016, according to Linzer's work. This means that the red dot forecast will be nearly always around the horizontal line for the auxiliary model's forecast, i.e., amounts to small fluctuations around an auxiliary model. Very likely, with the data supplied, it appears The Economist is using an overly narrow normal distribution for its 2020 forecasts (similar to its 2016 forecasts), which means it's window-dressing for the prior prediction.

If this seems like it's still a caricature, you can see for yourself with how The Economist retrodicted the 2016 election.

Step 1: Tell me the Outcome

How do we tell the model what the outcome is? The chain of papers appeals to "the fundamentals" (from economic indicators and similar "macro quantities", we can forecast the outcome). Let this sink in: we abdicate forecasting to an inferior model, to which we will fit data around. The literature and The Economist use the Abramowitz time-for-change mode, a simple linear regression in R pseudo-code:

Incumbent share of popular vote ~ (june approval rating) + (Q2 gdp)

This is then adjusted by how well the Democratic (or Republican) candidate performed in each state to provide rough estimates for each state's election outcome. The Economist claims to use some "secret recipe" on their How do we do it? page, and it does seem like they are using some slight variant (despite having Abramowitz's model in their codebase).

Linzer notes this is an estimate, and we should accommodate a suitably large deviation around these estimates. We do this by taking normally distributed random variables: \[\beta_{iJ}\sim\mathcal{N}(\operatorname{logit}(h_{i}),\sigma_{i}) \tag{1}\] Where "$h_{i}$" is the state-level estimates we just constructed, "$\sigma_{i}$" is a suitably large standard deviation (Linzer warns against anything smaller than $1/\sqrt{20}\approx 0.2236$), and the "$\beta_{iJ}$" are just reparametrized estimates on the log-odds scale intuitively representing the vote-share for, say, the Democratic candidate. (The Economist uses parameters $\sigma_{i}^{2}\approx 0.03920$ or equivalently $1/\sigma_{i}^{2}\approx25.51$, which overfits the outcomes given in this step. Linzer introduces parameter "$\tau_{i}=1/\sigma_{i}^{2}$" and warns explicitly "Values of $\tau_{i}\gt20$ [i.e., $1/\sigma_{i}^{2}\gt 20$ or $\sigma_{i}^{2}\lt 0.05$] generate credible intervals that are misleadingly narrow." This is not adequately discussed or analysed by The Economist.)

We should be quite explicit stressing the auxiliary model producing the prior forecast uses "the fundamentals", a family of crude models which work under fairly stable circumstances. The unasked question remains is this valid? Even assessing if the assumptions of the fundamentals holds at present, even that, is not discussed or entertained by The Economist. While experiencing a once-in-a-century pandemic, and the greatest economic catastrophe since the Great Depression, I don't think that's quite "stable".3Given the circumstances, the Abramowitz model underpinning The Economist's forecast fails to produce sensible values. For example, it predicts Trump will win −20% of the vote in DC and merely 34% of the vote in Mississippi (the lowest amount a Republican presidential candidate received since 1968, when Gov Wallace won Mississippi on a third party ticket). This should be a red flag, if only because no candidate could receive a negative share of the votes.

The logic here may seem circular: we forecast the future because we don't know what will happen. If we need to know the outcome to perform the forecast, then we wouldn't need to perform a forecast (we'd know what will happen anyways). On the other hand, if we don't know the outcome, then we can't forecast with this approach.

Bayesian analysis avoids this by trying many different outcomes to see if the forecasting method is "stable": if we change the parameters (or swap out the prior), then the forecast varies accordingly. Or, it should. As it stands now, it's impossible to do this analysis with The Economist's code without rewriting it entirely (owing to the fact that the code is a ball-of-mud). Needless to say, doing any form of model checking is intractable with The Economist's model given its code's current state.

Step 2: "Fit" the Model...to whatever we want

The difference between different members of this family of models lies in how we handle $\beta_{i,j}$. For example, Kremp expands it out at time $t_{j}$ and state $s_{i}$ (poor choice of notation) as \[ \beta[s_{i}, t_{j}] = \mu_{a}[t_{j}] + \mu_{b}[s_{i},t_{j}] + \begin{bmatrix}\mbox{pollster}\\\mbox{effect}\end{bmatrix}(poll_{k}) + \dots\tag{2} \] decomposing the term as a sum of nation-level polling effects $\mu_{a}$, state-level effects $\mu_{b}$, pollster effects (Kremp's $\mu_{c}$), as well as the "house effect" for pollsters, error terms, and so on. The Economist refines this further, and improves upon how polls impact this forecast. But then every family member uses Markov Chain Monte Carlo to fit polling data to match the results given in step 1. It's all kabuki theater around this outcome. The ingenuity of The Economist's model sadly amounts to an obfuscatory facade.

Remark. We should note the fitting of 2008 is highly questionable. When we forecast the outcome, we adjust the national forecast to state-level results using the previous election. So for 2008, we use our time-for-change estimates of the proportion of the popular vote Obama would receive, and then we adjust this based on how Kerry over-performed expectations in 2004. The Economist doesn't do this: they adjust based on 2008 results. I can't stress how questionable this is.

Although Linzer's original paper examined the 2008 election in detail, that doesn't alleviate the problem that The Economist used it as a metric for calibrating their model's goodness-of-fit. It allows the forecaster to claim to have greater accuracy than the model actually possesses. This is bad and if using 2004 data in the 2008 calibration produces worse results, then this is, in Morris's words, "malpractice".

Here's the problem in graphic detail. With historic data, this is the plot of predictions The Economist would purportedly have produced (the back horizontal lines are the prior predicted, the very light horizontal gray is the 50% mark):

Observe the predictions land within a normal distribution with standard deviation ~2.5% of the prior forecast (the black horizontal line). The entire complicated algorithm adjusts the initial prediction slightly based on polling data. (In fact, none of the forecasts differ in outcome compared to the initial prediction, including Ohio.) Now consider the hypothetical situation where Obama had a net −15% approval rating in June and the US experienced −25% Q2 GDP growth. The prior forecast projects a dim prospect for hypothetical Obama's re-election bid:

This isn't to say that the prediction is immune to polling data, but observe the prediction on election day is drawn from the top 5% of the normal distribution around the initial prediction [solid black line]. From a bad prior, we get a bad prediction (e.g., falsely predicting Obama would lose Wisconsin, Pennsylvania, Ohio, New Hampshire, Iowa, and Florida). This is mitigated if we have a sufficiently $\sigma^{2}\gt0.05$ as this hypothetical demonstrates.

Also worth reiterating, in 2012, the normal distribution around the prior is wider than in 2016: when the same normal distribution is used, the error increases considerably. We can perform the calculations over again to exaggerate the effect:

Take particular note how closely the prediction hugs the guess. Of course, we don't know ex ante how closely the guessed initial prediction matches the outcome, so it's foolish to make claims like, "This model predicts Biden with 91% probability will win in November, therefore it's nothing like 2016." From a bad crow, a bad egg.

How did this perform in 2016?

Another glaring red flag should be this model's performance in 2016. Pierre Kremp implemented this model for 2016, and forecasted a Clinton victory with 90% probability. The Economist's modifications didn't fare much better, a modest improvement at the cost of drastic overfitting.

Is overfitting really that bad? The germane XKCD illustrates its dangers.

The play-by-play forecasts for multiple snapshots across time. Initially, when there's no data (say, on April 1, 2016), it's very nearly the auxiliary model's forecast:

Note: each state has its own differently sized normal distribution (we can be far more confident about, say, California's results than we could about a perennially close state like Ohio or Florida).

Now, by the time of the first convention July 17, 2016, we have more data. How does the forecast do? Was it correct so far? Well, we plot out the same states' predictions with the same parameters:

Barring random noise from slightly different starting points on election day, there's no change (just a very tiny fluctuating random walk) between the last poll and election day. There's no forecast, just an initial guess around the auxiliary model's forecast.

We can then compare the initial forecast to this intermediate forecast to the final forecast:

Compare Wisconsin in this snapshot to the previous two snapshots, and you realize how badly off the forecast was until the day of the election. Even then, The Economist mispredicts what a simple Kalman filter would recognize even a week before the election: Clinton loses Wisconsin.

The reader can observe, compared to the hypothetical Obama 2012 scenario where Q2 GDP growth was −25% and net approval rating −15%, the Clinton snapshots remain remarkably tight to the prior prediction. Why is this? Because the $\sigma^{2}\lt0.05$ for the Clinton model, but $\sigma^{2}\gt0.05$ for the Obama model. This is the effect of overfitting: the forecast sticks too closely to the prior until just before the election.

Although difficult to observe close to election day (given how cramped the plots are), the forecast doesn't change on election day more than a percentage point or two: it changes the day prior, sometimes drastically. We could add more polling data, but the only way for the forecast to substantially change is for the random number generator to change the endpoint, or for the Markov Chain Monte Carlo library to use a different seed parameter (different parameters for the random number generator).

Conclusion

When The Economist makes boastful claims like

Mr Comey himself confessed to being so sure of the outcome of the contest that he took unprecedented steps against one candidate (which may have ended up costing her the election). But the statistical model The Economist built to predict presidential elections would not have been so shocked. Run retroactively on the last cycle, it would have given Mr Trump a 27% chance of winning the contest on election day. In July of 2016 it would have given him a 30% shot.

And no one else was skeptical of a Clinton victory in July 2016? We can recall FiveThirtyEight gave Clinton a more modest 49.9% chance of winning in July compared to Trump's 50.1%; and on election day, comparable odds as The Economist claims. FiveThirtyEight had a "worse" Brier score than The Economist (the only metric they're willing to advertise), but The Economist had the worse forecast in July. Inexcusably worse, The Economist begins by assuming the outcome, then fitting the data around that end. What of that "editorial malpractice" that, as Mr Morris of The Economist urged, "We should learn from"?

We should really treat predictions from The Economist's model with the same gravity as a horoscope. The Economist should take a far more measured and humble perspective with its "forecast". It is more than a little ironic that the magazine which makes such confident claims on the basis of a questionable model once wrote, Humility is the most important virtue among forecasters. Although this election may appear to be an easy forecast — Vice President Biden's lead over President Trump appears large in the polls (routinely double digits) except the undecideds routinely poll in double digits (sound familiar from 2016?)4Remember, when looking at the lead one candidate has over another in a poll ("Biden has a +15% lead"), that poll's margin of error should be doubled, due to the rules of arithmetic of Gaussian distributed random variables. This implies the undecideds plus twice the reported margin of error is routinely equal to or greater than the margin Biden leads Trump. This is an underappreciated point, an eerie parallel to 2016. — we may soon learn The Economist exists to make astrologers look professional.

Tuesday, July 14, 2020

What are Swing States?

Puzzle: What is the criteria of a swing state?

The notion of a swing state has been introduced by journalists, first seemingly used by Brody's news article "Carter, Reagan camps focusing on suburbs in the swing states" in the Washington Post (dated Sep 28, 1980). Its meaning changes with every news article.1 The Economist, Road to 270 It is still, technically, a swing state (meaning that the average polling margin is less than 5%)...

Daily Kos's Swing State Project defined it as ...any state where the margin of victory [of combined past two elections] was 6 points or less. This was probably the least rigorously defined source.

The Hill When Swing States Stop Swinging By “swing states,” I am referring to the nine or ten perennial campaign battleground states that have, since 1988, been decided by less than five percentage points in the majority of elections, are usually bellwethers mirroring the national vote in those elections, and which have flipped between Democrat and Republican victories.

The Cook Political report has treated any state with PVI scores between D+5 to R+5 as "swing states", at least according to a few random articles I read but could not recall.

Swing Voters and Elastic States [FiveThirtyEight.com] notes The states that are traditionally thought of as swing states — meaning that they are close to the national average in their partisan orientation — ...

Stacey Hunter Hecht and David Schultz note in their introduction to Presidential Swing States: Why Only Ten Matter (2015) that political scientists haven't studied the notion of a "swing state", hence it is left to journalists to define the term. Instead political scientists use the term "competitive elections" or "competitive states".2 Schultz and Hect note in the preface (to the 2015 edition) four criteria which they use to designate a swing state:

First, it is a competitive state in the political science meaning of the term. Other scholars have suggested a 10 percentage point margin or less in the last election as competitive (Glaeser and Ward). Instead, this book employs a five percentage point standard as the first decisional criteria. [...]

Second, in a nod to the bellwether designation in the media, this book assesses whether a state has sided with the winner in presidential elections through the period [of the past 7 elections]. By that, over the last seven elections, how many times has a state's popular vote...matched that of the final result in the presidential election [i.e., the state was won by the president-elect]. [...]

The third criteria for being a swing state is the incidence of flipping during the past seven presidential contests. This really speaks to the swing in the label swing state—that is how often does the state change with regard to its vote for president?

Finally, using data from Koza et al., this book examines the number of post-convention events events there were in 2012 [the election year in question]. (xxx)

If we treat "which party system we're in" as a bus waiting problem in between N = 9 elections before party realignment, it turns out we expect to be in N(1 − e⁻¹) + e⁻¹ ≈ 6.05696 previous elections until we're at a realigning election, which is a theoretical justification for the second and third criteria's usage of 7 previous elections. This is actually foreshadowing another topic I'm driving at in a series of posts ("party systems" as a periodization of US political history: sensible or meaningless), so stick a pin in this aside.

The hope is that swing states coincide with Bellwether states, i.e., states which correlate with the outcome of the election as a whole. A Bellwether state is one which usually reflects the election's outcome: whoever wins the state usually wins the presidential election. Ohio is currently the textbook example of a bellwether.

Predicted Swing States in 2020

Or, "Shut up and give me predictions!" Based on the campaign events in 2016 and the margins of victory, our predictions are enumerated in the table below. One column uses the classic logistic regression with the unweighted alternations between parties, the next uses a Bayesian logistic regression, the third uses a classic logistic regression with a weighted mean (rewarding more recent swings more than swings a while ago), and the fourth column uses a Bayesian logistic regression and the weighted swings. The only real difference is in the edge-cases, which are nearly swing states (like Georgia, Colorado, and Iowa). I highlight instances where the probability forecasted is at least 50%

State	Logistic	Bayes	Weighted Logistic	Weighted Bayes
Florida	0.9993956	0.9964374	0.9999655	0.9990531
Pennsylvania	0.9652903	0.9537852	0.9606524	0.9444615
North Carolina	0.9555401	0.9294878	0.9241776	0.8904109
New Hampshire	0.9434254	0.8744334	0.9738157	0.9209954
Michigan	0.8596214	0.8681566	0.8005915	0.8393148
Ohio	0.8336188	0.7632081	0.7310860	0.6594543
Nevada	0.8301466	0.7199140	0.8601073	0.7600979
Wisconsin	0.7514187	0.7907851	0.6308183	0.7402993
Arizona	0.6563595	0.6751866	0.9660922	0.9102213
Colorado	0.6162296	0.5055186	0.9027512	0.7521946
Iowa	0.3770038	0.3817862	0.1045918	0.1867988
Virginia	0.3452857	0.3377292	0.2091488	0.2761821
Minnesota	0.2586809	0.3289624	0.2130948	0.3467333
Georgia	0.1617802	0.2894286	0.5362766	0.5598381
Maine	0.1596644	0.2286243	0.1134700	0.2283813
New Mexico	0.1158958	0.1205112	0.0389905	0.0753137
Texas	0.0082930	0.0376663	0.0027493	0.0239994

Statistical Model

Lets try to consider Schultz and Hect's criteria as inputs into a logistic regression:

m = margin of victory for the [previous] President [in the given state] in the previous presidential election
b = bellwether input = number of times state went to the winner of the election (over the past 7 previous elections), turns out not significant in the classic logistic regression;
s = state swing-iness = min(number of time state went D in past 7 elections, number of times state went R in past 7 elections)
v = number of post-convention events in the current presidential cycle

There is some degree of freedom here with choice of input signals. For the bellwhether input, we could have chosen the fraction of times the state went to the election's winner, as opposed to the count of times this occurred. We could also use a running mean, to penalize (for example) Indiana in 2020 which went to the Republican candidate in 6 of the past 7 presidential elections (it went to Obama in 2008, which seems like a fluke rather than swinginess).

Why Logistic Regression? We are trying to classify states as either "swing" (= 1) or "stable" (= 0). The logistic regression will accomplish this and give us information in the coefficients for the inputs as log-odds.

The alternative that immediately springs to mind would be the probit model, which would work if we standardized the inputs via the z-transform. There is no compelling a priori reason to do so, but it is a perfectly valid thing to do. It just makes interpreting coefficients harder. (It may be worth revisiting this later, to see if a probit model is worth while.)

Multicollinearity. It may be tempting to throw everything into a logistic regression and see what happens but we must test for multicollinearity among the input signals. That is to say, we need the input variables to be uncorrelated. How to measure this? Simply if the variance inflation factor (VIF) is greater than 2.50 or 3 for any input variables, then we should grow concerned.3There are three cases when multicollinearity is unworrying: (1) high VIF variables are control variables, low VIF variables are input variables; (2) when high VIF variables are products of input variables; (3) when high VIF variables are categorical variables or product of 3, or more product variables.

Multicollinearity tends to cause overfitting, which is bad for prediction...which is precisely what we're interested in doing! Fortunately, there is no multicollinearity: the VIG scores are routinely around 1.1, though for the Bayesian logistic regression on unweighted swings has a score of 1.608148 for the number of post-convention campaign events, and 1.571879 for the number of swings. These aren't terrible, though worth keeping an eye on.

Predictive or Postdictive? We should be careful with the inputs to make sure we can have predictions about which states are swing states. To be clear, when predicting if a state is a swing state in a given election, we use the margin-of-victory, a bellwether dummy variable (1 if it went to the winner and 0 otherwise), and number of post-convention events all from the previous election. This lets us construct a logistic regression trained on past data, then predict which states are "swing states" in 2020 from the 2016 data.

Training Data. We use results reported in JS Hill, E Rodriquez, AE Wooden, "Stump speeches and road trips: The impact of state campaign appearances in presidential elections" for determining swing states in 2000, 2004, and 2008. For campaign events, we consulted a variety of sources. There's some freedom in whether we count the appearances of just the presidential candidate, or if we include the vice presidential candidate; and independently whether we consider all campaign appearances or post-convention appearances.

Ostensibly, if we disagree with their assessments (and it is fairly subjective what to include or exclude), then we would get a slightly different estimate.

Testing 2012 Elections

When applied to 2012, we have the following table of swing states:

state	Unweighted	Bayes	weighted Predict	weighted Bayes
Florida	0.9620333	0.9283174	0.9689775	0.9295346
Missouri	0.9342639	0.8611949	0.9945827	0.9668182
Ohio	0.9330508	0.8862916	0.9888160	0.9526306
North Carolina	0.9162987	0.8348147	0.8095200	0.7615408
Indiana	0.8679019	0.7702609	0.6951285	0.6684250
Montana	0.6741770	0.5723402	0.9696912	0.8823920
Colorado	0.3780801	0.3857023	0.7783322	0.6267094
Georgia	0.3267437	0.2887727	0.8431760	0.6549777
Virginia	0.2263317	0.3512313	0.0971042	0.2399934
New Hampshire	0.2212433	0.2562481	0.4795003	0.4063095
Arizona	0.0964162	0.1039871	0.0715052	0.1014729
Iowa	0.0826769	0.1450948	0.0155189	0.0617803
Nevada	0.0643182	0.0968483	0.1186013	0.1386136
Pennsylvania	0.0592556	0.1298407	0.2179990	0.2636838

Testing 2016

Remember, we're retrodicting the probability a state will be a swing-state in 2016 based off of the number of post-campaign events held in 2012, the number of swings, if it was a bellwether, and the previous (2012) margin of victory. Since there were some surprises (e.g., the Rust belt), we should remember in 2012 there were few post-convention events in, say, Michigan. I have put in bold states which now seem like they should've been treated as swing states, but at the time were not. I have also italicized the states which at the time were treated as swing states, but don't seem to qualify looking back.

state	Unweighted	Bayes	weighted Logistic	weighted Bayes
Florida	0.9971685	0.9885786	0.9997810	0.9968004
Ohio	0.9960023	0.9866616	0.9968622	0.9847068
Iowa	0.8315618	0.7730072	0.6044177	0.6155665
North Carolina	0.7312868	0.7398799	0.5124623	0.6338570
Virginia	0.6873719	0.6128545	0.5994374	0.5711363
Colorado	0.6148295	0.5011042	0.9014906	0.7451717
New Hampshire	0.4536818	0.3772813	0.4851559	0.4198070
Nevada	0.3254291	0.2793178	0.2640912	0.2670547
Wisconsin	0.1599034	0.2773650	0.0535469	0.1673131
Pennsylvania	0.1590733	0.2847362	0.0543191	0.1812992
Arizona	0.1124913	0.1824953	0.5033183	0.4367127
Georgia	0.0418023	0.1079614	0.1573142	0.2413978
New Mexico	0.0411491	0.0523271	0.0102234	0.0281328
Indiana	0.0402132	0.0848593	0.0061927	0.0328337
Missouri	0.0198727	0.0605405	0.0120657	0.0521976
Michigan	0.0198235	0.0602409	0.0036850	0.0267344

Code. All the code doing these calculations are available.

References

Stacey Hunter Hecht and David Schultz(eds), Presidential Swing States: Why Only Ten Matter. Lexington Books, 2015.
David Schultz and Rafael Jacob, Presidential Swing States. Second ed., Lexington Books, 2018.

Data Sources

Thomas M. Holbrook, "Did the Whistle-Stop Campaign Matter?". PS: Political Science and Politics 35, No. 1 (2002) pp. 59–66 Eprint. (For Truman vs Dewey)
Scott L. Althaus, Peter F. Nardulli, Daron R. Shaw, "Candidate Appearances in Presidential Elections, 1972-2000". Political Communication 19, no.1 (2002) pp.49–72 Eprint.
JS Hill, E Rodriquez, AE Wooden, "Stump speeches and road trips: The impact of state campaign appearances in presidential elections". PS: Political Science & Politics 43, no.2 (2010) pp. 243–254. Eprint. (For data by state for the 2000, 2004, 2008 elections.)
Democracy in Action gives data "by state" for 2000, 2004, 2008, 2012, 2016.
MIT Election Lab for state-level Presidential election results, 1976–2016.