Saturday, July 18, 2020

The Economist's 2020 Forecast

Every presidential election, forecasts crop up. Some are qualitative, others are quantitative, some scientific, others humorous (e.g., "tossup bot"). Danger lies when either humorous forecasts are mistaken as serious, or when scientific forecasts err in its approach (by accident or by bad design).

I was struck reading this article on forecasting US presidential races, specifically the passage:

[Modelers] will be rolling out their predictions for the first time this year, and they are intent on avoiding mistakes from past election cycles. Morris, the Economist’s forecaster, is one of those entering the field. He has called previous, error-prone predictions “lying to people” and “editorial malpractice.” “We should learn from that,” he says.
(Emphasis added) I agree. But without model checking, structured code, model checking, some degree of design by contract, model checking, unit testing, model checking, reproducibility, etc., how are we to avoid such "editorial malpractice"?

The Economist has been proudly advertising its forecast for the 2020 presidential election. On July 1st, they predicted Biden would win with 90% certainty. Whenever a model, any model, has at least 90% certainty in an outcome, I have a tendency to become suspicious. Unlike most psephological prognostications, The Economist has put the source code online (relevant commit 980c703). With all this, we can try to dispel any misgivings concerning the possibility of this model giving us error-prone predictions...right?

This post will focus mostly about the mathematics and mechanics underpinning The Economist's model. We will iteratively explain how it works in successive refinement, from very broad strokes to more mundane details.

How their model works

I couldn't easily get The Economist's code working (a glaring red flag), so I studied the code and the papers it was based upon. The intellectual history behind the model is rather convoluted, spread across more than a few papers. Instead of tracing this history, I'll give you an overview of the steps, then discuss each step in detail. The model forecasts how support for a candidate changes over time assuming we know the outcome ahead of time.

The high-level description could be summed up in three steps:

Step 1: Tell me the outcome of the election.

Step 2: Fit the trajectory of candidate support over time to match the polling data such that the fitted trajectory results in this desired outcome from step 1.

Step 3: Call this a forecast.

Yes, I know, the input to the model in step 1 is normally what we would expect the forecast to tell us. But hidden away in the paper trail of citations, this model is based on Linzer's work. Linzer cites a paper from Gelman and King from 1993 arguing it's easy to predict the outcome of an election. We've discussed this before: it is easy to game forecasting elections to appear like a superforcaster.

Some may object this is a caricature of the actual model itself. But as Linzer (2013) himself notes (section 3.1):

As shown by Gelman and King (1993), although historical model-based forecasts can help predict where voters’ preferences end up on Election Day, it is not known in advance what path they will take to get there.
This unexamined critical assumption lacks justification. Unsurprisingly, quite a bit has happened in the past 27 years...there's little reason to believe such appeals to authority. If you too doubt this assumption is sound, then any predictions made from a model thus premised is equally as suspect.

In pictures

Step 1, you give me the outcome. We plot the given outcome ("prior") against time:

We transform the y-axis to use log-odds scale rather than probability, for purely technical reasons. We drop a normal distribution around the auxiliary model's "prior probability" which reflects our confidence in the model's estimates. We then randomly pick an outcome according to this probability distribution (the red dot) which will 68% of the time be within a sigma of the prior (red shaded region):

Then we perform a random walk backwards in time until we get to the most recent poll.1Why backwards? Strauss argued in his paper Florida or Ohio? Forecasting Presidential State Outcomes Using Reverse Random Walks this circumvents projection bias, which tends to underestimate the potential magnitude of future shocks. This technique has been more or less accepted, conveyed only through folklore. Since this is a random walk, we shade the zone where the walk would "likely be":


(The random walk is exaggerated, as most of the figures are, for the sake of conveying the intuition. The walk adjusts the trajectory by less than ~0.003 per day, barely perceptible graphically.)

Once we have polling data, we just take this region and bend it around the polling data transformed to log-odds quantities (the "x" marks):

The fitting is handled through Markov Chain Monte Carlo methods, though I suspect there's probably some maximum likelihood method to get similar results.

Now we need to transform our red region back to probability space:

Note: the forecast does not adjust the final prediction (red endpoint) beyond 2% when polling data is added. There are some technical caveats to this, but this is true for the choice of parameters The Economist employs for 2016 and presumably 2020 (2012 narrowly avoids being overly-restrictive). For this reason I explain step 1 as "tell me the outcome" and step 2 works out how to get there. Adding polling data does not impact the forecast on election day.2I discovered this accidentally by trying to make the 2016 forecast reproducible. The skeptical reader can verify this by changing the RUN_DATE and supplying to rstan a fixed seed (and subsets of a fixed initial values). Fixing the initial values, and the seed supplied to rstan produced identical forecasts on election day. Adding more polling data does not substantively adjust the final predictions in the states, but changing the initial guesses will. I suspected this when I compared the forecasts made at the convention as opposed to on election day: to my surprise, they were shifted by less than 2%. The only substantial source of altering the final prediction stems from precisely what was fixed: (1) randomly selecting a different final prediction (red endpoint), and (2) randomly sampling these initial values (which accounts for a fraction of a difference compared to randomly changing the red endpoint).

And we pretend this is a forecast. We do this many times, then average over the paths taken, which will produce a diagram like this. Of course, if the normal distribution has a small standard deviation (is "tightly peaked" around the prior prediction), then the outcome will hug even tighter to the prior prediction...and polling data loses its pulling power. The Economist's normal distribution is alarmingly narrow for 2016, according to Linzer's work. This means that the red dot forecast will be nearly always around the horizontal line for the auxiliary model's forecast, i.e., amounts to small fluctuations around an auxiliary model. Very likely, with the data supplied, it appears The Economist is using an overly narrow normal distribution for its 2020 forecasts (similar to its 2016 forecasts), which means it's window-dressing for the prior prediction.

If this seems like it's still a caricature, you can see for yourself with how The Economist retrodicted the 2016 election.

Step 1: Tell me the Outcome

How do we tell the model what the outcome is? The chain of papers appeals to "the fundamentals" (from economic indicators and similar "macro quantities", we can forecast the outcome). Let this sink in: we abdicate forecasting to an inferior model, to which we will fit data around. The literature and The Economist use the Abramowitz time-for-change mode, a simple linear regression in R pseudo-code:

Incumbent share of popular vote ~ (june approval rating) + (Q2 gdp)
This is then adjusted by how well the Democratic (or Republican) candidate performed in each state to provide rough estimates for each state's election outcome. The Economist claims to use some "secret recipe" on their How do we do it? page, and it does seem like they are using some slight variant (despite having Abramowitz's model in their codebase).

Linzer notes this is an estimate, and we should accommodate a suitably large deviation around these estimates. We do this by taking normally distributed random variables: \[\beta_{iJ}\sim\mathcal{N}(\operatorname{logit}(h_{i}),\sigma_{i}) \tag{1}\] Where "\(h_{i}\)" is the state-level estimates we just constructed, "\(\sigma_{i}\)" is a suitably large standard deviation (Linzer warns against anything smaller than \(1/\sqrt{20}\approx 0.2236\)), and the "\(\beta_{iJ}\)" are just reparametrized estimates on the log-odds scale intuitively representing the vote-share for, say, the Democratic candidate. (The Economist uses parameters \(\sigma_{i}^{2}\approx 0.03920\) or equivalently \(1/\sigma_{i}^{2}\approx25.51\), which overfits the outcomes given in this step. Linzer introduces parameter "\(\tau_{i}=1/\sigma_{i}^{2}\)" and warns explicitly "Values of \(\tau_{i}\gt20\) [i.e., \(1/\sigma_{i}^{2}\gt 20\) or \(\sigma_{i}^{2}\lt 0.05\)] generate credible intervals that are misleadingly narrow." This is not adequately discussed or analysed by The Economist.)

We should be quite explicit stressing the auxiliary model producing the prior forecast uses "the fundamentals", a family of crude models which work under fairly stable circumstances. The unasked question remains is this valid? Even assessing if the assumptions of the fundamentals holds at present, even that, is not discussed or entertained by The Economist. While experiencing a once-in-a-century pandemic, and the greatest economic catastrophe since the Great Depression, I don't think that's quite "stable".3Given the circumstances, the Abramowitz model underpinning The Economist's forecast fails to produce sensible values. For example, it predicts Trump will win −20% of the vote in DC and merely 34% of the vote in Mississippi (the lowest amount a Republican presidential candidate received since 1968, when Gov Wallace won Mississippi on a third party ticket). This should be a red flag, if only because no candidate could receive a negative share of the votes.

The logic here may seem circular: we forecast the future because we don't know what will happen. If we need to know the outcome to perform the forecast, then we wouldn't need to perform a forecast (we'd know what will happen anyways). On the other hand, if we don't know the outcome, then we can't forecast with this approach.

Bayesian analysis avoids this by trying many different outcomes to see if the forecasting method is "stable": if we change the parameters (or swap out the prior), then the forecast varies accordingly. Or, it should. As it stands now, it's impossible to do this analysis with The Economist's code without rewriting it entirely (owing to the fact that the code is a ball-of-mud). Needless to say, doing any form of model checking is intractable with The Economist's model given its code's current state.

Step 2: "Fit" the Model...to whatever we want

The difference between different members of this family of models lies in how we handle \(\beta_{i,j}\). For example, Kremp expands it out at time \(t_{j}\) and state \(s_{i}\) (poor choice of notation) as \[ \beta[s_{i}, t_{j}] = \mu_{a}[t_{j}] + \mu_{b}[s_{i},t_{j}] + \begin{bmatrix}\mbox{pollster}\\\mbox{effect}\end{bmatrix}(poll_{k}) + \dots\tag{2} \] decomposing the term as a sum of nation-level polling effects \(\mu_{a}\), state-level effects \(\mu_{b}\), pollster effects (Kremp's \(\mu_{c}\)), as well as the "house effect" for pollsters, error terms, and so on. The Economist refines this further, and improves upon how polls impact this forecast. But then every family member uses Markov Chain Monte Carlo to fit polling data to match the results given in step 1. It's all kabuki theater around this outcome. The ingenuity of The Economist's model sadly amounts to an obfuscatory facade.

Here's the problem in graphic detail. With historic data, this is the plot of predictions The Economist would purportedly have produced (the back horizontal lines are the prior predicted, the very light horizontal gray is the 50% mark):

Observe the predictions land within a normal distribution with standard deviation ~2.5% of the prior forecast (the black horizontal line). The entire complicated algorithm adjusts the initial prediction slightly based on polling data. (In fact, none of the forecasts differ in outcome compared to the initial prediction, including Ohio.) Now consider the hypothetical situation where Obama had a net −15% approval rating in June and the US experienced −25% Q2 GDP growth. The prior forecast projects a dim prospect for hypothetical Obama's re-election bid:

This isn't to say that the prediction is immune to polling data, but observe the prediction on election day is drawn from the top 5% of the normal distribution around the initial prediction [solid black line]. From a bad prior, we get a bad prediction (e.g., falsely predicting Obama would lose Wisconsin, Pennsylvania, Ohio, New Hampshire, Iowa, and Florida). This is mitigated if we have a sufficiently \(\sigma^{2}\gt0.05\) as this hypothetical demonstrates.

Also worth reiterating, in 2012, the normal distribution around the prior is wider than in 2016: when the same normal distribution is used, the error increases considerably. We can perform the calculations over again to exaggerate the effect:

Take particular note how closely the prediction hugs the guess. Of course, we don't know ex ante how closely the guessed initial prediction matches the outcome, so it's foolish to make claims like, "This model predicts Biden with 91% probability will win in November, therefore it's nothing like 2016." From a bad crow, a bad egg.

How did this perform in 2016?

Another glaring red flag should be this model's performance in 2016. Pierre Kremp implemented this model for 2016, and forecasted a Clinton victory with 90% probability. The Economist's modifications didn't fare much better, a modest improvement at the cost of drastic overfitting.

Is overfitting really that bad? The germane XKCD illustrates its dangers.

The play-by-play forecasts for multiple snapshots across time. Initially, when there's no data (say, on April 1, 2016), it's very nearly the auxiliary model's forecast:


Note: each state has its own differently sized normal distribution (we can be far more confident about, say, California's results than we could about a perennially close state like Ohio or Florida).

Now, by the time of the first convention July 17, 2016, we have more data. How does the forecast do? Was it correct so far? Well, we plot out the same states' predictions with the same parameters:

Barring random noise from slightly different starting points on election day, there's no change (just a very tiny fluctuating random walk) between the last poll and election day. There's no forecast, just an initial guess around the auxiliary model's forecast.

We can then compare the initial forecast to this intermediate forecast to the final forecast:

Compare Wisconsin in this snapshot to the previous two snapshots, and you realize how badly off the forecast was until the day of the election. Even then, The Economist mispredicts what a simple Kalman filter would recognize even a week before the election: Clinton loses Wisconsin.

The reader can observe, compared to the hypothetical Obama 2012 scenario where Q2 GDP growth was −25% and net approval rating −15%, the Clinton snapshots remain remarkably tight to the prior prediction. Why is this? Because the \(\sigma^{2}\lt0.05\) for the Clinton model, but \(\sigma^{2}\gt0.05\) for the Obama model. This is the effect of overfitting: the forecast sticks too closely to the prior until just before the election.

Although difficult to observe close to election day (given how cramped the plots are), the forecast doesn't change on election day more than a percentage point or two: it changes the day prior, sometimes drastically. We could add more polling data, but the only way for the forecast to substantially change is for the random number generator to change the endpoint, or for the Markov Chain Monte Carlo library to use a different seed parameter (different parameters for the random number generator).

Conclusion

When The Economist makes boastful claims like

Mr Comey himself confessed to being so sure of the outcome of the contest that he took unprecedented steps against one candidate (which may have ended up costing her the election). But the statistical model The Economist built to predict presidential elections would not have been so shocked. Run retroactively on the last cycle, it would have given Mr Trump a 27% chance of winning the contest on election day. In July of 2016 it would have given him a 30% shot.
And no one else was skeptical of a Clinton victory in July 2016? We can recall FiveThirtyEight gave Clinton a more modest 49.9% chance of winning in July compared to Trump's 50.1%; and on election day, comparable odds as The Economist claims. FiveThirtyEight had a "worse" Brier score than The Economist (the only metric they're willing to advertise), but The Economist had the worse forecast in July. Inexcusably worse, The Economist begins by assuming the outcome, then fitting the data around that end. What of that "editorial malpractice" that, as Mr Morris of The Economist urged, "We should learn from"?

We should really treat predictions from The Economist's model with the same gravity as a horoscope. The Economist should take a far more measured and humble perspective with its "forecast". It is more than a little ironic that the magazine which makes such confident claims on the basis of a questionable model once wrote, Humility is the most important virtue among forecasters. Although this election may appear to be an easy forecast — Vice President Biden's lead over President Trump appears large in the polls (routinely double digits) except the undecideds routinely poll in double digits (sound familiar from 2016?)4Remember, when looking at the lead one candidate has over another in a poll ("Biden has a +15% lead"), that poll's margin of error should be doubled, due to the rules of arithmetic of Gaussian distributed random variables. This implies the undecideds plus twice the reported margin of error is routinely equal to or greater than the margin Biden leads Trump. This is an underappreciated point, an eerie parallel to 2016. — we may soon learn The Economist exists to make astrologers look professional.

No comments:

Post a Comment