Sunday, July 12, 2020

Prediction is Easy, Brier scores are for lying reprobates

After 2016, there's been an increased interest in predicting elections. Who predicted it best? Am I getting better? Etc.

One quirk to predicting a presidential election is less than a half-dozen races matter, the swing states. Consequently, about 40 states are easy to predict. If we were to use a naive accuracy metric for the predictions, it will always be greater than or equal to 80% accuracy (40 solid states predicted correctly out of 50).

Slightly cleverer fellows might propose the Brier Score \[ BS = \frac{1}{N}\sum_{k=1}^{N} (p_{k} - o_{k})^{2} \] if we predict outcome \(o_{k}\) with probability \(p_{k}\). The lower the score, the closer our predictions are to observations, hence the better the forecast. We also observe \(0\leq BS\leq1\). Let's work through some examples:

  1. If we predicted 2012's outcomes will be the same as 2008, we'd have a Brier score of 0.05882353.
  2. If we predicted the 2016 presidential election would be the same as 2012, we'd have a Brier score of 0.1372549.
  3. If we do predict the solid states remain solid, and 50% for the remaining 10 swing states, then we'd have a Brier score of generically 0.0625, quite a bit better than one would expect!
We can game the Brier score, especially for something like presidential election forecasting. I've computed the Brier scores for a variety of forecasts:

Forecaster Brier score Electoral-delegate weighted Brier
FiveThirtyEight 0.0665612 0.0931435
The Economist 0.0603016 0.0865332
Inside Elections 0.0654902 0.1018216
Sabato’s Crystal Ball 0.0736765 0.1168076
Cook Political Report 0.0737255 0.0887128

The spread of scores is somewhere between \(0.06\lt BS\leq 0.08\) for these forecasters, which is...good? It's unintuitive to compare scores which differ by ~0.003; and why are we giving "so much real-estate" \(0.25\lt BS\lt 1\) to worse than guessing?

Entropy

Another solution, more sensible, is to use the logarithm of the probability predicted for the observed outcome. This is given the unfortunate name of "entropy" (more popularly known as "log loss"): \[ H(pred) = \sum^{N}_{k=1}-\log_{2}(p_{k}) \] where there are N predictions made, and \(p_{k}\) is the probability assigned to the outcome of prediction \(k\). We use the base-2 logarithm because we can interpret the entropy in terms of "bits". The worse the prediction, the higher the entropy.

This puts all races on equal footing. We can multiply by the number of electoral delegates at stake. This produces the electoral-delegate weighted entropy \[ H_{ev}(pred) = \sum^{N}_{k=1}-d(k)\log_{2}(p_{k}) \] where \(d(k)\) are the electoral delegates at stake in race \(k\), and we predicted the winner with probability \(p_{k}\).

But we don't want to just consider penalizing bad predictions, we want to also incorporate predicting the wrong outcome (i.e., predicting the wrong person will win the presidency, e.g., predicting Clinton will win 2016). An apples-to-apples scoring function would be \[ H_{out} = -538\log_{2}(p) \] where the forecaster predicts the winner with probability p. We then rate a forecaster by summing the \(H_{out}+H_{ev}\) and, as always, the smaller score is better. The same forecast from the table of Brier scores has the following entropy scores:

name Entropy EV Entropy Outcome Entropy Total Entropy
FiveThirtyEight 15.48132 210.9518 982.5133 1193.465
The Economist 13.38435 193.3309 1452.0608 1645.392
Inside Elections 51.00000 538.0000 711.9190 1249.919
Sabato’s Crystal Ball 26.41833 353.1973 708.3173 1061.515
Cook Political Report 51.00000 538.0000 708.3173 1246.317

The main take-away is: if you want to look smart, use the Brier score, be confident with predicting solid states, and predict swing states with 50% probability. It's a great way to mislead people into thinking you are a fantastic forecaster.

No comments:

Post a Comment