Looking back at my prediction, we can see why I was off: the assumption (waiting time between announcements follows an exponential distribution) doesn't even hold. Look at the histogram against the expected distribution:
How can we be sure this is different? ...besides looking...
We can use the Kolmogorov-Smirnov test to see if the histogram differs from the expected exponential distribution. The only problem is there's too little data! There are 20 candidates in my data set, I need it closer to 50 for this test to work. So I just duplicate the data, "smearing" it by adding small amounts like 0.000001 or so, and I do this twice to get a total dataset of 60 "intervals".
The null hypothesis is the data follows the exponential distribution, the alternative hypothesis is the data follows some other distribution. The resulting p-value for the test is 0.03019, so we reject the null hypothesis.
This is based on the huge assumption that we can duplicate the data without any problem, which I have severe doubts about. (Addendum: This reasoning, I realized whilst in Maryland a few days after publication, is invalid, though there exists a technique to fabricate data; since, heuristically, around 30 data points are needed to make an inference, it seems the reasoning below is valid.)
In fact, as a sanity test, lets try simulating candidates announcing they are entering the primary, with λ = 20/136. One quick simulation gives us 16 candidates with intervals between announcements:
This doesn't look too far from the real data. If we had tried the Kolmogorov-Smirnov test without fabricating data, we get a p-value of 0.3387, which tells us we fail to reject the hypothesis this data appears to follow an exponential distribution.
So "what's the right answer"? There's sadly not enough data for us to reject the hypothesis (that the intervals between candidate announcements seem to be exponentially distributed), at least using the frequentist hypothesis testing framework.
Also, I discounted a few candidates which FiveThirtyEight has considered "major" like Andrew Yang or John Delaney (though both candidates announced back in 2017, which make them outliers).
I'll have to look for a statistical test which works on around 20 observations, checking if it fits against an exponential distribution. Maybe there's some Bayesian techniques buried away in Gelman somewhere...
To be clear, however, there is no reason to believe candidate announcements are, a priori, exponentially distributed since timing is contingent on when the (potential) candidate thinks other actors are going o announce. Joe Biden said something to the effect of, he's waiting to announce as late as possible because it's part of a strategy he has. But that depends on his calculations of when "the last possible announcement [relative to his adversaries]" which is explicitly dependent on what other people are doing. An exponentially distributed random variable is memoryless: candidates "wouldn't remember" the last time someone announced.
Ostensibly, a waggish critic might argue, this assumption does hold because it's so damn hard to keep track of the last guy or gal who announced their intention to run for President!
But the only way to know a posteriori whether the actual candidates, by accident or by design, appear to announce with intervals which seem to follow an exponential distribution...is to do some statistical analysis.
Data and scratch work is available on GitHub.
Addendum . One moral to take away from this is to not give a single number as a prediction, but the "HDI" (Highest Density Interval, the region defined as centered on the expected value, the lower bound containing 47.5% of the area, and the upper bound containing 47.5% of the area, so the entire region describes events which are 95% probable). This would've given a large spread, lying between roughly 2 hours and 27.8 days, which is far less interesting as a prediction. For the region containing 85% probability, the interval lies between 29 hours and 14.3 days (14 days, 7 hours).
Whether we use Tukey fences or the HDI, the upper bound for any exponential distribution's prediction would be around \((3\pm\varepsilon)/\lambda\) where \(|\varepsilon|\lesssim 0.03\). Since we're typically waiting for an event to occur, we're only really interested in the upper bound (at least, for candidates declaring their intent to run for President).
And after further consideration, the fabrication of data as I have done it above is incorrect, but it is not an invalid technique provided it is done correctly. Maybe that's a topic for a future post...
No comments:
Post a Comment