Monday, June 24, 2019

How is my candidate doing in the polls?

Given the profusion of polls, it is difficult to accurately gauge how well a given candidate is doing. A simple average of poll numbers won't adequately capture momentum (if such a concept exists), and a few averages (say, one of polls done in the past week, another of polls done in the past month) are difficult to parse. We want one, single, simple number.

We fix the candidate we're interested in, and we have polls \(P_{n+1}\) and \(P_{n}\) released at times \(t_{n}\lt t_{n+1}\). Ideally, we should be able to truncate the N polls to the last k without "much loss".

We could take a moving average, something like \[M_{n+1} = \alpha(t_{n}, t_{n+1}) P_{n+1} + (1 - \alpha(t_{n}, t_{n+1}))M_{n}\tag{1}\] where \(P_{n}\) refers to the n^th most recent poll released on the date \(t_{n}\), with the initial condition \(M_{1} = P_{1}\) and the function \[\alpha(t_{n}, t_{n+1}) = 1 - \exp\left(-\frac{|t_{n+1} - t_{n}|}{30 W}\right)\tag{2}\] where W is the average of the intervals between polls, and the difference in dates is measured in days. The 30 in the denominator of the exponent reflects 30 days in a month

Exercise 1. Show (1) α will take values between 0 and 1, (2) the larger the α, the quicker it "forgets" older data, (3) "older data" will be forgotten faster [what happens for regularly released polling data? Say, weekly, a new poll is released, what does α look like?].

Weighing Pollsters

If we knew about poll quality, we could add this in as another factor. Suppose we had a function1The codomain is a little ambiguous, we have it here as \(0\lt Q(P)\lt 1\), but either inequality could be weakened to "less than or equal than" conditions. So it could be extended to include \(0\leq Q(P)\lt 1\) or \(0\lt Q(P)\leq 1\) or even \(0\leq Q(P)\leq 1\). \[Q\colon \mathrm{Polls}\to (0,1)\tag{3}\] which gives each poll its quality (higher quality polls are nearer to 1). Then we could modify our function in Eq (2) to be something like \[ \begin{split} \tilde{\alpha}(t_{n}, t_{n+1}, P_{n+1}) &= Q(P_{n})\cdot\alpha(t_{n},t_{n+1}) \\ &= Q(P_{n})\cdot\left(1 - \exp\left(-\frac{|t_{n+1} - t_{n}|}{30 W}\right)\right)\end{split}\tag{4}\] which penalizes "worse polls" from influencing the moving average. (Since worse polls have smaller Q values, which leads to higher \(1 - Q\) values.)

One lazy way to go about this is to use pollster ratings from FiveThirtyEight, discard "F" rated polls, then take the moving average with \(Q(-)\) the familiar grading scheme used in the United States. (Or, more precisely, the midpoint of the interval for the grade.)

Letter grade	Percentage	`Q`-value
A+	97–100%	0.985
A	93–96%	0.945
A−	90–92%	0.91
B+	87–89%	0.88
B	83–86%	0.845
B−	80–82%	0.81
C+	77–79%	0.78
C	73–76%	0.745
C-	70–72%	0.71
D+	67–69%	0.68
D	63–66%	0.645
D-	60–62%	0.61

The other "natural" choices include (a) equidistant spacing in the interval (0, 1] so D- is given the value \(1/13\) all the way to A+ given \(13/13\), or (b) the roots of an orthogonal family of polynomials defined on the interval [0, 1].

Exercise 2. How do the different possible choices of Q-values affect the running average? [Hint: using the table above, is \(\widetilde{\alpha}\leq 0.61\) is an upper bound? Consider different scenarios, good poll numbers from bad polls, bad numbers from good polls.]

Exercise 3. If we assign \(Q(\mathrm{F}) = 0\) as opposed to discarding F-scored polls, how does that affect the weighted running average?

Some computed examples are available on github, but they're what you'd expect.

Thursday, June 20, 2019

Software analogies in statistics

When it comes to programming, I'm fond of "agile" practices (unit testing, contracts, etc.) as well as drawing upon standard practices (design patterns, etc.). One time when I was doing some R coding, which really feels like scripting statistics, I wondered if the "same" concepts had analogous counterparts in statistics.

(To be clear, I am not interested in contriving some functorial pullback or pushforward of concepts, e.g., "This is unit testing in R. The allowable statistical methods within R unit tests are as follows: [...]. Therefore these must be the anologues to unit testing in statistics." This is not what I am looking for.)

The problem with analogies is there may be different aspects which we can analogize. So there is no "one" analogous concept, there may be several (if any) corresponding to each of these software development concepts.

The concepts I'd like to explore in this post are Design Patterns, Unit Testing, and Design by Contract. There are other concepts which I don't believe have good counterparts (structured programming amounts to writing linearly so you read the work from top to bottom, McCabbe's analogy of "modules :: classes :: methods" to "house :: room :: door" does not appear to have counterparts in statistics, etc.); perhaps the gentle reader will take up the challenge of considering analogies where I do not: I just do not pretend to be complete with these investigations.

Design Patterns

Design patterns in software was actually inspired by Christopher Alexander's pattern language in architecture. For software, design patterns standardize the terminology for recurring patterns (like iterators, singletons, abstract factories, etc.).

One line of thinking may be to emphasize the "pattern language" for statistics. I think this would be a repackaged version of statistics. This may or may not be fruitful for one's own personal insight into statistics, but it's not "breaking new ground": it's "repackaging". Unwin's "Patterns of Data Analysis?" seems to be the only work done along these lines.

For what it's worth, I believe it is useful to write notes for one's self, especially in statistics. I found Unwin's article a good example of what such entries should look like, using "patterns" to describe the situation you are facing (so you can ascertain if the pattern is applicable or not), what to do, how to do some kind of sanity test or cross check, etc. As an applied math, statistics is example-driven, and maintaining one's own "pattern book" with examples added to each pattern is quite helpful.

Another line may pursue the fact that software design patterns are "best practices", hence standardizing "best statistical practices" may be the analogous concept. Best coding practices are informal rules designed to improve the quality of software. I suppose the analogous thing would be folklore like using the geometric mean to combine disparate probability estimates. Or when to avoid misusing statistical tests to get bogus results.

Unit Testing

Unit testing has a quirky place in software development. Some adhere strictly to test driven development, where a function signature is written (e.g., "int FibonacciNumber(int k)") and then before writing the body of the function, unit tests are written (we make sure, e.g., "FibonacciNumber(0) == 1", negative numbers throw errors, etc.). Only after the unit tests are written do we begin to implement the function.

Unit tests do not "prove" the correctness of the code, but it increases our confidence in it. Sanity checks are formalized: squareroots of negative numbers raise errors, easy cases (and edge cases) are checked, and so forth. Code is designed to allow dependency injection (for mock objects, to facilitate testing). These tests are run periodically (e.g., nightly, or after every push to the version control system) and failures are flagged for the team to fix. I can't imagine anything remotely analogous to this.

The analogous counterpart to "increasing our confidence in our work" would be some form of model verification, like cross-validation. However, model verification usually comes after creating a model, whereas unit testing is a critical component of software development (i.e., during creation).

Design by Contract

Contracts implement Hoare triples, specifying preconditions, postconditions, and invariants. These guarantee the correctness of the software, but that correctness stems from Hoare logic.

Statistical tests frequently make assumptions, which are not usually checked. These seem quite clearly analogous to preconditions or postconditions. For example, with a linear regression, we should check the error is not correlated with any input (this would be a postcondition) and there is no multicollinearity, i.e., no two inputs are correlated (this would be a precondition). For example, we could imagine something like the following R snippet (possibly logging warnings instead of throwing errors):

foo <- function(mpg, wt, y, alpha = 0.05) {
  # assert mpg is normally distributed
  assert_that(shapiro.test(mpg)$p >= alpha)
  # assert wt is normally distributed
  assert_that(shapiro.test(wt)$p >= alpha)
  # now assert mpg & wt are uncorrelated
  assert_that(cor.test(mpg, wt, method = "kendall")$p.value >= alpha)
  
  # rest of code not shown
}

The other aspect of contracts may be the underlying formalism, whose analogous concept would be some formal system of statistics. By "formal system", I mean a logical calculus, a formal language with rules of inference; I do not mean "probabilistic inference". We need to formalize a manner of saying, "Given these statistical assumptions, or these mathematical relations, we may perform the following procedure or calculation." I have seen little literature on this (arXiv:1706.08605 being a notable exception). The R snippet above attempted to encode this more explicitly, but Hoare logic analogues are implicit in statistics textbooks.

We might be able to capture the "sanity check" aspect to post-conditions in special situations. For example, testing if two samples have the same mean, we could verify we reject the null hypothesis correctly (we disproved they share the same mean) by looking at the confidence intervals for the data samples and seeing they are "mostly disjoint". This example is imprecise and heuristic, but illustrates the underlying idea.

Conclusion

Although statistics has been referred to as "mathematical engineering", a lot of techniques of software engineering don't really apply or have analogous counterparts. Some, like preconditions, have something mildly similar for R scripts. Others, like "design patterns", are more a meta-concept, a guiding concept for one's own notetaking skills rather than directly applicable to doing statistics.

References

Christopher Alexander, Sara Ishikawa and Murray Silverstein, A Pattern Language: Towns, Buildings, Construction. Oxford University Press, 1977.
Antony Unwin, Patterns of Data Analysis?
“Unit testing” for data science
Daniel Selsam, Percy Liang, David L. Dill, "Developing Bug-Free Machine Learning Systems With Formal Mathematics". Eprint: arXiv:1706.08605

Tuesday, June 18, 2019

Sample Size and Central Limit Theorem

I am trying to reproduce results from Zachary R. Smith and Craig S. Wells's Central Limit Theorem and Sample Size.

Therein the authors claim the central limit theorem does not hold for samples of size 30 or so (they go so far as to claim 300). I have tried reproducing this claim, their work is unreproducible. Moreover, their work is visually wrong, you can literally plot the sums of uniformly distributed variables and observe the sum is normally distributed.

For those interested in the sordid details, I have posted it on github.

Why did they get this so wrong? Well, they failed to adequately handle multicomparison problems, and they ignored the issues involved with the Kolmogorov-Smirnov test. The latter was particularly troublesome, as it led them to completely erroneous findings.

Thursday, June 13, 2019

Statistics as Decision Problem

"Decision theory" is a framework for picking an action based on evidence and some "loss function" (intuitively, negative utility). Almost all of statistics may be framed as a decision theoretic problem, and I'd like to review that in this post.

(Note that the diagrams in this post were really inspired by I-Hsiang Wang's lectures on Information Theory.)

I am going to, literally, give a "big picture" of statistics as decision theory. Then I'll try to drill down on various "frequentist" statistical tasks, to show the algorithmic structure to each one. Although I'm certain Bayesian data analysis can be made to fit this mould, I don't have as compelling a "natural" fit as frequentist statistics.

And just to be clear, we "the statistician" are "the decider" in this decision making problem. We are applying decision theory to the process of "doing statistics".

Review of Decision Theory

Statistical Experiment

We have some source of data, whether it's observation, experiment, whatever. As Richard McElreath's Statistical Rethinking calls it, we work in the "small world" of our model, where we describe the data as a random variable \(X\) which follows a hypothesized probability distribution \(P_{\theta}\) where \(\theta\) is a vector of parameters describing the "state of the world" (it really just parametrizes our model). The set of all possible parameters is denoted \(\Theta\) with a capital theta. This \(\Theta\) is the boundary to our "small world". This data collection process is highlighted in the following figure:

Serious statisticians need to actually think about sampling methods and experimental methods. We are ~~silly statisticians~~ not serious statisticians, and use data already assembled for us. Although we will not perform any polling or statistical experiments ourselves, it is useful to know the nuances and subtleties surrounding the methodology used to produce data. Hence we may dedicate a bit of space to discuss the aspects of data gathering and experimental methodologies our sources have employed.

Decision Making

Given some data we have just collected, we now arrive at the romantic part of data collection: decision making. Well, that's what decision theorists call it. Statisticians call it...well, it depends on the task. It is highlighted in the following diagram:

There are really multiple tasks at hand here, so lets consider the key moments in decision making.

Inference task. Or, "What do we want to do with the data?" The answer gives us the task of estimating a specific function \(T\) of the parameters \(T(\theta)\) from the observed data X. The choice of this function depends on the task we are trying to accomplish.

A few examples:

With hypothesis testing, \(T(\theta)=\theta\) we're trying to estimate the parameters themselves (which label the hypotheses we're testing).
For regressions (i.e., given pairs of data from the experimental process \((X,Y)\) find the function f such that \(Y = f(X) + \varepsilon\)) the function of the parameters is the relationship itself, i.e., \(T(\theta)=f\).
For classification problems, \(T(\theta)\) gives us the "correct" labeling function for the data.

In some sense, \(T(\theta)\) is the "correct answer" we're searching for, we just have to approximate it with the next step of the game...

The Decision Rule. In the language of decision theory, an estimator is an example of a Decision Rule which we denote by \(\tau\) ("tau"). This approximates \(T(\theta)\) given the data we have and the conceptual models we're using.

For regressions, this is the estimated function \(\tau(X,Y)=\widehat{f}_{X,Y}\) which fits the observations. For hypothesis testing, \(\tau(X)=\hat{\theta}\) is which hypothesis "appears to work".

These two tasks, inference and computing the decision rule, constitutes the bulk of statistical work. But there's one more crucial step to be done.

Performance Evaluation

We need to see how good our estimates are! In the complete diagram, this is the highlighted part of the following figure:

The loss function \(l(T(\theta),\tau(X))\) measures how bad, given the data X, the decision rule \(\tau\) is. Note this is a random variable, since it's a function of the random variable X. Also note, there are various different candidates for the loss function (it's our job as the statistician to figure out which one to use).

The risk is just the expected value of the loss function. This tells us on average how bad the decision rule \(\tau\) turns out, given the true state of the world is \(\theta\). We denote this risk by the function \(L_{\theta}(\tau)\).

For some tasks, we don't really have much of a choice on the loss function. Regressions do best with the mean-squared error. We could choose a different loss function (e.g., a variant on the mean squared error, we could use the \(L^{p}\) norm instead of the \(L^{2}\) norm).

Remark. It might seem strange that, given \(T\) is never really knowable, yet it appears in the risk function. We typically use it in a tricky way. For example in regression we're really using \(T(\theta)=f\) and then the trick is use \(f(X) = Y\) and we can use the observed results Y instead of worrying about the unobservable, incalculable, unknowable \(T\).

Examples

We will collect a bunch of examples, but this is incomplete. The goal is to show enough examples to encourage the reader to devise their own.

Hypothesis Testing

Classical hypothesis testing may be framed as a decision problem: do we take action A or action B? For our case, do we accept or reject the null hypothesis.

More precisely, we have two hypotheses regarding the observation X, indexed by \(\theta=0\) or \(\theta=1\). The null hypothesis is that \(X\sim P_{0}\), while the alternative hypothesis states \(X\sim P_{1}\).

We have some decision rule, which in our diagrams we have denoted \(\tau(X)\), which "picks" a \(\theta\) which minimizes the risk based on the observations X. But what is the loss function?

Well, we have the probability for a false alarm when \(\tau(x)=1\) but should be zero \[\alpha_{\tau} = \sum_{x}\tau(x)P_{0}(x)\tag{1}\] and the probability for missing a detection when \(\tau(x)=0\) but should be one \[\beta_{\tau} = \sum_{x}(1-\tau(x))P_{1}(x)\tag{2}.\] We note the loss function is indeed an expected value of \(\tau\), and it is parametrized by the choice of \(\theta\).

But how do we choose \(\tau\)?

We may construct one possible "hypothesis chooser" (randomized decision rule) as, for some constant probability \(0\leq q\leq 1\) and threshold \(c\gt0\), \[\tau_{c,q}(x) = \begin{cases} 1 & \mbox{if } P_{1}(x) \gt cP_{0}(x)\\ q & \mbox{if } P_{1}(x) = cP_{0}(x)\\ 0 & \mbox{if } P_{1}(x) \lt cP_{0}(x) \end{cases}\tag{3}\] In other words, \(\theta=1\) is chosen with probability \(\tau_{c,q}(x)\), and \(\theta=0\) is chosen with probability \(1-\tau_{c,q}(x)\). Starting from a given value of \(\alpha_{0}\), we then determine the parameters c and q by the equation \[\alpha_{0}=\sum_{x}\tau_{c,q}(x)P_{0}(x).\tag{4}\] The Neyman-Pearson lemma proves this is the most powerful test for significance (minimizes \(\beta_{\tau_{c,q}}\) subject to \(\alpha_{\tau_{c,q}}=\alpha_{0}\) constraint).

We emphasize, though, this is a "toy problem" which fleshes out the details of this framwork.

Exercise. Prove that the probability of type-I errors (probability of false alarm) is \(\alpha_{\tau} = \mathbb{E}_{X\sim P_{0}}[\tau(X)]\) and the probability of type-II errors (probability of failing to detect) is \(\beta_{\tau} = \mathbb{E}_{X\sim P_{1}}[1 - \tau(X)]\).

Regression

The goal of a regression is, when we have some training data \((\mathbf{X}^{(j)}, Y^{(j)})\) where parenthetic superscripts run through the number of observations \(j=1,\dots,N\), to find some function f such that \(\mathbb{E}[Y|\mathbf{X}]\approx f(\mathbf{X},\beta) \approx Y\). Usually we start with some preconception like f is a linear function, or a logistic function, or something similar, rather than permitting f to be any arbitrary function. We then proceed to estimate \(\widehat{f}\) and the coefficients \(\widehat{\beta}\).

Some terminology: the \(\mathbf{X}\) are the Covariates (or "features", "independent variables", or most intuitively "input variables") and \(Y\) are the Regressands (or "dependent variables", "response variable", "criterion", "predicted variable", "measured variable", "explained variable", "experimental variable", "responding variable", "outcome variable", "output variable" or "label"). Unfortunately there is a preponderance of nouns for the same concepts.

Definition. Consider \(X\sim P_{\theta}\) which randomly generates observed data \(x\), where \(\theta\in\Theta\) is an unknown parameter. An Estimator of \(\theta\) based on observed \(x\) is a mapping \(\phi\colon\mathcal{X}\to\Theta\), \(x\mapsto\hat{\theta}\). An Estimator of a function \(z(\theta)\) is a mapping \(\zeta\colon\mathcal{X}\to z(\Theta)\), \(x\mapsto\widehat{z}\).

The decision rule then estimates the true function, \(\tau_{\mathbf{X},Y}=\widehat{f}\). That is to say, it produces an estimator. There are various algorithms to construct the estimator, which depends on the regression analysis being done.

The loss function is usually the squared error for a single observation \[l(T,\tau) \mapsto \mathbb{E}_{(\mathbf{X},Y)\sim P_{\theta}}[(Y - \widehat{f}(\mathbf{X},\widehat{\beta}))^{2}].\tag{5}\] Depending on the problem at hand, other loss functions may be considered. (If the Y variables were probabilities or indicator variables, we could use the [expected] entropy as the loss function.)

The risk is then the average loss function over the training data. But do not mistake this for the only diagnostic for regression analysis.

We have multiple measures of how good our estimator is, which we should briefly review.

Definition. For an estimator \(\phi(x)\) of \(\theta\),

its Bias is \(\mathrm{Bias}_{\theta}(\phi) := \mathbb{E}_{X\sim P_{\theta}}[\phi(X)] - \theta\)

its Mean Square Error is \(\mathrm{MSE}_{\theta}(\phi) := [|\phi(X) - \theta|^{2}]\)

Fact (The MSE = (Bias)² + Variance). Let \(\phi(x)\) be an estimator of \(\theta\), then \[\mathrm{MSE}_{\theta}(\phi) = \left(\mathrm{Bias}_{\theta}(\phi)\right)^{2} + \mathrm{Var}_{P_{\theta}}[\phi(X)]. \tag{6}\] In practice, this means as an estimate is more "spread out", it becomes more "centered near the correct value". (End of Fact)

Conclusion

Most of statistical inference falls into this schema presented. Broadly speaking, statistical inference consists of hypothesis testing (already discussed), point estimation (and interval estimation), and confidence sets.1See, e.g., section 6 of K.M. Zuev's lecture notes on statistical inference. We have discussed only the frequentist approach, however, and for only a couple of these tasks.

The Bayesian approach, in contrast to all these techniques, end up using a loss function which sums over values of \(\theta\) (i.e., integrates over \(\Theta\) instead of over the space of experimental results \(\mathcal{X}\)). The Bayesian priors describe a probability distribution of likely values of \(\theta\), which would be used in the overall process.

Yet the Bayesian school offers more tools than just this, and I don't think they can neatly fit inside a diagram like the one doodled above for frequentist statistics.

But we have provided an intuition to the overall procedures and tools the frequentist school affords us. Although we abstracted away the data gathering procedure (as well as the other steps in the process), we could flesh out more on each step involved.

In short, statistics consists of decision theoretic problems (perhaps "decision theory about decision theory", or meta-decision theory, may be a good intuition), but it remains more of an art than an algorithmic task.

References

P.J.Bickel and E.L.Lehmann, "Frequentist Inference". In Neil J. Smelser and Paul B. Baltes (eds) International Encyclopedia of the Social & Behavioral Sciences, Elsevier, 2001, Pages 5789–5796.
James O. Berger, Statistical Decision Theory and Bayesian Analysis. Springer Verlag, 1993. See especially section 2.4.3. (This is the only book on statistical decision theory that I know of worth its salt.)
George Casella and Roger L. Berger, Statistical Inference. Second ed., Cengage Learning, 2001. (Section 8.3.5, 9.3.4 for hypothesis testing and point estimators in the decision theoretic framework.)
C. Robert, The Bayesian choice: from decision-theoretic foundations to computational implementation. Second ed., Springer Verlag, 2007. See especially chapter 2.

Tuesday, June 11, 2019

Reading Elsewhere

• The New Yorker published a brief history on polling in US Politics a few years back worth reading.

• Clinton's 2016 data analytics team apparently worked using only a single model, codenamed Ada. From how people describe it, it sounds like a [Bayesian] MCMC model.1n+1 reported, By sampling a representative subset of all possible variations—the so-called Monte Carlo method of quantitative analysis—Ada would produce a set of outcomes. Edward Tenner claims, in his book The Efficiency Paradox: What Big Data Can't Do, that he had a computer scientist examine the source code of Ada, The software, Ada, could generate hundreds of thousands of simulations, but according to a computer scientist who studied the program, it was flawed by the assumptions built into it by Kriegel and his staff, biases that kept it away from countering a loss of support. (pg 67)

• Bloomberg Politics had a pre-election 2016 review of forecasts (which I learned of from New York magazine article).

Thursday, June 6, 2019

Exit Polls Margin of Error Estimates

Exit polls do not have margins of error, but we can estimate the margin of error using confidence intervals for a two-party system.

Applied to the 2016 election, when we are told N respondents answered with a proportion p supporting a candidate, we can construct the Wilson confidence interval with, say, 95% confidence (i.e., α = 0.05 and \(z_{1-\alpha/2}\approx 1.96\)). Then we get an estimate \(\hat{p}\pm\Delta p\), and we may treat \(\Delta p\) as the margin of error.

If we are trying to estimate uncertainty propagated from the exit polls used in, say, computing coefficients for a logistic regression, then we could use z = 2 exactly, and then set the Wilson confidence interval computed to \(\hat{p}_{A}\pm2\sigma_{A}\) for supporters of candidate A.

The only caveat is, exit polls are not adequately random samples. Exit polls are cluster samples, since only a fraction of precincts are polled (although they are picked by random and are intended to reflect the state as a whole). There are techniques for computing the margin of error for such sampling techniques, I don't believe it to be tractable given the limited data from exit polls.

We can compute the margin of error for one-stage cluster sampling (which would be the upper bound in the margin of error, i.e., the stratified cluster sampling would have a smaller margin of error). How does it compare to binomial confidence intervals? Lets review binomial confidence intervals, then cluster sampling error, and see when/if the binomial confidence interval is a superior choice.

Brief Review of Confidence Intervals for Binomial Distribution

Remember the de Moivre-Laplace theorem, which states if X is a binomially distributed random variable with probability p of success in n trials, then as \(n\to\infty\) we find \[\frac{X-np}{\sqrt{np(1-p)}}=Z\to\mathcal{N}(0,1) \tag{1} \] the left hand side becomes approximately a normal distribution with mean 0 and standard deviation 1. Dividing top and bottom by n, then rearranging terms, we find \[\frac{X}{n} = p + Z\sqrt{p(1-p)/n}\tag{2}\] which gives us an estimate for p.

If we denote \(\hat{p}=y/n\) the empirically observed frequency of successes (y) to the number of trials (n), we pick some confidence level z, and the naive interval estimate for the probability of success is given by the normal approximation \[\boxed{\widehat{p}\pm z\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}}}\tag{3}\] which, for large enough n and for \(\hat{p}\) "not too extreme", gives some estimate for where the "true value" of p lies.

Puzzle: What values of n and p lead to good approximations by Eq (3)?

This puzzle is typically "solved" in textbooks by insisting \(n\cdot\mathrm{min}(\hat{p},1-\hat{p})>5\) (or 10), but this doesn't always lead to good estimates. Brown, Cai, and DasGupta investigated this question "empirically".

In fact, a better estimate of the interval starts by considering p at the boundaries of Eq (3) with a slight modification of the squareroot: \[p = \widehat{p}\pm z\sqrt{\frac{p(1-p)}{n}}\tag{4}\] then we get the quadratic equation \[(p - \widehat{p})^{2} = z^{2}\frac{p(1-p)}{n}.\tag{5}\] Solving the quadratic for p gives us the interval \[\boxed{\frac{\hat p+\frac{z^2}{2n}}{1+\frac{z^2}{n}} \pm \frac{z}{1+\frac{z^2}{n}}\sqrt{\frac{\hat p(1-\hat p)}{n}+\frac{z^2}{4n^2}}.}\tag{6}\] This is the Wilson Confidence Interval. Heuristically, for \(z\approx 2\) (the 95% confidence interval) estimates the true probability of success (p) to be centered nearly at a shifted estimate \((y+2)/(n+4)\). Observe for "large n", Eq (6) becomes Eq (3).

The Wilson confidence interval gives better estimates than the normal approximation, even for a small number of trials n and/or extreme probabilities. For larger n (i.e., \(n\gt 40\)), the Agresti-Coull interval should be used: compute the center of the Wilson confidence interval, then use this value as \(\hat{p}\) in the normal approximation Eq (3).

In short: If \(n\lt 40\), use the Wilson confidence interval. Otherwise, use either the Agresti-Coull interval, the Wilson interval, or the usual interval Eq (3).

Remark. For small n < 40, we could use a Bayesian approximation with the uninformative Jeffreys prior. This estimates the interval, for a confidence level α, to be the quantiles of the Beta distribution \(\mathrm{Beta}(y + 1/2, n-y+1/2)\) at the probabilities \(\alpha/2\) and \(1-\alpha/2\). This has to be computed numerically. I wonder if there are decent approximations to the quantile function for small α?

Rare Events

What if the probability of success is really low? That is to say, we're dealing with "rare events"? We have a special case for this, going back to the binomial distribution to describe the most extreme case where there are no successes observed y = 0 is described by \[\Pr(X=0) = (1 - p)^{n} = \alpha\tag{7}\] for a given confidence level α (usually 0.05). Then taking the (natural) logarithm of both sides yields \[n\ln(1-p)=\ln(\alpha).\tag{8}\] By assumption, the chance for success is small (\(p\ll 1\)), we can approximate the logarithm by the first term of the Taylor expansion (since the first term is an upper bound of the logarithm in this domain) \[-np=\ln(\alpha)\tag{9}\] hence the confidence interval is \[\boxed{0\leq p\leq\frac{-\ln(\alpha)}{n}.}\tag{10}\] For α = 0.05, the upper bound is 3/n (hence the so-called "Rule of three").

(Dually, for events which almost always happen, we could simply take 1 minus this interval. For α = 0.05, this is nearly [1-3/n,1].)

Cluster Sampling Margin of Error

The cluster sampling margin of error is the product of the standard error with the critical value z. We are given the sample proportions p. The cluster, in the case of exit polling, is a precinct; in 2004, there were 174,252 precincts in the United States (arXiv:1410.8868) with an average of 800 voters in a precinct. There is something on the order of 300 precincts (in 28 states, roughly 11 per state) sampled in the 2016 exit poll.

If there are N voters in the country, N_j voters in precinct j, m precincts in the exit poll, and M precincts in the country, and y_j be the number of voters in precinct j that voted for a fixed party for president, then we may consider the estimated number of votes may be given using the unbiased estimator \[\hat{\tau} = M\cdot\bar{y} = \frac{M\cdot\sum^{m}_{j=1}y_{j}}{m}.\tag{11}\] Its variance is given by \[\mathrm{Var}(\hat{\tau}) = M(M-m)\frac{s_{u}^{2}}{m}\tag{12a}\] where \[s_{u}^{2} = \frac{1}{m-1}\sum^{m}_{j=1}(y_{j}-\bar{y})^{2} \tag{12b}\] is the sample variance.

We can estimate the proportion of voters supporting our given party. We take \[\hat{\tau}_{r} = N\cdot r = N\cdot\frac{\sum^{m}_{j=1}y_{j}}{\sum^{m}_{j=1}N_{j}}\tag{13a}\] and \[\hat{\mu}_{r} = \hat{\tau}_{r}/N = r.\tag{13b}\] We find the variance \[\mathrm{Var}(\hat{\tau}_{r}) = \frac{M(M-m)}{m(m-1)}\sum^{m}_{j=1}(y_{j} - rN_{j})^{2}\tag{14a}\] which is biased, but the bias is small when the sample size is large; the variance for the ratio estimator \[\mathrm{Var}(\hat{\mu}_{r}) = \frac{M(M-m)}{m(m-1)}\frac{1}{N^{2}}\sum^{m}_{j=1}(y_{j} - rN_{j})^{2}.\tag{14b}\] It is not hard to find \[\mathrm{Var}(\hat{\mu}_{r}) = \frac{(1-m/M)}{m(m-1)}\sum^{m}_{j=1}\left(\frac{y_{j}}{N_{j}} - r\right)^{2}\frac{N_{j}^{2}M^{2}}{N^{2}}.\tag{14c}\] which is the variance we are looking for.

The margin of error for the exit polls would be approximately, for a given critical value z, \[ME = z\sqrt{\mathrm{Var}(\hat{\mu}_{r})}.\tag{15}\] Unfortunately, we are not given the data sufficient to compute the variance described in Eq (14c).

Another difficulty, Eq (14c) describes the sample variance. Exit polls are far less than ideal (in the US, at least), and this increases the actual variance. There has been some debate surrounding how much worse the error for exit polls is, when compared to naive binomial confidence interval estimates, but the Mystery Pollster inform us, Panagakis had checked with Warren Mitofsky, director of the NEP exit poll, and learned that the updated design effect used in 2004 assumed a 50% to 80% increase in error over simple random sampling (with the range depending on the number of precincts sampled in a given state). (Emphasis his) If the reader takes one thing away from this post, it should be exit polls are noisy and computing its margin of error is complicated.

Conclusion: multiplying the width of the confidence interval by a factor of 1.8, or even 2, would give us a reasonable margin of error for the exit polls.

References

Stat506 from Pennsylvania State University is where I learned about cluster sampling.

Saturday, June 1, 2019

2016 Exit Polls

Exit polls, by their nature, are extremely noisy and do not provide margins of error. Until 2016, news organizations banded together to provide a single, coherent exit poll for presidential elections. This coalition started collapsing after the 2016 presidential election. Both CNN and Fox News have slightly different exit poll data for the 2016 election. How do we make sense of these polls? When can we say they tell us "the same story"?

Toy Problem: Coin Tossing

Lets consider a simpler toy problem: I flip a coin N₁ times and obtain y₁ heads ("successes") and you flip a coin N₂ times and obtain y₂ heads ("successes"). How do we know our coins are equally biased? Assuming each of us has done a "large number of trials" (a couple hundred each).

The null hypothesis: these coins follow the same distribution with probability p of success.

The alternative hypothesis: these coins do not follow the same distribution.

Let \(p_{1} = y_{1}/N_{1}\) (and similarly \(p_{2} = y_{2}/N_{2}\)), a better approximation to the probability of heads ("success") is \[p=\frac{y_{1}+y_{2}}{N_{1}+N_{2}}.\tag{1}\] We can standardize the data and approximate it as a normal distribution with standard deviation \(\sigma_{1}^{2} = p(1-p)/N_{1}\) and similarly \(\sigma_{2}^{2} = p(1-p)/N_{2}\). Then the "true variance" is \[\sigma^{2} = \sigma_{1}^{2} + \sigma_{2}^{2} = p(1-p)\left(\frac{1}{N_{1}}+\frac{1}{N_{2}}\right)\tag{2}\] which we should use for performing the Z-transform: \[Z_{1} = \frac{p_{1} - p}{\sigma}\tag{3}\] and a similar definition of Z₂. The test statistic we want to measure is \[z = Z_{1}-Z_{2} = \frac{p_{1}-p_{2}}{\sigma}\tag{4}\] which should follow a normal distribution with mean 0 and standard deviation 1.

We can then pick some confidence level, usually 1 - α = 95% confidence level, which leads to the critical value \[z_{1 - \alpha/2} = \Phi^{-1}(1 - \alpha/2)\tag{5}\] using the quantile for the normal distribution \(\Phi^{-1}\) for approximately 1.96 for the 95% confidence level.

If the quantity computed in Eq (4) is "more extreme" than the quantity computed in Eq (5), i.e., if \(z_{1-\alpha/2} < |z|\), then we reject the null hypothesis and conclude these coins follow different distributions. Otherwise, we fail to reject the null hypothesis. Note: we never prove they follow the same distribution, we just fail to prove they follow different distributions.

Exit Polls

We can apply this same setup to exit polls. Unfortunately for Fox News, contrary to appearances, they do not provide state-level exit poll data for the first 12 questions. Perhaps this is a technical error due to some software bug on their server-side, but I can't do anything to remedy the problem.

Further, the results are identical percentages (on the first dozen questions) for FOX and CNN, despite having apparently different sample sizes. It is not hard to prove (if these results are accurately reported) the results of computing Eq (4) for each category is 0 identically, and thus we fail to reject the null hypothesis for each question.

It's rather anti-climactic despite this long buildup, but it's good to know these exit polls are coherent, in some appropriate sense.

The exit polls are in CSV form on github.