Monday, June 20, 2022

Introduction to Ecological Inference

Consider the election for some office the United States. The voters are organized into precincts. For the 2020 presidential election, there were a total of 176,933 precincts or equivalents (according to the United States Election Assistance Commission report). The voting age population in 2020 would be approximately 256,662,010 people (as reported in the Federal Register). We obtain an average of 1450.6 voting-age individuals per precinct.

Generic Puzzle: How did [candidate] perform with [demographic of voters] in [given election]?

Since voting data is private, it's impossible to figure this out with certainty. But there are a number of statistical tools we can use to answer the question. We will look at ecological inference today.

Puzzle 1: How many women voted for Biden (or Trump)? How many men voted for Biden (or Trump)?

There were approximately 132,408,924 women adults in 2020, and approximately 124,253,086 men adults. The FEC reports 158,383,403 votes were cast in the 2020 presidential election. That makes voter turnout approximately 66.20074%, but we do not know its composition in terms of men and women.

If we consider precinct $i$, we could write down a table of the fraction of the turnout who were women $\beta^{f}_{i}$, the fraction of the turnout who were men $\beta^{m}_{i}$, the proportion of the population of the precinct which are women $X_{i}$, and the voter turnout $T_{i}$ (as a fraction) in the following handy table:

Demographic Voted (fraction) Did not vote (fraction) Total
Women$\beta^{f}_{i}$$1-\beta^{f}_{i}$$X_{i}$
Men$\beta^{m}_{i}$$1-\beta^{m}_{i}$$1-X_{i}$
$T_{i}$$1-T_{i}$

We have the following useful identity \begin{equation}\tag{1} T_{i} = X_{i}\beta^{f}_{i} + (1 - X_{i})\beta^{m}_{i}. \end{equation} To convince ourselves of this, suppose we had $N_{i}$ adults in precinct $i$, $F_{i}$ of which are females, and that $V_{i}$ votes were case in the precinct, $V_{i}^{f}$ being cast by female voters. Then $\beta^{f}_{i}=V^{f}_{i}/F_{i}$ and $X_{i} = F_{i}/N_{i}$, multiplying through gives $\beta^{f}_{i}X_{i}=V^{f}_{i}/N_{i}$ the number of votes case by women relative to the precinct's population; when we combine it with $\beta^{m}_{i}(1 - X_{i}) = (V_{i} - V^{f}_{i})/N_{i}$, we recover $T_{i} = V_{i}/N_{i}$ the voter turnout as a fraction of the precinct's population. So far, so good? Good!

Now the name of the game is determine $\beta^{f}_{i}$ and $\beta^{m}_{i}$.

But this is an underdetermined system: we have 2 unknowns per precinct, and 1 equation per precinct. This is a fundamental weakness of ecological inference known as the Inderminacy problem. It's a serious problem, because we can "cook the books" to infer whatever we want. But let us try to soldier on, and see what we can determine.

Method of Bounds

For simplicity, I will be treating the entire United States as a single precinct. If you find this dodgy, well, the more sophisticated techniques of ecological inference stipulates worse, so buckle up.

We can place some bounds on $\beta^{f}$ (I'm dropping the index tracking precincts, since there's only one). If every man who could vote did vote, then we would have \begin{align} \frac{T - (1 - X)}{X} &= \frac{(V/N) - (N-F)/N}{F/N} \tag{2a} \\ &= \frac{V - (N-F)}{F}\leq\beta^{f}. \tag{2b} \end{align} The number of votes not cast by a man $V - (N-F)$ relative to the population of adult women $F$ would be the lower bound for female voter turnout. This works, provided $V-(N-F)\gt0$ (there are more votes than men). To handle the other case, we need \begin{equation}\tag{3} \max\left(\frac{T - (1 - X)}{X}, 0\right)\leq\beta^{f}. \end{equation} When working this out, we find the lower bound to be empirically \begin{equation}\tag{4} 0.32888\leq\beta^{f} \end{equation} at least 32.8% of voting-age females cast a ballot in 2020.

Likewise, we can derive an upperbound where only women cast ballots (which makes sense when $F\gt V$ there are more women than votes): \begin{equation}\tag{5} \min(T/X, 1) \geq\beta^{f}. \end{equation} Empirically, there were more votes than women, so \begin{equation}\tag{6} 0.3288\leq\beta^{f}\leq 1. \end{equation} This isn't terribly enlightening, somewhere between 32.8% and 100% of voting age females cast a ballot in 2020. (The bounds on male voter turnout is similarly between 31.9% and 100%.)

If we want something more, we need to supply additional constraints by hand. For example, the difference between voter turnout by sex is bounded by 10% empirically, so this would constrain the space of possibilities further. But we are now bringing in our own prior beliefs: the data doesn't tell us the difference between voter turnout is bounded by 10% empirically, I literally just made it up because it sounds plausible.

Had we precinct-level data, we could take a weighted average of these bounds to infer district-level turnout by sex. Perhaps we will work through state-level considerations using county-level data in a future post.

The other thing I should point out is that the interval \begin{equation}\tag{7} \beta^{f}\in\left[\max\left(\frac{T - (1 - X)}{X}, 0\right),\; \min(T/X, 1)\right] \end{equation} is a 100% confidence interval. This fact isn't really taken advantage of sufficiently, in my opinion, but taking advantage of it requires imposing our beliefs on the statistics, which allows us to conclude anything we want.

Goodman's Regression Approach

Another approach is to start with our first identity \begin{equation}\tag{1} T_{i} = X_{i}\beta^{f}_{i} + (1 - X_{i})\beta^{m}_{i}. \end{equation} We then regress $T_{i}$ against $X_{i}$. This will produce district-level estimates $B^{f}$ (and $B^{m}$) of the turnout by sex.

This is a linear regression, so a couple of caveats are worth noting:

  1. The estimates may produce unreasonable results, violating the deterministic bounds. In fact, you should expect the estimates to be biased (in the technical, statistical sense).
  2. There is an explicit assumption of constant covariance, which means the composition of a precinct does not affect voter turnout or voting behaviour. This may be reasonable superficially, but urban precincts tend to behave differently than rural precincts, and this could produce unreasonable results.

King's Approach

We can take some combination of the deterministic approach and Goodman's regression, to restrict values to $0\leq B^{f}\leq 1$ and likewise for men. That is to say, the coefficients (when plotted against each other) live in the unit square. (Or, for demographics with $n$-categories, like age, the unit $n$-cube.)

We transform the identity at the heart of Goodman's approach into the form: \begin{equation}\tag{8} \beta^{m}_{i} = \left(\frac{T_{i}}{1-X_{i}}\right) - \left(\frac{X_{i}}{1-X_{i}}\right)\beta^{f}_{i}. \end{equation} We can then plot a line in the unit square, since we know the $X_{i}$ and $T_{i}$, and this produces a distribution of possible values for the $\beta^{m}_{i}$ and $\beta^{f}_{i}$ parameters. In order for us to extract information, we need to make three statistical assumptions.

Probably the most severe assumption in this approach is the demand of spatial homogeneity: the conditional random variable $T_{i}|X_{i}$ are independent of observations. This sounds fine, but it stipulates rural voters behave indistinguishably from urban voters (and other demographics not considered behave indistinguishably from each other).

We tend to believe this is not the case (according to polling, election results, etc.). So we need to be careful about the domain of validity for King's regression.

The next assumption we need to make is that there are no cluster points. In other words, we will be using a bivariate truncated normal distribution (or if there are $n$ demographic categories, an $n$-variate truncated normal distribution).

The criticism for this assumption is that there's no reason to believe a truncated multivariate normal distribution is more appropriate than any other probability distribution on the unit cube. Arguably, it's not; Jaynes would argue we should use a entropy maximizing distribution, for example. There's merit to Jaynes's argument: we are imposing a subjective prior belief onto our statistical analysis by choosing some probability distribution. We need to appeal to some statistical principle, or prove the results are independent of prior, or...

The last assumption we need to make is that there is no a priori aggregation bias; i.e., $X_{i}$ is [mean] independent of $\beta^{f}_{i}$ and $\beta^{m}_{i}$.

I don't have much to say about this, because King developed another model which weakens this condition (and let's us measure violations of it).

When the deterministic bounds on the demographic category of interest is relatively tight, all results tend to coincide. But when the bounds are large (like our example with voter turnout by sex being anywhere between 30% to 100%), the conclusions drawn are largely model-dependent. This subtlety can lead to contradictory results.

Concluding Remarks

These three approaches constitute the "trunk" of ecological inference, from which we can form "branches" useful for whatever problem we're interested in. King really deserves credit for rejuvenating the tool, and a lot of modern work generalizes his approach.

But we should also realize there are many incredibly subtle (and easy to miss) opportunities for us to derive results which we were looking for ab initio. For this reason alone, it should be tested against other methods like Bayesian multilevel regression (with or without post-stratification), or even simpler methods when possible. It's easy to shoot yourself in the foot with ecological inference, and it's eager to do it, too.

Again, it is worth stressing: inferences drawn from ecological inference are either obvious or model-dependent. It's easy to impose your pre-existing beliefs onto the model without realizing it. You really shouldn't be using this for election analysis.