Monday, June 29, 2020

Polling as Sampling from Urns

Last time we discussed a simple model of polling where each demographic were a different color ball in an urn, with supporters of a candidate (and those not supporting that candidate) as solids and stripes. We saw with data from the Hispanic demographics that overdispersion was present and diluted support for Biden.

But we assumed the polls which formed the basis of our reasoning were a perfect representative sample. I think we will begin to tease this apart in this post. I don't think I'll discuss sampling methods yet.

Instead, in this post, I'll start with several urns with different quantities of striped and solid balls. Even using ideal sampling methods, the techniques produce a wide range of estimates of the proportion of striped balls to the rest of the urn.

Toy Problem

Consider 4 urns, A, C, F, and T. We have balls with three colors (red, green, blue) which are either striped or solid. We want to know how many solid balls are in each urn from a given sample. But we perform this backwards: we begin with knowing the number of striped and solid balls in each urn, and consider different sampling methods.

Urn `A`	Number Solid	Number Striped	Total	% of Balls
Green	962,036	357,294	1,319,330	96.12%
Blue	34,147	7,532	41,679	3.04%
Red	3,348	8,242	11,590	0.84%
Total	999,531 (72.8%)	373,068	1,372,599

Urn `C`	Number Solid	Number Striped	Total	% of Balls
Green	6,628,592	1,910,704	8,539,296	97.48%
Blue	130,711	20,448	151,159	1.72%
Red	19,519	49,726	69,245	0.79%
Total	6,778,822 (77.4%)	1,980,878	8,759,700

Urn `F`	Number Solid	Number Striped	Total	% of Balls
Green	364,392	126,652	491,044	21.01%
Blue	598,611	193,014	791,625	33.88%
Red	276,758	777,166	1,053,924	45.11%
Total	1,239,761 (53.0%)	1,096,832	2,336,593

Urn `T`	Number Solid	Number Striped	Total	% of Balls
Green	4,723,236	1,716,632	6,439,868	96.91%
Blue	110,395	32,782	143,177	2.15%
Red	18,830	43,645	62,475	0.94%
Total	4,852,461 (73.0%)	1,793,059	6,645,520

There are a total of 13,870,575 solid balls (72.566%) and 5,243,837 (27.434%) striped balls.

Exercise 1. If we had a hypothetical urn with 5,243,837 striped balls and 13,870,575 solid balls, and if we sampled without replacement n balls, what's the expected number of striped balls drawn? What's the interval containing 75% of the probability distribution about the mean for different sample sizes \(n=75,100,125\)?

This requires being more precise about what we're really interested in finding. The expected number of striped balls in a sample of n balls would be \(0.27434n\). This is unambiguous. (The median value would be, for \(n=100\), 27 striped balls.)

Constructing the interval can be done thus: the upper bound of k for \(\Pr(X\leq k|M,N,n=100)\approx 7/8\) empirically is about \(k\approx32\), and for the lower bound \(\Pr(X\leq k|M,N,n=100)\approx 1/8\) is about \(k\approx 22\). This would give us an interval of 27 ± 5.

We could also consider constructing an interval similar to the "highest posterior density interval". Intuitively, we plot the probability density function, then take a horizontal line tangent to the mode (peak of the PDF). We begin lowering the horizontal line until the area under the probability density between the intersection points equals 75%. This produces a slightly different value since the distribution is not symmetric, though for the sample sizes we are considering the differences would not be appreciable.

Exercise 2. If we drew 4 samples (each sampled without replacement) with each sample consisting of n balls, what's the expected number of striped balls drawn? What's the interval containing 75% of the probability distribution about the mean for \(n=75,100,125\)?

Constructing the intervals of expected striped balls drawn for \(n=100\) samples: urns A and T have interval 27 ± 5 striped balls, urn C has interval \(18\leq k\leq27\), and urn F has an interval \(41\leq k\leq53\).

Observe that urn C has an interval with fewer striped balls than the hypothetical pooled urn, whereas urn F has its interval centered at nearly double the hypothetical pooled urn's, and urns A and T coincide with the hypothetical pooled urn.

Exercise 3. If we sampled proportional to the number of balls in each urn (e.g., \(2336593/(5243837 + 13870575) \approx 12.22\%\) of the sample are drawn from urn F) and we sampled without replacement, what's the expected number of striped balls drawn? What's the interval containing 75% of the probability distribution about the mean for \(n=75,100,125\)? What's the expected number of balls of each color drawn?

This is several independent sampling problems, with \(n_{A} = nN_{A}/N\approx0.07n\), \(n_{C} = nN_{C}/N\approx 0.45n\), \(n_{F} = nN_{F}/N\approx 0.12n\), and \(n_{T} = nN_{T}/N\approx 0.35n\). The expected number of striped balls in the sample would be the sum of the expected number from a sample from respective urns with respective sizes. But some simple algebra shows this is just the expected value of the hypothetical pooled urn from exercise 1.

Since the wording suggests we move from urn to urn, taking a specific sample from each one independently, the intervals are computed independently, and the sum of the lower-bounds gives us the lower-bound for the resulting sample (and similarly for the upper-bound).

Exercise 4. What if each urn gets sampled equally? So a quarter of the sample is drawn from urn A, a quarter from urn C, etc. What is the expected value of striped balls appearing in the sample? What is the 75% interval for \(n=100\)?

This gives us an expected 30.92905 striped balls in the sample. The interval with 75% probability consists of \(20\leq k\leq41\) striped balls; the 50% intervals gives us 30 ± 6 striped balls.

Exercise 5. What if each urn gets sampled in this manner: we first fix a desired sample size \(n\geq100\). We want to randomly sample without replacement so we get at least \((R/N)n\) red balls (where there are \(R\) red balls and \(N\) total balls in the urn initially), at least \((G/N)n\) green balls (where \(G\) is the initial number of green balls in the urn) and at least \((B/N)n\) blue balls (with \(B\) the initial number of blue balls in the urn). We will have our sample consist of \(g\) green balls, \(r\) red balls, and \(b\) blue balls where \(r+g+b\geq n\). (A) What is the expected sample size for each urn for \(n=150\)? (B) What is the expected number of striped balls for each color? (C) [Open ended] Can we apply some set of weightings to better reflect the urn's composition?

We first compute how many balls we want, at minimum, drawn for each urn:

Urn A's sample needs at least 144 green balls, 5 blue balls, and 1 red ball
Urn C's sample needs at least 146 green balls, 3 blue balls, and 1 red ball
Urn F's sample needs at least 32 green balls, 51 blue balls, and 68 red ball
Urn T's sample needs at least 145 green balls, 3 blue balls, and 1 red ball

The exact numbers may differ due to rounding concerns, or desire for sampling particular subpopulations.

If we consider the stopping condition to be drawing s balls of a certain color (which the urn has K balls of that particular color), then we can compute the expected number of draws \(k+s\) needed using the negative hypergeometric distribution. The expected number of draws would be \[ n \approx s + s\frac{N-K}{K+1} = s\frac{N+1}{N-K+1}. \] This formula differs from a naive reading of the wikipedia page, because their "K" is our "N-K".

However, if we want our sample to contain the desired minimum with \(\alpha\) probability, we need at least \(\Pr(X\geq s\mid N, K, n)\geq \alpha\). Just brute forcing this, we find A needs \(n=354\), C needs \(n=378\), urn F requires \(n=194\), and finally T needs \(n=318\).

For, e.g., urn A, this has the unfortunate side effect of producing somewhere between double to triple as many green and blue balls as needed. So how do we handle this? We want to avoid discarding information (as a rule of thumb in life, but especially in statistics), so we may want to take the ratio of striped green balls to solid green balls, then multiply by the desired sample size for green balls (144). There are other possibilities, but this is the quickest for us.

Numerically, we find the samples produced for each urn has with the 90% confidence interval estimating solid balls of specific color in parentheses:

Urn A's sample has 340 green balls (234–261), 11 blue balls (7–11), and 3 red ball (0–2)
Urn C's sample needs at least 369 green balls (273–299), 7 blue balls (4–7), and 3 red ball (0–2)
Urn F's sample needs at least 41 green balls (26–35), 66 blue balls (44–55), and 87 red ball (16–30)
Urn T's sample needs at least 309 green balls (214–239), 7 blue balls (3–7), and 3 red ball (0–2)

If we were to normalize the overcounted balls, then we end up with the estimates:

Urn A's weighted sample has between 99 to 111 striped green balls, 2 or 3 striped blue balls, and at most 1 striped red ball, for a weighted total of somewhere between 102 to 115 striped balls (68%-76.67% striped)
Urn C's weighted sample has between 108 to 118 striped green balls, 2 (well, between 1.75 to 2) striped blue balls, and at most 1 striped red ball, for weighted total of somewhere between 111 to 121 striped balls (74%-80%)
Urn F's weighted sample has between 20 to 27 striped green balls, 34 to 42.5 striped blue balls, and 12.5 to 23 striped red ball, for a weighted total somewhere between 66 to 93 striped balls (44%-62%)
Urn T's weighted sample reports somewhere between 100 to 112 striped green balls, 1 to 3 striped blue balls, and at most 1 striped red ball, for a weighted total of 101 to 116 striped balls (67%-77%)

But if we were given this data, working backwards, we would end up with radically different estimates for the population size. Urn F has a "margin of error" of about ±9%, which would be huge. The reported margin of error (at 90% confidence) would be at most 6.75%, though. The reported margin of error underestimates the actual range of variability the polls could report.

On the other hand, for urn T, we see the reported margin-of-error would be 6.315% (at 90% confidence) whereas its estimates have a ± 5% margin. In this case, the margin of error reported over-estimates the interval width.

Observation 1. The number of striped balls drawn using these different sampling methods produce different estimates for the total number of striped balls in each of the urns.

Observation 2. For urns with high proportion of striped balls, the margin of error decreases. For urns with low estimates of striped balls with decent sample sizes should be believed.

Homework 1. Given the range of estimates for each of these sampling methods, produce plots estimating the number of striped balls in each urn.

A concluding remark: the reader may object at the sample size of 150 being too small. This is a valid criticism, but when you examine the polling crosstabs, it's not uncommon to find 150 Hispanics polled. The guiding question I have writing this series of posts is whether we can extract anything meaningful from the polling results.

Saturday, June 27, 2020

This one strange trick helps Hispanic support of Biden

Harry Enten noted a few weeks ago how Biden is doing worse than Clinton among Hispanic voters. I wanted to use this as a pretense to investigate some of the probability theoretic aspects of polling.

We'll specifically be building a series of toy models, progressively exploring aspects of polling a composite heterogeneous demographic like Hispanic Americans. These are idealizations which can be developed further into more accurate models, but even the approximations inform us about aspects seldom considered.

The basic game plan (the tl;dr version): we consider sampling without replacement as a hypergeometric distribution, then estimate the population of "successes" out of all trials using the maximum likelihood point-estimate and Bayesian conjugate prior. In section 2, we move on to consider multiple hypergeometric distributions pooled together (analogous to striped and solid balls of different colors) and apply the same methods to this more general setting.

Assumption 1. The polls are ideal, representative samples of the populations.

We will have a follow up post discussing sampling methods, and how it affects poll results.

Base Case

The floor model of polling amounts to treating respondents as balls in an urn. There are three types of balls (red, white, and blue, for America—err, I mean, for Republican, undecided, and Democrat leaning voters) and we assume the balls do not change color. For our interests, Biden's polling, there are B blue balls and N all balls (blue and non-blue alike).

We sample without replacement n balls, meaning: we take a ball out of the urn, inspect its color, make note of it, then set the ball aside (we do not return it to the urn or replace it), and repeat this process n times. The total number of blue balls drawn from the urn b will be reported.

Exercise 1. What is the probability of drawing \(b\) blue balls given \(N, B, n\)?

There are a number of ways to derive the solution. In frequentist terms, there are \[\binom{N}{n} = \frac{(N)!}{n!(N-n)!}\tag{1a}\] ways to draw n balls from the urn containing N balls without putting the sampled balls back. (We read the left-hand side of Eq (1a) as "N choose n".) This is the denominator of the probability.

The numerator is the product of the number of ways to draw b balls from a possible B cases, and similarly the number of ways to draw \(n-b\) non-blue balls from the population of \(N-B\) non-blue balls. This gives us the answer \[\Pr(b\mid N, B, n) = \frac{\binom{B}{b}\binom{N-B}{n-b}}{\binom{N}{n}}.\tag{1b}\] Readers familiar with probability recognize this as the probability mass function for the Hypergeometric distribution.

Exercise 2. What is the expected value and variance for the hypergeometric distribution?

This is a problem the reader should work out on their own. One trick is to introduce \(p=B/N\) as the proportion of the population which is blue. For \(N\to\infty\) with \(p\) fixed, the variance of a hypergeometrically distributed random variable \(X \sim \operatorname{Hypergeometric}(N, B, n)\) should approach the binomial variance \(\operatorname{Var}[X]\to np(1-p)\).

Exercise 3. How can we estimate \(B\) given \(N\), \(n\), and an observed \(b\)?

One way to solve this is to consider the value of \(B\) which maximizes the probability \(Pr(b\mid N, B, n)\). This would give us approximately \(B\approx (N/n)b\).

Bayesian Estimates

A more Bayesian approach would use the conjugate prior for the Hypergeometric distribution, i.e., the Beta-Binomial distribution to describe a random variable \(B\sim\operatorname{BetaBin}(N, \alpha, \beta)\) for some initial prior of the relative frequency of blue balls \(\alpha\) and non-blue balls \(\beta\) (an uninformed guess would be \(\alpha=\beta=1\)).

After observing a sample of \(n\) draws produce \(b\) blue balls, the estimate is updated to \(B-b\sim\operatorname{BetaBin}(N-n, \alpha+b, \beta+n-b)\). As more samples are done, we tally up the number of blue balls seen in all of them, and treat it as if it were a single sample. (That's the property of being a conjugate prior.)

Exercise 4. Given \(N = 40\times 10^{6}\), \(n = 273\), and \(b = 161\), estimate \(B\).

These numbers are not pulled out of thin air (though it might seem that way). Roughly 2/3 of any demographic groups is voting age, and there are approximately 60m Hispanic Americans,1According to the ACS 1-year estimate, B03001. which means about 40m are voting age. In a recent poll, the New York Times found 87+74 = 161 Hispanics supporting Biden out of 273 Hispanics polled.

In a separate-though-related poll, the New York Times reported among Hispanics 64% support Biden.

Using the first set of polls from battleground states, we can produce the following estimate for the number of Biden supporting Hispanic Americans (with the quantiles at 5%, 95%, and the expected value indicated with vertical lines):

Estimates using the New York Times Battleground polls and N = 40m. The quantiles are at 21,619,245 and 25,530,281 for the 5% and 95% respectively, with the expected value at 23,589,744.

Note this corresponds to 58.97% ± 4.9% support. The wide margins stem from a lack of data. We can make an informative statement from this little data and crude model: Harry Enten noted Clinton led by 61% to 23% among Hispanics pre-election, but that lies within the credible interval we just constructed. In other words, Biden is doing alright among the Hispanics, unless the few polls we have used were skewed or biased.

Exercise 5. Perform the same analysis with Trump's numbers among Hispanics. Trump has \(t=75\) respondents supporting him and \(n-t=198\) not supporting him. [Spoiler: 27.47% ± 4.4%]

Heterogeneity

This model fails for the simple reason that people are not balls in an urn. No demographic is a homogeneous blob, so how can we start to introduce heterogeneity?

Some data to help us is a poll Telemundo conducted back in March 2020 specifically concerning the Latino vote in Florida and Arizona. We can now examine how Cuban-Americans poll compared to other Hispanic Americans.

Let's isolate the interesting aspects from a probabilistic perspective. We will consider k different colored balls, identical physically except for their color, which have counts \(N_{1}, \dots, N_{k}\) (there are \(N_{j}\) balls of color \(j\)). Suppose further that for each color \(j\) there are \(K_{j}\in\{0,1,\dots,N_{j}\}\) balls which are "striped" and \(N_{j}-K_{j}\) balls which are "solid" (think: billiards). The striped balls are analogous to our candidate's supporters, the solid balls do not support our candidate.

We sometimes have information about the number of striped balls drawn from a sample, though more often polls report the number of striped balls drawn without reference to color.

The meta-question guiding us here (i.e., the real interesting thing which we're trying to use to guide constructing exercises and worked examples) is when we have a multivariate hypergeometric distribution ("many colored balls") which can be collapsed into a hypergeometric distribution ("solids and stripes"), under what circumstances does the multivariate distribution distort the univariate distribution.

For concrete numbers, consider the following table:

Color	Number Striped	Number Solid	Total Number Balls (N)
\(C_{1}\)	64	175	1,575,667
\(C_{2}\)	112	26	3,860,969
\(C_{3}\)	374	125	24,657,774
\(C_{4}\)	161	112	???
Total	711	438	39,842,421

If we tried using an exact Binomial test of the samples drawn against \(p\approx 711/1149\), only \(C_{4}\) fails to be significantly different with \(\alpha = 0.05\) (the p-values for the first three colors are on the order of \(10^{-10}\) or so, for those curious). The keen reader will realize this is because it's the numbers from exercise 4, i.e., from a pooled sample.

The index of dispersion for this data is approximately the ratio of the sample variance of the striped counts to the sample mean, i.e., approximately 105.1228 ≫ 1, which indicates it is quite overdispersed. None of this should be surprising, since it comes from samples from a quite heterogeneous population.

Exercise 6. How can we estimate the number of striped balls \(K_{j}\) for each color \(j\)?

Isn't this a repeat of exercise 3, but with slightly different data? Arguably, yes. (That's why it's exercise 3, because it's now a tool in our toolkit!) The maximum likelihood estimator suggests that 61.88% of all balls are striped. Let's see if the Bayesian approach will be as informative with the samples drawn.

If we tried to estimate the proportion of the colors which are striped with 90% credible intervals, Bayesian methods tell us for \(C_{1}\) we'd expect 26.77% ± 4.5%, for \(C_{2}\) we'd expect 81.16% ± 5.4% striped, and for \(C_{3}\) we'd expect between 69.% and 78.1% (centered around 74.95%) striped. We can plot the densities for the proportions \(C_{1}\) in red, \(C_{2}\) in blue, \(C_{3}\) in green:

Estimates using the figures given above, divided through by the total number of balls to estimate the proportion of balls striped.

The maximum likelihood estimates tell us \(C_{1}\) has about 26.78% striped, \(C_{2}\) has about 81.1% striped, and \(C_{3}\) has 74.95% striped. These differ from Bayesian estimates around the 8th digit after the decimal place.

Exercise 7. How do these Bayesian credibility intervals compare to the confidence intervals around the most likely number of striped balls?

Exercise 8. Using the values of \(b=711\), \(n-b=438\), and \(N=39842421\), compute the credibility interval for the conjugate prior \(B\sim\operatorname{BetaBin}(N,b,n-b)\). How does it compare to our results from the first part of this blog entry? How does it compare to the estimates for individual subpopulations? [Answer: the credibility interval for \(B/N\) with 90% of the probability is about 61.88% ± 2.35%]

An observation that we can make: balls colored \(C_{1}\) seem to be significantly different than the other balls, and it skews the overall estimation if we bundled them in with the others.

Concluding Remarks

So far, we have assumed a perfect sampling method, and have attempted to extract information from the samples given. We've used one basic trick. Working through the exercises grills it into the reader's mind.

Next time, we will discuss sampling methods (including nonresponse error) and how it impacts the sample reported. Although it impacts polling results, viewed differently, this serves as a model for voter turnout, as well.

If there were more time...

This post has gone on long enough. Had I more time, I would have discussed posterior predictive checking the Bayesian models we've set up. (Because that's as important as washing your hands after using the bathroom.)

I would have also liked to examine Clinton's polling numbers among Hispanics using these methods. This would give us some benchmarks to work against.

Further, as noted, not everyone polled will vote. It would be another fertile grounds for discussion to investigate "likely voter models".

There are a number of straightforward questions we can ask about using the hypergeometric distribution for polling results, and how it modifies the calculations we made related to the polling results. An example:

Homework. Given a hypergeometrically distributed random variable \(X\sim\operatorname{Hypergeometric}(N, B, n)\) with \(p=B/N\) show the variance \(\operatorname{Var}[X] = np(1-p)(\text{something})\). The puzzle is to see how this extra factor (the parenthetic "something") impacts the naive margin-of-error calculation for polls.

Monday, June 22, 2020

Jaynes' Probability Theory

I've been reading through E.T. Jaynes' Probability Theory: The Logic of Science, published posthumously in 2003. Although difficult to describe, I suppose it is a "polemic" in the most positive sense of the word.

My own interest in the book was piqued by Jaynes' "robot". I asked myself, Why don't we have something like Siri for scientists? A robot lab assistant would be useful, I imagine. Why not? Jaynes motivates the book by trying to teach a robot how to reason with empirical evidence, and we will be "programming" Jaynes' robot's brain.

Chapter 1

Jaynes begins by discussing plausible reasoning. We have definite inference rules in classical logic, the most famous is modus ponens:

A implies B
A is true
B is true.

"Dual" to this is modus tollens (proof by contrapositive)

A implies B
B is false
A is false.

Following Polya, Jaynes introduces several "weaker syllogisms", like

A implies B
B is true
A is more plausible.

There are a few other "weak syllogisms" where the conclusion modifies the plausibility of a proposition being true. (I omit them, if you are interested, then please read the first chapter!)

Jaynes introduces his notation for propositions, writing things like \(A\mid B\) for the plausibility that proposition \(A\) is true given some background information \(B\). Conjunction of propositions \(A\) and \(B\) is written \(AB\), disjunction is written \(A+B\). We denote the negation of \(A\) as \(\bar{A}\) (writing a bar over \(A\)). We read \(A\mid BC\) as \(A\mid(BC)\), and \(A+B\mid C\) as \((A+B)\mid C\).

There's one assumption we make about the plausibilities, which Jaynes makes clear (but readers may not fully digest or appreciate at first glance):

Axiom. If we write \(A\mid BC\), then we assume propositions B and C are not contradictory. Should someone write down \(A\mid BC\) for \(BC\) being contradictory, we make no attempt to make sense of it, and discard it as meaningless. (End of Axiom)

Further, plausibilities are always considered with some background information. That is to say, we always work with statements of the form \(A\mid B\).

The first chapter concludes with specifications of his robot's behavior, which Jaynes calls his "desiderata" [something that is needed or wanted]. They are not "axioms" because they are not asserted to be truths. The spec are:

Spec. 1. Degrees of plausibility are represented by real numbers.

Spec. 2. Qualitative correspondence with common sense.

This was rather vague to me, but I think Jaynes refers to the basic scheme: if old information C is updated to \(C'\) such that \((A\mid C')\gt(A\mid C)\) but the plausibility of B given A is not changed: \[ (B\mid AC') = (B\mid AC). \] Then (1) this can only increase the plausibility both A and B is true \[(AB\mid C')\geq (AB\mid C)\] and (2) on the negation of A behaves as \[(\overline{A}\mid C')\leq (\overline{A}\mid C).\]

Spec 3. The robot reasons consistently.

This is actually given by three sub-specifications

Spec. 3A. If a conclusion may be reasoned out by more than one way, then every way produces the same result.

This cryptic specification 3A means: if \(AB\mid C\) may be obtained from (1) combining \(A\mid C\) with \(B\mid AC\), and (2) combining \(B\mid C\) with \(A\mid BC\), then they should produce the same plausibilities.

Spec. 3B. The robot takes into account all evidence available to it. The robot does not arbitrarily throw away information.

Generically, this is a good rule of thumb for everyone.

With regards to programming a robot, I was left wondering: how does a robot know what some "raw data" is evidence of? How does a robot ascertain when some evidence is relevant to a given proposition?

Spec. 3C. The robot represents equivalent states of knowledge by equivalent plausibility assignments. (I.e., if two robots have the same information but label the propositions differently, the plausibilities assigned — once the propositions from robot 1 are identified with those from robot 2 — must be the same.)

As a mathematician, I was alerted to the word equivalent. Is this an equivalence relation? Or is it equality "on the nose"? Can we weaken it a little? Can we pad it to within some margin of error? I'm not sure, but it seems at least equality up to some symmetry transformation argument.

Jaynes calls specifications 1, 2, and 3A are "structural" (governing the structure of the robot's brain), while 3B and 3C are "interface" conditions telling us how the robot's brain relates to the world.

Comment 1.8.1. Common Language. Jaynes concludes each chapter with commentary and observations. Some are outright polemical, others insightful. Here Jaynes points out English conflates ontological statements (There is noise in the room) with epistemological statements (The room is noisy). Confusing these two asserts one's own private thoughts and sensations are realities existing externally in nature. Jaynes calls this the Mind Projection Fallacy. It's a recurring phrase in his book.

Chapter 2

This chapter mostly consists of Jaynes' idiosyncratic derivation of Cox's theorem. Judging from the notation used, I think a lot of it is implicitly nodding to statistical mechanics in physics, but I don't know sufficient thermodynamics & statistical physics to get the references.

The tl;dr version of Cox's theorem is: plausibility is represented by some "sufficiently reasonable" function assigning real numbers to propositions of the form \(A\mid B\). Once we impose some very reasonable conditions on this assignment of real numbers, we can rescale it to produce numbers in the interval [0, 1]. Thus we can identify this transformed, rescaled assignment of plausibilities as Probability.

Aside: Cox's theorem is mildly controversial. Halpern's A Counter Example to Theorems of Cox and Fine provides a rather pathological counter-example with a 12-sided die, where the plausibilities assigned to the outcomes are the same as the "real probabilities", except for some reason the plausibility of \(Bel(10)=\Pr(10)+\delta\) and \(Bel(11)=\Pr(11)-\delta\) for some sufficiently small \(0\lt\delta\lt1\). Well, then, we have a counter example: the plausibilities don't reflect the probabilities at all!

This would contradict the specifications for Jaynes' robot, though.

Still, we should bear in mind there are subtleties surrounding Cox's theorem, and we should not take it as the Gospel. For further reading, I have been referred to: Halpern's Cox's Theorem Revisited, Snow's On the Correctness and Reasonableness of Cox's Theorem for Finite Domains, and van Horn's Constructing a Logic of Plausible Inference: a Guide To Cox's Theorem. See also PBworks' annotated bibliography on Cox's theorem.

In classical logic, propositions are either true or false. Probabilities generalize this, and incorporates this by writing \(p(A)=1\) when A is true, and \(p(A)=0\) when A is false.

The two rules absolutely essential to this probability function are (1) the product rule \[p(AB\mid C)=p(A\mid BC)p(B\mid C) = p(B\mid AC)p(A\mid C)\] and (2) the sum rule \[p(A+B\mid C) = p(A\mid C) + p(B\mid C) - p(AB\mid C).\] These are the critical rules to probability that Jaynes uses frequently throughout, well, the next few chapters (i.e., as much as I've read thus far).

Section 2.3. Qualitative Properties. What's really intriguing is Jaynes' translation of syllogisms to this probability notation. Generically, a derivation with n premises \(B_{1},\dots,B_{n}\) and a conclusion \(A\) is written in logic as

\(B_{1}\)
...
\(\underline{B_{n}}\)
\(A\)

In probability, this corresponds to \(p(A\mid B_{1}\dots B_{n})\).

Jaynes uses the product rule to derive modus ponens and modus tollens from the product rule. But that's not all! Jaynes then derives the "weak syllogisms" of chapter 1, showing exactly how a proposition's plausibility changes as evidence shifts.

And, in an impressive moment of triumph, it turns out the "weak syllogisms" introduced in the first chapter correspond to a Bayes update of the probabilities. Having spent a lot of time with Polya's patterns of plausible reasoning, it was delightful to see them formalized in Bayesian terms.

2.4. Assigning numerical values. At this point, I was saying to myself, "Yeah, ok, this is fascinating and all, but I can't program this. I don't even know how to assign plausibilities to a proposition yet." Jaynes tackles that issue here.

If we have n mutually exclusive and exhaustive propositions \(A_{1},\dots,A_{n}\) given some background information B, then we necessarily have \[\sum^{n}_{k=1}p(A_{k}\mid B)=1.\] Jaynes gives a very brief argument by symmetry that, since we don't know anything more about these propositions, and we can relabel them arbitrarily without changing the state of knowledge, then we must have the Principle of Indifference, i.e., \[p(A_{k}\mid B) = \frac{1}{N}\] for any k.

Here is a moment of triumph: if we have a second robot with its plausibility assignments given by \(p_{II}(-)\) and the robot has accidentally relabeled the \(A_{\pi(i)}\) (for some permutation of the indexing set \(\pi\)), and if both robots have the same information, then necessarily \(p(A_{i}\mid B)=p_{II}(A_{\pi(i)}\mid B)\)...because that is what the data supports. From this, we "derive" the principle of indifference. And a general rule is found: the data given the robot determines the function \(p(-)\).

This is also where Jaynes first starts using the term Probability. For him, it is the "Kelvin scale" of plausibilities, the natural system of units which is universally translateable. (American scientists know the difficulty of dealing with Fehrehnheit and Celsius, well, the same difficulty occurs with two different agents trying to express plausibility assessments. Probability is the natural scale permitting translation.)

In a similar situation, if we have N mutually exclusive and exhaustive propositions \(H_{1}, \dots, H_{N}\) given some background information B, and if A is true in M hypotheses, then \[p(A\mid B) = \frac{M}{N}.\] Although quite simple, these rules suffice for getting quite a bit of mileage.

2.5. Notation, finite sets of propositions. When dealing with probabilities of propositions being true, Jaynes uses the notation of capital letters \(P(A\mid B)\). But for, e.g., the probability density of a particular distribution, lowercase letters are used like \(h(x\mid r, n, p)\).

Also, for technical reasons inherent in Cox's theorem, Jaynes will be working with finitely many propositions.

2.6.1. Comment on "Subjective" versus "Objective" probability. The terms "subjective" and "objective" are thrown around wildly in probability theory. Jaynes contends every probability assignment is necessarily "subjective", in the sense that it reflects only a state of knowledge and not anything measurable in a physical experiment. And if inquired, "Whose state of knowledge?" Well, it's the robot's state of knowledge (or anyone else who reasons according to the specifications laid out).

Probability then becomes a way of expressing (or encoding) one's information about a subject, regardless of one's personal feelings, hopes, fears, desires, etc., concerning it. This is the function of specifications 3B and 3C: they make the probability assignments "objective", in this sense of "it consistently encodes one's information about a subject".

Ch. 3. Elementary Sampling Theory

Jaynes helpfully begins by reviewing what has been established so far: the product rule (1) the product rule \[p(AB\mid C)=p(A\mid BC)p(B\mid C) = p(B\mid AC)p(A\mid C)\] and (2) the sum rule \[p(A+B\mid C) = p(A\mid C) + p(B\mid C) - p(AB\mid C).\] There's the "law of excluded middle" \[p(A\mid B) + p(\bar{A}\mid B) = 1.\] The principle of indifference: if given background information B the mutually exclusive hypotheses \(H_{i}\) (for \(i=1,\dots,N\)) are exhaustive, and B does not favor any particular hypothesis, then \[p(H_{i}\mid B) = \frac{1}{N}\] for any \(i=1,\dots,N\). If a proposition A is true on some M of the hypotheses \(H_{i}\), and false on the remaining \(N-M\), then \[p(A\mid B) = \frac{M}{N}.\] That's it, that's all we need. It's amazing how little is actually needed to do probability theory.

Section 3.1. Sampling without replacement. We are considering an urn problem, with the following propositions:

B = an urn contains N balls, all physically identical and indistinguishable, aside from being numbered from 1 to N (the presence of numbers does not alter the physical properties of the balls). And further M of the balls are red, the remaining \(N-M\) balls are white. We draw one ball at a time, observe its color, then set it aside. We do this n times, for \(0\leq n\leq N\).
\(R_{i}\) = the i-th draw is a red ball
\(W_{i}\) = the i-th draw is a white ball

We can treat the proposition \(W_{i}\) and \(R_{i}\) as negations of each other, i.e., \[\bar{W}_{i}=R_{i},\quad\mbox{and}\quad\bar{R}_{i}=W_{i}.\]

The reader can verify \[p(R_{1}\mid B) = \frac{M}{N} \tag{3.1.1}\] and \[p(W_{1}\mid B) = 1 - \frac{M}{N}. \tag{3.1.2}\] These probability assignments reflect our robot's state of knowledge. It's invalid to speak of "verifying" these equations experimentally: our concern is about reasoning with incomplete information, not assertions of physical fact about what will be drawn from the urn.

Jaynes next works through a few calculations. I think it's better if presented as exercises for the reader:

Exercise 1. What is the probability of drawing r red balls, one after another? I.e., compute \(p(R_{r}R_{r-1}\dots R_{2}R_{1}\mid B)\).

Exercise 2. What is the probability of drawing w white balls, one after another? I.e., compute \(p(W_{w}W_{w-1}\dots W_{2}W_{1}\mid B)\).

Exercise 3. What is the probability, given the first r draws are red balls, that draws \(r+1\), ..., \(r+w\) are white balls? I.e., compute \(p(W_{w+r}W_{w-1+r}\dots W_{2+r}W_{1+r}\mid R_{r}R_{r-1}\dots R_{2}R_{1}B)\).

From exercise 1 and 3, we conclude the probability of drawing w white balls and r red balls is given by: \[p(W_{w+r}\dots W_{1+r} R_{r}\dots R_{1}\mid B) = \frac{M!(N-M)!(N-n)!}{(M-r)!(N-M-w)!N!}.\] Curiously, this is the probability for one particular sequence of drawing r red balls and w white balls, namely the sequence of r red balls followed by w white balls. If the order were permuted (say, drawing w white balls followed by r red balls), the probability would be the same.

How many ways are there to draw exactly r red balls out of n drawings? It's the binomial coefficient, \[\binom{n}{r} = \frac{n!}{r!(n-r)!}.\]

Now take A to be the proposition "Exactly r red balls drawn out of n balls drawn, in any order", then \[h(r\mid N, M, n) := p(A\mid B) = \binom{n}{r}p(W_{w+r}\dots W_{1+r} R_{r}\dots R_{1}\mid B)\] which can be expanded out as \[h(r\mid N, M, n) = \frac{\binom{M}{r}\binom{N-M}{n-r}}{\binom{N}{n}}.\] Astute readers recognize this is the hypergeometric distribution.

Exercise 4. What is the most probable r? (Hint: solve \(h(r\mid N, M, n) = h(r-1\mid N, M, n)\) for r.)

Exercise 5. What is the probability of drawing a red ball on the third draw given no information on the first two draws? Is it dependent or independent of the prior draws? (Hint: the proposition being considered is \(R_{3}(W_{2}+R_{2})(W_{1}+R_{1})\mid B\), so use the product and sum rules to compute the probability.)

This wraps up the first section of chapter 3. The next section is a discussion of the role of causality in statistical physics as an invalid way to think of probability. It's also as far as I've gotten. There are about 10 more sections in the third chapter, I just haven't read them yet. Jaynes is a real treat if you have a solid understanding of probability theory and a good working knowledge of statistics. Maybe I'll continue adding my reading notes, if there's any interest in it.

Monday, June 1, 2020

Fast Dirichlet Distribution Parameter Estimation

From the Bayesian perspective, looking at \(\operatorname{Dir}(\alpha_{1}, \dots, \alpha_{n})\) the parameters \(\alpha_{j}\) encode the amount of evidence or number of observations of rolling an n-sided die, we witness side j approximately \(\alpha_{j}\) times.

It is tempting to use the Dirichlet distribution for estimating voting behavior, because a voter casts their ballot for a Democrat, a Republican, or a third-party candidate. The problem with this is, if we just set \(\alpha_{1}\) to the number of votes for the Democratic candidate (and likewise for the other parameters), there is too little variation. If we've seen a die turn up 1 and 6 roughly 99% of the time, we can be confident it will do so in the future. (We have such confidence about the Sun rising, after all.)

The problem: given observed outcomes \(x_{j}\) (dice rolls), we can form the approximate probabilities \(p_{j} = x_{j}/\sum_{\ell}x_{\ell}\). Using these probabilities, we can estimate the parameters of the Dirichlet distribution. I'm really interested in finding sensible values for \[\alpha_{0} = \alpha_{1} + \dots + \alpha_{n}.\tag{1}\] This quantity in Eq (1) is called the Precision of the Dirichlet distribution.

More precisely, we will observe an election in K counties, so we will end up with a vector \(\vec{x}_{j}\) whose components are the votes received by candidate j in each county. This gives us a vector \(\vec{p}_{j}\) whose components are the proportion of the county's vote went to candidate j

Solution: we can make estimates of the Dirichlet distribution based on the data, then use this to estimate the various parameters of interest.

Different Algorithms

There are a variety of ways to estimate parameters, I'm looking at closed-form algorithms. In particular, the method of moments derived algorithms.

In a nutshell, the method of moments looks at the moments of random variables \(\mathbb{E}[X^{n}]\) in terms of the parameters of the distribution, then tries to solve for one of the parameters in terms of the expected values. When given a population, we substitute in the sample mean (and sample variance) for the expected values in these formulas, giving us numerical estimates.

Minka's Method

Minka (2012) gives the following scheme. The method of moments applied to the Dirichlet distribution with parameters \(\alpha_{i}\), if we let \(\alpha_{0} = \alpha_{1}+\dots+\alpha_{n}\), may be approximated from data \(\vec{p}_{i}\) (in the trials the proportion corresponding to \(\alpha_{i}\)) using \[\widehat{\alpha_{0}} = \frac{\mathbb{E}[p_{1}] - \mathbb{E}[p_{1}^{2}]}{\mathbb{E}[p_{1}^{2}] - \mathbb{E}[p_{1}]^{2}}\] we find the approximation \[\boxed{\widehat{\alpha_{i}} = \mathbb{E}[p_{i}]\widehat{\alpha_{0}} = \mathbb{E}[p_{i}]\frac{\mathbb{E}[p_{1}] - \mathbb{E}[p_{1}^{2}]}{\mathbb{E}[p_{1}^{2}] - \mathbb{E}[p_{1}]^{2}}}\tag{2}\] However, there is no reason why we should use \(p_{1}\) in determining \(\widehat{\alpha_{0}}\) instead of any other \(p_{i}\).

Note algebraically speaking: \[\mathbb{E}[X^{2}] = \operatorname{var}[X] + \mathbb{E}[X]^{2}\]

A simple R implementation:

minka_estimate <- function(data) {
    mu <- mean(data[,1]);
    s2 <- var(data[,1]);

    (mu*(1 - mu))/s2 - 1;
}

Ronning's Method

We can use all the data to estimate \(\alpha_{0}\), as Ronning (1989) suggested. We use \[\operatorname{var}[p_{j}] = \frac{\mathbb{E}[p_{j}](1 - \mathbb{E}[p_{j}])}{1 + \alpha_{0}}\] after some algebraic rearrangement, then taking the log of both sides \[\log(\alpha_{0}) = \log\left(\frac{\mathbb{E}[p_{j}](1 - \mathbb{E}[p_{j}])}{\operatorname{var}[p_{j}]}-1\right).\] We take the average of the right-hand side over j (the candidates) to get \[\boxed{\log(\alpha_{0}) = \frac{1}{n}\sum^{n}_{j=1}\log\left(\frac{\mathbb{E}[p_{j}](1 - \mathbb{E}[p_{j}])}{\operatorname{var}[p_{j}]}-1\right).}\tag{3}\] The astute reader will recognize this is the geometric mean of estimates.

As a sanity test that this makes sense, we recover the correct value for the Beta Distribution.

A simple R implementation:

ronning_estimate <- function(data) {
    assert_that(all(0 <= data) && all(data <= 1));
    assert_that(all(0.97 <= rowSum(data)));
    
    col_estimate <- function(col) {
        mu <- mean(col);

        if (mu == 0) return(0);
        
        log(mu*(1-mu)/var(col) - 1);
    }
    
    exp(mean(apply(data, 2, col_estimate)));
}

Crazy algebraic results

With 3 results (a "trinomial", if you will), we have an estimate I found by hand. Let \(\mu_{A}\), \(\mu_{B}\) and \(\mu_{C}\) be the sample means and \(\sigma^{2}_{B}\) the sample variance. Assuming \[ 0\lt \mu_{A}\lt 1,\quad\mbox{and}\quad 0\lt\mu_{B}\lt 1-\mu_{A}\] (and if this fails to hold, permute the labels "A", "B", "C" until it holds, if at all possible), then we can estimate \[\widehat{\alpha_{A}} = \left(\frac{(1 - \mu_{B})\mu_{B}}{\sigma^{2}_{B}} - 1\right)\mu_{A} \tag{4.1}\] \[\widehat{\alpha_{B}} = \widehat{\alpha_{A}}\frac{\mu_{B}}{\mu_{A}} \] \[\widehat{\alpha_{C}} = \widehat{\alpha_{B}}\frac{1 - \mu_{B}}{\mu_{B}} - \widehat{\alpha_{A}}\]

Observe this uses Minka's formula in estimating \(\widehat{\alpha_{0}}\) (the parenthetic term on the right hand side of Eq (4.1)) for use in computing \(\widehat{\alpha_{A}}\). The reader can verify if we add these quantities together, we in fact get Minka's estimate using the second largest sample mean (and its associated sample variance).

Testing Results

When testing which estimator to use, we should have some error implicit in our mind, in the sense of "What are we trying to optimize? What counts as 'wrong'?" It's one thing to try to recover the parameters of a Dirichlet distribution, it's another to estimate the concentration parameter for a heuristic.

We could note that the method of moments for the Beta distribution coincide with estimating \(\alpha_{0}\) for using Minka's method on any particular outcome (i.e., using \(\widehat{\alpha_{j}}\) for any particular, fixed j). It's not uncommon for Bayesians to use a Gamma distribution as a prior for the concentration parameter of a Beta distribution, which would give us some measure of its variance.

The method of moments is biased (slightly). Despite this, we can still use the Cramer-Rao bound to estimate the variance of our estimates.

For what its worth, when used on voting results, the method of moments produce similar estimates when using the proportion of votes the Republican and Democrat received. It's within the margin of error of the maximum likelihood estimates given by MASS::fitdistr, but is consistently smaller than maximum likelihood estimates. Since I'm advocating humility in making these estimates, I prefer the method of moments using the fancy-pants algebraic method I derived (Minka's method on the runner-up candidate's proportion of votes).

If I were less lazy, I'd include more detailed proofs about estimates and goodness-of-fit tests, but...meh, exercise for reader :)

References

Jonathan Huang, "Maximum Likelihood Estimation of Dirichlet Distribution Parameters". Unpublished manuscript(?), undated, Eprint
Thomas P Minka, "Estimating a Dirichlet distribution". Microsoft Research Paper, 2012. Eprint.
G. Ronning, "Maximum-likelihood estimation of dirichlet distributions". Journal of Statistical Computation and Simulation 32 (1989) 215–221.