From the Bayesian perspective, looking at \(\operatorname{Dir}(\alpha_{1}, \dots, \alpha_{n})\) the parameters \(\alpha_{j}\) encode the amount of evidence or number of observations of rolling an n-sided die, we witness side j approximately \(\alpha_{j}\) times.

It is tempting to use the Dirichlet distribution for estimating voting behavior, because a voter casts their ballot for a Democrat, a Republican, or a third-party candidate. The problem with this is, if we just set \(\alpha_{1}\) to the number of votes for the Democratic candidate (and likewise for the other parameters), there is too little variation. If we've seen a die turn up 1 and 6 roughly 99% of the time, we can be confident it will do so in the future. (We have such confidence about the Sun rising, after all.)

The problem: given observed outcomes \(x_{j}\) (dice rolls), we can form the approximate probabilities \(p_{j} = x_{j}/\sum_{\ell}x_{\ell}\). Using these probabilities, we can estimate the parameters of the Dirichlet distribution. I'm really interested in finding sensible values for \[\alpha_{0} = \alpha_{1} + \dots + \alpha_{n}.\tag{1}\] This quantity in Eq (1) is called the Precision of the Dirichlet distribution.

More precisely, we will observe an election in K counties, so we will end up with a vector \(\vec{x}_{j}\) whose components are the votes received by candidate j in each county. This gives us a vector \(\vec{p}_{j}\) whose components are the proportion of the county's vote went to candidate j

Solution: we can make estimates of the Dirichlet distribution based on the data, then use this to estimate the various parameters of interest.

Different Algorithms

There are a variety of ways to estimate parameters, I'm looking at closed-form algorithms. In particular, the method of moments derived algorithms.

In a nutshell, the method of moments looks at the moments of random variables \(\mathbb{E}[X^{n}]\) in terms of the parameters of the distribution, then tries to solve for one of the parameters in terms of the expected values. When given a population, we substitute in the sample mean (and sample variance) for the expected values in these formulas, giving us numerical estimates.

Minka's Method

Minka (2012) gives the following scheme. The method of moments applied to the Dirichlet distribution with parameters \(\alpha_{i}\), if we let \(\alpha_{0} = \alpha_{1}+\dots+\alpha_{n}\), may be approximated from data \(\vec{p}_{i}\) (in the trials the proportion corresponding to \(\alpha_{i}\)) using \[\widehat{\alpha_{0}} = \frac{\mathbb{E}[p_{1}] - \mathbb{E}[p_{1}^{2}]}{\mathbb{E}[p_{1}^{2}] - \mathbb{E}[p_{1}]^{2}}\] we find the approximation \[\boxed{\widehat{\alpha_{i}} = \mathbb{E}[p_{i}]\widehat{\alpha_{0}} = \mathbb{E}[p_{i}]\frac{\mathbb{E}[p_{1}] - \mathbb{E}[p_{1}^{2}]}{\mathbb{E}[p_{1}^{2}] - \mathbb{E}[p_{1}]^{2}}}\tag{2}\] However, there is no reason why we should use \(p_{1}\) in determining \(\widehat{\alpha_{0}}\) instead of any other \(p_{i}\).

Note algebraically speaking: \[\mathbb{E}[X^{2}] = \operatorname{var}[X] + \mathbb{E}[X]^{2}\]

A simple R implementation:

minka_estimate <- function(data) {
    mu <- mean(data[,1]);
    s2 <- var(data[,1]);

    (mu*(1 - mu))/s2 - 1;
}

Ronning's Method

We can use all the data to estimate \(\alpha_{0}\), as Ronning (1989) suggested. We use \[\operatorname{var}[p_{j}] = \frac{\mathbb{E}[p_{j}](1 - \mathbb{E}[p_{j}])}{1 + \alpha_{0}}\] after some algebraic rearrangement, then taking the log of both sides \[\log(\alpha_{0}) = \log\left(\frac{\mathbb{E}[p_{j}](1 - \mathbb{E}[p_{j}])}{\operatorname{var}[p_{j}]}-1\right).\] We take the average of the right-hand side over j (the candidates) to get \[\boxed{\log(\alpha_{0}) = \frac{1}{n}\sum^{n}_{j=1}\log\left(\frac{\mathbb{E}[p_{j}](1 - \mathbb{E}[p_{j}])}{\operatorname{var}[p_{j}]}-1\right).}\tag{3}\] The astute reader will recognize this is the geometric mean of estimates.

As a sanity test that this makes sense, we recover the correct value for the Beta Distribution.

A simple R implementation:

ronning_estimate <- function(data) {
    assert_that(all(0 <= data) && all(data <= 1));
    assert_that(all(0.97 <= rowSum(data)));
    
    col_estimate <- function(col) {
        mu <- mean(col);

        if (mu == 0) return(0);
        
        log(mu*(1-mu)/var(col) - 1);
    }
    
    exp(mean(apply(data, 2, col_estimate)));
}

Crazy algebraic results

With 3 results (a "trinomial", if you will), we have an estimate I found by hand. Let \(\mu_{A}\), \(\mu_{B}\) and \(\mu_{C}\) be the sample means and \(\sigma^{2}_{B}\) the sample variance. Assuming \[ 0\lt \mu_{A}\lt 1,\quad\mbox{and}\quad 0\lt\mu_{B}\lt 1-\mu_{A}\] (and if this fails to hold, permute the labels "A", "B", "C" until it holds, if at all possible), then we can estimate \[\widehat{\alpha_{A}} = \left(\frac{(1 - \mu_{B})\mu_{B}}{\sigma^{2}_{B}} - 1\right)\mu_{A} \tag{4.1}\] \[\widehat{\alpha_{B}} = \widehat{\alpha_{A}}\frac{\mu_{B}}{\mu_{A}} \] \[\widehat{\alpha_{C}} = \widehat{\alpha_{B}}\frac{1 - \mu_{B}}{\mu_{B}} - \widehat{\alpha_{A}}\]

Observe this uses Minka's formula in estimating \(\widehat{\alpha_{0}}\) (the parenthetic term on the right hand side of Eq (4.1)) for use in computing \(\widehat{\alpha_{A}}\). The reader can verify if we add these quantities together, we in fact get Minka's estimate using the second largest sample mean (and its associated sample variance).

Testing Results

When testing which estimator to use, we should have some error implicit in our mind, in the sense of "What are we trying to optimize? What counts as 'wrong'?" It's one thing to try to recover the parameters of a Dirichlet distribution, it's another to estimate the concentration parameter for a heuristic.

We could note that the method of moments for the Beta distribution coincide with estimating \(\alpha_{0}\) for using Minka's method on any particular outcome (i.e., using \(\widehat{\alpha_{j}}\) for any particular, fixed j). It's not uncommon for Bayesians to use a Gamma distribution as a prior for the concentration parameter of a Beta distribution, which would give us some measure of its variance.

The method of moments is biased (slightly). Despite this, we can still use the Cramer-Rao bound to estimate the variance of our estimates.

For what its worth, when used on voting results, the method of moments produce similar estimates when using the proportion of votes the Republican and Democrat received. It's within the margin of error of the maximum likelihood estimates given by MASS::fitdistr, but is consistently smaller than maximum likelihood estimates. Since I'm advocating humility in making these estimates, I prefer the method of moments using the fancy-pants algebraic method I derived (Minka's method on the runner-up candidate's proportion of votes).

If I were less lazy, I'd include more detailed proofs about estimates and goodness-of-fit tests, but...meh, exercise for reader :)

References

Jonathan Huang, "Maximum Likelihood Estimation of Dirichlet Distribution Parameters". Unpublished manuscript(?), undated, Eprint
Thomas P Minka, "Estimating a Dirichlet distribution". Microsoft Research Paper, 2012. Eprint.
G. Ronning, "Maximum-likelihood estimation of dirichlet distributions". Journal of Statistical Computation and Simulation 32 (1989) 215–221.

Political Arithmetic

Monday, June 1, 2020

Fast Dirichlet Distribution Parameter Estimation