Problem: Integral in Bayesian Inference. The basic setup is we're considering Bayesian inference \[ \Pr(\theta|D) = \frac{\Pr(D|\theta)\Pr(\theta)}{\Pr(D)} \tag{1a}\] where the marginal probabilities \(\Pr(\theta)\) is the prior probability distribution for \(\theta\), and \(\Pr(D)\) is computed as the integral \[ \Pr(D) = \int\Pr(D|\theta)\Pr(\theta)\,\mathrm{d}\theta. \tag{1b}\] If the reader is familiar with calculus, then it is no surprise to learn computing Eq (1b) is hard in general.

Puzzle: How do we estimate the posterior distribution \(\Pr(\theta|D)\)? (We will discuss one popular approach in this post.)

Clever readers will suggest using numerical methods to approximate the integral. The problem with this is the number of terms needed to approximate the solution grows as \(N^{\operatorname{dim}(\theta)}\) while error is approximately \(N^{-1}\). For an extra digit of accuracy, we need \(10^{\operatorname{dim}(\theta)}\) terms, and usually \(\theta\) is from a large dimensional parameter space. (For predicting presidential elections, \(\operatorname{dim}(\theta)\geq 50\) for example; there are \(10^{80}\) atoms in the universe, so 2 digits of extra accuracy would require more atoms than the universe can facilitate.)1The unsatisfied reader who believes this is a contrived line of reasoning may consult Bakhvalov's theorems concerning numerical integration error bounds in d dimensions. This is sometimes called the "curse of dimensionality".

That's the rub: how do we make the integral in Eq (1b) computationally tractable?

Monte Carlo Integration

The basic idea is: give up. We won't be able to find a solution to Eq (1b) using numerical methods. Instead, we will look at Eq (1b) as \[ \Pr(D) = \mathbb{E}[\Pr(D|\theta)] = \int\Pr(D|\theta)\Pr(\theta)\,\mathrm{d}\theta. \] We just need to approximate the "expected value" \(\mathbb{E}[\Pr(D|\theta)]\).

Our trick is, with an arbitrary function \(g(x,\theta)\), we can approximate its expected value using a random sample \(\theta_{1}\), ..., \(\theta_{N}\sim\Pr(\theta)\) distributed according to the prior distribution, then \[ \mathbb{E}[g(x, \theta)]\approx\frac{1}{N}\sum^{N}_{j=1}g(x,\theta_{j}). \tag{2}\] This is the Monte Carlo method for integration.

Random Numbers

This works fine, but we need some way to generate the random sample \(\theta_{j}\sim\Pr(\theta)\). I'm just going to look at a few methods of generating random numbers.

Rejection Sampling

Consider \(X\sim\operatorname{Exp}(\lambda=1)\). Observe the probability density function is \(\Pr(x)=\exp(-x)\). We generate a sample by generating a random number \(x_{i}\) from the computer between 0 and, say, 10. Then we generate a random number \(y_{i}\) from the computer between 0 and 1. Observe that the \(x_{i},y_{i}\) are drawn from uniform distributions.

If \(y_{i}\leq\Pr(X=x_{i})\), then we accept \(x_{i}\) and add it to our sample. Otherwise, we reject it, and go back to step 1.

We do this until we have generated the desired sample size.

The obvious problem is that, for most distributions, we will generate a lot of random numbers for very few acceptable results. In our example, roughly 90% of the generated \(x_{i}\) will be rejected. More precisely, 9.999546% will be accepted. So for a sample of 10, we'd need to generate 200 random numbers (100 \(x_{i}\) and 100 \(y_{i}\)). That's pretty bad.

More generally, we could take \(x_{i}\) from an arbitrary distribution \(\Pr(\theta)\), and \(y_{i}\sim U(0,1)\), then if \(y_{i}\leq\Pr(X=x_{i})\) we accept \(x_{i}\). Even if we'd accept 90% of the \(x_{i}\) generated, we would still generate twice as many random numbers as we'd actually need (since we need to generate the \(y_{i}\)).

Importance Sampling

Suppose we have a random variable \(X\sim P(\theta)\) and we know its density function \(p(x)\), but we only have the ability to sample a different distribution \(Q\) (and we know its density function is \(q(x)\)). Then we could consider \[ \mathbb{E}_{P}[f(X)] = \int f(x)p(x)\,\mathrm{d}x = \int \frac{f(x)p(q)}{q(x)}q(x)\,\mathrm{d}x \tag{3} \] and we see this is just \[ \mathbb{E}_{P}[f(X)] = \mathbb{E}_{Q}\left[f(X)\frac{p(X)}{q(X)}\right]. \tag{4} \] If we can only sample from the distribution \(Q\), then we find for \(x_{i}\sim Q\), \[ \mathbb{E}_{P}[f(X)] = \mathbb{E}_{Q}\left[f(X)\frac{p(X)}{q(X)}\right]\approx\frac{1}{N}\sum^{N}_{i} f(x_{i})\frac{p(x_{i})}{q(x_{i})}q(x_{i}). \tag{5} \] To form some confidence interval about this estimate, we need to find the variance, which is just \[ \sigma_{q}^{2} = \int\frac{(f(x)p(x))^{2}}{q(x)}\,\mathrm{d}x - \mathbb{E}_{P}[f(X)]^{2} \tag{6} \] which can itself be approximated using the sample. This can be derived from \[ \sigma_{q}^{2} = \mathbb{E}_{Q}\left[\left(f(X)\frac{p(X)}{q(X)}\right)^{2}\right] - \mathbb{E}_{P}[f(X)]^{2} \tag{7} \] in case we wish to examine this further in the future (i.e., helping any possible future Alex).

Markov Chains

Markov chains are not unique to Bayesian concerns. Any stochastic system with state-transition described by probabilities is described by a Markov chain: at any time t, there is a probability vector whose components describe the probability of finding the system in that particular state. Here, time is discrete.

A great use for this is baseball, where the state of an inning is described by the number of outs and what bases are occupied by runners: the transition probabilities are then dependent on the batter's abilities, the pitcher's abilities, and the other player's skills. (And, maybe, luck.)

Sampling from the Posterior

The process we saw before with generating random numbers, the processes were independent sampling methods (in the sense that \(x_{i}\) and \(x_{j}\) are statistically independent of each other for \(i\neq j\)). What if we sample from the posterior distribution in Eq (1a) depending on the previous sample? That is to say, we form a Markov chain of samples \(\theta_{j+1}\sim\Pr(\theta_{j}|D)\) according to "some sampling algorithm".

One way to do this is to propose the next value in the chain \(\theta_{pr}\) using some random number generator scheme, then we accept or reject it by first forming the ratio \[ \frac{\Pr(\theta_{pr}|D)}{\Pr(\theta_{j}|D)} = \frac{\Pr(D|\theta_{pr})\Pr(\theta_{pr})}{\Pr(D|\theta_{j})\Pr(\theta_{j})} \tag{8} \] so the normalization denominators cancel out. IF the ratio of unnormalized posteriors is GREATER THAN a randomly generated number \(u\) between 0 and 1 (i.e., \(u \lt \Pr(\theta_{pr}|D)/\Pr(\theta_{j}|D)\)), THEN accept the proposed value as the next \(\theta_{j+1}=\theta_{pr}\). ELSE, for the case where the random number is less than the ratio, we set \(\theta_{j+1}=\theta_{j}\), and continue on. (So we'll always accept proposals for \(\Pr(\theta_{pr}|D)\geq \Pr(\theta_{j}|D)\).)

This is the famous Metropolis-Hastings sampler.

How and Why does this work?

Theorem 1. The [empirical distribution associated to the] sequence \(\theta_{1}\), \(\theta_{2}\), ..., converges to the target posterior distribution \(\Pr(\theta|D)\).

The proof amounts to proving two claims.

Claim 1. The simulated sequence \(\theta_{1}\), ..., is a Markov chain with a unique stationary distribution.

For technical reasons, this is equivalent to the Markov chain being irreducible and aperiodic and not transient. Aperiodic and not transient hold for a random walk on any proper distribution. Irreducibility holds as long as the random walk has a positive probability of eventually reaching any state from any other state.

Claim 2. The stationary distribution equals this target posterior distribution.

Concluding Remarks

We have introduced various ways to obtain the posterior distribution estimates using Monte Carlo integration. Markov Chain Monte Carlo methods give us a decent algorithm.

Next time, we will examine error estimates, convergence issues, etc. That is to say, the practical matters of MCMC.

References

Art B. Owen's Monte Carlo theory, methods and examples.
Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin, Bayesian Data Analysis. Third ed, CRC Press, 2014; see especially chapter 10 et seq.
Martin Haugh's "MCMC and Bayesian Modeling", lecture notes
L. Martino, V. Elvira, F. Louzada, "Effective Sample Size for Importance Sampling based on discrepancy measures". arXiv:1602.03572
David Spiegelhalter, Sylvia Richardson, W. R. Gilks, Markov Chain Monte Carlo in Practice. CRC Press, 1996.
Ilker Yildirim, Bayesian Inference: Metropolis-Hastings Sampling. Unpublished note, dated 2012.

Political Arithmetic

Thursday, July 9, 2020

MCMC and its usefulness in Bayesian Statistics