Problem: Integral in Bayesian Inference. The basic setup is we're considering Bayesian inference Pr(θ|D)=Pr(D|θ)Pr(θ)Pr(D)
Puzzle: How do we estimate the posterior distribution Pr(θ|D)? (We will discuss one popular approach in this post.)
Clever readers will suggest using numerical methods to approximate the integral. The problem with this is the number of terms needed to approximate the solution grows as Ndim(θ) while error is approximately N−1. For an extra digit of accuracy, we need 10dim(θ) terms, and usually θ is from a large dimensional parameter space. (For predicting presidential elections, dim(θ)≥50 for example; there are 1080 atoms in the universe, so 2 digits of extra accuracy would require more atoms than the universe can facilitate.)1The unsatisfied reader who believes this is a contrived line of reasoning may consult Bakhvalov's theorems concerning numerical integration error bounds in d dimensions. This is sometimes called the "curse of dimensionality".
That's the rub: how do we make the integral in Eq (1b) computationally tractable?
Monte Carlo Integration
The basic idea is: give up. We won't be able to find a solution to Eq (1b) using numerical methods. Instead, we will look at Eq (1b) as Pr(D)=E[Pr(D|θ)]=∫Pr(D|θ)Pr(θ)dθ.
Our trick is, with an arbitrary function g(x,θ), we can approximate its expected value using a random sample θ1, ..., θN∼Pr(θ) distributed according to the prior distribution, then E[g(x,θ)]≈1NN∑j=1g(x,θj).
Random Numbers
This works fine, but we need some way to generate the random sample θj∼Pr(θ). I'm just going to look at a few methods of generating random numbers.
Rejection Sampling
Consider X∼Exp(λ=1). Observe the probability density function is Pr(x)=exp(−x). We generate a sample by generating a random number xi from the computer between 0 and, say, 10. Then we generate a random number yi from the computer between 0 and 1. Observe that the xi,yi are drawn from uniform distributions.
If yi≤Pr(X=xi), then we accept xi and add it to our sample. Otherwise, we reject it, and go back to step 1.
We do this until we have generated the desired sample size.
The obvious problem is that, for most distributions, we will generate a lot of random numbers for very few acceptable results. In our example, roughly 90% of the generated xi will be rejected. More precisely, 9.999546% will be accepted. So for a sample of 10, we'd need to generate 200 random numbers (100 xi and 100 yi). That's pretty bad.
More generally, we could take xi from an arbitrary distribution Pr(θ), and yi∼U(0,1), then if yi≤Pr(X=xi) we accept xi. Even if we'd accept 90% of the xi generated, we would still generate twice as many random numbers as we'd actually need (since we need to generate the yi).
Importance Sampling
Suppose we have a random variable X∼P(θ) and we know its density function p(x), but we only have the ability to sample a different distribution Q (and we know its density function is q(x)). Then we could consider EP[f(X)]=∫f(x)p(x)dx=∫f(x)p(q)q(x)q(x)dx
Markov Chains
Markov chains are not unique to Bayesian concerns. Any stochastic system with state-transition described by probabilities is described by a Markov chain: at any time t, there is a probability vector whose components describe the probability of finding the system in that particular state. Here, time is discrete.
A great use for this is baseball, where the state of an inning is described by the number of outs and what bases are occupied by runners: the transition probabilities are then dependent on the batter's abilities, the pitcher's abilities, and the other player's skills. (And, maybe, luck.)
Sampling from the Posterior
The process we saw before with generating random numbers, the processes were independent sampling methods (in the sense that xi and xj are statistically independent of each other for i≠j). What if we sample from the posterior distribution in Eq (1a) depending on the previous sample? That is to say, we form a Markov chain of samples θj+1∼Pr(θj|D) according to "some sampling algorithm".
One way to do this is to propose the next value in the chain θpr using some random number generator scheme, then we accept or reject it by first forming the ratio Pr(θpr|D)Pr(θj|D)=Pr(D|θpr)Pr(θpr)Pr(D|θj)Pr(θj)
This is the famous Metropolis-Hastings sampler.
How and Why does this work?
Theorem 1. The [empirical distribution associated to the] sequence θ1, θ2, ..., converges to the target posterior distribution Pr(θ|D).
The proof amounts to proving two claims.
Claim 1. The simulated sequence θ1, ..., is a Markov chain with a unique stationary distribution.
For technical reasons, this is equivalent to the Markov chain being irreducible and aperiodic and not transient. Aperiodic and not transient hold for a random walk on any proper distribution. Irreducibility holds as long as the random walk has a positive probability of eventually reaching any state from any other state.
Claim 2. The stationary distribution equals this target posterior distribution.
Concluding Remarks
We have introduced various ways to obtain the posterior distribution estimates using Monte Carlo integration. Markov Chain Monte Carlo methods give us a decent algorithm.
Next time, we will examine error estimates, convergence issues, etc. That is to say, the practical matters of MCMC.
References
- Art B. Owen's Monte Carlo theory, methods and examples.
- Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin, Bayesian Data Analysis. Third ed, CRC Press, 2014; see especially chapter 10 et seq.
- Martin Haugh's "MCMC and Bayesian Modeling", lecture notes
- L. Martino, V. Elvira, F. Louzada, "Effective Sample Size for Importance Sampling based on discrepancy measures". arXiv:1602.03572
- David Spiegelhalter, Sylvia Richardson, W. R. Gilks, Markov Chain Monte Carlo in Practice. CRC Press, 1996.
- Ilker Yildirim, Bayesian Inference: Metropolis-Hastings Sampling. Unpublished note, dated 2012.
No comments:
Post a Comment