Friday, July 3, 2020

Runs in a sample

A puzzle to celebrate the 4th of July. This is a bit more open-ended (hence it's a "puzzle", not an "exercise"). Consider an urn, with N balls and an unknown number K of red balls (the remainder are white). We will be sampling without replacement.

First, define a run as drawing one ball after another of the same color.

Puzzle. How can we use the sample of n balls written as a sequence of runs \((k_{1},c_{1})\), ..., \((k_{m},c_{m})\) for a run of length \(k_{i}\) of color \(c_{i}\)? Are there cases where using this data would give less informative inferences than the vanilla inference? (And what's the inferred variance?)

Some exercises might make this all easier.

Exercise 1. Let B be the background information that an urn contains N balls, K of which are red. Is the probability the first k balls drawn are red: \[\Pr(R_{k}\dots R_{1}|B) = \frac{(K-(k-1))(\dots)(K-1)K}{(N-(k-1))(\dots)(N-1)N}\] or should we suppose the run ends when we draw a white ball? I.e., we should examine \(W_{k+1}R_{k}\dots R_{1}\) as a run of k red balls terminated by drawing a white ball? Does the sum of probabilities sum to 1? If not, why not? (And what does it sum to?)

Out of laziness, I sometimes write \(\Pr(n|K,N)\) for a run of n red balls when the urn contains K red balls and N balls in the urn in total.

Exercise 1B. What is \(\Pr(n|K,N+1)\) in terms of   \(\Pr(n-1|K,N)\)? Similarly, what is \(\Pr(n|K+1,N)\) in terms of   \(\Pr(n-1|K,N)\)? Can we relate \(\Pr(n|K+1,N+1)\) and \(\Pr(n|K,N)\)?

Exercise 2. In an urn with N balls (K of which are red), what's the expected length of a run of red balls?

This is rather tricky. Using the solution from exercise 1, the expected value for the length of a run of red balls would be \[\mathbb{E}[L] = \sum_{k=1}^{K}k\Pr(k|K,N)\] where \(L\) is the random variable denoting the length of a run of red balls drawn from the urn (sampled without replacement).

Exercise 3. Consider the "large urn limit", where \(p=K/N\) is fixed as \(N\to\infty\) and \(K\to\infty\). (a) Prove sampling without replacement [the hypergeometric distribution] becomes sampling with replacement [the binomial distribution]. (b) How does the expected length of a run of red balls change?

Could we use exercise 3 as a sanity check on the solution to exercise 2 under appropriate limits?

Exercise 4 (open-ended). Think of at least three ways to test your solutions to exercises 1, 2 and 3(b). [Punishment: if you thought about the \(K=1\) case, think of four more ways to test your solutions.]

Arguably, using the solution of exercise 3 is one check of the solution of exercise 2. Finding two more gives us more confidence in our solution being correct; it is good practice when solving problems and we don't know the solution.

Addendum: The solution will be posted Monday July 13, 2020, at 9:00am PDT.

Further Reading

  1. D.M. Bloom, "Probabilities of Clumps in a Binary Sequence (and How to Evaluate Them Without Knowing a Lot)." Mathematics Magazine 69, no.5 (1996) pp. 366–372. (Bloom considers the combinatorics for no-run samples, which is the complement of the situation we're interested in.)
  2. I.P. Goulden, and D.M. Jackson, Combinatorial Enumeration. New York: Wiley, 1983.
  3. Check your results for a deck of cards (OEIS A086438)

No comments:

Post a Comment