Recently, an eccentric billionaire bought the Los Angeles Times and sought to make it rival the New York Times as a "newspaper of record". Presumably this means hiring more journalists, but let us ask a simpler question.
Puzzle 1: How many news stories go unreported by both the New York Times and the Los Angeles Times?
We can solve this puzzle using the maximum likelihood estimator for the Hypergeometric Distribution. Think of it like this: on a remote island with some unknown deer population, we go and (without harming the wildlife) tag K deer. A month later, we return, and capture n deer, of which k are tagged. We can estimate the total population of deer N on the island.
Explicitly connecting that analogous problem to our own, we know the "tagged stories" K reported by the New York Times, the "sample stories" n reported by the Los Angeles Times, of which there is the "tagged sample stories" k reported by both newspapers, and we want to estimate how many news stories there are in total N. The maximum likelihood estimator for N is given by \[ \min_{\widehat{N}}\frac{\Pr(\widehat{N},K,n,k)}{\Pr(\widehat{N}-1,K,n,k)}\geq1 \] the smallest N for which the ratio of probabilities is greater than 1. It is not hard to solve this to find \(\widehat{N} = [Kn/k]\) where the brackets indicate we are using the integer part of the number (e.g., [3.2]=3, [4.9]=4).
Now we just need to list the stories which the New York Times reported but the Los Angeles Times did not (giving \(K-k\)), the stories which both papers reported (k), and the total number of stories the Los Angeles Times reported (n). From this, we will estimate how many stories have gone unreported.
To answer this fully, I looked at the front section for each paper for May 28, 2019. The short answer is K = 22 stories in the New York Times, n = 12 stories in the Los Angeles Times, and k = 5 stories in both. We thus may expect there to be N = [264/5] = 52 stories, of which 29 were reported and 23 went unreported by either newspaper. Find below a density plot of the probability for various N, and notice how it is maximized at N = 52 (indicated by a red vertical line):
Solution: Using the maximum likelihood estimate for the hypergeometric distribution, there were a total of N = 52 news stories, 29 were reported by one of the two newspapers, and 23 stories went unreported.
Puzzle 2: Is there a Bayesian estimate for the number of news stories? Or different ways to estimate the total number of news stories?
Puzzle 3: How stable is this estimate for N? If we examine, say, the last week's worth of articles, do we get approximately the same value for N?
Puzzle 4: What if we extend this analysis to include, e.g., the Wall Street Journal, the Washington Post, and others? How stable is N in this case?
Find two tables below, one listing the stories in the international section for both papers, and the second for national stories. Corresponding stories are listed on the same row.
The national stories in both newspapers, appears to be completely disjoint sets of stories.
No comments:
Post a Comment