Friday, December 6, 2019

Median Voter Theorem in other Voter Models?

The Economist's Why a left-wing nominee would hurt Democrats explicitly invoked an implicitly assumed proposition in political discourse: the median voter theorem "holds" in US elections.

Briefly put, the median voter theorem states that, if we have an odd number1The assumption of an "odd number" of voters is technically not needed, if we have some way to break the tie, or if (for an even number of voters) we have made sure the 2 median voters agree on how they'll vote. The assumption on the odd number of voters is not strictly necessary, it's just a helpful assumption. of voters who are rational and the vote is about an issue describable by a one-dimensional issue space, then polling the median voter will tell us the outcome of the vote. I won't digress on the assumptions of this theorem or the conditions when it holds too long, the take-away message is the median voter's vote coincides with the result of the vote, so a rational candidate would run as close to the median as possible.

But I draw your attention to the fact that the median voter theorem assumes the voters are rational, in the game theoretic sense of the term.

Puzzle: Does the median voter theorem hold for other voter models?

We can be generous and weaken the median voter theorem statement to be something like There exists a voter, the Median Voter, and the Median Voter's vote "correlates strongly" with the outcome in a first-past-the-post vote with two candidates.

I don't have an answer to this puzzle, nor have I searched the literature hard enough to be satisfied. (There may be some obscure article on this very puzzle, I am unaware of it though.) The answer is famously "no" for ranked-choice voting with multiple candidates, but for incomplete information or alternative voter models...these have not been adequately explored in the literature. I just have accumulated some notes that may be germane to this puzzle.

References

  • Milton Lodge, Marco R. Steenbergen, Shawn Brau, The Responsive Voter: Campaign Information and the Dynamics of Candidate Evaluation. The American Political Science Review, Vol. 89, No. 2. (Jun., 1995), pp. 309-326.

Wednesday, August 7, 2019

Puzzles from Bernoulli

I recently stumbled across a fascinating puzzle from Bernoulli's Ars Conjectandi, and then found that part 3 of his book consists of 24 equally exciting worked problems. But I haven't found online the presentation of these problems.

See either Anders Hald's A History of Probability and Statistics and Their Applications before 1750 (2005), especially chapter 15, section 5...or Edith Dudley Sylla's translation The Art of Conjecturing, Together with Letter to a Friend on Sets in Court Tennis (2006).

If I misrepresented any of the problems, leave me a comment! I was rather quick and rushed in assembling the exercises, and I easily could have made mistakes. Also, I am fully aware some of these problems are ambiguous. Try working out every variant you can think of. The whole point (at least, to me) is that these present variations on a theme.

Problem 1. There are two balls in an urn, a "winner" ball and a "loser" ball. There are three players. The first player draws a ball. If it's the "loser" ball, the first player returns it to the urn; and if it's the "winner" ball, the first player wins the game. Should the first player lose, the second player performs the same task, and wins only if drawing the "winner" ball. Should the first and second players both draw the "loser" ball, the third player draws a ball. If the third player draws the "loser" ball, the house wins the game. What is the expected winnings for each player (and the bank)?

Problem 2. A variant of problem 1, each player bets some amount. If no player draws the winning ball, then they divide up the bets equally. (Example: player 1 bets 4 coins, player 2 bets 2 coins, player 3 bets 1 coin; if they all draw the "loser" ball, then each player receives (4 + 2 + 1)/3 coins back.) What is the expected winnings for each player?

Problem 3. Consider a tournament (i.e., a sequence of 2-player games). Two players compete in a game, where there will be only one winner. The winner plays the game against player 3. The winner of the second match plays against player 4. And so on until player 6. If the lots of each player in the games are equal, how do the lots of the later players compare/increase over those of the earlier players?

Problem 4. As a variant of problem 3, the lot of the player who wins the first game is stipulated to be double that of the third player, who plays only in the second game, and so on. Compare the expected winnings for each player.

To be clear, Bernoulli means the relative probability mass of winning the first round are even for both players (each player has 1 favorable outcome for an odds of 1:1, or 50% probability of winning), but the relative probability mass for the winner of the first round to beat player 3 is doubled (the victor of the first match has twice the favorable outcomes as compared to game 1 [i.e., 2 favorable outcomes] whereas player 3 has 1 favorable producing a 2/3 probability favoring the victor of the first game to win the second), and the relative probability mass of the winner of the second round is doubled to determine the odds of winning the third game (so if the victor of the first game also won the second game, this victor has 4 favorable outcomes, but player 4 has 1 favorable outcome, the probability of winning this particular game for the victor of the first two games would be 4/5), and so on.

Problem 5. "This is the third problem of Huygens' Appendix." A wagers against B, that of 40 cards, of which 10 of each color, he will draw 4 of them in a way to have one of each color. (Or, for modern readers, consider a standard playing deck with Jacks, Queens, Kings discarded; A wages that B cannot draw one card from each suit.) What is the probability that A wins the bet? [Alleged solution: One finds in this case that the chance of A is to that of B as 1000 is to 8139.]

Problem 6. "This is the fourth problem of Huygens' Appendix." One takes 12 tokens of which 4 white and 8 black. A wagers against B that among 7 tokens that he will draw from them blindly, there will be found 3 white. One demands the ratio of the chance of A to that of B; i.e., what are the odds A wins? (There is some ambiguity in the historic text; namely, is it exactly 3 white or at least 3 white?)

Face Cards

Problem 7. Let there be a single face card in a pack of n cards, and the first player (of m players) to draw it wins. If no one draws it, and cards remain, then the players continue drawing cards. What is the probability of each player to win? What if n = mk (the number of players divides the number of cards)?

Problem 8. As a variant of Problem 7, what if there were j face cards in the deck and the winner is still the first person to draw a face card? What is the probability for each player to win?

Problem 9. As a variant of Problem 8, what if the players keep drawing cards until the deck is exhausted, and the winner is the player with the most face cards drawn? For ties, the winnings are split among the winners. (Assume each player pays 1 coin to play, for example.) What is the expected winnings for each player?

Problem 10. As an extension to Problem 9, what if we permit any player to sell their position? Specifically Bernoulli considers four players (A, B, C, D) with a deck of 36 cards, of which 16 are face cards. Each player receives cards in rounds until 23 cards have been distributed, A has received 4 face cards, B has 3, C received 2, and D a modest 1 face card, so that there remain 13 cards among which there are 6 face cards. The fourth player D (who is next to receive a card), "seeing that almost all hope of his winning has vanished", wishes to sell his right to one of the others. How much should he sell it for and what are the expectations of the individual players?

Dice Puzzles

Problem 11. Throw a die. Then throw a second die. If they differ, the player wins a point; if they agree, the player loses a point. Then the player throws a third die. If it agrees with any previous die, lose one additional point for each agreement; otherwise, the player wins an additional point added to their running score. Do this for a total of 6 dice. What is the expected score for the player?

Problem 12. Similar to Problem 11, but the dice must be thrown in numerical order. E.g., the first die must be "1" for the player to get a point, otherwise the player loses a point; the second die must be "2" for the player to get a point, otherwise the player loses a point; and so on. What is the expected score for the player?

Problem 13. Three players (A, B, C) have a list of 6 numerals ("1", "2", ..., "6") on a sheet of paper before them. Each of them take turns round-robin. On a player's turn, they roll a die, and if the result is on their sheet of paper, then the player gets to eliminate it from his sheet (scratch off the number) and roll again; but if the player has already eliminated that number, then the next player gets the die. This process continues until someone eliminates all their written numerals. It happens, however, after a while that A has 2 numerals before him, B has 4, and C has 3; it's A's turn to throw. What are probabilities for each player to win? [Bernoulli notes This problem requires more labor and patience than ingenuity.]

Problem 14. There are k players. A given player throws a die, which shows its face to be m. This tells the player to throw m dice and sum the values shown. (We may choose m to be added to the score or not.) The player with the most points wins.

Here's a twist, though: one player may opt to beat a fixed number t of points. This value t is fixed by the rules of the game.

What's the expected winnings for each player if no one opts for the fixed points route? What's the expected winnings for each player if someone chooses to beat a fixed number of points?

Although Bernoulli didn't pitch it, what if there's a "bidding war" competition to determine which player may opt to beat a fixed number of points? Instead of having t be fixed, when one player asks to be the one to beat a fixed number of points, that player must offer a bid ("I want to beat x points"). Each player may pass (and no longer participates in the bidding anymore), or offers a higher bid ("I want to beat y > x points"). This continues until the maximum value of 35 is bid. What strategy works best in this bidding strategy?

Problem 15. As a variant of Problem 14, what if the fixed number of points is the square of the first toss of the die?

Problem 16 (Cinq et Neuf). This is a prototype of craps. Player 1 tosses a pair of dice. Player 1 wins if on his first toss she gets a 3, an 11, or any pair. But player 2 wins if player 1 tosses a 5 or 9.

If player 1's first toss is a 4, 6, 7, 8, or 10, then the game continues (player 1 keeps tossing a pair of dice) until either (a) a 5 or 9 appears [player 2 wins], or (b) player 1 tosses the first value [player 1 wins].

What is the probability of player 1 winning? (Player 2's probability of winning is, by definition, the complement of the probability for player 1 winning.)

Although not asked, what is the expected number of tosses for a given game?

Wheel Games

Problem 17. The player pays 4 coins to toss 4 balls on a roulette-like wheel. The roulette wheel has 32 pockets, each with labels "1", "2", ..., "8". There are four pockets labelled "1"; four pockets labelled "2"; and so on. Each pocket may contain at most 1 ball. The player wins the sum of the pockets's labels (for the pockets containing the balls). What is the expected winnings for the player? [Solution.]

Card Games

Problem 18 (Trijaques). We consider a "toy model" of poker.

We assemble a deck of 28 cards from a standard playing deck, by discarding the cards 2 through 8 for each suit. The values for each card are determined by its face value, but the Jack of Clubs and 9s are wild.

The player will be given 4 cards. The goal is for the player to assemble either a flush (a run of all 4 cards regardless of suit, e.g., "9, 10, J, Q"), or a pair, three of a kind, or four of a kind. The value of a hand is the sum of the value of the cards in the combination. The player with the highest valued hand wins the pot (or it is split among the highest valued hands).

But the sequence of play is as follows: each player is dealt 2 cards face down. Then the players bet. Then each player is dealt 2 cards face up.

What is the expected winnings for each player? What strategies could be considered in the betting process?

Problem 19. Consider a generic game, where one player is the "banker". The banker has an advantage over the other players (i.e., is more probable than any other player to win a given round). But the rules of the game may allow moving the banker role to another player.

Specifically, the banker has probability p of winning a round, and probability q of losing, with p + q = 1 and \(r = p - q > 0\). The banker has probability h of continuing the next round as banker, and probability k of losing the position as banker to another player, with \(t = h - k > 0\). Let a denote the amount won by either the player or the banker, whoever wins the round.

What is the expected winnings for the banker after "many" rounds?

Problem 20 (Capriludium, Bockspiel). At the beginning of the game, each player puts down their bet. Then the banker shuffles the deck, and divides it into equally sized hands. Each player (and the banker) gets a hand. Bernoulli says the player just turns the hand over without organizing it. If the punter [player who is not the banker] has their facing card be of equal or higher value compared to the banker, then the punter wins an amount of money equal to his bet/bid from the banker. Otherwise the player loses their wager to the banker. When the banker loses to all the players in a game, the next player becomes the banker.

After one round, the top cards are not yet discarded. New wagers are first made. Then the top card is discarded (collected by the banker for later shuffling).

If there are N = sf cards in the deck with s suits and f face cards (of value ranging between 1 to f), and suppose there are n players (including the banker) for n = 2, 3, 4.

What is the expected winnings for the banker? What is the expected number of rounds to be played for a given deck? How does it vary on the number of players? What is the probability the banker will remain in their role as banker after one hand? After h hands?

Problem 21 (Basset). The basic formalization is there are 2n cards, of which k are marked "a" and \(2n - k\) marked "b". For example, 2n = 52, and k = 4 (e.g., aces in a standard deck of playing cards). The player draws two cards in succession (no replacement). The possibly outcomes:

  • ab = the banker wins 1 point
  • ba = the player wins 1 point
  • aa = the banker wins 1 point
  • bb = toss the cards aside and draw 2 new cards (and consult this table of outcomes again)

What is the expected winnings for the banker after 1 hand? For 1 game (exhausting the whole deck)?

As a variant, we could try using the full rules for Basset, with the bizarre bet multiplying schemes.

Curious Puzzles

Problem 22. There are two players, Titus and Caius. Titus pays Caius one coin for each round where Titus will throw a single die. There are a possible outcomes, of which b favor Titus (Caius pays him one coin) and c favor Caius (where Titus wins nothing). If Titus throws one of the c cases continuously n times in a row, Caius must return all n coins to Titus. What is the expected winnings of Caius and Titus?

Problem 23 (Blinde Würffel, "Blind Dice"). We have 6 dice with a number on only one face, and blanks on the remaining faces. One die has "1" for its non-blank side, another has "2" for its non-blank side, and so on, so each label "1" through "6" may possibly show up. Blank sides are treated as having value 0. Suppose the player rolls all 6 dice, and wins the sum of the values shown. What is the player's expected winnings?

Problem 24. A variant of Problem 23, if the player gets no points, in 5 tosses in a row then the player gets his money back for those 5 tosses. What is the player's expected winnings now?

Monday, July 29, 2019

Clustering States

I was curious about clustering states into regions based off of voting behaviour, similar to what Nate Silver did earlier in 2008. The basic idea is the state's presidential vote to find correlated behaviour in neighboring states, then form clusters. The result:

The colors are chosen to distinguish neighboring clusters, they do not reflect anything else. There is no relationship encoded in the color choice. It is pure aesthetics. There are 11 distinct clusters.

As a first stab, for a given state, I created a list for the percentage of votes a party received in the presidential elections since 1976. One given list might schematically look like: (1976 D%, 1976 R %, 1976 third %, 1980 D %, ...). This is all for one single state.

Each state having a list of these percentages, I compute the correlation between pairs of states's voting behaviour. This produces a long list of connections between states, and the correlation between them. I throw out all connections whose correlation is in the bottom 92.5-percentile, then cluster neighboring states if they have a strong enough correlation.

Texas, Vermont, and West Virginia did not correlate within the top 7.5% with any neighboring states, but they did correlate quite strongly with neighbors regardless. Strong enough for me to manually cluster them with specific neighbors. They were the only states I did manually.

It would be curious to include voting for House members, and Senators. At present, I have not investigated this avenue, and it may be interesting to investigate further.

As always, the code related to this is available on github.

Saturday, July 6, 2019

Post-Mortem of 2016 (Fragment)

How did Trump win 2016? It appeared to turn the world upside down, seemingly defying polls and reason. What happened? Was it really surprising or were we misled?

The game plan will be first to examine a few popular myths, eliminate these explanations, then examine the results of 2016. By articulating why 2016 was surprising, we will find the factors responsible for the outcome.

Executive Summary: It's all Gary Johnson's fault.

Remark (On Obama-Trump Voters). One plausible explanation for Trump's victory is the shift of Obama voters of 2012 who came to vote for Trump in 2016, the so-called "Obama-Trump voter". This is worth mentioning, only in passing, because the media has an irrational fascination with such voters.

Myth #1: Shy Trump Supporter

A popular conjecture is that a subpopulation of Trump supporters believed supporting Trump was "socially undesirable", hence would not publicly acknowledge backing Trump. Alexander Coppock conclusively tested this theory, and found no evidence of a shy Trump supporter existing. The predictive models with shy Trump supporters make statistically indistinguishable predictions from those without shy Trump supporters.

Andrew Gelman independently reckoned the same conclusion along different lines. Also, Gelman reasons, Republican candidates outperformed expectations in the Senate races, which casts doubt on the model in which respondents would not admit they supported Trump; rather, the Senate results are consistent with differential nonresponse or unexpected turnout or opposition to Hillary Clinton. It is possible that the anti-media, anti-elite, and even anti-pollster sentiment stoked by the Trump campaign has been a part of the reason for the low response of Trump supporters in states with large rural populations. Emphasis added.

It is worth remembering that Politico/Morning Consult released a poll, gathered online and via live phone calls, indicating despite different methodologies the different results show only a slight, not-statistically-significant difference in their effect on voters’ preferences for president. In other words, it didn't matter if a respondent talked to a pollster on the phone (where shyness would prevent the respondent from announcing support for Trump) or if the respondent communicated online, the results were statistically indistinguishable.

Testing this hypothesis three different ways, and they all reach the same conclusion, seriously undermines the "null hypothesis" of the existence of "shy Trump supporters". We can discard this explanation as lacking empirical support.

Myth #2: Comey Ruined the Election

Hillary Clinton has personally blamed the election outcome on Comey's public announcement that he was re-opening the FBI investigation into Clinton's email servers the Friday prior to the vote. This explanation has become popular, presumably for the proximity of Comey's announcement to the perceived surprising loss. Neither evidence nor reason supports this misconception.

This claim is a contentious matter. But the Comey letter is the sort of mirage which our cognitive biases are susceptible to mistake for real. We must heed Thucydides's words (1.21), stressing the search for truth strains the patience of most people, who would rather believe the first things that come to hand.

Using rather rudimentary post-stratified modeling techniques, Chad P. Kiewiet de Jonge and Gary Langer have shown in fact Clinton began losing the projected electoral college the Tuesday prior to Comey's announcement. Why didn't anyone else have this insight? The other prognosticating psephologists used some combination of poll-weighting and a likely voter model, which would have missed it.1Nate Silver noted, regarding Comey's letter, As of Oct. 28, the polls-plus version of FiveThirtyEight’s forecast, which accounts for these factors, expected Clinton to lose a point or so off her lead before Election Day. Silver's model did not detect the deviations which MRP modeling found.

The New York Times released a poll, released days prior to the letter, showing Trump ahead of Clinton by 4 points. Bloomberg/Selzer released a poll showing Trump ahead of Clinton by 2 points. Could we trust this poll? Well, FiveThiryEight has awarded Selzer & Co. an "A+" pollster grade (and similarly high marks for both Siena College and the New York Times).2As of July 3, 2019 and November 11, 2016 Selzer & Co. received an "A+". The New York Times, in collaboration with CBS, received the more modest "A-" grade, but Siena College won a solid "A". Although this is weak evidence supporting the claim Clinton began losing before Comey's announcement, the point we stress is this is a second approach to think about the matter.

Could it be that Comey's letter accelerated Clinton's decline? From post-stratified modeling based on polling released afterwards, there is insufficient evidence that Comey's letter impacted Clinton's standing. The MRP model using polling data with surveys held November 4–6 show Clinton recovering, albeit insufficient to win the election or recover significant ground. Such results challenge and undermine the claim Comey impacted Clinton at all.

The raw data, and refined statistics, both reach the same conclusion: Clinton began losing the election before Comey even spoke.

What Happened?

To better answer this question, we should first note what people expected. If, for each state, we take a rolling mean of the proportion of the vote for each party (bundling all "third parties" together into a single "third party"), then normalize the result per state (so we have proportions again), the result is precisely the proportion of the vote one might expect to have found on election day of 2016. The results may be described by the following map:

The electoral college count would have been 332 for the Democratic candidate, 206 for the Republican candidate. We can actually quantify how surprising the actual results were using the Kullback–Leibler divergence, but to make sense of this we should compare it to previous elections:

Election Year Surprise (in bits)
1988 0.8504352
1992 33.3329646
1996 2.8056336
2000 2.8825127
2004 1.3989161
2008 0.7251642
2012 0.3414356
2016 3.5202876

For 1992, remember Ross Perot captured about 20% of the popular vote, the most a third party received since Teddy Roosevelt ran for a third term on the Progressive Party ticket in 1912,3Perot's 1992 performance stands third in the rankings of "percent of popular vote a third party candidate received in the presidential election". Teddy in 1912 stands at first with 27.39% of the vote, Millard Fillmore's 1856 bid on the Constitutional Party ticket ranks second at 21.54% of the vote, and Perot's 1992 bid comes in third at 18.91% of the popular vote. In contrast, Gary Johnson's 2016 bid received 3.28% of the popular vote. which is why it is the most surprising row in our table.

So what happened? Why was this so surprising? There are a variety of ways to approach answering this. We may take the expected vote proportions and compare them to the actual vote proportions, and extrapolate out the difference in votes (assuming voter turnout remained the same in this hypothetical 2016 election as in the actual 2016 election). We find the difference in votes:

Party Difference in votes As Percent of Total Vote
Third Parties 4,397,608 0.0321493
Democratic -2,605,576 -0.0190484
Republican -1,792,032 -0.0131009

Computed by lumping all third parties into a single "third party". Then the rolling mean for each state and for each party was taken from 1976 until 2012. The result was renormalized in each state, then multiplied by the total votes cast in each state. The second column shows the difference between the expected votes and the actual votes, summed over each state.

Although we find Trump underperformed by 1.31%, we find that Clinton underperformed by 1.9% of the vote. Third parties overperformed by roughly 3.21% of the popular vote. Note: Johnson received 3.28% of the popular vote in 2016 — does this account for the overperformance of third party candidates in 2016?

Two questions immediately emerge: (a) which Third party candidate overperformed? (b) If we supposed the third parties received lower votes (i.e., they received the expected votes), how would the difference be re-allocated between Trump and Clinton?

Measuring Third Party Overpower

We can actually measure the Kullback–Leibler divergence for how Johnson performed in 2016 compared to 2012 (lumping all non-Johnson votes in one category). This measures the surprise in Johnson's 2016 performance relative to 2012 expectations, an adequate way to gauge improvement. We may do similarly for Stein, since both ran in 2012 and 2016. The result may be summed up thus:

The ratio of "Johnson's improvement" to "Stein's improvement" averaged 22.88781 — that's over an order of magnitude improvement!

In the states which went from Obama in 2012 to Trump in 2016 (specifically: Florida, Michigan, Pennsylvania, Wisconsin, Iowa, Indiana, North Carolina, Ohio, Nebraska), Johnson doubled his votes in 2016 compared to 2012...or better (in Florida, Johnson saw his votes grow from 44,726 to 207,043). Johnson's average improvement among these swing states was 336% more votes in 2016 compared to 2012. We should observe that Johnson didn't run in either Michigan or Wisconsin in 2012, though.

These numbers tell us, quite simply, Johnson improved considerably between 2012 to 2016. This alone is quite surprising, third party candidates seldom improve so drastically. Jill Stein, on the other hand, saw very little improvement in votes. We may safely conclude that Johnson is the dominant (sole?) dynamo for the third party's surprise improvement, which answers the first question the previous section posed: Which Third party candidate overperformed? We may safely answer, it was Johnson.

A Wonderful Life

The Economist's Lexington, stating the obvious, informs us, Most of those who voted for Mr Johnson in 2016 were protesting against the alternatives. But if there were a timeline where there was no Libertarian ticket in 2016, or any other third party for that matter, how would the election have changed?

Looking at the numbers, third party voters changed the outcome in Flordia, Michigan, Pennsylvania, and Wisconsin. (They did so in North Carolina, but they would have to break 95.7% for Clinton, which is implausible.) If 68.997%+1 of third party voters broke for Clinton in this hypothetical, then Clinton would have won an additional 75 delegates in the electoral college. This would have changed the outcome of the election. Is this feasible?

FiveThirtyEight's Harry Enten asked this very question in his article, Election Update: Is Gary Johnson Taking More Support From Clinton Or Trump? If we take his observations as a launching off point, then third party voters are divided up thus: 1.19864% (of the total vote) is taken from the third party candidates, then the remaining third party voters are divvied up evenly between Trump and Clinton. This effectively erases the margin of victory for Trump. (With the exception of Florida, only a small fraction of third party voters need to be shaved off to change the outcome.)

State Delegates Trump margin Shift to Clinton
Florida 29 0.0119863 0.0207737
Michigan 16 0.0022303 0.0311395
Pennsylvania 20 0.0072427 0.0228425
Wisconsin 10 0.0076434 0.0366399

In this alternate timeline, where third parties vanished and its constituents had to pick between Trump and Clinton, would have produced a drastically different result.

Even if taking Harry Enten's findings too generously, that Clinton's edge was not 1% but a more conservative estimate 0.7643%+1 (the margin enough to win Wisconsin), in that hypothetical Clinton would still lose Florida but win Michigan, Pennsylvania, and Wisconsin. This would have given Clinton 46 electoral delegates, enough to make her delegate count 278 to Trump's 260. Again we find Johnson acted as spoiler, prevented Clinton's victory, and delivered to us a Trump presidency.

Remark. Let us suppose Enten's findings could be used to construct a random variable describing how third party voters will likely vote. Given that polls have a margin of error of 0.04 at a 95% confidence level, we can construct a normally distributed random variable X centered at 0.01 [which Enten determined is the edge Clinton has] with a 0.02 sigma [from the noise for polling] for the third party supporter who just "randomly" picks who to vote for as follows: generate a random real number following this distribution and, if it is positive, vote for Clinton, otherwise vote for Trump. In this scheme, Trump receives 30.85375% of the third party voters, Clinton receives 69.14625%, enough for Clinton to win Florida, Michigan, Wisconsin, and Pennsylvania.

Exercise 1. The New York Times's Libertarian Gary Johnson Polls at 10 Percent. Who Are His Supporters? surveys the demographics of Johnson supporters. Consider using a MRP model (like Gelman et al.'s in arXiv:1802.00842) to estimate the preference of third party voters.

Exercise 2. The argument produced above, and the results of exercise 1, give two different ways to show Johnson was a spoiler candidate and Clinton would have won the election had Johnson not run. But it is not wise to go to see with two chronometers (take either one or three). Think of another test for showing Johnson was a spoiler candidate.

What Remains to be Investigated?

Aside from the exercises for myself, points worth pursuing include what could Clinton have done differently? It's one thing for us to sit back and say, "Well, well, third parties ruined everything." But it's more useful to consider how third parties attracted voters, and what Clinton could have done to counter this effect.

Also comparing the demographics of Obama-Trump supporters to Johnson supporters may be insightful. If it turns out these two share a suitably similar political culture, then we may have found one strata of swing voters. It remains to be seen if they are so fed up with Trump that they abstain from even voting in 2020.

But also worth considering is the newly energized Democratic base which didn't materialize for Clinton in 2016 but sure as Hell materialized to protest Trump and vote out Republicans in 2018. If the newly energized base is larger than the Obama-Trump and Johnson voters, especially in the swing states, then it may be worth considering alternative 2020 strategies.

All the scratchwork for this post may be found on Github.

Thursday, July 4, 2019

Explanations in Psephology

What qualifies as an "explanation"? Specifically, when will a statistical analysis explain why candidate A lost the election to candidate B?

Good explanations have a variety of characteristics (it's independent of the method, it's contrastive, social, etc.). The late Cambridge philosophy professor Peter Lipton defines an explanation (as quoted in arXiv:1811.03163)

To explain why P rather than Q, we must cite a causal difference between P and not-Q, consisting of a cause of P and the absence of a corresponding event in the history of not-Q.

The question I'm pursuing (implicitly, through a number of posts) is "Why did Trump win 2016?" The first puzzle: will this generate the same explanations as "Why did Clinton lose 2016"?

There are many variants on these questions which may be worth considering:

  • What could Clinton have done differently to win 2016?
  • What did Trump do (as opposed to [generic Republican candidate]) which contributed to winning 2016?
  • Could a generic Republican candidate have won 2016?
  • Could a generic Democrat have defeated Trump in 2016?
And could we rank how decisive each factor in the answers contributed to the outcome?

All of these questions raise different answers, particularly since we're focusing on different actors. For our purposes, understanding how the election of 2016 unfolded as it did, all of these questions may be worth investigating.

But What's a "Good Explanation"?

The second puzzle is what qualifies as a "good explanation". Lets try to examine a few (hypothetical) propositions, and see if they qualify as "explanations".

Proposition 1. If 50%+1 of voters who voted for Obama in 2012 and Trump in 2016 had changed their vote to Clinton in 2016, then Clinton would have won the election.

This gives us a path to victory, but it does not illuminate why Obama-Trump supporters jumped ship from Obama to Trump. As Lipton phrased it, this gives us knowledge but not understanding. We do not understand voter "issue preferences", to borrow a game theoretic term.

Proposition 2. Clinton didn't alter her campaign sufficiently in 2016 compared to her past campaigns.

This explanation gives understanding why she lost, but it is incomplete or not fully fleshed out...depending on what question we're really trying to explain. Proposition 2 explains why she lost 2016 partially, we would implicitly need to explain that "typical campaigning" didn't work against Trump (which hardly seems like something worth explaining to anyone who lived through it).

How could we empirically test this proposition? This is an orthogonal concern: providing evidence for an explanation. It is worth pondering, though, what data would suffice to merit this explanation.

Concerns for proof withstanding, as an explanation, proposition 2 has the quality of understanding and some flavor of causal reasoning.

Proposition 3. Johnson acted as a spoiler candidate, particularly among swing voters.

Explanations, like proposition 3, tend to sound more like excuses. How can we rigorously test such a proposition? How can we avoid fooling ourselves?

There is a clear counter-factual claim we could make, premised on proposition 3, that: had Johnson not run, Clinton would have been President. So proposition 3 qualifies as an "explanation" per se, but there are lurking factors hidden beneath it...why did Johnson get so many votes? How could Clinton have campaigned differently?

But that's a story for another day...

Tuesday, July 2, 2019

Swing Voters: A Glance at the Literature

Puzzle 0. Journalists have introduced the term "swing voters". (a) Can we make this notion rigorous? Who is a swing voter? Assuming this notion is well-defined, we have follow up queries: (b) Can we "model" swing voters (in some sense)? Is there some "psychological profile" for "swing voters"? (c) What correlates with the number of swing voters in a state?

William Mayer, in his book The Swing Voter in American Politics (2008, pg 2), describes a "swing voter" as a voter who is persuadable:

In simple terms, a swing voter is, as the name implies, a voter who could go either way: a voter who is not solidly committed to one candidate or the other as to make all the efforts at persuasion futileAs indicated in the text, among media articles that do provide an explicit definition of the swing voter, this is the most common approach. See, for example, Joseph Perkins, "Which candidate Can Get Things Done?" San Diego Union-Tribune, October 20, 2000, p. B-11; Saeed Ahmed, "Quick Hits from the Trail," Atlanta Constitution, October 26, 2000, p. 14A; and "Power of the Undecideds," New York Times, November 5, 2000, sec. IV, p. 14.. If some voters are firm, clear, dependable supporters of one candidate or the other, swing voters are the opposite: those whose final allegiance is in some doubt all the way up until Election Day. Put another way, swing voters are ambivalent or, to use a term with a somewhat better political science lineage, cross-pressured.Though it never employed the term “swing voter,” one antecedent to the analysis in this chapter is the discussion in most of the great early voting studies of social and attitudinal cross-pressures within the electorate. See, in particular, Lazarsfeld, Berelson, and Gaudet [The People’s Choice: How the Voter Makes Up His Mind in a Presidential Campaign] (1948,pp. 56–64); Berelson, Lazarsfeld, and McPhee [Voting: A Study of Opinion Formation in a Presidential Campaign] (1954, pp. 128–32); Campbell, Gurin, and Miller [The Voter Decides] (1954, pp. 157–64); and Campbell and others [The American Voter] (1960, pp. 78–88). There was, however, never any agreement as to how to operationalize this concept (Lazarsfeld and his collaborators tended to look at demographic characteristics; the Michigan school used attitudinal data); and almost the only empirical finding of this work was that cross-pressured voters tended to be late deciders. For reasons that are not immediately clear, more recent voting studies have almost entirely ignored the concept. The term appears nowhere in Nie, Verba, and Petrocik [The Changing American Voter] (1976); Fiorina [Retrospective Voting in American National Elections] (1981); or Miller and Shanks [The New American Voter] (1996). Rather than seeing one party as the embodiment of all virtue and the other as the quintessence of vice, swing voters are pulled—or repulsed—in both directions.

The American National Election Studies have surveyed voters in every presidential election since 1972. We can use their so-called "feeling thermometer questions", which gives a value between 0 to 100 for each candidate. Mayer constructs a new statistic by taking the Republican's "feeling thermometer value" and subtract the Democrat's "feeling thermometer value". The voters around 0 degrees, Mayer suggests, are the swing voters.

Since Mayer's book was published, we have acquired more data about swing voters using ad-tracking technology. Quartz's Ashley Rodriguez reviewed findings for 2016 swing voters. While the minute details of these studies are fascinating, if true, they don't tell us any correlated "macro-statistics" correlated with "swing-iness".

After the 2018 midterm elections, Vox's Matthew Yglesias argues swing voters still exist, but his argument is uncompelling circumstantial evidence. There are voters who cast their 2012 vote for Obama yet 2016 vote for Trump (and similarly those who voted for Romney in 2012 and Clinton in 2016), but this data alone is insufficient to prove all such voters are "swing voters". We need more to establish such voters are "swingers".

Palfrey and Poole (1987) have shown low information voters tend to constitute the majority of swing voters, as Mayer has thus defined it.

Puzzle 1. Can we reproduce Palfrey and Poole's results? Has there been more modern work confirming this?

Here's the unintuitive thing: if we control for partisans masquerading as "independents", as Gelman et al. (2014) have done, then swing voters in 2012 are sample artifacts whose effects are quite small. Dr Gelman wrote a piece in the Washington Post explaining his findings, in simpler terms.

Puzzle 2. How does Gelman, et al., hold in light of Palfrey and Poole? Are uninformed voters no longer "swingable"? Or have uninformed voters vanished (or, at least, no longer vote)?

Happily, Mayer has a resolution for this puzzle. It's well-known (since at least Keith's work 1992, building upon many others's works from the '80s) that self-proclaimed "independents" are "hidden partisans". Indeed, using "political independent" as synonymous with "swing voter" is a Bad Idea.

The "undecided voter" is a much closer concept to a "swing voter". Although conceptually similar, it is harder to gauge if a voter is really "undecided" or not. It turns out people eagerly claim to be "undecided", more than matches reality.

Voter Model

There are a variety of voter models we could consider. I'm going to summarize the models as presented in R. Douglas Arnold's The Logic of Congressional Action, and they're really quite simply decision rules.

Party Performance Rule. A voter asks themselves, "Are things better off than they were at the last election? Which party is 'in charge' [of the White House]?" If things are deteriorating, the voter will cast their vote against the President's party. If conditions are improving, the voter will cast their vote supporting the President's party.

Incumbent Performance Rule. Voters, Douglas describe, first evaluate current conditions in society, decide how acceptable those conditions are, and then either reward or punish incumbent legislators for actions that they think contributed to the current state of affairs. (pg. 44) Although very similar to the party performance rule, the difference lies in who is held responsible: the members of the President's party, or the legislators themselves.

Party Position Rule. A citizen first identifies the party offering the most pleasant package of policy positions, then votes for candidates belonging to that party.

Candidate Position Rule. A citizen first identifies the candidate offering the most pleasant package of policy positions, then votes for that candidate.

So...which rule is it? Arnold suggests, arguably, a fifth decision rule which resembles aspects of all these rules. Basically, a voter keeps four "accounts" (integers) in his brain, one for each party, another for the incumbent, and the fourth for the challenger. These values may be positive or negative.

The two accounts for the parties is given some initial values during childhood. As the voter acquires information about the parties achievements in office, the voter updates their "accounts" for the parties using something like the party position and party performance rules.

When the voter learns about the incumbent, our voter opens up a third account for that incumbent. As our voter learns about the incumbent's positions and accomplishments, the voter updates the incumbent's accounts using some amalgam of candidate position and candidate performance rules. Ancillary information (like extramarital affairs, hiking the Appalachian trail, etc.) nudge the account, one way or another.

Finally, a challenger appears, and the voter opens up a fourth account for that challenger. The value assigned to this fourth account is a variant of the candidate position rule, combined with extra-information adjustments. (Is the challenger's party responsible for a disastrous war? Did the economy collapse? Etc.)

On election day, the intrepid voter goes to the polls, combines these four values, then decides how to vote. The simplest model adds the four values together, then if the sum is positive votes for the incumbent (otherwise, the voter sides with the challenger).

How well does this "Impression-driven" model work? There's evidence this model is something along the lines of how people actually decide to vote, but that's a contentious point among academics. I will side-step arguments, and just note that cognitive heuristics probably account for most (all?) of the decision-making process, and some variant of this "impression-driven" model probably works "good enough".

Remark. I suspect something like a moving average formula is used to update these accounts. Doubtless there are countless variants of this model, depending on what formulas we want to use.

How do Swing Voters fit in this Model?

It seems there are multiple narratives one could generate to produce a swing voter. But the only ones which I can think of produce voters whose "accounts" are all "near zero" around election time.

There are some genuine "party switchers", like Reagan-Democrats or Obama-Trump voters. These voters seem to be dissatisfied with the Democratic party and/or their candidate, and update their "accounts" accordingly. This "swing" is a re-evaluation of party performance, or candidate performance, rather than "Starting uninformed and scrambling to form an opinion."

Swing voters seem to have their accounts return to "near zero" after the election, paying little attention to politics. Empirically, it is hard to find covariates correlating with this quality. Mayer's book discusses this in greater detail.

References

Swing Voters

  • Gary Cox, "Swing voters, core voters, and distributive politics". In Political Representation (edited by Ian Shapiro, Susan C. Stokes, Elisabeth Jean Wood, Alexander S. Kirshner), Cambridge University Press, 2010, pp.342–357. Eprint.
  • Timothy J. Feddersen, Wolfgang Pesendorfer, "The Swing Voter's Curse". The American Economic Review 86, no. 3 (1996) pp. 408–424. Provides a decision-theoretic model for voters abstaining from voting.
  • Andrew Gelman, Sharad Goel, Douglas Rivers,,and David Rothschild, The Mythical Swing Voter. (2014)
  • S. Kelley, Interpreting Elections. Princeton University Press, 1983.
  • William G. Mayer (ed.), The Swing Voter in American Politics. Brookings Institute Press, 2008. See esp. ch 1
  • Thomas R. Palfrey, Keith T. Poole, "The Relationship between Information, Ideology, and Voting Behavior". American Journal of Political Science 31, no. 3 (1987) pp. 511–530.
  • Ashley Rodriguez, Undecided voters are as scared as the rest of us, and other insights from a trove of data on swing voters, Quartz, October 8, 2016.
  • Nate Silver, The Invisible Undecided Voter, FiveThirtyEight, Jan. 23, 2017.
  • Matthew Yglesias, Swing voters are extremely real, Vox, July 23, 2018.

Voter Model

  • R. Douglas Arnold, The Logic of Congressional Action. Yale University Press, 1990. See chapter 3.
  • Richard Lau and David Redlawsk, "Advantages and disadvantages of cognitive heuristics in political decision making". American Journal of Political Science 45, No. 4 (2001): 951–971.
  • Milton Lodge, Kathleen M. McGraw, Patrick Stroh, "An Impression-Driven Model of Candidate Evaluation". The American Political Science Review 83, No. 2 (1989), pp. 399–419.
  • Milton Lodge, Marco R. Steenbergen, Shawn Brau, "The Responsive Voter: Campaign Information and the Dynamics of Candidate Evaluation". The American Political Science Review 89, No. 2. (1995), pp. 309–326.

Addendum (). FiveThirtyEight had published an article rehashing much the same, but skims on the references.

Monday, June 24, 2019

How is my candidate doing in the polls?

Given the profusion of polls, it is difficult to accurately gauge how well a given candidate is doing. A simple average of poll numbers won't adequately capture momentum (if such a concept exists), and a few averages (say, one of polls done in the past week, another of polls done in the past month) are difficult to parse. We want one, single, simple number.

We fix the candidate we're interested in, and we have polls \(P_{n+1}\) and \(P_{n}\) released at times \(t_{n}\lt t_{n+1}\). Ideally, we should be able to truncate the N polls to the last k without "much loss".

We could take a moving average, something like \[M_{n+1} = \alpha(t_{n}, t_{n+1}) P_{n+1} + (1 - \alpha(t_{n}, t_{n+1}))M_{n}\tag{1}\] where \(P_{n}\) refers to the nth most recent poll released on the date \(t_{n}\), with the initial condition \(M_{1} = P_{1}\) and the function \[\alpha(t_{n}, t_{n+1}) = 1 - \exp\left(-\frac{|t_{n+1} - t_{n}|}{30 W}\right)\tag{2}\] where W is the average of the intervals between polls, and the difference in dates is measured in days. The 30 in the denominator of the exponent reflects 30 days in a month

Exercise 1. Show (1) α will take values between 0 and 1, (2) the larger the α, the quicker it "forgets" older data, (3) "older data" will be forgotten faster [what happens for regularly released polling data? Say, weekly, a new poll is released, what does α look like?].

Weighing Pollsters

If we knew about poll quality, we could add this in as another factor. Suppose we had a function1The codomain is a little ambiguous, we have it here as \(0\lt Q(P)\lt 1\), but either inequality could be weakened to "less than or equal than" conditions. So it could be extended to include \(0\leq Q(P)\lt 1\) or \(0\lt Q(P)\leq 1\) or even \(0\leq Q(P)\leq 1\). \[Q\colon \mathrm{Polls}\to (0,1)\tag{3}\] which gives each poll its quality (higher quality polls are nearer to 1). Then we could modify our function in Eq (2) to be something like \[ \begin{split} \tilde{\alpha}(t_{n}, t_{n+1}, P_{n+1}) &= Q(P_{n})\cdot\alpha(t_{n},t_{n+1}) \\ &= Q(P_{n})\cdot\left(1 - \exp\left(-\frac{|t_{n+1} - t_{n}|}{30 W}\right)\right)\end{split}\tag{4}\] which penalizes "worse polls" from influencing the moving average. (Since worse polls have smaller Q values, which leads to higher \(1 - Q\) values.)

One lazy way to go about this is to use pollster ratings from FiveThirtyEight, discard "F" rated polls, then take the moving average with \(Q(-)\) the familiar grading scheme used in the United States. (Or, more precisely, the midpoint of the interval for the grade.)

Letter grade Percentage Q-value
A+ 97–100% 0.985
A 93–96% 0.945
A− 90–92% 0.91
B+ 87–89% 0.88
B 83–86% 0.845
B− 80–82% 0.81
C+ 77–79% 0.78
C 73–76% 0.745
C- 70–72% 0.71
D+ 67–69% 0.68
D 63–66% 0.645
D- 60–62% 0.61

The other "natural" choices include (a) equidistant spacing in the interval (0, 1] so D- is given the value \(1/13\) all the way to A+ given \(13/13\), or (b) the roots of an orthogonal family of polynomials defined on the interval [0, 1].

Exercise 2. How do the different possible choices of Q-values affect the running average? [Hint: using the table above, is \(\widetilde{\alpha}\leq 0.61\) is an upper bound? Consider different scenarios, good poll numbers from bad polls, bad numbers from good polls.]

Exercise 3. If we assign \(Q(\mathrm{F}) = 0\) as opposed to discarding F-scored polls, how does that affect the weighted running average?

Some computed examples are available on github, but they're what you'd expect.

Thursday, June 20, 2019

Software analogies in statistics

When it comes to programming, I'm fond of "agile" practices (unit testing, contracts, etc.) as well as drawing upon standard practices (design patterns, etc.). One time when I was doing some R coding, which really feels like scripting statistics, I wondered if the "same" concepts had analogous counterparts in statistics.

(To be clear, I am not interested in contriving some functorial pullback or pushforward of concepts, e.g., "This is unit testing in R. The allowable statistical methods within R unit tests are as follows: [...]. Therefore these must be the anologues to unit testing in statistics." This is not what I am looking for.)

The problem with analogies is there may be different aspects which we can analogize. So there is no "one" analogous concept, there may be several (if any) corresponding to each of these software development concepts.

The concepts I'd like to explore in this post are Design Patterns, Unit Testing, and Design by Contract. There are other concepts which I don't believe have good counterparts (structured programming amounts to writing linearly so you read the work from top to bottom, McCabbe's analogy of "modules :: classes :: methods" to "house :: room :: door" does not appear to have counterparts in statistics, etc.); perhaps the gentle reader will take up the challenge of considering analogies where I do not: I just do not pretend to be complete with these investigations.

Design Patterns

Design patterns in software was actually inspired by Christopher Alexander's pattern language in architecture. For software, design patterns standardize the terminology for recurring patterns (like iterators, singletons, abstract factories, etc.).

One line of thinking may be to emphasize the "pattern language" for statistics. I think this would be a repackaged version of statistics. This may or may not be fruitful for one's own personal insight into statistics, but it's not "breaking new ground": it's "repackaging". Unwin's "Patterns of Data Analysis?" seems to be the only work done along these lines.

For what it's worth, I believe it is useful to write notes for one's self, especially in statistics. I found Unwin's article a good example of what such entries should look like, using "patterns" to describe the situation you are facing (so you can ascertain if the pattern is applicable or not), what to do, how to do some kind of sanity test or cross check, etc. As an applied math, statistics is example-driven, and maintaining one's own "pattern book" with examples added to each pattern is quite helpful.

Another line may pursue the fact that software design patterns are "best practices", hence standardizing "best statistical practices" may be the analogous concept. Best coding practices are informal rules designed to improve the quality of software. I suppose the analogous thing would be folklore like using the geometric mean to combine disparate probability estimates. Or when to avoid misusing statistical tests to get bogus results.

Unit Testing

Unit testing has a quirky place in software development. Some adhere strictly to test driven development, where a function signature is written (e.g., "int FibonacciNumber(int k)") and then before writing the body of the function, unit tests are written (we make sure, e.g., "FibonacciNumber(0) == 1", negative numbers throw errors, etc.). Only after the unit tests are written do we begin to implement the function.

Unit tests do not "prove" the correctness of the code, but it increases our confidence in it. Sanity checks are formalized: squareroots of negative numbers raise errors, easy cases (and edge cases) are checked, and so forth. Code is designed to allow dependency injection (for mock objects, to facilitate testing). These tests are run periodically (e.g., nightly, or after every push to the version control system) and failures are flagged for the team to fix. I can't imagine anything remotely analogous to this.

The analogous counterpart to "increasing our confidence in our work" would be some form of model verification, like cross-validation. However, model verification usually comes after creating a model, whereas unit testing is a critical component of software development (i.e., during creation).

Design by Contract

Contracts implement Hoare triples, specifying preconditions, postconditions, and invariants. These guarantee the correctness of the software, but that correctness stems from Hoare logic.

Statistical tests frequently make assumptions, which are not usually checked. These seem quite clearly analogous to preconditions or postconditions. For example, with a linear regression, we should check the error is not correlated with any input (this would be a postcondition) and there is no multicollinearity, i.e., no two inputs are correlated (this would be a precondition). For example, we could imagine something like the following R snippet (possibly logging warnings instead of throwing errors):

foo <- function(mpg, wt, y, alpha = 0.05) {
  # assert mpg is normally distributed
  assert_that(shapiro.test(mpg)$p >= alpha)
  # assert wt is normally distributed
  assert_that(shapiro.test(wt)$p >= alpha)
  # now assert mpg & wt are uncorrelated
  assert_that(cor.test(mpg, wt, method = "kendall")$p.value >= alpha)
  
  # rest of code not shown
}

The other aspect of contracts may be the underlying formalism, whose analogous concept would be some formal system of statistics. By "formal system", I mean a logical calculus, a formal language with rules of inference; I do not mean "probabilistic inference". We need to formalize a manner of saying, "Given these statistical assumptions, or these mathematical relations, we may perform the following procedure or calculation." I have seen little literature on this (arXiv:1706.08605 being a notable exception). The R snippet above attempted to encode this more explicitly, but Hoare logic analogues are implicit in statistics textbooks.

We might be able to capture the "sanity check" aspect to post-conditions in special situations. For example, testing if two samples have the same mean, we could verify we reject the null hypothesis correctly (we disproved they share the same mean) by looking at the confidence intervals for the data samples and seeing they are "mostly disjoint". This example is imprecise and heuristic, but illustrates the underlying idea.

Conclusion

Although statistics has been referred to as "mathematical engineering", a lot of techniques of software engineering don't really apply or have analogous counterparts. Some, like preconditions, have something mildly similar for R scripts. Others, like "design patterns", are more a meta-concept, a guiding concept for one's own notetaking skills rather than directly applicable to doing statistics.

References

Tuesday, June 18, 2019

Sample Size and Central Limit Theorem

I am trying to reproduce results from Zachary R. Smith and Craig S. Wells's Central Limit Theorem and Sample Size.

Therein the authors claim the central limit theorem does not hold for samples of size 30 or so (they go so far as to claim 300). I have tried reproducing this claim, their work is unreproducible. Moreover, their work is visually wrong, you can literally plot the sums of uniformly distributed variables and observe the sum is normally distributed.

For those interested in the sordid details, I have posted it on github.

Why did they get this so wrong? Well, they failed to adequately handle multicomparison problems, and they ignored the issues involved with the Kolmogorov-Smirnov test. The latter was particularly troublesome, as it led them to completely erroneous findings.

Thursday, June 13, 2019

Statistics as Decision Problem

"Decision theory" is a framework for picking an action based on evidence and some "loss function" (intuitively, negative utility). Almost all of statistics may be framed as a decision theoretic problem, and I'd like to review that in this post.

(Note that the diagrams in this post were really inspired by I-Hsiang Wang's lectures on Information Theory.)

I am going to, literally, give a "big picture" of statistics as decision theory. Then I'll try to drill down on various "frequentist" statistical tasks, to show the algorithmic structure to each one. Although I'm certain Bayesian data analysis can be made to fit this mould, I don't have as compelling a "natural" fit as frequentist statistics.

And just to be clear, we "the statistician" are "the decider" in this decision making problem. We are applying decision theory to the process of "doing statistics".

Review of Decision Theory

Statistical Experiment

We have some source of data, whether it's observation, experiment, whatever. As Richard McElreath's Statistical Rethinking calls it, we work in the "small world" of our model, where we describe the data as a random variable \(X\) which follows a hypothesized probability distribution \(P_{\theta}\) where \(\theta\) is a vector of parameters describing the "state of the world" (it really just parametrizes our model). The set of all possible parameters is denoted \(\Theta\) with a capital theta. This \(\Theta\) is the boundary to our "small world". This data collection process is highlighted in the following figure:

Serious statisticians need to actually think about sampling methods and experimental methods. We are silly statisticians not serious statisticians, and use data already assembled for us. Although we will not perform any polling or statistical experiments ourselves, it is useful to know the nuances and subtleties surrounding the methodology used to produce data. Hence we may dedicate a bit of space to discuss the aspects of data gathering and experimental methodologies our sources have employed.

Decision Making

Given some data we have just collected, we now arrive at the romantic part of data collection: decision making. Well, that's what decision theorists call it. Statisticians call it...well, it depends on the task. It is highlighted in the following diagram:

There are really multiple tasks at hand here, so lets consider the key moments in decision making.

Inference task. Or, "What do we want to do with the data?" The answer gives us the task of estimating a specific function \(T\) of the parameters \(T(\theta)\) from the observed data X. The choice of this function depends on the task we are trying to accomplish.

A few examples:

  • With hypothesis testing, \(T(\theta)=\theta\) we're trying to estimate the parameters themselves (which label the hypotheses we're testing).
  • For regressions (i.e., given pairs of data from the experimental process \((X,Y)\) find the function f such that \(Y = f(X) + \varepsilon\)) the function of the parameters is the relationship itself, i.e., \(T(\theta)=f\).
  • For classification problems, \(T(\theta)\) gives us the "correct" labeling function for the data.

In some sense, \(T(\theta)\) is the "correct answer" we're searching for, we just have to approximate it with the next step of the game...

The Decision Rule. In the language of decision theory, an estimator is an example of a Decision Rule which we denote by \(\tau\) ("tau"). This approximates \(T(\theta)\) given the data we have and the conceptual models we're using.

For regressions, this is the estimated function \(\tau(X,Y)=\widehat{f}_{X,Y}\) which fits the observations. For hypothesis testing, \(\tau(X)=\hat{\theta}\) is which hypothesis "appears to work".

These two tasks, inference and computing the decision rule, constitutes the bulk of statistical work. But there's one more crucial step to be done.

Performance Evaluation

We need to see how good our estimates are! In the complete diagram, this is the highlighted part of the following figure:

The loss function \(l(T(\theta),\tau(X))\) measures how bad, given the data X, the decision rule \(\tau\) is. Note this is a random variable, since it's a function of the random variable X. Also note, there are various different candidates for the loss function (it's our job as the statistician to figure out which one to use).

The risk is just the expected value of the loss function. This tells us on average how bad the decision rule \(\tau\) turns out, given the true state of the world is \(\theta\). We denote this risk by the function \(L_{\theta}(\tau)\).

For some tasks, we don't really have much of a choice on the loss function. Regressions do best with the mean-squared error. We could choose a different loss function (e.g., a variant on the mean squared error, we could use the \(L^{p}\) norm instead of the \(L^{2}\) norm).

Remark. It might seem strange that, given \(T\) is never really knowable, yet it appears in the risk function. We typically use it in a tricky way. For example in regression we're really using \(T(\theta)=f\) and then the trick is use \(f(X) = Y\) and we can use the observed results Y instead of worrying about the unobservable, incalculable, unknowable \(T\).

Examples

We will collect a bunch of examples, but this is incomplete. The goal is to show enough examples to encourage the reader to devise their own.

Hypothesis Testing

Classical hypothesis testing may be framed as a decision problem: do we take action A or action B? For our case, do we accept or reject the null hypothesis.

More precisely, we have two hypotheses regarding the observation X, indexed by \(\theta=0\) or \(\theta=1\). The null hypothesis is that \(X\sim P_{0}\), while the alternative hypothesis states \(X\sim P_{1}\).

We have some decision rule, which in our diagrams we have denoted \(\tau(X)\), which "picks" a \(\theta\) which minimizes the risk based on the observations X. But what is the loss function?

Well, we have the probability for a false alarm when \(\tau(x)=1\) but should be zero \[\alpha_{\tau} = \sum_{x}\tau(x)P_{0}(x)\tag{1}\] and the probability for missing a detection when \(\tau(x)=0\) but should be one \[\beta_{\tau} = \sum_{x}(1-\tau(x))P_{1}(x)\tag{2}.\] We note the loss function is indeed an expected value of \(\tau\), and it is parametrized by the choice of \(\theta\).

But how do we choose \(\tau\)?

We may construct one possible "hypothesis chooser" (randomized decision rule) as, for some constant probability \(0\leq q\leq 1\) and threshold \(c\gt0\), \[\tau_{c,q}(x) = \begin{cases} 1 & \mbox{if } P_{1}(x) \gt cP_{0}(x)\\ q & \mbox{if } P_{1}(x) = cP_{0}(x)\\ 0 & \mbox{if } P_{1}(x) \lt cP_{0}(x) \end{cases}\tag{3}\] In other words, \(\theta=1\) is chosen with probability \(\tau_{c,q}(x)\), and \(\theta=0\) is chosen with probability \(1-\tau_{c,q}(x)\). Starting from a given value of \(\alpha_{0}\), we then determine the parameters c and q by the equation \[\alpha_{0}=\sum_{x}\tau_{c,q}(x)P_{0}(x).\tag{4}\] The Neyman-Pearson lemma proves this is the most powerful test for significance (minimizes \(\beta_{\tau_{c,q}}\) subject to \(\alpha_{\tau_{c,q}}=\alpha_{0}\) constraint).

We emphasize, though, this is a "toy problem" which fleshes out the details of this framwork.

Exercise. Prove that the probability of type-I errors (probability of false alarm) is \(\alpha_{\tau} = \mathbb{E}_{X\sim P_{0}}[\tau(X)]\) and the probability of type-II errors (probability of failing to detect) is \(\beta_{\tau} = \mathbb{E}_{X\sim P_{1}}[1 - \tau(X)]\).

Regression

The goal of a regression is, when we have some training data \((\mathbf{X}^{(j)}, Y^{(j)})\) where parenthetic superscripts run through the number of observations \(j=1,\dots,N\), to find some function f such that \(\mathbb{E}[Y|\mathbf{X}]\approx f(\mathbf{X},\beta) \approx Y\). Usually we start with some preconception like f is a linear function, or a logistic function, or something similar, rather than permitting f to be any arbitrary function. We then proceed to estimate \(\widehat{f}\) and the coefficients \(\widehat{\beta}\).

Some terminology: the \(\mathbf{X}\) are the Covariates (or "features", "independent variables", or most intuitively "input variables") and \(Y\) are the Regressands (or "dependent variables", "response variable", "criterion", "predicted variable", "measured variable", "explained variable", "experimental variable", "responding variable", "outcome variable", "output variable" or "label"). Unfortunately there is a preponderance of nouns for the same concepts.

Definition. Consider \(X\sim P_{\theta}\) which randomly generates observed data \(x\), where \(\theta\in\Theta\) is an unknown parameter. An Estimator of \(\theta\) based on observed \(x\) is a mapping \(\phi\colon\mathcal{X}\to\Theta\), \(x\mapsto\hat{\theta}\). An Estimator of a function \(z(\theta)\) is a mapping \(\zeta\colon\mathcal{X}\to z(\Theta)\), \(x\mapsto\widehat{z}\).

The decision rule then estimates the true function, \(\tau_{\mathbf{X},Y}=\widehat{f}\). That is to say, it produces an estimator. There are various algorithms to construct the estimator, which depends on the regression analysis being done.

The loss function is usually the squared error for a single observation \[l(T,\tau) \mapsto \mathbb{E}_{(\mathbf{X},Y)\sim P_{\theta}}[(Y - \widehat{f}(\mathbf{X},\widehat{\beta}))^{2}].\tag{5}\] Depending on the problem at hand, other loss functions may be considered. (If the Y variables were probabilities or indicator variables, we could use the [expected] entropy as the loss function.)

The risk is then the average loss function over the training data. But do not mistake this for the only diagnostic for regression analysis.

We have multiple measures of how good our estimator is, which we should briefly review.

Definition. For an estimator \(\phi(x)\) of \(\theta\),

  • its Bias is  \(\mathrm{Bias}_{\theta}(\phi) := \mathbb{E}_{X\sim P_{\theta}}[\phi(X)] - \theta\)
  • its Mean Square Error is  \(\mathrm{MSE}_{\theta}(\phi) := [|\phi(X) - \theta|^{2}]\)

Fact (The MSE = (Bias)2 + Variance). Let \(\phi(x)\) be an estimator of \(\theta\), then \[\mathrm{MSE}_{\theta}(\phi) = \left(\mathrm{Bias}_{\theta}(\phi)\right)^{2} + \mathrm{Var}_{P_{\theta}}[\phi(X)]. \tag{6}\] In practice, this means as an estimate is more "spread out", it becomes more "centered near the correct value". (End of Fact)

Conclusion

Most of statistical inference falls into this schema presented. Broadly speaking, statistical inference consists of hypothesis testing (already discussed), point estimation (and interval estimation), and confidence sets.1See, e.g., section 6 of K.M. Zuev's lecture notes on statistical inference. We have discussed only the frequentist approach, however, and for only a couple of these tasks.

The Bayesian approach, in contrast to all these techniques, end up using a loss function which sums over values of \(\theta\) (i.e., integrates over \(\Theta\) instead of over the space of experimental results \(\mathcal{X}\)). The Bayesian priors describe a probability distribution of likely values of \(\theta\), which would be used in the overall process.

Yet the Bayesian school offers more tools than just this, and I don't think they can neatly fit inside a diagram like the one doodled above for frequentist statistics.

But we have provided an intuition to the overall procedures and tools the frequentist school affords us. Although we abstracted away the data gathering procedure (as well as the other steps in the process), we could flesh out more on each step involved.

In short, statistics consists of decision theoretic problems (perhaps "decision theory about decision theory", or meta-decision theory, may be a good intuition), but it remains more of an art than an algorithmic task.

References

  • P.J.Bickel and E.L.Lehmann, "Frequentist Inference". In Neil J. Smelser and Paul B. Baltes (eds) International Encyclopedia of the Social & Behavioral Sciences, Elsevier, 2001, Pages 5789–5796.
  • James O. Berger, Statistical Decision Theory and Bayesian Analysis. Springer Verlag, 1993. See especially section 2.4.3. (This is the only book on statistical decision theory that I know of worth its salt.)
  • George Casella and Roger L. Berger, Statistical Inference. Second ed., Cengage Learning, 2001. (Section 8.3.5, 9.3.4 for hypothesis testing and point estimators in the decision theoretic framework.)
  • C. Robert, The Bayesian choice: from decision-theoretic foundations to computational implementation. Second ed., Springer Verlag, 2007. See especially chapter 2.