"Decision theory" is a framework for picking an action based on evidence and some "loss function" (intuitively, negative utility). Almost all of statistics may be framed as a decision theoretic problem, and I'd like to review that in this post.
(Note that the diagrams in this post were really inspired by I-Hsiang Wang's lectures on Information Theory.)
I am going to, literally, give a "big picture" of statistics as decision theory. Then I'll try to drill down on various "frequentist" statistical tasks, to show the algorithmic structure to each one. Although I'm certain Bayesian data analysis can be made to fit this mould, I don't have as compelling a "natural" fit as frequentist statistics.
And just to be clear, we "the statistician" are "the decider" in this decision making problem. We are applying decision theory to the process of "doing statistics".
Review of Decision Theory
Statistical Experiment
We have some source of data, whether it's observation, experiment, whatever. As Richard McElreath's Statistical Rethinking calls it, we work in the "small world" of our model, where we describe the data as a random variable \(X\) which follows a hypothesized probability distribution \(P_{\theta}\) where \(\theta\) is a vector of parameters describing the "state of the world" (it really just parametrizes our model). The set of all possible parameters is denoted \(\Theta\) with a capital theta. This \(\Theta\) is the boundary to our "small world". This data collection process is highlighted in the following figure:
Serious statisticians need to actually think about sampling methods and experimental methods. We are silly statisticians not serious statisticians, and use data already assembled for us. Although we will not perform any polling or statistical experiments ourselves, it is useful to know the nuances and subtleties surrounding the methodology used to produce data. Hence we may dedicate a bit of space to discuss the aspects of data gathering and experimental methodologies our sources have employed.
Decision Making
Given some data we have just collected, we now arrive at the romantic part of data collection: decision making. Well, that's what decision theorists call it. Statisticians call it...well, it depends on the task. It is highlighted in the following diagram:
There are really multiple tasks at hand here, so lets consider the key moments in decision making.
Inference task. Or, "What do we want to do with the data?" The answer gives us the task of estimating a specific function \(T\) of the parameters \(T(\theta)\) from the observed data X. The choice of this function depends on the task we are trying to accomplish.
A few examples:
- With hypothesis testing, \(T(\theta)=\theta\) we're trying to estimate the parameters themselves (which label the hypotheses we're testing).
- For regressions (i.e., given pairs of data from the experimental process \((X,Y)\) find the function f such that \(Y = f(X) + \varepsilon\)) the function of the parameters is the relationship itself, i.e., \(T(\theta)=f\).
- For classification problems, \(T(\theta)\) gives us the "correct" labeling function for the data.
In some sense, \(T(\theta)\) is the "correct answer" we're searching for, we just have to approximate it with the next step of the game...
The Decision Rule. In the language of decision theory, an estimator is an example of a Decision Rule which we denote by \(\tau\) ("tau"). This approximates \(T(\theta)\) given the data we have and the conceptual models we're using.
For regressions, this is the estimated function \(\tau(X,Y)=\widehat{f}_{X,Y}\) which fits the observations. For hypothesis testing, \(\tau(X)=\hat{\theta}\) is which hypothesis "appears to work".
These two tasks, inference and computing the decision rule, constitutes the bulk of statistical work. But there's one more crucial step to be done.
We need to see how good our estimates are! In the complete diagram, this is the highlighted part of the following figure:
The loss function \(l(T(\theta),\tau(X))\) measures how bad, given the data X, the decision rule \(\tau\) is. Note this is a random variable, since it's a function of the random variable X. Also note, there are various different candidates for the loss function (it's our job as the statistician to figure out which one to use).
The risk is just the expected value of the loss function. This tells us on average how bad the decision rule \(\tau\) turns out, given the true state of the world is \(\theta\). We denote this risk by the function \(L_{\theta}(\tau)\).
For some tasks, we don't really have much of a choice on the loss function. Regressions do best with the mean-squared error. We could choose a different loss function (e.g., a variant on the mean squared error, we could use the \(L^{p}\) norm instead of the \(L^{2}\) norm).
Examples
We will collect a bunch of examples, but this is incomplete. The goal is to show enough examples to encourage the reader to devise their own.
Hypothesis Testing
Classical hypothesis testing may be framed as a decision problem: do we take action A or action B? For our case, do we accept or reject the null hypothesis.
More precisely, we have two hypotheses regarding the observation X, indexed by \(\theta=0\) or \(\theta=1\). The null hypothesis is that \(X\sim P_{0}\), while the alternative hypothesis states \(X\sim P_{1}\).
We have some decision rule, which in our diagrams we have denoted \(\tau(X)\), which "picks" a \(\theta\) which minimizes the risk based on the observations X. But what is the loss function?
Well, we have the probability for a false alarm when \(\tau(x)=1\) but should be zero
\[\alpha_{\tau} = \sum_{x}\tau(x)P_{0}(x)\tag{1}\]
and the probability for missing a detection when \(\tau(x)=0\) but should be one
\[\beta_{\tau} = \sum_{x}(1-\tau(x))P_{1}(x)\tag{2}.\]
We note the loss function is indeed an expected value of \(\tau\), and it is parametrized by the choice of \(\theta\).
But how do we choose \(\tau\)?
We may construct one possible "hypothesis chooser" (randomized decision rule) as, for some constant probability \(0\leq q\leq 1\) and threshold \(c\gt0\),
\[\tau_{c,q}(x) = \begin{cases}
1 & \mbox{if } P_{1}(x) \gt cP_{0}(x)\\
q & \mbox{if } P_{1}(x) = cP_{0}(x)\\
0 & \mbox{if } P_{1}(x) \lt cP_{0}(x)
\end{cases}\tag{3}\]
In other words, \(\theta=1\) is chosen with probability \(\tau_{c,q}(x)\), and \(\theta=0\) is chosen with probability \(1-\tau_{c,q}(x)\). Starting from a given value of \(\alpha_{0}\), we then determine the parameters c and q by the equation
\[\alpha_{0}=\sum_{x}\tau_{c,q}(x)P_{0}(x).\tag{4}\]
The Neyman-Pearson lemma proves this is the most powerful test for significance (minimizes \(\beta_{\tau_{c,q}}\) subject to \(\alpha_{\tau_{c,q}}=\alpha_{0}\) constraint).
We emphasize, though, this is a "toy problem" which fleshes out the details of this framwork.
Exercise.
Prove that the probability of type-I errors (probability of false alarm) is \(\alpha_{\tau} = \mathbb{E}_{X\sim P_{0}}[\tau(X)]\)
and the probability of type-II errors (probability of failing to detect) is \(\beta_{\tau} = \mathbb{E}_{X\sim P_{1}}[1 - \tau(X)]\).
Regression
The goal of a regression is, when we have some training data \((\mathbf{X}^{(j)}, Y^{(j)})\) where parenthetic superscripts run through the number of observations \(j=1,\dots,N\), to find some function f such that \(\mathbb{E}[Y|\mathbf{X}]\approx f(\mathbf{X},\beta) \approx Y\). Usually we start with some preconception like f is a linear function, or a logistic function, or something similar, rather than permitting f to be any arbitrary function. We then proceed to estimate \(\widehat{f}\) and the coefficients \(\widehat{\beta}\).
Some terminology: the \(\mathbf{X}\) are the Covariates (or "features", "independent variables", or most intuitively "input variables") and \(Y\) are the Regressands (or "dependent variables", "response variable", "criterion", "predicted variable", "measured variable", "explained variable", "experimental variable", "responding variable", "outcome variable", "output variable" or "label"). Unfortunately there is a preponderance of nouns for the same concepts.
Definition. Consider \(X\sim P_{\theta}\) which randomly generates observed data \(x\), where \(\theta\in\Theta\) is an unknown parameter. An Estimator of \(\theta\) based on observed \(x\) is a mapping \(\phi\colon\mathcal{X}\to\Theta\), \(x\mapsto\hat{\theta}\). An Estimator of a function \(z(\theta)\) is a mapping \(\zeta\colon\mathcal{X}\to z(\Theta)\), \(x\mapsto\widehat{z}\).
The decision rule then estimates the true function, \(\tau_{\mathbf{X},Y}=\widehat{f}\). That is to say, it produces an estimator. There are various algorithms to construct the estimator, which depends on the regression analysis being done.
The loss function is usually the squared error for a single observation
\[l(T,\tau) \mapsto \mathbb{E}_{(\mathbf{X},Y)\sim P_{\theta}}[(Y - \widehat{f}(\mathbf{X},\widehat{\beta}))^{2}].\tag{5}\]
Depending on the problem at hand, other loss functions may be considered. (If the Y variables were probabilities or indicator variables, we could use the [expected] entropy as the loss function.)
The risk is then the average loss function over the training data. But do not mistake this for the only diagnostic for regression analysis.
We have multiple measures of how good our estimator is, which we should briefly review.
Definition. For an estimator \(\phi(x)\) of \(\theta\),
- its Bias is \(\mathrm{Bias}_{\theta}(\phi) := \mathbb{E}_{X\sim P_{\theta}}[\phi(X)] - \theta\)
- its Mean Square Error is \(\mathrm{MSE}_{\theta}(\phi) := [|\phi(X) - \theta|^{2}]\)
Fact (The MSE = (Bias)2 + Variance).
Let \(\phi(x)\) be an estimator of \(\theta\), then
\[\mathrm{MSE}_{\theta}(\phi) = \left(\mathrm{Bias}_{\theta}(\phi)\right)^{2} + \mathrm{Var}_{P_{\theta}}[\phi(X)]. \tag{6}\]
In practice, this means as an estimate is more "spread out", it becomes more "centered near the correct value". (End of Fact)
Conclusion
Most of statistical inference falls into this schema presented. Broadly speaking, statistical inference consists of hypothesis testing (already discussed), point estimation (and interval estimation), and confidence sets.1See, e.g., section 6 of K.M. Zuev's lecture notes on statistical inference.
We have discussed only the frequentist approach, however, and for only a couple of these tasks.
The Bayesian approach, in contrast to all these techniques, end up using a loss function which sums over values of \(\theta\) (i.e., integrates over \(\Theta\) instead of over the space of experimental results \(\mathcal{X}\)). The Bayesian priors describe a probability distribution of likely values of \(\theta\), which would be used in the overall process.
Yet the Bayesian school offers more tools than just this, and I don't think they can neatly fit inside a diagram like the one doodled above for frequentist statistics.
But we have provided an intuition to the overall procedures and tools the frequentist school affords us. Although we abstracted away the data gathering procedure (as well as the other steps in the process), we could flesh out more on each step involved.
In short, statistics consists of decision theoretic problems (perhaps "decision theory about decision theory", or meta-decision theory, may be a good intuition), but it remains more of an art than an algorithmic task.
References
- P.J.Bickel and E.L.Lehmann, "Frequentist Inference". In Neil J. Smelser and Paul B. Baltes (eds) International Encyclopedia of the Social & Behavioral Sciences, Elsevier, 2001, Pages 5789–5796.
- James O. Berger, Statistical Decision Theory and Bayesian Analysis. Springer Verlag, 1993. See especially section 2.4.3. (This is the only book on statistical decision theory that I know of worth its salt.)
- George Casella and Roger L. Berger, Statistical Inference. Second ed., Cengage Learning, 2001. (Section 8.3.5, 9.3.4 for hypothesis testing and point estimators in the decision theoretic framework.)
- C. Robert, The Bayesian choice: from decision-theoretic foundations to computational implementation. Second ed., Springer Verlag, 2007. See especially chapter 2.