Tentative Definition. A random variable assigns to a given "random phenomenon" or "random process" some (real) number, or a vector ("list") of numbers.
Examples. The following are all random variables.
- Let X be the number of heads in 10 coin flips.
- Let R be the number of times a given baseball pitcher strikes out an opponent in the course of a given game.
- Let S be the number of "successes" (heads) until the first "failure" (tail) in a repeated trial (flipping a coin over and over again).
- Let T be the waiting time (in minutes) until the next bus arrives.
Slightly more formal Definition. If we represent the possible outcomes for a given "random process" (i.e., the set underlying its sigma algebra) by Ω, then a random variable is a function \(X\colon\Omega\to\mathbb{R}\) (or possible \(X\colon\Omega\to\mathbb{R}^{k}\) some [fixed positive integer] k) such that for any \(x\in\mathbb{R}\) the preimage of smaller values is an event \(\{\omega\in\Omega:X(\omega)\leq x\}\in\Sigma\) ("is measurable", i.e., we can assign a probability to that preimage). We will write \(X\leq x = \{\omega\in\Omega:X(\omega)\leq x\}\) in an abuse of notation.
Remark. We should remember a sigma algebra is not just the space of outcomes \(\Omega\), but also the specific set of well-defined events \(\Sigma\subseteq\mathcal{P}(\Omega)\). There are various specifications we have on the events, for instance "something happens" \(\Omega\in\Sigma\); for any event \(E\in\Sigma\) its complement is also a well-defined event \(\Omega\setminus E\in\Sigma\) [or "an event does not happen"]; for any countable family of events \(\{E_{j}\}_{j\in J}\subseteq\Sigma\) their union is also a well-defined event \(\bigcup_{j\in J}E_{j} \in\Sigma\) ["one of these events might happen"]; and so forth. Implicitly we consider a probability measure on the set of well-defined events generically denoted \(\Pr(-)\). Altogether, this is the data necessary to describe some "random phenomenon".
Example. Let E be any event. Then indicator function \(I_{E}\) which is zero for any \(x\notin E\) and \(I_{E}(e)=1\) for all \(e\in E\), this is a random variable. For multiple events \(E_{1},\dots, E_{n}\), we find \(I_{E_{1}\cup\dots\cup E_{n}}(x)=\max(I_{E_{1}}(x),\dots,I_{E_{n}}(x))\) and \(I_{E_{1}\cap\dots\cap E_{n}}(x) = \prod_{j} I_{E_{j}}(x)\). For discrete random phenomena, these indicator functions are the basic building blocks for constructing other random variables.
What happens to all that sigma-algebra baggage? Given a random variable, we can ask "What is the probability its value will be in a given range?" For example, "What is the probability the starting pitcher for the Dodgers will strike out at least 30 batters?"
This would be computed by first assembling all possible outcomes which satisfy this \(\mathcal{E}=\{E\in\Omega : R(E)\geq30\}\subseteq\Omega\) and using the probability measure on the sample space for the random process \(\Pr(\mathcal{E})\) or more imaginatively \(\Pr(R\geq30)\).
As a caveat, we hasten to add, only inequalities are necessarily well-defined, e.g., \(\Pr(X\leq x)\). Equality "on the nose" may not be well-defined \(\Pr(X=x)\), but we abuse notation to write the Probability mass function in this manner. (This gets tricky with subtle nuances when dealing with continuous random variables instead of discrete ones.)
This induces a nice mathematical structure on the image of the random variable \(R(\Omega)\), namely we can "transport" the probability distribution from the sigma algebra \(\Sigma\) on \(\Omega\) to \(R(\Omega)\). This is the Distribution Function for the random variable, \(F_{R}(x) = \Pr(R\leq x)\).
Equivalence relation of Random Variables. If we have two random variables, say, \(X\) and \(Y\), we can say they are Equivalent if for any \(x\in\mathbb{R}\), we have \(\Pr(X\leq x) = \Pr(Y\leq x)\). This is usually denoted \(X\sim Y\).
Algebra of Random Variables. Given some random variables X, Y on the sample sigma algebra, we can define new random variables \(X+Y\), \(X - Y\), \(XY\), \(X/Y\) provided Y is never zero, and exponentiation \(X^{Y} = \exp(Y\log(X))\). The intuition to have is that the operations are done as operations on real-valued functions.
So, specifically, if \(x\in\Omega\), then \((X+Y)(x)=X(x)+Y(x)\), \((X - Y)(x) = X(x) - Y(x)\), \((XY)(x) = X(x)Y(x)\), \((X/Y)(x) = X(x)/Y(x)\) provided \(Y(x)\neq0\), and exponentiation \((X^{Y})(x) = \exp(Y(x)\log(X(x)))\).
Probability Distributions. We have a few "standard" distributions which are the "template" for various random processes. Flipping a coin follows a Bernoulli Distribution if we flip the coin only once, and a Binomial Distribution if we flip it N times, for example.
The notation for these families may vary reference to reference. A Bernoulli distribution with probability p of success is usually denoted \(\mathrm{Bernoulli}(p)\) or \(\mathrm{Ber}(p)\).
To indicate a random variable is distributed like one of these standard distributions, we abuse notation and write \(X\sim \mathrm{Bernoulli}(p)\).
We can build more distributions out of a handful of basic ones, for example \(Y = X_{1} + \dots + X_{N}\) where the \(X_{j}\sim\mathrm{Bernoulli}(p)\) will describe flipping a coin N times and counting the number of "successes" ("heads"). This gives us the Binomial distribution when we consider \(\Pr(Y\leq k)\) (there are at most k heads in N coin flips).
We can specify a probability distribution by its parameters (like p in the Bernoulli distribution), the probability mass function and/or the probability density function. Often it's useful to give other summary statistics alongside this data.
Expected Value. We also have for any random variable X its Expected Value given by \(\mathbb{E}[X] = \sum_{x\in X(\Omega)}x\Pr(X=x)\) (or replacing the sum with an integral for continuous random variables). The intuition we should have for the expected value of a random variable is this captures the "average value" of the random variable.
If we have some function \(f\colon\mathbb{R}\to\mathbb{R}\), then we have \(\mathbb{E}[f(X)] = \sum_{x\in X(\Omega)}f(x)\Pr(X=x)\) (and again, an integral instead of a sum for continuous random variables, with the restriction that f is an integrable function).
Note that \(\mathbb{E}[X^{2}]\neq(\mathbb{E}[X])^{2}\) and more generally \(\mathbb{E}[XY]\neq \mathbb{E}[X]\mathbb{E}[Y]\). But we do have \(\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]\) and, for any real number \(a\in\mathbb{R}\) \(\mathbb{E}[aX] = a\mathbb{E}[X]\).
Exercise. Let \(E\in\Sigma\) be an event in a sigma algebra, and \(I_{E}\) be the indicator function on E. What is \(\mathbb{E}[I_{E}]\)?
Theorem. Let X and Y be random variables, a and b be real numbers. Then:
- \(\mathbb{E}[aX + b] = a\mathbb{E}[X] + b\)
- \(\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]\)
- \(\displaystyle\mathbb{E}[XY] = \sum_{\omega\in\Omega}X(\omega)Y(\omega)\Pr(\omega)\)
- \(\displaystyle\mathbb{E}[X/Y] = \sum_{\omega\in\Omega}\frac{X(\omega)}{Y(\omega)}\Pr(\omega)\) if \(Y(\omega)\neq0\) for any \(\omega\in\Omega\)
- \(\displaystyle\mathbb{E}[X^{Y}] = \sum_{\omega\in\Omega}\exp(Y(\omega)\log(X(\omega)))\Pr(\omega)\)
Variance. If expected value tells us what neighborhood a random variable is likely to live in, the variance tells us how spread out this neighborhood is. We define it as \[\mathrm{Var}[X] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 = \sum_{\omega}(X(\omega) - \mathbb{E}[X])^{2}\Pr(\omega).\] The variance and expected value for a random variable contain a lot of useful information, which we use when trying to infer parameters from data.
More generally, we have for any two random variables X and Y a measure of their failure to be correlated by the covariance \[\mathrm{Cov}(X,Y) = \sum_{\omega}(X(\omega)-\mathbb{E}[X])(Y(\omega)-\mathbb{E}[Y])\Pr(\omega)\] which is such that \(\mathrm{Cov}(X,X)=\mathrm{Var}[X]\). Correlatedness is not the same as independence: independent random variables are not correlated, but uncorrelated random variables may or may not be independent (a "all salmon are fish, not all fish are salmon" type statement). So uncorrelated is a "weaker" property than independence.
Applications?
Iterate! Note, we can transform the parameters of these distributions (like p the probability of success in a Bernoulli trial) into random variables themselves. This is precisely what Bayesian data analysis does: the parameters are random variables following prior probability distributions, which we update as new data becomes available using Bayes's theorem.
Regressions! A linear regression basically says that the observations are really values of a random variable, i.e., \(Y\sim\mathcal{N}(aX + b, \varepsilon)\) where \(\mathcal{N}\) is the normal distribution. There are other useful regressions, but this is the basic idea.
We should admit that this is one formulation of regressions in terms of random variables. The other uses conditional random variables, \(Y|X\sim f(\beta\cdot X,\theta)\) when the regressions (X) are stochastic (i.e., "not controlled by the experimenter/statistician"). Formally these are different models. But when actually doing the regressions, they are treated "the same".
Tests! We often do an experiment, producing some data points \(x_{1},\dots,x_{n}\) which we interpret as values of a random variable X which follows a prescribed distribution. We test the assumption (that X follows the given distribution with specific parameters) by comparing the sample mean \[ \mu = \frac{1}{n}\sum_{j=1}^{n}x_{j} \] to the expected value \(\mathbb{E}[X]\). The central limit theorem suggests that \(\sqrt{n}(\mu - \mathbb{E}[X])\) looks like a normal distribution centered at 0 with variance approximately equal to the variance of the data points (loosely speaking).
Reading List
- Discrete Random Variables and their Expected Values, MIT OCW 18-05 Introduction to Probability and Statistics
- Expected Value and Variance, lecture notes.
- Statistical Machine Learning, by Han Liu and Larry Wasserman, specifically chapter 12 on Bayesian inference using random variables.