I've been reading through E.T. Jaynes' Probability Theory: The Logic of Science, published posthumously in 2003. Although difficult to describe, I suppose it is a "polemic" in the most positive sense of the word.

My own interest in the book was piqued by Jaynes' "robot". I asked myself, Why don't we have something like Siri for scientists? A robot lab assistant would be useful, I imagine. Why not? Jaynes motivates the book by trying to teach a robot how to reason with empirical evidence, and we will be "programming" Jaynes' robot's brain.

Chapter 1

Jaynes begins by discussing plausible reasoning. We have definite inference rules in classical logic, the most famous is modus ponens:

A implies B
A is true
B is true.

"Dual" to this is modus tollens (proof by contrapositive)

A implies B
B is false
A is false.

Following Polya, Jaynes introduces several "weaker syllogisms", like

A implies B
B is true
A is more plausible.

There are a few other "weak syllogisms" where the conclusion modifies the plausibility of a proposition being true. (I omit them, if you are interested, then please read the first chapter!)

Jaynes introduces his notation for propositions, writing things like \(A\mid B\) for the plausibility that proposition \(A\) is true given some background information \(B\). Conjunction of propositions \(A\) and \(B\) is written \(AB\), disjunction is written \(A+B\). We denote the negation of \(A\) as \(\bar{A}\) (writing a bar over \(A\)). We read \(A\mid BC\) as \(A\mid(BC)\), and \(A+B\mid C\) as \((A+B)\mid C\).

There's one assumption we make about the plausibilities, which Jaynes makes clear (but readers may not fully digest or appreciate at first glance):

Axiom. If we write \(A\mid BC\), then we assume propositions B and C are not contradictory. Should someone write down \(A\mid BC\) for \(BC\) being contradictory, we make no attempt to make sense of it, and discard it as meaningless. (End of Axiom)

Further, plausibilities are always considered with some background information. That is to say, we always work with statements of the form \(A\mid B\).

The first chapter concludes with specifications of his robot's behavior, which Jaynes calls his "desiderata" [something that is needed or wanted]. They are not "axioms" because they are not asserted to be truths. The spec are:

Spec. 1. Degrees of plausibility are represented by real numbers.

Spec. 2. Qualitative correspondence with common sense.

This was rather vague to me, but I think Jaynes refers to the basic scheme: if old information C is updated to \(C'\) such that \((A\mid C')\gt(A\mid C)\) but the plausibility of B given A is not changed: \[ (B\mid AC') = (B\mid AC). \] Then (1) this can only increase the plausibility both A and B is true \[(AB\mid C')\geq (AB\mid C)\] and (2) on the negation of A behaves as \[(\overline{A}\mid C')\leq (\overline{A}\mid C).\]

Spec 3. The robot reasons consistently.

This is actually given by three sub-specifications

Spec. 3A. If a conclusion may be reasoned out by more than one way, then every way produces the same result.

This cryptic specification 3A means: if \(AB\mid C\) may be obtained from (1) combining \(A\mid C\) with \(B\mid AC\), and (2) combining \(B\mid C\) with \(A\mid BC\), then they should produce the same plausibilities.

Spec. 3B. The robot takes into account all evidence available to it. The robot does not arbitrarily throw away information.

Generically, this is a good rule of thumb for everyone.

With regards to programming a robot, I was left wondering: how does a robot know what some "raw data" is evidence of? How does a robot ascertain when some evidence is relevant to a given proposition?

Spec. 3C. The robot represents equivalent states of knowledge by equivalent plausibility assignments. (I.e., if two robots have the same information but label the propositions differently, the plausibilities assigned — once the propositions from robot 1 are identified with those from robot 2 — must be the same.)

As a mathematician, I was alerted to the word equivalent. Is this an equivalence relation? Or is it equality "on the nose"? Can we weaken it a little? Can we pad it to within some margin of error? I'm not sure, but it seems at least equality up to some symmetry transformation argument.

Jaynes calls specifications 1, 2, and 3A are "structural" (governing the structure of the robot's brain), while 3B and 3C are "interface" conditions telling us how the robot's brain relates to the world.

Comment 1.8.1. Common Language. Jaynes concludes each chapter with commentary and observations. Some are outright polemical, others insightful. Here Jaynes points out English conflates ontological statements (There is noise in the room) with epistemological statements (The room is noisy). Confusing these two asserts one's own private thoughts and sensations are realities existing externally in nature. Jaynes calls this the Mind Projection Fallacy. It's a recurring phrase in his book.

Chapter 2

This chapter mostly consists of Jaynes' idiosyncratic derivation of Cox's theorem. Judging from the notation used, I think a lot of it is implicitly nodding to statistical mechanics in physics, but I don't know sufficient thermodynamics & statistical physics to get the references.

The tl;dr version of Cox's theorem is: plausibility is represented by some "sufficiently reasonable" function assigning real numbers to propositions of the form \(A\mid B\). Once we impose some very reasonable conditions on this assignment of real numbers, we can rescale it to produce numbers in the interval [0, 1]. Thus we can identify this transformed, rescaled assignment of plausibilities as Probability.

Aside: Cox's theorem is mildly controversial. Halpern's A Counter Example to Theorems of Cox and Fine provides a rather pathological counter-example with a 12-sided die, where the plausibilities assigned to the outcomes are the same as the "real probabilities", except for some reason the plausibility of \(Bel(10)=\Pr(10)+\delta\) and \(Bel(11)=\Pr(11)-\delta\) for some sufficiently small \(0\lt\delta\lt1\). Well, then, we have a counter example: the plausibilities don't reflect the probabilities at all!

This would contradict the specifications for Jaynes' robot, though.

Still, we should bear in mind there are subtleties surrounding Cox's theorem, and we should not take it as the Gospel. For further reading, I have been referred to: Halpern's Cox's Theorem Revisited, Snow's On the Correctness and Reasonableness of Cox's Theorem for Finite Domains, and van Horn's Constructing a Logic of Plausible Inference: a Guide To Cox's Theorem. See also PBworks' annotated bibliography on Cox's theorem.

In classical logic, propositions are either true or false. Probabilities generalize this, and incorporates this by writing \(p(A)=1\) when A is true, and \(p(A)=0\) when A is false.

The two rules absolutely essential to this probability function are (1) the product rule \[p(AB\mid C)=p(A\mid BC)p(B\mid C) = p(B\mid AC)p(A\mid C)\] and (2) the sum rule \[p(A+B\mid C) = p(A\mid C) + p(B\mid C) - p(AB\mid C).\] These are the critical rules to probability that Jaynes uses frequently throughout, well, the next few chapters (i.e., as much as I've read thus far).

Section 2.3. Qualitative Properties. What's really intriguing is Jaynes' translation of syllogisms to this probability notation. Generically, a derivation with n premises \(B_{1},\dots,B_{n}\) and a conclusion \(A\) is written in logic as

\(B_{1}\)
...
\(\underline{B_{n}}\)
\(A\)

In probability, this corresponds to \(p(A\mid B_{1}\dots B_{n})\).

Jaynes uses the product rule to derive modus ponens and modus tollens from the product rule. But that's not all! Jaynes then derives the "weak syllogisms" of chapter 1, showing exactly how a proposition's plausibility changes as evidence shifts.

And, in an impressive moment of triumph, it turns out the "weak syllogisms" introduced in the first chapter correspond to a Bayes update of the probabilities. Having spent a lot of time with Polya's patterns of plausible reasoning, it was delightful to see them formalized in Bayesian terms.

2.4. Assigning numerical values. At this point, I was saying to myself, "Yeah, ok, this is fascinating and all, but I can't program this. I don't even know how to assign plausibilities to a proposition yet." Jaynes tackles that issue here.

If we have n mutually exclusive and exhaustive propositions \(A_{1},\dots,A_{n}\) given some background information B, then we necessarily have \[\sum^{n}_{k=1}p(A_{k}\mid B)=1.\] Jaynes gives a very brief argument by symmetry that, since we don't know anything more about these propositions, and we can relabel them arbitrarily without changing the state of knowledge, then we must have the Principle of Indifference, i.e., \[p(A_{k}\mid B) = \frac{1}{N}\] for any k.

Here is a moment of triumph: if we have a second robot with its plausibility assignments given by \(p_{II}(-)\) and the robot has accidentally relabeled the \(A_{\pi(i)}\) (for some permutation of the indexing set \(\pi\)), and if both robots have the same information, then necessarily \(p(A_{i}\mid B)=p_{II}(A_{\pi(i)}\mid B)\)...because that is what the data supports. From this, we "derive" the principle of indifference. And a general rule is found: the data given the robot determines the function \(p(-)\).

This is also where Jaynes first starts using the term Probability. For him, it is the "Kelvin scale" of plausibilities, the natural system of units which is universally translateable. (American scientists know the difficulty of dealing with Fehrehnheit and Celsius, well, the same difficulty occurs with two different agents trying to express plausibility assessments. Probability is the natural scale permitting translation.)

In a similar situation, if we have N mutually exclusive and exhaustive propositions \(H_{1}, \dots, H_{N}\) given some background information B, and if A is true in M hypotheses, then \[p(A\mid B) = \frac{M}{N}.\] Although quite simple, these rules suffice for getting quite a bit of mileage.

2.5. Notation, finite sets of propositions. When dealing with probabilities of propositions being true, Jaynes uses the notation of capital letters \(P(A\mid B)\). But for, e.g., the probability density of a particular distribution, lowercase letters are used like \(h(x\mid r, n, p)\).

Also, for technical reasons inherent in Cox's theorem, Jaynes will be working with finitely many propositions.

2.6.1. Comment on "Subjective" versus "Objective" probability. The terms "subjective" and "objective" are thrown around wildly in probability theory. Jaynes contends every probability assignment is necessarily "subjective", in the sense that it reflects only a state of knowledge and not anything measurable in a physical experiment. And if inquired, "Whose state of knowledge?" Well, it's the robot's state of knowledge (or anyone else who reasons according to the specifications laid out).

Probability then becomes a way of expressing (or encoding) one's information about a subject, regardless of one's personal feelings, hopes, fears, desires, etc., concerning it. This is the function of specifications 3B and 3C: they make the probability assignments "objective", in this sense of "it consistently encodes one's information about a subject".

Ch. 3. Elementary Sampling Theory

Jaynes helpfully begins by reviewing what has been established so far: the product rule (1) the product rule \[p(AB\mid C)=p(A\mid BC)p(B\mid C) = p(B\mid AC)p(A\mid C)\] and (2) the sum rule \[p(A+B\mid C) = p(A\mid C) + p(B\mid C) - p(AB\mid C).\] There's the "law of excluded middle" \[p(A\mid B) + p(\bar{A}\mid B) = 1.\] The principle of indifference: if given background information B the mutually exclusive hypotheses \(H_{i}\) (for \(i=1,\dots,N\)) are exhaustive, and B does not favor any particular hypothesis, then \[p(H_{i}\mid B) = \frac{1}{N}\] for any \(i=1,\dots,N\). If a proposition A is true on some M of the hypotheses \(H_{i}\), and false on the remaining \(N-M\), then \[p(A\mid B) = \frac{M}{N}.\] That's it, that's all we need. It's amazing how little is actually needed to do probability theory.

Section 3.1. Sampling without replacement. We are considering an urn problem, with the following propositions:

B = an urn contains N balls, all physically identical and indistinguishable, aside from being numbered from 1 to N (the presence of numbers does not alter the physical properties of the balls). And further M of the balls are red, the remaining \(N-M\) balls are white. We draw one ball at a time, observe its color, then set it aside. We do this n times, for \(0\leq n\leq N\).
\(R_{i}\) = the i-th draw is a red ball
\(W_{i}\) = the i-th draw is a white ball

We can treat the proposition \(W_{i}\) and \(R_{i}\) as negations of each other, i.e., \[\bar{W}_{i}=R_{i},\quad\mbox{and}\quad\bar{R}_{i}=W_{i}.\]

The reader can verify \[p(R_{1}\mid B) = \frac{M}{N} \tag{3.1.1}\] and \[p(W_{1}\mid B) = 1 - \frac{M}{N}. \tag{3.1.2}\] These probability assignments reflect our robot's state of knowledge. It's invalid to speak of "verifying" these equations experimentally: our concern is about reasoning with incomplete information, not assertions of physical fact about what will be drawn from the urn.

Jaynes next works through a few calculations. I think it's better if presented as exercises for the reader:

Exercise 1. What is the probability of drawing r red balls, one after another? I.e., compute \(p(R_{r}R_{r-1}\dots R_{2}R_{1}\mid B)\).

Exercise 2. What is the probability of drawing w white balls, one after another? I.e., compute \(p(W_{w}W_{w-1}\dots W_{2}W_{1}\mid B)\).

Exercise 3. What is the probability, given the first r draws are red balls, that draws \(r+1\), ..., \(r+w\) are white balls? I.e., compute \(p(W_{w+r}W_{w-1+r}\dots W_{2+r}W_{1+r}\mid R_{r}R_{r-1}\dots R_{2}R_{1}B)\).

From exercise 1 and 3, we conclude the probability of drawing w white balls and r red balls is given by: \[p(W_{w+r}\dots W_{1+r} R_{r}\dots R_{1}\mid B) = \frac{M!(N-M)!(N-n)!}{(M-r)!(N-M-w)!N!}.\] Curiously, this is the probability for one particular sequence of drawing r red balls and w white balls, namely the sequence of r red balls followed by w white balls. If the order were permuted (say, drawing w white balls followed by r red balls), the probability would be the same.

How many ways are there to draw exactly r red balls out of n drawings? It's the binomial coefficient, \[\binom{n}{r} = \frac{n!}{r!(n-r)!}.\]

Now take A to be the proposition "Exactly r red balls drawn out of n balls drawn, in any order", then \[h(r\mid N, M, n) := p(A\mid B) = \binom{n}{r}p(W_{w+r}\dots W_{1+r} R_{r}\dots R_{1}\mid B)\] which can be expanded out as \[h(r\mid N, M, n) = \frac{\binom{M}{r}\binom{N-M}{n-r}}{\binom{N}{n}}.\] Astute readers recognize this is the hypergeometric distribution.

Exercise 4. What is the most probable r? (Hint: solve \(h(r\mid N, M, n) = h(r-1\mid N, M, n)\) for r.)

Exercise 5. What is the probability of drawing a red ball on the third draw given no information on the first two draws? Is it dependent or independent of the prior draws? (Hint: the proposition being considered is \(R_{3}(W_{2}+R_{2})(W_{1}+R_{1})\mid B\), so use the product and sum rules to compute the probability.)

This wraps up the first section of chapter 3. The next section is a discussion of the role of causality in statistical physics as an invalid way to think of probability. It's also as far as I've gotten. There are about 10 more sections in the third chapter, I just haven't read them yet. Jaynes is a real treat if you have a solid understanding of probability theory and a good working knowledge of statistics. Maybe I'll continue adding my reading notes, if there's any interest in it.

Political Arithmetic

Monday, June 22, 2020

Jaynes' Probability Theory

Chapter 1

Chapter 2

Ch. 3. Elementary Sampling Theory

No comments:

Post a Comment