Lecture on Probability

Kevin T. Kelly

Department of Philosophy, Carnegie Mellon University

Probability theory

An algebra is a collection of propositions closed under the Boolean operations {and, or, not}.

A probability function P satisfies the following axioms:

If |= T then P(T) = 1.

P(A) >= 0.

If |= not(A and B) then P(A or B) = P(A) + P(B) (finite additivity).

Optional: If for all i distint from j, |= not(Ai and Aj), then P(A0 or A1 or ... or An or...) = Sum(i) P(Ai) (countable additivity).

Conditional probability defined

If P(B) > 0 then P(A | B) = P(A & B)/P(B)

Interpretations

The above definition is mathematical. It is another question how to interpret it.

Ontic Interpretations (probabilities in the world)

Frequentism

Consider a repeated experiment.

Consider a type of outcome of that experiment (e.g., heads comes up).

The frequency of an outcome type is the ratio: (#outcoms of that type)/(# of trials).

The limiting frequency of an outcome type is the limit of the frequency as the number of trials goes to infinity.

The probability of the outcome type is the limiting frequency of the type (in a given, eternal experimental run).

Difficulties:

Reference class problem: which experiment are we talking about when we make a probability claim?

Idealization: no experiment actually runs forever.

Even if it did, the relevant properties of the coin would change as it eventually wears away to nothing.

Doesn't satisfy contable additivity: each side of an infinite-sided die comes up only finitely often, so each has limiting relative frequency zero, but the probabilities of all the sides has to add up to 1.

Propensity:

Something counterfactual or modal rather than merely actual about the world.

Strong: propensities are conterfactual limiting relative frequencies: what would happen if a given process were continued forever.

Still not countably additive.

Moderate: propensities entail counterfactual limiting relative frequencies, but are not definable as such frequencies.

Weak: propensities are modalities in the world that satisfy the probability axioms. Imply limiting relative frequencies only with unit propensity. May be assumed to be countably additive. Hard to say what they practically imply or how we can discover them.

Epistemic intepretations (pertaining to belief)

Classical

Principle of "insufficient reason": In ignorance, all possibilities are equally probable.

Difficulties:

A coin may be biased. Fairness of the coin must be discovered experimentally, not postulated out of ignorance.

Should we be indifferent about the volume of a cube or about the length of a side? Indifference about the former is bias about the latter, and conversely.

Logical (Rudolph Carnap)

"Partial Entailment"

E |= H: E completeley supports H.

P(H | E) = r: "E confirms H to degree r".

So P(H | E) measures the degree to which H is supported by E.

Difficulties:

Choice of the measure P(. | .) is rather arbitrary: how should logicweight "possible worlds"?

Recall the problems with "insufficient reason". Carnap simply assumed insufficient reason with respect to the structues of possible worlds.

Goodman's problem:

Say that x is Grue just in case x is Green prior to stage 5 or x is Blue after stage 5.

Evidence:

Green(0) Green(1) Green(2) Green(3) Green(4)

Seems this increases the support of

Green(5).

So it decreases the support of

Blue(5).

However,

Grue(0) <===> Green(0)

Grue(4) <===> Green(4)

but

Grue(5) <===> Blue(5)

Hence,

Grue(0) Grue(1) Grue(2) Grue(3) Grue(4)

decreases the support of

Grue(5).

But logic pays only to the logical forms of statements, and up to logical form, the two situations are identical.

So inductive support cannot be a logical relation.

Personal Probability Theory

P(A) = r says that a particular individual, say Fred, believes A to degree r.

So different people may have different probabilities for the same proposition.

So probability measures the strenght of a propositional attitude, rather than a causal feature of the world.

To check whether a radium atom has a ten percent chance of decaying, do psychological tests on yourself, rather than experiments on the atom.

Dutch Book Argument:

Your degrees of belief should satisfy the probability axioms or somebody clever could take advantage of you.

Bets

The bet ($s, A) is a contract to pay $s to the buyer if A is true and to pay nothing if A is false.

($s, A):

Pays $s if A.
Pays $0 if not-A.

$s is the "pot" or "prize" you win if A turns out true.

Negative prizes and selling

Selling the bet ($s, A) at the rate r means receiving $sr prior to the bet and paying out the prize $s to your customer if A occurs. This may be redescribed as paying the negative amount $-sr for the bet and receiving the unwelcome "prize" of $-s when you "win" (when A occurs)! Thus,

Selling the bet ($s, A) is just the same as buying the bet ($-s, A).

So it suffices to speak only of buying bets.

"Put your money where your mouth is" principle:

If your degree of belief in A is r, then

you are willing to contribute fraction r of the stake in a bet for A and
and you are willing to contribute fraction (1 - r) of the stake in a bet for not-A.

If P(A) = r then

you will trade $rs for the bet ($s, A) and

you will trade $(1 - r)s for the bet ($s, not-A).

Packages of bets

The bet ($s, A) + ($s', B) is like owning both bets at once:

($s, A) + ($s', B):
Pays $(s + s') if A, B.
Pays $s if A, not-B.
Pays $s' if not-A, B
Pays $0 if not-A, not-B

"Package" principle:

A package of bets is worth the sum of the amounts you would have paid for each bet in the package individually.

If you will pay $rs for the bet ($s, A) and $r's' for the bet ($s', B) then you will pay $(rs + r's') for the bet ($s, A) + ($s', B).

Coherence

Degrees of belief assignment P is coherent just in case P satisfies the probability axioms.

Personalism is about coherence, or beliefs fitting together, rather than about reliability or finding the truth.

Dutch Book

A package of bets that you would buy and lose money on no matter what happens.

Bad luck can happen to anybody.

But losing money no matter what is dumb.

Dutch Book theorem

You will be willing to buy a package of bets on which you lose money no matter what <===> you violate the probability axioms.

Proof of (===>) side omitted.

Proof of (<===) side.

First axiom

Let P(T) = r

Suppose r > 1. Then you are willing to pay $r > $1 for the "Dutch book" bet DB = ($1, T). You are guaranteed to win DB, but you foolishly paid more for DB than DB pays out, so you lose money no matter what.

Suppose r < 1. Then the same holds for the bet DB = ($1, not-A). (Here we use the assumption that you would take both sides of a bet at a given degree of belief).

Second axiom

Let P(A) = r < 0. Let $s < $0. Then DB = ($s, A) is the awful bet in which you "win" a bill charging you $s when A occurs. But you are willing to pay the positive amount $rs > 0 for DB. You lose this amount if you lose, and you lose even more when you win and get the bill for $s.

Third axiom

Suppose P(A) = a, P(B) = b, P(A or B) = c, and |= not(A and B).

Case: a + b > c. You would pay

$a for ($1, A), $b for ($1, B) and $(1 - c) for ($1, not-A & not-B) (This uses the assumption that you will take both sides of the bet ($1, A or B)).

So you would pay $(a + b + 1 - c) for the package DB = ($1, A) + ($1, B) + ($1, not-A & not-B) (by the package principle).

The total subtraction from your bank account for DB is $-(a + b + 1 - c) = $(-a - b - 1 + c).

Case: A, B:

Impossible, since |= not(A and B)! But check the payoff anyway. The result "explains" from a personalist viewpoint why you want to restrict the additivity axiom to incompatible events!

Case: A, not-B:

Outlay for bets = $(-a - b - 1 + c).
Winnings from ($1, A) = $1. You win nothing from the others.
Net from bets = $1 + $(- a - b - 1 + c) = $(1 - 1 - a - b + c) = $(c - a - b) = $(c - (a + b)) < $0, since a + b > c.

Case: B, not-A:

Switch A and B in the A, not-B argument. Net < $0.

Case: not-A, not-B:

Outlay for bets = $(-a - b - 1 + c).
Winnings from ($1, not-A & not-B) = $1. You win nothing from the others.
Net from bets = $1 + $(- a - b - 1 + c) < $0, just as in the preceding two cases.

Case a + b < c. You would pay

$(1- a) for ($1, not-A), $(1 - b) for ($1, B) and c for ($1, A or B).

So you would pay $(1- a + 1 - b + c) for the package DB = ($1, not-A) * ($1, not-B) * ($1, A or B) (by the package principle).

The total subtraction from your bank account for DB is $-(1- a + 1 - b + c) = $(-2 + a + b - c).

Case: A, B: impossible, since |= not(A and B).

Case: A, not-B:

Outlay for bets = $(-2 + a + b - c).
Winnings from ($1, A or B) = $1. You win nothing from the others.
Net from bets = $1 + $(- 2 + a + b - c) = $(-1 + a + b - c) = $(-1 + a + b - c) < $0, since a + b < c.

Case: B, not-A: switch A and B in the A, not-B argument. Net < $0.

Case: not-A, not-B: Win $1 from the bet (1$, not-A & not-B).

Outlay for bets = $(-1 + a - 1 + b - c).
Winnings from ($1, not-A) = $1; winnings from ($1, not-B) = $1. You win nothing from ($1, A or B).
Net from bets = $2 + $(- 2 + a + b - c) = $(2 - 2 + a + b - c) = $( a + b - c) < $0, since a + b < c.

Project: Degrees of Dutch book. It is widely objected that it is impossible, both for practical and for computational reason to be probabilistically coherent. One response to this objectio is t define degrees of incoherence in terms of how much money an evil bettor could be guaranteed to milk from you per unit stake. In other words, in a single betting cycle, the evil bettor has to put up money to induce you to bet. The more incoherent you are, the less money he has to put up to get a given surefire return from his dealings with you.

Objections:

Bettable propositions: If P is a proposition that nobody could determine the truth of, nobody could force you to pay when you lose, so you can't lose money. (Response: if you value the truth, you can lose or gain it whether or not you or anyone else knows that you have).

Package principle: suppose the bets are bought in sequence through time. Your assessment of the value of money could change during this period, so that your total outlay for the bets differs from the sum of your origional degrees of belief.

Objections to the "money where your mouth is" principle:

Trickery: In poker, you bluff by betting at more than you think your hand is worth, violating the "money where your mouth is" principle. This kind of violation is usually called a violation of "act-state independence", since your bet changes the probability that you will lose the bet by affecting your opponent's play.

Taking both sides of a bet and interval-valued degrees of belief:

Imagine an agent whose degree of belief in A is the closed interval P(A) = [r, r'], where r < r'.

Think of this agent as being unsure about how to bet. This agent would then conservatively pay according to the probability in her interval that make the bet look worst to her.

Revised "money where your mouth is" principle

When P(A) = [r, r'], the maximum you are willing to pay for ($s, A) is $sq, where $sq = min{sq': q' is in [r, r']}.

For example, let P(T) = [0, 1]. So our agent has no idea how to bet on a logical truth (she failed logic class). Can we make Dutch Book? No way.

The bet ($1, T) looks scariest to her from the viewpoint of P(T) = 0, so she pays at most nothing for ($1, T).

But the bet ($1, not-T) looks scariest from the viewpoint of P(T) = 1, so she offers at most nothing for ($1, A).

You can't cheat someone who won't bet!

Personalist Evidence and Learning

Recall

P(H | E) = P(H & E)/ P(E).

This is just a definition. In personalism, this defined concept is used as a procedure for learning.

Let Pn(A) be the agent's degrees of belief at stage n.

Suppose that at time n+1 the agent learns E. Then the personalist idea for learning is:

Pn+1(H) = Pn(H | E).

This procedure is called updating by conditionalization orBayesian updating.

Diachronic Dutch Book

The learning idea is an extra assumption requiring extra justification. Is there a Dutch book argument for it? In a sense, yes.

Let

c = Pn(A | B), a = Pn+1(A), b = Pt(B).

By definition:

c= Pn(H | E) = Pn(H & E)/Pn(E),

so we have

Pn(A & B) = Pn(A | B)Pn(B) = bc.

Case: Suppose c > a.

You would buy

($1, A & B) at $bc,
($(c - a), B) at $(c - a)b = $bc - ab, and
($x, not-B) at $x (1 - b) = $x - bx.

So you would pay $(bc + (bc - ab) + (x - bx)) for the whole package DB = ($1, A & B) * ($(c - a), B) * ($x, not-B) (by the package principle).

Case: not-B

Outlay for bets = -$(bc + (bc - ab) + (x - bx)) = $(-bc - bc + ab -x +bx).
Winnings from ($x, not-B) = $x. You win nothing from the others.
Net from bets = $x + $(-bc - bc + ab -x +bx) = $(x - x - bc - bc + ab + bx) = $(-bc - bc + ab + bx) =< $(-bc + bx), since c > a..

Case: B

After B is observed at stage n+1, you would buy ($-1, A) for $-a (i.e., receive $a for selling ($1, A)).

So overall, you would buy the whole package DB = ($1, A & B) * ($(c - a), B) * ($x, not-B) * ($-1, A) for $a + $(-bc - bc + ab -x +bx).

Case: B, A

Outlay for bets = $(a + -bc - bc + ab -x +bx).
Winnings from ($-1, A), ($1, A & B), ($(c - a), B) = $(-1 + 1 + c - a) = $(c - a).
Net from bets =

$(c - a)+ $(a -bc - bc + ab -x +bx) =
$(c - bc - bc + ab - x + bx) <
$(c - bc - x + bx) (since c > a) =
$(c(1 - b) + x(b - 1)).

Case: B, not-A

Outlay for bets = $(a + -bc - bc + ab -x +bx) as before.
Winnings from ($(c - a), B) = $( c - a) as before. So
Net from bets = $(c(1 - b) + x(b - 1)) as before.

So it suffices to choose x so that that

c(1 - b) + x(b - 1) =< 0 and
-bc + bx =< 0.

Solving the second equation,

(-bc + bx) =< 0, so

bx =< bc, so

x =< (bc)/b so

x =< c.

From the first equation

c(1 - b) + x(b - 1) =< 0,

x =< c(1 - b)/(b - 1)

< 0 (since b < 1)

< c (since c > a >= 0).

Thus, you lose money for sure so long as we choose

x = c(1 - b)/(b - 1) < 0.

Recall that the negative prize indicates that purchasing the bet ($x, B) be interpreted as selling the bet ($-x, B).

Objections

The package principle

Before we objected that the package principle is invalid across time. But in this case, the package is

DB = ($1, A & B) * ($(c - a), B) * ($x, not-B) * ($-1, A),

and it is clear that ($1, A & B) * ($(c - a), B) * ($x, not-B) is bought at stage n whereas ($-1, A) is only purchased at n+1 after observing B. So in this case, there is no way to ignore the fact that the alleged "incoherence" occurs to different temporal stages of the same agent.

But something is wrong here, for ultimately, what is at stake in Dutch book is preference among acts, and changes of preference through time are not normally thought to be irrational. We learn to appreciate red wine and some even come to appreciate Wagner (!), even though they couldn't have dreamed of paying for these things in earlier years.

Using self-knowledge

Moreover, if I know now that I fail to update by conditioning, then seeing the Dutch book I will be subjected to if I buy the first package of bets, I should refuse to buy it. The argument does not permit me to use any self-knowledge about my future opinions when I buy bets in the present. From my current viewpoint, my future degrees of belief and the choices based upon them are future states of the world about which I should have degrees of belief now.

Bayes' Theorem

P(H | E) = P(E | H)P(H)/P(E).

Proof:

P(E | H)P(H)/P(E) = (P(E & H)P(H)/P(H))/P(E) = P(E & H)/P(E) = P(H | E).

Terminology

P(H) = the prior probability of H.

The posterior probabililty of H reflects the plausibility of H in light of data E. This is usually a highly subjective matter. Forces are now more plausible than witches.

P(E | H) = the likelihood of E.

The likelihood of E given H reflects how good an "explanation" H provides of E. This is usually a fairly objective matter. Sometimes it is determined by H itself, as in quantum mechanics.

P(H | E) = the posterior probability of H.

The prior probability of H reflects the a priori "plausibility" of H as a hypothesis. Bayes' theorem tells us that the posterior or "new" probability of H is proportional to its prior probability times the likelihood of the data given H (the quality of H as an explanation). Thus, belief is a product of a subjective and an objective component.

P(E) = the prior probability of E.

The prior probability of E is a very troublesome number to find. We will see why:

Total Probability Theorem

P(E) = P(E | H)P(H) + P(E | not-H)P(not-H).

Proof:

P(E | H)P(H) + P(E | not-H)P(not-H) =

P(E & H)P(H)/P(H) + P(E & not-H)P(not-H)/P(not-H) =

P(E & H) + P(E & not-H) = (by axiom 3)

P((E & H) or (E & not-H)) = (by logic of the underlying algebra)

P(E).

The problem of the catch-all hypothesis:

So why is the prior of the evidence E so hard to find? By the total probability theorem,

P(E) = P(E | H)P(H) + P(E | not-H)P(not-H).

But P(E | not-H) is nearly as inaccessible as P(E) itself. The hypothesis not-H is called the "catch-all" hypothesis, because it says "anything but H". We need to continue chopping mutually incompatible hypotheses out of not-H and apply the total probability rule repeatedly to get

P(E) = P(E | H1)P(H1) + P(E | H2)P(H2) + ... + P(E | Hn)P(Hn).

But this expresses P(E) in terms of all possible hypotheses! Who knows what the possible hypotheses are? They haven't been thought of yet!

In cases of exciting scientific change, a genius (Bohr, Einstein, Darwin, Maxwell, Fresnel, Kepler, etc.) thinks of a new, plausible alternative hypothesis H' that nobody thought of before such that the likelihood P(E | H') is high (i.e. H' is a good, plausible explanation). This causes a revolutionary shift of probability mass from the old hypothesis H to the new one H', and hence onto not-H. Such shifts are due not to new data but to new ideas. Revolutionary scientific change always has this character because what is inconceivable cannot be probable.

Likelihood ratios

The troublesome quantity P(E) can be avoided if we always care the empirical support of two hypotheses against each other. Then the P(E)s cancel:

P(H1 | E)/P(H2 | E) = (P(E | H1)P(H1)/P(E))/(P(E | H2)P(H2)/P(E) = P(E | H1)P(H1)/P(E | H2)P(H2).

But then we obtain no absolute grasp of the posterior probability: we can merely compare the probabilities of hypotheses on the table. Hence, even the subjective degree of belief in a theory is a fairly idealized quantity unless the alternatives are artificially constrained.

Relevant Information

If information E is irrelevant to H, then our degree of belief in H does not change upon learning that E. If E is positively relevant to H, then learning E makes us more confident that H. If E is negatively relevant to H,then learning E makes us less confident that H.

E is irrelevant to H <===> P(H | E) = P(H).

E is positively relevant to H <===> P(H | E) > P(H).

E is negatively relevant to H <===> P(H | E) < P(H).

Unfortunately, the obtuse, technical term "statistical independence" has been attached to the concept of irrelevance. An equivalent, perhaps more familiar formulation of this concept is as follows:

Proposition:

P(H | E) = P(H) <===> P(H & E) = P(H)P(E).
P(H | E) > P(H) <===> P(H & E) > P(H)P(E).
P(H | E) < P(H) <===> P(H & E) < P(H)P(E).

Proof:

P(H | E) = P(H) <===> P(H & E)/P(E) = P(H).<===> P(H & E) = P(H)P(E).

P(H | E) > P(H) <===> P(H & E)/P(E) > P(H).<===> P(H & E) > P(H)P(E).

P(H | E) < P(H) <===> P(H & E)/P(E) < P(H).<===> P(H & E) < P(H)P(E).

Moreover, both positive and negative relevance are symmetric.

Hence, positive and negative relevance and irrelevance are symmetric, as is intuitive.

Proposition:

E is irrelevant to H <===> H is irrelevant to E.
E is positively relevant to H <===> H is positively relevant to E.
E is negatively relevant to H <===> E is negatively relevant to H.

Proof: Using the preceding result:

P(H | E) = P(H) <===> P(E & H) = P(E)P(H) <===> P(E | H) = P(E).

P(H | E) > P(H) <===> P(E & H) > P(E)P(H) <===> P(E | H) > P(E).

P(H | E) < P(H) <===> P(E & H) < P(E)P(H) <===> P(E | H) < P(E).

Bayesian Confirmation

Positive relevance provides us with a quantitative concept of confirmation. Many philosophers believe that this has a lot to do with justification. In fact, conditional probability provides an intuitively appealing theory of justification.

Refutation is bad:

Suppose evidence E refutes hypothesis H. Then E |= not-H. Then

P(E | H) =

P(E & H)/P(E) =

0/P(E) = 0.

So the probability of a refuted hypothesis goes down to 0 and stays there.

Plausible competitors are bad:

Suppose H1 |= E and H2 |= E and P(H2) is high. So P(E | H1) = P(E | H2) = 1. Then

P(H1 | E)/P(H2 | E) =

P(E | H1)P(H1)/P(E | H2)P(H2) =

P(H1)/P(H2).

So H1 cannot gain against an equally explanatory competitor.

Refuting plausible competitors helps

Suppose P(H2) is high and H1 |= not-H2. Suppose H2 |= not-E and H1 |= E and E is observed. Then

P(H1 | E) = P(H1)P(E | H1)/P(E) =

P(H1)P(E | H1)\P(E) =

P(H1)/P(E) =

P(H1)/[P(E | H1)P(H1) + P(E | H2)P(H2) + ... + P(E |Hn)P(Hn)] =

P(H1)/[P(H1) + 0 + ... + P(E|Hn)P(Hn)].

Notice that dropping the term P(E | H2)P(H2) increases the value of the whole ratio and hence of P(H1 | E).

Surprising predictions are good:

Prediction of apparently independent phenomena is good:

Suppose E is irrelevant to E', but H predicts both. Then H |= E and H |= E' and P(E'& E) = P(E)P(E'). So

P(H | E & E') =

P(H)P(E & E' | H)P(H)/P(E & E') =

P(H)P(E | H)P(E' | H)/P(E)P(E') =

P(H)/P(E)P(E').

Since P(E), P(E') < 1, H gets a big boost from E & E'. The more independent phenomena H predicts, the bigger the kick.

Unification of apparently independent phenomena is good:

Predicting disparate phenomena is one thing, but it is another matter to "unify" disparate phemonena, revealing that the one really does provide information about the other.

Suppose E is irrelevant to E', but conditional on H, E is highly relevant to E'. For example, the wave theory of optics provided a perfect correlation between the swirling pink and blue color of an oily puddle with the bands or fringes around the edge of a shadow. On Newton's particle-based theory of optics, these were entirely uncorrelated phenomena. Then we have the situation:

P(E | E' & N) = P(E | N) but P(E | E' & W) = 1.

Then

P(E & E' | N) = P(E' | N)P(E | N) and

P(E & E' | W) = P(E | E' & W)P(E' | W) = P(E | W).

Then

P(N | E & E')/P(W | E & E') = P(N)P(E &E' | N)/P(W)P(E & E' | W)

= P(N)P(E | N)P(E' | N)/P(W)P(E | W).

If we define the Unifying coefficent of W given E & E' as follows

U(W | E & E') = P(E & E' | H)/P(E | H)P(E' | H).

Then the likelihood ratio can be expressed as

P(W | E & E') ... U(W | E & E')(P(E | W)P(E' | W))

----------------- = --------------------------------------------

P(N | E & E') ... U(N | E & E')(P(E | N)P(E' | N))

If neither theory makes the data individually much more likely, but one does a much better job of unifying them, then it will be better confirmed.

Unification seems to play an important role in scientific revolutions. Usually the unifying theory wasn't yet conceived prior to the scintific revolution, so there is an unexpected flow of mass to the unifying theory (cf. the discussion of the catch-all hypothesis). William Whewell, the 19th century divine who invented the term "anode" and who advised Darwin on scientific method called such a unification a "consilience of inductions". Whewell perhaps went overboard when he concluded that a theory that had achieved such a consilience could not be false. He based this claim on Kant's idea that whatever is nontrivially certain must reflect our own cognitive structure. Since consilience produces strong belief, Whewell concluded that it must be a reflection of our cognitive structure. In fact, the credibility of consilience is a feature of the structure of probabilistic coherence!

Also, there has been a tendency to separate "explanatory virtues" from confirmation. This discussion shows that qualitative explanatory virtues such as "unifying" what seem a priori to be disparate phenomena contributes directly to confirmation.

Direct Inference and Objectivity

Personal probabilities express your degrees of belief. What about quantum mechanics, a physical theory that makes assertions about "probabilities". What is it talking about, Einstein's degrees of belief? Or yours? What if you don't even understand the theory?

The direct inference principle is a structural constraint that forces personal likelihoods to agree with theoretical probabilities, given that the theory is true. Let

p(A) = r

be the theoretical statement that the objective chance of proposition A is r (whatever that means).

Then the direct inference principle requires, roughly, that:

P(A | p(A) = r) = r.

In other words, conditional on the fact that p(A) = r, we should adjust our personal credence in A to level r.

You might expect that a good theory of chance would explain why we should obey the direct inference principle. You might therefore be surprised to learn tha some personists view the matter just the other way around! Because of Dutch book and the direct inference principle, they conclude that it would be irrational not to regard chances as probabilities!

The direct inference princple is responsible for maintaining the illusion that physical probabilities are relevant to our lives and can be discovered from evidence. Here is how the story works.

P(p(E) = r | E) =

P(p(E) = r)P(E | p(E) = r)/P(E) =

P(p(E) = r)r/P(E), (by the direct inference principle).

Notice that the direct inference principle allows us to substitute objective chances for our personal likelihoods in Bayes' theorem. That is why I remarked above that the likelihood is often understood to be "objective" or even "ontic".

What about practical relevance? Suppose we have come to fully believe a probabilistic hypothesis P(p(A) = r) = 1. Let ($s, A) be a bet on A (that's practical!). Then you should be willing to pay:

Exp($s, A) = $sP(A)

= $sP(A | p(A) = r)P(p(A) = r) + P(A | not(p(A) = r))P(not(p(A) = r)

= $sP(A | p(A) = r)

=$sr

So again, highly confirmed statements about chance serve to constrain our practical deliberations via the direct inference principle.

Bayesian Learning

Suppose hypothesis H entails the successive observations

H |= E(0), E(1), E(2), E(3), ...

Then

P(H | E(0) & ... & E(n)) = P(H)/P(E(0) & ... & E(n)).

What can we say about

P(E(0) & ...& E(n))?

This is easier to figure out if we look at the probability of the denial:

P(not-E(0) or ... or not-E(n)).

That's more promising. But we can't apply axiom 3 unless the disjuncts are mutually incompatible. There's a neat trick for forcing this condition to hold:

P(not-E(0) or (E(0) & not-E(1)) or ... or (E(0) & E(1) & ... & not-E(n))).

Now we have an instance of axiom 3, so

P(E(0) & ...& E(n)) = 1 - P(not-E(0)) + P(not-E(0)) + P(E(0) & not-E(1)) + ... + P(E(0) & E(1) & ... & E(n-1) & not-E(n)).

So long as none of these proposition are known a priori, the probability of

P(E(0) & ...& E(n))

goes down forever as n ---> oo. Hence,

P(H | E(0) & ... & E(n)) = P(H)/P(E(0) & ... & E(n))

continues to rise forever as n ---> oo. But we also know that eventually the terms in the sum

P(not-E(0)) + P(not-E(0)) + P(E(0) & not-E(1)) + ... + P(E(0) & E(1) & ... & E(n-1) & not-E(n)).

have to get ever smaller so as not to exceed 1 as n ---> oo. So there are "dimishing returns" for learning. After a certain point, more positive information doesn't add much credence. Nothing is involved here except for the first three axioms of probability!

Bayesian Induction

In the preceding section, we supposed that H |= E(0), E(1), .... Now let's suppose that H simply is the claim: for all i, E(i). By the same reasoning employed above, set

for all i, E(i) <===> not-(there exists an i such that not E(i)) <===>

not-(not-E(0) or not-E(1) or ... or not-E(n) or ...) <===>

not-(OR(i = 1 to oo) E(i)) <===>

not-(OR(i = 1 to oo) E(0) & ... & E(n-1) & not-E(n)).

Since the terms under the last OR are all mutually exclusive, countable additivity (optional axiom 4) yields:

P(for all i E(i)) = 1 - SUM(n = 0 to oo) P(E(0) & ... & E(n-1) & not-E(n)).

Similarly,

P(E(0) & ... & E(n)) = 1 - SUM(n = 0 to n) P(E(0) & ... & E(n-1) & not-E(n)).

Since for each n, (for all i, E(i)) |= not-E(0) & ... & E(n-1) & not-E(n), we have

P((for all i, E(i) | (E(0) & ... & E(n)) = P(for all i, E(i))/P(E(0) & ... & E(n)).

1 - SUM(n = 0 to oo) P(E(0) & ... & E(n-1) & not-E(n))

= ----------------------------------------------------------------------------

1 - SUM(n = 0 to n) P(E(0) & ... & E(n-1) & not-E(n)).

This ratio goes to 1 as n ---> oo.

The argument depends essentially on the countable additivity axiom. Without it, the SUM can add up to less than P(exists i not-E(i)). Then the limit just discussed would converge to a value less than 1, as you may verify yourself.

Epistemological question: countable additivity is what makes Bayesian induction work. Since it is pivotal to the Baysian response to inductive skepticism, why should be believe it? Note, the Dutch Book arguments do not vindicate countable additivity. One would have to be willing to buy infinite combinations of bets.

Washing out the priors

This slogan refers to the fact reflected in the preceding results that although different Bayesians may start out with wildly divergent views about some topic (due to their differing priors), they very rapidly come to near agreement of opinion if they see largely the same evidence. Thus, it is claimed, the priors "wash out" with increasing evidence. So Bayesians respond that insofar as science should be objective, personal probabilities are objective. And to the extent that science is not objective, neither are personal probabilities.

Here is a general result of that sort which will not be proved here. Say that H is empirically supervenient just in case the truth of H depends only on whether the data sentences E(n) are true or false. Then we have:

Theorem (Halmos): If H is empirically supervenient, and P satisfies countable additivity, then

P(LIM(n ---> oo)P(H | E(0) & ... & E(n)) = the truth value of H) = 1.

In other words, a countably additive Bayesian conditionalizer must be willing to bet her life against nothing that she will converge to the truth value of an arbitrary, empirically supervenient hypothesis in the limit.

Note:

This doesn't say that she will converge to the truth value of H. It says only that she must be morally sure a priori that she will. Contrast this with the reliabilist views of Nozick, Goldman, and Alston.
There exist empirically supervenient hypotheses H for which no possible updating method, conditioning included, can be guaranteed to find the truth value of H. One intuitive such example, due to Kant, is "Matter is infinitely divisible" (cf. Kelly 96).
Also, the result fails without countabe additivity (cf. Kelly 96).
If we drop empirical supervenience, then the result may fail because all the hypotheses consistent with the actual data stream must share the remaining probability among one another.