Introduction to Bayesian Methodology

Kevin T. Kelly
Department of Philosophy
Carnegie Mellon University

Mathematical probability theory

Think of propositions as sets of possible states of the world. Thus, "the sky is blue" picks out all world states in which the color of the sky is blue. Some of these world states will have houses and cars and others will not.
An algebra is a collection of propositions such that for any two propositions p, q in the collection, the propositions
• p&q
• p or q
• not p
• not q
are in the collection.

T = the vacuous proposition.

A probability function P on an algebra is an assignment of numbers to propositions such that:

1. P(h) >= 0;
2. P(T) = 1;
3. a is inconsistent with b ===> P(a or b) = P(a) + P(b).
Definition of conditional probability:
P(h|e) = P(h & e)/P(e), if P(e) > 0.
Bayes' theorem: A trivial consequence of the definition.
P(h|e) = [P(h)P(e|h)]/P(e).

Proof:
P(e|h) = P(e & h)/P(h) [definition of conditional probability].
P(e & h) = P(e|h)P(h) [multiply both sides by P(h)].
P(h|e)  = P(h & e)/P(e) [definition of conditional probability].
P(h|e)  = P(e|h)P(h)/P(e) [substitution].

Total probability theorem: A trivial consequence of the the definition of conditional probability and axom 3.  This is the form in which the axiom is usually applied in practice.
P(e) = SUMi P(e|hi)P(hi), where the hi's are mutually exclusive and exhaustive.

Proof:
e = ORi (e & hi) [logic]
P(ORi (e & hi)) = SUMi P(e & hi) [second axiom of probability]
P(e|hi) = P(e & hi)/P(hi) [definition of conditional probability]
P(e & hi) = P(e|hii)P(hi) [multiply both sides by P(h)].
P(e) = P(ORi (e & hi)) = SUMi P(e & hi) = SUMi P(e|hii)P(hi) [preceding lines]

Definition of probabilistic independence:
e is independent of h if and only if P(e & h) = P(e)P(h).

The preceding definition seems rather technical.  The following recharacterization makes intuitive sense.  Probabilistic independence is irrelevance: learning the one event wouldn't change your belief in the other.

Theorem: probabilistic independence is informational irrelevance:
e is independent of h if and only if P(h|e) = P(h).

Proof:  Suppose P(h|e) = P(h).  Then P(h & e) = P(h|e)P(e) = P(h)P(e).
Suppose P(h & e) = P(h)P(e).  Then P(h|e) = P(h & e)/P(e) = P(h)P(e)/P(e) = P(h).

Bayesian methodology:

A rational agent whose degrees of belief are represented by probability function P should update her degrees of belief to P(.|e) after observing e.

By Bayes' theorem, the new degree of belief in h after seeing e is

P(h|e) = [P(h)P(e|h)]/P(e).
This formula is so important that the individual parts have special, time-honored names:
Likelihood of e given h = P(h|e). This is usually fairly definite and objective, since a theory usually makes a determinate prediction or specifies a probability of occurrence of a given experimental outcome. Classical statisticians allow only such probabilities to enter into methodology, so theories themselves cannot be said to have probabilities. You were warned about this in your elementary statistics class. Possibly you already forgot!

Prior probability of h = P(h). This may be quite subjective, reflecting a theory's initial "plausibility" prior to scientific investigation. This plausibility depends on such factors as intelligibility, simplicity, and whether the mechanism posited by the theory has been observed to operate elsewhere in nature (e.g., uniformitarian vs. catastrophist geology). In the 19th c. it was proposed that only causes observed to operate in nature could be invoked in new theories. This reflects prior probability.

Prior probability of e = P(e). This is subjective and very hard to specify. Using total probability,

P(e) = SUMi P(e|hi)P(hi),
with the sum taken over all possible theories. Nobody knows what all possible theories are. At most they are aware of the dominant paradigm and a few competitors.
In order to ameliorate the problem of assigning P(e), pairwise comparisons of theories are made by looking at ratios of posterior probabilities:
P(h|e)/P(h'|e) = [P(h)/P(h')][P(e|h)P(e|h')].
The ratio [P(h)/P(h')] is the prior ratio and the ratio [P(e|h)P(e|h')] is the likelihood ratio.  Changes in relative probability between competing theories are governed entirely by the likelihood ratio, since the prior ratio is a fixed constant.

Some methodological consequences:

High initial plausibility is good, explanations being similar: P(h1|e)/P(h2|e) = [P(h1)/P(h2)][P(e|h1)/P(e|h2)].

Refutation is fatal: If consistent e is inconsistent with h, then P(h|e) = 0.

Proof:
Note e & h = not T, since the two are inconsistent.
Also, P(not T) + P(T) = 1 by axiom (3).
P(T) = 1 by axiom (2).
Hence, P(not T) = 0. Now we have:
P(h|e) = P(h & e)/P(e) = P(not T)/P(e) = 0/P(e) = 0.
Surprising predictions are good, initial plausibilities being similar: If h entails e, then P(h & e) = P(h), so P(h|e) = P(h)/P(e), which is greater insofar as P(e) is lower (i.e., the occurrence of e is more surprising).

Diminishing returns of repeated testing:  Once P(e) is expected, by the preceding argument confirmation is reduced.

Strong explanations are good, initial plausibilities being similar: The ratio P(h|e)/P(h'|e) changes through time entirely as a function of the ratio of relative strength of explanation P(e|h)/P(e|h'), for

P(h1|e)/P(h2|e) = [P(h1)/P(h2)][P(e|h1)/P(e|h2)] = k[P(e|h1)/P(e|h2)].
Unification is good, initial plausibilities being similar: A unified theory explains some regularity that the disunified theory does not. For example, Copernicus' theory entails that the total number of years must equal the total number of synodic periods + the total number of periods of revolution. To see this, suppose that data e, e' are independent a priori, so
P(e & e') = P(e)P(e').
Now suppose that e and e' remain independent given h1 but are completely dependent given h2 so that
h & e' --> e.  Thus
P(e & e'|h1) = P(e'|h1)P(e|h1) and
P(e & e'|h2) = P(e|h2).
So
P(h1|e & e')/P(h2|e & e') =
[P(h1)/P(h2)][P(e & e'|h1)/P(e & e'|h2)] =
k[P(e & e'|h1)/P(e & e'|h2)] =
k[P(e|h1)P(e'|h1)/P(e|h2)].
Now there is no reason to suppose that P(e|h1), P(e'|h1), and P(e|h2) are high, so the disunified theory has to overcome the effect of a product of low numbers while the unified theory does not. The more disunified phenomena a theory unifies compared to a competitor, the bigger this advantage becomes (suppose the likelihoods are all less that .5.  Then the degree of belief drops exponentially in the number of disunified phenomena.

Saying more lowers probability: h entails h' ==> P(h) < P(h').

Conflict turns explanatory strength into an asset: Didn't we just say that strong explanations are good??? That is true if the initial plausibilities are similar. But if one theory entails the other, they won't be. Thus, unification-style arguments only work if the competing theories are mutually contradictory!

Some defeasible objections

Scientific method should be objective. The method is objective. Everybody is supposed to update by calcluating personal probabilities. Some of the inputs to this method (prior probabilities) are not objective.

Scientific method should not consider subjective, prior plausibilities. That's just the kind of blind, pre-paradigm science Kuhn ridicules as being sterile. Without prior plausibilities to guide inquiry, no useful experiments would ever be performed.

Priors should be flat. What is flat? If we are uncertain about the size of a cube, should we be indifferent about

• the possible volumes,
• the areas of the sides, or
• the lengths of the sides?
Whichever one we are unbiased about, we are strongly biased about the others!

Some more stubborn objections

High posterior probability doesn't mean that the theory is true. To some extent, one can show that the agent must believe that she will converge to the truth. But this doesn't mean that she will.

It isn't clear that numbers like P(e) even exist. One can respond with a protocol for eliciting such numbers, but in practice it doesn't always work. One can say that the subjects are "irrational", but the audience can always blame Bayesianism instead of the subjects.

The old evidence problem. If e is already known, then P(h|e) = P(h) P(e|h)/P(e) = P(h). So old evidence never "confirms" a hypothesis.

Responses:

Counterfactual confirmation: Some counterfactual version of yourself who had not already learned that e would have found that P(h|e) > P(h) even though you do not because for you P(e) = 1.
Objection: Many different conterfactual persons could have turned into you after seeing e. Which are you?
Something new is learned: that the theory entails the old data. This makes old evidence an instance of the "problem of found constraints" below.
Objection: Evidential support shouldn't depend on prior mathematical ignorance.