Introduction to Bayesian Methodology

Kevin T. Kelly
Department of Philosophy
Carnegie Mellon University

Mathematical probability theory

Think of propositions as sets of possible states of the world. Thus, "the sky is blue" picks out all world states in which the color of the sky is blue. Some of these world states will have houses and cars and others will not.

algebra

p&q
p or q
not p
not q

T = the vacuous proposition.

A probability function P on an algebra is an assignment of numbers to propositions such that:

P(h) >= 0;
P(T) = 1;
a is inconsistent with b ===> P(a or b) = P(a) + P(b).

Definition of conditional probability:

P(h|e)

P(h & e)

P(e)

Bayes' theorem:

P(h|e)

P(h)

P(e|h)

P(e)

Total probability theorem:

Proof:
e = OR_i (e & h_i) [logic]
P(OR_i (e & h_i)) = SUM_i P(e & h_i) [second axiom of probability]
P(e|h_i) = P(e & h_i)/P(h_i) [definition of conditional probability]
P(e & h_i) = P(e|h_ii)P(h_i) [multiply both sides by P(h)].
P(e) = P(OR_i (e & h_i)) = SUM_i P(e & h_i) = SUM_i P(e|h_ii)P(h_i) [preceding lines]

Definition of probabilistic independence:

independent

The preceding definition seems rather technical. The following recharacterization makes intuitive sense. Probabilistic independence is irrelevance: learning the one event wouldn't change your belief in the other.

Theorem: probabilistic independence is informational irrelevance:
e is independent of h if and only if P(h|e) = P(h).

Proof: Suppose P(h|e) = P(h). Then P(h & e) = P(h|e)P(e) = P(h)P(e).
Suppose P(h & e) = P(h)P(e). Then P(h|e) = P(h & e)/P(e) = P(h)P(e)/P(e) = P(h).

Bayesian methodology:

A rational agent whose degrees of belief are represented by probability function P should update her degrees of belief to P(.|e) after observing e.

By Bayes' theorem, the new degree of belief in h after seeing e is

P(h|e)

P(h)

P(e|h)

P(e)

This formula is so important that the individual parts have special, time-honored names:

Likelihood

P(h|e)

Prior probability of h = P(h). This may be quite subjective, reflecting a theory's initial "plausibility" prior to scientific investigation. This plausibility depends on such factors as intelligibility, simplicity, and whether the mechanism posited by the theory has been observed to operate elsewhere in nature (e.g., uniformitarian vs. catastrophist geology). In the 19th c. it was proposed that only causes observed to operate in nature could be invoked in new theories. This reflects prior probability.

Prior probability of e = P(e). This is subjective and very hard to specify. Using total probability,

all possible theories.

In order to ameliorate the problem of assigning P(e), pairwise comparisons of theories are made by looking at ratios of posterior probabilities:

P(h|e)/P(h'|e) = [P(h)/P(h')][P(e|h)P(e|h')].

The ratio [P(h)/P(h')] is the prior ratio and the ratio [P(e|h)P(e|h')] is the likelihood ratio. Changes in relative probability between competing theories are governed entirely by the likelihood ratio, since the prior ratio is a fixed constant.

Some methodological consequences:

High initial plausibility is good, explanations being similar: P(h₁|e)/P(h₂|e) = [P(h₁)/P(h₂)][P(e|h₁)/P(e|h₂)].

Refutation is fatal: If consistent e is inconsistent with h, then P(h|e) = 0.

Surprising predictions are good, initial plausibilities being similar: If h entails e, then P(h & e) = P(h), so P(h|e) = P(h)/P(e), which is greater insofar as P(e) is lower (i.e., the occurrence of e is more surprising).

Diminishing returns of repeated testing: Once P(e) is expected, by the preceding argument confirmation is reduced.

Strong explanations are good, initial plausibilities being similar: The ratio P(h|e)/P(h'|e) changes through time entirely as a function of the ratio of relative strength of explanation P(e|h)/P(e|h'), for

P(h₁|e)/P(h₂|e) = [P(h₁)/P(h₂)][P(e|h₁)/P(e|h₂)] = k[P(e|h₁)/P(e|h₂)].

Unification is good, initial plausibilities being similar: A unified theory explains some regularity that the disunified theory does not. For example, Copernicus' theory entails that the total number of years must equal the total number of synodic periods + the total number of periods of revolution. To see this, suppose that data e, e' are independent a priori, so

P(e & e') = P(e)P(e').

Now suppose that e and e' remain independent given h₁ but are completely dependent given h₂ so that
h & e' --> e. Thus

P(e & e'|h₁) = P(e'|h₁)P(e|h₁) and
P(e & e'|h₂) = P(e|h₂).

P(h₁|e & e')/P(h₂|e & e') =
[P(h₁)/P(h₂)][P(e & e'|h₁)/P(e & e'|h₂)] =
k[P(e & e'|h₁)/P(e & e'|h₂)] =
k[P(e|h₁)P(e'|h₁)/P(e|h₂)].

Now there is no reason to suppose that P(e|h₁), P(e'|h₁), and P(e|h₂) are high, so the disunified theory has to overcome the effect of a product of low numbers while the unified theory does not. The more disunified phenomena a theory unifies compared to a competitor, the bigger this advantage becomes (suppose the likelihoods are all less that .5. Then the degree of belief drops exponentially in the number of disunified phenomena.

Saying more lowers probability: h entails h' ==> P(h) < P(h').

Conflict turns explanatory strength into an asset: Didn't we just say that strong explanations are good??? That is true if the initial plausibilities are similar. But if one theory entails the other, they won't be. Thus, unification-style arguments only work if the competing theories are mutually contradictory!

Some defeasible objections

Scientific method should be objective. The method is objective. Everybody is supposed to update by calcluating personal probabilities. Some of the inputs to this method (prior probabilities) are not objective.

Scientific method should not consider subjective, prior plausibilities. That's just the kind of blind, pre-paradigm science Kuhn ridicules as being sterile. Without prior plausibilities to guide inquiry, no useful experiments would ever be performed.

Priors should be flat. What is flat? If we are uncertain about the size of a cube, should we be indifferent about

the possible volumes,
the areas of the sides, or
the lengths of the sides?

Whichever one we are unbiased about, we are strongly biased about the others!

Some more stubborn objections

High posterior probability doesn't mean that the theory is true. To some extent, one can show that the agent must believe that she will converge to the truth. But this doesn't mean that she will.

It isn't clear that numbers like P(e) even exist. One can respond with a protocol for eliciting such numbers, but in practice it doesn't always work. One can say that the subjects are "irrational", but the audience can always blame Bayesianism instead of the subjects.

The old evidence problem. If e is already known, then P(h|e) = P(h) P(e|h)/P(e) = P(h). So old evidence never "confirms" a hypothesis.

Responses:

Counterfactual confirmation:

Objection:

Something new is learned:

Objection: