Lecture on Probability


Probability theory

An algebra is a collection of propositions closed under the Boolean operations {and, or, not}.

A probability function P satisfies the following axioms:

Conditional probability defined

If P(B) > 0 then P(A | B) = P(A & B)/P(B)


Interpretations

The above definition is mathematical. It is another question how to interpret it.


Ontic Interpretations (probabilities in the world)


Frequentism

Difficulties:


Propensity:

Something counterfactual or modal rather than merely actual about the world.


Epistemic intepretations (pertaining to belief)


Classical

Difficulties:


Logical (Rudolph Carnap)

"Partial Entailment"

Difficulties:

Goodman's problem:


Personal Probability Theory

Dutch Book Argument:

Bets

Negative prizes and selling

"Put your money where your mouth is" principle:

Packages of bets

"Package" principle:

Coherence

Dutch Book

Dutch Book theorem

Project: Degrees of Dutch book. It is widely objected that it is impossible, both for practical and for computational reason to be probabilistically coherent. One response to this objectio is t define degrees of incoherence in terms of how much money an evil bettor could be guaranteed to milk from you per unit stake. In other words, in a single betting cycle, the evil bettor has to put up money to induce you to bet. The more incoherent you are, the less money he has to put up to get a given surefire return from his dealings with you.

Objections:


Personalist Evidence and Learning

Recall

This is just a definition. In personalism, this defined concept is used as a procedure for learning.

Let Pn(A) be the agent's degrees of belief at stage n.

Suppose that at time n+1 the agent learns E. Then the personalist idea for learning is:

This procedure is called updating by conditionalization orBayesian updating.


Diachronic Dutch Book

The learning idea is an extra assumption requiring extra justification. Is there a Dutch book argument for it? In a sense, yes.

Let

By definition:

so we have

Objections

The package principle

Before we objected that the package principle is invalid across time. But in this case, the package is

and it is clear that ($1, A & B) * ($(c - a), B) * ($x, not-B) is bought at stage n whereas ($-1, A) is only purchased at n+1 after observing B. So in this case, there is no way to ignore the fact that the alleged "incoherence" occurs to different temporal stages of the same agent.

But something is wrong here, for ultimately, what is at stake in Dutch book is preference among acts, and changes of preference through time are not normally thought to be irrational. We learn to appreciate red wine and some even come to appreciate Wagner (!), even though they couldn't have dreamed of paying for these things in earlier years.

Using self-knowledge

Moreover, if I know now that I fail to update by conditioning, then seeing the Dutch book I will be subjected to if I buy the first package of bets, I should refuse to buy it. The argument does not permit me to use any self-knowledge about my future opinions when I buy bets in the present. From my current viewpoint, my future degrees of belief and the choices based upon them are future states of the world about which I should have degrees of belief now.


Bayes' Theorem

P(H | E) = P(E | H)P(H)/P(E).

Terminology

P(H) = the prior probability of H.

P(E | H) = the likelihood of E.

P(H | E) = the posterior probability of H.

P(E) = the prior probability of E.

Total Probability Theorem

P(E) = P(E | H)P(H) + P(E | not-H)P(not-H).

The problem of the catch-all hypothesis:

So why is the prior of the evidence E so hard to find? By the total probability theorem,

But P(E | not-H) is nearly as inaccessible as P(E) itself. The hypothesis not-H is called the "catch-all" hypothesis, because it says "anything but H". We need to continue chopping mutually incompatible hypotheses out of not-H and apply the total probability rule repeatedly to get

But this expresses P(E) in terms of all possible hypotheses! Who knows what the possible hypotheses are? They haven't been thought of yet!

In cases of exciting scientific change, a genius (Bohr, Einstein, Darwin, Maxwell, Fresnel, Kepler, etc.) thinks of a new, plausible alternative hypothesis H' that nobody thought of before such that the likelihood P(E | H') is high (i.e. H' is a good, plausible explanation). This causes a revolutionary shift of probability mass from the old hypothesis H to the new one H', and hence onto not-H. Such shifts are due not to new data but to new ideas. Revolutionary scientific change always has this character because what is inconceivable cannot be probable.

Likelihood ratios

The troublesome quantity P(E) can be avoided if we always care the empirical support of two hypotheses against each other. Then the P(E)s cancel:

But then we obtain no absolute grasp of the posterior probability: we can merely compare the probabilities of hypotheses on the table. Hence, even the subjective degree of belief in a theory is a fairly idealized quantity unless the alternatives are artificially constrained.


Relevant Information

If information E is irrelevant to H, then our degree of belief in H does not change upon learning that E. If E is positively relevant to H, then learning E makes us more confident that H. If E is negatively relevant to H,then learning E makes us less confident that H.

Unfortunately, the obtuse, technical term "statistical independence" has been attached to the concept of irrelevance. An equivalent, perhaps more familiar formulation of this concept is as follows:

    Proposition:

    1. P(H | E) = P(H) <===> P(H & E) = P(H)P(E).
    2. P(H | E) > P(H) <===> P(H & E) > P(H)P(E).
    3. P(H | E) < P(H) <===> P(H & E) < P(H)P(E).
    4. Proof:

      P(H | E) = P(H) <===> P(H & E)/P(E) = P(H).<===> P(H & E) = P(H)P(E).

      P(H | E) > P(H) <===> P(H & E)/P(E) > P(H).<===> P(H & E) > P(H)P(E).

      P(H | E) < P(H) <===> P(H & E)/P(E) < P(H).<===> P(H & E) < P(H)P(E).

Moreover, both positive and negative relevance are symmetric.

Hence, positive and negative relevance and irrelevance are symmetric, as is intuitive.


Bayesian Confirmation

Positive relevance provides us with a quantitative concept of confirmation. Many philosophers believe that this has a lot to do with justification. In fact, conditional probability provides an intuitively appealing theory of justification.

Refutation is bad:

Plausible competitors are bad:

Refuting plausible competitors helps

Surprising predictions are good:

Prediction of apparently independent phenomena is good:

Unification of apparently independent phenomena is good:

Predicting disparate phenomena is one thing, but it is another matter to "unify" disparate phemonena, revealing that the one really does provide information about the other.

Suppose E is irrelevant to E', but conditional on H, E is highly relevant to E'. For example, the wave theory of optics provided a perfect correlation between the swirling pink and blue color of an oily puddle with the bands or fringes around the edge of a shadow. On Newton's particle-based theory of optics, these were entirely uncorrelated phenomena. Then we have the situation:

Then

Then

If we define the Unifying coefficent of W given E & E' as follows

Then the likelihood ratio can be expressed as

If neither theory makes the data individually much more likely, but one does a much better job of unifying them, then it will be better confirmed.

Unification seems to play an important role in scientific revolutions. Usually the unifying theory wasn't yet conceived prior to the scintific revolution, so there is an unexpected flow of mass to the unifying theory (cf. the discussion of the catch-all hypothesis). William Whewell, the 19th century divine who invented the term "anode" and who advised Darwin on scientific method called such a unification a "consilience of inductions". Whewell perhaps went overboard when he concluded that a theory that had achieved such a consilience could not be false. He based this claim on Kant's idea that whatever is nontrivially certain must reflect our own cognitive structure. Since consilience produces strong belief, Whewell concluded that it must be a reflection of our cognitive structure. In fact, the credibility of consilience is a feature of the structure of probabilistic coherence!

Also, there has been a tendency to separate "explanatory virtues" from confirmation. This discussion shows that qualitative explanatory virtues such as "unifying" what seem a priori to be disparate phenomena contributes directly to confirmation.


Direct Inference and Objectivity

Personal probabilities express your degrees of belief. What about quantum mechanics, a physical theory that makes assertions about "probabilities". What is it talking about, Einstein's degrees of belief? Or yours? What if you don't even understand the theory?

The direct inference principle is a structural constraint that forces personal likelihoods to agree with theoretical probabilities, given that the theory is true. Let

be the theoretical statement that the objective chance of proposition A is r (whatever that means).

Then the direct inference principle requires, roughly, that:

In other words, conditional on the fact that p(A) = r, we should adjust our personal credence in A to level r.

You might expect that a good theory of chance would explain why we should obey the direct inference principle. You might therefore be surprised to learn tha some personists view the matter just the other way around! Because of Dutch book and the direct inference principle, they conclude that it would be irrational not to regard chances as probabilities!

The direct inference princple is responsible for maintaining the illusion that physical probabilities are relevant to our lives and can be discovered from evidence. Here is how the story works.

Notice that the direct inference principle allows us to substitute objective chances for our personal likelihoods in Bayes' theorem. That is why I remarked above that the likelihood is often understood to be "objective" or even "ontic".

What about practical relevance? Suppose we have come to fully believe a probabilistic hypothesis P(p(A) = r) = 1. Let ($s, A) be a bet on A (that's practical!). Then you should be willing to pay:

So again, highly confirmed statements about chance serve to constrain our practical deliberations via the direct inference principle.


Bayesian Learning

Suppose hypothesis H entails the successive observations

Then

What can we say about

This is easier to figure out if we look at the probability of the denial:

That's more promising. But we can't apply axiom 3 unless the disjuncts are mutually incompatible. There's a neat trick for forcing this condition to hold:

Now we have an instance of axiom 3, so

So long as none of these proposition are known a priori, the probability of

goes down forever as n ---> oo. Hence,

continues to rise forever as n ---> oo. But we also know that eventually the terms in the sum

have to get ever smaller so as not to exceed 1 as n ---> oo. So there are "dimishing returns" for learning. After a certain point, more positive information doesn't add much credence. Nothing is involved here except for the first three axioms of probability!


Bayesian Induction

In the preceding section, we supposed that H |= E(0), E(1), .... Now let's suppose that H simply is the claim: for all i, E(i). By the same reasoning employed above, set

Since the terms under the last OR are all mutually exclusive, countable additivity (optional axiom 4) yields:

Similarly,

Since for each n, (for all i, E(i)) |= not-E(0) & ... & E(n-1) & not-E(n), we have

This ratio goes to 1 as n ---> oo.

The argument depends essentially on the countable additivity axiom. Without it, the SUM can add up to less than P(exists i not-E(i)). Then the limit just discussed would converge to a value less than 1, as you may verify yourself.

Epistemological question: countable additivity is what makes Bayesian induction work. Since it is pivotal to the Baysian response to inductive skepticism, why should be believe it? Note, the Dutch Book arguments do not vindicate countable additivity. One would have to be willing to buy infinite combinations of bets.


Washing out the priors

This slogan refers to the fact reflected in the preceding results that although different Bayesians may start out with wildly divergent views about some topic (due to their differing priors), they very rapidly come to near agreement of opinion if they see largely the same evidence. Thus, it is claimed, the priors "wash out" with increasing evidence. So Bayesians respond that insofar as science should be objective, personal probabilities are objective. And to the extent that science is not objective, neither are personal probabilities.

Here is a general result of that sort which will not be proved here. Say that H is empirically supervenient just in case the truth of H depends only on whether the data sentences E(n) are true or false. Then we have:

Theorem (Halmos): If H is empirically supervenient, and P satisfies countable additivity, then

In other words, a countably additive Bayesian conditionalizer must be willing to bet her life against nothing that she will converge to the truth value of an arbitrary, empirically supervenient hypothesis in the limit.

Note:

  1. This doesn't say that she will converge to the truth value of H. It says only that she must be morally sure a priori that she will. Contrast this with the reliabilist views of Nozick, Goldman, and Alston.
  2. There exist empirically supervenient hypotheses H for which no possible updating method, conditioning included, can be guaranteed to find the truth value of H. One intuitive such example, due to Kant, is "Matter is infinitely divisible" (cf. Kelly 96).
  3. Also, the result fails without countabe additivity (cf. Kelly 96).
  4. If we drop empirical supervenience, then the result may fail because all the hypotheses consistent with the actual data stream must share the remaining probability among one another.