80-300 Seminar on Simplicity and Ockham’s Razor

Instructor: Kevin T. Kelly, Professor

Office: 135 K BH

Office hours: M, T 11:00-12:00

Contact: X 8567, kk3n@andrew.cmu.edu

Room: DH 4303

Time: Mon 4:30-6:50 PM.

Text: articles available for download in pdf format from this page.

Requirements:

Read, think, talk.

Final project of some sort related to simplicity (original or expository).

Description: Ockham’s razor is a systematic bias toward simpler or more unified theories in scientific theory choice. The question is why one should have such a bias if one’s aim is to find the true theory. We will survey standard accounts and find that they are all either circular (they presuppose that simple worlds are more probable than complex worlds) or give up on finding the true theory altogether. Then we will consider a new approach, according to which Ockham’s razor does not point directly at the truth but is the unique strategy that keeps one on the most direct path to the truth. Directness is understood in terms of costs such as the number of reversals of opinion prior to convergence to the truth and the times at which these reversals occur. The approach will be applied to the inference of causal networks. There are many topics to investigate, most notably how to apply the approach to stochastic theories and applications to belief revision and epistemic logic.

Simplicity and its Puzzles

Thomas Kuhn, The Copernican Revolution, chapters 2 and 5.

Clark Glymour, “Relevant Evidence”.

Philip Kitcher, “Explanatory Unification”.

Margaret Morrison, Unifying Scientific Theories, chapter 1, chapter 2.

Nelson Goodman, Fact, Fiction and Forecast.

P.S. David Cristensen, “Glymour on Evidential Relevance”. This is the counterexample to “bootstrapping” that Clark mentioned in class.

Part I: Alternative Accounts

Testability

Karl Popper thought that there is no such thing as “inductive support” by evidence. Instead, there are simply “conjectures” and sincere attempts to refute them. Science is “demarcated” from pseudo-science by that sincere effort. When a theory “proves its mettle” by standing up to a severe test, it is said not to be confirmed or supported by the data, but “corroborated”, which means nothing more than that it stood up to a severe test. Thus, the aim of severe testing is to have a theory that survives severe testing.

Mayo explicates “severe test” in terms of statistical power, the chance that the theory is refuted given that it is false. She never mentions simplicity directly. But think of testing Ptolemy and Copernicus with noisy observations, and ask which one is severely testable.

It seems that Glymour’s bootstrap theory is also a testability-based account of simplicity, even thought he speaks of “confirmation” rather than “corroboration”.

Popper, Glymour, and Mayo all leave the relationship between simplicity and truth vague or unspecified. Focus your reading on that point.

Karl Popper, Logic of Scientific Discovery, Chapter 6-7, Chapter 10.

Deborah Mayo, Error and the Growth of Experimental Knowledge, Chapter 1, Chapter 6.

Accurate Prediction

To predict with a deterministic theory, one uses some data to set the free parameters of the theory and when the parameters have been set the theory yields unique predictions for future data. If the theory is probabilistic, one can do approximately the same thing by choosing the values of the free parameters that make the observed data most probable (by the lights of the theory). When the parameters are set, the theory can be used to predict other observations. In the deterministic case, it takes a sufficient number of data to fix the values of the theory’s parameters---2 for a line, 3 for a quadratic, etc. If there are too few data to fix the parameters, the theory’s predictions are indeterminate. Again, something similar is true in the case of stochastic theories, but in a gradual way: if the sample is too small, there is greater statistical “spread” to the predictions of the theory. This excess spread implies greater expected distance of the empirical estimate from the true, underlying value of the quantity estimated. So if we are given a true theory, it makes sense to set its parameters with a sample large enough to reduce the spread (and hence the error) of the empirical prediction to an acceptable value.

That all assumes that we have the true theory. But if we don’t, then the false theory we employ may exclude the true value of the predicted quantity.
That shows up as bias or deviation of the average value of the prediction from the true value. Over-simplified theories yield biased estimates (think of a flat or constant line when the true relation is a tilted line). Overly complex theories have no bias (their parameters accommodate anything) but greater spread (think of the extreme case in which we estimate n free parameters with 1 data point). Expected deviation from the true prediction (a quantity called predictive “risk”) reflects both bias and variance. Since bias goes up for overly simple theories and variance goes up for overly complex ones, there should be a “sweet spot” where the combined bias and variance of the empirical prediction is minimized. For purposes of predictive accuracy, one ought to choose the model or theory that minimizes predictive risk at the current sample size.

C.S. Peirce thought that prediction is the ultimate aim of knowledge and that no theory predicts better than the true theory. That is plausible, but false for stochastic theories. As Vapnik explains, even if the true theory is known, on small samples a false, over-simplified theory will minimize expected distance from the true prediction better than the true theory will. Hence, accurate prediction is a sui generis aim incompatible with finding the true theory on small samples. Therefore, explanations of Ockham’s razor based on the principle of minimizing predictive risk result in an instrumentalistic attitude toward theories, as shows up candidly in the article by Sober. But watch out when advocates of risk minimization, like Forster and Sober, motivate it by distinguishing “signal” or “trend” from “noise”. If the truth is a noisy equation:

Y = aX + bX^2 + e,

Where e is a normally distributed random variable, then presumably the “trend” or “signal” is Y = aX and the “noise” is e. But aX + bX^2 is not what risk minimizing are even intended to recover. For them, the “trend” is the functional form that, when estimated, minimizes predictive risk. If the sample is small, that will result in an over-simplified model compared to the true one. One also frequently hears the term “over-fitting” in connection with risk minimization. One might mistakenly think that over-fitting is estimation by means of a model more complicated than the true model. But the definition of over-fitting is to estimate with a model that has too much variance to minimize overall risk (and under-fitting is using a model with too much bias to minimize overall risk). On small samples, estimation with the true model over-fits (evidently, the true model may have 50 free parameters and the sample may consist of one data point).

Forster and Sober generated some hoopla in philosophy for Akaike’s particular approach, but in fact the idea of controlling predictive risk by constraining estimates in some manner has been around for a long time in statistics and motivates techniques like Mallow’s statistic and cross-validation.

It is important to understand some limitations of this approach. First of all, estimating from a constrained model yields lousy predictions if the model happens to be very false and the classical statistician has no way of telling other than to estimate, which is the original problem to be solved! Forster concedes this point near the end of "The New Science of Simplicity". Second, the approach is about accurate prediction, not finding the true model. Indeed, many classical statisticians think it is a terrible idea to first try to find the true model and then to estimate it (Hodge's estimator article and Leeb and Poetscher). That is because estimation based on a convergent model-selection technique implies that the worst-case predictive error goes to infinity. Techniques that fail to converge to the simplest theory when it is true (e.g., AIC) do not have this infinite blow-up property. But they can still do way worse than estimating the most complex model. So what is the motive for Ockham's razor, after all?

Third, "accuracy" is mean squared error, which may be the wrong loss function (when a miss is as good as a mile). Fourth, expected accuracy is with respect to the actual sampling distribution. Any manipulation or policy that changes the sampling distribution will invalidate the risk estimates. Thus, a very accurate estimate of the regression coefficient from ashtrays to lung cancer can be a disastrous predictor of the effect of eliminating ashtrays. Causal inference requires correct causal conclusions---a method that converges to the true causal model, not an accurate prediction of the observed value of Y from the observed value of X.

The over-fitting idea assumes a very elegant formulation in Vladimir Vapnik’s Statistical Learning Theory, which was developed for a specific problem called pattern recognition. In that problem, one learns a deterministic concept from randomly sampled data (in model selection, one chooses a stochastic theory based on randomly sampled data). Vapnik defines a quantity VC Dimension that determines how large a sample one requires to minimize predictive risk of predictions based on a selected concept from the class---the richer the class, the more data one requires. All of the above caveats apply, plus one more. If the concept class poses the problem of induction, the VC dimension is infinite and all bets are off.

Gilbert Harman is famous for recommending Ockham’s razor as a principle of theoretical inference, under the name of “inference to the best explanation” (IBE).

In Harman and Kulkarni's Reliable Reasoning, he proposes statistical learning theory as a formal justification for IBE. It is, therefore, a mistake for him to recommend structural risk minimization as an explanation of that principle in the reading for this week. He tries to overcome that objection by arguing that structural risk minimization can converge to the truth in the limit, but alternative short-run biases would converge to the truth in the limit as well. If that were all there is to it, the usual Bayesian position would suffice. For his part, Vapnik candidly concedes that SRM is no better than other standard estimation techniques on large samples.

Malcolm Forster and Elliott Sober, “How to Tell when Simpler, More Unified, or Less Ad Hioc Theories will Provide More Accurate Predictions”

Gilbert Harman and Sanjeev Kulkarni, Reliable Reasoning, selections

Vladimir Vapnik, The Nature of Statistical Learning Theory, selections.

Kevin Kelly and Conor Mayo-Wilson, Review of Reliable Reasoning

Malcolm Forster, "The New Science of Simplicity", especially the discussion around p. 109.

Wikipedia, "Hodge's Estimator". Compare the plot to Forster's.

Hannes Leeb and Benedikt Poetscher, "Sparse Estimators and the Oracle Property, or the Return of Hodges' Estimator". Don't get hung up on details---just get the point the authors want to make.

Part II: Flattish Prior Probabilities

Bayesian theory views probabilities as [rational] degrees of belief and recommends updating them on data by the rule of conditioning, which means setting new degrees of belief to one’s old degrees of belief conditional on the current, total evidence. All of this is supposed to be as obvious as deductive logic itself, although there are some attempts to give some sorts of arguments to the savages who don’t yet agree with that enlightened point of view.

Since every Bayesian analysis of evidence begins with prior probabilities for theories and parameter settings, these prior probabilities feed back into one’s assessment of the bearing of evidence and on overall plausibility after the evidence has been received. Under some natural constraints on how these numbers are assigned, something like Ockham’s razor seems to follow. Our question is: what does that tell us about the truth-conduciveness of Ockham’s razor? The concern is that such Bayesian analyses get out exactly what you put in, namely, a circular, prior bias toward simplicity.

One very bald way to beg the question is to assume that simpler models have higher prior probabilities (cf. the short piece by Harold Jeffreys). A more subtle argument, expressed by Rosenkrantz and surveyed by Kass and Raftery, is that the key feature of the argument is averaging the likelihoods over the prior probabilities of parameter settings---the theory with more free parameters will end up with a lower averaged likelihood than a theory that predicts the same data without free parameters. That is why more free parameters tend to lower the confirmation of accommodated data. The issue is whether this ultimately begs the question for Ockham (if the posterior probability is considered) or fails to address it (if only the Bayes factors are considered). So I will argue.

Readings:

Clark Glymour, “Why I am Not a Bayesian”, Chapter 3 of Theory and Evidence.

Roger Rosenkrantz, “Why Glymour is a Bayesian”, in Testing Scientific Theories.

Harold Jeffreys, “Theory of Probability”, short selection on Ockham’s razor.

Robert Kass and Adrian Raftery, “Bayes Factors”, JASA, sections 3, 5, 6, 8, 10.

Background: Gideon Schwarz, “Estimating the Dimension of a Model”., Annals of Statistics.

Part III: Efficient Convergence to the Truth

O.K., here is my idea. Ockham’s razor is the unique strategy that keeps inquiry on the straightest path to the truth. Since “straightest” possible convergence is weaker than pointing at the truth immediately, no circular assumption that the world is probably simple is required. Since “straightest” possible convergence is stronger than mere convergence in the limit, Ockham’s razor is not merely sufficient but necessary. Unlike overfitting, the story is directed at finding the true theory rather than accurate predictions. Unlike the Bayesian story, it does not appeal to the very bias toward simplicity that is to be explained.

K. Kelly, “How Simplicity Helps You Find the Truth Without Pointing at it”, in V. Harazinov, M. Friend, and N. Goethe, eds. Philosophy of Mathematics and Induction, Dordrecht: Springer, 2007.

K. Kelly, “Simplicity and Truth: an Alternative Explanation of Ockham’s Razor”, keynote lecture, IDEAL 2008, Birmingham, UK.

For next time, we’ll continue to go through the Ockham efficiency theorem more carefully and we will start to look at the underlying concept of simplicity. The following paper is somewhat different from “How Simplicity Helps You Find the Truth Without Pointing at it” and has a nicer theory of simplicity near the end:

K. Kelly, “Ockham’s Razor, Truth, and Information”, in Handbook on the Philosophy of Information, J. van Benthem and P. Adriaans, eds., 2008.

Part IV: The Nature of Empirical Simplicity

To this point, simplicity has been understood in terms of “empirical effects” that arrive in the data like discrete marbles. This week I will explain what empirical effects are. The basic idea is that empirical complexity of theory T is a partial order reflecting iterated problems of induction relative to the given question. This topic will interface with the visit of Jeroen Groenendijk, a world expert on the logic and linguistics of questions.

Part V: Ockham’s Shaky Razor

In Game theory, it is typically the case that one can do better in the worst case by randomizing. A familiar case of that is the game “Rock-paper-scissors”, whose unique equilibrium is the strategy of playing each possibility with 1/3 chance. As it turns out, deterministic Ockham strategies are optimal. The only randomization compatible with optimality is to randomize between the unique Ockham answer and the question mark at the first moment the Ockham answer is produced with any probability. Thereafter, the Ockham answer must be produced with unit probability until it is no longer Ockham.

Another reason to look at random strategies is that it forces one to generalize the concept of efficiency to random outputs, so it is a crucial step to a full theory of statistical inference.

The paper is a long working draft. Ignore the results on measurability and focus on the definitions of Ockham’s razor, stalwartness, and efficiency.

Kelly and Mayo-Wilson: “Ockham Efficiency Theorem for Empirical Methods Conceived as Empirically-Driven, Countable-State Stochastic Processes”, manuscript.

Part VI: Acceptance and Bayesian Retractions

What does it mean for a Bayesian to retract? We will look at a new theory Hanti Lin and I worked out.