Konstantin Genin and Kevin T. Kelly
Department of Philosophy
Carnegie Mellon University
Statisticians tend to think of scientific theories as "models" that are literally false but that can be fit to the data in a way that yields reasonably accurate predictions of features of future samples. But sometimes one hopes to find the true model rather than a merely predictive one, as in the fundamental problem of causal inference from non-experimental data. That is hard, due to the ancient problem of inductive skepticism, which explains why frequentist statisticians don't want to touch such problems.
Bayesian conditioning and other methods like the Bayes information criterion (BIC) or the PC algorithm for causal network search can converge (point-wise) in probability to the true model if the true model is entertained at all. But one should not imagine that they lead one straight to the truth like a compass! Nature can force all convergent methods to perform dramatic retractions of opinion because inductive inference of true models inherently excludes bounds on chance of error en route to the truth. Theory choice methods can be forced to choose one theory with arbitrarily high chance and then choose another theory with arbitrarily high chance, etc. Bayesians can be forced to put a high expected posterior on one theory followed by a high expected posterior on another theory, etc., where the expectations are in chance. But how do these retractions in chance happen and, more importantly, how can they be minimized so that we at least are not subjected to needless surprises in future studies at larger sample size?
We examine a toy problem well-adapted to graphical representation, with the idea that it can serve as a proxy for more interesting problems that are difficult to visualize, like causal network search. Suppose that there are two variables X and Y with known covariance and the question is exactly which components of the mean vector (muX, muY) are nonzero. The possible theories are none, both, just muX or just muY. Note that as propositions they are all mutually exclusive, but since non-zero means are free parameters, more nonzero means is simpler than fewer. Geometrically speaking, "none" is true precisely at the origin of the cartesian coordinates, "just muX" is the Y axis minus the origin, "just muY" is the X axis minus the origin, and "both" is everything but the coordinate axes.
The BIC score for model H on sample S of size n is defined as: -2ln(L(H)) + k ln(n), where n is sample size and k is the number of free parameters in Hand L(H, S) is the maximized likelihood of H given the sample. A standard procedure is to draw a sample of size n and to choose the model H whose BIC score is minimum. That method chooses a model for each point in (X-bar, Y-bar) space, so one can plot the regions in the plane at which each such model is chosen. In our simulations, each such region is a different color. The zone for the origin model is blue, for the X axis model is yellow, for the Y axis model is red, and for the background model is green. The blue unfilled ellipses are the 95th nd 99th quantiles of the sampling gaussian, respectively. As covariance increases, so does the eccentricity of the ellipse.
We choose a true mean vector in which both components are non-zero but very small, with muY much smaller than muX. The sampling density for (X-bar, Y-bar) shrinks as sample size increases but we zoom in on it to keep it centered in the picture at a fixed size, as though we are watching an airplane through powerful binoculars. The result is that the acceptance zones in the background are magnified and shift as we maintain our focus on the sampling density. The effect is reminiscent of a 747 taking off from a runway. The acceleration effect is artificial---it reflects our use of a log time scale to speed things up. Actually, the convergence is extremely slow. The slowness is easy to understand from the picture---the sampling distribution follows a straight-line path away from the origin, so the shallower the angle of departure, the longer it takes to escape the yellow band.
During the simulation one sees the .99 quantile of the sampling distribution filled first with blue and then with yellow and finally with green. Those are retractions in chance---momentous drops in the chance of producing the first two hypotheses. The graph to the right right plots the chance of producing each answer in the color that corresponds to its acceptance zone. Total retractions are tallied at the top of the frame and the retractions of each answer are to the right, in matching colors. The graphs should be smooth. The choppiness is due to sampling effects in the monte-carlo estimates of the chances and can be eliminated due to larger samples and by computing retractions over larger intervals. Therefore, the total retraction estimate in the posted simulations is currently too high. We will re-run them to get better estimates. But the comparisons between methods are realistic.
Bayesian conditioning looks very similar to BIC. Retractions in chance are measured as total drops in expected posterior probability.
The following findings illustrate vividly how the aim of minimizing retractions en route to convergence can yield concrete suggestions for improving very familiar inductive strategies in statistics.
Both the Bayesian posterior .95 quantile zone for the origin model and the BIC acceptance zone for the origin model have narrow, pinched-in shape when X and Y are correlated and are square (!) and too small when X and Y are independent even though the sampling distribution is round (how do you fit a round peg into a square hole?). The mismatch between the acceptance zones and the sampling distribution result in extra retractions that can be eliminated by making the acceptance zone for the origin model match the shape of the sampling distribution. For example, one can compare the BIC score of the origin model only with the background and the axis models only with the background. That is actually easier to compute and results in a more sensible method, so far as minimizing retractions is concerned:
One unexpected finding is that the sampling density drags across the corner of the red axis zone when it leaves the origin (blue) zone, which generates extra retractions (look for the red bump) that could have been avoided by returning "I don't know" rather than an informative theory in a region around the origin zone. That is something new---a sort of diachronic violation of Ockham's razor. The extra retractions are large when the correlation is high. These retractions can be reduced by treating the Bayesian as accepting theories only when they have posterior probability greater than some high threshold.
Retractions in chance that postierior probability > .95 when X and Y are independent. [stay tuned]
The same effect can be obtained by thresholding the BIC score. That raises the intriguing possiblity that the apparent need to wait for data to confirm the simplest hypothesis is based on minimizing retractions rather than on reliability (which nobody can promise for questions of this sort).
Pending: John Templeton Foundation grant 24145, Simplicity, Truth, and Ockham's Razor.
2009-2011: NSF grant 0740681, Ockham's Razor: A New Justification, Division of Social and Economic Sciences, Program for History and Philosophy of Science Engineering and Technology.