The Sample Selection Problem

Sample selection is a particular instance of the identification problem. We are interested in a theory about the relationship between two variables, say X and Y , that says that high values of X induce high values of Y. So we get data on X and Y to test the theory. A sample selection problem arises when all the observations for which X is high corresponds to high values of another variable, say Z, because individuals with high Z select themselves into having high X. Furthermore, Z has a direct effect on Y so it will be hard to measure the effect of X on Y.
As an example, let Y denote earnings after leaving school. Let X denote the number of years of schooling an individual has had, and let Z denote an individual's ability (this is one of the examples we looked at in class). There is a sample selection probem here because high-ability individuals choose to stay in school longer. Thus, they have high earnings both because they are smarter and because they got more schooling. If we cannot measure ability well, we will not be able to disentangle the separate effects of ability and schooling on earnings.
Imagine that we compared years of schooling with earnings without adjusting for the sample selection problem. Individuals with more schooling are smarter so they also earn more because of their ability. We would find statistically a very strong effect of an extra year of schooling on earnings, but in fact only a fraction of that would actually be due to schooling while the rest would be due to unmeasured ability differences.
To deal with this we would like to conduct some form of experiment.