# Final project: missing data & topcoded values

Prof. Chouldechova

### Final project: missing data

There is a fair bit of missingness in the data set. There are several approaches to dealing with missing data:

1. Exclude

• You can omit observations with missing values (e.g., remove any rows that contain missing data)
2. Impute

• R has various packages (Amelia, mice, mi, impute() function from Hmisc, etc.) that can help with imputing missing values.
3. Think carefully about whether certain kinds of missingness are informative

### Final project: missing data

The downsides of the Impute approach:

• Imputation methods often rely on fairly strong assumptions concerning the process governing the appearance of missing values (assumptions such as MAR, missing at random; or MCAR, missing completely at random).

• This is a lot of hassle to go through unless you want practice imputing values

### Final project: missing data

Why the think carefully approach can be a good one:

• For factor variables, you can treat missing values as just another factor level. Sometimes missingness can be informative (predictive), leading to a significant coefficient for the missing level.

• E.g., Just now we ran a logistic regression in which we used ? as one of the levels of the workingclass variable to indicate individuals whose working class is unknown. Having workingclass = ? turned out to be strong associated with earning under 50k a year.
• For numeric variables, there's not much you can do. Just recode negative values to NA.

### Final project: missing data

My recommendation

1. Start by thinking carefully about missing values

2. If nothing interesting turns up, go ahead and exclude them (code as NA, proceed accordingly)

• Warning: Trying to impute can consume a lot of time
• Not guaranteed to produce better results than what you'd have if you just excluded all observations with missing values.

### Final project: topcoded outcome variable

• The income variable that you have available is topcoded.

• For the top 2% of earners, you don't observe their actual income.

• Instead, their income is recorded as the average of the top 2% of incomes.

• Standard regression applied to data with a topcoded outcome is inconsistent.

• i.e., even if you had infinite data, your coefficient estimates won't converge to the “true” coefficients.

### Final project: topcoded outcome variable

1. Tobit regression (censored regression).