Prof. Chouldechova

There is a fair bit of missingness in the data set. There are several approaches to dealing with missing data:

**Exclude**- You can omit observations with missing values (e.g., remove any rows that contain missing data)

**Impute**- R has various packages (Amelia, mice, mi,
`impute()`

function from Hmisc, etc.) that can help with imputing missing values.

- R has various packages (Amelia, mice, mi,
Think carefully about whether certain kinds of missingness are

**informative**

The downsides of the **Impute** approach:

Imputation methods often rely on fairly

**strong assumptions**concerning the process governing the appearance of missing values (assumptions such as MAR, missing at random; or MCAR, missing completely at random).This is

**a lot of hassle**to go through unless you want practice imputing values

Why the **think carefully** approach can be a good one:

For factor variables, you can treat missing values as just another factor level. Sometimes

**missingness can be informative (predictive)**, leading to a significant coefficient for the missing level.- E.g., Just now we ran a logistic regression in which we used
`?`

as one of the levels of the`workingclass`

variable to indicate individuals whose working class is unknown. Having`workingclass = ?`

turned out to be strong associated with earning under 50k a year.

- E.g., Just now we ran a logistic regression in which we used
For numeric variables, there's not much you can do. Just recode negative values to

`NA`

.

**My recommendation**

Start by

**thinking carefully**about missing valuesIf nothing interesting turns up, go ahead and

**exclude**them (code as`NA`

, proceed accordingly)

**Warning**: Trying to impute can consume a lot of time- Not guaranteed to produce better results than what you'd have if you just excluded all observations with missing values.

The income variable that you have available is

**topcoded**.For the top 2% of earners, you don't observe their actual income.

Instead, their income is recorded as the average of the top 2% of incomes.

Standard regression applied to data with a topcoded outcome is

**inconsistent**.- i.e., even if you had infinite data, your coefficient estimates won't converge to the “true” coefficients.

- i.e., even if you had infinite data, your coefficient estimates won't converge to the “true” coefficients.

**Tobit regression**(censored regression).- We didn't talk about this method in class
- It's not too difficult to understand if you already understand linear regression.
- A tutorial can be found here.

Try fitting the regression models / running hypothesis tests two ways

**First way**: include the topcoded observations**Second way**: exclude all observations with topcoded outcomes- If your estimates change a lot, then you probably don't want to use the topcoded observations
- If you go this route, be sure to explain what omiting the high earning individuals means for the scope of your conclusions.

**My recommendation**: Take approach (2), unless you want practice with tobit regression.