Regression Analysis > Cook’s Distance

## What is Cook’s Distance?

Cook’s distance, D_{i}, is used in Regression Analysis to find influential outliers in a set of predictor variables. In other words, it’s a way to identify points that negatively affect your regression model. The measurement is a combination of each observation’s leverage and residual values; the higher the leverage and residuals, the higher the Cook’s distance.

Several interpretations for Cook’s distance exist.

- A
**general rule of thumb**is that observations with a Cook’s D of more than 3 times the mean, μ, is a possible outlier. - An alternative interpretation is to investigate any point over 4/n, where n is the number of observations.
- Other authors suggest that any “large” D
_{i}should be investigated. How large is “too large”? The consensus seems to be that a D_{i}value of more that 1 indicates an influential value, but you may want to look at values above 0.5. Any value that sticks out from the other (like the one in the above chart) should also be investigated. - An alternative (but slightly more technical) way to interpret D
_{i}is to find the potential outlier’s percentile value using the F-distribution. A percentile of over 50 indicates a highly influential point.

If you have a *lot* of points with large D_{i} values, that could indicate a problem with your regression model in general.

## Formula

Technically, Cook’s D is calculated by removing the i_{th} data point from the model and recalculating the regression. **It summarizes how much all the values in the regression model change when the i _{th} observation is removed. **The formula for Cook’s distance is:

As this can get quite cumbersome by hand, you’ll want to use software like Minitab or SPSS to do it.

In **Minitab**:

- Go to Regression > Regression.
- Click “Storage” then select “Cook’s Distance.”
- Click “OK.”

A COOK column will appear in your data cells with the Cook’s D values.

**Reference**:

Cook, R. Dennis (February 1977). “Detection of Influential Observations in Linear Regression”. Technometrics (American Statistical Association)).

**Need help with a homework or test question?** Chegg offers 30 minutes of free tutoring, so you can try them out before committing to a subscription. Click here for more details.

If you prefer an **online interactive environment** to learn R and statistics, this *free R Tutorial by Datacamp* is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try *this Statistics with R track*.

*Facebook page*.