Regression Analysis > Stepwise Regression

Stepwise regression is a way to build a model by adding or removing predictor variables, usually via a series of F-tests or T-tests. The variables to be added or removed are chosen based on the test statistics of the estimated coefficients. While the technique does have its benefits, it requires skill on the part of the researcher so should be performed by people who are very familiar with statistical testing. In essence, unlike most regression models, the models created with stepwise regression should be taken with a grain of salt; they require a keen eye to detect whether they make sense or not.

## How Stepwise Regression Works

The two ways that software will perform stepwise regression are:

**Start the test with all available predictor variables**(the “Backward: method), deleting one variable at a time as the regression model progresses. Use this method if you have a modest number of predictor variables and you want to eliminate a few. At each step, the variable with the lowest “F-to-remove” statistic is deleted from the model. The “F-to-remove” statistic is calculated as follows:- A t-statistic is calculated for the estimated coefficient of each variable in the model.
- The t-statistic is squared, creating the “F-to-remove” statistic.

**Start the test with no predictor variables**(the “Forward” method), adding one at a time as the regression model progresses. If you have a large set of predictor variables, use this method. The “F-to-add” statistic is created using the same steps above, except the system will calculate the statistic for each variable*not*in the model. The variable with the highest “F-to-add” statistic is added to the model.

## Advantages and Disadvantages

Advantages of stepwise regression include:

- The ability to manage large amounts of potential predictor variables, fine-tuning the model to choose the best predictor variables from the available options.
- It’s faster than other automatic model-selection methods.
- Watching the order in which variables are removed or added can provide valuable information about the quality of the predictor variables.

Although stepwise regression is popular, many statisticians (see here and here ) agree that it’s riddled with problems and should *not* be used. Some issues include:

- Stepwise regression often has many potential predictor variables but too little data to estimate coefficients meaningfully. Adding more data does not help much, if at all.
- If two predictor variables in the model are highly correlated, only one may make it into the model.
- R-squared values are usually too high.
- Adjusted r-squared values might be high, and then dip sharply as the model progresses. If this happens, identify the variables that were added or removed when this happens and adjust the model.
- F and chi-square tests listed next to output variables don’t have those distributions.
- Predicted values and confidence intervals are too narrow.
- P-values are given that do not have the correct meaning.
- Regression coefficients are biased and coefficients for other variables are too high.
- Collinearity is usually a major issue. Excessive collinearity may cause the program to dump predictor variables into the model.
- Some variables (especially dummy variables) may be removed from the model, when they are deemed important to be included. These can be manually added back in.

If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.

Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!