Regression Analysis > Overfitting
What is Overfitting?Overfitting is where your model is too complex for your data — it happens when your sample size is too small. If you put enough predictor variables in your regression model, you will nearly always get a model that looks significant.
While an overfitted model may fit the idiosyncrasies of your data extremely well, it won’t fit additional test samples or the overall population. The model’s
p-values, R-Squared and regression coefficients can all be misleading. Basically, you’re asking too much from a small set of data.
How to Avoid Overfitting
In linear modeling (including multiple regression), you should have at least 10-15 observations for each term you are trying to estimate. Any less than that, and you run the risk of overfitting your model.
- Interaction Effects,
- Polynomial expressions (for modeling curved lines),
- Predictor variables.
While this rule of thumb is generally accepted, Green (1991) takes this a step further and suggests that the minimum sample size for any regression should be 50, with an additional 8 observations per term. For example, if you have one interacting variable and three predictor variables, you’ll need around 45-60 items in your sample to avoid overfitting, or 50 + 3(8) = 74 items according to Green.
There are exceptions to the “10-15” rule of thumb. They include:
- When there is multicollinearity in your data, or if the effect size is small. If that’s the case, you’ll need to include more terms (although there is, unfortunately, no rule of thumb for how many terms to add!).
- You may be able to get away with as few as 10 observations per predictor if you are using logistic regression or survival models, as long as you don’t have extreme event probabilities, small effect sizes, or predictor variables with truncated ranges. (Peduzzi et al.)
How to Detect and Avoid Overfitting
The easiest way to avoid overfitting is to increase your sample size by collecting more data. If you can’t do that, the second option is to reduce the number of predictors in your model — either by combining or eliminating them. Factor Analysis is one method you can use to identify related predictors that might be candidates for combining.
Use cross validation to detect overfitting: this partitions your data, generalizes your model, and chooses the model which works best. One form of cross-validation is predicted R-squared. Most good statistical software will include this statistic, which is calculated by:
- Removing one observation at a time from your data,
- Estimating the regression equation for each iteration,
- Using the regression equation to predict the removed observation.
Cross validation isn’t a magic cure for small data sets though, and sometimes a clear model isn’t identified even with an adequate sample size.
2. Shrinkage & Resampline
3. Automated Methods
Automated stepwise regression shouldn’t be used as an overfitting solution for small data sets. According to Babyak (2004),
“The problems with automated selection conducted in this very typical manner are so
numerous that it would be hard to catalogue all of them [in a journal article].”
Babyak also recommends avoiding univariate pretesting or screening (a “variation of automated selection in disguise”), dichotomizing continuous variables — which can dramatically increase Type I errors, or multiple testing of confounding variables (although this may be ok if used judiciously).
- Babyak, M.A.,(2004). “What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models.” Psychosomatic Medicine. 2004 May-Jun;66(3):411-21.
- Green S.B., (1991) “How many subjects does it take to do a regression analysis?” Multivariate Behavior Research 26:499–510.
- Peduzzi P.N., et. al (1995). “The importance of events per independent variable in multivariable analysis, II: accuracy and precision of regression estimates.” Journal of Clinical Epidemiology 48:1503–10.
- Peduzzi P.N., et. al (1996). “A simulation study of the number of events per variable in logistic regression analysis.” Journal of Clinical Epidemiology 49:1373–9.