Statistics How To

Regression Analysis: Step by Step Articles, Videos, Simple Definitions

Probability and Statistics > Regression analysis

regression analysis

A simple linear regression plot for amount of rainfall.

Regression analysis is used in stats to find trends in data. For example, you might guess that there’s a connection between how much you eat and how much you weigh; regression analysis can help you quantify that. Regression analysis will provide you with an equation for a graph so that you can make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in ten years time if you continue to put on weight at the same rate. It will also give you a slew of statistics (including a p-value and a correlation coefficient) to tell you how accurate your model is. Most elementary stats courses cover very basic techniques, like making scatter plots and performing linear regression. However, you may come across more advanced techniques like multiple regression.

Contents:

  1. Introduction to Regression Analysis
  2. Multiple Regression Analysis
  3. Overfitting and how to avoid it
  4. More articles

Regression Analysis: An Introduction

In statistics, it’s hard to stare at a set of random numbers in a table and try to make any sense of it. For example, global warming may be reducing average snowfall in your town and you are asked to predict how much snow you think will fall this year. Looking at the following table you might guess somewhere around 10-20 inches. That’s a good guess, but you could make a better guess, by using regression.
regression 1

Essentially, regression is the “best guess” at using a set of data to make some kind of prediction. It’s fitting a set of points to a graph. There’s a whole host of tools that can run regression for you, including Excel, which I used here to help make sense of that snowfall data:
regression 2
Just by looking at the regression line running down through the data, you can fine tune your best guess a bit. You can see that the original guess (20 inches or so) was way off. For 2015, it looks like the line will be somewhere between 5 and 10 inches! That might be “good enough”, but regression also gives you a useful equation, which for this chart is:
y = -2.2923x + 4624.4.
What that means is you can plug in an x value (the year) and get a pretty good estimate of snowfall for any year. For example, 2005:
y = -2.2923(2005) + 4624.4 = 28.3385 inches, which is pretty close to the actual figure of 30 inches for that year.

Best of all, you can use the equation to make predictions. For example, how much snow will fall in 2017?
y = 2.2923(2017) + 4624.4 = 0.8 inches.

Regression also gives you an R squared value, which for this graph is 0.702. This number tells you how good your model is. The values range from 0 to 1, with 0 being a terrible model and 1 being a perfect model. As you can probably see, 0.7 is a fairly decent model so you can be fairly confident in your weather prediction!

Back to Top

Multiple Regression Analysis

Multiple regression analysis is used to see if there is a statistically significant relationship between sets of variables. It’s used to find trends in those sets of data.

Multiple regression analysis is almost the same as simple linear regression. The only difference between simple linear regression and multiple regression is in the number of predictors (“x” variables) used in the regression.

  • Simple regression analysis uses a single x variable for each dependent “y” variable. For example: (x1, Y1).
  • Multiple regression uses multiple “x” variables for each independent variable: (x1)1, (x2)1, (x3)1, Y1).

In one-variable linear regression, you would input one dependent variable (i.e. “sales”) against an independent variable (i.e. “profit”). But you might be interested in how different types of sales effect the regression. You could set your X1 as one type of sales, your X2 as another type of sales and so on.

When to Use Multiple Regression Analysis.

Ordinary linear regression usually isn’t enough to take into account all of the real-life factors that have an effect on an outcome. For example, the following graph plots a single variable (number of doctors) against another variable (life-expectancy of women).

multiple regression analysis

Image: Columbia University




From this graph it might appear there is a relationship between life-expectancy of women and the number of doctors in the population. In fact, that’s probably true and you could say it’s a simple fix: put more doctors into the population to increase life expectancy. But the reality is you would have to look at other factors like the possibility that doctors in rural areas might have less education or experience. Or perhaps they have a lack of access to medical facilities like trauma centers.

The addition of those extra factors would cause you to add additional dependent variables to your regression analysis and create a multiple regression analysis model.

Multiple Regression Analysis Output.

Regression analysis is always performed in software, like Excel or SPSS. The output differs according to how many variables you have but it’s essentially the same type of output you would find in a simple linear regression. There’s just more of it:

  • Simple regression: Y = b0 + b1 x.
  • Multiple regression: Y = b0 + b1 x1 + b0 + b1 x2…b0…b1 xn.

The output would include a summary, similar to a summary for simple linear regression, that includes:

These statistics help you figure out how well a regression model fits the data. The ANOVA table in the output would give you the p-value and f-statistic.

Minimum Sample size

“The answer to the sample size question appears to depend in part on the objectives
of the researcher, the research questions that are being addressed, and the type of
model being utilized. Although there are several research articles and textbooks giving
recommendations for minimum sample sizes for multiple regression, few agree
on how large is large enough and not many address the prediction side of MLR.” ~ Gregory T. Knofczynski

If you’re concerned with finding accurate values for squared multiple correlation coefficient, minimizing the
shrinkage of the squared multiple correlation coefficient or have another specific goal, Gregory Knofczynski’s paper is a worthwhile read and comes with lots of references for further study. That said, many people just want to run MLS to get a general idea of trends and they don’t need very specific estimates. If that’s the case, you can use a rule of thumb. It’s widely stated in the literature that you should have more than 100 items in your sample. While this is sometimes adequate, you’ll be on the safer side if you have at least 200 observations or better yet — more than 400.

Back to Top

Overfitting in Regression

overfitting

Overfitting can lead to a poor model for your data.

Overfitting is where your model is too complex for your data — it happens when your sample size is too small. If you put enough predictor variables in your regression model, you will nearly always get a model that looks significant.

While an overfitted model may fit the idiosyncrasies of your data extremely well, it won’t fit additional test samples or the overall population. The model’s
p-values, R-Squared and regression coefficients can all be misleading. Basically, you’re asking too much from a small set of data.

How to Avoid Overfitting

In linear modeling (including multiple regression), you should have at least 10-15 observations for each term you are trying to estimate. Any less than that, and you run the risk of overfitting your model.
“Terms” include:

While this rule of thumb is generally accepted, Green (1991) takes this a step further and suggests that the minimum sample size for any regression should be 50, with an additional 8 observations per term. For example, if you have one interacting variable and three predictor variables, you’ll need around 45-60 items in your sample to avoid overfitting, or 50 + 3(8) = 74 items according to Green.

Exceptions

There are exceptions to the “10-15” rule of thumb. They include:

  1. When there is multicollinearity in your data, or if the effect size is small. If that’s the case, you’ll need to include more terms (although there is, unfortunately, no rule of thumb for how many terms to add!).
  2. You may be able to get away with as few as 10 observations per predictor if you are using logistic regression or survival models, as long as you don’t have extreme event probabilities, small effect sizes, or predictor variables with truncated ranges. (Peduzzi et al.)

How to Detect and Avoid Overfitting

The easiest way to avoid overfitting is to increase your sample size by collecting more data. If you can’t do that, the second option is to reduce the number of predictors in your model — either by combining or eliminating them. Factor Analysis is one method you can use to identify related predictors that might be candidates for combining.

1. Cross-Validation

Use cross validation to detect overfitting: this partitions your data, generalizes your model, and chooses the model which works best. One form of cross-validation is predicted R-squared. Most good statistical software will include this statistic, which is calculated by:

  • Removing one observation at a time from your data,
  • Estimating the regression equation for each iteration,
  • Using the regression equation to predict the removed observation.

Cross validation isn’t a magic cure for small data sets though, and sometimes a clear model isn’t identified even with an adequate sample size.

2. Shrinkage & Resampline

Shrinkage and resampling tehcniques (like this R-module) can help you to find out how well your model might fit a new sample.

3. Automated Methods

Automated stepwise regression shouldn’t be used as an overfitting solution for small data sets. According to Babyak (2004),

“The problems with automated selection conducted in this very typical manner are so
numerous that it would be hard to catalogue all of them [in a journal article].”

Babyak also recommends avoiding univariate pretesting or screening (a “variation of automated selection in disguise”), dichotomizing continuous variables — which can dramatically increase Type I errors, or multiple testing of confounding variables (although this may be ok if used judiciously).

References

  1. Babyak, M.A.,(2004). “What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models.” Psychosomatic Medicine. 2004 May-Jun;66(3):411-21.
  2. Green S.B., (1991) “How many subjects does it take to do a regression analysis?” Multivariate Behavior Research 26:499–510.
  3. Peduzzi P.N., et. al (1995). “The importance of events per independent variable in multivariable analysis, II: accuracy and precision of regression estimates.” Journal of Clinical Epidemiology 48:1503–10.
  4. Peduzzi P.N., et. al (1996). “A simulation study of the number of events per variable in logistic regression analysis.” Journal of Clinical Epidemiology 49:1373–9.

Back to Top

Check out our YouTube channel for hundreds of videos on elementary statistics, including regression analysis using a variety of tools like Excel and the TI-83.

More articles

  1. How to Construct a Scatter Plot.
  2. How to Calculate Pearson’s Correlation Coefficients.
  3. How to Compute a Linear Regression Test Value.
  4. Chow Test for Split Data Sets
  5. How to Find a Linear Regression Equation.
  6. How to Find a Regression Slope Intercept.
  7. How to Find a Linear Regression Slope.
  8. How to Find the Standard Error of Regression Slope.
  9. Validity Coefficient: What it is and how to find it.
  10. Quadratic Regression.
  11. Stepwise Regression

Technology

  1. Regression in Minitab

Back to Top

Definitions

  1. Assumptions and Conditions for Regression.
  2. Betas / Standardized Coefficients.
  3. What is a Beta Weight?
  4. Bilinear Regression
  5. The Breusch-Pagan-Godfrey Test
  6. What is the Correlation Coefficient Formula?
  7. Cook’s Distance.
  8. What is a Covariate?
  9. Detrend Data.
  10. Gauss-Newton Algorithm.
  11. What is the General Linear Model?
  12. What is the Generalized Linear Model?
  13. What is the Hausman Test?
  14. What is Homoscedasticity?
  15. What is an Instrumental Variable?
  16. Lasso Regression.
  17. What is a Linear Relationship?
  18. What is the Line of best fit?
  19. What is Logistic Regression?
  20. Model Misspecification.
  21. Multinomial Logistic Regression.
  22. What is Nonlinear Regression?
  23. Ordered Logit / Ordered Logistic Regression
  24. What is Ordinary Least Squares Regression
  25. Overfitting.
  26. Parsimonious Models.
  27. What is Pearson’s Correlation Coefficient?
  28. Poisson Regression.
  29. Probit Model.
  30. What is a Prediction Interval?
  31. What is Regularization?
  32. What are Relative Weights?
  33. What are Residual Plots?
  34. Reverse Causality.
  35. Ridge Regression
  36. Root Mean Square Error.
  37. Semiparametric models
  38. Simultaneity Bias.
  39. Simultaneous Equations Model.
  40. What is Spurious Correlation?
  41. Structural Equations Model
  42. What are Tolerance Intervals?
  43. Tuning Parameter
  44. What is Weighted Least Squares Regression?

Back to Top

If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.

Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!
Regression Analysis: Step by Step Articles, Videos, Simple Definitions was last modified: October 15th, 2017 by Andale

5 thoughts on “Regression Analysis: Step by Step Articles, Videos, Simple Definitions

  1. Sekeli Maboshe

    PLEASE SHOW ME HOW TO DETRED THIS DATA AS (III) BELOW REQUIRES
    The total annual fertilizer consumption in thousands of tonnes during 1995-2001 in XYZ Province of Zambia was recorded as given in the table below.
    Year
    1995
    1996
    1997
    1998
    1999
    2000
    2001
    Consumption
    50
    56
    60
    68
    70
    75
    78
    (i) Fit a straight line trend by the method of least squares and compute the trend quantities.
    (ii) What has been the annual increase in fertiliser consumption?
    (iii) Eliminate the trend variation from the fertilizer consumption data.

  2. Andale Post author

    He Sekeli, just calculate the least squares regression and then subtract the differences for the data points from the trendline. See here.

  3. Sekeli Maboshe

    Hi Andale,
    Am failing to understand the improvement that a three-variable linear regression analysis make over the two-variable case? please Explain.

  4. Andale Post author

    If you have three predictor variables you should be running multiple regression to take into account all of the independent variables in your model. If you only choose two, you’re leaving out info.

  5. FM

    Really helpful website, thank you for simplifying “things”, well explained… makes all the difference!