Statistics How To

Undersampling and Oversampling in Data Analysis

Share on

Sampling >

Undersampling attempts to reduce the bias associated with imbalanced classes of data. In machine learning, undersamplingand oversampling are two techniques that deal with imbalances in a training set (the part of data used to fit a model). You can undersample the majority class, oversample the minority class, or combine the two techniques.

In general, undersampling (instead of oversampling) the majority class works best for large data sets. That’s because with oversampling, you’re adding more data points, which can lead to a data set that’s too massive to use classifiers like support vector machines (García-Pedrajas, 2010).


Random Undersampling

With random undersampling, you randomly remove members of the majority class until you reach a preset threshold.

One advantage to random selection here is that you don’t have to make decisions on which points are important and which are not: you simply let the random process do the work. Several studies have shown that random selection performs as well as, if not better than, processes where deliberate removal choices are made.

However, a distinct disadvantage is that the process could remove important members. Problems tend to result in data that is non-smooth, has boundaries or small features (Dey, n.d.). One way to avoid this pitfall is to combine undersampling and boosting (Liu et al, as cited in García-Pedrajas, 2010). You might also want to manually resample or repair any holes in the data algorithmically.

References

Dey, T. Undersampling and Oversampling in Sample Based Shape Modeling. Retrieved December 16, 2019 from: https://graphics.stanford.edu/courses/cs468-03-fall/Papers/deygiesen_undersampling.pdf
García-Pedrajas, N. et al. (2010). Trends in Applied Intelligent Systems: 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Cordoba, Spain, June 1-4, 2010, Proceedings. Springer Science & Business Media.

CITE THIS AS:
Stephanie Glen. "Undersampling and Oversampling in Data Analysis" From StatisticsHowTo.com: Elementary Statistics for the rest of us! https://www.statisticshowto.com/undersampling/
------------------------------------------------------------------------------

Need help with a homework or test question? With Chegg Study, you can get step-by-step solutions to your questions from an expert in the field. Your first 30 minutes with a Chegg tutor is free!


Comments? Need to post a correction? Please post a comment on our Facebook page.