Undersampling attempts to reduce the bias(error) associated with imbalanced classes of data. In machine learning, undersampling and oversampling are two techniques that deal with imbalances in a training set (the part of data used to fit a model). You can undersample the majority class, oversample the minority class, or combine the two techniques.
In general, undersampling (instead of oversampling) the majority class works best for large data sets. That’s because with oversampling, you’re adding more data points, which can lead to a data set that’s too massive to use classifiers like support vector machines (García-Pedrajas, 2010).
With random undersampling, you randomly remove members of the majority class until you reach a preset threshold.
One advantage to random selection here is that you don’t have to make decisions on which points are important and which are not: you simply let the random process do the work. Several studies have shown that random selection performs as well as, if not better than, processes where deliberate removal choices are made.
However, a distinct disadvantage is that the process could remove important members. Problems tend to result in data that is non-smooth, has boundaries or small features (Dey, n.d.). One way to avoid this pitfall is to combine undersampling and boosting (Liu et al., as cited in García-Pedrajas, 2010). You might also want to manually resample or repair any holes in the data algorithmically.
Dey, T. Undersampling and Oversampling in Sample Based Shape Modeling. Retrieved December 16, 2019 from: https://graphics.stanford.edu/courses/cs468-03-fall/Papers/deygiesen_undersampling.pdf
García-Pedrajas, N. et al. (2010). Trends in Applied Intelligent Systems: 23rd International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2010, Cordoba, Spain, June 1-4, 2010, Proceedings. Springer Science & Business Media.