Reservoir Sampling

Reservoir sampling is a quota-based random sampling method, used to get a particular sample size when you don’t know the population size (i.e. when you’re dealing with a data stream of unknown length). It can also be used to create a sample for very large data sets.

It’s called reservoir sampling because the selected items are placed into a reservoir (i.e. a holding set). As each stream-tuple is received, the algorithm updates dynamically. The reservoir can be updated with replacement, or without replacement.

Originally developed for one-pass processing from magnetic tapes (Andrade et al. 2014), reservoir sampling is now used for one pass stream processing in data mining.

Reservoir Sampling Without Replacement

A reservoir sample without replacement is one where every distinct element has an equal probability of being selected:

Where:

• n = population size.
• m = a distinct element.

As the sampling is done without replacement, each element in the set is distinct (i.e. only selected once).

Reservoir Sampling With Replacement

Reservoir sampling with replacement means that every element has the possibility of being chosen for the reservoir more than once. It should guarantee that every element in the sample has an equal chance (1/n) of being placed in a certain position in the sample, no matter what elements are in the other positions. Formally, this is written as:

P(𝒥 – {i1, i2, …, im,}) = 1/nm

References

Andrade, H. et al. (2014). Fundamentals of Stream Processing: Application Design, Systems, and Analytics. Cambridge University Press.
Park, B. et al. Reservoir-Based Random Sampling with Replacement from Data Stream. (1987). In Proceedings of the Fourth SIAM International Conference on Data Mining (Proceedings in Applied Mathematics) 4th ed. Edition. Society for Industrial and Applied Mathematics. pp. 492-496.
Vitter, J. (1985). Random Sampling with a Reservoir. ACM Transactions on Mathematical Software, Vol. 11, No. 1, March.
Steele, P. & Pallone, S. (2017). Reservoir Sampling. Retrieved January 6, 2021 from: https://people.orie.cornell.edu/snp32/orie_6125/algorithms/reservoir-sampling.html