Bootstrap Sample: Definition, Example

Sampling > Bootstrap Sample

What is a Bootstrap Sample?

A bootstrap sample is a smaller sample that is “bootstrapped” from a larger sample. Bootstrapping is a type of resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement, from a single original sample.

For example, let’s say your sample was made up of ten numbers: 49, 34, 21, 18, 10, 8, 6, 5, 2, 1. You randomly draw three numbers 5, 1, and 49. You then replace those numbers into the sample and draw three numbers again. Repeat the process of drawing x numbers B times. Usually, original samples are much larger than this simple example, and B can reach into the thousands. After a large number of iterations, the bootstrap statistics are compiled into a bootstrap distribution. You’re replacing your numbers back into the pot, so your resamples can have the same item repeated several times (e.g. 49 could appear a dozen times in a dozen resamples).

Bootstrapping is loosely based on the law of large numbers, which states that if you sample over and over again, your data should approximate the true population data. This works, perhaps surprisingly, even when you’re using a single sample to generate the data.

An empirical bootstrap sample is drawn from observations.
A parametric bootstrap sample is drawn from a parameterized distribution (e.g. a normal distribution).

Why Resample?

Ideally, you would want to draw large, non-repeated, samples from a population in order to create a sampling distribution for a statistic. However, you may be limited to one sample because of finances or time. This single sample method can serve as a mini population, from which repeated small samples are drawn with replacement over and over again. As well as saving time and money, bootstrapped samples can be quite good approximations for population parameters.

Running the Procedure

Bootstrapping is usually performed with software (e.g. Stata or with the R Bootstrap package); The process generally follows three steps:

Resample a data set x times,
Find a summary statistic (called a bootstrap statistic) for each of the x samples,
Estimate the standard error for the bootstrap statistic using the standard deviation of the bootstrap distribution.

Notation

The number of bootstrap samples can be indicated with B (e.g. if you resample 10 times then B = 10).
A bootstrap sample is identified by “star” notation: x*₁, x_2*,…x*_n. This is similar to the notation for sample data, which is traditionally denoted by: x₁, x₂,…x_n
A star next to a statistic, like s* or x̄* indicates the statistic was calculated by resampling. A bootstrap statistic is sometimes denoted with a T, where T*_b would be the B^th bootstrap sample statistic T.

Bootstrap Percentile Method

The bootstrap percentile method is a way to calculate confidence intervals for bootstrapped samples.

With the simple method, a certain percentage (e.g. 5% or 10%) is trimmed from the lower and upper end of the sample statistic (e.g. the mean or standard deviation). Which number you trim depends on the confidence interval you’re looking for. For example, a 90% confidence interval would generate a 100% – 90% = 10% trim (i.e. 5% from both ends). Or, put another (slightly more technical) way, you can get a 90% confidence interval by taking the lower bound 5% and upper bound 95% quantiles of the B replication T₁, T₂,…T_B.

A more complicated method is Efron’s BCa method (see DiCiccio and Efron, 1993), which stands for Bias-corrected and accelerated. As well as adjusting for bias, it also corrects skewness in the model. Other variants include Rubin’s Bayesian extension and DiCiccio and Efron’s ABC method.

This trimmed range for the statistic is the confidence interval for the population parameter of interest.

References:
DiCiccio, T.J. and Efron B. (1996) Bootstrap confidence intervals. Statistical Science, 11, 189-228.
Efron, B. and Tibshirani, R. (1993) An Introduction to the Bootstrap. Chapman and Hall, New York, London.
Rubin, D (1981). The Bayesian bootstrap. Annals of Statistics 9 130–134.