Outliers > Winsorize

## What is Winsorization?

Winsorization is a way to minimize the influence of outliers in your data by either:- Assigning the outlier a lower weight,
- Changing the value so that it is close to other values in the set.

Note that the data points are modified, not trimmed/removed (as in the trimmed mean). The Winsorize technique was first introduced by Dixon, who attributed it to Charles P. Winsor.

Statistics such as the mean and variance are very susceptible to outliers; Winsorization can be an effective way to deal with this problem, improve statistical efficiency and increase the robustness of statistical inferences.

The downside is that bias is introduced into your results, although the bias is a lot less than if you had simply deleted the data point. The alternative is to keep the data point as-is, but that may not be the best choice as it could dramatically skew your results. Either way, you should always have a good justification for Winsorizing your data; Never run the procedure arbitrarily in the hopes of getting more significant results.

## A Basic Method to Winsorize by Hand

**Analyze your data**to make sure the outlier isn’t a result of measurement error or some other fixable error.**Decide how much Winsorization you want.**This is specified as a total percentage of*untouched*data. For example, if you want to Winsorize the top 5% and bottom 5% of data points, this is equal to 100% – 5% – 5% = 90% Winsorization. A 80% Winsorization means that 10% is modified from each tail area (see Tips on Cut-Off Point Selection below).- Replace the extreme values by the maximum and/or minimum values at the threshold. For example:
- The following data set has several (bolded) extremes:

{**0.1,1**,12,14,16,18,19,21,24,26,29,32,33,35,39,40,41,44,**99,125**}

Mean = 33.405. - After modifying the top and bottom 10% (I matched those values to the nearest extreme):

{**12,12**,12,14,16,18,19,21,24,26,29,32,33,35,39,40,41,44,**44,44**}

80% Winsorized mean = 24.95.

- The following data set has several (bolded) extremes:

You *could *choose to add a little more to the larger/smaller values to account for their weights. for example, the values 99 and 125 were modified, but 125 is approximately 125% larger than 99. Therefore, instead of replacing those values with 44 and 44, I could replace them with 44 and 55 (because 125% * 44 = 55).

## Tips on Cut-Off Point Selection

A poor choice in Step 2 above can result in estimators with inflated mean squared errors(MSE). A few suggestions for cut-off point choice and avoiding this problem:

- Compare the MSE from modified and unmodified results. If a classical estimator (like the arithmetic mean) has a much smaller MSE, this may indicate a poor cut-off point choice.
**Note**: it stands to reason that you should probably choose the cut-off point that minimizes the MSE compared to the classical estimator, but in practice this is very difficult to do. - If in doubt, refer to published literature to see if your data type (i.e. cholesterol levels, intelligence, rock minerals or something else) is commonly Winsorized and what percentage is usually used in your particular field.
- Do not set your cut-off point before collecting your data. Wait until you actually have the data in front of you before making your choice.

**Reference**:

W. J. Dixon (1960). Simplified Estimation from Censored Normal Samples, The Annals of Mathematical Statistics, 31, 385–391.

**Need help with a homework or test question?** Chegg offers 30 minutes of free tutoring, so you can try them out before committing to a subscription. Click here for more details.

If you prefer an **online interactive environment** to learn R and statistics, this *free R Tutorial by Datacamp* is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try *this Statistics with R track*.

*Facebook page*.