Descriptive Statistics > How to Choose Bin Sizes in Statistics
Watch the video or read the steps below:
Choose Bin size: Overview
(What is a bin?). There are a few general rules for choosing bins:
- Bins should be all the same size. For example, groups of ten or a hundred.
- Bins should include all of the data, even outliers. If your outliers fall way outside of your other data, consider lumping them in with your first or last bin. This creates a rough histogram — make sure you note where outliers are being included.
- Boundaries for bins should land at whole numbers whenever possible (this makes the chart easier to read).
- Choose between 5 and 20 bins. The larger the data set, the more likely you’ll want a large number of bins. For example, a set of 12 data pieces might warrant 5 bins but a set of 1000 numbers will probably be more useful with 20 bins. The exact number of bins is usually a judgment call.
- If at all possible, try to make your data set evenly divisible by the number of bins. For example, if you have 10 pieces of data, work with 5 bins instead of 6 or 7.
Choose Bin size: Steps
Step 1: Find the smallest and largest data point. If your smallest and/or largest numbers are not whole numbers, go to Step 2. If they are whole numbers, go to Step 3.
Step 2: Lower the minimum a little and raise the maximum a little. For example, 1.2 as a minimum becomes 1, and 99.9 as a maximum becomes 100.
Step 3: Decide how many bins you need using your best guess and using the guidelines listed in the intro paragraph above.
Step 4: Divide your range (the numbers in your data set) by the bin size you chose in Step 3. For example, if you have numbers that range from 0 to 50, and you chose 5 bins, your bin size is 50/5=10.
Step 5: Create the bin boundaries by starting with your smallest number (from Steps 1 and 2) and adding the bin size from Step 4. For example, if your smallest number is 0 and your bin size is 10 you would have bin boundaries of 0, 10, 20…
Tip: If you have a large data set, you may want to use Excel to find the smallest and largest point. Type your data into a single column and then use the “Sort” function or type =MIN(A:A) in a blank cell in a different column (i.e. column B) and then type =MAX(A:A) to get the biggest number.
Choose bin sizes with Sturge’s Rule
Sturge’s rule is another way to choose bin sizes. Although it’s widely used in statistical packages for making histograms, it has been criticized for over-smoothing of histograms (Hyndman, 1995). Therefore it should probably be considered a “Rule of Thumb” rather than an absolute formula with the perfect solution.
The formula is:
K = number of class intervals (bins).
N = number of observations in the set.
log = logarithm of the number.
For 10 observations in the set, the number of class intervals is:
K = 1 + 3.322 log(10) = 4.322 ≅ 4
For 55 observations in the set, the number of class intervals is:
K = 1 + 3.322 log(55) = 6.781 ≅ 7
When to Use Sturge’s Rule
Sturge’s rule works best for continuous data that is normally distributed and symmetrical. It helps us to convert this data into discrete, symmetric, binomial classes. As long as your data is not skewed, using Sturge’s rule should give you a nice-looking, easy to read histogram that represents the data well.
What Sturge’s rule is not much good for is severely skewed, non symmetric data sets, or for an extremely large number of observations. Here you’ll want to use one of the many available alternatives: either Doane’s Rule, Scott’s Rule, Rice’s Rule, or Freedman and Diaconis’s (1981) rule.
Advanced Formulas to Choose Bin Sizes
These alternate formulas are not what you can expected to find in an elementary statistics class; they are not used very often.
This modified version of Sturge’s rule which may also lead to over-smoothing:
Scott’s rule is based on the standard deviation(σ) of the data. The formula is: 3.49σn−1/3.
Rice’s rule is defined as: (cube root of the number of observations) * 2.
For 216 observations, the Rice rule equals 12 (the cubed root of 216 is 6; 6 * 2 = 12).
This formula uses the interquartile range (IQR):
Doane, D.P. Aesthetic frequency classification. American Statistician, 30, 181– 183. Retrieved December 13, 2017 from http://www.jstor.org/stable/2683757 December 13, 2017
Legg et. al (2013). Improving Accuracy and Efficiency of Mutual Information for Multi-modal Retinal Image Registration using Adaptive Probability Density Estimation.
Hyndman, R. (1995). The problem with Sturges’ rule for constructing histograms. Retrieved December 13, 2017 from: https://robjhyndman.com/papers/sturges.pdf
If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!