Statistics Definitions > Post-Hoc

## Post-Hoc Tests

Post-hoc (Latin, meaning “after this”) means to analyze the results of your experimental data. They are often based on a familywise error rate; the probability of at least one Type I error in a set (family) of comparisons. The most common post-hoc tests are:

- Bonferroni Procedure
- Duncan’s new multiple range test (MRT)
- Dunn’s Multiple Comparison Test
- Fisher’s Least Significant Difference (LSD)
- Holm-Bonferroni Procedure
- Newman-Keuls
- Rodger’s Method
- Scheffé’s Method
- Tukey’s Test (see also: Studentized Range Distribution)
- Dunnett’s correction
- Benjamin-Hochberg (BH) procedure

Bonferroni Procedure (Bonferonni Correction)

This multiple-comparison post-hoc correction is used when you are performing many independent or dependent statistical tests at the same time. The problem with running many simultaneous tests is that the probability of a significant result increases with each test run. This post-hoc test sets the significance cut off at α/n. For example, if you are running 20 simultaneous tests at α=0.05, the correction would be 0.0025. More detail. The Bonferroni does suffer from a loss of power. This is due to several reasons, including the fact that Type II error rates are high for each test. In other words, it overcorrects for Type I errors.

Holm-Bonferroni Method

The ordinary Bonferroni method is sometimes viewed as too conservative. Holm’s sequential Bonferroni post-hoc test is a less strict correction for multiple comparisons. See: Holm-Bonferroni method for a step-by-step example.

Duncan’s new multiple range test (MRT)

When you run Analysis of Variance (ANOVA), the results will tell you if there is a difference in means. However, it won’t pinpoint the pairs of means that are different. Duncan’s Multiple Range Test will identify the pairs of means (from at least three) that differ. The MRT is similar to the LSD, but instead of a t-value, a Q Value is used.

Fisher’s Least Significant Difference (LSD)

A tool to identify which pairs of means are statistically different. Essentially the same as Duncan’s MRT, but with t-values instead of Q values. **See**: Fisher’s Least Significant Difference.

Newman-Keuls

Like Tukey’s, this post-hoc test identifies sample means that are different from each other. Newman-Keuls uses different critical values for comparing pairs of means. Therefore, it is more likely to find significant differences.

Rodger’s Method

Considered by some to be the most powerful post-hoc test for detecting differences among groups. This test protects against loss of statistical power as the degrees of freedom increase.

Scheffé’s Method

Used when you want to look at post-hoc comparisons in general (as opposed to just pairwise comparisons). Scheffe’s controls for the overall confidence level. It is customarily used with unequal sample sizes.

See: The Scheffe Test.

Tukey’s Test

The purpose of Tukey’s test is to figure out which groups in your sample differ. It uses the “Honest Significant Difference,” a number that represents the distance between groups, to compare every mean with every other mean.

Dunnett’s correction

Like Tukey’s this post-hoc test is used to compare means. Unlike Tukey’s, it compares every mean to a control mean. For calculation steps, see: Dunnett’s Test.

Benjamin-Hochberg (BH) procedure

If you perform a very large amount of tests, one or more of the tests will have a significant result purely by chance alone. This post-hoc test accounts for that false discovery rate. For more details, including how to run the procedure, see: Benjamini-Hochberg Procedure.

## More on the Bonferroni Correction

The Bonferroni correction is used to limit the possibility of getting a statistically significant result when testing multiple hypotheses. It’s needed because the more tests you run, the more likely you are to get a significant result. The correction lowers the area where you can reject the null hypothesis. In other words, it makes your p-value smaller.

Imagine looking for the Ace of Clubs in a deck of cards: if you pull one card from the deck, the odds are pretty low (1/52) that you’ll get the Ace of Clubs. Try again (and try perhaps 50 times), you’ll probably end up getting the Ace. The same principal works with hypothesis testing: the more simultaneous tests you run, the more likely you’ll get a “significant” result. Let’s say you were running 50 tests simultaneously with an alpha level of 0.05. The probability of observing at least one significant event due to chance alone is:

P (significant event) = 1 – P(no significant event)

= 1 – (1-0.05)^{50} = 0.92.

That’s almost certain (92%) that you’ll get at least one significant result.

## How to Calculate the Bonferroni Correction

The calculation is actually very simple, it’s just the alpha level (α) divided by the number of tests you’re running.

**Sample question: ** A researcher is testing 25 different hypotheses at the same time, using a critical value of 0.05. What is the Bonferroni correction?

**Answer:**

Bonferroni correction is α/n = .05/25 = .002

For this set of 25 tests, you would reject the null only if your p-value was smaller than .002.

## The Bonferroni Correction and Medical Testing

Matthew A. Napierala, MD points out how multiple tests affect physicians (and patients) in an article for the American Academy of Orthopaedic Surgeons. “In contemporary orthopaedic research studies, numerous simultaneous tests are routinely performed.” This means that given enough tests, one of them is bound to come back as a false positive. Definitely *not* a good thing when we’re talking about health issues.

If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.

Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!
Post hoc comparisons need to correct for some types of error rates, but not others. What are some examples?

Rudy, I’m updating a few of the tests with your answer. For example, the Bonferroni overcorrects for

Type I error and in general post hocs use familywise error rates. If that doesn’t answer your question, could you please be more specific about what test you need info on, or is it about post hoc tests in general?

I’m running a single one-way repeated measures ANOVA with 9 conditions total. Would n=72 in this case? Is that correct? If so, that would put my p-value at 0.00069 which seems way too conservative.

Kelsey,

N is the sample size. It doesn’t have anything to do with the number of conditions.

Hello, good summary on post hoc tests. If one was doing cluster analysis with say 2 groups, what tests would you recommend to test for differences in means between cluster groups generated from cluster analysis?

I wouldn’t recommend anything. It’s not a good idea to use the same data to make clusters and then try to perform significance tests on those clusters. You need random data for hypothesis tests, and clustering gives you just about the opposite of random!

Hi,

Even after transformation, I still have non-normal distributions, so I guess I should use non-parametric tests such as Kruskal-Wallis (I have 3 groups and several variables to compare). Is there a correction for multiple comparisons that is more appropriate for non-parametric tests? I was thinking of using Holm-Bonferroni 8although I haven’t figured out yet how to do that in SPSS).

Thank you!

You’re on the right track. Bonferonni is your only option for nonparametric tests.

Hello, I ran the Bonferroni correction for three sampls and all were significant. What else can I do to locate where the difference lies among the samples

It depends on your data. If you have mean, Fisher’s LSD can identify which pairs of means are statistically different.

Good day. I want the best stat analysis for a this for analysis of 4 different groups of rats: 1. n=5; 2. n=5; 3. n=4; 4. n=5. I have used one-way ANOVA but What is the best Post-hoc comparison test to use? I want to compare each of the groups with each other.

Thank you

Yes, I have mean for each of the group. For example: Group 1 (control) mean: 1.93+/-0.12 (n=5); group 2 mean: 1.72+/-0.03 (n=5); group 3 mean: 2.01+/-0.11 (n=4); group 4 mean: 1.99+/-0.09 (n=5).

Ii want to compare group 1 with groups 2,3,4; 2 with 1,3,4; 3 with 1,2,4 and 4 with 1,2,3.

Hello, Sola. Sorry, but for ethical reasons (see Experiments on Animals), I am going to have to decline to answer this question. Please consider more humane foundations for your research. Regards, Stephanie.

Hello, I discovered that using different post hoc test gives different values of significance. Which post hoc test is more appropriate for analysing test of the difference between control group and other groups – Dunnet or Duncan? Or better still when am I supposed to use Dunnet and not Duncan? Thank you

Use Dunnet if you have a control mean.

Hello, in a scenario where the hypothesis is “there will be no differences in X between the three groups” is it more appropriate to use a post-hoc than run a one-way anova with contrasts? (there will obviously be three contrasts in that case, 1 -1 0, 0 -1 1, 1 0 -1). And would you ever choose to do a post-hoc in a planned comparison when you have hypothesised a difference (directional or not)? Thank you so much in advance!

Hi Stephanie,

just a remark on animal studies: they are actually the most humane way to test drugs before giving them to humans without risking their lives. Would you give a drug to your children without testing them on animals before? Researchers know that animal studies are not 100% reliable, but going from in vitro to human without animal testing would be even less ethical. Regards, Stephane

“…: they are actually the most humane way to test drugs before giving them to humans without risking their lives”.

Humane to humans, yes.

” Would you give a drug to your children without testing them on animals before?” Yes, I would. Animal tests are no gauge of how reliably a drug performs in humans. I trust mathematics (i.e. computer simulations) more than I would trust how the drug works in a mouse.