Assumption of Normality > Kolmogorov-Smirnov Test
- What is the Kolmogorov-Smirnov Test?
- How to run the test by hand
- Using software
- K-S Test P-Value Table
- Advantages and Disadvantages
The Kolmogorov-Smirnov Goodness of Fit Test (K-S test) compares your data with a known distribution and lets you know if they have the same distribution. Although the test is nonparametric — it doesn’t assume any particular underlying distribution — it is commonly used as a test for normality to see if your data is normally distributed.It’s also used to check the assumption of normality in Analysis of Variance.
Lilliefors test, a corrected version of the K-S test for normality, generally gives a more accurate approximation of the test statistic’s distribution. In fact, many statistical packages (like SPSS) combine the two tests as a “Lilliefors corrected” K-S test.
Note: If you’ve never compared an experimental distribution to a hypothetical distribution before, you may want to read the empirical distribution article first. It’s a short article, and includes an example where you compare two data sets simply— using a scatter plot instead of a hypothesis test.
Back to top
The hypotheses for the test are:
- Null hypothesis (H0): the data comes from the specified distribution.
- Alternate Hypothesis (H1): at least one value does not match the specified distribution.
H0: P = P0, H1: P ≠ P0.
Where P is the distribution of your sample (i.e. the EDF) and P0 is a specified distribution.
The general steps to run the test are:
- Create an EDF for your sample data (see Empirical Distribution Function for steps),
- Specify a parent distribution (i.e. one that you want to compare your EDF to),
- Graph the two distributions together.
- Measure the greatest vertical distance between the two graphs.
- Calculate the test statistic.
- Find the critical value in the KS table.
- Compare to the critical value.
Calculating the Test Statistic
The K-S test statistic measures the largest distance between the EDF Fdata(x) and the theoretical function F0(x), measured in a vertical direction (Kolmogorov as cited in Stephens 1992). The test statistic is given by:
Where (for a two-tailed test):
- F0(x) = the cdf of the hypothesized distribution,
- Fdata(x) = the empirical distribution function of your observed data.
For one-tailed test, omit the absolute values from the formula.
Step 1: Find the EDF. In the EDF article, I generated an EDF using Excel that I’ll use for this example.
Step 2: Specify the parent distribution. In the same article, I also calculated the corresponding values for the gamma function.
Step 3: Graph the functions together. A snapshot of the scatter graph looked like this:
Step 4: Measure the greatest vertical distance. Let’s assume that I graphed the entire sample and the largest vertical distance separating my two graphs is .04 (in the yellow highlighted box).
Step 6: Compare the results from Step 4 and Step 5. Since .04 is less than .190, the null hypothesis (that the distributions are the same) is accepted.
Most software packages can run this test.
- The test is distribution free. That means you don’t have to know the underlying population distribution for your data before running this test.
- The D statistic (not to be confused with Cohen’s D) used for the test is easy to calculate.
- It can be used as a goodness of fit test following regression analysis.
- There are no restrictions on sample size; Small samples are acceptable.
- Tables are readily available.
Although the K-S test has many advantages, it also has a few limitations:
- In order for the test to work, you must specify the location, scale, and shape parameters. If these parameters are estimated from the data, it invalidates the test. If you don’t know these parameters, you may want to run a less formal test (like the one outlined in the empirical distribution function article).
- It generally can’t be used for discrete distributions, especially if you are using software (most software packages don’t have the necessary extensions for discrete K-S Test and the manual calculations are convoluted).
- Sensitivity is higher at the center of the distribution and lower at the tails.
Chakravarti, Laha, and Roy, (1967). Handbook of Methods of Applied Statistics, Volume I, John Wiley and Sons, pp. 392-394.
Ruppert, D. (2004). Statistics and Finance: An Introduction. Springer Science and Business Media.
Stephens M.A. (1992) Introduction to Kolmogorov (1933) On the Empirical Determination of a Distribution. In: Kotz S., Johnson N.L. (eds) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics). Springer, New York, NY
If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.Comments? Need to post a correction? Please post on our Facebook page.