# Kolmogorov-Smirnov Goodness of Fit Test

Assumption of Normality > Kolmogorov-Smirnov Test

## What is the Kolmogorov-Smirnov Test?

The Kolmogorov-Smirnov Goodness of Fit Test (K-S test) compares your data with a known distribution and lets you know if they have the same distribution. Although the test is nonparametric — it doesn’t assume any particular underlying distribution — it is commonly used as a test for normality to see if your data is normally distributed.It’s also used to check the assumption of normality in Analysis of Variance.

More specifically, the test compares a known hypothetical probability distribution (e.g. the normal distribution) to the distribution generated by your data — the empirical distribution function.

Lilliefors test, a corrected version of the K-S test for normality, generally gives a more accurate approximation of the test statistic’s distribution. In fact, many statistical packages (like SPSS) combine the two tests as a “Lilliefors corrected” K-S test.

Note: If you’ve never compared an experimental distribution to a hypothetical distribution before, you may want to read the empirical distribution article first. It’s a short article, and includes an example where you compare two data sets simply— using a scatter plot instead of a hypothesis test.

## How to run the test by hand

The hypotheses for the test are:

That is,
H0: P = P0, H1: P ≠ P0.
Where P is the distribution of your sample (i.e. the EDF) and P0 is a specified distribution.

## General Steps

The general steps to run the test are:

1. Create an EDF for your sample data (see Empirical Distribution Function for steps),
2. Specify a parent distribution (i.e. one that you want to compare your EDF to),
3. Graph the two distributions together.
4. Measure the greatest vertical distance between the two graphs.
5. Calculate the test statistic.
6. Find the critical value in the KS table.
7. Compare to the critical value.

## Calculating the Test Statistic

The K-S test statistic measures the largest distance between the EDF Fdata(x) and the theoretical function F0(x), measured in a vertical direction (Kolmogorov as cited in Stephens 1992). The test statistic is given by:

Where (for a two-tailed test):

For one-tailed test, omit the absolute values from the formula.

If D is greater than the critical value, the null hypothesis is rejected. Critical values for D are found in the K-S Test P-Value Table.

## Example

Step 1: Find the EDF. In the EDF article, I generated an EDF using Excel that I’ll use for this example.

Step 2: Specify the parent distribution. In the same article, I also calculated the corresponding values for the gamma function.

Step 3: Graph the functions together. A snapshot of the scatter graph looked like this:

Largest vertical distance for this graph is highlighted by the yellow box.

Step 4: Measure the greatest vertical distance. Let’s assume that I graphed the entire sample and the largest vertical distance separating my two graphs is .04 (in the yellow highlighted box).

Step 5: Look up the critical value in the K-S table value. I have 50 observations in my sample. At an alpha level of .05, the K-S table value is .190.

Step 6: Compare the results from Step 4 and Step 5. Since .04 is less than .190, the null hypothesis (that the distributions are the same) is accepted.

## Using Technology

Most software packages can run this test.

The R function ecdf creates empirical distribution functions. An R function p followed by a distribution name (pnorm, pbinom, etc.) gives a theoretical distribution function.

There are several online calculators available, like this one, and this one.

As a result of using software to test for normality, small p-values in your output generally indicate the data is not from a normal distribution (Ruppert, 2004).

## K-S Test P-Value Table

• The test is distribution free. That means you don’t have to know the underlying population distribution for your data before running this test.
• The D statistic (not to be confused with Cohen’s D) used for the test is easy to calculate.
• It can be used as a goodness of fit test following regression analysis.
• There are no restrictions on sample size; Small samples are acceptable.
• Tables are readily available.

Although the K-S test has many advantages, it also has a few limitations:

• In order for the test to work, you must specify the location, scale, and shape parameters. If these parameters are estimated from the data, it invalidates the test. If you don’t know these parameters, you may want to run a less formal test (like the one outlined in the empirical distribution function article).
• It generally can’t be used for discrete distributions, especially if you are using software (most software packages don’t have the necessary extensions for discrete K-S Test and the manual calculations are convoluted).
• Sensitivity is higher at the center of the distribution and lower at the tails.

