Nearest Neighbor matching > k-NN (k-Nearest Neighbor)
K-nn (k-Nearest Neighbor) is a non-parametric classification and regression technique. The basic idea is that you input a known data set, add an unknown, and the algorithm will tell you to which class that unknown data point belongs. The unknown is classified by a simple neighborly vote, where the class of close neighbors “wins.” It’s most popular use is for predictive decision making. For example:
- Will a customer default on a loan, or not?
- Is the business going to make a profit?
- Should we expand into a certain market segment?
The following simple example shows how k-nn works.
K-nn: Steps
1. Take a dataset with known categories
In this initial step, you’re just collecting the unsorted, raw data. In this example, the data is clearly categorized with hares and tortoises.
2. Cluster the data
You’ve got a few choices in this step; How you cluster the data is up to you. (e.g. with PCA or another clustering method).
3. Add a cell with an unknown category
4. Find the “k”
Perhaps the most challenging step is finding a k that’s “just right”. The square root of n (the number of items in the data set) is an easy place to start.
- √(n)
- = √(8)
- = 2.82
- = ≅ 3
Although the square root of n is simple, it isn’t the most accurate method. Ideally you should use a training set (i.e. a nicely categorized set) to find a “k” that works for your data. Remove a few categorized data points and make them “unknowns”, testing a few values for k to see what works. An elbow method can work well, where you find an optimal k based on lowest error rates. For a brief look at the elbow method, see: Determining number of clusters in one picture.
5. Locate the “k” nearest neighbors
For this example, I just used the visual to locate the nearest neighbors.
6. Classify the new point
The new point is classified by a majority vote. If most of your neighbors are turtles, odds are that you’re also a turtle. In this case, two out of three of the unknown’s neighbors are hares so the new point is classified as a hare.
See also: K-nn clustering explained in one picture.