Nearest Neighbor matching > k-NN (k-Nearest Neighbor)
K-nn (k-Nearest Neighbor) is a non-parametric classification and regression technique. The basic idea is that you input a known data set, add an unknown, and the algorithm will tell you to which class that unknown data point belongs. The unknown is classified by a simple neighborly vote, where the class of close neighbors “wins.” It’s most popular use is for predictive decision making. For example:
- Will a customer default on a loan, or not?
- Is the business going to make a profit?
- Should we expand into a certain market segment?
The following simple example shows how k-nn works.
1. Take a dataset with known categories
2. Cluster the data
3. Add a cell with an unknown category
4. Find the “k”
Perhaps the most challenging step is finding a k that’s “just right”. The square root of n (the number of items in the data set) is an easy place to start.
- = √(8)
- = 2.82
- = ≅ 3
Although the square root of n is simple, it isn’t the most accurate method. Ideally you should use a training set (i.e. a nicely categorized set) to find a “k” that works for your data. Remove a few categorized data points and make them “unknowns”, testing a few values for k to see what works. An elbow method can work well, where you find an optimal k based on lowest error rates. For a brief look at the elbow method, see: Determining number of clusters in one picture.
5. Locate the “k” nearest neighbors
6. Classify the new point
The new point is classified by a majority vote. If most of your neighbors are turtles, odds are that you’re also a turtle. In this case, two out of three of the unknown’s neighbors are hares so the new point is classified as a hare.
See also: K-nn clustering explained in one picture.