k-NN (k-Nearest Neighbor): Overview, Simple Example

Nearest Neighbor matching > k-NN (k-Nearest Neighbor)

K-nn (k-Nearest Neighbor) is a non-parametric classification and regression technique. The basic idea is that you input a known data set, add an unknown, and the algorithm will tell you to which class that unknown data point belongs. The unknown is classified by a simple neighborly vote, where the class of close neighbors “wins.” It’s most popular use is for predictive decision making. For example:

  • Will a customer default on a loan, or not?
  • Is the business going to make a profit?
  • Should we expand into a certain market segment?

The following simple example shows how k-nn works.

K-nn: Steps

1. Take a dataset with known categories

In this initial step, you’re just collecting the unsorted, raw data. In this example, the data is clearly categorized with hares and tortoises.
Take a dataset with known categories - the data is clearly categorized with hares and tortoises

2. Cluster the data

You’ve got a few choices in this step; How you cluster the data is up to you. (e.g. with PCA or another clustering method).
Cluster the data

3. Add a cell with an unknown category

Add a cell with an unknown category

4. Find the “k”

Perhaps the most challenging step is finding a k that’s “just right”. The square root of n (the number of items in the data set) is an easy place to start.

  • √(n)
  • = √(8)
  • = 2.82
  • = ≅ 3

Although the square root of n is simple, it isn’t the most accurate method. Ideally you should use a training set (i.e. a nicely categorized set) to find a “k” that works for your data. Remove a few categorized data points and make them “unknowns”, testing a few values for k to see what works. An elbow method can work well, where you find an optimal k based on lowest error rates. For a brief look at the elbow method, see: Determining number of clusters in one picture.

5. Locate the “k” nearest neighbors

For this example, I just used the visual to locate the nearest neighbors.
For this example, I just used the visual to locate the nearest neighbors

6. Classify the new point

The new point is classified by a majority vote. If most of your neighbors are turtles, odds are that you’re also a turtle. In this case, two out of three of the unknown’s neighbors are hares so the new point is classified as a hare.
Classify the new point

See also: K-nn clustering explained in one picture.


Comments? Need to post a correction? Please Contact Us.