Clustering > Hierarchical Clustering

## What is Hierarchical Clustering?

Hierarchical clustering is where you build a cluster tree (a dendrogram) to represent data, where each group (or “node”) is linked to two or more successor groups. The groups are nested and organized as a tree, which ideally ends up as a meaningful classification scheme.

Each node in the cluster tree contains a group of similar data; Nodes are placed on the graph next to other, similar nodes. Clusters at one level are joined with clusters in the next level up, using a degree of similarity; The process carries on until all nodes are in the tree, which gives a visual snapshot of the data contained in the whole set. The total number of clusters is *not *predetermined before you start the tree creation.

## Hierarchical Clustering Algorithms

All hierarchical clustering algorithms are monotonic — they either increase or decrease.

The algorithms can be *bottom up* or *top down*:

**1. Bottom up (Hierarchical Agglomerative Clustering, HAC): **

- Each document is treated as a single cluster at the beginning of the algorithm.
- Items are merged (agglomerated) two at a time into a new cluster.
*How*the pairs are merged involves calculating a dissimilarity between each merged pair and the other samples. There are*many*ways to do this. Popular options:**Complete linkage**: similarity of the furthest pair. One drawback is that outliers can cause close groups to be merged later than is optimal.**Single-linkage**: similarity of the closest pair. This can cause premature merging of groups with close pairs, even if those groups are quite dissimilar overall.**Group average**: similarity between groups.**Centroid similarity: each iteration merges the clusters with the most similar central point.**

- The pairing process continues until all items have been merged into a single cluster.

HAC’s account for the vast majority of hierarchical clustering algorithms. However, one downside is that they have **significant computational and storage requirements** — especially for big data. These complex algorithms are about quadruple the size of the K-means algorithm. Also, merging can’t be reversed, which can create a problem if you have noisy, high-dimensional data.

**2. Top down (Divisive Clustering): **

- Data starts as one combined cluster.
- Cluster is split into two distinct parts, according to some degree of similarity.
- Clusters are split into two again and again until the clusters only contain a single data point.

Divisive clustering is very rarely used.

## Downsides

**Hierarchical clustering can easily lead to dendrograms that are just plain wrong.** Unless you known your data inside out (pretty much impossible for big data sets), this is largely unavoidable. One of the main reasons for this is that the clustering algorithm will work even on the most unsuitable data. Another reason is that the decision you make for creating clusters (Step 2 above) can lead to significantly different dendrograms. The choice can be tough to make in advance, and you may not be able to tell which of the four end results are the most suitable.

## Real Life Example

The fact that the hierarchical clustering algorithm will work even if presented with seemingly unrelated data can be a positive as well as a negative. For example, a 2003 research team used hierarchical clustering to “support the idea that many…breast tumor subtypes represent biologically distinct disease entities.” To the human eye, the original data looked like noise, but the algorithm was able to find patterns.

If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.

Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!