Regression Analysis > Mahalanobis distance
What is the Mahalanobis distance?
The Mahalanobis distance (MD) is the distance between two points in multivariate space. In a regular Euclidean space, variables (e.g. x, y, z) are represented by axes drawn at right angles to each other; The distance between any two points can be measured with a ruler. For uncorrelated variables, the Euclidean distance equals the MD. However, if two or more variables are correlated, the axes are no longer at right angles, and the measurements become impossible with a ruler. In addition, if you have more than three variables, you can’t plot them in regular 3D space at all. The MD solves this measurement problem, as it measures distances between points, even correlated points for multiple variables.
The Mahalanobis distance measures distance relative to the centroid — a base or central point which can be thought of as an overall mean for multivariate data. The centroid is a point in multivariate space where all means from all variables intersect. The larger the MD, the further away from the centroid the data point is.
How the Mahalanobis distance is measured
The Mahalanobis distance quantifies relative distance from the centroid. The centroid, a measure of central tendency is the point in multivariate space where all means from all variables intersect. The magnitude of the MD indicates the data point’s deviation from the centroid.
Unlike a “regular” Euclidean space, where variables are mapped on perpendicular axes (x, y, z), allowing measurement with a ruler, the Mahalanobis distance addresses the challenge posed by correlated variables. In the case of correlated variables, the axes are no longer orthogonal, prohibiting ruler-based measurements. In addition, when dealing with more than three variables, conventional 3D plots are insufficient for correlated variables; by measuring distances between points, including correlated instances across multiple variables, the MD offers a solution to this measurement predicament.
The Mahalanobis distance between two objects is formally defined as [2]:
dMahalanobis = [(xB – xA)T * C -1 * (xB – xA)]0.5
where
- xA and xB are a pair of objects, and
- C is the sample covariance matrix
- T is the transpose operation, which flips a matrix over its diagonal.
A different version of the formula uses distances from each observation to the central mean:
di = [xi – x̄)T C-1(xi – x̄)]0.5
Where:
- xi = an object vector
- x̄ = arithmetic mean vector.
Uses
The most common use for the Mahalanobis distance is to find multivariate outliers, which indicates unusual combinations of two or more variables. For example, it’s fairly common to find a 6′ tall woman weighing 185 lbs., but it’s rare to find a 4′ tall woman who weighs that much.
Related Measurements
A related term is leverage, which uses a different measurement scale than the Mahalanobis distance. The two are related by the following formula [3]: Mahalanobis distance = (N – 1) (Hii – 1/N) Where hii is the leverage. While the MD only uses independent variables in its calculations,
While the Mahalanobis distance uses independent variables in its calculations, Cook’s distance — a measure of how much an observation affects the fitted regression model — uses both independent variables and dependent variables. It is calculated by multiplying the leverage and the studentized residual, which is the difference between the observed and predicted value, divided by the standard error of the residual.
Cook’s distance can be interpreted as the change in the fitted regression coefficients if the observation is removed from the data set. A higher Cook’s distance indicates a more influential observation on the fitted model. An observation with a high Cook’s distance is likely to have a high Mahalanobis distance because an outlier that is far away from the fitted model is also likely to have a large impact on the fitted regression coefficients. That said, there are some circumstances where an observation can have a high Mahalanobis distance but a low Cook’s distance, such as when an outlier does not have a strong impact on the fitted regression coefficients.
Disadvantages
Although Mahalanobis distance is included with many popular statistics packages, some authors question the reliability of results [4, 5]. A major issue with the MD is that the inverse of the correlation matrix is needed for the calculations. This can’t be calculated if the variables are highly correlated [2].
References
[1] Integrative set enrichment testing for multiple omics platforms – Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Mahalanobis-distance-plot-example-A-contour-plot-overlaying-the-scatterplot-of-100_fig1_51835228 [accessed 8 Jun, 2024]
[2] Varmuza, K. & Filzmoser, P. Introduction to Multivariate Statistical Analysis in Chemometrics. CRC Press
[3] Weiner, I. et. al. (2003). Handbook of Psychology, Research Methods in Psychology. John Wiley & Sons.
[4] Egan, W. & Morgan, S. (1998). Outlier detection in multivariate analytical chemical data. Analytical Chemistry, 70, 2372-2379.
[5] Hadi, A. & Simonoff, J. (1993). Procedures for the identification of multiple outliers in linear models. Journal of the American Statistical Association, 88, 1264-1272.