Statistics Definitions > Dimensionality
What is Dimensionality?
Dimensionality in statistics refers to how many attributes a dataset has. For example, healthcare data is notorious for having vast amounts of variables (e.g. blood pressure, weight, cholesterol level). In an ideal world, this data could be represented in a spreadsheet, with one column representing each dimension. In practice, this is difficult to do, in part because many variables are inter-related (like weight and blood pressure).
High Dimensional means that the number of dimensions are staggeringly high, so high that computations start to become extremely difficult. For example, microarrays, which measure gene expression, can contain tens of hundreds of samples containing tens of thousands of genes.
Reduction of dimensionality means to simplify understanding of data, either numerically or visually, while maintaining the integrity of the data. To reduce dimensionality, you could combine related data into groups using a technique like multidimensional scaling to identify similarities in data. You could also use clustering to group items together.
Note: Dimensionality means something slightly different in other areas of mathematics and science. For example, in physics, dimensionality can usually be expressed in terms of fundamental dimensions like mass, time, or length. In linear algebra, two units of measure have the same dimensionality if a function exists that maps one variable onto another variable and the inverse of the function does the reverse.
Curse of Dimensionality
The curse of dimensionality usually refers to what happens when you add more and more variables to a multivariate model: The more dimensions you add to a data set, the more difficult it becomes to predict certain quantities.
It would seem that the more variables you add to a model, the more useful the model would be. However, the opposite is actually true: adding more variables greatly weakens the model’s predictive ability. Each variable added to the model results in an exponential decrease in predictive power. As a simple example, let’s say you are using a model to predict the location of a large bacteria in a 25cm2 petri dish. The model might be fairly accurate at pinning the particle down to the nearest square cm. However, if you add just one more dimension/variable — say, instead of a 2D petri dish you have a 3D beaker — the predictive space increases exponentially, to 125 cm3. When you add more dimensions, it makes sense that the computational burden also increases. It wouldn’t be impossible to pinpoint where bacteria might be in a 3D model, but it’s certainly a more challenging task.
The statistical curse of dimensionality refers to a related fact: a required sample size n will grow exponentially with data that has d dimensions. In other words, adding more dimensions could mean that the required sample sizes quickly become unmanageable.
Finney, D.J. (1977). “Dimensions of Statistics.” Journal of the Royal Statistical Society. Series C (Applied Statistics). 26, No.3, p.285-289. Royal Statistical Society.