Dummy Variables / Indicator Variable: Simple Definition, Examples

Types of Variable > Dummy Variables

dummy variables
Regression analysis.

What are Dummy Variables?

Dummy variables (sometimes called indicator variables) are used in regression analysis and Latent Class Analysis. As implied by the name, these variables are artificial attributes, and they are used with two or more categories or levels. It’s used when you want to work with categorical variables which have no quantifiable relationship with each other.

For example, race can be categorized by Caucasian, African American, Asian, Hispanic, Other. If you assign the numbers 1-5 for these categories when performing regression analysis, the results would make no sense at all (is the “Other” category in any way 5 times the “Caucasian” category?). However, if you create a variable called Caucasian and assign the dummy variable 1 to mean “is Caucasian” and 0 to mean “is not Caucasian” then you can start to see how dummy variables are useful.

In latent class analysis, the term indicator variable means something more specific, although it’s still an artificial variable. A set of observed variables can “indicate” the presence of one or more latent (hidden) variables — hence the term indicator variable.

Coding Categorical variables with multiple levels

If you have a categorical variable with more than two levels (groups or levels are different groups in the same independent variable), multiple dummy variables need to be created. In the above example, the categorical variable “Race” has five levels (Caucasian, African American, Asian, Hispanic, Other). The formula k-1 is used to decide how many dummy variables to code, where “k” is the number of levels. In other words, only four of these five levels are coded with dummy variables. Which variable should you leave out? It’s usually the largest group to which all the others will be compared. In this example, let’s assume it’s some sort of data for Mexico City, Mexico. the largest group would be Hispanic and that would be the level left out. Ultimately, which variable is not coded with a dummy variable is up to you, the researcher and which variable you are comparing the others to.

References

Edwards, A. (1976). An introduction to linear regression and correlation. W. H. Freeman
Everitt, B. S.; Skrondal, A. (2010), The Cambridge Dictionary of Statistics, Cambridge University Press.


Comments? Need to post a correction? Please Contact Us.