Regression Analysis > Instrumental Variable
What is an Instrumental Variable?
An instrumental variable (sometimes called an “instrument” variable) is a third variable, Z, used in regression analysis when you have endogenous variables — variables that are influenced by other variables in the model. In other words, you use it to account for unexpected behavior between variables. Using an instrumental variable to identify the hidden (unobserved) correlation allows you to see the true correlation between the explanatory variable and response variable, Y.
Instrumental variables are widely used in econometrics, a branch of economics that uses statistics to describe economic systems, and is sometimes seen in other fields like health sciences and epidemiology.
Example of an Instrumental Variable
Let’s say you had two correlated variables that you wanted to regress: X and Y. Their correlation might be described by a third variable Z, which is associated with X in some way. Z is also associated with Y but only through Y’s direct association with X. For example, let’s say you wanted to investigate the link between depression (X) and smoking (Y). Lack of job opportunities (Z) could lead to depression, but it is only associated with smoking through it’s association with depression (i.e. there isn’t a direct correlation between lack of job opportunities and smoking). This third variable, Z (lack of job opportunities), can generally be used as an instrumental variable if it can be measured and it’s behavior can be accounted for.
What is Instrumental Variables Regression?
Instrumental Variables regression (IV) basically splits your explanatory variable into two parts: one part that could be correlated with ε and one part that probably isn’t. By isolating the part with no correlation, it’s possible to estimate β in the regression equation:
Yi = β0 + β1Xi + εi.
This type of regression can control for threats to internal validity, like:
- Confounding variables,
- Measurement error,
- Omitted variable bias (sometimes called spuriousness),
- Reverse Causality.
In essence, IV is used when your variables are related in some way; If you have some type of correlation going on between variables (e.g. bidirectional correlation), then you can’t use the more common methods like ordinary least squares, because one requirement of those methods is that variables are not correlated.
Finding Instrumental Variables
IV regression isn’t an easy fix for confounding or other issues; In real life, instrumental variables can be difficult to find and in fact, may not exist at all. You cannot use the actual data to find IVs (e.g. you can’t perform a regression to identify any) — you must rely on your knowledge about the model’s structure and the theory behind your experiment (e.g. economic theory). When looking for IVs, keep in mind that Z should be:
- Exogenous — not affected by other variables in the system (i.e. Cov(z,ε) = 0). This can’t be directly tested; you have to use your knowledge of the system to determine if your system has exogenous variables or not.
- Correlated with X, an endogenous explanatory variable (i.e. Cov(Z,X) ≠ 0). A very significant correlation is called a strong first stage. Weak correlations can lead to misleading estimates for parameters and standard errors.
A couple of ideas for finding IVs: if available you could use two different data sources for your instrumental variables, or you could collect longitudinal data and use that. If you know that a mediating variable is causing the effect of X and Y, you can use it as an instrumental variable.
Causal graphs can be used to outline your model structure and identify possible IVs.
Suppose that you want to estimate the effect of a counseling program on senior depression (measured by a rating scale like the HAM-D). The relationship between attending counseling and score on the HAM-D may be confounded by various factors. For example, people who attend counseling sessions might care more about improving their health, or they may have a support network encouraging them to go to counseling. The proximity of a patient’s home to the counseling program is a potential instrumental variable.
However, what if the counseling center is located within a senior community center? Proximity may then cause seniors to spend time socializing or taking up a hobby, which could improve their HAM-D scores. The causal graph in Figure 2 shows that Proximity cannot be used as an IV because it is connected to depression scoring through the path Proximity → Community Center Hours → HAM-D Score.
However, you can control for Community Center Hours by adding it as a covariate ; If you do that, then Proximity can be used as an IV, since Proximity is separated from HAM-D score, given community center hours.
Next, suppose that extroverts are more likely to spend time in the community center and are generally happier than introverts. This is shown in the following graph:
Community center hours is a collider variable; conditioning on it opens up a part-bidirectional path Proximity → Community Center Hours → HAM-D. This means that Proximity can’t be used as an IV.
As a final step for this example, let’s say you find that community center hours doesn’t affect HAM-D Scores because people who don’t socialize in the community center actually socialize in other places. This is depicted on the following graph:
If you don’t control for community center hours and remove it as a covariate, then you can use Proximity again as an IV.
Coming soon: Two Stage Least Squares (2SLS) IV Method
Epsilon (ε) is a measurement of how far from the true regression line the observation y is. The true regression line is the line of the means (the mean of epsilon is zero).
Glymour, M. Using Causal Diagrams to Understand Common Problems in Social Epidemiology in Methods in Social Epidemiology May 2006, Jossey-Bass
Are Extroverts Happier? Psychology Today.