 # Multicollinearity

Go Back

Definition

A term for the linear associations between two or more Independent Variables used to model/predict a response variable, making it difficult to distinguish between the effects of the associated variables on the response. Multicollinearity results in very large Variances for the coefficients of the variables in the model, and thus coefficient estimates are no longer precise. In the worst case scenario (perfect multicollinearity), the model coefficients cannot be estimated.

Application

Multicollinearity is identified by regressing each independent variable on the others in the set and checking the resultant coefficient of determination (R2) values. High R2 values (usually 70% or higher) indicate high multicollinearity. Another statistic that is often used is the Variance Inflation Factor (VIF), calculated as 1/(1 - R2). Again, high VIF values signal multicollinearity among the variables.

If the study goal is to predict the value of the response variable for given values of the predictors in the model, then multicollinearity doesn't affect the results. The predictions will still be reliable. However if the goal is to describe the response distribution based on the chosen predictors or to test the significance of the predictors, then the estimated model coefficients (weights) of the predictors will be imprecise. As a result, the whole model equation may be significant but the individual coefficients may show insignificant t-statistics.

There are a few ways to deal with multicollinearity, among them:
- Increase the sample size.
- Drop one of the redundant variables or combine them into a single predictor variable that makes sense.
- Use theoretical a priori information about the associations among the predictors to impose restrictions on the coefficients.
- Do nothing, but note the presence of multicollinearity among the model variables in the analysis report.
- Use a different method of analysis, such as Ridge regression or Principal components.