Multicollinearity

- March 11, 2021

Multicollinearity simply means that there is high correlation amongst some of the predictor variables in a multiple regression model, meaning that one or more of these predictor variables can be accurately predicted (linearly) from some other predictor variable.

How to deal with it?

You can use decision tree based models (boosted or simple) as they are by nature immune to multicollinearity (as out of let's say 2 highly correlated features, it will use only one at any split). However, it is still good to remove any redundant features during the preprocessing phase.

How to remove redundant features?

One possible way could be to use the correlation value between the predictor variables and the target to flush out the less contributing features. You can easily find a function that will compute these correlation values for you for each feature and then you can set a threshold to accept the features for training your final model.

Is it necessary to deal with it?

Most people use tree based models these days which are not affected by this issue of multicollinearity. However, if you are using linear models, then you will have to necessarily deal with it, as mentioned above. Otherwise, you might get strange outputs. Generally, there is automated support in ML libraries these days to handle these kind of issues. You might just want to use those as it will reduce your work.

Search This Blog

Machine Learning Quickies