Multicollinearity

 Multicollinearity simply means that there is high correlation amongst some of the predictor variables in a multiple regression model, meaning that one or more of these predictor variables can be accurately predicted (linearly) from some other predictor variable.

How to deal with it?

You can use decision tree based models (boosted or simple) as they are by nature immune to multicollinearity (as out of let's say 2 highly correlated features, it will use only one at any split). However, it is still good to remove any redundant features during the preprocessing phase.

How to remove redundant features?

One possible way could be to use the correlation value between the predictor variables and the target to flush out the less contributing features. You can easily find a function that will compute these correlation values for you for each feature and then you can set a threshold to accept the features for training your final model.

Is it necessary to deal with it?

Most people use tree based models these days which are not affected by this issue of multicollinearity. However, if you are using linear models, then you will have to necessarily deal with it, as mentioned above. Otherwise, you might get strange outputs. Generally, there is automated support in ML libraries these days to handle these kind of issues. You might just want to use those as it will reduce your work.

Popular posts from this blog

ML technical definition : Learning in AI defined

Batch Gradient Descent vs Stochastic Gradient Descent (SGD) vs Mini-Batch Gradient Descent

What exactly is Artificial Intelligence in simple words ? Explain AI in simple terms.