Missing Values in Data Science Interview
If an interviewer shows you a sample dataset and asks you to tell what comes to your mind, he likely wants to ask you about MISSING VALUES in the dataset. So, check if the dataset has any missing values and answer him accordingly. The next thing the interviewer will ask you will likely be about strategies for dealing with those missing values. So, be prepared.
4 Strategies to Deal with Missing Values in a Dataset:
- Drop all columns with missing values (wastes a lot of valuable data, so NOT recommended)
- Drop all rows with missing values (if there are only a few no of rows with missing values, then you can do this)
- Imputation: You basically fill the missing value with some default value (like -1) or some calculated value (like mean). This is the most used strategy.
- Imputation with tracking: Basically, you use imputation on a column, and then you create a new column to keep track of the rows where you have applied imputation. The new column will be either TRUE or FALSE based on whether you have imputed the row. You can further extend this by doing this individually for every column. (This is not used as much because it is generally not feasible as it creates a lot of new columns).
Ending remark: You will most likely be using the third strategy (of simple imputation) all the time.
Imputation of Missing Values: Numerical vs. Categorical feature
Imputation for numerical feature
You basically impute (fill) the missing value with some default constant value (like -1) or some calculated value like mean, median, or mode.
If you are filling it with a default value, make sure that the replacement value is outside the range of the possible values for that column. For example, you could use a value of -1 for the column PRICE because the price cannot be negative. So, -1 will denote to the algorithm that this is a missing value.
Imputation for categorical feature
In the case of a categorical feature, you impute (fill) the missing value with the most frequently occurring value (mode) or some default value like "_MISSING" (generally!)