Python for Feature Engineering: Handling missing data.

Feature engineering is the most vital part for making good Machine Learning models. Handling missing data is the most basic step in feature engineering. Missing data can completely mess up your models, so it has to be handled properly for creating good machine learning models. Here I’m going to explain multiple methods to handle missing data for different scenarios.

What is missing data?

Missing data occurs when features/columns of a record/row have not been recorded. Missing data results in incomplete records that may impact the performance of the machine learning model created using this data.

Missing data can happen due to many reasons. It could be because of human error, maybe the observer could not find a certain value for an observation, it could even be because the person recording the data is lazy like me.

Types of missing data

  1. Missing completely at random: When data is missing completely at random it means that the missing value has no relationship between any other feature in the dataset. The good thing about data being missing at random is that it would not affect the original distribution of the data.
  2. Missing not at random: This occurs when there is a relationship between the missing value and the expected output from the data. Eg: If we look at the publicly available Titanic dataset, we can see that there are lots of missing information for the passengers that did not survive, this is because the team that recorded the data could only collect complete data from the passengers that survived.
  3. Missing at random: This occurs when there is a relationship between the missing values and other features in the dataset. For example, if blue collar employees are more likely to disclose their salary than white collar employee, there will be more missing values for white collar employees than blue collar employees. Hence the missing data has a relationship with the type of employee.

Handling Missing Data

Now that you have a basic idea about the types of missing data, lets see how we can handle missing data.

The first step in handling missing data is to identify the columns/features that have considerably large missing data and remove them as it is better not to take them into consideration.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
cols_with_na = [col for col in data.columns if data[col].isnull().mean() > 0]na_info = data[cols_with_na].isnull().mean()
na_info = pd.DataFrame(data_na.reset_index())
na_info.columns = ['variable', 'missing_perc']na_info.sort_values(by='missing_perc', ascending=False, inplace=True)

After running the above code, you have to handpick the features/columns that have huge number of missing data. eg: more than 10% missing data. This value can change depending upon the result you are looking for

Method 1: Complete case analysis

Complete-case analysis is discarding observations/rows where values in any of the columns/features are missing. Complete Case Analysis only takes into consideration information that is complete.

When to use:

  • Data is missing completely at random
  • No more than 5% of the total dataset contains missing data

Pros:

  • Easy implementation
  • No data manipulation required

Cons:

  • If there are lot of records with random missing data, it could reduce the the size of original dataset by a huge margin.
  • If new data being pushed into the model has missing values, the model will not be able to handle it.

Implementation:

Identify what % of total dataset would be removed if we drop rows with missing data:

#Indentify rows with <5% missing data
perc = 5
cols_cca = [col for col in data.columns if data[var].isnull().mean() < (perc/100))]
#Find what % of data is remaining compared to the original datasetlen(data[cols_cca].dropna()) / len(data)

The above code will show you how much data would remain after we remove the incomplete records. If you are happy with the results go ahead and drop the rows:

data_cca = data[cols_cca].dropna()

Note: You should ensure that the distribution of different columns/features does not vary much from the original dataset by comparing their distributions using histograms or density plots.

example code for comparing distribution:

fig = plt.figure()
ax = fig.add_subplot(111)
# original dataset
data['feature1'].plot.density(color='red')
# new dataset
data_cca['feature1'].plot.density(color='blue')

Method 2: Mean/Median imputation

Imputation is the act of replacing missing data with statistical estimates of the missing values.

Mean / median imputation replaces all missing values within a column by the mean or median depending upon the distribution of the column/feature

When to use:

  • Data is missing completely at random
  • No more than 5% of the total dataset contains missing data

Pros:

  • Easy implementation
  • Can be used in production

Cons:

  • Could distort the original distribution of the feature
  • Can only be used on features that are numeric

Implementation:

Since Mean/Median imputation can only be used on numeric features, we have to identify them in our dataset and save those column names inside a list for easy access.

Note: The target label which we intend to predict should not be included in this list


numeric_cols = ['price', 'quantity', 'numeric_feature3']

Once we define the numeric variables , we can split the dataset into train/test and perform our imputations

from sklearn.impute import SimpleImputer# to split the datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test=train_test_split(data[numeric_cols],
data['target_label'],
test_size=0.3,
random_state=0)
#Define the type of imputation
imputer = SimpleImputer(strategy='median')
imputer.fit(X_train[numeric_cols])
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

If you want to apply more than one type of imputation, you cane make use of ColumnTransformer and Pipeline features of Scikitlearn

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
#Define the columns
numeric_cols_mean = ['price', 'numeric_feature_3]
numeric_cols_median = ['quantity']
#Define the pipeline that needs to be run, each imputation will be #done one after the othernumeric_mean_imputer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
])
numeric_median_imputer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
])
preprocessor = ColumnTransformer(transformers=[
('mean_imputer', numeric_mean_imputer, numeric_cols_mean),
('median_imputer', numeric_median_imputer, numeric_cols_median)
], remainder='passthrough')
preprocessor.fit(X_train)X_test = preprocessor.transform(X_test)
X_train = preprocessor.transform(X_train)

Note: The parameter remainder = ‘passthrough’ is used to retain all the columns in the dataset, otherwise only the columns undergoing imputation would be retained.

After imputation, you have to verify that the transformed features retain the distribution of the original dataset.

Method 3: Arbitrary value imputation

Arbitrary value imputation replaces all missing values within a column by an arbitrary value that we choose, both categorical and numerical variables can be imputed using this method. For categorical features it is a common practice to replace all instance of missing values by a new category(named ‘Missing’)

When to use:

  • Data is not missing at random

Pros:

  • Easy implementation
  • Can be used in production

Cons:

  • Could distort the original distribution of the feature
  • Can only be used on features that are numeric
  • If the arbitrary value chosen is at the end of the distribution , it may lead to creation of outliers

Implementation:

We start by splitting the dataset into train and test

note: Ensure that you select only the columns that needs to be imputed by checking the columns with missing data, and removing columns with too many missing data. Since it’s demonstrated in the last two method’s I will not be repeating it again.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
cols_to_use = ['col1', 'col2', 'col3']X_train, X_test, y_train, y_test = train_test_split(data,data['target_label'], test_size=0.3,random_state=0)

We will be implementing an imputation pipeline instead of applying the same imputation for all the columns

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

feature_1 = ['col1']
feature_2 = ['col_2']
categor_feature_1 = ['col3']
#Define the imputers and pipelineimputer_feature_1= Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value = 999)),
])
imputer_feature_2= Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value = 100)),
])
imputer_missing_cateogory_1= Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value = 'Missing')),
])
preprocessor = ColumnTransformer(transformers=[
('imputer_LotFrontAge', imputer_feature_1, feature_1),
('imputer_MasVnrArea', imputer_feature_2, feature_2),
('imputer_GarageYrBlt', imputer_missing_cateogory_1, categor_feature_1)
],remainder = 'drop')
#Fit the imputers into the preprocessor using X_train datasetpreprocessor.fit(X_train)#Transform both in X_train and X_test dataset using the fitted #preprocessorX_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

As you can see we have chosen different arbitrary values for each column that we are imputing.

The arbitrary values for the categorical feature is selected as ‘missing’. This means that all the missing values in that specific column will be replaced with a new category named ‘Missing’.

The arbitrary values for numeric features are selected based on the distribution of that specific columns. Most probably the arbitrary value chosen will be the median or a value close to the median of the distribution. The distribution of all the features in the train set can be plotted using a histogram as follows.

X_train.hist(bins=50, figsize=(10,10))
plt.show()

Method 4: Frequent category Imputation

Frequent category imputation replaces all missing values within a column by the most frequent value in the column. This technique is mainly used on categorical features. For numeric features a mean or median imputation tends to result in a distribution similar to the input

When to use:

  • Data is missing completely at random
  • No more than 5% of the variable contains missing data

Pros:

  • Easy implementation
  • Can be used in production

Cons:

  • Could distort the original distribution of the feature
  • Can only be used on features that are numeric
  • If the arbitrary value chosen is at the end of the distribution , it may lead to creation of outliers

Implementation:

We start by splitting the dataset into train and test, I will not be coding this step again, you can refer the previous method to get the code for implementing this. To make thisa more practical example, I will combine most frequent imputation method with mean imputation method using a pipeline

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
#Split the columns in the input dataset that we want to impute into #numeric and categoricalfeatures_numeric = ['price', 'quantity']
features_categoric = ['dish_category']
#Define the imputers and pipelinenumeric_imputer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
])
categoric_imputer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
])
preprocessor = ColumnTransformer(transformers=[
('numeric_imputer', numeric_imputer, features_numeric),
('categoric_imputer', categoric_imputer, features_categoric)
])
#Fit the imputers into the preprocessor using X_train datasetpreprocessor.fit(X_train)#Transform both in X_train and X_test dataset using the fitted #preprocessorX_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

Method 5: Using a missing Indicator

In all the previous methods we saw how to replace missing values using mean, median, and frequent categories. These methods would only be effective if the data is missing completely at ransom.

In cases where data is not missing completely at random we would have to combine mean/median/frequent category imputation with a flag indicating that the value that is being imputed was initially missing. This helps our machine learning models identify these values, hence giving a better result. This method can be used for both numerical and categorical features.

When to use:

  • Data is not missing at random

Pros:

  • Easy implementation
  • Helps the machine learning model identify the missing values while training

Cons:

  • Expands the feature space, ie: An extra column would be added for each feature being imputed using this method, hence the total number of features being trained by the Machine learning model increases.
  • If a single record is having lots of missing data then it could affect the result of the output ML model

Implementation:

from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import Pipeline
indicator = MissingIndicator(error_on_new=True, features='missing-only')indicator.fit(X_train)
tmp = indicator.transform(X_train)
#Creates a new column in the dataframe named feature_NA indicating whether the value is missing or not
indicator_cols = [c+'_NA' for c in X_train.columns[indicator.features_]]
X_train = pd.concat([
X_train.reset_index(),
pd.DataFrame(tmp, columns = indicator_cols)],
axis=1)

Now we have to impute the column with missing data, in this case ‘Alley’ using a suitable imputation method. This snippet is from a very famous, opensource dataset names House pricing. In this dataset, alley is a categorical value. We are going to use most frequent imputation method for this

imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

Try combining both of these steps using a pipeline.

Conclusion

Missing data is a common occurrence in any dataset. Handling missing data is the first step in Feature engineering a dataset and making it ready for creating Machine learning models. As you saw above missing data can be handled in many ways depending upon the use case. The best way to find the optimum method for handling missing data is by visualizing each feature in the dataset and analyzing how the feature is distributed. The amount of data that is missing and the type of data should also be taken into consideration. The approach I use to handle missing data is:

  1. Group different features(column names) into lists based on the type of missing data imputation that it requires
  2. Define Imputation methods for each of the group of features
  3. Add all the defined imputation methods into a single pipeline
  4. Fit the training set into a preprocessor(pipeline)
  5. Transform the Train and Test set using the preprocessor

Hello fellow Developers, my name's Pranoy. I'm a 24 year old programmer living in Kerala, India.