# install the libraries for this demonstration
# ! pip install -U dataidea
Handling Missing Data
handling missing data, handling missing data in python, python data analysis, dataidea, machine learning, data preprocessing, sklearn data preprocessing
Introduction:
Missing data is a common hurdle in data analysis, impacting the reliability of insights drawn from datasets. Python offers a range of solutions to address this issue, some of which we discussed in the earlier weeks. In this notebook, we look into the top four missing data imputation methods:
- SimpleImputer
- KNNImputer
- IterativeImputer
- Datawig
We’ll explore these essential techniques, using sklearn and the weather dataset.
import pandas as pd
import dataidea as di
loadDataset
allows us to load datasets inbuilt in the dataidea library
= di.loadDataset('weather')
weather weather
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | NaN |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | NaN |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
sum() weather.isna().
day 0
temperature 4
windspead 4
event 2
dtype: int64
Let’s demonstrate how to use the top three missing data imputation methods—SimpleImputer, KNNImputer, and IterativeImputer—using the simple weather dataset.
# select age from the data
= weather[['temperature', 'windspead']].copy() temp_wind
= temp_wind.copy() temp_wind_imputed
SimpleImputer from scikit-learn:
- Usage: SimpleImputer is a straightforward method for imputing missing values by replacing them with a constant, mean, median, or most frequent value along each column.
- Pros:
- Easy to use and understand.
- Can handle both numerical and categorical data.
- Offers flexibility with different imputation strategies.
- Cons:
- It doesn’t consider relationships between features.
- May not be the best choice for datasets with complex patterns of missingness.
- Example:
from sklearn.impute import SimpleImputer
= SimpleImputer(strategy='mean')
simple_imputer = simple_imputer.fit_transform(temp_wind)
temp_wind_simple_imputed
= pd.DataFrame(temp_wind_simple_imputed, columns=temp_wind.columns) temp_wind_simple_imputed_df
Let’s have a look at the outcome
temp_wind_simple_imputed_df
temperature | windspead | |
---|---|---|
0 | 32.0 | 6.0 |
1 | 33.2 | 9.0 |
2 | 28.0 | 8.4 |
3 | 33.2 | 7.0 |
4 | 32.0 | 8.4 |
5 | 33.2 | 8.4 |
6 | 33.2 | 8.4 |
7 | 34.0 | 8.0 |
8 | 40.0 | 12.0 |
Exercise:
- Try out the SimpleImputer with different imputation strategies like mode, constant
- Choose and try some imputation techniques on categorical data
KNNImputer from scikit-learn:
- Usage:
- KNNImputer imputes missing values using k-nearest neighbors, replacing them with the mean value of the nearest neighbors.
- You can read more about the KNNImputer from the sklearn official docs site
- Pros:
- Considers relationships between features, making it suitable for datasets with complex patterns of missingness.
- Can handle both numerical and categorical data.
- Cons:
- Computationally expensive for large datasets.
- Requires careful selection of the number of neighbors (k).
Note!
By default, the KNNImputer uses ‘nan’ values as missing data and the ‘nan_euclidean’ metric to calculate the distances between values.
- Example:
from sklearn.impute import KNNImputer
= KNNImputer(n_neighbors=2)
knn_imputer = knn_imputer.fit_transform(temp_wind)
temp_wind_knn_imputed
= pd.DataFrame(temp_wind_knn_imputed, columns=temp_wind.columns) temp_wind_knn_imputed_df
If we take a look at the outcome
weather
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | NaN |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | NaN |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
Filling a single column independently using the KNNImputer
To use the KNNImputer for a single independ column, you can use the index as the other column instead, this will result into equal euclidean distances resulting into the use of the physical neighbors in the data table.
from sklearn.impute import KNNImputer
= KNNImputer(n_neighbors=2)
knn_imputer = knn_imputer.fit_transform(weather[['windspead']].reset_index())
windspead_imputed
windspead_imputed
array([[ 0. , 6. ],
[ 1. , 9. ],
[ 2. , 8. ],
[ 3. , 7. ],
[ 4. , 8. ],
[ 5. , 7.5],
[ 6. , 10. ],
[ 7. , 8. ],
[ 8. , 12. ]])
# we can fill it back in the weather data
'windspead'] = windspead_imputed[:, 1]
weather[
# now looking at the data
weather
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | 8.0 | Snow |
3 | 06/01/2017 | NaN | 7.0 | NaN |
4 | 07/01/2017 | 32.0 | 8.0 | Rain |
5 | 08/01/2017 | NaN | 7.5 | Sunny |
6 | 09/01/2017 | NaN | 10.0 | NaN |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
Exercise
- Try out the KNNImputer with different numbers of neighbors and compare the results
- Findo out how to use KNNImputer to fill categorical data
IterativeImputer from scikit-learn:
- Usage: IterativeImputer models each feature with missing values as a function of other features and uses that estimate for imputation. It iteratively estimates the missing values.
- Pros:
- Takes into account relationships between features, making it suitable for datasets with complex missing patterns.
- More robust than SimpleImputer for handling missing data.
- Cons:
- Can be computationally intensive and slower than SimpleImputer.
- Requires careful tuning of model parameters.
- Example:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
= IterativeImputer()
iterative_imputer = iterative_imputer.fit_transform(temp_wind)
temp_wind_iterative_imputed
= pd.DataFrame(temp_wind_iterative_imputed, columns=temp_wind.columns)
temp_wind_iterative_imputed_df
temp_wind_iterative_imputed_df
temperature | windspead | |
---|---|---|
0 | 32.000000 | 6.0 |
1 | 33.967053 | 9.0 |
2 | 28.000000 | 8.0 |
3 | 31.410210 | 7.0 |
4 | 32.000000 | 8.0 |
5 | 32.049421 | 7.5 |
6 | 35.245474 | 10.0 |
7 | 34.000000 | 8.0 |
8 | 40.000000 | 12.0 |
You can also choose an estimator of your choice, let’s try a Linear Regression
model
from sklearn.linear_model import LinearRegression
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# set estimator to an instance of a model
= IterativeImputer(estimator=LinearRegression())
iterative_imputer = iterative_imputer.fit_transform(temp_wind)
temp_wind_iterative_imputed
= pd.DataFrame(temp_wind_iterative_imputed, columns=temp_wind.columns)
temp_wind_iterative_imputed_df
temp_wind_iterative_imputed_df
temperature | windspead | |
---|---|---|
0 | 32.000000 | 6.0 |
1 | 34.125000 | 9.0 |
2 | 28.000000 | 8.0 |
3 | 31.041667 | 7.0 |
4 | 32.000000 | 8.0 |
5 | 31.812500 | 7.5 |
6 | 35.666667 | 10.0 |
7 | 34.000000 | 8.0 |
8 | 40.000000 | 12.0 |
Datawig:
Datawig is a library specifically designed for imputing missing values in tabular data using deep learning models.
# import datawig
# # Impute missing values
# df_imputed = datawig.SimpleImputer.complete(weather)
These top imputation methods offer different trade-offs in terms of computational complexity, handling of missing data patterns, and ease of use. The choice between them depends on the specific characteristics of the dataset and the requirements of the analysis.
Homework
- Try out these techniques for categorical data
Don’t miss out on any updates and developments! Subscribe to the DATAIDEA Newsletter it’s easy and safe.