Data Cleaning

we’ll be using numpy and pandas, to explore some techniques we can use to manipulate and clean a dataset.
Author

Juma Shafara

Published

January 1, 2024

Modified

July 25, 2024

Photo by DATAIDEA

Cleaning the weather dataset

In this notebook, we’ll be using numpy and pandas, to explore some techniques we can use to manipulate and clean a dataset. We’ll be using the weather dataset which I developed specifically for the purpose of this lesson.

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool whereas Numpy is the fundamental package for scientific computing with Python.

To continue with this notebook, you must have python, pandas and numpy installed.

## Uncomment and run this cell to install pandas and numpy
#!pip install pandas numpy
# import the libraries
import pandas as pd
import numpy as np
from dataidea.datasets import loadDataset

Let’s check the versions of python, numpy and pandas we’ll be using for this notebook

# checking python version
print('Python Version: ',)
!python --version
Python Version: 
Python 3.10.12
# Checking numpy and pandas versions
print('Pandas Version: ', pd.__version__)
print('Numpy Version: ', np.__version__)
Pandas Version:  2.2.2
Numpy Version:  1.26.4

Let’s load the dataset. We’ll be using a weather dataset that imagined for learning purposes.

# load the dataset
weather_data = loadDataset('weather')

We can sample out random rows from the dataset using the sample() method, we can use the n parameter to specify the number of rows to sample

# sample out random values from the dataset
weather_data.sample(n=5)
day temperature windspead event
4 07/01/2017 32.0 NaN Rain
1 04/01/2017 NaN 9.0 Sunny
7 10/01/2017 34.0 8.0 Cloudy
6 09/01/2017 NaN NaN NaN
3 06/01/2017 NaN 7.0 NaN

From our quick our sample, we can already observe some probles with the data that will need fixing

Display some info about the dataset eg number of entries, count of non-null values and variable datatypes using the info() method

# get quick dataframe info
weather_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          9 non-null      object 
 1   temperature  5 non-null      float64
 2   windspead    5 non-null      float64
 3   event        7 non-null      object 
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes

We can count all missing values in each column in our dataframe by using dataframe.isna().sum(), eg

# count missing values in each column
weather_data.isna().sum()
day            0
temperature    4
windspead      4
event          2
dtype: int64

We can use a boolean-indexing like technique to find all rows in a dataset with missing values in a specific column.

# get rows with missing data in temperature
weather_data[weather_data.temperature.isna()]
day temperature windspead event
1 04/01/2017 NaN 9.0 Sunny
3 06/01/2017 NaN 7.0 NaN
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN NaN
# get rows with missing data in event column
weather_data[weather_data.event.isna()]
day temperature windspead event
3 06/01/2017 NaN 7.0 NaN
6 09/01/2017 NaN NaN NaN

For the next part, we would like to demonstrate forward fill (ffill()) and backward fill (bfill), we first create two copies of the dataframe to avoid modifying our original copy in memory. - ffill() fills the missing values with the previous valid value in the column - bfill() fills the missing values with the next valid value in the column

Let’s create 2 copies of our dataframe and test out each of these concepts on either of the copies

For the first copy, let’s fill NaN values in the event column with ffill()

# fill with the previous valid value
weather_data['event'] = weather_data.event.ffill()
weather_data
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 NaN 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 Snow
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN Sunny
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny

From the returned dataframe we can observe that the NaN values in event have been replaced with their corresponding non-null values orccurring earlier than them ie Snow at row 3 and Sunny at row 6

Exercise: Demonstrate how to replace missing values with the bfill() method

Choosing Between ffill and bfill

  • Context: Choose based on the context and the logical assumption that fits the nature of your data. If the past influences the present, use ffill. If the future influences the present, use bfill.
  • Data Patterns: Consider the patterns in your data and what makes sense for your specific analysis or model. Ensure that the method you choose maintains the integrity and meaning of your data.

Fill with a specific value

We can modify (or fill) a specific value in the dataframe by using the loc[] method. This picks the value by its row (index) and column names. Assigning it a new value modifies it in the dataframe as illustrated below

# modify a specific value in the dataframe
weather_data.loc[1, 'temperature'] = 29
weather_data
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 29.0 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 Snow
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN Sunny
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny

Observe that the missing value in row 1 and temperature has been replaced with 29

We can use the fillna() method to replace all missing values in a column with a specific value as demostrated value

# replace missing values in temperature column with mean
weather_data['temperature'] = weather_data.temperature.fillna(
    value=weather_data.temperature.mean()
)
weather_data
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 29.0 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 32.5 7.0 Snow
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 32.5 NaN Sunny
6 09/01/2017 32.5 NaN Sunny
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny

Exercise: Demonstrate some technniques to replace missing data for numeric, and categorical data using the fillna()

We can also use the fillna() method to fill missing values in multiple columns by passing in the dictionary of key/value pairs of column-name and value to replace. To demonstrate this, let’s first reload a fresh dataframe with missing data

weather_data = loadDataset('weather')
weather_data
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 NaN 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 NaN
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN NaN
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny
# Replace missing values in temperature, column and event
weather_data.fillna(value={
    'temperature': weather_data.temperature.mean(), 
    'windspead': weather_data.windspead.max(), 
    'event': weather_data.event.bfill()
    }, inplace=True)

# Now let's look at our data
weather_data
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 29.0 9.0 Sunny
2 05/01/2017 28.0 12.0 Snow
3 06/01/2017 32.5 7.0 Snow
4 07/01/2017 32.0 12.0 Rain
5 08/01/2017 32.5 12.0 Sunny
6 09/01/2017 32.5 12.0 Sunny
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny

We can optionally drop all rows with missing values using the dropna() method. To demonstrate this, let’s first reload a fresh dataframe with missing data

weather_data = loadDataset('weather')
weather_data
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
1 04/01/2017 NaN 9.0 Sunny
2 05/01/2017 28.0 NaN Snow
3 06/01/2017 NaN 7.0 NaN
4 07/01/2017 32.0 NaN Rain
5 08/01/2017 NaN NaN Sunny
6 09/01/2017 NaN NaN NaN
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny
# Drop all rows with missing values
weather_data.dropna()
day temperature windspead event
0 01/01/2017 32.0 6.0 Rain
7 10/01/2017 34.0 8.0 Cloudy
8 11/01/2017 40.0 12.0 Sunny

In the next chapter we’ll look at some data visualization tools in Python

Congratulations!

Congratulations on finishing this lesson. In this lesson, you have learned various methods of handling missing data including:
  • Finding missing values
  • bfill and ffill
  • filling with a specific value
  • min, max and mean values

What’s on your mind? Put it in the comments!

Back to top