## Uncomment and run this cell to install pandas and numpy
#!pip install pandas numpy
Data Cleaning
Cleaning the weather dataset
In this notebook, we’ll be using numpy and pandas, to explore some techniques we can use to manipulate and clean a dataset. We’ll be using the weather dataset
which I developed specifically for the purpose of this lesson.
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool whereas Numpy is the fundamental package for scientific computing with Python.
To continue with this notebook, you must have python, pandas and numpy installed.
# import the libraries
import pandas as pd
import numpy as np
from dataidea.datasets import loadDataset
Let’s check the versions of python, numpy and pandas we’ll be using for this notebook
# checking python version
print('Python Version: ',)
!python --version
Python Version:
Python 3.10.12
# Checking numpy and pandas versions
print('Pandas Version: ', pd.__version__)
print('Numpy Version: ', np.__version__)
Pandas Version: 2.2.2
Numpy Version: 1.26.4
Let’s load the dataset. We’ll be using a weather dataset that imagined for learning purposes.
# load the dataset
= loadDataset('weather') weather_data
We can sample out random rows from the dataset using the sample()
method, we can use the n
parameter to specify the number of rows to sample
# sample out random values from the dataset
=5) weather_data.sample(n
day | temperature | windspead | event | |
---|---|---|---|---|
4 | 07/01/2017 | 32.0 | NaN | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
6 | 09/01/2017 | NaN | NaN | NaN |
3 | 06/01/2017 | NaN | 7.0 | NaN |
From our quick our sample, we can already observe some probles with the data that will need fixing
Display some info about the dataset eg number of entries, count of non-null values and variable datatypes using the info()
method
# get quick dataframe info
weather_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 day 9 non-null object
1 temperature 5 non-null float64
2 windspead 5 non-null float64
3 event 7 non-null object
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes
We can count all missing values in each column in our dataframe by using dataframe.isna().sum()
, eg
# count missing values in each column
sum() weather_data.isna().
day 0
temperature 4
windspead 4
event 2
dtype: int64
We can use a boolean-indexing like technique to find all rows in a dataset with missing values in a specific column.
# get rows with missing data in temperature
weather_data[weather_data.temperature.isna()]
day | temperature | windspead | event | |
---|---|---|---|---|
1 | 04/01/2017 | NaN | 9.0 | Sunny |
3 | 06/01/2017 | NaN | 7.0 | NaN |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | NaN |
# get rows with missing data in event column
weather_data[weather_data.event.isna()]
day | temperature | windspead | event | |
---|---|---|---|---|
3 | 06/01/2017 | NaN | 7.0 | NaN |
6 | 09/01/2017 | NaN | NaN | NaN |
For the next part, we would like to demonstrate forward fill (ffill()
) and backward fill (bfill
), we first create two copies of the dataframe to avoid modifying our original copy in memory. - ffill()
fills the missing values with the previous valid value in the column - bfill()
fills the missing values with the next valid value in the column
Let’s create 2 copies of our dataframe and test out each of these concepts on either of the copies
For the first copy, let’s fill NaN values in the event
column with ffill()
# fill with the previous valid value
'event'] = weather_data.event.ffill()
weather_data[ weather_data
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | Snow |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | Sunny |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
From the returned dataframe we can observe that the NaN
values in event
have been replaced with their corresponding non-null values orccurring earlier than them ie Snow at row 3 and Sunny at row 6
Exercise: Demonstrate how to replace missing values with the bfill()
method
Choosing Between ffill
and bfill
- Context: Choose based on the context and the logical assumption that fits the nature of your data. If the past influences the present, use
ffill
. If the future influences the present, usebfill
. - Data Patterns: Consider the patterns in your data and what makes sense for your specific analysis or model. Ensure that the method you choose maintains the integrity and meaning of your data.
Fill with a specific value
We can modify (or fill) a specific value in the dataframe by using the loc[]
method. This picks the value by its row (index) and column names. Assigning it a new value modifies it in the dataframe as illustrated below
# modify a specific value in the dataframe
1, 'temperature'] = 29
weather_data.loc[ weather_data
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | 29.0 | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | Snow |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | Sunny |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
Observe that the missing value in row 1 and temperature
has been replaced with 29
We can use the fillna()
method to replace all missing values in a column with a specific value as demostrated value
# replace missing values in temperature column with mean
'temperature'] = weather_data.temperature.fillna(
weather_data[=weather_data.temperature.mean()
value
) weather_data
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | 29.0 | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | 32.5 | 7.0 | Snow |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | 32.5 | NaN | Sunny |
6 | 09/01/2017 | 32.5 | NaN | Sunny |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
Exercise: Demonstrate some technniques to replace missing data for numeric, and categorical data using the fillna()
We can also use the fillna()
method to fill missing values in multiple columns by passing in the dictionary of key/value pairs of column-name and value to replace. To demonstrate this, let’s first reload a fresh dataframe with missing data
= loadDataset('weather')
weather_data weather_data
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | NaN |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | NaN |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
# Replace missing values in temperature, column and event
={
weather_data.fillna(value'temperature': weather_data.temperature.mean(),
'windspead': weather_data.windspead.max(),
'event': weather_data.event.bfill()
=True)
}, inplace
# Now let's look at our data
weather_data
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | 29.0 | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | 12.0 | Snow |
3 | 06/01/2017 | 32.5 | 7.0 | Snow |
4 | 07/01/2017 | 32.0 | 12.0 | Rain |
5 | 08/01/2017 | 32.5 | 12.0 | Sunny |
6 | 09/01/2017 | 32.5 | 12.0 | Sunny |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
We can optionally drop all rows with missing values using the dropna()
method. To demonstrate this, let’s first reload a fresh dataframe with missing data
= loadDataset('weather')
weather_data weather_data
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | NaN |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | NaN |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
# Drop all rows with missing values
weather_data.dropna()
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
In the next chapter we’ll look at some data visualization tools in Python
Congratulations!
Congratulations on finishing this lesson. In this lesson, you have learned various methods of handling missing data including:- Finding missing values
- bfill and ffill
- filling with a specific value
- min, max and mean values