# Uncomment and run this cell to install pandas
# !pip install pandas
# !pip install openpyxl
Pandas Crash Course
Creating Dataframes, Concatenating DataFrames, Sampling values in the DataFrame, Selecting, Boolean Indexing and Setting, Dropping, Retrieving information about DataFrame
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
This tutoral will show you the basic and intermediate concepts in Pandas
# import pandas
import pandas as pd
# to check python version
print(pd.__version__)
2.2.2
Creating Dataframes
The reason why data analysts like pandas is because pandas provides them with a very powerful data structure called a dataframe. A dataframe is a 2D structure that offers us rows and columns similar to tables in excel, sql etc
Creating dataframes from existing files
If you already have some data in maybe an excel, csv, or stata file, you can be able to load it into a dataframe and then perform manipulation.
# loading an excel file into a dataframe
= pd.read_excel(io='../assets/demo.xlsx') data
Note!
You must have openpyxl
installed to be able to read excel files using pandas.
The data structure that is returned by the statement is called a DataFrame
# checking the datatype of the data object
print(type(data))
<class 'pandas.core.frame.DataFrame'>
# randomly sample some values
=5) data.sample(n
Age | Gender | Marital Status | Address | Income | Income Category | Job Category | |
---|---|---|---|---|---|---|---|
137 | 32 | m | 0 | 1 | 26 | 2 | 2 |
92 | 61 | m | 1 | 18 | 23 | 1 | 3 |
39 | 21 | f | 0 | 0 | 13 | 1 | 1 |
41 | 56 | f | 0 | 7 | 213 | 4 | 3 |
48 | 51 | f | 0 | 0 | 47 | 2 | 1 |
Creating a DataFrame from a Dictionary
For the previous case, we may be having some data already, but sometimes we may want to create a dataframe from scratch. We can create pandas dataframes using two major ways:
- Using a dictionary
- Using a 2D list
We’ve met dictionaries and lists in the Containers/Collections module of the Python 3 Beginner Course.
To begin with, we’re gonna create a dataframe from a dictionary. The way we create the dictionary is important, keys will be used as column names, and the values will be used as rows. So, you typically want your values to be lists or tuples.
Now you might observe that the values (lists) are of equal length
# create a pandas dataframe using a dictionary
= {
data_dictionary 'age': [65, 51, 45, 38, 40],
'gender': ['m', 'm', 'm', 'f', 'm'],
'income': [42, 148, 147, 43, 89]
}
= pd.DataFrame(data=data_dictionary) dataframe_from_dict
# display the dataframe
dataframe_from_dict
age | gender | income | |
---|---|---|---|
0 | 65 | m | 42 |
1 | 51 | m | 148 |
2 | 45 | m | 147 |
3 | 38 | f | 43 |
4 | 40 | m | 89 |
Next up, we are gonna create a dataframe from a list. For this case, the list be of 2D shape. Again, the way we organize data in our list is important. We should organize that in a format close to rows and columns as showed below.
It turns out that when creating a dataframe from a list, we need to explicitly define the column names as demonstrated below
# creating a dataframe from a 2D list
= [
data_list 28, 'm', 24],
[59, 'm', 841],
[54, 'm', 741],
[83, 'f', 34],
[34, 'm', 98]
[
]# let's specify the column names
= ['age', 'gender', 'income']
names
= pd.DataFrame(data=data_list,
dataframe_from_list =names) columns
# display the dataframe
dataframe_from_list
age | gender | income | |
---|---|---|---|
0 | 28 | m | 24 |
1 | 59 | m | 841 |
2 | 54 | m | 741 |
3 | 83 | f | 34 |
4 | 34 | m | 98 |
Before we continue, I would like to share some ways you would look for help or more information about pandas methods.
-
One way is by using the
help()
method. -
Another is by using the query operator
?
# Finding more information
# help(pd.DataFrame)
## Another way to find more information
# ?pd.DataFrame
The latter is my favorite because it works in all situations.
Concatenating DataFrames
Sometimes there’s a need to add two or more dataframes. To perform this, for the start, we can use the pd.concat()
. The concat()
method takes in a list of dataframes we would like to combine
Remember we created 2 dataframes earlier, one from a dictionary and another from a list, now let’s combine them to make one dataframe
= pd.concat(
concatenated_dataframe =[dataframe_from_dict, dataframe_from_list],
objs=True
ignore_index )
concatenated_dataframe
age | gender | income | |
---|---|---|---|
0 | 65 | m | 42 |
1 | 51 | m | 148 |
2 | 45 | m | 147 |
3 | 38 | f | 43 |
4 | 40 | m | 89 |
5 | 28 | m | 24 |
6 | 59 | m | 841 |
7 | 54 | m | 741 |
8 | 83 | f | 34 |
9 | 34 | m | 98 |
We set ignore_index=True
to correct the indexing so that we can have unique indexes and hence be able to able to uniquely identify rows by index
Exercise: Demonstrate how to concatenate two or more dataframes by column ie if dataframe A has columns a, b, c and dataframe B has columns x, y, z, the resulting dataframe should have columns a, b, c, x, y, z
Sampling values in the DataFrame
In this section, we are gonna look at how to pick out some sections or parts of the data. We’ll look at head()
, tail()
and sample()
.
To demonstrate these, we’ll continue with our concatenated dataframe from the previous section
We can use head()
to look at the top part of the data. Out of the box, it returns the top 5 rows, however modifying the value for n
can help us pick a specific number of rows from the top.
# We can have look at the top part
=3) concatenated_dataframe.head(n
age | gender | income | |
---|---|---|---|
0 | 65 | m | 42 |
1 | 51 | m | 148 |
2 | 45 | m | 147 |
We can use tail()
to look at the bottom part of the data. Out of the box, it returns the bottom 5 rows, however modifying the value for n
can help us pick a specific number of rows from the bottom.
# We can look at the bottom part
=3) concatenated_dataframe.tail(n
age | gender | income | |
---|---|---|---|
7 | 54 | m | 741 |
8 | 83 | f | 34 |
9 | 34 | m | 98 |
We can use sample()
to look random rows data rows. Out of the box, it returns only 1 row, however modifying the value for n
can help us pick a specific number of rows at random.
# We can also randomly sample out some values in a DataFrame
=3) concatenated_dataframe.sample(n
age | gender | income | |
---|---|---|---|
0 | 65 | m | 42 |
2 | 45 | m | 147 |
8 | 83 | f | 34 |
Selection
In this section we are gonna look at some tricks and techniques we can use to pick some really specific values from the dataframe
Selecting, Boolean Indexing and Setting
To demonstrate these, we’re creating a little countries dataframe. As you may observe we use pd.DataFrame()
to create our dataframe and notice we’re passing in a dictionary for the value of data.
= pd.DataFrame(data={
country_data 'Country': ['Uganda', 'Kenya', 'Tanzania'],
'Capital': ['Kampala', 'Nairobi', 'Dodoma'],
'Population': [11190846, 1303171035, 207847528]
}) country_data
Country | Capital | Population | |
---|---|---|---|
0 | Uganda | Kampala | 11190846 |
1 | Kenya | Nairobi | 1303171035 |
2 | Tanzania | Dodoma | 207847528 |
We can pick a specific value (or values) from a dataframe by indexing usin the iloc
and iat
methods. We insert the row number and the column number of an item that we want to pick from the dataframe in the square brackets.
We can also use these techniques to replace values in a dataframe.
# position 1
print(country_data.iloc[0, 0])
print(country_data.iloc[2, 1])
Uganda
Dodoma
# position 2
print(country_data.iat[0, 0])
print(country_data.iat[2, 1])
Uganda
Dodoma
Ponder:
How can you use the pd.DataFrame.iat
method to replace (or modify) a specific value in a dataframe
We can access any value(s) by their row index and column name with the help of the loc[]
and at[]
methods.
As you may observe the difference now is that we are using row index and column name instead of row index and column index for iloc[]
and iat[]
# using label
print(country_data.loc[0, 'Capital'])
print(country_data.loc[1, 'Population'])
Kampala
1303171035
# using label
print(country_data.at[2, 'Population'])
print(country_data.at[1, 'Capital'])
207847528
Nairobi
We can be able to pick out an entire column by either using the .
operator or []
.
-
We use the
.
operator when a column name is one single word -
We can use the
[]
when the column name is containing more that one word -
We can also use the
[]
when creating or assigning values to columns in a dataframe
# picking out data from a specific column
country_data.Country
0 Uganda
1 Kenya
2 Tanzania
Name: Country, dtype: object
# another way to pick data from a specific column
'Country'] country_data[
0 Uganda
1 Kenya
2 Tanzania
Name: Country, dtype: object
The data structure that is returned by the statement is called a Series
= ['Albert', 'Turkana', 'Tanganyika']
lakes 'Lake'] = lakes
country_data[
# lets display the updated data
country_data
Country | Capital | Population | Lake | |
---|---|---|---|---|
0 | Uganda | Kampala | 11190846 | Albert |
1 | Kenya | Nairobi | 1303171035 | Turkana |
2 | Tanzania | Dodoma | 207847528 | Tanganyika |
# lets check it
type(country_data['Capital'])
pandas.core.series.Series
# Get specific row data (using indexing)
0] country_data.iloc[
Country Uganda
Capital Kampala
Population 11190846
Lake Albert
Name: 0, dtype: object
We can get a range of rows by passing into iloc[]
a range of indexes. The demonstration below returns rows of indexes 0 until but not including 2 ie 0 for Uganda and 1 for Kenya
# Get specific rows (using subsetting)
0:2] country_data.iloc[
Country | Capital | Population | Lake | |
---|---|---|---|---|
0 | Uganda | Kampala | 11190846 | Albert |
1 | Kenya | Nairobi | 1303171035 | Turkana |
We can be able to pickout only rows whose values satisfy a specific condition, this trick is called boolean indexing. In the example below, we find all rows whose contry is Uganda
# get all rows that have a column-value matching a specific value
# eg where country is Belgium
'Country'] == 'Uganda'] country_data[country_data[
Country | Capital | Population | Lake | |
---|---|---|---|---|
0 | Uganda | Kampala | 11190846 | Albert |
# Think about this
'Country'] == 'Tanzania' country_data[
0 False
1 False
2 True
Name: Country, dtype: bool
You donot have to submit that
Dropping
In this part we will learn some tricks and techniques to drop or remove some data from a dataframe.
country_data
Country | Capital | Population | Lake | |
---|---|---|---|---|
0 | Uganda | Kampala | 11190846 | Albert |
1 | Kenya | Nairobi | 1303171035 | Turkana |
2 | Tanzania | Dodoma | 207847528 | Tanganyika |
We may realize that all our population values are terribly wrong and may choose the entire column. we can do this by using the drop()
method. We specify the column name and axis=1
to drop a column.
# drop a column from a dataframe
country_data.drop(='Population',
labels=1
axis )
Country | Capital | Lake | |
---|---|---|---|
0 | Uganda | Kampala | Albert |
1 | Kenya | Nairobi | Turkana |
2 | Tanzania | Dodoma | Tanganyika |
To drop many columns, we can pass all the columns to the drop()
method as a list or tuple, and again specify the axis=1
.
country_data.drop(=['Lake', 'Population'],
labels=1
axis )
Country | Capital | |
---|---|---|
0 | Uganda | Kampala |
1 | Kenya | Nairobi |
2 | Tanzania | Dodoma |
Another way to drop many columns is by passing them to the drop()
method as a list value to the columns
parameter. In this case we don’t need to specify the axis
.
# You can drop many columns by passing in a columns list
=['Country', 'Population']) country_data.drop(columns
Capital | Lake | |
---|---|---|
0 | Kampala | Albert |
1 | Nairobi | Turkana |
2 | Dodoma | Tanganyika |
To drop a row or many rows, we shall pass the index(es) as labels to the drop
method method and optionally set axis=0
. It turns out that the default value for axis
is actually 0. Below, we have some two examples.
# how to drop 1 row
=0) # drops out Uganda country_data.drop(labels
Country | Capital | Population | Lake | |
---|---|---|---|---|
1 | Kenya | Nairobi | 1303171035 | Turkana |
2 | Tanzania | Dodoma | 207847528 | Tanganyika |
# how to drop row data
=[0, 2], axis=0) country_data.drop(labels
Country | Capital | Population | Lake | |
---|---|---|---|---|
1 | Kenya | Nairobi | 1303171035 | Turkana |
This drops rows in indexes 0 and 2, ie Uganda and Tanzania
Research on:
sort and rank data
Retrieving information about DataFrame
Pandas offers us some quick way with which we can find some quick information about our dataset (or dataframe)
Basic Information
= pd.DataFrame({
country_data 'Country': ['Uganda', 'Kenya', 'Tanzania'],
'Capital': ['Kampala', None, None],
'Population': [11190846, 1303171035, 207847528]
})
country_data
Country | Capital | Population | |
---|---|---|---|
0 | Uganda | Kampala | 11190846 |
1 | Kenya | None | 1303171035 |
2 | Tanzania | None | 207847528 |
We can use the shape
attribute to obtain the number of rows and columns available in our dataset as illustrated.
# shape of a dataframe (ie rows, columns)
country_data.shape
(3, 3)
If you’re only interested in the number of rows you can use the len()
method to find that eg
# len of a dataframe (ie no of rows)
len(country_data)
3
If you are interested in looking at the columns (column names), you can use the columns
attribute to obtain the Index
object of column names
# Get all columns in a dataframe
country_data.columns
Index(['Country', 'Capital', 'Population'], dtype='object')
We can also use the len()
method on this Index
object to obtain the number of columns
len(country_data.columns)
3
We can use the info()
method to find some information on our dataframe ie:
- Columns (All columns in the dataframe)
- Non-Null Count (Number of non null values per column)
- Dtype (Data type each column)
- Total number of entries (rows)
# get some basic info about the dataframe
country_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 3 non-null object
1 Capital 1 non-null object
2 Population 3 non-null int64
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes
By using the count()
method on the dataframe, we can obtain the number of non-null values per a column
# Count non-null values in each column
country_data.count()
Country 3
Capital 1
Population 3
dtype: int64
Summary
Finally, we can use the describe()
to obtain some quick summary (descriptive) statistics about our data eg count, mean, standard deviation, minimum and maximum values, percentile
# summary statistics
country_data.describe()
Population | |
---|---|
count | 3.000000e+00 |
mean | 5.074031e+08 |
std | 6.961346e+08 |
min | 1.119085e+07 |
25% | 1.095192e+08 |
50% | 2.078475e+08 |
75% | 7.555093e+08 |
max | 1.303171e+09 |
Research
Find out how to get for specific columns:
- mean
- median
- cummulative sum
- minimum
- maximum