Matplotlib Crash Course

Matplotlib is a powerful plotting library in Python commonly used for data visualization.
Author

Juma Shafara

Published

January 1, 2024

Modified

July 25, 2024

Keywords

python visualization, matplotlib, bar chart, histogram, scatter plot, line plot, box plot, pie chart, stacked bar chart, quiz, python quiz, dataidea, data science

Photo by DATAIDEA

What is Matploblib

Matplotlib is a powerful plotting library in Python commonly used for data visualization.

When working with datasets, you can use Matplotlib to create various plots to explore and visualize the data.

Here are some major plots you can create using Matplotlib with the Titanic dataset:

# # Uncomment and run this cell to install the libraries
# !pip install pandas matplotlib dataidea
# import the libraries, packages and modules
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from dataidea_science.datasets import loadDataset

Let’s demonstrate each of the plots using the Titanic dataset. We’ll first load the dataset and then create each plot using Matplotlib.

# Load the Titanic dataset
titanic_df = loadDataset('titanic')

We can load this dataset like this because it is inbuilt in the dataidea package

titanic_df.head(n=5)
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
0 1.0 1.0 Allen, Miss. Elisabeth Walton female 29.0000 0.0 0.0 24160 211.3375 B5 S 2 NaN St Louis, MO
1 1.0 1.0 Allison, Master. Hudson Trevor male 0.9167 1.0 2.0 113781 151.5500 C22 C26 S 11 NaN Montreal, PQ / Chesterville, ON
2 1.0 0.0 Allison, Miss. Helen Loraine female 2.0000 1.0 2.0 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
3 1.0 0.0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1.0 2.0 113781 151.5500 C22 C26 S NaN 135.0 Montreal, PQ / Chesterville, ON
4 1.0 0.0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1.0 2.0 113781 151.5500 C22 C26 S NaN NaN Montreal, PQ / Chesterville, ON
  1. Bar Plot: You can create a bar plot to visualize categorical data such as the number of passengers in each class (first class, second class, third class), the number of survivors vs. non-survivors, or the number of passengers embarked from each port (Cherbourg, Queenstown, Southampton).
# 1. Bar Plot - Number of passengers in each class
class_counts = titanic_df.pclass.value_counts()
classes = class_counts.index
counts = class_counts.values

plt.bar(x=classes, height=counts, color='#008374')
plt.title('Number of Passengers Per Passenger Class')
plt.xlabel('Passenger Class')
plt.ylabel('Number of Passengers')

plt.show()

It’s easy to see from the graph that the 3rd class had the largest number of passengers, followed by the 1st class and 2nd class comes last

  1. Histogram: Histograms are useful for visualizing the distribution of continuous variables such as age or fare. You can create histograms to see the age distribution of passengers or the fare distribution.
# 2. Histogram - Age distribution of passengers
ages = titanic_df.age
plt.hist(x=ages, bins=20, color='#008374', 
         edgecolor='#66FDEE')
plt.title('Age Distribution of Passengers')
plt.ylabel('Frequency')
plt.xlabel('Age')
plt.show()

From the histogram we can observe that:

  • The majority of the people we of ages between 15 and 35
  • Fewer older people(above 60 years) boarded the titanic (below 20)t
  1. Box Plot: A box plot can be used to show the distribution of a continuous variable across different categories. For example, you can create a box plot to visualize the distribution of age or fare across different passenger classes.

3.1. Age distribution boxplot

# 3.1 Age distribution boxplot
ages = titanic_df.age.dropna()
plt.boxplot(x=ages, vert=False,)
plt.title('Age Distribution of Passengers')
plt.xlabel('Age')
plt.show()

Features of a box plot:

Box: The box in a boxplot represents the interquartile range (IQR), which contains the middle 50% of the data. The top and bottom edges of the box are the third quartile (Q3) and the first quartile (Q1), respectively.

Median Line: A line inside the box indicates the median (Q2) of the data, which is the middle value of the dataset.

Whiskers: The whiskers extend from the edges of the box to the smallest and largest values within 1.5 times the IQR from Q1 and Q3. They represent the range of the bulk of the data.

Outliers: Data points that fall outside the whiskers are considered outliers. They are typically plotted as individual points. Outliers can be indicative of variability or errors in the data.

Minimum and Maximum: The ends of the whiskers show the minimum and maximum values within the range of 1.5 times the IQR from the first and third quartiles.

Meaning: A boxplot provides a visual summary of several important aspects of a dataset:

  • Central Tendency: The median line shows the central point of the data.
  • Spread: The IQR (the length of the box) shows the spread of the middle 50% of the data.
  • Symmetry and Skewness: The relative position of the median within the box and the length of the whiskers can indicate whether the data is symmetric or skewed.
  • Outliers: Individual points outside the whiskers highlight potential outliers.

Boxplots are particularly useful for comparing distributions between several groups or datasets and identifying outliers and potential anomalies.

3.2 Age Distribution Across Passenger Classes

# 3. Box Plot - Distribution of age across passenger classes
plt.boxplot([titanic_df[titanic_df['pclass'] == 1]['age'].dropna(),
             titanic_df[titanic_df['pclass'] == 2]['age'].dropna(),
             titanic_df[titanic_df['pclass'] == 3]['age'].dropna()],
            labels=['1st Class', '2nd Class', '3rd Class'])
plt.xlabel('Passenger Class')
plt.ylabel('Age')
plt.title('Distribution of Age Across Passenger Classes')
plt.show()
/tmp/ipykernel_16695/4289029800.py:2: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
  plt.boxplot([titanic_df[titanic_df['pclass'] == 1]['age'].dropna(),

  1. Scatter Plot: Scatter plots are helpful for visualizing the relationship between two continuous variables. You can create scatter plots to explore relationships such as age vs. fare. Read more about the scatter plot from the Matplotlib documentation
# 4. Scatter Plot - Age vs. Fare
plt.scatter(
    x=titanic_df['age'], 
    y=titanic_df['fare'], 
    alpha=.5, 
    c=titanic_df['survived'], 
    cmap=ListedColormap(['#008374', '#000000'])
)
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs. Fare')
plt.colorbar(label='Survived')  
plt.show()

I don’t about you but for me I don’t see a linear relationship between the age and fare of the titanic passengers

  1. Pie Chart: Pie charts can be used to visualize the proportion of different categories within a dataset. For example, you can create a pie chart to show the proportion of male vs. female passengers or the proportion of survivors vs. non-survivors.
# 5. Pie Chart - Proportion of male vs. female passengers
gender_counts = titanic_df['sex'].value_counts()
plt.pie(x=gender_counts, labels=gender_counts.index, 
        autopct='%1.1f%%', startangle=90, 
        colors=['#008374', '#66FDEE'])
plt.title('Proportion of Male vs. Female Passengers')
plt.legend(loc='lower right')
plt.show()

  1. Stacked Bar Plot: Stacked bar plots can be used to compare the composition of different categories across groups. For example, you can create a stacked bar plot to compare the proportion of survivors and non-survivors within each passenger class.
# 6. Stacked Bar Plot - Survival status within each passenger class
survival_counts = titanic_df.groupby(['pclass', 'survived']).size().unstack()
survival_counts.plot(kind='bar', stacked=True,  
                     color=['#008374', '#66FDEE'])
plt.xlabel('Passenger Class')
plt.ylabel('Number of Passengers')
plt.title('Survival Status Within Each Passenger Class')
plt.legend(['Did not survive', 'Survived'])
plt.show()

titanic_df.groupby(['pclass', 'survived']).size().unstack()
survived 0.0 1.0
pclass
1.0 123 200
2.0 158 119
3.0 528 181

We observe that:

  • More passengers in class 1 survived than those that did not survive (200 vs 123)
  • Most of the passengers in class 3 did not survive (528 vs 181)
  • Slightly more passengers did not survive as compared to those that survived in class 2 (152 vs 119)
  1. Line Plot: Line plots can be useful for visualizing trends over time or continuous variables. While the Titanic dataset may not have explicit time data, you can still use line plots to visualize trends such as the change in survival rate with increasing age or fare.
# 7. Line Plot - Mean age of passengers by passenger class
mean_age_by_class = titanic_df.groupby('pclass')['age'].mean()
plt.plot(mean_age_by_class.index, mean_age_by_class.values, 
         marker='*', color='#008374')
plt.xlabel('Passenger Class')
plt.ylabel('Mean Age')
plt.title('Mean Age of Passengers by Passenger Class')
plt.show()

We can quickly see the average ages for each passenger class, ie:

  • Around 39 for first class
  • Around 30 for second class
  • Around 25 for third class

These are some of the major plots you can create using Matplotlib. Each plot serves a different purpose and can help you gain insights into the data and explore relationships between variables.

air_passengers_data = loadDataset('air_passengers')
air_passengers_data.head()
Month Passengers
0 1949-01 112
1 1949-02 118
2 1949-03 132
3 1949-04 129
4 1949-05 121
air_passengers_data['Month'] = pd.to_datetime(air_passengers_data.Month)
plt.plot('Month', 'Passengers', data=air_passengers_data, color='#008374')
plt.xlabel('Years')
plt.ylabel('Number of Passengers')
plt.show()

We can observe that the number of passengers seems to increase with time

Review

Congratulations on reaching the end of this tutorial. In this tutorial, we have learned the basic graphs and how to interprete them. ie

  • Bar chart
  • Histogram
  • Scatter plot
  • Line plot
  • Box plot
  • Pie chart
  • Stacked bar chart

What’s on your mind? Put it in the comments!

Back to top