# # Uncomment and run this cell to install the libraries
# !pip install pandas matplotlib dataidea-science
Matplotlib Crash Course
python visualization, matplotlib, bar chart, histogram, scatter plot, line plot, box plot, pie chart, stacked bar chart, quiz, python quiz, dataidea, data science
What is Matploblib
Matplotlib is a powerful plotting library in Python commonly used for data visualization.
When working with datasets, you can use Matplotlib to create various plots to explore and visualize the data.
Here are some major plots you can create using Matplotlib with the Titanic dataset:
# import the libraries, packages and modules
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from dataidea_science.datasets import loadDataset
Let’s demonstrate each of the plots using the Titanic dataset. We’ll first load the dataset and then create each plot using Matplotlib.
# Load the Titanic dataset
= loadDataset('titanic') titanic_df
We can load this dataset like this because it is inbuilt in the dataidea package
=5) titanic_df.head(n
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 1.0 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0.0 | 0.0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1.0 | 1.0 | Allison, Master. Hudson Trevor | male | 0.9167 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1.0 | 0.0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1.0 | 0.0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1.0 | 0.0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1.0 | 2.0 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
- Bar Plot: You can create a bar plot to visualize categorical data such as the number of passengers in each class (first class, second class, third class), the number of survivors vs. non-survivors, or the number of passengers embarked from each port (Cherbourg, Queenstown, Southampton).
# 1. Bar Plot - Number of passengers in each class
= titanic_df.pclass.value_counts()
class_counts = class_counts.index
classes = class_counts.values
counts
=classes, height=counts, color='#008374')
plt.bar(x'Number of Passengers Per Passenger Class')
plt.title('Passenger Class')
plt.xlabel('Number of Passengers')
plt.ylabel(
plt.show()
It’s easy to see from the graph that the 3rd class had the largest number of passengers, followed by the 1st class and 2nd class comes last
- Histogram: Histograms are useful for visualizing the distribution of continuous variables such as age or fare. You can create histograms to see the age distribution of passengers or the fare distribution.
# 2. Histogram - Age distribution of passengers
= titanic_df.age
ages =ages, bins=20, color='#008374',
plt.hist(x='#66FDEE')
edgecolor'Age Distribution of Passengers')
plt.title('Frequency')
plt.ylabel('Age')
plt.xlabel( plt.show()
From the histogram we can observe that:
- The majority of the people we of ages between 15 and 35
- Fewer older people(above 60 years) boarded the titanic (below 20)t
- Box Plot: A box plot can be used to show the distribution of a continuous variable across different categories. For example, you can create a box plot to visualize the distribution of age or fare across different passenger classes.
3.1. Age distribution boxplot
# 3.1 Age distribution boxplot
= titanic_df.age.dropna()
ages =ages, vert=False,)
plt.boxplot(x'Age Distribution of Passengers')
plt.title('Age')
plt.xlabel( plt.show()
Features of a box plot:
Box: The box in a boxplot represents the interquartile range (IQR), which contains the middle 50% of the data. The top and bottom edges of the box are the third quartile (Q3) and the first quartile (Q1), respectively.
Median Line: A line inside the box indicates the median (Q2) of the data, which is the middle value of the dataset.
Whiskers: The whiskers extend from the edges of the box to the smallest and largest values within 1.5 times the IQR from Q1 and Q3. They represent the range of the bulk of the data.
Outliers: Data points that fall outside the whiskers are considered outliers. They are typically plotted as individual points. Outliers can be indicative of variability or errors in the data.
Minimum and Maximum: The ends of the whiskers show the minimum and maximum values within the range of 1.5 times the IQR from the first and third quartiles.
Meaning: A boxplot provides a visual summary of several important aspects of a dataset:
- Central Tendency: The median line shows the central point of the data.
- Spread: The IQR (the length of the box) shows the spread of the middle 50% of the data.
- Symmetry and Skewness: The relative position of the median within the box and the length of the whiskers can indicate whether the data is symmetric or skewed.
- Outliers: Individual points outside the whiskers highlight potential outliers.
Boxplots are particularly useful for comparing distributions between several groups or datasets and identifying outliers and potential anomalies.
3.2 Age Distribution Across Passenger Classes
# 3. Box Plot - Distribution of age across passenger classes
'pclass'] == 1]['age'].dropna(),
plt.boxplot([titanic_df[titanic_df['pclass'] == 2]['age'].dropna(),
titanic_df[titanic_df['pclass'] == 3]['age'].dropna()],
titanic_df[titanic_df[=['1st Class', '2nd Class', '3rd Class'])
labels'Passenger Class')
plt.xlabel('Age')
plt.ylabel('Distribution of Age Across Passenger Classes')
plt.title( plt.show()
/tmp/ipykernel_16695/4289029800.py:2: MatplotlibDeprecationWarning: The 'labels' parameter of boxplot() has been renamed 'tick_labels' since Matplotlib 3.9; support for the old name will be dropped in 3.11.
plt.boxplot([titanic_df[titanic_df['pclass'] == 1]['age'].dropna(),
- Scatter Plot: Scatter plots are helpful for visualizing the relationship between two continuous variables. You can create scatter plots to explore relationships such as age vs. fare. Read more about the scatter plot from the Matplotlib documentation
# 4. Scatter Plot - Age vs. Fare
plt.scatter(=titanic_df['age'],
x=titanic_df['fare'],
y=.5,
alpha=titanic_df['survived'],
c=ListedColormap(['#008374', '#000000'])
cmap
)'Age')
plt.xlabel('Fare')
plt.ylabel('Age vs. Fare')
plt.title(='Survived')
plt.colorbar(label plt.show()
I don’t about you but for me I don’t see a linear relationship between the age and fare of the titanic passengers
- Pie Chart: Pie charts can be used to visualize the proportion of different categories within a dataset. For example, you can create a pie chart to show the proportion of male vs. female passengers or the proportion of survivors vs. non-survivors.
# 5. Pie Chart - Proportion of male vs. female passengers
= titanic_df['sex'].value_counts()
gender_counts =gender_counts, labels=gender_counts.index,
plt.pie(x='%1.1f%%', startangle=90,
autopct=['#008374', '#66FDEE'])
colors'Proportion of Male vs. Female Passengers')
plt.title(='lower right')
plt.legend(loc plt.show()
- Stacked Bar Plot: Stacked bar plots can be used to compare the composition of different categories across groups. For example, you can create a stacked bar plot to compare the proportion of survivors and non-survivors within each passenger class.
# 6. Stacked Bar Plot - Survival status within each passenger class
= titanic_df.groupby(['pclass', 'survived']).size().unstack()
survival_counts ='bar', stacked=True,
survival_counts.plot(kind=['#008374', '#66FDEE'])
color'Passenger Class')
plt.xlabel('Number of Passengers')
plt.ylabel('Survival Status Within Each Passenger Class')
plt.title('Did not survive', 'Survived'])
plt.legend([ plt.show()
'pclass', 'survived']).size().unstack() titanic_df.groupby([
survived | 0.0 | 1.0 |
---|---|---|
pclass | ||
1.0 | 123 | 200 |
2.0 | 158 | 119 |
3.0 | 528 | 181 |
We observe that:
- More passengers in class 1 survived than those that did not survive (200 vs 123)
- Most of the passengers in class 3 did not survive (528 vs 181)
- Slightly more passengers did not survive as compared to those that survived in class 2 (152 vs 119)
- Line Plot: Line plots can be useful for visualizing trends over time or continuous variables. While the Titanic dataset may not have explicit time data, you can still use line plots to visualize trends such as the change in survival rate with increasing age or fare.
# 7. Line Plot - Mean age of passengers by passenger class
= titanic_df.groupby('pclass')['age'].mean()
mean_age_by_class
plt.plot(mean_age_by_class.index, mean_age_by_class.values, ='*', color='#008374')
marker'Passenger Class')
plt.xlabel('Mean Age')
plt.ylabel('Mean Age of Passengers by Passenger Class')
plt.title( plt.show()
We can quickly see the average ages for each passenger class, ie:
- Around 39 for first class
- Around 30 for second class
- Around 25 for third class
These are some of the major plots you can create using Matplotlib. Each plot serves a different purpose and can help you gain insights into the data and explore relationships between variables.
= loadDataset('air_passengers')
air_passengers_data air_passengers_data.head()
Month | Passengers | |
---|---|---|
0 | 1949-01 | 112 |
1 | 1949-02 | 118 |
2 | 1949-03 | 132 |
3 | 1949-04 | 129 |
4 | 1949-05 | 121 |
'Month'] = pd.to_datetime(air_passengers_data.Month)
air_passengers_data['Month', 'Passengers', data=air_passengers_data, color='#008374')
plt.plot('Years')
plt.xlabel('Number of Passengers')
plt.ylabel( plt.show()
We can observe that the number of passengers seems to increase with time
Review
Congratulations on reaching the end of this tutorial. In this tutorial, we have learned the basic graphs and how to interprete them. ie
- Bar chart
- Histogram
- Scatter plot
- Line plot
- Box plot
- Pie chart
- Stacked bar chart