Classification Metrics Practice

Learn Programming for Data Science. Demonstrate loading, preparing, training, and evaluating a machine learning model using the Iris dataset

Author

Juma Shafara

Published

February 1, 2024

Modified

July 25, 2024

Keywords

machine learning, machine learning classification, machine learning classification metrics, decision trees, python, precision, recall, f1 score, weighted, accuracy, linear regression

In this notebook, we’ll walk through the process of building and evaluating a decision tree classifier using Scikit-Learn. We’ll use the Iris dataset for demonstration and then provide an exercise to apply the same steps to the Wine dataset.

Don’t Miss Any Updates!

To be among the first to hear about future updates of the course materials, simply enter your email below, follow us on (formally Twitter), or subscribe to our YouTube channel.

Importing Necessary Libraries

First, we import the necessary libraries for data manipulation and loading the dataset.

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

numpy and pandas are imported for data manipulation.
load_iris from sklearn.datasets is imported to load the Iris dataset.

Loading the Iris Dataset

iris = load_iris()

The Iris dataset is loaded and stored in the variable iris.

Displaying Dataset Description

For a better understanding of the dataset, we can uncomment the following line to print the description of the Iris dataset.

## uncomment and run to read the data description
# print(iris['DESCR'])

Extracting Features and Target Variables

X = iris['data']
y = iris['target']

X contains the feature data (sepal length, sepal width, petal length, petal width).
y contains the target data (class labels: 0, 1, 2).

Importing Train-Test Split Function

from sklearn.model_selection import train_test_split

train_test_split is imported to split the data into training and testing sets.

Splitting the Data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

The dataset is split into training (70%) and testing (30%) sets.

Importing Decision Tree Classifier

Next, we import the Decision Tree classifier from Scikit-Learn.

from sklearn.tree import DecisionTreeClassifier

Initializing the Classifier

We create an instance of the Decision Tree classifier

classifier = DecisionTreeClassifier()

Training the Classifier

We train the classifier using the training data.

classifier.fit(X_train, y_train)

DecisionTreeClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Making Predictions

We then make predictions on the test data using the the predict() method on the model

preds = classifier.predict(X_test)

Importing Metrics for Evaluation

To evaluate our model, we import various metrics from Scikit-Learn.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Calculating Accuracy

Accuracy refers to the proportion of correctly predicted instances out of the total instances.

accuracy_score(y_test, preds)

0.9777777777777777

Calculating Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positives.

precision_score(y_test, preds, average='weighted')

0.9794871794871796

Calculating Recall

Recall is the ratio of correctly predicted positive observations to all the actual positives.

recall_score(y_test, preds, average='weighted')

0.9777777777777777

Calculating F1 Score

The f1 score refers to the Harmonic mean of Precision and Recall.

f1_score(y_test, preds, average='weighted')

0.977863799283154

Displaying the Classification Report

We can print the classification report, which provides precision, recall, F1-score, and support for each class.

from sklearn.metrics import classification_report

classification_report = classification_report(y_test, preds)
print(classification_report)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        17
           1       1.00      0.94      0.97        16
           2       0.92      1.00      0.96        12

    accuracy                           0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

The results show how well the model performs in classifying the iris species, with metrics providing insights into different aspects of the model’s performance.

from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, preds)
conf_matrix = pd.DataFrame(conf_matrix, index=[0, 1, 2], columns=[0, 1, 2])
# print("Confusion Matrix:\n", conf_matrix)
conf_matrix

	0	1	2
0	17	0	0
1	0	15	1
2	0	0	12

Exercise:

Perform the steps above using the wine dataset from sklearn