GridSearchCV

GridSearchCV is a method in the scikit-learn library, which is a popular machine learning library in Python

Author

Juma Shafara

Published

March 1, 2024

Keywords

What is GridSearchCV, What is a Pipeline, Machine Learning, Machine Learning Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, sklearn, KNeighborsRegressor, Hyperparameters, Hyperparameter tuning

What is GridSearchCV

GridSearchCV is a method in the scikit-learn library, which is a popular machine learning library in Python. It’s used for hyperparameter optimization, which involves searching for the best set of hyperparameters for a machine learning model. In this notebook, we’ll learn:

how to setup a proper GridSearchCV and
how to use it for hyperparameter optimization.

Let’s import some packages

We begin by importing necessary packages and modules. The KNeighborsRegressor model is imported from the sklearn.neighbors module. KNN regression is a non-parametric method that, in an intuitive manner, approximates the association between independent variables and the continuous outcome by averaging the observations in the same neighbourhood. Read more about the KNN Regressor from this link

# Let's import some packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import dataidea as di
from sklearn.neighbors import KNeighborsRegressor

Let’s import necessary components from sklearn

We import essential components from sklearn, including Pipeline, which we’ll use to create a pipe as from the previous section, ColumnTransformer, StandardScaler, and OneHotEncoder which we’ll use to transform the numeric and categorical columns respectively to be good for modelling.

# lets import the Pipeline from sklearn
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

Loading the dataset

We load the dataset named boston using the loadDataset function, which is inbuilt in the dataidea package. The loaded dataset is stored in the variable data.

# loading the data set
data = di.loadDataset('boston')

# looking at the top part
data.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3	222.0	18.7	396.90	5.33	36.2

Reveal more about the Boston dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:

CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000’s

Selecting features (X) and target variable (y)

We separate the features (X) from the target variable (y). Features are stored in X, excluding the target variable ‘MEDV’, which is stored in y.

# Selecting our X set and y
X = data.drop('MEDV', axis=1)
y = data.MEDV

Defining numeric and categorical columns

We define lists of column names representing numeric and categorical features in the dataset. We identified these columns as the best features from the previous section of this week. Click here to learn about feature selection

# numeric columns
numeric_cols = [
    'INDUS', 'NOX', 'RM', 
    'TAX', 'PTRATIO', 'LSTAT'
    ]
    
# categorical columns
categorical_cols = ['CHAS', 'RAD']

Preprocessing steps

We define transformers for preprocessing numeric and categorical features. StandardScaler is used for standardizing numeric features, while OneHotEncoder is used for one-hot encoding categorical features. These transformers are applied to respective feature types using ColumnTransformer as we learned in the previous section.

# Preprocessing steps
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

# Combine preprocessing steps
column_transformer = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numeric_cols),
        ('categorical', categorical_transformer, categorical_cols)
    ])

Defining the pipeline

We construct a machine learning pipeline using Pipeline. The pipeline consists of preprocessing steps (defined in column_transformer) and a KNeighborsRegressor model with 10 neighbors. Learn about Machine Learning Pipelining here

# Pipeline
pipe = Pipeline([
    ('column_transformer', column_transformer),
    ('model', KNeighborsRegressor(n_neighbors=10))
])

pipe

Pipeline(steps=[('column_transformer',
                 ColumnTransformer(transformers=[('numeric', StandardScaler(),
                                                  ['INDUS', 'NOX', 'RM', 'TAX',
                                                   'PTRATIO', 'LSTAT']),
                                                 ('categorical',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['CHAS', 'RAD'])])),
                ('model', KNeighborsRegressor(n_neighbors=10))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Fitting the pipeline

As we learned, the Pipeline has the fit, score and predict methods which we use to fit on the dataset (X, y) and evaluate the model’s performance using the score() method, finally making predictions.

# Fit the pipeline
pipe.fit(X, y)

# Score the pipeline
pipe_score = pipe.score(X, y)

# Predict using the pipeline
pipe_predicted_y = pipe.predict(X)

print('Pipe Score:', pipe_score)

Pipe Score: 0.818140222027107

Hyperparameter tuning using GridSearchCV

We perform hyperparameter tuning using GridSearchCV. The pipeline (pipe) serves as the base estimator, and we define a grid of hyperparameters to search through.

For this demonstration, we will focus on the number of neighbors for the KNN model.

from sklearn.model_selection import GridSearchCV

model = GridSearchCV(
    estimator=pipe,
    param_grid={
        'model__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    },
    cv=3
    )

Fitting the model for hyperparameter tuning

We fit the GridSearchCV model on the dataset to find the optimal hyperparameters. This involves preprocessing the data and training the model multiple times using cross-validation.

model.fit(X, y)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('column_transformer',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         StandardScaler(),
                                                                         ['INDUS',
                                                                          'NOX',
                                                                          'RM',
                                                                          'TAX',
                                                                          'PTRATIO',
                                                                          'LSTAT']),
                                                                        ('categorical',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['CHAS',
                                                                          'RAD'])])),
                                       ('model',
                                        KNeighborsRegressor(n_neighbors=10))]),
             param_grid={'model__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

GridSearchCV?Documentation for GridSearchCViFitted

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('column_transformer',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         StandardScaler(),
                                                                         ['INDUS',
                                                                          'NOX',
                                                                          'RM',
                                                                          'TAX',
                                                                          'PTRATIO',
                                                                          'LSTAT']),
                                                                        ('categorical',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['CHAS',
                                                                          'RAD'])])),
                                       ('model',
                                        KNeighborsRegressor(n_neighbors=10))]),
             param_grid={'model__n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

estimator: Pipeline

Pipeline(steps=[('column_transformer',
                 ColumnTransformer(transformers=[('numeric', StandardScaler(),
                                                  ['INDUS', 'NOX', 'RM', 'TAX',
                                                   'PTRATIO', 'LSTAT']),
                                                 ('categorical',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['CHAS', 'RAD'])])),
                ('model', KNeighborsRegressor(n_neighbors=10))])

column_transformer: ColumnTransformer?Documentation for column_transformer: ColumnTransformer

ColumnTransformer(transformers=[('numeric', StandardScaler(),
                                 ['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO',
                                  'LSTAT']),
                                ('categorical',
                                 OneHotEncoder(handle_unknown='ignore'),
                                 ['CHAS', 'RAD'])])

numeric

['INDUS', 'NOX', 'RM', 'TAX', 'PTRATIO', 'LSTAT']

StandardScaler?Documentation for StandardScaler

StandardScaler()

categorical

['CHAS', 'RAD']

OneHotEncoder?Documentation for OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

KNeighborsRegressor?Documentation for KNeighborsRegressor

KNeighborsRegressor(n_neighbors=10)

Extracting and displaying cross-validation results

We extract the results of cross-validation performed during hyperparameter tuning and present them in a tabular format using a DataFrame.

cv_results = pd.DataFrame(model.cv_results_)
cv_results

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_model__n_neighbors	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
0	0.006364	0.003203	0.003916	0.000833	1	{'model__n_neighbors': 1}	0.347172	0.561780	0.295295	0.401415	0.115356	10
1	0.004014	0.000250	0.003659	0.001165	2	{'model__n_neighbors': 2}	0.404829	0.612498	0.276690	0.431339	0.138369	9
2	0.003741	0.000159	0.003376	0.000710	3	{'model__n_neighbors': 3}	0.466325	0.590333	0.243375	0.433345	0.143552	8
3	0.004399	0.000464	0.002981	0.000075	4	{'model__n_neighbors': 4}	0.569672	0.619854	0.246539	0.478688	0.165428	4
4	0.003881	0.000336	0.002855	0.000071	5	{'model__n_neighbors': 5}	0.613900	0.600994	0.230320	0.481738	0.177857	2
5	0.004046	0.000582	0.003318	0.000555	6	{'model__n_neighbors': 6}	0.620587	0.607083	0.225238	0.484302	0.183269	1
6	0.003628	0.000127	0.002781	0.000018	7	{'model__n_neighbors': 7}	0.639693	0.583685	0.218612	0.480663	0.186704	3
7	0.003585	0.000059	0.002839	0.000093	8	{'model__n_neighbors': 8}	0.636143	0.567841	0.209472	0.471152	0.187125	5
8	0.003649	0.000175	0.002755	0.000031	9	{'model__n_neighbors': 9}	0.649335	0.542624	0.197917	0.463292	0.192639	6
9	0.003591	0.000071	0.002790	0.000060	10	{'model__n_neighbors': 10}	0.653370	0.535112	0.191986	0.460156	0.195674	7

model.score(X, y)

0.8661624926868122

Reveal the interpretation of the CV results

These are the results of a grid search cross-validation performed on our pipeline (pipe). Let’s break down each column:

mean_fit_time: The average time taken to fit the estimator on the training data across all folds.
std_fit_time: The standard deviation of the fitting time across all folds.
mean_score_time: The average time taken to score the estimator on the test data across all folds.
std_score_time: The standard deviation of the scoring time across all folds.
param_model__n_neighbors: The value of the n_neighbors parameter of the KNeighborsRegressor model in our pipeline for this particular grid search iteration.
params: A dictionary containing the parameters used in this grid search iteration.
split0_test_score, split1_test_score, split2_test_score: The test scores obtained for each fold of the cross-validation. Each fold corresponds to one entry here.
mean_test_score: The average test score across all folds.
std_test_score: The standard deviation of the test scores across all folds.
rank_test_score: The rank of this model configuration based on the mean test score. Lower values indicate better performance.

These results allow you to compare different parameter configurations and select the one that performs best based on the mean test score and other relevant metrics.

From the results above, it appears that the best number of neighbors to is 6.

From now on, I would like you to consider a GridSearchCV whenever you want to build a machine learning model.

Congratulations!

If you reached here, you have learned the following:

Selecting Features
Preprocessing data
Creating a Machine Learning Pipeline
Creating a GridSearchCV
Using the GridSearchCV to find the best Hyperparameters for our Machine Learning model.