Pipelines
title: Pipeline keywords: [pipeline, machine learning, supervised machine learning, data preprocessing, preprocessing, feature extraction, feature selection, model fitting] description: In this notebook we are gonna be looking at the following steps in a pipeline preprocessing, feature extraction, feature selection, model fitting author: Juma Shafara date: "2024-03"

Pipeline
A pipeline is a series of data processing steps that are chained together sequentially. Each step in the pipeline typically performs some transformation on the data. In this notebook we are gonna be looking at the following steps in a pipeline:
- Preprocessing
- Feature extraction
- Feature selection
- Model fitting
Let's redefine a model
In week 4, we introduced ourselves to Machine Learning Concepts, in week 5 we learned some statistical tests and we applied them in week 7 to find the best feature and transform them to efficient forms. In this section, we will build on top of those concepts to redefine what a Machine Learning model is and hence come up with a more efficient way of developing good Machine Learning models
First, let's install the dataidea package, which will help us with loading packages and datasets with much more ease
The Boston Housing Dataset
The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per $10,000
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's
Training our first model
In week 4, we learned that to train a model (for supervised machine learning), we needed to have a set of X variables (also called independent, predictor etc), and then, we needed a y variable (also called dependent, outcome, predicted etc).
Now we can train the KNeighborsRegressor model, this model naturally makes predictions by averaging the values of the 5 neighbors to the point that you want to predict
Now lets go ahead and try to visualize the performance of the model. The scatter plot is of true labels against predicted labels. Do you think the model is doing well?
Some feature selection.
Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.
In week 7 we learned that having irrelevant features in your data can decrease the accuracy of many models. In the code below, we try to find out the best features that best contribute to the outcome variable
# lets do some feature selection using ANOVA
data_num = data.drop(['CHAS','RAD'], axis=1) # dropping categorical
X = data_num.drop("MEDV", axis=1)
y = data_num.MEDV
# using SelectKBest
test_reg = SelectKBest(score_func=f_regression, k=6)
fit_boston = test_reg.fit(X, y)
indexes = fit_boston.get_support(indices=True)
print(fit_boston.scores_)
print(indexes)
From above, we can see from above that the best features for now are those in indexes [ 2 3 4 7 8 10] in the num_data dataset. Lets find them in the data and add on our categorical ones to set up our new X set
Training our second model
Now that we have selected out the features, X that we thing best contribute to the outcome, let's retrain our machine learning model and see if we are gonna get better results
The model seems to score better with a significant increment in accuracy from 0.71 to 0.83. As like last time, let us try to visualize the difference in performance
I do not know about you, but as for me, I notice a meaningful improvement in the predictions made from the model considering this scatter plot
Transforming the data
In week 7, we learned some advantages of scaling our data like:
- preventing dominance by features with larger scales
- faster convergence in optimization algorithms
-
reduce the impact of outliers
-
Numeric Transformer:
This initializes a StandardScaler which standardizes features by removing the mean and scaling to unit variance. It's applied to numeric columns to ensure they are on a similar scale.
- Categorical Transformer:
This initializes a OneHotEncoder which converts categorical variables into a format that can be provided to ML algorithms to do a better job in prediction.
encoded_data_cat = one_hot_encoder.fit_transform(data[['CHAS', 'RAD']])
encoded_data_cat_array = encoded_data_cat.toarray()
# Get feature names
feature_names = one_hot_encoder.get_feature_names_out(['CHAS', 'RAD'])
encoded_data_cat_df = pd.DataFrame(
data=encoded_data_cat_array,
columns=feature_names
)
Let us add that to the new X and form a standardized new X set
Training our third model
Now that we have the right features selected and standardized, let us train a new model and see if it is gonna beat the first models
This new models appears to do better than the earlier ones with an improvement in score from 0.83 to 0.87. Do you think this is now a good model?
The Pipeline
It turns out the above efforts to improve the performance of the model add extra steps to pass before you can have a good model. But what about if we can put together the transformers into on object we do most of that stuff.
The sklearn Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.
Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit.
Let us build a model that puts together transformation and modelling steps into one pipeline object
# Preprocessing steps
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder()
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('numerical', numeric_transformer, numeric_cols),
('categorical', categorical_transformer, categorical_cols)
])
# Pipeline
pipe = Pipeline([
('preprocessor', preprocessor),
('model', KNeighborsRegressor())
])
# display pipe
pipe
The code above sets up a data preprocessing and modeling pipeline using the scikit-learn library. Let's break down each part:
Combine Preprocessing Steps
- ColumnTransformer:
- The
ColumnTransformeris used to apply different preprocessing steps to different columns of the data. It combines thenumeric_transformerfor numeric columns and thecategorical_transformerfor categorical columns. numeric_colsandcategorical_colsare lists containing the names of numeric and categorical columns respectively.
Pipeline
- Pipeline Setup:
- A
Pipelineis created which sequentially applies a list of transforms and a final estimator. - The
preprocessorstep applies theColumnTransformerdefined earlier. - The
modelstep applies aKNeighborsRegressor, which is a regression model that predicts the target variable based on the k-nearest neighbors in the feature space.
Now we can instead fit the Pipeline and use it for making predictions
We can observe that the model still gets the same good score, but now all the transformation steps, both on numeric and categorical variables are in a single pipeline object together with the model.