import pandas as pd
from sklearn.model_selection import train_test_split
from dataidea.datasets import loadDataset
Feature Selection
feature selection
What is Feature Selection
Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.
Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.
Three benefits of performing feature selection before modeling your data are:
- Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
- Improves Accuracy: Less misleading data means modeling accuracy improves.
- Reduces Training Time: Less data means that algorithms train faster.
You can learn more about feature selection with scikit-learn in the article Feature selection.
= loadDataset('../assets/demo_cleaned.csv',
data =False, file_type='csv')
inbuilt data.head()
age | gender | marital_status | address | income | income_category | job_category | |
---|---|---|---|---|---|---|---|
0 | 55 | f | 1 | 12 | 72.0 | 3.0 | 3 |
1 | 56 | m | 0 | 29 | 153.0 | 4.0 | 3 |
2 | 24 | m | 1 | 4 | 26.0 | 2.0 | 1 |
3 | 45 | m | 0 | 9 | 76.0 | 4.0 | 2 |
4 | 44 | m | 1 | 17 | 144.0 | 4.0 | 3 |
= pd.get_dummies(data, columns=['gender'],
data ='int', drop_first=True)
dtype=5) data.head(n
age | marital_status | address | income | income_category | job_category | gender_m | |
---|---|---|---|---|---|---|---|
0 | 55 | 1 | 12 | 72.0 | 3.0 | 3 | 0 |
1 | 56 | 0 | 29 | 153.0 | 4.0 | 3 | 1 |
2 | 24 | 1 | 4 | 26.0 | 2.0 | 1 | 1 |
3 | 45 | 0 | 9 | 76.0 | 4.0 | 2 | 1 |
4 | 44 | 1 | 17 | 144.0 | 4.0 | 3 | 1 |
Univariate Feature Selection Techniques
Statistical tests can be used to select those features that have the strongest relationship with the output variable.
The scikit-learn library provides the SelectKBest
class that can be used with a suite of different statistical tests to select a specific number of features.
Many different statistical tests can be used with this selection method. For example the ANOVA F-value method is appropriate for numerical inputs and categorical data. This can be used via the f_classif() function. We will select the 4 best features using this method in the example below.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import f_regression
Let’s first separate our data into features ie X
and outcome ie y
as below.
= data.drop('marital_status', axis=1)
X = data.marital_status y
Numeric or Continuous Features with Categorical Outcome
Beginning with the numeric columns, let’s find which of them best contributes to the outcome variable
= X[['age', 'income', 'address']].copy() X_numeric
# create a test object from SelectKBest
= SelectKBest(score_func=f_classif, k=2)
test
# fit the test object to the data
= test.fit(X_numeric, y)
fit
# get the scores and features
= fit.scores_
scores
# get the selected indices
= fit.transform(X_numeric)
features = test.get_support(indices=True)
selected_indices
# print the scores and features
print('Feature Scores: ', scores)
print('Selected Features Indices: ', selected_indices)
Feature Scores: [1.34973748 1.73808724 0.02878244]
Selected Features Indices: [0 1]
This shows us that the best 2 features to use to differentiate between the groups in our outcome are [0, 1]
ie age
and income
Numeric Features with Numeric Outcome
Let’s selecting the input features X
, and the output (outcome), y
# pick numeric input and output
= data[['age', 'address']].copy()
X = data.income y
We will still use the SelectKBest
class but with our score_func
as f_regression
instead.
= SelectKBest(score_func=f_regression, k=1)
test
# Fit the test to the data
= test.fit(X, y)
fit
# get scores
= fit.scores_
test_scores
# summarize selected features
= fit.transform(X)
features
# Get the selected feature indices
= fit.get_support(indices=True)
selected_indices
print('Feature Scores: ', test_scores)
print('Selected Features Indices: ', selected_indices)
Feature Scores: [25.18294605 23.43115992]
Selected Features Indices: [0]
Here, we can see that age
is selected because it returns the higher f_statistic between the two features
Both input and outcome Categorical
Let’s begin by selecting out only the categorical features to make our X
set and set y
as categorical
# selecting categorical features
= data[['gender_m', 'income_category', 'job_category']].copy()
X
# selecting categorical outcome
= data.marital_status y
Now we shall again use SelectKBest
but with the score_func
as chi2
.
from sklearn.feature_selection import chi2
= SelectKBest(score_func=chi2, k=2)
test = test.fit(X, y)
fit = fit.scores_
scores = fit.transform(X)
features = fit.get_support(indices=True)
selected_indices
print('Feature Scores: ', scores)
print('Selected Features Indices: ', selected_indices)
Feature Scores: [0.20921223 0.61979264 0.00555967]
Selected Features Indices: [0 1]
Note: When using the Chi-Square (chi2) as the the score function for feature selection, you use the Chi-Square statistic.
Again, we can see that the features with higher f_statistic scores have been selected
f_classif
is most applicable where the input features are continuous and the outcome is categorical.f_regression
is most applicable where the input features are continuous and the outcome is continuous.chi2
is best for when the both the input and outcome are categorical.
Recursive Feature Elimination
The Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on those attributes that remain.
It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.
You can learn more about the RFE class in the scikit-learn documentation.
Logistic Regression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
= data.drop('marital_status', axis=1)
X = data.marital_status y
# feature extraction
= LogisticRegression()
model = RFE(model)
rfe = rfe.fit(X, y)
fit
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
Num Features: 3
Selected Features: [False False False True True True]
Feature Ranking: [2 3 4 1 1 1]
X.head()
age | address | income | income_category | job_category | gender_m | |
---|---|---|---|---|---|---|
0 | 55 | 12 | 72.0 | 3.0 | 3 | 0 |
1 | 56 | 29 | 153.0 | 4.0 | 3 | 1 |
2 | 24 | 4 | 26.0 | 2.0 | 1 | 1 |
3 | 45 | 9 | 76.0 | 4.0 | 2 | 1 |
4 | 44 | 17 | 144.0 | 4.0 | 3 | 1 |
From the operation above, we can observe features that bring out the best from the LogisticRegression
model ranked from 1
as most best and bigger numbers as less.
Feature Importance
Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.
In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifier class in the scikit-learn API.
Extra Trees Classifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
# feature extraction
= ExtraTreesClassifier()
model
model.fit(X, y)
# see the best features
print(model.feature_importances_)
[0.29058207 0.24978811 0.26342117 0.06763375 0.08501043 0.04356447]
Random Forest Classifier
# feature extraction
= RandomForestClassifier()
model
model.fit(X, y)
# see the best features
print(model.feature_importances_)
[0.28927782 0.2515934 0.28839236 0.06166801 0.06610313 0.04296528]
more about random forest here