ANOVA for Feature Selection

In this notebook, we demonstrate how ANOVA (Analysis of Variance) can be used to identify better features for machine learning models
Author

Juma Shafara

Published

March 1, 2023

Keywords

ANOVA for feature selection, Feature selection techniques, Data Science tutorial, Machine learning feature selection, ANOVA in data science, Univariate feature selection, Fantasy Premier League dataset analysis, SelectKBest example, Statistical tests for feature selection, F-statistic in ANOVA, Python feature selection, Scikit-learn SelectKBest, Analysis of Variance, Machine learning with ANOVA, Data science programming

Photo by DATAIDEA

In this notebook, we demonstrate how ANOVA (Analysis of Variance) can be used to identify better features for machine learning models. We’ll use the Fantasy Premier League (FPL) dataset to show how ANOVA helps in selecting features that best differentiate categories.

# Uncomment the line below if you need to install the dataidea package
# !pip install -U dataidea

First, we’ll import the necessary packages: scipy for performing ANOVA, dataidea for loading the FPL dataset, and SelectKBest from scikit-learn for univariate feature selection based on statistical tests.

import scipy as sp
from sklearn.feature_selection import SelectKBest, f_classif
import dataidea as di

Let’s load the FPL dataset and preview the top 5 rows.

# Load FPL dataset
fpl = di.loadDataset('fpl') 

# Preview the top 5 rows
fpl.head(n=5)
First_Name Second_Name Club Goals_Scored Assists Total_Points Minutes Saves Goals_Conceded Creativity Influence Threat Bonus BPS ICT_Index Clean_Sheets Red_Cards Yellow_Cards Position
0 Bruno Fernandes MUN 18 14 244 3101 0 36 1414.9 1292.6 1253 36 870 396.2 13 0 6 MID
1 Harry Kane TOT 23 14 242 3083 0 39 659.1 1318.2 1585 40 880 355.9 12 0 1 FWD
2 Mohamed Salah LIV 22 6 231 3077 0 41 825.7 1056.0 1980 21 657 385.8 11 0 0 MID
3 Heung-Min Son TOT 17 11 228 3119 0 36 1049.9 1052.2 1046 26 777 315.2 13 0 0 MID
4 Patrick Bamford LEE 17 11 194 3052 0 50 371.0 867.2 1512 26 631 274.6 10 0 3 FWD

ANOVA helps us determine if there’s a significant difference between the means of different groups. We use it to select features that best show the difference between categories. Features with higher F-statistics are preferred.

ANOVA for Goals Scored

We will create groups of goals scored by each player position (forwards, midfielders, defenders, and goalkeepers) and run an ANOVA test.

# Create groups of goals scored for each player position
forwards_goals = fpl[fpl.Position == 'FWD']['Goals_Scored']
midfielders_goals = fpl[fpl.Position == 'MID']['Goals_Scored']
defenders_goals = fpl[fpl.Position == 'DEF']['Goals_Scored']
goalkeepers_goals = fpl[fpl.Position == 'GK']['Goals_Scored']

# Perform the ANOVA test for the groups
f_statistic, p_value = sp.stats.f_oneway(forwards_goals, midfielders_goals, defenders_goals, goalkeepers_goals)
print("F-statistic:", f_statistic)
print("p-value:", p_value)
F-statistic: 33.281034594400445
p-value: 3.9257634156019246e-20

We observe an F-statistic of 33.281 and a p-value of 3.926e-20, indicating a significant difference at multiple confidence levels.

ANOVA for Assists

Next, we’ll create groups for assists and run an ANOVA test.

# Create groups of assists for each player position
forwards_assists = fpl[fpl.Position == 'FWD']['Assists']
midfielders_assists = fpl[fpl.Position == 'MID']['Assists']
defenders_assists = fpl[fpl.Position == 'DEF']['Assists']
goalkeepers_assists = fpl[fpl.Position == 'GK']['Assists']

# Perform the ANOVA test for the groups
f_statistic, p_value = sp.stats.f_oneway(forwards_assists, midfielders_assists, defenders_assists, goalkeepers_assists)
print("F-statistic:", f_statistic)
print("p-value:", p_value)
F-statistic: 19.263717036430815
p-value: 5.124889288362087e-12

We observe an F-statistic of 19.264 and a p-value of 5.125e-12, again indicating significance.

Comparing Results

Both features show significant F-statistics, but goals scored has a higher value, indicating it is a better feature for differentiating player positions.

Using SelectKBest for Feature Selection

We can also use SelectKBest from scikit-learn to automate this process.

# Use scikit-learn's SelectKBest (with f_classif)
test = SelectKBest(score_func=f_classif, k=1)

# Fit the model to the data
fit = test.fit(fpl[['Goals_Scored', 'Assists']], fpl.Position)

# Get the F-statistics
scores = fit.scores_

# Select the best feature
features = fit.transform(fpl[['Goals_Scored', 'Assists']])

# Get the indices of the selected features (optional)
selected_indices = test.get_support(indices=True)

# Print indices and scores
print('Feature Scores: ', scores)
print('Selected Features Indices: ', selected_indices)
Feature Scores:  [33.28103459 19.26371704]
Selected Features Indices:  [0]

The 0th feature (Goals Scored) is selected as the best feature based on the F-statistics.

Summary

In this notebook, we demonstrated how to use ANOVA for feature selection in the Fantasy Premier League dataset. By comparing the F-statistics of different features, we identified that ‘Goals Scored’ is a more significant feature than ‘Assists’ for differentiating player positions. Using SelectKBest from scikit-learn, we confirmed that ‘Goals Scored’ is the best feature among the two. This method can be applied to other datasets and features to enhance the performance of machine learning models.

What’s on your mind? Put it in the comments!

Back to top