ANOVA for Feature Selection

In this notebook, we demonstrate how ANOVA (Analysis of Variance) can be used to identify better features for machine learning models

Author

Juma Shafara

Published

March 1, 2023

Keywords

ANOVA for feature selection, Feature selection techniques, Data Science tutorial, Machine learning feature selection, ANOVA in data science, Univariate feature selection, Fantasy Premier League dataset analysis, SelectKBest example, Statistical tests for feature selection, F-statistic in ANOVA, Python feature selection, Scikit-learn SelectKBest, Analysis of Variance, Machine learning with ANOVA, Data science programming

In this notebook, we demonstrate how ANOVA (Analysis of Variance) can be used to identify better features for machine learning models. We’ll use the Fantasy Premier League (FPL) dataset to show how ANOVA helps in selecting features that best differentiate categories.

# Uncomment the line below if you need to install the dataidea package
# !pip install -U dataidea

First, we’ll import the necessary packages: scipy for performing ANOVA, dataidea for loading the FPL dataset, and SelectKBest from scikit-learn for univariate feature selection based on statistical tests.

import scipy as sp
from sklearn.feature_selection import SelectKBest, f_classif
import dataidea as di

Let’s load the FPL dataset and preview the top 5 rows.

# Load FPL dataset
fpl = di.loadDataset('fpl') 

# Preview the top 5 rows
fpl.head(n=5)

	First_Name	Second_Name	Club	Goals_Scored	Assists	Total_Points	Minutes	Goals_Conceded	Creativity	Influence	Threat	Bonus	BPS	ICT_Index	Clean_Sheets	Yellow_Cards	Position
0	Bruno	Fernandes	MUN	18	14	244	3101	36	1414.9	1292.6	1253	36	870	396.2	13	6	MID
1	Harry	Kane	TOT	23	14	242	3083	39	659.1	1318.2	1585	40	880	355.9	12	1	FWD
2	Mohamed	Salah	LIV	22	6	231	3077	41	825.7	1056.0	1980	21	657	385.8	11	0	MID
3	Heung-Min	Son	TOT	17	11	228	3119	36	1049.9	1052.2	1046	26	777	315.2	13	0	MID
4	Patrick	Bamford	LEE	17	11	194	3052	50	371.0	867.2	1512	26	631	274.6	10	3	FWD

ANOVA helps us determine if there’s a significant difference between the means of different groups. We use it to select features that best show the difference between categories. Features with higher F-statistics are preferred.

ANOVA for Goals Scored

We will create groups of goals scored by each player position (forwards, midfielders, defenders, and goalkeepers) and run an ANOVA test.

# Create groups of goals scored for each player position
forwards_goals = fpl[fpl.Position == 'FWD']['Goals_Scored']
midfielders_goals = fpl[fpl.Position == 'MID']['Goals_Scored']
defenders_goals = fpl[fpl.Position == 'DEF']['Goals_Scored']
goalkeepers_goals = fpl[fpl.Position == 'GK']['Goals_Scored']

# Perform the ANOVA test for the groups
f_statistic, p_value = sp.stats.f_oneway(forwards_goals, midfielders_goals, defenders_goals, goalkeepers_goals)
print("F-statistic:", f_statistic)
print("p-value:", p_value)

F-statistic: 33.281034594400445
p-value: 3.9257634156019246e-20

We observe an F-statistic of 33.281 and a p-value of 3.926e-20, indicating a significant difference at multiple confidence levels.

ANOVA for Assists

Next, we’ll create groups for assists and run an ANOVA test.

# Create groups of assists for each player position
forwards_assists = fpl[fpl.Position == 'FWD']['Assists']
midfielders_assists = fpl[fpl.Position == 'MID']['Assists']
defenders_assists = fpl[fpl.Position == 'DEF']['Assists']
goalkeepers_assists = fpl[fpl.Position == 'GK']['Assists']

# Perform the ANOVA test for the groups
f_statistic, p_value = sp.stats.f_oneway(forwards_assists, midfielders_assists, defenders_assists, goalkeepers_assists)
print("F-statistic:", f_statistic)
print("p-value:", p_value)

F-statistic: 19.263717036430815
p-value: 5.124889288362087e-12

We observe an F-statistic of 19.264 and a p-value of 5.125e-12, again indicating significance.

Comparing Results

Both features show significant F-statistics, but goals scored has a higher value, indicating it is a better feature for differentiating player positions.

Using SelectKBest for Feature Selection

We can also use SelectKBest from scikit-learn to automate this process.

# Use scikit-learn's SelectKBest (with f_classif)
test = SelectKBest(score_func=f_classif, k=1)

# Fit the model to the data
fit = test.fit(fpl[['Goals_Scored', 'Assists']], fpl.Position)

# Get the F-statistics
scores = fit.scores_

# Select the best feature
features = fit.transform(fpl[['Goals_Scored', 'Assists']])

# Get the indices of the selected features (optional)
selected_indices = test.get_support(indices=True)

# Print indices and scores
print('Feature Scores: ', scores)
print('Selected Features Indices: ', selected_indices)

Feature Scores:  [33.28103459 19.26371704]
Selected Features Indices:  [0]

The 0th feature (Goals Scored) is selected as the best feature based on the F-statistics.

Summary

In this notebook, we demonstrated how to use ANOVA for feature selection in the Fantasy Premier League dataset. By comparing the F-statistics of different features, we identified that ‘Goals Scored’ is a more significant feature than ‘Assists’ for differentiating player positions. Using SelectKBest from scikit-learn, we confirmed that ‘Goals Scored’ is the best feature among the two. This method can be applied to other datasets and features to enhance the performance of machine learning models.

ANOVA for Goals Scored

ANOVA for Assists

Comparing Results

Using SelectKBest for Feature Selection

Summary

What’s on your mind? Put it in the comments!