Bonus
In this notebook, we demonstrate how ANOVA (Analysis of Variance) can be used to identify better features for machine learning models. We'll use the Fantasy Premier League (FPL) dataset to show how ANOVA helps in selecting features that best differentiate categories.
First, we'll import the necessary packages: scipy
for performing ANOVA, dataidea
for loading the FPL dataset, and SelectKBest
from scikit-learn
for univariate feature selection based on statistical tests.
import scipy as sp
from sklearn.feature_selection import SelectKBest, f_classif
import dataidea as di
Let's load the FPL dataset and preview the top 5 rows.
First_Name | Second_Name | Club | Goals_Scored | Assists | Total_Points | Minutes | Saves | Goals_Conceded | Creativity | Influence | Threat | Bonus | BPS | ICT_Index | Clean_Sheets | Red_Cards | Yellow_Cards | Position | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Bruno | Fernandes | MUN | 18 | 14 | 244 | 3101 | 0 | 36 | 1414.9 | 1292.6 | 1253 | 36 | 870 | 396.2 | 13 | 0 | 6 | MID |
1 | Harry | Kane | TOT | 23 | 14 | 242 | 3083 | 0 | 39 | 659.1 | 1318.2 | 1585 | 40 | 880 | 355.9 | 12 | 0 | 1 | FWD |
2 | Mohamed | Salah | LIV | 22 | 6 | 231 | 3077 | 0 | 41 | 825.7 | 1056.0 | 1980 | 21 | 657 | 385.8 | 11 | 0 | 0 | MID |
3 | Heung-Min | Son | TOT | 17 | 11 | 228 | 3119 | 0 | 36 | 1049.9 | 1052.2 | 1046 | 26 | 777 | 315.2 | 13 | 0 | 0 | MID |
4 | Patrick | Bamford | LEE | 17 | 11 | 194 | 3052 | 0 | 50 | 371.0 | 867.2 | 1512 | 26 | 631 | 274.6 | 10 | 0 | 3 | FWD |
ANOVA helps us determine if there's a significant difference between the means of different groups. We use it to select features that best show the difference between categories. Features with higher F-statistics are preferred.
ANOVA for Goals Scored
We will create groups of goals scored by each player position (forwards, midfielders, defenders, and goalkeepers) and run an ANOVA test.
# Create groups of goals scored for each player position
forwards_goals = fpl[fpl.Position == 'FWD']['Goals_Scored']
midfielders_goals = fpl[fpl.Position == 'MID']['Goals_Scored']
defenders_goals = fpl[fpl.Position == 'DEF']['Goals_Scored']
goalkeepers_goals = fpl[fpl.Position == 'GK']['Goals_Scored']
# Perform the ANOVA test for the groups
f_statistic, p_value = sp.stats.f_oneway(forwards_goals, midfielders_goals, defenders_goals, goalkeepers_goals)
print("F-statistic:", f_statistic)
print("p-value:", p_value)
F-statistic: 33.281034594400445
p-value: 3.9257634156019246e-20
We observe an F-statistic of 33.281
and a p-value of 3.926e-20
, indicating a significant difference at multiple confidence levels.
ANOVA for Assists
Next, we'll create groups for assists and run an ANOVA test.
# Create groups of assists for each player position
forwards_assists = fpl[fpl.Position == 'FWD']['Assists']
midfielders_assists = fpl[fpl.Position == 'MID']['Assists']
defenders_assists = fpl[fpl.Position == 'DEF']['Assists']
goalkeepers_assists = fpl[fpl.Position == 'GK']['Assists']
# Perform the ANOVA test for the groups
f_statistic, p_value = sp.stats.f_oneway(forwards_assists, midfielders_assists, defenders_assists, goalkeepers_assists)
print("F-statistic:", f_statistic)
print("p-value:", p_value)
F-statistic: 19.263717036430815
p-value: 5.124889288362087e-12
We observe an F-statistic of 19.264
and a p-value of 5.125e-12
, again indicating significance.
Comparing Results
Both features show significant F-statistics, but goals scored has a higher value, indicating it is a better feature for differentiating player positions.
Using SelectKBest for Feature Selection
We can also use SelectKBest
from scikit-learn
to automate this process.
# Use scikit-learn's SelectKBest (with f_classif)
test = SelectKBest(score_func=f_classif, k=1)
# Fit the model to the data
fit = test.fit(fpl[['Goals_Scored', 'Assists']], fpl.Position)
# Get the F-statistics
scores = fit.scores_
# Select the best feature
features = fit.transform(fpl[['Goals_Scored', 'Assists']])
# Get the indices of the selected features (optional)
selected_indices = test.get_support(indices=True)
# Print indices and scores
print('Feature Scores: ', scores)
print('Selected Features Indices: ', selected_indices)
Feature Scores: [33.28103459 19.26371704]
Selected Features Indices: [0]
The 0th
feature (Goals Scored) is selected as the best feature based on the F-statistics.
Summary
In this notebook, we demonstrated how to use ANOVA for feature selection in the Fantasy Premier League dataset. By comparing the F-statistics of different features, we identified that 'Goals Scored' is a more significant feature than 'Assists' for differentiating player positions. Using SelectKBest
from scikit-learn
, we confirmed that 'Goals Scored' is the best feature among the two. This method can be applied to other datasets and features to enhance the performance of machine learning models.