import numpy as np
import dataidea as di
Normalization vs Standardization
normalization, standardization, rescaling
Introduction:
In data analysis and machine learning, preprocessing steps such as data normalization and standardization are crucial for improving the performance and interpretability of models.
This Jupyter Notebook provides an overview of the importance of data normalization and standardization in preparing data for analysis and modeling.
Normalization
- Normalization: Normalization typically refers to scaling numerical features to a common scale, often between 0 and 1. This is usually done by subtracting the minimum value and then dividing by the range (maximum - minimum). Normalization is useful when the distribution of the data does not follow a Gaussian distribution (Normal Distribution).
# Data Normalization without libraries:
def minMaxScaling(data):
= min(data)
min_val = max(data)
max_val
= []
scaled_data for value in data:
= (value - min_val) / (max_val - min_val)
scaled
scaled_data.append(scaled)return scaled_data
# Example data
= np.array([10, 20, 30, 40, 50])
data = minMaxScaling(data)
normalized_data print("Normalized data (Min-Max Scaling):", normalized_data)
Normalized data (Min-Max Scaling): [0.0, 0.25, 0.5, 0.75, 1.0]
from sklearn.preprocessing import MinMaxScaler
# Sample data
= np.array([[1, 2], [3, 4], [5, 6]])
data
# Create the scaler
= MinMaxScaler()
scaler
# Fit the scaler to the data and transform the data
= scaler.fit_transform(data)
normalized_data
print("Original data:")
print(data)
print("\nNormalized data:")
print(normalized_data)
Original data:
[[1 2]
[3 4]
[5 6]]
Normalized data:
[[0. 0. ]
[0.5 0.5]
[1. 1. ]]
Let’s now try on a real world dataset!
= di.loadDataset('boston')
boston_data boston_data.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
= MinMaxScaler()
boston_scaler = boston_scaler.fit_transform(boston_data[['CRIM', 'AGE', 'TAX']])
normalized_data =True)
np.set_printoptions(suppress normalized_data
array([[0. , 0.64160659, 0.20801527],
[0.00023592, 0.78269825, 0.10496183],
[0.0002357 , 0.59938208, 0.10496183],
...,
[0.00061189, 0.90731205, 0.16412214],
[0.00116073, 0.88980433, 0.16412214],
[0.00046184, 0.80226571, 0.16412214]])
Standardization
- Standardization: Standardization, often implemented with a method like z-score standardization, transforms the data to have a mean of 0 and a standard deviation of 1. This means that the data will have a Gaussian distribution (if the original data had a Gaussian distribution).
def zScoreNormalization(data):
= sum(data) / len(data)
mean = sum((x - mean) ** 2 for x in data) / len(data)
variance = variance ** 0.5
std_dev = [(x - mean) / std_dev for x in data]
standardized_data return standardized_data
# Example data
= [10, 20, 30, 40, 50]
data = zScoreNormalization(data)
standardized_data print("Standardized data (Z-Score Normalization):", standardized_data)
Standardized data (Z-Score Normalization): [-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]
In Python, we can also typically use the StandardScaler
from the sklearn.preprocessing
module to standardize data.
from sklearn.preprocessing import StandardScaler
# Sample data
= np.array([[1, 2, 3], [3, 4, 5], [5, 6, 7]])
data
# Create the scaler
= StandardScaler()
scaler
# Fit the scaler to the data and transform the data
= scaler.fit_transform(data)
standardized_data
print("Original data:")
print(data)
print("\nStandardized data:")
print(standardized_data)
Original data:
[[1 2 3]
[3 4 5]
[5 6 7]]
Standardized data:
[[-1.22474487 -1.22474487 -1.22474487]
[ 0. 0. 0. ]
[ 1.22474487 1.22474487 1.22474487]]
= StandardScaler()
boston_scaler = boston_scaler.fit_transform(boston_data[['CRIM', 'AGE', 'TAX']])
standardized_data =True)
np.set_printoptions(suppress standardized_data
array([[-0.41978194, -0.12001342, -0.66660821],
[-0.41733926, 0.36716642, -0.98732948],
[-0.41734159, -0.26581176, -0.98732948],
...,
[-0.41344658, 0.79744934, -0.80321172],
[-0.40776407, 0.73699637, -0.80321172],
[-0.41500016, 0.43473151, -0.80321172]])
Importance:
- Data Normalization:
- Uniform Scaling: Ensures all features are scaled to a similar range, preventing dominance by features with larger scales.
- Improved Convergence: Facilitates faster convergence in optimization algorithms by making the loss surface more symmetric.
- Interpretability: Easier interpretation as values are on a consistent scale, aiding in comparison and understanding of feature importance.
- Data Standardization:
- Mean Centering: Transforms data to have a mean of 0 and a standard deviation of 1, simplifying interpretation of coefficients in linear models.
- Handling Different Scales: Useful when features have different scales or units, making them directly comparable.
- Reducing Sensitivity to Outliers: Less affected by outliers compared to normalization, leading to more robust models.
- Maintaining Information: Preserves relative relationships between data points without altering the distribution shape.
Which one?
The choice between normalization and standardization depends on your data and the requirements of your analysis. Here are some guidelines to help you decide:
- Normalization:
- Use normalization when the scale of features is meaningful and should be preserved.
- Normalize data when you’re working with algorithms that require input features to be on a similar scale, such as algorithms using distance metrics like k-nearest neighbors or clustering algorithms like K-means.
- If the distribution of your data is not Gaussian and you want to scale the features to a fixed range, normalization might be a better choice.
- Standardization:
- Use standardization when the distribution of your data is Gaussian or when you’re unsure about the distribution.
- Standardization is less affected by outliers compared to normalization, making it more suitable when your data contains outliers.
- If you’re working with algorithms that assume your data is normally distributed, such as linear regression, logistic regression, standardization is typically preferred.
In some cases, you might experiment with both approaches and see which one yields better results for your specific dataset and analysis. Additionally, it’s always a good practice to understand your data and the underlying assumptions of the algorithms you’re using to make informed decisions about data preprocessing techniques.