Normalization vs Standardization

This Jupyter Notebook provides an overview of the importance of data normalization and standardization in preparing data for analysis and modeling
Author

Juma Shafara

Published

March 1, 2024

Keywords

normalization, standardization, rescaling

Photo by DATAIDEA

Introduction:

In data analysis and machine learning, preprocessing steps such as data normalization and standardization are crucial for improving the performance and interpretability of models.

This Jupyter Notebook provides an overview of the importance of data normalization and standardization in preparing data for analysis and modeling.

import numpy as np
import dataidea as di

Normalization

  1. Normalization: Normalization typically refers to scaling numerical features to a common scale, often between 0 and 1. This is usually done by subtracting the minimum value and then dividing by the range (maximum - minimum). Normalization is useful when the distribution of the data does not follow a Gaussian distribution (Normal Distribution).
# Data Normalization without libraries:
def minMaxScaling(data):
    min_val = min(data)
    max_val = max(data)
    
    scaled_data = []
    for value in data:
        scaled = (value - min_val) / (max_val - min_val)
        scaled_data.append(scaled)
    return scaled_data
# Example data
data = np.array([10, 20, 30, 40, 50])
normalized_data = minMaxScaling(data)
print("Normalized data (Min-Max Scaling):", normalized_data)
Normalized data (Min-Max Scaling): [0.0, 0.25, 0.5, 0.75, 1.0]
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = np.array([[1, 2], [3, 4], [5, 6]])

# Create the scaler
scaler = MinMaxScaler()

# Fit the scaler to the data and transform the data
normalized_data = scaler.fit_transform(data)

print("Original data:")
print(data)
print("\nNormalized data:")
print(normalized_data)
Original data:
[[1 2]
 [3 4]
 [5 6]]

Normalized data:
[[0.  0. ]
 [0.5 0.5]
 [1.  1. ]]

Let’s now try on a real world dataset!

boston_data = di.loadDataset('boston')
boston_data.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222.0 18.7 396.90 5.33 36.2
boston_scaler = MinMaxScaler()
normalized_data = boston_scaler.fit_transform(boston_data[['CRIM', 'AGE', 'TAX']])
np.set_printoptions(suppress=True)
normalized_data
array([[0.        , 0.64160659, 0.20801527],
       [0.00023592, 0.78269825, 0.10496183],
       [0.0002357 , 0.59938208, 0.10496183],
       ...,
       [0.00061189, 0.90731205, 0.16412214],
       [0.00116073, 0.88980433, 0.16412214],
       [0.00046184, 0.80226571, 0.16412214]])

Standardization

  1. Standardization: Standardization, often implemented with a method like z-score standardization, transforms the data to have a mean of 0 and a standard deviation of 1. This means that the data will have a Gaussian distribution (if the original data had a Gaussian distribution).
def zScoreNormalization(data):
    mean = sum(data) / len(data)
    variance = sum((x - mean) ** 2 for x in data) / len(data)
    std_dev = variance ** 0.5
    standardized_data = [(x - mean) / std_dev for x in data]
    return standardized_data
# Example data
data = [10, 20, 30, 40, 50]
standardized_data = zScoreNormalization(data)
print("Standardized data (Z-Score Normalization):", standardized_data)
Standardized data (Z-Score Normalization): [-1.414213562373095, -0.7071067811865475, 0.0, 0.7071067811865475, 1.414213562373095]

In Python, we can also typically use the StandardScaler from the sklearn.preprocessing module to standardize data.

from sklearn.preprocessing import StandardScaler

# Sample data
data = np.array([[1, 2, 3], [3, 4, 5], [5, 6, 7]])

# Create the scaler
scaler = StandardScaler()

# Fit the scaler to the data and transform the data
standardized_data = scaler.fit_transform(data)

print("Original data:")
print(data)
print("\nStandardized data:")
print(standardized_data)
Original data:
[[1 2 3]
 [3 4 5]
 [5 6 7]]

Standardized data:
[[-1.22474487 -1.22474487 -1.22474487]
 [ 0.          0.          0.        ]
 [ 1.22474487  1.22474487  1.22474487]]
boston_scaler = StandardScaler()
standardized_data = boston_scaler.fit_transform(boston_data[['CRIM', 'AGE', 'TAX']])
np.set_printoptions(suppress=True)
standardized_data
array([[-0.41978194, -0.12001342, -0.66660821],
       [-0.41733926,  0.36716642, -0.98732948],
       [-0.41734159, -0.26581176, -0.98732948],
       ...,
       [-0.41344658,  0.79744934, -0.80321172],
       [-0.40776407,  0.73699637, -0.80321172],
       [-0.41500016,  0.43473151, -0.80321172]])

Importance:

  1. Data Normalization:
    • Uniform Scaling: Ensures all features are scaled to a similar range, preventing dominance by features with larger scales.
    • Improved Convergence: Facilitates faster convergence in optimization algorithms by making the loss surface more symmetric.
    • Interpretability: Easier interpretation as values are on a consistent scale, aiding in comparison and understanding of feature importance.
  2. Data Standardization:
    • Mean Centering: Transforms data to have a mean of 0 and a standard deviation of 1, simplifying interpretation of coefficients in linear models.
    • Handling Different Scales: Useful when features have different scales or units, making them directly comparable.
    • Reducing Sensitivity to Outliers: Less affected by outliers compared to normalization, leading to more robust models.
    • Maintaining Information: Preserves relative relationships between data points without altering the distribution shape.

Which one?

The choice between normalization and standardization depends on your data and the requirements of your analysis. Here are some guidelines to help you decide:

  1. Normalization:
    • Use normalization when the scale of features is meaningful and should be preserved.
    • Normalize data when you’re working with algorithms that require input features to be on a similar scale, such as algorithms using distance metrics like k-nearest neighbors or clustering algorithms like K-means.
    • If the distribution of your data is not Gaussian and you want to scale the features to a fixed range, normalization might be a better choice.
  2. Standardization:
    • Use standardization when the distribution of your data is Gaussian or when you’re unsure about the distribution.
    • Standardization is less affected by outliers compared to normalization, making it more suitable when your data contains outliers.
    • If you’re working with algorithms that assume your data is normally distributed, such as linear regression, logistic regression, standardization is typically preferred.

In some cases, you might experiment with both approaches and see which one yields better results for your specific dataset and analysis. Additionally, it’s always a good practice to understand your data and the underlying assumptions of the algorithms you’re using to make informed decisions about data preprocessing techniques.

What’s on your mind? Put it in the comments!

Back to top