Welcome to our new Data Science resource site!

logo
Programming for Data Science
Momentum
Initializing search
    • Home
    • Python
    • Data Collection and Visualization
    • Machine Learning
    • Deep Learning
    • Time Series
    • Maths & Statistics
    • Extras
    • About
    • Home
      • Overview
      • End of Course Exercise
        • Outline
        • Introduction
        • Variables
        • Numbers
        • Strings
        • Operators
        • Containers
        • Flow Control
        • Advanced
        • Modules
        • File Handling
        • End of Course Exercise
        • Filter Function
        • Map Function
        • Reduce Function
        • NumPy Crash Course
        • Pandas Crash Course
        • Matplotlib Crash Course
      • NumPy Crash Course
      • Pandas Crash Course
      • Weather Data
      • Matplotlib Crash Course
      • Data Exploration Exercise
      • Handling Missing Data
      • Overview
      • Training Models
        • Introduction
        • Advanced
        • Feature Selection
        • Why Scaling
        • Feature Scaling (FPL)
        • Normalization and Standardization
      • Handling Missing Data
        • Classification Metrics
        • Regression Metrics
        • Pipelines
        • Hyperparameter Tuning
      • Introduction
        • 1D Tensors
        • 2D Tensors
        • Derivatives & Graphs
        • Simple Datasets
        • Pre-Built Datasets
        • Exercise
        • 1D Regression
        • One Parameter
        • Slope & Bias
        • Exercise
        • SGD
        • Mini-Batch GD
        • PyTorch Way
        • Training & Validation
        • Exercise
        • Multiple LR Prediction
        • Multiple LR Training
        • Multi-Target LR
        • Training Multiple Output
        • Exercise
        • Prediction
        • MSE Issues
        • Cross Entropy
        • Softmax
        • Exercise
        • Custom Datasets
        • DataLoaders
        • Transforms
        • Simple Hidden Layer
        • this is for exercises
        • XOR Problem
        • MNIST
        • Activation Functions
        • MNIST One Layer
        • MNIST Two Layer
        • Multiclass Spiral
        • Dropout Prediction
        • Dropout Regression
        • Initialization
        • Xavier Init
        • He Init
        • Momentum
        • NN with Momentum
        • Batch Normalization
        • Convolution Basics
        • Activation & Pooling
        • Multiple Channels
        • Simple CNN
        • CNN Small Image
        • CNN Batch Processing
      • Introduction
      • Analysis
      • Forecasting
      • Python Example
      • Overview
      • Eigen Values and Vectors
      • Descriptive Statistics
      • Inferential Statistics
      • Statistical Models
      • Hypothesis Testing
      • Customer Analysis
      • How KNN Works
      • Handling Imbalanced Data
      • Classification Metrics
      • License
      • ReadMe

    Momentum

    Objective for this Notebook

    1. Learn Saddle Points, Local Minima, and Noise

    Table of Contents

    In this lab, you will deal with several problems associated with optimization and see how momentum can improve your results.

    • Saddle Points
    • Local Minima
    • Noise

    Estimated Time Needed: 25 min


    Preparation

    Import the following libraries that you'll use for this lab:

    In [ ]:
    Copied!
    # These are the libraries that will be used for this lab.
    
    import torch 
    import torch.nn as nn
    import matplotlib.pylab as plt
    import numpy as np
    
    torch.manual_seed(0)
    
    # These are the libraries that will be used for this lab. import torch import torch.nn as nn import matplotlib.pylab as plt import numpy as np torch.manual_seed(0)

    This function will plot a cubic function and the parameter values obtained via Gradient Descent.

    In [ ]:
    Copied!
    # Plot the cubic
    
    def plot_cubic(w, optimizer):
        LOSS = []
        # parameter values 
        W = torch.arange(-4, 4, 0.1)
        # plot the loss fuction 
        for w.state_dict()['linear.weight'][0] in W:
            LOSS.append(cubic(w(torch.tensor([[1.0]]))).item())
        w.state_dict()['linear.weight'][0] = 4.0
        n_epochs = 10
        parameter = []
        loss_list = []
    
        # n_epochs
        # Use PyTorch custom module to implement a ploynomial function
        for n in range(n_epochs):
            optimizer.zero_grad() 
            loss = cubic(w(torch.tensor([[1.0]])))
            loss_list.append(loss)
            parameter.append(w.state_dict()['linear.weight'][0].detach().data.item())
            loss.backward()
            optimizer.step()
        plt.plot(parameter, loss_list, 'ro', label='parameter values')
        plt.plot(W.numpy(), LOSS, label='objective function')
        plt.xlabel('w')
        plt.ylabel('l(w)')
        plt.legend()
    
    # Plot the cubic def plot_cubic(w, optimizer): LOSS = [] # parameter values W = torch.arange(-4, 4, 0.1) # plot the loss fuction for w.state_dict()['linear.weight'][0] in W: LOSS.append(cubic(w(torch.tensor([[1.0]]))).item()) w.state_dict()['linear.weight'][0] = 4.0 n_epochs = 10 parameter = [] loss_list = [] # n_epochs # Use PyTorch custom module to implement a ploynomial function for n in range(n_epochs): optimizer.zero_grad() loss = cubic(w(torch.tensor([[1.0]]))) loss_list.append(loss) parameter.append(w.state_dict()['linear.weight'][0].detach().data.item()) loss.backward() optimizer.step() plt.plot(parameter, loss_list, 'ro', label='parameter values') plt.plot(W.numpy(), LOSS, label='objective function') plt.xlabel('w') plt.ylabel('l(w)') plt.legend()

    This function will plot a 4th order function and the parameter values obtained via Gradient Descent. You can also add Gaussian noise with a standard deviation determined by the parameter std.

    In [ ]:
    Copied!
    # Plot the fourth order function and the parameter values
    
    def plot_fourth_order(w, optimizer, std=0, color='r', paramlabel='parameter values', objfun=True):
        W = torch.arange(-4, 6, 0.1)
        LOSS = []
        for w.state_dict()['linear.weight'][0] in W:
            LOSS.append(fourth_order(w(torch.tensor([[1.0]]))).item())
        w.state_dict()['linear.weight'][0] = 6
        n_epochs = 100
        parameter = []
        loss_list = []
    
        #n_epochs
        for n in range(n_epochs):
            optimizer.zero_grad()
            loss = fourth_order(w(torch.tensor([[1.0]]))) + std * torch.randn(1, 1)
            loss_list.append(loss)
            parameter.append(w.state_dict()['linear.weight'][0].detach().data.item())
            loss.backward()
            optimizer.step()
        
        # Plotting
        if objfun:
            plt.plot(W.numpy(), LOSS, label='objective function')
        plt.plot(parameter, loss_list, 'ro',label=paramlabel, color=color)
        plt.xlabel('w')
        plt.ylabel('l(w)')
        plt.legend()
    
    # Plot the fourth order function and the parameter values def plot_fourth_order(w, optimizer, std=0, color='r', paramlabel='parameter values', objfun=True): W = torch.arange(-4, 6, 0.1) LOSS = [] for w.state_dict()['linear.weight'][0] in W: LOSS.append(fourth_order(w(torch.tensor([[1.0]]))).item()) w.state_dict()['linear.weight'][0] = 6 n_epochs = 100 parameter = [] loss_list = [] #n_epochs for n in range(n_epochs): optimizer.zero_grad() loss = fourth_order(w(torch.tensor([[1.0]]))) + std * torch.randn(1, 1) loss_list.append(loss) parameter.append(w.state_dict()['linear.weight'][0].detach().data.item()) loss.backward() optimizer.step() # Plotting if objfun: plt.plot(W.numpy(), LOSS, label='objective function') plt.plot(parameter, loss_list, 'ro',label=paramlabel, color=color) plt.xlabel('w') plt.ylabel('l(w)') plt.legend()

    This is a custom module. It will behave like a single parameter value. We do it this way so we can use PyTorch's build-in optimizers .

    In [ ]:
    Copied!
    # Create a linear model
    
    class one_param(nn.Module):
        
        # Constructor
        def __init__(self, input_size, output_size):
            super(one_param, self).__init__()
            self.linear = nn.Linear(input_size, output_size, bias=False)
            
        # Prediction
        def forward(self, x):
            yhat = self.linear(x)
            return yhat
    
    # Create a linear model class one_param(nn.Module): # Constructor def __init__(self, input_size, output_size): super(one_param, self).__init__() self.linear = nn.Linear(input_size, output_size, bias=False) # Prediction def forward(self, x): yhat = self.linear(x) return yhat

    We create an object w, when we call the object with an input of one, it will behave like an individual parameter value. i.e w(1) is analogous to $w$

    In [ ]:
    Copied!
    # Create a one_param object
    
    w = one_param(1, 1)
    
    # Create a one_param object w = one_param(1, 1)

    Saddle Points

    Let's create a cubic function with Saddle points

    In [ ]:
    Copied!
    # Define a function to output a cubic 
    
    def cubic(yhat):
        out = yhat ** 3
        return out
    
    # Define a function to output a cubic def cubic(yhat): out = yhat ** 3 return out

    We create an optimizer with no momentum term

    In [ ]:
    Copied!
    # Create a optimizer without momentum
    
    optimizer = torch.optim.SGD(w.parameters(), lr=0.01, momentum=0)
    
    # Create a optimizer without momentum optimizer = torch.optim.SGD(w.parameters(), lr=0.01, momentum=0)

    We run several iterations of stochastic gradient descent and plot the results. We see the parameter values get stuck in the saddle point.

    In [ ]:
    Copied!
    # Plot the model
    
    plot_cubic(w, optimizer)
    
    # Plot the model plot_cubic(w, optimizer)

    we create an optimizer with momentum term of 0.9

    In [ ]:
    Copied!
    # Create a optimizer with momentum
    
    optimizer = torch.optim.SGD(w.parameters(), lr=0.01, momentum=0.9)
    
    # Create a optimizer with momentum optimizer = torch.optim.SGD(w.parameters(), lr=0.01, momentum=0.9)

    We run several iterations of stochastic gradient descent with momentum and plot the results. We see the parameter values do not get stuck in the saddle point.

    In [ ]:
    Copied!
    # Plot the model
    
    plot_cubic(w, optimizer)
    
    # Plot the model plot_cubic(w, optimizer)

    Local Minima

    In this section, we will create a fourth order polynomial with a local minimum at 4 and a global minimum a -2. We will then see how the momentum parameter affects convergence to a global minimum. The fourth order polynomial is given by:

    In [ ]:
    Copied!
    # Create a function to calculate the fourth order polynomial 
    
    def fourth_order(yhat): 
        out = torch.mean(2 * (yhat ** 4) - 9 * (yhat ** 3) - 21 * (yhat ** 2) + 88 * yhat + 48)
        return out
    
    # Create a function to calculate the fourth order polynomial def fourth_order(yhat): out = torch.mean(2 * (yhat ** 4) - 9 * (yhat ** 3) - 21 * (yhat ** 2) + 88 * yhat + 48) return out

    We create an optimizer with no momentum term. We run several iterations of stochastic gradient descent and plot the results. We see the parameter values get stuck in the local minimum.

    In [ ]:
    Copied!
    # Make the prediction without momentum
    
    optimizer = torch.optim.SGD(w.parameters(), lr=0.001)
    plot_fourth_order(w, optimizer)
    
    # Make the prediction without momentum optimizer = torch.optim.SGD(w.parameters(), lr=0.001) plot_fourth_order(w, optimizer)

    We create an optimizer with a momentum term of 0.9. We run several iterations of stochastic gradient descent and plot the results. We see the parameter values reach a global minimum.

    In [ ]:
    Copied!
    # Make the prediction with momentum
    
    optimizer = torch.optim.SGD(w.parameters(), lr=0.001, momentum=0.9)
    plot_fourth_order(w, optimizer)
    
    # Make the prediction with momentum optimizer = torch.optim.SGD(w.parameters(), lr=0.001, momentum=0.9) plot_fourth_order(w, optimizer)

    Noise

    In this section, we will create a fourth order polynomial with a local minimum at 4 and a global minimum a -2, but we will add noise to the function when the Gradient is calculated. We will then see how the momentum parameter affects convergence to a global minimum.

    with no momentum, we get stuck in a local minimum

    In [ ]:
    Copied!
    # Make the prediction without momentum when there is noise
    
    optimizer = torch.optim.SGD(w.parameters(), lr=0.001)
    plot_fourth_order(w, optimizer, std=10)
    
    # Make the prediction without momentum when there is noise optimizer = torch.optim.SGD(w.parameters(), lr=0.001) plot_fourth_order(w, optimizer, std=10)

    with momentum, we get to the global minimum

    In [ ]:
    Copied!
    # Make the prediction with momentum when there is noise
    
    optimizer = torch.optim.SGD(w.parameters(), lr=0.001,momentum=0.9)
    plot_fourth_order(w, optimizer, std=10)
    
    # Make the prediction with momentum when there is noise optimizer = torch.optim.SGD(w.parameters(), lr=0.001,momentum=0.9) plot_fourth_order(w, optimizer, std=10)

    Practice

    Create two SGD objects with a learning rate of 0.001. Use the default momentum parameter value for one and a value of 0.9 for the second. Use the function plot_fourth_order with an std=100, to plot the different steps of each. Make sure you run the function on two independent cells.

    In [ ]:
    Copied!
    # Practice: Create two SGD optimizer with lr = 0.001, and one without momentum and the other with momentum = 0.9. Plot the result out.
    
    # Type your code here
    
    # Practice: Create two SGD optimizer with lr = 0.001, and one without momentum and the other with momentum = 0.9. Plot the result out. # Type your code here

    Double-click here for the solution.

    What's on your mind? Put it in the comments!

    June 3, 2025 June 3, 2025

    © 2025 DATAIDEA. All rights reserved. Built with ❤️ by Juma Shafara.