Simple Dataset

In this lab, you will construct a basic dataset by using PyTorch and learn how to apply basic transformations to it.

Author

Juma Shafara

Published

September 1, 2023

Modified

September 2, 2024

Keywords

custom dataset classes in pytorch, Simple dataset, Transforms, Compose

Objective

How to create a dataset in pytorch.
How to perform transformations on the dataset.

In this lab, you will construct a basic dataset by using PyTorch and learn how to apply basic transformations to it.

Simple dataset
Transforms
Compose

Estimated Time Needed: 30 min

Preparation

The following are the libraries we are going to use for this lab. The torch.manual_seed() is for forcing the random function to give the same number every time we try to recompile it.

# These are the libraries will be used for this lab.

import torch
from torch.utils.data import Dataset
torch.manual_seed(1)

<torch._C.Generator at 0x71849b542f70>

Simple dataset

Let us try to create our own dataset class.

# Define class for dataset

class toy_set(Dataset):
    
    # Constructor with defult values 
    def __init__(self, length = 10, transform = None):
        self.len = length
        self.x = 2 * torch.ones(length, 2)
        self.y = torch.ones(length, 1)
        self.transform = transform
     
    # Getter
    def __getitem__(self, index):
        sample = self.x[index], self.y[index]
        if self.transform:
            sample = self.transform(sample)     
        return sample
    
    # Get Length
    def __len__(self):
        return self.len

Now, let us create our toy_set object, and find out the value on index 1 and the length of the inital dataset

# Create Dataset Object. Find out the value on index 1. Find out the length of Dataset Object.

our_dataset = toy_set()
print("Our toy_set object: ", our_dataset)
print("Value on index 0 of our toy_set object: ", our_dataset[0])
print("Our toy_set length: ", len(our_dataset))

Our toy_set object:  <__main__.toy_set object at 0x7184a1b0ec60>
Value on index 0 of our toy_set object:  (tensor([2., 2.]), tensor([1.]))
Our toy_set length:  10

As a result, we can apply the same indexing convention as a list, and apply the fuction len on the toy_set object. We are able to customize the indexing and length method by def __getitem__(self, index) and def __len__(self).

Now, let us print out the first 3 elements and assign them to x and y:

# Use loop to print out first 3 elements in dataset

for i in range(3):
    x, y=our_dataset[i]
    print("index: ", i, '; x:', x, '; y:', y)

index:  0 ; x: tensor([2., 2.]) ; y: tensor([1.])
index:  1 ; x: tensor([2., 2.]) ; y: tensor([1.])
index:  2 ; x: tensor([2., 2.]) ; y: tensor([1.])

The dataset object is an Iterable; as a result, we apply the loop directly on the dataset object

for x,y in our_dataset:
    print(' x:', x, 'y:', y)

 x: tensor([2., 2.]) y: tensor([1.])
 x: tensor([2., 2.]) y: tensor([1.])
 x: tensor([2., 2.]) y: tensor([1.])
 x: tensor([2., 2.]) y: tensor([1.])
 x: tensor([2., 2.]) y: tensor([1.])
 x: tensor([2., 2.]) y: tensor([1.])
 x: tensor([2., 2.]) y: tensor([1.])
 x: tensor([2., 2.]) y: tensor([1.])
 x: tensor([2., 2.]) y: tensor([1.])
 x: tensor([2., 2.]) y: tensor([1.])

An existing dataset

For purposes of learning, we will use a simple dataset from the dataidea package called music. It’s made up of two features, age and gender and outcome variable as genre

import dataidea

# load dataset
music_data = dataidea.loadDataset('music')

# get features and target
features = music_data.drop('genre', axis=1)
target = music_data['genre']

# display sample
music_data.sample(n=3)

	age	gender	genre
4	29	1	Jazz
5	30	1	Jazz
18	35	1	Classical

# lets encode the target
from sklearn.preprocessing import LabelEncoder

target = LabelEncoder().fit_transform(target)
print(f'Encoded Genre: {target}')

Encoded Genre: [3 3 3 4 4 4 1 1 1 2 2 2 0 0 0 1 1 1 1 1]

We can create a custom Class for this dataset as demonstrated below

class MusicDataset(Dataset):
    
    def __init__(self):
        self.features = torch.tensor(features.values, dtype=torch.float32)
        self.target = torch.from_numpy(target)

    def __getitem__(self, index):
        x = self.features[index]
        y = self.target[index]
        return x, y

    def __len__(self):
        return len(self.features)

Now let’s create a MusicDataset object and get use the methods to access some data and info

music_torch_dataset = MusicDataset()

# Row 0 data
print(f"Row 0: {music_torch_dataset[0]}")

# display no of rows
print(f"Number of rows: {len(music_torch_dataset)}")

Row 0: (tensor([20.,  1.]), tensor(3))
Number of rows: 20

Let’s have a look at the first 5 rows

for sample in range(5):
    print(f"Row {sample}: {music_torch_dataset[0]}")

Row 0: (tensor([20.,  1.]), tensor(3))
Row 1: (tensor([20.,  1.]), tensor(3))
Row 2: (tensor([20.,  1.]), tensor(3))
Row 3: (tensor([20.,  1.]), tensor(3))
Row 4: (tensor([20.,  1.]), tensor(3))

Practice

Try to create an toy_set object with length 50. Print out the length of your object.

# Practice: Create a new object with length 50, and print the length of object out.

# Type your code here

Double-click here for the solution.

Transforms

You can also create a class for transforming the data. In this case, we will try to add 1 to x and multiply y by 2:

# Create tranform class add_mult

class add_mult(object):
    
    # Constructor
    def __init__(self, addx = 1, muly = 2):
        self.addx = addx
        self.muly = muly
    
    # Executor
    def __call__(self, sample):
        x = sample[0]
        y = sample[1]
        x = x + self.addx
        y = y * self.muly
        sample = x, y
        return sample

Now, create a transform object:.

# Create an add_mult transform object, and an toy_set object

a_m = add_mult()
data_set = toy_set()

Assign the outputs of the original dataset to x and y. Then, apply the transform add_mult to the dataset and output the values as x_ and y_, respectively:

# Use loop to print out first 10 elements in dataset

for i in range(10):
    x, y = data_set[i]
    print('Index: ', i, 'Original x: ', x, 'Original y: ', y)
    x_, y_ = a_m(data_set[i])
    print('Index: ', i, 'Transformed x_:', x_, 'Transformed y_:', y_)

As the result, x has been added by 1 and y has been multiplied by 2, as [2, 2] + 1 = [3, 3] and [1] x 2 = [2]

We can apply the transform object every time we create a new toy_set object? Remember, we have the constructor in toy_set class with the parameter transform = None. When we create a new object using the constructor, we can assign the transform object to the parameter transform, as the following code demonstrates.

# Create a new data_set object with add_mult object as transform

cust_data_set = toy_set(transform = a_m)

This applied a_m object (a transform method) to every element in cust_data_set as initialized. Let us print out the first 10 elements in cust_data_set in order to see whether the a_m applied on cust_data_set

# Use loop to print out first 10 elements in dataset

for i in range(10):
    x, y = data_set[i]
    print('Index: ', i, 'Original x: ', x, 'Original y: ', y)
    x_, y_ = cust_data_set[i]
    print('Index: ', i, 'Transformed x_:', x_, 'Transformed y_:', y_)

The result is the same as the previous method.

# Practice: Construct your own my_add_mult transform. Apply my_add_mult on a new toy_set object. Print out the first three elements from the transformed dataset.

# Type your code here.

Double-click here for the solution.

Compose

You can compose multiple transforms on the dataset object. First, import transforms from torchvision:

# Run the command below when you do not have torchvision installed
# !mamba install -y torchvision

from torchvision import transforms

Then, create a new transform class that multiplies each of the elements by 100:

# Create tranform class mult

class mult(object):
    
    # Constructor
    def __init__(self, mult = 100):
        self.mult = mult
        
    # Executor
    def __call__(self, sample):
        x = sample[0]
        y = sample[1]
        x = x * self.mult
        y = y * self.mult
        sample = x, y
        return sample

Now let us try to combine the transforms add_mult and mult

# Combine the add_mult() and mult()

data_transform = transforms.Compose([add_mult(), mult()])
print("The combination of transforms (Compose): ", data_transform)

The new Compose object will perform each transform concurrently as shown in this figure:

Compose PyTorch

data_transform(data_set[0])

x,y=data_set[0]
x_,y_=data_transform(data_set[0])
print( 'Original x: ', x, 'Original y: ', y)

print( 'Transformed x_:', x_, 'Transformed y_:', y_)

Now we can pass the new Compose object (The combination of methods add_mult() and mult) to the constructor for creating toy_set object.

# Create a new toy_set object with compose object as transform

compose_data_set = toy_set(transform = data_transform)

Let us print out the first 3 elements in different toy_set datasets in order to compare the output after different transforms have been applied:

# Use loop to print out first 3 elements in dataset

for i in range(3):
    x, y = data_set[i]
    print('Index: ', i, 'Original x: ', x, 'Original y: ', y)
    x_, y_ = cust_data_set[i]
    print('Index: ', i, 'Transformed x_:', x_, 'Transformed y_:', y_)
    x_co, y_co = compose_data_set[i]
    print('Index: ', i, 'Compose Transformed x_co: ', x_co ,'Compose Transformed y_co: ',y_co)

Let us see what happened on index 0. The original value of x is [2, 2], and the original value of y is [1]. If we only applied add_mult() on the original dataset, then the x became [3, 3] and y became [2]. Now let us see what is the value after applied both add_mult() and mult(). The result of x is [300, 300] and y is [200]. The calculation which is equavalent to the compose is x = ([2, 2] + 1) x 100 = [300, 300], y = ([1] x 2) x 100 = 200

Practice

Try to combine the mult() and add_mult() as mult() to be executed first. And apply this on a new toy_set dataset. Print out the first 3 elements in the transformed dataset.

# Practice: Make a compose as mult() execute first and then add_mult(). Apply the compose on toy_set dataset. Print out the first 3 elements in the transformed dataset.

# Type your code here.

Objective

Table of Contents

Preparation

Simple dataset

An existing dataset

Practice

Transforms

Compose

Practice