# These are the libraries will be used for this lab.
import torch
from torch.utils.data import Dataset
1) torch.manual_seed(
<torch._C.Generator at 0x71849b542f70>
Juma Shafara
September 1, 2023
September 2, 2024
custom dataset classes in pytorch, Simple dataset, Transforms, Compose
In this lab, you will construct a basic dataset by using PyTorch and learn how to apply basic transformations to it.
Estimated Time Needed: 30 min
The following are the libraries we are going to use for this lab. The torch.manual_seed()
is for forcing the random function to give the same number every time we try to recompile it.
# These are the libraries will be used for this lab.
import torch
from torch.utils.data import Dataset
torch.manual_seed(1)
<torch._C.Generator at 0x71849b542f70>
Let us try to create our own dataset class.
# Define class for dataset
class toy_set(Dataset):
# Constructor with defult values
def __init__(self, length = 10, transform = None):
self.len = length
self.x = 2 * torch.ones(length, 2)
self.y = torch.ones(length, 1)
self.transform = transform
# Getter
def __getitem__(self, index):
sample = self.x[index], self.y[index]
if self.transform:
sample = self.transform(sample)
return sample
# Get Length
def __len__(self):
return self.len
Now, let us create our toy_set
object, and find out the value on index 1 and the length of the inital dataset
# Create Dataset Object. Find out the value on index 1. Find out the length of Dataset Object.
our_dataset = toy_set()
print("Our toy_set object: ", our_dataset)
print("Value on index 0 of our toy_set object: ", our_dataset[0])
print("Our toy_set length: ", len(our_dataset))
Our toy_set object: <__main__.toy_set object at 0x7184a1b0ec60>
Value on index 0 of our toy_set object: (tensor([2., 2.]), tensor([1.]))
Our toy_set length: 10
As a result, we can apply the same indexing convention as a list
, and apply the fuction len
on the toy_set
object. We are able to customize the indexing and length method by def __getitem__(self, index)
and def __len__(self)
.
Now, let us print out the first 3 elements and assign them to x and y:
# Use loop to print out first 3 elements in dataset
for i in range(3):
x, y=our_dataset[i]
print("index: ", i, '; x:', x, '; y:', y)
index: 0 ; x: tensor([2., 2.]) ; y: tensor([1.])
index: 1 ; x: tensor([2., 2.]) ; y: tensor([1.])
index: 2 ; x: tensor([2., 2.]) ; y: tensor([1.])
The dataset object is an Iterable; as a result, we apply the loop directly on the dataset object
x: tensor([2., 2.]) y: tensor([1.])
x: tensor([2., 2.]) y: tensor([1.])
x: tensor([2., 2.]) y: tensor([1.])
x: tensor([2., 2.]) y: tensor([1.])
x: tensor([2., 2.]) y: tensor([1.])
x: tensor([2., 2.]) y: tensor([1.])
x: tensor([2., 2.]) y: tensor([1.])
x: tensor([2., 2.]) y: tensor([1.])
x: tensor([2., 2.]) y: tensor([1.])
x: tensor([2., 2.]) y: tensor([1.])
For purposes of learning, we will use a simple dataset from the dataidea
package called music. It’s made up of two features, age
and gender
and outcome variable as genre
import dataidea
# load dataset
music_data = dataidea.loadDataset('music')
# get features and target
features = music_data.drop('genre', axis=1)
target = music_data['genre']
# display sample
music_data.sample(n=3)
age | gender | genre | |
---|---|---|---|
4 | 29 | 1 | Jazz |
5 | 30 | 1 | Jazz |
18 | 35 | 1 | Classical |
# lets encode the target
from sklearn.preprocessing import LabelEncoder
target = LabelEncoder().fit_transform(target)
print(f'Encoded Genre: {target}')
Encoded Genre: [3 3 3 4 4 4 1 1 1 2 2 2 0 0 0 1 1 1 1 1]
We can create a custom Class for this dataset as demonstrated below
Now let’s create a MusicDataset object and get use the methods to access some data and info
music_torch_dataset = MusicDataset()
# Row 0 data
print(f"Row 0: {music_torch_dataset[0]}")
# display no of rows
print(f"Number of rows: {len(music_torch_dataset)}")
Row 0: (tensor([20., 1.]), tensor(3))
Number of rows: 20
Let’s have a look at the first 5 rows
Row 0: (tensor([20., 1.]), tensor(3))
Row 1: (tensor([20., 1.]), tensor(3))
Row 2: (tensor([20., 1.]), tensor(3))
Row 3: (tensor([20., 1.]), tensor(3))
Row 4: (tensor([20., 1.]), tensor(3))
Try to create an toy_set
object with length 50. Print out the length of your object.
Double-click here for the solution.
You can also create a class for transforming the data. In this case, we will try to add 1 to x and multiply y by 2:
Now, create a transform object:.
Assign the outputs of the original dataset to x
and y
. Then, apply the transform add_mult
to the dataset and output the values as x_
and y_
, respectively:
As the result, x
has been added by 1 and y has been multiplied by 2, as [2, 2] + 1 = [3, 3] and [1] x 2 = [2]
We can apply the transform object every time we create a new toy_set object
? Remember, we have the constructor in toy_set class with the parameter transform = None
. When we create a new object using the constructor, we can assign the transform object to the parameter transform, as the following code demonstrates.
This applied a_m
object (a transform method) to every element in cust_data_set
as initialized. Let us print out the first 10 elements in cust_data_set
in order to see whether the a_m
applied on cust_data_set
The result is the same as the previous method.
Double-click here for the solution.
You can compose multiple transforms on the dataset object. First, import transforms
from torchvision
:
Then, create a new transform class that multiplies each of the elements by 100:
Now let us try to combine the transforms add_mult
and mult
The new Compose
object will perform each transform concurrently as shown in this figure:
Now we can pass the new Compose
object (The combination of methods add_mult()
and mult
) to the constructor for creating toy_set
object.
Let us print out the first 3 elements in different toy_set
datasets in order to compare the output after different transforms have been applied:
# Use loop to print out first 3 elements in dataset
for i in range(3):
x, y = data_set[i]
print('Index: ', i, 'Original x: ', x, 'Original y: ', y)
x_, y_ = cust_data_set[i]
print('Index: ', i, 'Transformed x_:', x_, 'Transformed y_:', y_)
x_co, y_co = compose_data_set[i]
print('Index: ', i, 'Compose Transformed x_co: ', x_co ,'Compose Transformed y_co: ',y_co)
Let us see what happened on index 0. The original value of x
is [2, 2], and the original value of y
is [1]. If we only applied add_mult()
on the original dataset, then the x
became [3, 3] and y became [2]. Now let us see what is the value after applied both add_mult()
and mult()
. The result of x is [300, 300] and y is [200]. The calculation which is equavalent to the compose is x = ([2, 2] + 1) x 100 = [300, 300], y = ([1] x 2) x 100 = 200
Try to combine the mult()
and add_mult()
as mult()
to be executed first. And apply this on a new toy_set
dataset. Print out the first 3 elements in the transformed dataset.