Skip to content

PyTorch Dataloader

In almost all machine learning tasks, the first step belong to data loading. In ML resources, people usually wouldn’t talk so much about handling the data and they usually focus on preprocessing or classification tasks. But in many practical applications, loading data is very challenging. The reason is the large amount of data.

When the data size is too large, loading with traditional ways, like loading into the memory is not possible. The data should be load with batch, and handling batch is not very straight forward. For instance, in medical image processing, the size of images are too huge (50k x 50k pixels). When CNN is wanted to be implemented, the window should move on this image and each window should have overlap. Each window size is usually predefined and is small too, such as 256×256. Therefore, handling boundary conditions should be considered. Also, in many applications it is better to feed the data not in predefined order, but in shuffled format. Moreover, the batch size and the most important, cross correlation and splitting the train/dev data are other challenges.

PyTorch makes the life easy. In torch, you can find several predefined datasets.

https://pytorch.org/docs/stable/torchvision/datasets.html

Some of these datasets doesn’t contain actual data, indeed they are designed to read data from the disk in specific manner. For example ImageFolder is one of them most applicable one. This dataset receive a directory and based on the hierarchy of that directory makes the classes’ labels.

When the torch.datasets.ImageFolder class is fed to torch.utils.data.DataLoader, then managing the batch size, shuffling the data, train-dev splitting will be done by torch. Also, it is possible to preprocess the images in a chain of operations or transforms very easy in Datasets class of PyTorch. Moreover, DataLoaders can load the data with parallel processing which makes the data reading very fast.

These are some reasons that researchers and developers prefer to use Datasets and DataLoaders calsses of PyTorch and prohibit from writing the classes by themselves. Spending time to adapt our dataset to PyTorch classes took less time and is much more efficient rather than writing new classes to handle the input data.

Published inMachine Learning

Be First to Comment

Leave a Reply