Learn how to ingest and perform basic ML operations on the MNIST dense dataset.
A dense dataset is one in which most of the values are nonzero.
In a dense dataset, the majority of elements in the data matrix or array have nonzero values. This means dense datasets have little to no sparsity.
Dense datasets are common in many types of data, including numerical data, images, and text, where most of the features or dimensions are relevant and contribute to the information content of the dataset.
Dense datasets often require more memory and computational resources to process and analyze due to the larger amount of data present.
For the dense case, this tutorial will use the MNIST dataset.
Note
The MNIST dataset is a widely used benchmark dataset in the field of machine learning. It consists of a collection of 28×28 pixel grayscale images of handwritten digits (0 to 9), along with their corresponding labels showing the digit represented in each image. The dataset is commonly used for training and evaluating machine learning models, particularly for image classification tasks.
TileDB offers an API with native dataloaders for all the ML frameworks with which TileDB integrates. After you store your data, you can use the API to create dataloaders in each framework that will be later used as input to the model’s training stage. The API takes two TileDB arrays as inputs: x (which refers to the sample data), and y (which holds the label data corresponding to each sample in x). The dataloader collates these two arrays into a single data object that you can use later as input for training a model.
Jupyter notebooks have limited support of Python multiprocessing. Avoid using multiple workers on Jupyter when you need multiprocessing. Instead, run scripts with a normal Python interpreter.
with tiledb.open(training_images) as x, tiledb.open(training_labels) as y: train_loader = PyTorchTileDBDataLoader( ArrayParams(x), ArrayParams(y), batch_size=128, num_workers=0, shuffle_buffer_size=256, ) batch_imgs, batch_labels =next(iter(train_loader))print(f"Input Shape: {batch_imgs.shape}")print(f"Label Shape: {batch_labels.shape}")