Learn how to ingest and perform basic ML operations on the MovieLens 100k sparse dataset.
A sparse dataset is one in which a significant portion of the values are zero.
In a sparse dataset, most elements in the data matrix or array contain zero values, resulting in a high level of sparsity.
Sparse datasets are common in various domains, such as natural language processing (where many words may not appear in a document), recommendation systems (where users may interact with only a small subset of items), and certain scientific applications (where measurements are missing or irrelevant for many observations).
Sparse datasets can be represented more efficiently using sparse matrix or sparse tensor formats, which store only the non-zero values along with their indices. This representation saves memory and computational resources compared to dense representations.
Sparse ingestion
For the sparse case, this tutorial will use the MovieLens 100k dataset.
Note
The MovieLens dataset is a well-known and widely used dataset in the field of recommendation systems and collaborative filtering research. It comprises various collections of movie ratings provided by users of the MovieLens website. These ratings are typically on a scale of 1 to 5, with 5 indicating the highest rating and 1 indicating the lowest. The MovieLens dataset comes in several versions, with the most commonly used ones being MovieLens 100k, MovieLens 1M, MovieLens 10M, MovieLens 20M, and MovieLens 25M, which denote the approximate number of ratings in each dataset. The sparsity of the MovieLens dataset depends on the specific version and how it is represented. In terms of ratings, the MovieLens dataset can be considered somewhat sparse, especially as the dataset size increases. This sparsity arises because not all users rate all movies, and the dataset typically contains many missing ratings.
TileDB offers an API with native dataloaders for all the ML frameworks with which TileDB integrates. After you store your data, you can use the API to create dataloaders in each framework that will be later used as input to the model’s training stage. The API takes two TileDB arrays as inputs: x, which refers to the sample data; and y, which holds the label data corresponding to each sample in x. The dataloader collates these two arrays into a single data object that can later be used as input for training a model.