Learn how to ingest and perform basic ML operations on the MNIST dense dataset.
A dense dataset is one in which most of the values are nonzero.
In a dense dataset, the majority of elements in the data matrix or array have nonzero values. This means dense datasets have little to no sparsity.
Dense datasets are common in many types of data, including numerical data, images, and text, where most of the features or dimensions are relevant and contribute to the information content of the dataset.
Dense datasets often require more memory and computational resources to process and analyze due to the larger amount of data present.
For the dense case, this tutorial will use the MNIST dataset.
Note
The MNIST dataset is a widely used benchmark dataset in the field of machine learning. It consists of a collection of 28×28 pixel grayscale images of handwritten digits (0 to 9), along with their corresponding labels showing the digit represented in each image. The dataset is commonly used for training and evaluating machine learning models, particularly for image classification tasks.
TileDB offers an API with native dataloaders for all the ML frameworks with which TileDB integrates. After you store your data, you can use the API to create dataloaders in each framework that will be later used as input to the model’s training stage. The API takes two TileDB arrays as inputs: x (which refers to the sample data), and y (which holds the label data corresponding to each sample in x). The dataloader collates these two arrays into a single data object that you can use later as input for training a model.
Jupyter notebooks have limited support of Python multiprocessing. Avoid using multiple workers on Jupyter when you need multiprocessing. Instead, run scripts with a normal Python interpreter.
with tiledb.open(training_images) as x, tiledb.open(training_labels) as y: train_loader = PyTorchTileDBDataLoader( ArrayParams(x), ArrayParams(y), batch_size=128, num_workers=0, shuffle_buffer_size=256, ) batch_imgs, batch_labels =next(iter(train_loader))print(f"Input Shape: {batch_imgs.shape}")print(f"Label Shape: {batch_labels.shape}")
with tiledb.open(training_images) as x, tiledb.open(training_labels) as y: tiledb_dataset = TensorflowTileDBDataset( ArrayParams(array=x), ArrayParams(array=y), ) batched_dataset = tiledb_dataset.batch(128) batch_imgs, batch_labels =next(batched_dataset.as_numpy_iterator())print(f"Input Shape: {batch_imgs.shape}")print(f"Label Shape: {batch_labels.shape}")
---------------------------------------------------------------------------InvalidArgumentError Traceback (most recent call last)
Cell In[9], line 12 7 tiledb_dataset = TensorflowTileDBDataset(
8 ArrayParams(array=x),
9 ArrayParams(array=y),
10 )
11 batched_dataset = tiledb_dataset.batch(128)
---> 12 batch_imgs, batch_labels = next(batched_dataset.as_numpy_iterator())
13 print(f"Input Shape: {batch_imgs.shape}")
14 print(f"Label Shape: {batch_labels.shape}")
File ~/mambaforge/envs/tiledb-ml-env/lib/python3.9/site-packages/tensorflow/python/data/ops/dataset_ops.py:4700, in _NumpyIterator.__next__(self) 4697 numpy.setflags(write=False)
4698 return numpy
-> 4700 return nest.map_structure(to_numpy, next(self._iterator))
File ~/mambaforge/envs/tiledb-ml-env/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py:814, in OwnedIterator.__next__(self) 812 def __next__(self):
813 try:
--> 814 return self._next_internal()
815 except errors.OutOfRangeError:
816 raise StopIteration
File ~/mambaforge/envs/tiledb-ml-env/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py:777, in OwnedIterator._next_internal(self) 774 # TODO(b/77291417): This runs in sync mode as iterators use an error status
775 # to communicate that there is no more data to iterate over.
776 with context.execution_mode(context.SYNC):
--> 777 ret = gen_dataset_ops.iterator_get_next(
778 self._iterator_resource,
779 output_types=self._flat_output_types,
780 output_shapes=self._flat_output_shapes)
782 try:
783 # Fast path for the case `self._structure` is not a nested structure.
784 return self._element_spec._from_compatible_tensor_list(ret) # pylint: disable=protected-access
File ~/mambaforge/envs/tiledb-ml-env/lib/python3.9/site-packages/tensorflow/python/ops/gen_dataset_ops.py:3028, in iterator_get_next(iterator, output_types, output_shapes, name) 3026 return _result
3027 except _core._NotOkStatusException as e:
-> 3028 _ops.raise_from_not_ok_status(e, name)
3029 except _core._FallbackException:
3030 pass
File ~/mambaforge/envs/tiledb-ml-env/lib/python3.9/site-packages/tensorflow/python/framework/ops.py:6656, in raise_from_not_ok_status(e, name) 6654 def raise_from_not_ok_status(e, name):
6655 e.message += (" name: " + str(name if name is not None else ""))
-> 6656 raise core._status_to_exception(e) from None
InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_2_device_/job:localhost/replica:0/task:0/device:CPU:0}} TypeError: `generator` yielded an element of shape (60000,) where an element of shape (None, 60000) was expected.
Traceback (most recent call last):
File "/Users/konstantinostsitsimpikos/mambaforge/envs/tiledb-ml-env/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__
ret = func(*args)
File "/Users/konstantinostsitsimpikos/mambaforge/envs/tiledb-ml-env/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
return func(*args, **kwargs)
File "/Users/konstantinostsitsimpikos/mambaforge/envs/tiledb-ml-env/lib/python3.9/site-packages/tensorflow/python/data/ops/from_generator_op.py", line 235, in generator_py_func
raise TypeError(
TypeError: `generator` yielded an element of shape (60000,) where an element of shape (None, 60000) was expected.
[[{{node PyFunc}}]] [Op:IteratorGetNext] name:
Visualization
Render the first image from the batched data fetched by TileDB-ML loaders: