Plugins
A DEEL dataset plugin is an extension of the
Dataset or
VolatileDataset class defined in the DEEL
dataset manager project.
It allows to access to specific datasets files using the load method of defined
modes.
Extending the Dataset class
Below is an example implementation of a dataset class ExampleDataset.
The load_XXX methods defines the various mode, e.g. load_pytorch adds
a pytorch mode to the dataset.
The default mode used (when none is specified) can be set using the
_default_mode class attribute.
import h5py
import pathlib
import typing
from deel.datasets.dataset import Dataset
from deel.datasets.settings import Settings
class ExampleDataset(Dataset):
# Default mode:
_default_mode: str = "numpy"
def __init__(
self,
version: str = "latest",
settings: typing.Optional[Settings] = None
):
"""
Args:
version: Version of the dataset.
settings: The settings to use for this dataset, or `None` to use the
default settings.
"""
# `data_name` is the name of the folder containing the dataset on the
# provider (remote or local).
super().__init__("data_name", version, settings)
def load_numpy(self, path: pathlib.Path):
"""
Numpy mode for this dataset.
"""
# Dataset-specific code:
return data
def load_csv(self, path: pathlib.Path):
"""
CSV mode for this dataset.
"""
import pandas as pd
return pd.read_csv(path, sep=";", index_col=0)
def load_pytorch(
self,
path: pathlib.Path,
nstack: int = 4,
transform: typing.Callable = None,
):
"""
Pytorch mode for this dataset. With extra arguments that can
be passed to the `deel.datasets.load` method using named parameters.
"""
from .torch import SourceDataSet
return SourceDataSet(self.load_path(path), nstack, transform)
By default, the with_info option will return a dictionary containing the name
and the version of the dataset.
If you want to provide extra information, you can return a dictionary from the
load_XXX methods, e.g.:
def load_pytorch(self, path: pathlib.Path):
# Load a pytorch dataset:
dataset = ...
return dataset, {"classes": ["foo", "bar"]}
Utility functions
The deel.datasets.utils package contains utility functions to load or split numpy,
pytorch and tensorflow image dataset in a consistent way, and the Dataset
class contains some utility methods to generate the information dictionary from
the return of these methods.
Here is a very simple example for loading a dataset:
def load_pytorch(self, path: pathlib.Path, image_size: Tuple[int, int]):
# Use relative import only if you are inside the deel package:
from ..utils import load_pytorch_image_dataset
# Load the dataset using the utility function:
dataset, idx_to_class = load_pytorch_image_dataset(
self.load_path(path), # This is require only if `load_path` modifies the path.
image_size=image_size,
train_split=.7,
)
# The `_make_class_info` is provided by `Dataset`:
return dataset, self._make_class_info(idx_to_class)
Packaging the dataset(s)
To be found by the dataset manager, the ExampleDataset class must be put in
a package with a specific entrypoint (defined in setup.py).
The entry point provides to the plugin to be discovered and used by DEEL dataset
manager project.
The name of the DEEL dataset manager entry point is unique:
plugins.deel.dataset.
It is possible to define many aliases for the same plugin by adding multiple
alias = package:plugin class entries in entry points list.
# Assuming `ExampleDataset` is in `my_dataset/__init__.py`:
from setuptools import setup
setup(
# Other `setup` arguments:
...
# Entry points:
entry_points={
"plugins.deel.dataset": [
"example = my_dataset:ExampleDataset",
"my_dataset.example = my_dataset:ExampleDataset"
]
}
)
A single plugin can expose multiple datasets through different entry points.