deel.datasets.utils package

Submodules

deel.datasets.utils.numpy_utils module

deel.datasets.utils.numpy_utils.numpy_split_on_label(dataset, labels_in)

Allows to split a numpy dataset in in-dataset and out-dataset according to labels_in :type dataset: Tuple[ndarray, ndarray] :param dataset: a numpy dataset :type labels_in: Sequence[int] :param labels_in: array of ‘normal’ labels

Return type

Tuple[Tuple[ndarray, ndarray], Tuple[ndarray, ndarray]]

Returns

a tuple of split datasets (dataset_in, dataset_out),

deel.datasets.utils.supervised module

deel.datasets.utils.supervised.load_hierarchical_python_image_dataset(folder, dispatch_fn, unique_labels=False)

Walk the given folder applying the given function to find to which dataset and class each file should be associated to.

The function should returns a list of parts and a class. The number of parts can be different for each file.

Parameters
  • folder (Path) – The folder to look for file.

  • dispatch_fn (Callable[[Path], Optional[Tuple[List[str], str]]]) – A function that should return a 2-tuple where the first element in a list of str to represent the dataset (e.g, [“train”, “a”]) to represent the dataset train/a and the second element is the class of the file.

  • unique_labels (bool) – If True, the labels will be unique across all datasets, otherwise the labels will go from 0 to the number of class in the datasets - 1.

Returns

A dictionary mapping dataset hierarchy to 3-tuples (paths, labels, classes) where paths is a list of paths, labels is a list of labels (integers) and classes is a dictionary from labels to names.

Examples

Let assume the folder contains image under $class/train/ and $class/test, where $class is a class name. We could have dispatch_fn returns either [“train”], $class or [“test”], $class which would result in creating two distinct datasets.

deel.datasets.utils.supervised.load_hierarchical_pytorch_image_dataset(folder, dispatch_fn, image_size=None, unique_labels=False, transform=None)

Creates a pytorch image dataset from the given folder and parameters.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each classes, a subfolder with only images inside.

  • dispatch_fn (Callable[[Path], Optional[Tuple[List[str], str]]]) – A function that should return a 2-tuple where the first element in a list of str to represent the dataset (e.g, [“train”, “a”]) to represent the dataset train/a and the second element is the class of the file.

  • image_size (Optional[Tuple[int, int]]) – The size of the image, or None to not resize images.

  • unique_labels (bool) – If True, the labels will be unique across all datasets, otherwise the labels will go from 0 to the number of class in the datasets - 1.

  • transform (Optional[Callable[[Image], Image]]) – Transformation to apply to the image before the conversion to a torch tensor via ToTensor(). If image_size is not None, the resize transform will be applied before these, if you want to do the opposite, simply pass None as image_size and add the resize transformation manually.

Returns

A two-tuple whose first element is another tuple containing two or three datasets corresponding to training, validation and testing dataset, and the second element is mapping from class labels to class names.

deel.datasets.utils.supervised.load_numpy_image_dataset(folder, image_size=None, train_split=0.8, shuffle=True, aggregate_fn=<function <lambda>>, filter_fn=<function <lambda>>)

Creates a numpy image dataset from the given folder and parameters.

The image dataset are 4-dimensional (N, H, W, C) numpy array where (H, W) is the image size, C the number of channels. The arrays contain np.uint8 values between 0 and 255.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each classes, a subfolder with only images inside.

  • image_size (Optional[Tuple[int, int]]) – The size of the image, or None to not resize images. None is only supported if all the images already have the same size.

  • train_split (Union[float, Tuple[float, float]]) – One or two float values. If a single value is specified, two datasets will be returned, one for training (using a percentage train_split of data) and one for testing. If two values are specified, three datasets will be returned: a training dataset, a validation dataset and a testing dataset.

  • shuffle (Union[bool, int]) – If True, shuffle images before spliting with the default seed. If an int is given, use it as the seed for shuffling. If False, do not shuffle. False otherwise.

  • aggregate_fn (Callable[[str], Optional[str]]) – Callable to aggregate classes. The function should take the name of an original class (subfolder) and returns the name of the “parent” class. If the call returns None, the class is discarded.

  • filter_fn (Callable[[str, Path], bool]) – A function to filter out images. This function should take a string (name of the file) and a path to the file and returns True if the image should be included, False if it should be excluded.

Returns

A two-tuple whose first element is another tuple containing two or three 2-tuple of numpy arrays corresponding to training, validation and testing dataset, and the second element is mapping from class labels to class names. Each dataset is a 2-tuple (x, y) where x is a 4-dimensional numpy array containing images and y a one-dimensional numpy array containing classes.

deel.datasets.utils.supervised.load_python_image_dataset(folder, shuffle=True, aggregate_fn=<function <lambda>>, filter_fn=<function <lambda>>)

Load a python “dataset” from the given folder. This methods returns a list of paths, labels, and class names from the given folder.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each class, a subfolder with only images inside.

  • shuffle (Union[bool, int]) – If True, shuffle images with the default seed. If an int is given, use it as the seed for shuffling. If False, do not shuffle. Shuffle is done using the standard random.shuffle module False otherwise.

  • aggregate_fn (Callable[[str], Optional[str]]) – Callable to aggregate classes. The function should take the name of an original class (subfolder) and returns the name of the “parent” class. If the call returns None, the class is discarded.

  • filter_fn (Callable[[str, Path], bool]) – A function to filter out images. This function should take a string (name of the file) and a path to the file and returns True if the image should be included, False if it should be excluded.

Returns: A 3-tuple (paths, labels, classes) where paths is a list of paths, labels is a list of labels (integers) and classes is a dictionary from labels to names.

Return type

Tuple[List[Path], List[int], Dict[int, str]]

deel.datasets.utils.supervised.load_pytorch_image_dataset(folder, image_size=None, train_split=0.8, shuffle=True, aggregate_fn=<function <lambda>>, filter_fn=<function <lambda>>, transform=None)

Creates a pytorch image dataset from the given folder and parameters.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each classes, a subfolder with only images inside.

  • image_size (Optional[Tuple[int, int]]) – The size of the image, or None to not resize images.

  • train_split (Union[float, Tuple[float, float]]) – One or two float values. If a single value is specified, two datasets will be returned, one for training (using a percentage train_split of data) and one for testing. If two values are specified, three datasets will be returned: a training dataset, a validation dataset and a testing dataset.

  • shuffle (Union[bool, int]) – If True, shuffle images before spliting with the default seed. If an int is given, use it as the seed for shuffling. If False, do not shuffle.

  • aggregate_fn (Callable[[str], Optional[str]]) – Callable to aggregate classes. The function should take the name of an original class (subfolder) and returns the name of the “parent” class. If the call returns None, the class is discarded.

  • filter_fn (Callable[[str, Path], bool]) – A function to filter out images. This function should take a string (name of the file) and a path to the file and returns True if the image should be included, False if it should be excluded.

  • transform (Optional[Callable[[Image], Image]]) – Transformation to apply to the image before the conversion to a torch tensor via ToTensor(). If image_size is not None, the resize transform will be applied before these, if you want to do the opposite, simply pass None as image_size and add the resize transformation manually.

Returns

A two-tuple whose first element is another tuple containing two or three datasets corresponding to training, validation and testing dataset, and the second element is mapping from class labels to class names.

deel.datasets.utils.supervised.load_tensorflow_image_dataset(folder, image_size=None, train_split=0.8, shuffle=True, aggregate_fn=<function <lambda>>, filter_fn=<function <lambda>>)

Creates a tensorflow image dataset from the given folder and parameters.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each classes, a subfolder with only images inside.

  • image_size (Optional[Tuple[int, int]]) – The size of the image, or None to not resize images.

  • train_split (Union[float, Tuple[float, float]]) – One or two float values. If a single value is specified, two datasets will be returned, one for training (using a percentage train_split of data) and one for testing. If two values are specified, three datasets will be returned: a training dataset, a validation dataset and a testing dataset.

  • shuffle (Union[bool, int]) – If True, shuffle images before spliting with the default seed. If an int is given, use it as the seed for shuffling. If False, do not shuffle. False otherwise.

  • aggregate_fn (Callable[[str], Optional[str]]) – Callable to aggregate classes. The function should take the name of an original class (subfolder) and returns the name of the “parent” class. If the call returns None, the class is discarded.

  • filter_fn (Callable[[str, Path], bool]) – A function to filter out images. This function should take a string (name of the file) and a path to the file and returns True if the image should be included, False if it should be excluded.

Returns

A two-tuple whose first element is another tuple containing two or three datasets corresponding to training, validation and testing dataset, and the second element is mapping from class labels to class names.

deel.datasets.utils.supervised.split_datasets_on_label(datasets, labels_in)

Allows to split a list of datasets in in-dataset and out-dataset according to the given labels. See split_on_label for more details.

Parameters
  • datasets (Sequence[TypeVar(DatasetType, Tuple[ndarray, ndarray], DatasetV2, Dataset)]) – List of numpy, pytorch or tensorflow datasets.

  • labels_in (Sequence[int]) – Array containing ‘normal’ labels.

Return type

Sequence[Tuple[TypeVar(DatasetType, Tuple[ndarray, ndarray], DatasetV2, Dataset), TypeVar(DatasetType, Tuple[ndarray, ndarray], DatasetV2, Dataset)]]

Returns

A list of split datasets ((dataset_in, dataset_out), …, (dataset_in, dataset_out)).

deel.datasets.utils.supervised.split_on_label(dataset, labels_in)

Allows to split a dataset in in-dataset and out-dataset according to the given labels.

Parameters
  • dataset (TypeVar(DatasetType, Tuple[ndarray, ndarray], DatasetV2, Dataset)) – A numpy, pytorch or tensorflow dataset.

  • labels_in (Sequence[int]) – Array containing ‘normal’ labels.

Return type

Tuple[TypeVar(DatasetType, Tuple[ndarray, ndarray], DatasetV2, Dataset), TypeVar(DatasetType, Tuple[ndarray, ndarray], DatasetV2, Dataset)]

Returns

A tuple of split datasets (dataset_in, dataset_out), where dataset_in is the subset of samples whose labels are in labels_in and dataset_out the remaining part of the dataset.

deel.datasets.utils.tensorflow_utils module

deel.datasets.utils.tensorflow_utils.tf_split_on_label(dataset, labels_in)

Allows to split a tensoflow dataset in in-dataset and out-dataset according to labels_in :type dataset: DatasetV2 :param dataset: a tensoflow dataset :type labels_in: Sequence[int] :param labels_in: array of ‘normal’ labels

Return type

Tuple[DatasetV2, DatasetV2]

Returns

a tuple of split datasets (dataset_in, dataset_out),

deel.datasets.utils.torch_utils module

class deel.datasets.utils.torch_utils.ImageDataset(files, labels=None, transform=None)

Bases: Dataset

loader(path)
Return type

Image

class deel.datasets.utils.torch_utils.OptionalToTensor

Bases: object

Optional call to ToTensor() if the object is not already a tensor.

deel.datasets.utils.torch_utils.torch_split_on_label(dataset, labels_in)

Allows to split a torch dataset in in-dataset and out-dataset according to labels_in :type dataset: Dataset :param dataset: a torch dataset :type labels_in: Sequence[int] :param labels_in: array of ‘normal’ labels

Return type

Tuple[Dataset, Dataset]

Returns

a tuple of split datasets (dataset_in, dataset_out),

deel.datasets.utils.unsupervised module

deel.datasets.utils.unsupervised.load_hierarchical_python_image_dataset(folder, dispatch_fn)

Walk the given folder applying the given function to find to which dataset and class each file should be associated to.

The function should returns a list of parts. The number of parts can be different for each file.

Parameters
  • folder (Path) – The folder to look for files.

  • dispatch_fn (Callable[[Path], Optional[List[str]]]) – A function that should return a list of str representing the dataset (e.g, [“train”, “a”]) to represent the dataset train/a.

Returns

A dictionary mapping dataset hierarchy to list of paths (a dictionary of dictionary, indexed by string, whose leaves are list of paths).

deel.datasets.utils.unsupervised.load_hierarchical_pytorch_image_dataset(folder, dispatch_fn, image_size=None, transform=None)

Creates a pytorch image dataset from the given folder and parameters.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each classes, a subfolder with only images inside.

  • dispatch_fn (Callable[[Path], Optional[List[str]]]) – A function that should return a list of str representing the dataset (e.g, [“train”, “a”]) to represent the dataset train/a.

  • image_size (Optional[Tuple[int, int]]) – The size of the image, or None to not resize images.

  • transform (Optional[Callable[[Image], Image]]) – Transformation to apply to the image before the conversion to a torch tensor via ToTensor(). If image_size is not None, the resize transform will be applied before these, if you want to do the opposite, simply pass None as image_size and add the resize transformation manually.

Returns

A tuple containing two or three datasets corresponding to training, validation and testing dataset.

deel.datasets.utils.unsupervised.load_numpy_image_dataset(folder, image_size=None, train_split=0.8, shuffle=True, filter_fn=<function <lambda>>)

Creates a numpy image dataset from the given folder and parameters.

The image dataset are 4-dimensional (N, H, W, C) numpy array where (H, W) is the image size, C the number of channels. The arrays contain np.uint8 values between 0 and 255.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each classes, a subfolder with only images inside.

  • image_size (Optional[Tuple[int, int]]) – The size of the image, or None to not resize images. None is only supported if all the images already have the same size.

  • train_split (Union[float, Tuple[float, float]]) – One or two float values. If a single value is specified, two datasets will be returned, one for training (using a percentage train_split of data) and one for testing. If two values are specified, three datasets will be returned: a training dataset, a validation dataset and a testing dataset.

  • shuffle (Union[bool, int]) – If True, shuffle images before spliting with the default seed. If an int is given, use it as the seed for shuffling. If False, do not shuffle. False otherwise.

  • filter_fn (Callable[[str, Path], bool]) – A function to filter out images. This function should take a string (name of the file) and a path to the file and returns True if the image should be included, False if it should be excluded.

Returns

A tuple containing two or three numpy arrays corresponding to training, validation and testing datasets.

deel.datasets.utils.unsupervised.load_python_image_dataset(folder, shuffle=True, filter_fn=<function <lambda>>)

Load a python “dataset” from the given folder. This methods returns a list of paths from the given folder.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each class, a subfolder with only images inside.

  • shuffle (Union[bool, int]) – If True, shuffle images with the default seed. If an int is given, use it as the seed for shuffling. If False, do not shuffle. Shuffle is done using the standard random.shuffle module False otherwise.

  • filter_fn (Callable[[str, Path], bool]) – A function to filter out images. This function should take a string (name of the file) and a path to the file and returns True if the image should be included, False if it should be excluded.

Returns: A 3-tuple (paths, labels, classes) where paths is a list of paths, labels is a list of labels (integers) and classes is a dictionary from labels to names.

Return type

List[Path]

deel.datasets.utils.unsupervised.load_pytorch_image_dataset(folder, image_size=None, train_split=0.8, shuffle=True, aggregate_fn=<function <lambda>>, filter_fn=<function <lambda>>, transform=None)

Creates a pytorch image dataset from the given folder and parameters.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each classes, a subfolder with only images inside.

  • image_size (Optional[Tuple[int, int]]) – The size of the image, or None to not resize images.

  • train_split (Union[float, Tuple[float, float]]) – One or two float values. If a single value is specified, two datasets will be returned, one for training (using a percentage train_split of data) and one for testing. If two values are specified, three datasets will be returned: a training dataset, a validation dataset and a testing dataset.

  • shuffle (Union[bool, int]) – If True, shuffle images before spliting with the default seed. If an int is given, use it as the seed for shuffling. If False, do not shuffle.

  • filter_fn (Callable[[str, Path], bool]) – A function to filter out images. This function should take a string (name of the file) and a path to the file and returns True if the image should be included, False if it should be excluded.

  • transform (Optional[Callable[[Image], Image]]) – Transformation to apply to the image before the conversion to a torch tensor via ToTensor(). If image_size is not None, the resize transform will be applied before these, if you want to do the opposite, simply pass None as image_size and add the resize transformation manually.

Returns

A tuple containing two or three datasets corresponding to training, validation and testing datasets.

deel.datasets.utils.unsupervised.load_tensorflow_image_dataset(folder, image_size=None, train_split=0.8, shuffle=True, aggregate_fn=<function <lambda>>, filter_fn=<function <lambda>>)

Creates a tensorflow image dataset from the given folder and parameters.

Parameters
  • folder (Path) – The folder containing the dataset. The folder should contain, for each classes, a subfolder with only images inside.

  • image_size (Optional[Tuple[int, int]]) – The size of the image, or None to not resize images.

  • train_split (Union[float, Tuple[float, float]]) – One or two float values. If a single value is specified, two datasets will be returned, one for training (using a percentage train_split of data) and one for testing. If two values are specified, three datasets will be returned: a training dataset, a validation dataset and a testing dataset.

  • shuffle (Union[bool, int]) – If True, shuffle images before spliting with the default seed. If an int is given, use it as the seed for shuffling. If False, do not shuffle. False otherwise.

  • filter_fn (Callable[[str, Path], bool]) – A function to filter out images. This function should take a string (name of the file) and a path to the file and returns True if the image should be included, False if it should be excluded.

Returns

A tuple containing two or three datasets corresponding to training, validation and testing datasets.