pytoda.datasets.utils.utils module

Utils for the dataset module.

Summary

Functions:

concatenate_file_based_datasets

Concatenate file-based datasets into a single one, with the ability to

indexed

Returns mutated shallow copy of passed dataset instance, where indexing behavior is changed to additionally returning index.

keyed

Returns mutated shallow copy of passed dataset instance, where indexing behavior is changed to additionally returning key.

pad_item

Padding function for a single item of a batch.

sizeof_fmt

Human readable file size.

Reference

sizeof_fmt(num, suffix='B')[source]

Human readable file size. Source: https://stackoverflow.com/a/1094933

concatenate_file_based_datasets(filepaths, dataset_class, **kwargs)[source]
Concatenate file-based datasets into a single one, with the ability to

get the source dataset of items.

Parameters
  • filepaths (Files) – list of filepaths.

  • dataset_class (type) – dataset class reading from file. Supports KeyDataset and DatasetDelegator. For pure torch.utils.data.Dataset the returned instance can still be used like a pytoda.datasets.TransparentConcatDataset, but methods depending on key lookup will fail.

  • kwargs (dict) – additional args for dataset_class.__init__(filepath, **kwargs).

Returns

the concatenated dataset.

Return type

ConcatKeyDataset

indexed(dataset)[source]

Returns mutated shallow copy of passed dataset instance, where indexing behavior is changed to additionally returning index.

Return type

Union[KeyDataset, DatasetDelegator, ConcatKeyDataset]

keyed(dataset)[source]

Returns mutated shallow copy of passed dataset instance, where indexing behavior is changed to additionally returning key.

Return type

Union[KeyDataset, DatasetDelegator, ConcatKeyDataset]

pad_item(item, padding_modes, padding_values, max_length)[source]

Padding function for a single item of a batch.

Parameters
  • item (Tuple) – Tuple returned by the __getitem__ function of a Dataset class.

  • padding_modes (List[str]) – The type of padding to perform for each datum in item. Options are ‘constant’ for constant value padding, and ‘range’ to fill the tensor with a range of values.

  • padding_values (List) – The values with which to fill the background tensor for padding. Can be a constant value or a range depending on the datum to pad in item.

  • max_length (int) – The maximum length to which the datum should be padded.

Returns

Tuple of tensors padded according to the given specifications.

Return type

Tuple

NOTE: pad_item function uses trailing dimensions as the repetitions argument

for range_tensor(), since the ‘length’ of the set is covered by the value_range. That is, if a tensor of shape (5,) is required for padding_mode ‘range’ then () is passed as shape into range_tensor function which will repeat range(5) exactly once thus giving us a (5,) tensor.