pytoda.data_splitter module

Data splitting utilties.

Summary

Functions:

csv_data_splitter

Function for generic splitting into train and test data in csv format.

Reference

csv_data_splitter(data_filepaths, save_path, data_type, mode, seed=42, test_fraction=0.1, number_of_columns=12, **kwargs)[source]

Function for generic splitting into train and test data in csv format. This is an eager splitter trying to fit the entire dataset into memory.

Parameters
  • data_filepaths (Files) – a list of .csv files that contain the data.

  • save_path (str) – folder to store the training/testing dataset.

  • data_type (str) – data type (only used as prefix for the saved files).

  • mode (str) – mode to split data from: “random” and “file”. - random: does a random split across all samples in all files. - file: randomly splits the files into training and testing.

  • seed (int) – random seed used for the split. Defaults to 42.

  • test_fraction (float) – portion of samples in testing data. Defaults to 0.1.

  • number_of_columns (int) – number of columns used to create the hash. Defaults to 12.

  • kwargs (dict) – additional parameters for pd.read_csv.

Returns

a tuple pointing to the train and test files.

Return type

Tuple[str, str]