pytoda.data_splitter module¶

Data splitting utilties.

Summary¶

Functions:

Function for generic splitting into train and test data in csv format.

csv_data_splitter(data_filepaths, save_path, data_type, mode, seed=42, test_fraction=0.1, number_of_columns=12, **kwargs)[source]¶

Function for generic splitting into train and test data in csv format. This is an eager splitter trying to fit the entire dataset into memory.

Parameters

data_filepaths (Files) – a list of .csv files that contain the data.
save_path (str) – folder to store the training/testing dataset.
data_type (str) – data type (only used as prefix for the saved files).
mode (str) – mode to split data from: “random” and “file”. - random: does a random split across all samples in all files. - file: randomly splits the files into training and testing.
seed (int) – random seed used for the split. Defaults to 42.
test_fraction (float) – portion of samples in testing data. Defaults to 0.1.
number_of_columns (int) – number of columns used to create the hash. Defaults to 12.
kwargs (dict) – additional parameters for pd.read_csv.

Returns

a tuple pointing to the train and test files.

Return type

Tuple[str, str]