Skip to content

Data Preparation

class DataPrep

Prepares data for training.

Based on a samples file and an image directory a train, validation, and test set will be created. Before data preparation starts samples file and image directory will be validated.

DataPrep creates a job directory with the following set of files

  • class_mapping.json

  • train_samples.json

  • val_samples.json

  • test_samples.json

These files are required for subsequent components (training and evaluation).

The samples file must be in JSON format with the following keys

[ { "image_id": "image_1.jpg", "label": "Class 1" }, { "image_id": "image_2.jpg", "label": "Class 1" }, ... ]

  • image_dir: path of image directory.

  • job_dir: path to job directory with samples.

  • samples_file: path to samples file.

  • min_class_size: minimal number of samples per label (default 2).

  • test_size: represent the proportion of the dataset to include in the test set (default 0.2).

  • val_size: represent the proportion of the dataset to include in the val set (default 0.1).

  • part_size: represent the proportion of the dataset to include in all sets (default 1).


def __init__(job_dir, image_dir, samples_file, min_class_size, test_size, val_size, part_size, **kwargs)

Inits data preparation component.

Loads samples file. Initializes variables for further operations: valid_image_ids, class_mapping, train_samples, val_samples, test_samples.


def run(resize)

Executes all steps of data preparation.

  • Validates samples and images

  • Creates class-mapping (string to integer)

  • Applies class-mapping on samples

  • Splits sample into train, validation and test sets

  • Resizes images

  • Saves files (class-mapping, train-, validation- and test-set)

  • resize: boolean (creates a subfolder of resized images, default False).