

Data Preparation

class DataPrep

Prepares data for training.

Based on a samples file and an image directory a train, validation, and test set will be created. Before data preparation starts samples file and image directory will be validated.

DataPrep creates a job directory with the following set of files

class_mapping.json
train_samples.json
val_samples.json
test_samples.json

These files are required for subsequent components (training and evaluation).

The samples file must be in JSON format with the following keys

[ { "image_id": "image_1.jpg", "label": "Class 1" }, { "image_id": "image_2.jpg", "label": "Class 1" }, ... ]

Attributes

image_dir: path of image directory.
job_dir: path to job directory with samples.
samples_file: path to samples file.
min_class_size: minimal number of samples per label (default 2).
test_size: represent the proportion of the dataset to include in the test set (default 0.2).
val_size: represent the proportion of the dataset to include in the val set (default 0.1).
part_size: represent the proportion of the dataset to include in all sets (default 1).

init

def __init__(job_dir, image_dir, samples_file, min_class_size, test_size, val_size, part_size, **kwargs)

Inits data preparation component.

Loads samples file. Initializes variables for further operations: valid_image_ids, class_mapping, train_samples, val_samples, test_samples.

run

def run(resize)

Executes all steps of data preparation.

Validates samples and images
Creates class-mapping (string to integer)
Applies class-mapping on samples
Splits sample into train, validation and test sets
Resizes images
Saves files (class-mapping, train-, validation- and test-set)

Args

resize: boolean (creates a subfolder of resized images, default False).