Data Preparation
class DataPrep
Prepares data for training.
Based on a samples file and an image directory a train, validation, and test set will be created. Before data preparation starts samples file and image directory will be validated.
DataPrep creates a job directory with the following set of files
These files are required for subsequent components (training and evaluation).
The samples file must be in JSON format with the following keys
[ { "image_id": "image_1.jpg", "label": "Class 1" }, { "image_id": "image_2.jpg", "label": "Class 1" }, ... ]
image_dir: path of image directory.
job_dir: path to job directory with samples.
samples_file: path to samples file.
min_class_size: minimal number of samples per label (default 2).
test_size: represent the proportion of the dataset to include in the test set (default 0.2).
val_size: represent the proportion of the dataset to include in the val set (default 0.1).
part_size: represent the proportion of the dataset to include in all sets (default 1).
def __init__(job_dir, image_dir, samples_file, min_class_size, test_size, val_size, part_size, **kwargs)
Inits data preparation component.
Loads samples file. Initializes variables for further operations: valid_image_ids, class_mapping, train_samples, val_samples, test_samples.
def run(resize)
Executes all steps of data preparation.
Validates samples and images
Creates class-mapping (string to integer)
Applies class-mapping on samples
Splits sample into train, validation and test sets
Resizes images
Saves files (class-mapping, train-, validation- and test-set)
- resize: boolean (creates a subfolder of resized images, default False).