Data Preparation
class DataPrep
Prepares data for training.
Based on a samples file and an image directory a train, validation, and test set will be created. Before data preparation starts samples file and image directory will be validated.
DataPrep creates a job directory with the following set of files
-
class_mapping.json
-
train_samples.json
-
val_samples.json
-
test_samples.json
These files are required for subsequent components (training and evaluation).
The samples file must be in JSON format with the following keys
[ { "image_id": "image_1.jpg", "label": "Class 1" }, { "image_id": "image_2.jpg", "label": "Class 1" }, ... ]
Attributes
-
image_dir: path of image directory.
-
job_dir: path to job directory with samples.
-
samples_file: path to samples file.
-
min_class_size: minimal number of samples per label (default 2).
-
test_size: represent the proportion of the dataset to include in the test set (default 0.2).
-
val_size: represent the proportion of the dataset to include in the val set (default 0.1).
-
part_size: represent the proportion of the dataset to include in all sets (default 1).
__init__
def __init__(job_dir, image_dir, samples_file, min_class_size, test_size, val_size, part_size, **kwargs)
Inits data preparation component.
Loads samples file. Initializes variables for further operations: valid_image_ids, class_mapping, train_samples, val_samples, test_samples.
run
def run(resize)
Executes all steps of data preparation.
-
Validates samples and images
-
Creates class-mapping (string to integer)
-
Applies class-mapping on samples
-
Splits sample into train, validation and test sets
-
Resizes images
-
Saves files (class-mapping, train-, validation- and test-set)
Args
- resize: boolean (creates a subfolder of resized images, default False).