Skip to content

Getting Started

Installation

Latest Stable Version (pip installation):

pip install imageatm

Bleeding Edge Version (manual installation):

git clone https://github.com/idealo/imageatm.git
cd imageatm
python setup.py install

Folder structure and file formats

The starting folder structure is

root
├── config_file.yml
├── data.json
└── images
    ├── image_1.jpg
    ├── image_2.jpg
    ├── image_3.jpg
    └── image_4.jpg

data.json is a file containing a mapping between the images and their labels. This file must be in the following format:

[
    {
        "image_id": "image_1.jpg",
        "label": "Class 1"
    },
    {
        "image_id": "image_2.jpg",
        "label": "Class 1"
    },
    {
        "image_id": "image_3.jpg",
        "label": "Class 2"
    },
    {
        "image_id": "image_4.jpg",
        "label": "Class 2"
    },
    ...
]

In the next sections we will use the cats and dogs dataset to showcase all examples. Therefore our starting structure will look as follows:

root
 ├── data.json
 ├── cats_and_dogs_job_dir
 └── cats_and_dogs
     └── train
         ├── cat.0.jpg
         ├── cat.1.jpg
         ├── dog.0.jpg
         └── dog.1.jpg  

Here you can download the cats and dogs dataset:

wget --no-check-certificate \
    https://storage.googleapis.com/mledu-datasets/cats_and_dogs_filtered.zip \
    -O cats_and_dogs_filtered.zip

unzip cats_and_dogs_filtered.zip

mkdir -p cats_and_dogs/train
mv cats_and_dogs_filtered/train/cats/* cats_and_dogs/train
mv cats_and_dogs_filtered/train/dogs/* cats_and_dogs/train

Convert the data into our input format (data.json):

[
    {
        "image_id": "cat.0.jpg",
        "label": "Cat"
    },
    {
        "image_id": "cat.1.jpg",
        "label": "Cat"
    },
    {
        "image_id": "dog.0.jpg",
        "label": "Dog"
    },
    {
        "image_id": "dog.1.jpg",
        "label": "Dog"
    },
    ...
]

You can use this code to create the data.json:

import os
import json

filenames = os.listdir('cats_and_dogs/train')
sample_json = []
for i in filenames:
    sample_json.append(
        {
        'image_id': i,
        'label': 'Cat' if 'cat' in i else 'Dog'
        }
        )

with open('data.json', 'w') as outfile:
    json.dump(sample_json, outfile, indent=4, sort_keys=True)

Simple Example for local training

Train with CLI

Define your config_file.yml:

image_dir: cats_and_dogs/train/
job_dir: cats_and_dogs_job_dir/

dataprep:
  run: True
  samples_file: data.json
  resize: True

train:
  run: True

evaluate:
  run: True
  report:
    create: True
    kernel_name: imageatm
    export_html: True
    export_pdf: True

These configurations will run three components in a pipeline: data preparation, training and evaluation. In case you set create: True a jupyter notebook is created. Be sure to set a proper kernel_name. If you don't have an IPython ernel available create one.

In case you like to export report to html set export_html: True.

In case you like to export report to pdf set export_pdf: True. This will be done via Latex. Be sure to have installed all required packages for Latex.

Then run:

imageatm pipeline config/config_file.yml

The resulting folder structure will look like this

root
├── config_file.yml
├── data.json
├── cats_and_dogs
│   └── train
│       ├── cat.0.jpg
│       ├── cat.1.jpg
│       ├── dog.0.jpg
│       └── dog.1.jpg
└── job_dirs
    └── cats_and_dogs_job_dir
        ├── class_mapping.json
        ├── test_samples.json
        ├── train_samples.json
        ├── val_samples.json
        ├── logs
        ├── models
        │   ├── model_1.hdf5
        │   ├── model_2.hdf5
        │   └── model_3.hdf5
        └── evaluation
            ├── evaluation_report.html
            ├── evaluation_report.ipynb
            └── evaluation_report.pdf

Train without CLI

Run the data preparation:

from imageatm.components import DataPrep

dp = DataPrep(
    image_dir = 'cats_and_dogs/train',
    samples_file = 'data.json',
    job_dir='cats_and_dogs_job_dir'
)

dp.run(resize=True)

Run the training:

from imageatm.components import Training

trainer = Training(dp.image_dir, dp.job_dir)
trainer.run()

Run the evaluation:

from imageatm.components import Evaluation

evaluater = Evaluation(image_dir=dp.image_dir, job_dir=dp.job_dir)
evaluater.run()

Simple Example for cloud training

Initial cloud set-up

To train your model using cloud services you'll need an existing S3 bucket where you will be able to store the content of your local job_dir and image_dir as well as trained models.

If you don't have an S3 bucket you'll have to create one.

Assign the following IAM roles to your AWS user account

  • iam role:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1508403745000",
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy",
                "iam:CreateRole",
                "iam:GetPolicy",
                "iam:GetRole",
                "iam:GetPolicyVersion",
                "iam:CreateInstanceProfile",
                "iam:AttachRolePolicy",
                "iam:ListRolePolicies",
                "iam:GetInstanceProfile",
                "iam:ListEntitiesForPolicy",
                "iam:ListPolicyVersions",
                "iam:CreatePolicyVersion",
                "iam:RemoveRoleFromInstanceProfile",
                "iam:DetachRolePolicy",
                "iam:DeleteInstanceProfile",
                "iam:DeletePolicyVersion",
                "iam:ListInstanceProfilesForRole",
                "iam:DeletePolicy",
                "iam:DeleteRole",
                "iam:AddRoleToInstanceProfile",
                "iam:PassRole"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
  • ec2 role:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "ec2:*",
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "elasticloadbalancing:*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "autoscaling:*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "iam:AWSServiceName": [
                        "autoscaling.amazonaws.com",
                        "ec2scheduled.amazonaws.com",
                        "elasticloadbalancing.amazonaws.com",
                        "spot.amazonaws.com",
                        "spotfleet.amazonaws.com",
                        "transitgateway.amazonaws.com"
                    ]
                }
            }
        }
    ]
}
  • s3 role:
{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Effect": "Allow",
           "Action": "s3:*",
           "Resource": "*"
       }
   ]
}

Configure your AWS client

In your shell run aws configure and input:

- AWS Access Key ID
- AWS Secret Access Key
- Default region name (for example `eu-central-1`)
- Default output format (for example `json`)

Now you are ready to kick off with the cloud training.

Train with CLI

Define your config_file.yml:

image_dir: cats_and_dogs/train
job_dir: cats_and_dogs_job_dir/

dataprep:
  run: True
  samples_file: data.json
  resize: True

train:
  run: True
  cloud: True

cloud:
  cloud_tag: image_atm_example_tag
  provider: aws
  tf_dir: cloud/aws
  region: eu-west-1
  vpc_id: vpc-example
  instance_type: p2.xlarge
  bucket: s3://example-bucket
  destroy: True

These configurations will locally run data preparation, then launch an AWS EC2 instance, run a training on it and copy the content of your local job_dir and image_dir to the pre-defined S3 bucket. As we set destroy: True in this example, it will also destroy the EC2 instance and all dependencies once the training finished. If you want to reuse the instance for multiple experiments, set destroy: False.

Run imageatm in shell:

imageatm pipeline config/config_file.yml

The resulting S3 bucket structure will look like this:

s3://example-bucket
├── image_dirs
│   └── train_resized
│       ├── cat.0.jpg
│       ├── cat.1.jpg
│       ├── dog.0.jpg
│       └── dog.1.jpg
└── job_dirs
    └── cats_and_dogs_job_dir
        ├── class_mapping.json
        ├── test_samples.json
        ├── train_samples.json
        ├── val_samples.json
        └── models
            ├── model_1.hdf5
            ├── model_2.hdf5
            └── model_3.hdf5

The resulting local structure will look like this:

root
├── cats_and_dogs
│   ├── train
│   │   ├── cat.0.jpg
│   │   ├── cat.1.jpg
│   │   ├── dog.0.jpg
│   │   ├── dog.1.jpg
│   └── train_resized
│       ├── cat.0.jpg
│       ├── cat.1.jpg
│       ├── dog.0.jpg
│       └── dog.1.jpg

└── cats_and_dogs_job_dir
    ├── class_mapping.json
    ├── test_samples.json
    ├── train_samples.json
    ├── val_samples.json
    ├── logs
    └── models
        ├── model_1.hdf5
        ├── model_2.hdf5
        └── model_3.hdf5

Train without CLI

Make sure you've got your cloud setup ready as described in the cloud training introduction.

Run the data preparation:

from imageatm.components import DataPrep

dp = DataPrep(
    image_dir = 'cats_and_dogs/train',
    samples_file = 'data.json',
    job_dir='cats_and_dogs_job_dir'
)

dp.run(resize=True)

Run the training:

from imageatm.components import AWS

cloud = AWS(
    tf_dir='cloud/aws',
    region='eu-west-1',
    instance_type='p2.xlarge',
    vpc_id='vpc-example',
    s3_bucket='s3://example-bucket',
    job_dir=dp.job_dir,
    cloud_tag='image_atm_example_tag',
)

cloud.init()
cloud.apply()
cloud.train(image_dir=dp.image_dir)

Once the training is completed you have to manually destroy the AWS instance and its dependencies:

cloud.destroy()