Skip to content

CNN

class CNN

Find duplicates using CNN and/or generate CNN encodings given a single image or a directory of images.

The module can be used for 2 purposes: Encoding generation and duplicate detection.

  • Encodings generation: To propagate an image through a Convolutional Neural Network architecture and generate encodings. The generated encodings can be used at a later time for deduplication. Using the method 'encode_image', the CNN encodings for a single image can be obtained while the 'encode_images' method can be used to get encodings for all images in a directory.

  • Duplicate detection: Find duplicates either using the encoding mapping generated previously using 'encode_images' or using a Path to the directory that contains the images that need to be deduplicated. 'find_duplicates' and 'find_duplicates_to_remove' methods are provided to accomplish these tasks.

__init__

def __init__(verbose, model_config)

Initialize a pytorch MobileNet model v3 that is sliced at the last convolutional layer. Set the batch size for pytorch dataloader to be 64 samples.

Args
  • verbose: Display progress bar if True else disable it. Default value is True.

  • model_config: A CustomModel that can be used to initialize a custom PyTorch model along with the corresponding transform.

apply_preprocess

def apply_preprocess(im_arr)

Apply preprocessing function for mobilenet to images.

Args
  • im_arr: Image typecast to numpy array.
Returns
  • transformed_image_tensor: Transformed images returned as a pytorch tensor.

encode_image

def encode_image(image_file, image_array)

Generate CNN encoding for a single image.

Args
  • image_file: Path to the image file.

  • image_array: Optional, used instead of image_file. Image typecast to numpy array.

Returns
  • encoding: Encodings for the image in the form of numpy array.
Example usage
from imagededup.methods import CNN
myencoder = CNN()
encoding = myencoder.encode_image(image_file='path/to/image.jpg')
OR
encoding = myencoder.encode_image(image_array=<numpy array of image>)

encode_images

def encode_images(image_dir, recursive, num_enc_workers)

Generate CNN encodings for all images in a given directory of images. Test.

Args
  • image_dir: Path to the image directory.

  • recursive: Optional, find images recursively in a nested image directory structure, set to False by default.

  • num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation (supported only on linux platform), set to 0 by default. 0 disables multiprocessing.

Returns
  • dictionary: Contains a mapping of filenames and corresponding numpy array of CNN encodings.
Example usage
from imagededup.methods import CNN
myencoder = CNN()
encoding_map = myencoder.encode_images(image_dir='path/to/image/directory')

find_duplicates

def find_duplicates(image_dir, encoding_map, min_similarity_threshold, scores, outfile, recursive, num_enc_workers, num_sim_workers)

Find duplicates for each file. Take in path of the directory or encoding dictionary in which duplicates are to be detected above the given threshold. Return dictionary containing key as filename and value as a list of duplicate file names. Optionally, the cosine distances could be returned instead of just duplicate filenames for each query file.

Args
  • image_dir: Path to the directory containing all the images or dictionary with keys as file names

  • encoding_map: Optional, used instead of image_dir, a dictionary containing mapping of filenames and corresponding CNN encodings.

  • min_similarity_threshold: Optional, threshold value (must be float between -1.0 and 1.0). Default is 0.9

  • scores: Optional, boolean indicating whether similarity scores are to be returned along with retrieved duplicates.

  • outfile: Optional, name of the file to save the results, must be a json. Default is None.

  • recursive: Optional, find images recursively in a nested image directory structure, set to False by default.

  • num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation (supported only on linux platform), set to 0 by default. 0 disables multiprocessing.

  • num_sim_workers: Optional, number of cpu cores to use for multiprocessing similarity computation, set to number of CPUs in the system by default. 0 disables multiprocessing.

Returns
  • dictionary: if scores is True, then a dictionary of the form {'image1.jpg': [('image1_duplicate1.jpg', score), ('image1_duplicate2.jpg', score)], 'image2.jpg': [] ..}. if scores is False, then a dictionary of the form {'image1.jpg': ['image1_duplicate1.jpg', 'image1_duplicate2.jpg'], 'image2.jpg':['image1_duplicate1.jpg',..], ..}
Example usage
from imagededup.methods import CNN
myencoder = CNN()
duplicates = myencoder.find_duplicates(image_dir='path/to/directory', min_similarity_threshold=0.85, scores=True,
outfile='results.json')

OR

from imagededup.methods import CNN
myencoder = CNN()
duplicates = myencoder.find_duplicates(encoding_map=<mapping filename to cnn encodings>,
min_similarity_threshold=0.85, scores=True, outfile='results.json')

find_duplicates_to_remove

def find_duplicates_to_remove(image_dir, encoding_map, min_similarity_threshold, outfile, recursive, num_enc_workers, num_sim_workers)

Give out a list of image file names to remove based on the similarity threshold. Does not remove the mentioned files.

Args
  • image_dir: Path to the directory containing all the images or dictionary with keys as file names and values as numpy arrays which represent the CNN encoding for the key image file.

  • encoding_map: Optional, used instead of image_dir, a dictionary containing mapping of filenames and corresponding CNN encodings.

  • min_similarity_threshold: Optional, threshold value (must be float between -1.0 and 1.0). Default is 0.9

  • outfile: Optional, name of the file to save the results, must be a json. Default is None.

  • recursive: Optional, find images recursively in a nested image directory structure, set to False by default.

  • num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation (supported only on linux platform), set to 0 by default. 0 disables multiprocessing.

  • num_sim_workers: Optional, number of cpu cores to use for multiprocessing similarity computation, set to number of CPUs in the system by default. 0 disables multiprocessing.

Returns
  • duplicates: List of image file names that should be removed.
Example usage
from imagededup.methods import CNN
myencoder = CNN()
duplicates = myencoder.find_duplicates_to_remove(image_dir='path/to/images/directory'),
min_similarity_threshold=0.85)

OR

from imagededup.methods import CNN
myencoder = CNN()
duplicates = myencoder.find_duplicates_to_remove(encoding_map=<mapping filename to cnn encodings>,
min_similarity_threshold=0.85, outfile='results.json')