Hashing

class Hashing

Find duplicates using hashing algorithms and/or generate hashes given a single image or a directory of images.

The module can be used for 2 purposes: Encoding generation and duplicate detection.

Encoding generation: To generate hashes using specific hashing method. The generated hashes can be used at a later time for deduplication. Using the method 'encode_image' from the specific hashing method object, the hash for a single image can be obtained while the 'encode_images' method can be used to get hashes for all images in a directory.
Duplicate detection: Find duplicates either using the encoding mapping generated previously using 'encode_images' or using a Path to the directory that contains the images that need to be deduplicated. 'find_duplicates' and 'find_duplicates_to_remove' methods are provided to accomplish these tasks.

init

def __init__(verbose)

Initialize hashing class.

Args

verbose: Display progress bar if True else disable it. Default value is True.

hamming_distance

def hamming_distance(hash1, hash2)

Calculate the hamming distance between two hashes. If length of hashes is not 64 bits, then pads the length to be 64 for each hash and then calculates the hamming distance.

Args

hash1: hash string
hash2: hash string

Returns

hamming_distance: Hamming distance between the two hashes.

encode_image

def encode_image(image_file, image_array)

Generate hash for a single image.

Args

image_file: Path to the image file.
image_array: Optional, used instead of image_file. Image typecast to numpy array.

Returns

hash: A 16 character hexadecimal string hash for the image.

Example usage

from imagededup.methods import <hash-method>
myencoder = <hash-method>()
myhash = myencoder.encode_image(image_file='path/to/image.jpg')
OR
myhash = myencoder.encode_image(image_array=<numpy array of image>)

encode_images

def encode_images(image_dir, recursive, num_enc_workers)

Generate hashes for all images in a given directory of images.

Args

image_dir: Path to the image directory.
recursive: Optional, find images recursively in a nested image directory structure, set to False by default.
num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation, set to number of CPUs in the system by default. 0 disables multiprocessing.

Returns

dictionary: A dictionary that contains a mapping of filenames and corresponding 64 character hash string such as {'Image1.jpg': 'hash_string1', 'Image2.jpg': 'hash_string2', ...}

Example usage

from imagededup.methods import <hash-method>
myencoder = <hash-method>()
mapping = myencoder.encode_images('path/to/directory')

find_duplicates

def find_duplicates(image_dir, encoding_map, max_distance_threshold, scores, outfile, search_method, recursive, num_enc_workers, num_dist_workers)

Find duplicates for each file. Takes in path of the directory or encoding dictionary in which duplicates are to be detected. All images with hamming distance less than or equal to the max_distance_threshold are regarded as duplicates. Returns dictionary containing key as filename and value as a list of duplicate file names. Optionally, the below the given hamming distance could be returned instead of just duplicate filenames for each query file.

Args

image_dir: Path to the directory containing all the images or dictionary with keys as file names and values as hash strings for the key image file.
encoding_map: Optional, used instead of image_dir, a dictionary containing mapping of filenames and corresponding hashes.
max_distance_threshold: Optional, hamming distance between two images below which retrieved duplicates are valid. (must be an int between 0 and 64). Default is 10.
scores: Optional, boolean indicating whether Hamming distances are to be returned along with retrieved duplicates.
outfile: Optional, name of the file to save the results, must be a json. Default is None.
search_method: Algorithm used to retrieve duplicates. Default is brute_force_cython for Unix else bktree.
recursive: Optional, find images recursively in a nested image directory structure, set to False by default.
num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation, set to number of CPUs in the system by default. 0 disables multiprocessing.
num_dist_workers: Optional, number of cpu cores to use for multiprocessing distance computation, set to number of CPUs in the system by default. 0 disables multiprocessing.

Returns

duplicates dictionary: if scores is True, then a dictionary of the form {'image1.jpg': [('image1_duplicate1.jpg', score), ('image1_duplicate2.jpg', score)], 'image2.jpg': [] ..}. if scores is False, then a dictionary of the form {'image1.jpg': ['image1_duplicate1.jpg', 'image1_duplicate2.jpg'], 'image2.jpg':['image1_duplicate1.jpg',..], ..}

Example usage

from imagededup.methods import <hash-method>
myencoder = <hash-method>()
duplicates = myencoder.find_duplicates(image_dir='path/to/directory', max_distance_threshold=15, scores=True,
outfile='results.json')

OR

from imagededup.methods import <hash-method>
myencoder = <hash-method>()
duplicates = myencoder.find_duplicates(encoding_map=<mapping filename to hashes>,
max_distance_threshold=15, scores=True, outfile='results.json')

find_duplicates_to_remove

def find_duplicates_to_remove(image_dir, encoding_map, max_distance_threshold, outfile, recursive, num_enc_workers, num_dist_workers)

Give out a list of image file names to remove based on the hamming distance threshold threshold. Does not remove the mentioned files.

Args

image_dir: Path to the directory containing all the images or dictionary with keys as file names and values as hash strings for the key image file.
encoding_map: Optional, used instead of image_dir, a dictionary containing mapping of filenames and corresponding hashes.
max_distance_threshold: Optional, hamming distance between two images below which retrieved duplicates are valid. (must be an int between 0 and 64). Default is 10.
outfile: Optional, name of the file to save the results, must be a json. Default is None.
recursive: Optional, find images recursively in a nested image directory structure, set to False by default.
num_enc_workers: Optional, number of cpu cores to use for multiprocessing encoding generation, set to number of CPUs in the system by default. 0 disables multiprocessing.
num_dist_workers: Optional, number of cpu cores to use for multiprocessing distance computation, set to number of CPUs in the system by default. 0 disables multiprocessing.

Returns

duplicates: List of image file names that are found to be duplicate of me other file in the directory.

Example usage

from imagededup.methods import <hash-method>
myencoder = <hash-method>()
duplicates = myencoder.find_duplicates_to_remove(image_dir='path/to/images/directory'),
max_distance_threshold=15)

OR

from imagededup.methods import <hash-method>
myencoder = <hash-method>()
duplicates = myencoder.find_duplicates(encoding_map=<mapping filename to hashes>,
max_distance_threshold=15, outfile='results.json')

class PHash

Inherits from Hashing base class and implements perceptual hashing (Implementation reference: http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html).

Offers all the functionality mentioned in hashing class.

Example usage

# Perceptual hash for images
from imagededup.methods import PHash
phasher = PHash()
perceptual_hash = phasher.encode_image(image_file = 'path/to/image.jpg')
OR
perceptual_hash = phasher.encode_image(image_array = <numpy image array>)
OR
perceptual_hashes = phasher.encode_images(image_dir = 'path/to/directory')  # for a directory of images

# Finding duplicates:
from imagededup.methods import PHash
phasher = PHash()
duplicates = phasher.find_duplicates(image_dir='path/to/directory', max_distance_threshold=15, scores=True)
OR
duplicates = phasher.find_duplicates(encoding_map=encoding_map, max_distance_threshold=15, scores=True)

# Finding duplicates to return a single list of duplicates in the image collection
from imagededup.methods import PHash
phasher = PHash()
files_to_remove = phasher.find_duplicates_to_remove(image_dir='path/to/images/directory',
max_distance_threshold=15)
OR
files_to_remove = phasher.find_duplicates_to_remove(encoding_map=encoding_map, max_distance_threshold=15)

init

def __init__(verbose)

Initialize perceptual hashing class.

Args

verbose: Display progress bar if True else disable it. Default value is True.

class AHash

Inherits from Hashing base class and implements average hashing. (Implementation reference: http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html)

Offers all the functionality mentioned in hashing class.

Example usage

# Average hash for images
from imagededup.methods import AHash
ahasher = AHash()
average_hash = ahasher.encode_image(image_file = 'path/to/image.jpg')
OR
average_hash = ahasher.encode_image(image_array = <numpy image array>)
OR
average_hashes = ahasher.encode_images(image_dir = 'path/to/directory')  # for a directory of images

# Finding duplicates:
from imagededup.methods import AHash
ahasher = AHash()
duplicates = ahasher.find_duplicates(image_dir='path/to/directory', max_distance_threshold=15, scores=True)
OR
duplicates = ahasher.find_duplicates(encoding_map=encoding_map, max_distance_threshold=15, scores=True)

# Finding duplicates to return a single list of duplicates in the image collection
from imagededup.methods import AHash
ahasher = AHash()
files_to_remove = ahasher.find_duplicates_to_remove(image_dir='path/to/images/directory',
max_distance_threshold=15)
OR
files_to_remove = ahasher.find_duplicates_to_remove(encoding_map=encoding_map, max_distance_threshold=15)

init

def __init__(verbose)

Initialize average hashing class.

Args

verbose: Display progress bar if True else disable it. Default value is True.

class DHash

Inherits from Hashing base class and implements difference hashing. (Implementation reference: http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html)

Offers all the functionality mentioned in hashing class.

Example usage

# Difference hash for images
from imagededup.methods import DHash
dhasher = DHash()
difference_hash = dhasher.encode_image(image_file = 'path/to/image.jpg')
OR
difference_hash = dhasher.encode_image(image_array = <numpy image array>)
OR
difference_hashes = dhasher.encode_images(image_dir = 'path/to/directory')  # for a directory of images

# Finding duplicates:
from imagededup.methods import DHash
dhasher = DHash()
duplicates = dhasher.find_duplicates(image_dir='path/to/directory', max_distance_threshold=15, scores=True)
OR
duplicates = dhasher.find_duplicates(encoding_map=encoding_map, max_distance_threshold=15, scores=True)

# Finding duplicates to return a single list of duplicates in the image collection
from imagededup.methods import DHash
dhasher = DHash()
files_to_remove = dhasher.find_duplicates_to_remove(image_dir='path/to/images/directory',
max_distance_threshold=15)
OR
files_to_remove = dhasher.find_duplicates_to_remove(encoding_map=encoding_map, max_distance_threshold=15)

init

def __init__(verbose)

Initialize difference hashing class.

Args

verbose: Display progress bar if True else disable it. Default value is True.

class WHash

Inherits from Hashing base class and implements wavelet hashing. (Implementation reference: https://fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5)

Offers all the functionality mentioned in hashing class.

Example usage

# Wavelet hash for images
from imagededup.methods import WHash
whasher = WHash()
wavelet_hash = whasher.encode_image(image_file = 'path/to/image.jpg')
OR
wavelet_hash = whasher.encode_image(image_array = <numpy image array>)
OR
wavelet_hashes = whasher.encode_images(image_dir = 'path/to/directory')  # for a directory of images

# Finding duplicates:
from imagededup.methods import WHash
whasher = WHash()
duplicates = whasher.find_duplicates(image_dir='path/to/directory', max_distance_threshold=15, scores=True)
OR
duplicates = whasher.find_duplicates(encoding_map=encoding_map, max_distance_threshold=15, scores=True)

# Finding duplicates to return a single list of duplicates in the image collection
from imagededup.methods import WHash
whasher = WHash()
files_to_remove = whasher.find_duplicates_to_remove(image_dir='path/to/images/directory',
max_distance_threshold=15)
OR
files_to_remove = whasher.find_duplicates_to_remove(encoding_map=encoding_map, max_distance_threshold=15)

init

def __init__(verbose)

Initialize wavelet hashing class.

Args

verbose: Display progress bar if True else disable it. Default value is True.