Skip to content

Benchmarks

To gauge an idea of the speed and accuracy of the implemented algorithms, a benchmark has been provided on the UKBench dataset (zip file titled 'UKBench image collection' having size ~1.5G) and some variations derived from it.

Datasets

3 datasets that have been used:

  1. Near duplicate dataset (UKBench dataset): This dataset has near duplicates that are arranged in groups of 4. There are a total of 2550 such groups amounting to a total of 10200 RGB images. The size of each image is 640 x 480 with jpg extension. The image below depicts 3 example groups from the UKBench dataset. Each row represents a group with the corresponding 4 images from the group.

  2. Transformed dataset derived from UKBench dataset: An image from different groups of the UKBench dataset was taken and the following 5 transformations were applied to the original image:

    • Random crop preserving the original aspect ratio (new size - 560 x 420)
    • Horizontal flip
    • Vertical flip
    • 25 degree rotation
    • Resizing with change in aspect ratio (new aspect ratio - 1:1)

    Thus, each group has a total of 6 images (original + transformed). A total of 1800 such groups were created totalling 10800 images in the dataset.

  3. Exact duplicate dataset: An image from each of the 2550 image groups of the UKBench dataset was taken and an exact duplicate was created. The number of images totalled 5100.

Environment

The benchmarks were performed on an AWS ec2 r5.xlarge instance having 4 vCPUs and 32 GB memory. The instance does not have a GPU, so all the runs are done on CPUs.

Metrics

The metrics used here are classification metrics as explained in the documentation.

class-0 refers to non-duplicate image pairs.

class-1 refers to duplicate image pairs.

The reported numbers are rounded off to nearest 3 digits.

Timings

The times are reported in seconds and comprise the time taken to generate encodings and find duplicates. The time taken to perform the evaluation task is NOT reported.

Threshold selection

For each method, 3 different thresholds have been selected.

For hashing methods, following max_distance_threshold values are used:

  • 0: Indicates that exactly the same hash should be generated for the image pairs to be considered duplicates.
  • 10: Default.
  • 32: Halfway between the maximum and minimum values (0 and 64).

For cnn method, following min_similarity_threshold values are used:

  • 1.0: Indicates that exactly the same cnn embeddings should be generated for the image pairs to be considered duplicates.
  • 0.9: Default.
  • 0.5: A threshold that allows large deviation between image pairs.

Results

Near Duplicate dataset

Method Threshold Time (s) class-0 precision class-1 precision class-0 recall class-1 recall
dhash 0 35.570 0.999 0.0 1.0 0.0
dhash 10 35.810 0.999 0.018 0.999 0.0461
dhash 32 106.670 0.998 0.0 0.326 0.884
phash 0 40.073 0.999 1.0 1.0 0.0
phash 10 39.056 0.999 0.498 0.999 0.016
phash 32 98.835 0.998 0.0 0.343 0.856
ahash 0 36.171 0.999 0.282 0.999 0.002
ahash 10 36.560 0.999 0.012 0.996 0.193
ahash 32 97.170 0.999 0.000 0.448 0.932
whash 0 51.710 0.999 0.112 0.999 0.002
whash 10 51.940 0.999 0.008 0.993 0.199
whash 32 112.560 0.999 0.0 0.416 0.933
cnn 0.5 379.680 0.999 0.0 0.856 0.999
cnn 0.9 377.157 0.999 0.995 0.999 0.127
cnn 1.0 379.570 0.999 0.0 1.0 0.0

Observations

  • The cnn method with a threshold between 0.5 and 0.9 would work best for finding near duplicates. This is indicated by the extreme values class-1 precision and recall takes for the two thresholds.
  • Hashing methods do not perform well for finding near duplicates.

Transformed dataset

Method Threshold Time (s) class-0 precision class-1 precision class-0 recall class-1 recall
dhash 0 25.360 0.999 1.0 1.0 0.040
dhash 10 25.309 0.999 0.138 0.999 0.117
dhash 32 108.960 0.990 0.0 0.336 0.872
phash 0 28.069 0.999 1.0 1.0 0.050
phash 10 28.075 0.999 0.341 0.999 0.079
phash 32 107.079 0.990 0.003 0.328 0.847
ahash 0 25.270 0.999 0.961 0.999 0.058
ahash 10 25.389 0.999 0.035 0.997 0.216
ahash 32 93.084 0.990 0.0 0.441 0.849
whash 0 40.390 0.999 0.917 0.999 0.061
whash 10 41.260 0.999 0.023 0.996 0.203
whash 32 109.630 0.990 0.0 0.410 0.853
cnn 0.5 397.380 0.999 0.003 0.852 0.999
cnn 0.9 392.090 0.999 0.999 0.990 0.384
cnn 1.0 396.250 0.990 0.0 1.0 0.0

Observations

  • The cnn method with threshold 0.9 seems to work best for finding transformed duplicates. A slightly lower min_similarity_threshold value could lead to a higher class-1 recall.
  • Hashing methods do not perform well for finding transformed duplicates. In reality, resized images get found easily, but all other transformations lead to a bad performance for hashing methods.

Exact duplicates dataset

Method Threshold Time (s) class-0 precision class-1 precision class-0 recall class-1 recall
dhash 0 18.380 1.0 1.0 1.0 1.0
dhash 10 18.410 1.0 0.223 0.999 1.0
dhash 32 34.602 1.0 0.0 0.327 1.0
phash 0 19.780 1.0 1.0 1.0 1.0
phash 10 20.012 1.0 0.980 0.999 1.0
phash 32 34.054 1.0 0.0 0.344 1.0
ahash 0 18.180 1.0 0.998 0.999 1.0
ahash 10 18.228 1.0 0.044 0.995 1.0
ahash 32 31.961 1.0 0.0 0.448 1.0
whash 0 26.097 1.0 0.980 0.999 1.0
whash 10 26.056 1.0 0.029 0.993 1.0
whash 32 39.408 1.0 0.0 0.417 1.0
cnn 0.5 192.050 1.0 0.001 0.860 1.0
cnn 0.9 191.024 1.0 1.0 1.0 1.0
cnn 1.0 194.270 0.999 1.0 1.0 0.580*

* The value is low as opposed to the expected 1.0 because of the cosine_similarity function from scikit-learn (used within the package) which sometimes calculates the similarity to be slightly less than 1.0 even when the vectors are same.

Observations

  • Difference hashing is the fastest (max_distance_threshold 0).
  • When using hashing methods for exact duplicates, keep max_distance_threshold to a low value. The value of 0 is good, but a slightly higher value should also work fine.
  • When using cnn method, keep min_similarity_threshold to a high value. The default value of 0.9 seems to work well. A slightly higher value can also be used.

Summary

  • Near duplicate dataset: use cnn with an appropriate min_similarity_threshold.
  • Transformed dataset: use cnn with min_similarity_threshold of around 0.9 (default).
  • Exact duplicates dataset: use Difference hashing with 0 max_distance_threshold.
  • A higher max_distance_threshold (i.e., hashing) leads to a higher execution time. cnn method doesn't seem much affected by the min_similarity_threshold (though a lower value would add a few seconds to the execution time as can be seen in all the runs above.)
  • Generally speaking, the cnn method takes longer to run as compared to hashing methods for all datasets. If a GPU is available, cnn method should be much faster.