Python Duplicate Image Cleaner

Find and delete duplicate images from a directory.

TL/DR

You can download the script below (or copy and paste it into a Python file) and run it against a directory. Note that this was only tested on Windows. There might be false positives to be sure to create a backup. (This also sends the file to the recycle bin if available)

Background

My external hard drive with a backup of most of my photos has failed me and I had to run some recovery software to recover the photos. I got my photos back, but unfortunately, shadow copies of the photos were also recovered. With over a thousand photos, I didn’t want to check manually which files were duplicated.

First, we can use the imagehash library to compare the similarity of images. Since the images are not exactly the same, there will be some differences. But imagehash will work even if the images are resized, compressed, in different file formats, or with adjusted contrast and colors.

The hash (or fingerprint, really) is derived from an 8×8 monochrome thumbnail of the image. But even with such a reduced sample, the similarity comparisons give quite accurate results. Adjust the cutoff to find a balance between false positives and false negatives that is acceptable.

from PIL import Image
import imagehash
hash0 = imagehash.average_hash(Image.open('quora_photo.jpg')) 
hash1 = imagehash.average_hash(Image.open('twitter_photo.jpeg')) 
cutoff = 5

if hash0 - hash1 < cutoff:
  print('images are similar')
else:
  print('images are not similar')

The problem with that code is that it only compares the similarity of two images. To compare the similarity between multiple images, I can create a dictionary of the image hash and the path to the image.

imgs = []
for img in imgList:
  index = index + 1
  drawProgressBar((index) / totalImg)

  # tuple of (path, hash)
  try:
    imgs.append((img, getImageHash(img)))
  except:
    print("\nWARNING: Cannot open file ", img)

Now we can traverse through the dictionary of image hashes and compare between them if there were any similarities. I realized the running time may not be that good, but I couldn’t think of any optimizations for now. And it runs well even on an Intel i5 6^th Generation computer anyway. Feel free to comment on GitHub.

seen = []
for img in imgs:
  isDuplicate = False

  for seenImg in seen:
    if (similar):
      isDuplicate = True
      # send to recycle bin
      send2trash(seenImg[0])

  if not isDuplicate: seen.append(img)

When I did my first run, I found out that some removed images were of lower resolution than that of the kept images. Before removing either image, I compared the resolutions and kept the larger file.

      img1Res = getImageResolution(seenImg[0])
      img2Res = getImageResolution(img[0])
      if img1Res[0] > img2Res[0] and img1Res[1] > img2Res[0]:
        # keep seenImg
        send2trash(seenImg[0])
        seen.pop(index)
        seen.append(img)
      else:
        send2trash(img[0])

References: