Internet pictures obtained by crawlers , Some of them are repeated after downloading , The cost of eye check is laborious , And it's hard to find out . This paper calculates and compares the md5 Value to determine whether it is a duplicate picture , For later use .
MD5 Information digest algorithm ( English :MD5 Message-Digest Algorithm), A widely used Cryptographic hash function , I can produce one 128 position (16 byte ) Hash value (hash value), Used to ensure complete and consistent transmission of information .
python The code is as follows :
import os
import shutil
import hashlib
# Calculate the of each image md5 value
def compute_md5(image_path):
img = open(image_path, 'rb')
md5 = hashlib.md5(img.read())
img.close()
md5_values = md5.hexdigest()
return md5_values
# Storage md5 It's worth it list
md5_list = []
# Path to store duplicate pictures
result_dir = "results"
os.makedirs(result_dir, exist_ok=True)
# Path of duplicate image to be checked
image_dir = "images"
image_list = os.listdir(image_dir)
for image_name in image_list:
image_path = os.path.join(image_dir, image_name)
md5 = compute_md5(image_path)
# If md5 Value already exists , Then move the picture to result_dir Under the table of contents
if md5 not in md5_list:
md5_list.append(md5)
else:
print(image_name)
save_path = os.path.join(result_dir, image_name)
shutil.move(image_path, save_path)
The above code only provides the duplicate checking function of completely repeated pictures , For similar pictures, it does not have the function of duplicate checking , Subsequently, similarity calculation or feature point matching can be added to realize duplicate checking of similar pictures .