Giter Club home page Giter Club logo

fast-near-duplicate-image-search's Introduction

Fast Near-Duplicate Image Search and Delete

  • Author: Umberto Griffo
  • Twitter: @UmbertoGriffo

This Python script is a command line tool for visualizing, checking and deleting near-duplicate images from the target directory. In order to find similar images this script hashes the images using pHash from ImageHash library, adding the hash into a KDTree and perform a nearest neighbours search. In addition, near-duplicate images can be visualized generating a t-SNE (t-distributed Stochastic Neighbor Embedding) using a feature vector for each image derived from the pHash function.

I take no responsibility for bugs in this script or accidentally deleted pictures. Use at your own risk. Make sure you back up your pictures before using. This algorithm is intended to find nearly duplicate images. It is NOT intended to find images that are conceptually similar.

Contents

pHash definition

Features in the image are used to generate a distinct (but not unique) fingerprint, and these fingerprints are comparable. Perceptual hashes are a different concept compared to cryptographic hash functions like MD5 and SHA1.

phash

With cryptographic hashes, the hash values are random. The data used to generate the hash acts like a random seed, so the same data will generate the same result, but different data will create different results. Comparing two SHA1 hash values really only tells you two things. If the hashes are different, then the data is different. And if the hashes are the same, then the data is likely the same. (Since there is a possibility of a hash collision, having the same hash values does not guarantee the same data.) In contrast, perceptual hashes can be compared giving you a sense of similarity between the two data sets. Using pHash images can be scaled larger or smaller, have different aspect ratios, and even minor coloring differences (contrast, brightness, etc.) and they will still match similar images.

KDTree definition

A KDTree(short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space. In particular, KDTree helps organize and partition the data points based on specific conditions. KDTree is a useful for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbor searches).

Complexity (Average)

Scape Search Insert Delete
O(n) O(log n) O(log n) O(log n)

where n is the number of points.

Search

phases

Deletion

delete

Installation

Check INSTALL.md for installation instructions.

How to use the Makefile

Prerequisites

Install Python3 and virtualenv see Option 2 in INSTALL.md

  • All-in-one: make all
    • Setup, test and package.
  • Setup: make setup-env
    • Installs all dependencies.
  • Export dependencies of the environment: make export_env
    • Export a requirements.txt file containing the detailed dependencies of the environment.
  • Test: make test
    • Runs all tests.
    • Using pytest
  • Clean: make clean
    • Removes the environment.
    • Removes all cached files.
  • Check: make check
    • Use It to check that which pip3 and which python3 points to the right path.
  • Lint: make lint
    • Checks PEP8 conformance and code smells using pylint.
  • Package: make package
    • Creates a bundle of software to be installed.

Note: Run Setup as your init command (or after Clean)

Usage

Arguments

  <command>             delete or show or search.

  --images-path /path/to/images/
                        The Directory containing images.
  --output-path /path/to/output/
                        The Directory containing results.
  -q /path/to/image/, --query /path/to/image/
                        Path to the query image
  --tree-type {KDTree,cKDTree}
  --leaf-size LEAF_SIZE
                        Leaf size of the tree.
  --hash-algorithm {average_hash,dhash,phash,whash}
                        Hash algorithm to use.
  --hash-size HASH_SIZE
                        Hash size to use.
  -d {euclidean,l2,minkowski,p,manhattan,cityblock,l1,chebyshev,infinity}, --distance-metric {euclidean,l2,minkowski,p,manhattan,cityblock,l1,chebyshev,infinity}
                        Distance metric to use
  --nearest-neighbors NEAREST_NEIGHBORS
                        # of nearest neighbors.
  --threshold THRESHOLD
                        Threshold.
  --parallel [parallel]
                        Whether to parallelize the computation.
  --batch-size BATCH_SIZE
                        The batch size is used when parallel is set to true.
  --backup-keep [BACKUP_KEEP]
                        Whether to save the image to keep into a folder.
  --backup-duplicate [BACKUP_DUPLICATE]
                        Whether to save the duplicates into a folder.
  --safe-deletion [SAFE_DELETION]
                        Whether to execute the deletion without really
                        deleting nothing.
  --image-w IMAGE_W     The source image is resized down to or up to the
                        specified size.
  --image-h IMAGE_H     The source image is resized down to or up to the
                        specified size.

Delete near-duplicate images from the target directory

$ deduplication delete --images_path <target_dir> --output_path <output_dir> --tree_type KDTree

For example:

delete \
--images-path datasets/potatoes_multi_folder  \
--output-path outputs \
--tree-type KDTree \
--threshold 40 \
--parallel y \
--nearest-neighbors 5 \
--hash-algorithm phash \
--hash-size 8 \
--distance-metric manhattan \
--backup-keep y \
--backup-duplicate y \
--safe-deletion y \
Building the dataset...
	Parallel mode has been enabled...
	CPU: 16
	delegate work...
100%|██████████| 1/1 [00:00<00:00, 2231.01it/s]
	get the results...
100%|██████████| 1/1 [00:00<00:00, 773.57it/s]
Building the KDTree...
Finding duplicates and/or near duplicates...
	 Max distance: 33.0
	 Min distance: 0.0
	 number of files to remove: 28
	 number of files to keep: 4
28 duplicates or near duplicates has been founded in 0.0027272701263427734 seconds
We have found 28/32 duplicates in folder
Backuping images...
100%|██████████| 28/28 [00:00<00:00, 4087.45it/s]

Find near-duplicated images from an image you specified

$ deduplication search \
 --images_path <target_dir> \
 --output_path <output_dir> \
 --query <specify a query image file>

For example:

$ deduplication search \
--images-path datasets/potatoes \
--output-path outputs \
--tree-type KDTree \
--threshold 40 \
--parallel f \
--nearest-neighbors 5 \
--hash-algorithm phash \
--hash-size 8 \
--distance-metric manhattan \
--query datasets/potatoes/2018-12-11-15-031193.png

phases

Show near-duplicate images from the target directory With t-SNE

$ deduplication show --images_path <target_dir> --output_path <output_dir>

For example:

$ show
--images-path datasets/potatoes \
--output-path outputs \
--parallel y \
--image-w 32 \
--image-h 32

phases

Todo

References

fast-near-duplicate-image-search's People

Contributors

dependabot[bot] avatar umbertogriffo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fast-near-duplicate-image-search's Issues

Installation fails for scipy on Ubuntu 20.10

I can't install this either using option 1 or option 2 from the instructions with the same error while building scipy. I'm using python 3.8.6 on Ubuntu 20.10. I've tried installing the following dependencies: sudo apt install python3-testresources gfortran python3-dev libblas3 liblapack3 liblapack-dev libblas-dev libatlas-base-dev python3-scipy

Here's a snippet of the error output (there's a lot of it):

  Building wheel for scipy (setup.py): finished with status 'error'
  ERROR: Command errored out with exit status 1:
   command: /home/daniel/Documents/GitHub/fast-near-duplicate-image-search/.pyenv/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vdc1x7x4/scipy/setup
.py'"'"'; __file__='"'"'/tmp/pip-install-vdc1x7x4/scipy/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(c
ompile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-y7tlt6cn
       cwd: /tmp/pip-install-vdc1x7x4/scipy/
  Complete output (9458 lines):
  /tmp/pip-install-vdc1x7x4/scipy/setup.py:114: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp
  Running from scipy source directory.
  lapack_opt_info:
  lapack_mkl_info:
  customize UnixCCompiler
    libraries mkl_rt not found in ['/home/daniel/Documents/GitHub/fast-near-duplicate-image-search/.pyenv/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
    NOT AVAILABLE
  
  openblas_lapack_info:
  customize UnixCCompiler
  customize UnixCCompiler
    libraries openblas not found in ['/home/daniel/Documents/GitHub/fast-near-duplicate-image-search/.pyenv/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
    NOT AVAILABLE
  
  openblas_clapack_info:
  customize UnixCCompiler
  customize UnixCCompiler
    libraries openblas,lapack not found in ['/home/daniel/Documents/GitHub/fast-near-duplicate-image-search/.pyenv/lib', '/usr/local/lib', '/usr/lib', '/usr/lib/x86_64-linux-gnu']
    NOT AVAILABLE
  
  atlas_3_10_threads_info:
  Setting PTATLAS=ATLAS
  customize UnixCCompiler
    libraries lapack_atlas not found in /home/daniel/Documents/GitHub/fast-near-duplicate-image-search/.pyenv/lib
  customize UnixCCompiler
    libraries tatlas,tatlas not found in /home/daniel/Documents/GitHub/fast-near-duplicate-image-search/.pyenv/lib
  customize UnixCCompiler
    libraries lapack_atlas not found in /usr/local/lib
  customize UnixCCompiler
    libraries tatlas,tatlas not found in /usr/local/lib
  customize UnixCCompiler
    libraries lapack_atlas not found in /usr/lib
  customize UnixCCompiler
    libraries tatlas,tatlas not found in /usr/lib
  customize UnixCCompiler
    libraries lapack_atlas not found in /usr/lib/x86_64-linux-gnu/atlas
  customize UnixCCompiler
    libraries tatlas,tatlas not found in /usr/lib/x86_64-linux-gnu/atlas

installation

Hi! Thank you very much for making this open-source!

I tried installing the package using multiple ways (Option 1 in INSTALL.md: ./install.sh, option 3 in INSTALL.md, and make all) but I still get: -bash: deduplication: command not found.

Option 1 was mostly successful with the exception of the installation of mkl-ftt. make all failed with the following error:

which: no activate in (/home/arjung2/.conda/envs/fast_near_duplicate_img_src_py3/bin:/opt/apps/swsuite/bin:/usr/local/cuda-10.1/bin:/opt/apps/openmpi-4.0.0-gcc-4.8.5/bin:/usr/local/cuda/bin:/usr/lib64/qt-3.3/bin:/opt/apps/anaconda3/condabin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/home/arjung2/.local/bin:/home/arjung2/bin)
echo 

source  && conda env create -f environment.yml && conda deactivate
/bin/bash: line 0: source: filename argument required
source: usage: source filename [arguments]
make: *** [setup] Error 2

Any ideas? Thank you once again for making this work open-source!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.