johannesbuchner / imagehash Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bunchesofdonald/photohash

3.1K 66.0 326.0 7.95 MB

A Python Perceptual Image Hashing Module

License: BSD 2-Clause "Simplified" License

Python 82.59% HTML 13.06% Makefile 4.35%

image-hashing-algorithms image-hashing python

imagehash's Introduction

ImageHash

An image hashing library written in Python. ImageHash supports:

Average hashing
Perceptual hashing
Difference hashing
Wavelet hashing
HSV color hashing (colorhash)
Crop-resistant hashing

Rationale

Image hashes tell whether two images look nearly identical. This is different from cryptographic hashing algorithms (like MD5, SHA-1) where tiny changes in the image give completely different hashes. In image fingerprinting, we actually want our similar inputs to have similar output hashes as well.

The image hash algorithms (average, perceptual, difference, wavelet) analyse the image structure on luminance (without color information). The color hash algorithm analyses the color distribution and black & gray fractions (without position information).

Installation

Based on PIL/Pillow Image, numpy and scipy.fftpack (for pHash) Easy installation through pypi:

pip install imagehash

Basic usage

>>> from PIL import Image
>>> import imagehash
>>> hash = imagehash.average_hash(Image.open('tests/data/imagehash.png'))
>>> print(hash)
ffd7918181c9ffff
>>> otherhash = imagehash.average_hash(Image.open('tests/data/peppers.png'))
>>> print(otherhash)
9f172786e71f1e00
>>> print(hash == otherhash)
False
>>> print(hash - otherhash)  # hamming distance
33

Each algorithm can also have its hash size adjusted (or in the case of colorhash, its binbits). Increasing the hash size allows an algorithm to store more detail in its hash, increasing its sensitivity to changes in detail.

The demo script find_similar_images illustrates how to find similar images in a directory.

Source hosted at GitHub: https://github.com/JohannesBuchner/imagehash

References

Average hashing (aHashref)
Perceptual hashing (pHashref)
Difference hashing (dHashref)
Wavelet hashing (wHashref)
Crop-resistant hashing (crop_resistant_hashref)

Examples

To help evaluate how different hashing algorithms behave, below are a few hashes applied to two datasets. This will let you know what images an algorithm thinks are basically identical.

Example 1: Icon dataset

Source: 7441 free icons on GitHub (see examples/github-urls.txt).

The following pages show groups of images with the same hash (the hashing method sees them as the same).

The hashes use hashsize=8; colorhash uses binbits=3. You may want to adjust the hashsize or require some manhattan distance (hash1 - hash2 < threshold).

Example 2: Art dataset

Source: 109259 art pieces from https://www.parismuseescollections.paris.fr/en/recherche/image-libre/.

The following pages show groups of images with the same hash (the hashing method sees them as the same).

For understanding hash distances, check out these excellent blog posts: * https://tech.okcupid.com/evaluating-perceptual-image-hashes-at-okcupid-e98a3e74aa3a * https://content-blockchain.org/research/testing-different-image-hash-functions/

Storing hashes

As illustrated above, hashes can be turned into strings. The strings can be turned back into a ImageHash object as follows.

For single perceptual hashes:

>>> original_hash = imagehash.phash(Image.open('tests/data/imagehash.png'))
>>> hash_as_str = str(original_hash)
>>> print(hash_as_str)
ffd7918181c9ffff
>>> restored_hash = imagehash.hex_to_hash(hash_as_str)
>>> print(restored_hash)
ffd7918181c9ffff
>>> assert restored_hash == original_hash
>>> assert str(restored_hash) == hash_as_str

For crop_resistant_hash:

>>> original_hash = imagehash.crop_resistant_hash(Image.open('tests/data/imagehash.png'), min_segment_size=500, segmentation_image_size=1000)
>>> hash_as_str = str(original_hash)
>>> restored_hash = imagehash.hex_to_multihash(hash_as_str)
>>> assert restored_hash == original_hash
>>> assert str(restored_hash) == hash_as_str

For colorhash:

>>> original_hash = imagehash.colorhash(Image.open('tests/data/imagehash.png'), binbits=3)
>>> hash_as_str = str(original_hash)
>>> restored_hash = imagehash.hex_to_flathash(hash_as_str, hashsize=3)

Efficient database search

For storing the hashes in a database and using fast hamming distance searches, see pointers at #127 (a blog post on how to do this would be a great contribution!)

@KDJDEV points to https://github.com/KDJDEV/imagehash-reverse-image-search-tutorial and writes: In this tutorial I use PostgreSQL and this extension, and show how you can create a reverse image search using hashes generated by this library.

Changelog

4.3: typing annotations by @Avasam @SpangleLabs and @nh2
4.2: Cropping-Resistant image hashing added by @SpangleLabs
4.1: Add examples and colorhash
4.0: Changed binary to hex implementation, because the previous one was broken for various hash sizes. This change breaks compatibility to previously stored hashes; to convert them from the old encoding, use the "old_hex_to_hash" function.
3.5: Image data handling speed-up
3.2: whash now also handles smaller-than-hash images
3.0: dhash had a bug: It computed pixel differences vertically, not horizontally.

I modified it to follow dHashref. The old function is available as dhash_vertical.
2.0: Added whash
1.0: Initial ahash, dhash, phash implementations.

Contributing

Pull requests and new features are warmly welcome.

If you encounter a bug or have a question, please open a GitHub issue. You can also try Stack Overflow.

Other projects

https://github.com/commonsmachinery/blockhash-python
https://github.com/acoomans/instagram-filters
https://pippy360.github.io/transformationInvariantImageSearch/
https://www.phash.org/
https://pypi.org/project/dhash/
https://github.com/thorn-oss/perception (based on imagehash code, depends on opencv)
https://docs.opencv.org/3.4/d4/d93/group__img__hash.html

imagehash's People

Contributors

Stargazers

Watchers

Forkers

seylerius rynbrd stuaxo svisser catmonkeylee theskumar runarbu ibebrett scribu pombredanne ilvar nagyistoce dmkoch guomulian extrinsicmedia astyonax shannonyu kwin-wang shanemcquillan silasxue ritu1337 lizsz zhouliang1979 guitarmind jzbjyb zrenx urwithajit9 gurusura ayurjev lifematrix ividal etali jgraving fxolivia enricomiccoli ramzi-alqrainy samuel2015 bjlittle jerry-0824 twmht caomw ksouthall2 jakirkham bobquest33 patolin jusme rddaz2013 christophergondek dmpetrov djunzu k-du elcombato amarildogolloshi zgsxwsdxg lnas01 guoweijia0579 lmwalkowicz mastergenius starislandgames grevutiu-gabriel upliner stamhe previtus samzhang111 walkoncross own2pwn awesome-python dagangwood163 go2carter krzanek cameraforensics rafpyprog rasulov3645 jamesbing gsuareztangil jlertle resurgo-genetics rzel nathangq ghutchis jonnycrunch stephanie-ustc magic347 youngsheldon asahius hoffmanindustry lintry rahulremanan stevefoy lucky096 gengcode sc1015147151 saibabanadh bnekolny benzei zubairshokh zongxinwu92 adelinewang b2220333 tomaszszyborski

imagehash's Issues

Tag a 4.0 release

@JohannesBuchner Just for completeness, would it be possible to tag a 4.0 release?

The imagehash recipe on conda-forge avoids this as it pulls it's source from PyPI.

Thanks.

Why using median instead of average in phash?

Average value is used after dct in phash_simple, whereas in phash, median is used. Why?
I found that using average will produce more similar hash values for similar images.

Any reason the ImageHaash.hash's shape is (1, N^2) rather than (N^2)?

When I look at this, all of the computations of ImageHash requires flattening out the hash. When I was profiling (for hash_size=32), each of the flattening is adding 0.5us of overhead that can be avoided.
It might be small, but I have code where I need to subtract one hash from 10,000 stored hashes every frame I'm processing, and it adds more than 10ms per frame. This forces me to copy paste the arithmetic of ImageHash to my code snippet (instead of calling some_hash - other_hash)

I don't see any reason to store the flattened version of it in the first place. In other words:

def __init__(self, binary_array):
	self.hash = binary_array.flatten()

Image similarity for million images

Hi all,
using the below code, i tried with several images to find the similarity but the difference between the hashvalues are not matching

hash = imagehash.phash(Image.open(1.jpg'))
otherhash = imagehash.phash(Image.open('2.jpg'))
print(hash == otherhash)
print(hash-otherhash)

please suggest how can i get the similarity.

please suggest how can we search million images in the database based on the hashvalue

Thanks
vijay

Having ImageHash instances be iterable would be awesome!

A nice tweak would be to make it possible to iterate over the values in ImageHash instances.

This makes things like converting to non-hex string representations much easier.

For example, I currently want to convert the hash value into a string of "1" or "0" values, for use in a database (I can use the DB's fuzzy text matching with such a string).

Currently, I have to:

dHash = "".join(["1" if val else "0" for val in np.nditer(dHash.hash, order='C') ])

which means I need numpy imported in my own code, and it's just ugly.

I think this could be as simple as just adding:

    def __iter__(self):
        return np.nditer(self.hash, order='C')

to the ImageHash class (I'm not totally sure you can return an iterator object for the iterator class method), but even if not, it should only be a few lines.

At that point, my own code is the much more clean:

dHash = "".join(["1" if val else "0" for val in dHash ])

whash-db4 shape problem: "TypeError: 'ImageHashes must be of the same shape.', (13L, 13L), (14L, 14L)"

I'm trying to find the difference between the hash value of the following two images. When I use any hash method other than whash-db4 it works fine:

import imagehash
from PIL import Image

hashNum1_w = imagehash.whash(Image.open('test1.jpg'))
hashNum2_w = imagehash.whash(Image.open('test2.jpg'))
print(hashNum1_w - hashNum2_w)  # prints 10

But for whash-db4 it produces the error in the title:

hashnum1_wdb4 = imagehash.whash(Image.open('test1.jpg'), mode='db4')
hashnum2_wdb4 = imagehash.whash(Image.open('test2.jpg'), mode='db4')
print(hashNum1_wdb4 - hashNum2_wdb4)

Note that

print(hashNum1_w)
print(hashNum2_w)
print(hashNum1_wdb4)
print(hashNum2_wdb4)

Gives

80cfe3c1e7cfa501
00cfe9c9efcf6c00
0a0b500fcfd602801fc7fe0ff07fcffcf8025406b03
f807e01f807e01f80c000fe3ff83fe0ff8ffe7c000f80ff03

_binary_array_to_hex gives wrong value

Correct me if I am wrong, but _binary_array_to_hex gives wrong values.

Python 2.7.13 (default, Feb 16 2017, 19:11:00) 
[GCC 5.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from PIL import Image
>>> import imagehash
>>> img = Image.open('sample_image.png')
>>> ph = imagehash.phash(img)
>>> bool_array = ph.hash.flatten()
>>> bool_array
array([ True, False, False,  True,  True,  True,  True, False, False,
       False,  True,  True,  True,  True, False, False,  True,  True,
       False, False, False, False, False,  True,  True,  True,  True,
        True, False, False, False, False,  True,  True,  True,  True,
       False, False, False, False,  True,  True, False, False, False,
       False,  True,  True,  True,  True,  True, False, False,  True,
        True,  True,  True,  True, False, False, False, False, False, False], dtype=bool)
>>> bit_array = 1*bool_array
>>> bit_array
array([1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
>>> bit_string = ''.join(str(b) for b in bit_array)
>>> bit_string
'1001111000111100110000011111000011110000110000111110011111000000'
>>> int(bit_string, 2)
11402201597170935744L
>>> hex(int(bit_string, 2))
'0x9e3cc1f0f0c3e7c0L' # This is the expected hex value for the given bit array (phash)
>>> str(ph)
'793c830f0fc3e703' # and here is the actual value.

If I made no mistake, let me know and I will submit a PR.

Identifies similar images in the directory.

Hi, I am new to ML coding. Can you please suggest me ,which parameter i need to fill to get output. I have two images 1.jpg and 2.jpg. I want to find similarity between them. How shall i do it? I would be greatful . Thanks

` def usage():
sys.stderr.write("""SYNOPSIS: %s [ahash|phash|dhash|...] []
Identifies similar images in the directory.
Method:
ahash: Average hash
phash: Perceptual hash
dhash: Difference hash
whash-haar: Haar wavelet hash
whash-db4: Daubechies wavelet hash
(C) Johannes Buchner, 2013-2017
""" % sys.argv[0])
sys.exit(1)'

Is there a plan to develop a java version

Which alternative is best?

Which hash function would be best to use if I compare a photo from my phone to a set of photos stored in a catalog on my computer?

Im trying to achieve basic image-searching.

hex_to_hash only works on 16 char hashes

This works as expected:

h8 = imagehash.average_hash(img, hash_size=8)
print(h8)
>> 6d39bc91d0d0f131
imagehash.hex_to_hash("6d39bc91d0d0f131")
>> array([[...

This fails

h16 = imagehash.average_hash(img, hash_size=16)
print(h16)
>> f33df33dc21bc297e29f7ace12c782e782e282f782f312f30273027f822f830f
imagehash.hex_to_hash("f33df33dc21bc297e29f7ace12c782e782e282f782f312f30273027f822f830f'")

with traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-9c753def512a> in <module>()
----> 1 imagehash.hex_to_hash("f33df33dc21bc297e29f7ace12c782e782e282f782f312f30273027f822f830f'")

/home/mhradile/tests/distribution/Library/imaging/imagehash.pyc in hex_to_hash(hexstr)
     84         l = []
     85         if len(hexstr) != 16:
---> 86                 raise ValueError('The hex string has the wrong length')
     87         for i in range(8):
     88                 h = hexstr[i*2:i*2+2]

ValueError: The hex string has the wrong length

No need to pad HEX

Why do you pad the conversion with zeros?
There is no need to do this.
binary_array = '{:0>{width}b}'.format(int(hexstr, 16), width = hash_size * hash_size)
to get rid of 0b like:
bin(int('3838b59d0c1c7fd8', 16))

just do
'{:b}'.format(int('3838b59d0c1c7fd8', 16), width = hash_size * hash_size)

or
format((int('3838b59d0c1c7fd8', 16)),'08b')

speedup image operation

pillow-simd fastest alternative for Pillow

hex_to_hash doesn't work on >16 length hashes

I'd like to use hex_to_hash on 64 length hashes, however it only works on the 16 length hashes.

I've tried to edit the code, but the hash wouldn't be accurate.

Performance tests / thoughts (~10e6 hashes)

Firstly, congratulations on imagehash!

I'm using it for an application in which I have a few million ImageHash objects in a pandas DataFrame. I have a web server (a "microservice") which loads all of these hashes in memory once (from a pickled file), and then outputs closest matches for a given hash.

i.e., this server allows other services to call it hence: http://example.com/?phash=... passing it the phash of an image (... the needle ...) that's then compared to the millions of stored/pre-computed hashes.

As these millions of hash comparisons were taking > 15 seconds on a pretty good machine, this made me look under the hood. Here's what I uncovered using line_profiler:

Total time: 40.0547 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    66                                            @profile
    67                                            def __sub__(self, other):
    68   6585208    2452786.0      0.4      6.1     if other is None:
    69                                                raise TypeError('Other hash must not be None.')
    70   6585208    4903316.0      0.7     12.2     if self.hash.size != other.hash.size:
    71                                                raise TypeError('ImageHashes must be of the same shape.', self.hash.shape, other.hash.shape)
    72                                              # original code below, split up to profile each separate instruction
    73   6585208    8010861.0      1.2     20.0     flattened_h = self.hash.flatten()
    74   6585208    6635342.0      1.0     16.6     flattened_other_h = other.hash.flatten()
    75   6585208    6499394.0      1.0     16.2     sub_calc = flattened_h != flattened_other_h
    76   6585208    9317760.0      1.4     23.3     non_zero = numpy.count_nonzero(sub_calc)
    77   6585208    2235231.0      0.3      5.6     return non_zero

(the reported total time is slower than when running this code without the profiler, but the percentage values still hold)

Interestingly enough, the first two sanity if checks take up 18% of the time. First question: is it worth disabling those (but leaving users with more obscure error messages when __sub__ is called with incompatible arguments)? Would it be worth considering having a separate, "optimized" version of __sub__ that assumes that the user is passing correct values to it..?

The second, and most important finding, is that both .flatten() operations take up close to 40% of the running time.

I've modified my version of imagehash to pre-compute self.hash_flat once in __init__, and removed both sanity checks. Here's the optimized result:

Total time: 16.0691 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    67                                            @profile
    68                                            def __sub__(self, other):
    69                                              # if other is None:
    70                                              #   raise TypeError('Other hash must not be None.')
    71                                              # if self.hash.size != other.hash.size:
    72                                              #   raise TypeError('ImageHashes must be of the same shape.', self.hash.shape, other.hash.shape)
    73                                           
    74                                              # optimized code
    75   6585208    7297545.0      1.1     45.4     sub_calc = self.hash_flat != other.hash_flat
    76   6585208    8771542.0      1.3     54.6     return numpy.count_nonzero(sub_calc)

Much better... but is this only better for my specific, "weird" application? 😄

I have some other thoughts/questions around numpy.count_nonzero, but perhaps we can save it for later/another issue.

Thanks again! Looking forward to reading your thoughts.

`whash` sometimes gives invalid results

I frequently see this hash when doing a whash: ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff

Visually inspecting the images, they are not at all similar. Here are two images:

Here's my test code:

import os

import imagehash
from PIL import Image


class ImageHasher(object):

    HASH_SIZE = 32
    IMAGE_SCALE = 2048

    def __call__(self, f):

        image = Image.open(f)

        hash_ = imagehash.whash(
            image,
            hash_size=self.HASH_SIZE,
            image_scale=self.IMAGE_SCALE
        )

        return str(hash_)


image_hasher = ImageHasher()

def main():
    for filename in os.listdir('images'):
        path = os.path.join('images', filename)
        with open(path, 'rb') as f:
             print(image_hasher(f))


main()

Any clues as to what's going on?

Thanks.

Why intentionally shorten hash value of hash() of ImageHash class?

It looks like the current implementation of __hash__(): of ImageHash class divides the internal 64-bit hash value into several 8-bit values, then adds up the computed values from those 8-bit hash values.

The question is why doing so? It will make '0000ffff0000eeee' and 'ffff0000eeee0000' return the same hash value and lead to collision.

In [61]: hash(imagehash.hex_to_hash('0000ffff0000eeee'))
Out[61]: 986

In [62]: hash(imagehash.hex_to_hash('ffff0000eeee0000'))
Out[62]: 986

pHash hash_size not used in resulting hash

seems like the hash size remains the same regardless of the desired size supplied. Perhaps it should be:

dctlowfreq = dct[:hash_size/4, 1:hash_size/4+1]

Array returned by `hex_to_hash` has different shape then default hash

it's not shaped to (8, 8), but (64,). I'll put up a PR for this shortly.

Unable to install the library "--single-version-externally-managed --compile" failed with error code 1"

OS: Windows 10
Python Version: 3.5.0
Pip Version: 9.0.1

I am getting following error message while trying to install (pip install ImageHash) the library.

Command "c:\misc\python\python35-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\DANCE2~1\\AppData\\Local\\Temp\\pip-build-otq1zs_f\\scipy\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\DANCE2~1\AppData\Local\Temp\pip-ei4og5fl-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\DANCE2~1\AppData\Local\Temp\pip-build-otq1zs_f\scipy\

Here is the full message. Let me know should you require other information.

c:\Users\dance2die>pip install ImageHash
Collecting ImageHash
  Using cached ImageHash-3.1.tar.gz
Requirement already satisfied: numpy in c:\misc\python\python35-32\lib\site-packages (from ImageHash)
Collecting scipy (from ImageHash)
  Using cached scipy-0.18.1.tar.gz
Collecting pillow (from ImageHash)
  Using cached Pillow-3.4.2-cp35-cp35m-win32.whl
Collecting PyWavelets (from ImageHash)
  Using cached PyWavelets-0.5.1-cp35-none-win32.whl
Installing collected packages: scipy, pillow, PyWavelets, ImageHash
  Running setup.py install for scipy ... error
	Complete output from command c:\misc\python\python35-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\DANCE2~1\\AppData\\Local\\Temp\\pip-build-otq1zs_f\\scipy\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\DANCE2~1\AppData\Local\Temp\pip-ei4og5fl-record\install-record.txt --single-version-externally-managed --compile:

	Note: if you need reliable uninstall behavior, then install
	with pip instead of using `setup.py install`:

	  - `pip install .`       (from a git repo or downloaded source
							   release)
	  - `pip install scipy`   (last SciPy release on PyPI)


	lapack_opt_info:
	openblas_lapack_info:
	  libraries openblas not found in ['c:\\misc\\python\\python35-32\\lib', 'C:\\', 'c:\\misc\\python\\python35-32\\libs']
	  NOT AVAILABLE

	lapack_mkl_info:
	  libraries mkl_rt not found in ['c:\\misc\\python\\python35-32\\lib', 'C:\\', 'c:\\misc\\python\\python35-32\\libs']
	  NOT AVAILABLE

	atlas_3_10_threads_info:
	Setting PTATLAS=ATLAS
	c:\misc\python\python35-32\lib\site-packages\numpy\distutils\system_info.py:639: UserWarning: Specified path C:\projects\numpy-wheels\windows-wheel-builder\atlas-builds\atlas-3.10.1-sse2-32\lib is invalid.
	  warnings.warn('Specified path %s is invalid.' % d)
	<class 'numpy.distutils.system_info.atlas_3_10_threads_info'>
	  NOT AVAILABLE

	atlas_3_10_info:
	<class 'numpy.distutils.system_info.atlas_3_10_info'>
	  NOT AVAILABLE

	atlas_threads_info:
	Setting PTATLAS=ATLAS
	<class 'numpy.distutils.system_info.atlas_threads_info'>
	  NOT AVAILABLE

	atlas_info:
	<class 'numpy.distutils.system_info.atlas_info'>
	  NOT AVAILABLE

	c:\misc\python\python35-32\lib\site-packages\numpy\distutils\system_info.py:1532: UserWarning:
		Atlas (http://math-atlas.sourceforge.net/) libraries not found.
		Directories to search for the libraries can be specified in the
		numpy/distutils/site.cfg file (section [atlas]) or by setting
		the ATLAS environment variable.
	  warnings.warn(AtlasNotFoundError.__doc__)
	lapack_info:
	  libraries lapack not found in ['c:\\misc\\python\\python35-32\\lib', 'C:\\', 'c:\\misc\\python\\python35-32\\libs']
	  NOT AVAILABLE

	c:\misc\python\python35-32\lib\site-packages\numpy\distutils\system_info.py:1543: UserWarning:
		Lapack (http://www.netlib.org/lapack/) libraries not found.
		Directories to search for the libraries can be specified in the
		numpy/distutils/site.cfg file (section [lapack]) or by setting
		the LAPACK environment variable.
	  warnings.warn(LapackNotFoundError.__doc__)
	lapack_src_info:
	  NOT AVAILABLE

	c:\misc\python\python35-32\lib\site-packages\numpy\distutils\system_info.py:1546: UserWarning:
		Lapack (http://www.netlib.org/lapack/) sources not found.
		Directories to search for the sources can be specified in the
		numpy/distutils/site.cfg file (section [lapack_src]) or by setting
		the LAPACK_SRC environment variable.
	  warnings.warn(LapackSrcNotFoundError.__doc__)
	  NOT AVAILABLE

	Running from scipy source directory.
	Traceback (most recent call last):
	  File "<string>", line 1, in <module>
	  File "C:\Users\DANCE2~1\AppData\Local\Temp\pip-build-otq1zs_f\scipy\setup.py", line 415, in <module>
		setup_package()
	  File "C:\Users\DANCE2~1\AppData\Local\Temp\pip-build-otq1zs_f\scipy\setup.py", line 411, in setup_package
		setup(**metadata)
	  File "c:\misc\python\python35-32\lib\site-packages\numpy\distutils\core.py", line 135, in setup
		config = configuration()
	  File "C:\Users\DANCE2~1\AppData\Local\Temp\pip-build-otq1zs_f\scipy\setup.py", line 335, in configuration
		config.add_subpackage('scipy')
	  File "c:\misc\python\python35-32\lib\site-packages\numpy\distutils\misc_util.py", line 1000, in add_subpackage
		caller_level = 2)
	  File "c:\misc\python\python35-32\lib\site-packages\numpy\distutils\misc_util.py", line 969, in get_subpackage
		caller_level = caller_level + 1)
	  File "c:\misc\python\python35-32\lib\site-packages\numpy\distutils\misc_util.py", line 906, in _get_configuration_from_setup_py
		config = setup_module.configuration(*args)
	  File "scipy\setup.py", line 15, in configuration
		config.add_subpackage('linalg')
	  File "c:\misc\python\python35-32\lib\site-packages\numpy\distutils\misc_util.py", line 1000, in add_subpackage
		caller_level = 2)
	  File "c:\misc\python\python35-32\lib\site-packages\numpy\distutils\misc_util.py", line 969, in get_subpackage
		caller_level = caller_level + 1)
	  File "c:\misc\python\python35-32\lib\site-packages\numpy\distutils\misc_util.py", line 906, in _get_configuration_from_setup_py
		config = setup_module.configuration(*args)
	  File "scipy\linalg\setup.py", line 20, in configuration
		raise NotFoundError('no lapack/blas resources found')
	numpy.distutils.system_info.NotFoundError: no lapack/blas resources found

	----------------------------------------
Command "c:\misc\python\python35-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\DANCE2~1\\AppData\\Local\\Temp\\pip-build-otq1zs_f\\scipy\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\DANCE2~1\AppData\Local\Temp\pip-ei4og5fl-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\DANCE2~1\AppData\Local\Temp\pip-build-otq1zs_f\scipy\

How to choose the best hash size when using phash ?

I want to find the most accurate hash method with the most proper size, can you share me some idea?

Maximum possible difference

Hi Johannes,

This is a quick question more than an issue, though it would be helpful for you to add it to your docs. What is the maximum possible difference between two images using the hashing approach?

max(hash1 - hash2) = ?

I'm interested in the normalizing the value to give a value between 0 and 1, though this may not make sense. I'd be interested to know your thoughts.

Thanks,

Shane

Make scipy an optional dependency

Currently both numpy and scipy are needed to install imagehash but this only makes sense when you're using the phash function. If you're using average_hash or dhash then you only need numpy.

Hence scipy should be marked as an optional dependency, both in the code and in setup.py.

People can then install this library as needed by using, e.g., pip install ImageHash or pip install ImageHash[pHash] using the optional dependency functionality.

Can not install on Ubuntu

I tried to install imagehash on Ubuntu 14.04 using a blank virtualenv running python 2.7 and pip.
First it complained that numpy not installed. After that I tried again and got this:

Traceback (most recent call last):

  File "<string>", line 1, in <module>

  File "/home/pc/.testenv/build/scipy/setup.py", line 415, in <module>

    setup_package()

  File "/home/pc/.testenv/build/scipy/setup.py", line 411, in setup_package

    setup(**metadata)

  File "/home/pc/.testenv/local/lib/python2.7/site-packages/numpy/distutils/core.py", line 135, in setup

    config = configuration()

  File "/home/pc/.testenv/build/scipy/setup.py", line 335, in configuration

    config.add_subpackage('scipy')

  File "/home/pc/.testenv/local/lib/python2.7/site-packages/numpy/distutils/misc_util.py", line 1000, in add_subpackage

    caller_level = 2)

  File "/home/pc/.testenv/local/lib/python2.7/site-packages/numpy/distutils/misc_util.py", line 969, in get_subpackage

    caller_level = caller_level + 1)

  File "/home/pc/.testenv/local/lib/python2.7/site-packages/numpy/distutils/misc_util.py", line 906, in _get_configuration_from_setup_py

    config = setup_module.configuration(*args)

  File "scipy/setup.py", line 15, in configuration

    config.add_subpackage('linalg')

  File "/home/pc/.testenv/local/lib/python2.7/site-packages/numpy/distutils/misc_util.py", line 1000, in add_subpackage

    caller_level = 2)

  File "/home/pc/.testenv/local/lib/python2.7/site-packages/numpy/distutils/misc_util.py", line 969, in get_subpackage

    caller_level = caller_level + 1)

  File "/home/pc/.testenv/local/lib/python2.7/site-packages/numpy/distutils/misc_util.py", line 906, in _get_configuration_from_setup_py

    config = setup_module.configuration(*args)

  File "scipy/linalg/setup.py", line 20, in configuration

    raise NotFoundError('no lapack/blas resources found')

numpy.distutils.system_info.NotFoundError: no lapack/blas resources found

----------------------------------------
Cleaning up...
Command /home/pc/.testenv/bin/python -c "import setuptools, tokenize;__file__='/home/pc/.testenv/build/scipy/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-AQ6kBT-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/pc/.testenv/include/site/python2.7 failed with error code 1 in /home/pc/.testenv/build/scipy
Storing debug log for failure in /home/pc/.pip/pip.log

How can I fix this?

how to encode an <class 'imagehash.ImageHash'> object into mongodb?

Hello,

I am new to python/pymongo and I am in trouble when I try to save imagehash infos to a mongodb database...

File "/usr/local/lib/python2.7/site-packages/pymongo/collection.py", line 409, in insert
gen(), check_keys, self.uuid_subtype, client)
bson.errors.InvalidDocument: Cannot encode object: array([[False, True, True, True, True, False, True, True],
[ True, False, False, False, True, True, False, False],
[False, False, True, False, False, True, True, False],
[False, True, True, True, True, True, True, True],
[ True, True, True, True, True, True, True, True],
[ True, True, False, False, False, False, False, False],
[False, False, False, False, False, False, False, True],
[ True, True, True, True, True, True, False, False]], dtype=bool)

Sorry if it is a dumb question, but could you help me solve this? :/

thanks

How to get the mean or average of a set of hashes?

Suppose I have a list of 100 imagehashes. How do I find the hash that is in the center of this high-dimensional cloud? I am looking for a way to find the mean or average of a set of hashes.

PHash() is not throwing out high-frequency image data.

As of commit: 481148a

The phash() and phash_simple() algorithms are no longer tossing out high-frequency image data.

The rest of this issue description is the same as the comment I wrote on that commit:

On average_hash and dhash the hash_size parameter is used to indicate how small to make the sub-sampled image, which directly becomes the number of bits in the resulting hash value.

But phash (and phash_simple) there's two steps of subsampling, not one. The first subsample is when we reduce the image, and the second subsample is when we select which DCT coefficients to use. The second subsample determines the number of bits in the resulting hash value.

These two steps really need to be done at different sizes. The whole reason phash works is because it tosses out high-frequency image data, that which is most likely to be different in similar-looking images, and it keeps only the low-frequency data. This happened in the past because we downsized the image to 32x32 pixels (subsample 1) which generated 32x32 DCT coefficients, and then we kept only 8x8 of the DCT coefficients (subsample 2).

But after this commit, the algorithm is keeping ALL the DCT coefficients. It downsizes to 8x8 pixels and keeps all 8x8 coefficients. It no longer throws out high-frequency data.

This means that similar images that would have hashed to similar phash values are now much more likely to differ by more bits in their phash values, because that high-frequency data is being included.

(You can also easily show that phash_simple is returning fewer bits than expected because it's basically running off the edge of the dct matrix)

Rule of thumb Threshold for phash

What would be an ideal value to use as a threshold for phash? This blog here did some research on it. However, I couldn't make out much from it. Especially as it didn't provide any conclusive threshold value. I'm fairly new to this and a rule of thumb threshold might benefit newcomers

How to compare hashes?

I'm not clear on how two hashes should be compared to determine how similar the images are.

Integer difference?
Hamming difference?
Difference of each byte/nybble?
XOR?
Number of 1 bits?
Is hash 0xFFFF.... very close to 0x0000... or very far?

phash() exhibits a strong bias for repetitive bit patterns

The hashes generated by imagehash.phash() tend to exhibit strong bias for repetitive bit patterns, thus shrinking the space of plausable hash values to something much smaller than the full 64-bit space.

Pull request coming momentarily with my changes

Here's an example: I have a pile of 1442 test images. I generated phashes for each and bucketed the results. There were a number of hash values that identified more than one image. Upon visual inspection most of the collided images didn't have much visual similarity.

f4f4f4f4f4f4f4f4    - 3 images
f5f5f5f5f5f5f5f5    - 9 images
fcfcfcfcfcfcfcf4    - 2 images
fefefefefefefefe    - 2 images
6c6c6c6c6c6c6c6c    - 2 images
0101090505050501    - 2 images
fcf8f8f8f8fcfcfc    - 2 images
5555555555555555    - 3 images
0303030303030303    - 9 images
0101010101010101    - 6 images
9c918404e6dec7c2    - 2 images
f4d4f4fefefefafe    - 2 images
fafafafafafafafa    - 2 images
fcfcfcf4f4f4f4f4    - 3 images
9dcdcdddcddcd9dc    - 2 images
f4f4f4f4f4fcfcfc    - 4 images
fcfcfcfcfcfcfcfc    - 9 images

Note the repetitive bit-pattern nature of all these hashes.

I propose four changes to the imagehash code:

Change 0 -- fix binary_array_to_int()

The implementation takes a modulus where it shouldn't and returns mathematically wrong values. This change doesn't affect the output of the library in any way, it's only used in ImageHash.hash(), but it's a useful function for diagnostic reasons and there's no reason not to fix it.

Change 1-3 -- changes to the phash algorithm

1: Use a bi-directional DCT, as per [1]

2: Use the median of coefficients instead of the mean

as described at [2]
this is what phash.org's ph_dct_imagehash() does

3: Use the upper-left 8x8 coefficients, instead of skipping the left-most column

This change begs more debate
David Starkweather wrote on [3] that you want to skip the first coefficient
phash.org's ph_dct_imagehash() skips both the left-most column AND the top-most row. If I understand their code, they're only using 49 coefficients instead of 64 (or 63) coefficients. Cite: [4]
I would argue only the top-left-most coefficient is the DC term, and skipping the entire column or row doesn't make any sense. If you skip the DC term then you either have a wasted bit in your hash or you get to pick some 64th coefficient to fill it -- which makes you decide if you like the horizontal or vertical axis more. But if you include the DC term it's only going to affect a single bit of the hash, and intuitively images with good visual similarity would correlate on their DC term as well as the frequency terms.

[1] = http://stackoverflow.com/questions/15978468/using-the-scipy-dct-function-to-create-a-2d-dct-ii
[2] = http://hackerlabs.org/blog/2012/07/30/organizing-photos-with-duplicate-and-similarity-checking/
[3] = http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html
[4] = (pHash.cpp:386) CImg subsec = dctImage.crop(1,1,8,8).unroll('x');;

After implementing these changes, I re-ran my 1442 image files and had only one bucket with more than one image:

19b2664d98da7396    - 2 images

Upon inspection these images were nearly identical -- about a 1-degree camera rotation difference.

Pulling sample images from my library and looking for their closest phash neighbors did, of course, still yield some false positives, but in general the results empirically felt much more accurate.

I also used a small set of images to do more detailed experiments to help justify each change.

Details below:

Experiment Driver Code

from PIL import Image
import imagehash

def PHash(filepath):
  im    = Image.open(filepath)
  ph    = imagehash.phash(im)
  return ph

def HashToString(phash):
  i     = imagehash.binary_array_to_int(phash.hash)
  return '{:016x}({:02d})'.format(i, '{:b}'.format(i).count('1'))

def RunExperiment(hashes):
  ''' Prints an N-by-N matrix comparing each hash value to each other hash value. '''
  s       = ''
  for y in range(len(hashes)):
    s     += '\n{}: '.format(y)
    for x in range(len(hashes)):
      s   += '  {:2d}'.format(hashes[y] - hashes[x])
    s     +=  '    hash(on-bits)=' + HashToString(hashes[y])
  print('Comparison matrix:\n      ' + '   '.join([str(x) for x in range(len(hashes))]) + s)

hashes  = (PHash('../phash1.jpeg'), PHash('../phash2.jpeg'), PHash('../phash3a.jpeg'), PHash('../phash3b.jpeg'))
RunExperiment(hashes)

Step 0: Fix the binary_array_to_int() function

testarray = np.array([1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1])
list(testarray).count(1)
6

Before:

'{:016b}'.format(binary_array_to_int(testarray))
'0000000101100010'
'{:016b}'.format(binary_array_to_int(testarray)).count('1')
4

After:

'{:016b}'.format(binary_array_to_int(testarray))
'1010000111000001'
'{:016b}'.format(binary_array_to_int(testarray)).count('1')
6

Next: Try to eliminate high levels of repetitive bit-patterns

Baseline: original implementation

Comparison matrix:
      0   1   2   3
0:    0   0   7  10    hash(on-bits)=fefefefefefefefe(56)
1:    0   0   7  10    hash(on-bits)=fefefefefefefefe(56)
2:    7   7   0   3    hash(on-bits)=fefefefefed4f4f4(49)
3:   10  10   3   0    hash(on-bits)=fefafafefed4f4d4(46)

Very compressed hash space, the majority of bits are set

Experiment 1: use a bi-directional DCT

Comparison matrix:
      0   1   2   3
0:    0  36  29  36    hash(on-bits)=61bdf66a81857efa(36)
1:   36   0  31  34    hash(on-bits)=af42f41ce32e30c7(32)
2:   29  31   0  21    hash(on-bits)=b0d82f76e3996866(33)
3:   36  34  21   0    hash(on-bits)=f8d11f864f3149c6(32)

Much higher-entropy hashes (yay!), about half the bits are set

Experiment 2: use median instead of mean

Comparison matrix:
      0   1   2   3
0:    0  28  30  30    hash(on-bits)=a4a4b4f4f4f8e8e0(32)
1:   28   0  30  32    hash(on-bits)=aaaaaaaaaaaaaaaa(32)
2:   30  30   0   2    hash(on-bits)=fabafafa58505050(32)
3:   30  32   2   0    hash(on-bits)=fafafafa50505050(32)

Exactly half the bits set (yay!) but low-entropy hashes again

Experiment 3: use both a bi-directional DCT and the median instead of mean

Comparison matrix:
      0   1   2   3
0:    0  34  28  34    hash(on-bits)=21bdf66a81056afa(32)
1:   34   0  30  34    hash(on-bits)=af42f41ce32e30c7(32)
2:   28  30   0  22    hash(on-bits)=b0d82f76e3992866(32)
3:   34  34  22   0    hash(on-bits)=f8d11f864f3149c6(32)

Exactly half the bits set on (yay!), high entropy (yay!)

So using bi-directional DCT yields more random-like hashes with fewer repeating bit patterns. And using the median makes the number of set bits tend to be exactly half of the available bit space. Both seem like very good changes.

Finally, as to which coefficients to take...

Experiment C0: skip first column, use 64 coefficients (original implementation)

Comparison matrix:
      0   1   2   3
0:    0  34  28  34    hash(on-bits)=21bdf66a81056afa(32)
1:   34   0  30  34    hash(on-bits)=af42f41ce32e30c7(32)
2:   28  30   0  22    hash(on-bits)=b0d82f76e3992866(32)
3:   34  34  22   0    hash(on-bits)=f8d11f864f3149c6(32)

Experiment C1: skip first row, use 64 coefficients

Comparison matrix:
      0   1   2   3
0:    0  32  30  36    hash(on-bits)=06437bedd4020bfd(32)
1:   32   0  30  36    hash(on-bits)=e05ec5ed38c65c61(32)
2:   30  30   0  20    hash(on-bits)=e661b05eedc63251(32)
3:   36  36  20   0    hash(on-bits)=adf1a23e0d8e6293(32)

Experiment C2: skip first row and column, use 64 coefficients

Comparison matrix:
      0   1   2   3
0:    0  34  32  36    hash(on-bits)=0b61bdf66a81057e(32)
1:   34   0  32  36    hash(on-bits)=70af62f61ce32e30(32)
2:   32  32   0  22    hash(on-bits)=f3b0d80f66e39928(32)
3:   36  36  22   0    hash(on-bits)=56f8d11f864f3149(32)

Experiment C3: skip first row and column, use 49 coefficients (as per phash.org)

Comparison matrix:
      0   1   2   3
0:    0  26  22  26    hash(on-bits)=000085ef6d4042fa(24)
1:   26   0  24  28    hash(on-bits)=0000be174398d730(24)
2:   22  24   0  16    hash(on-bits)=0000c2c2fed8cca8(24)
3:   26  28  16   0    hash(on-bits)=0001e289f0d1d8c9(24)

Experiment C4: use all 64 lowest-frequency coefficients (no skipping)

Comparison matrix:
      0   1   2   3
0:    0  32  30  34    hash(on-bits)=437bed94020bc5f5(32)
1:   32   0  30  34    hash(on-bits)=5e85e938c65c618f(32)
2:   30  30   0  18    hash(on-bits)=61b05eedc63251cd(32)
3:   34  34  18   0    hash(on-bits)=f1a23e0d9e62938d(32)

Image 3 is similar to image 2 (same person, similar camera angle, different pose). Image 0 and 1 are similar to 2 or 3 only in that they all have a sky, a ground, and a horizon.

The best spread we got between similar and not-similar results was 16: cases C1 and C4

Case C3 had a spread of 10, but used 77% of the bits to do it.

Normalizing C3 as if it used 64 bits yields a spread of 13 (= 10*64/49)

My preference is to go with C4. C3 may be the phash.org implementation but I see no benefit to throwing out 15 bits worth of data. C2 is arguably what phash.org might have wanted instead, using 15 higher-frequency coefficients instead of the 15 lowest-frequency coefficients, but it doesn't perform as well as C4. C1 and C0 seem arbitrary to me, choosing to favor horizontal data over vertical data or vise-versa.

Of course it would make sense to run large-scale experiments on thousands of human-classified test samples. But that's resources I don't have access to at this time.

Pull Request

A pull request with my changes should be coming along momentarily.

Why I use the function about compute hash when two pictures have different sizes but similar.

I know your function will resize a picture default 8*8 or 8*4 or 8*9 when it compute the HASH. But when I have two pictures that so similar but have different size. The two pictures' Hamming_distance is too big. My function is this:

def Hamming_distance(hash1, hash2):
        """
        :param hash1: hash1
        :param hash2: hash2
        :return: distance
        """
        distance = 0
        try:
            for index in range(len(hash1)):
                if hash1[index] != hash2[index]:
                    distance += 1
            return distance
        except Exception, e:
            logging.error('hamming distance fail %s', e)

Please tell me how can I do?

Hash comparison performance

Hi Johannes - thanks for an elegant implementation of these perceptual hash algorithms.

A performance enhancement suggestion for hash comparison (sub):

My understanding is that:

self.hash.flatten() != other.hash.flatten()

will do a boolean compare for each element in the hash array

Optimising this via the clever bitCount() below could be useful when comparing large hash sets
from: https://wiki.python.org/moin/BitManipulation

def bitCount(int_type):
count = 0
while(int_type):
int_type &= int_type - 1
count += 1
return(count)

pillow Image.resize

Note that, there appears to be a change in the pillow implementation of the Image.resize method, between versions 3.3.1 and 3.4.0+ such that (some) images will result in subtly different image-hash values.

I've seen this using imagehash.phash and I suspect that it will extent to all algorithms that use Image.resize.

This difference is independent of using Image.ANTIALIAS.

Flipping an image doesn't flip the hash (phash, dhash)

For example, with this image originally and flipped horizontally as input:

average_hash outputs hashes that satisfyingly look like this (# = 1, . = 0):

While phash outputs hashes that look this. When shown side-by-side like this I would expect them to be symmetrical.

I'm assuming this a bug? If not, could you please explain this behaviour? I find it quite strange.

average_hash - valid
dhash_vertical - valid
whash - valid
phash - invalid
phash_simple - invalid
dhash - invalid

Here's a minimal example:

import numpy as np
from imagehash import *

def hash_to_text(a):
    return "\n".join(" ".join(("#" if x else ".") for x in y) for y in a)

im_orig = Image.open("samples/lion.jpg")
im_horz = im_orig.transpose(Image.FLIP_LEFT_RIGHT) 

for func in (average_hash, phash, phash_simple, dhash, dhash_vertical, whash):
    hash_orig = func(im_orig).hash
    hash_horz = func(im_horz).hash
    name = func.__name__
    print("==", name, "==")
    if np.all(hash_orig == np.fliplr(hash_horz)):
        print("CORRECT RESULT")
    else:
        print(name + "(im_orig).hash:\n" +
              hash_to_text(hash_orig))
        print("np.fliplr(" + name + "(im_horz).hash):\n" +
              hash_to_text(np.fliplr(hash_horz)))

Small images on whash

If whash(image, hash_size = 8, image_scale = None, mode = 'haar', remove_max_haar_ll = True) is called on a very small image it gives an "AssertionError: hash_size in a wrong range".
This is hard to track down, since called with default parameter values and the docstring states:

@hash_size must be a power of 2 and less than @image_scale.
@image_scale must be power of 2 and less than image size. By default is equal to max power of 2 for an input image.

I suggest a different AssertionError message for small images and "image_scale is None"

scipy error: dtype uint8 not supported

With the latest code, using phash results in this error:

$ ./find_similar_images.py phash .
Traceback (most recent call last):
  File "./find_similar_images.py", line 61, in <module>
    find_similar_images(userpath=userpath, hashfunc=hashfunc)
  File "./find_similar_images.py", line 21, in find_similar_images
    hash = hashfunc(Image.open(img))
  File "/code/imagehash/imagehash/__init__.py", line 157, in phash
    dct = scipy.fftpack.dct(scipy.fftpack.dct(pixels, axis=0), axis=1)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/fftpack/realtransforms.py", line 124, in dct
    return _dct(x, type, n, axis, normalize=norm, overwrite_x=overwrite_x)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/scipy/fftpack/realtransforms.py", line 223, in _dct
    raise ValueError("dtype %s not supported" % tmp.dtype)
ValueError: dtype uint8 not supported

Getting completely different hashes of almost identical images

Hi! I'm trying to compare image unmodified, as taken by camera, and photoshopped image (tweaked histogram and a bit changed white balance) and get distances above 25 if using code as in examples:
hash = imagehash.phash(Image.open(path))

But if i modify code like this:

img = cv2.imread(path)
img = Image.fromarray(img)
hash = imagehash.phash(img)

I get distance of 0
Looks like it might be caused by different color spaces, or something like that.
Hope this info could help somebody.

Speed optimizations for wavelet hash

This particular line: https://github.com/JohannesBuchner/imagehash/blob/master/imagehash/__init__.py#L249

can be replaced with np.asarray(image).

Timing info for the original line:
118 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing for np.asarray:
89.6 µs ± 97.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each).

This is a significant speedup for the most expensive operation in the method. Even the wavelet decompositions and reconstructions for removing the lowest frequency take 18.4 ms ± 285 µs and 18.4 ms ± 24.6 µs respectively.

What is 'six' in find_similar_images.py

Shouldn't the line containing 'six' instead read 'images.iteritems()'?

unable to install pyaudio in jetbrains, giving an error i am using python 3.7

error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe' failed with exit status 2

Memory leak in wavelet hash

Looks like wavelet hash implementation has a memory leak. I'm processing 600k images and memory keeps going up. When using the other hash implementation, memory usage does not accumulate.

Using sizes other than 8

I love this project and am currently using it in one of mine. The thing is, you allow me to set a hash_size to something other than 8, but you won't let the person use the function hex_to_hash on the hex generated by using a size greater than 8. Am I doing something wrong, or was this an oversight?

Cannot install on Centos w/ python 3.6

bash-4.2# /usr/bin/pip3.6 install imagehash --no-cache-dir
Collecting imagehash
  Downloading ImageHash-3.5.tar.gz (294kB)
    100% |################################| 296kB 298kB/s
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-7g_5kb6z/imagehash/setup.py", line 10, in <module>
        long_description = f.read()
      File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode
        return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 261: ordinal not in range(128)

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-7g_5kb6z/imagehash/

imagehash from memory object saves corrupted file

When I run imagehash against an upload that exists in memory but has not yet been written to disk, the resulting file is corrupted after disk write. For example:

f = form.image.data
filename = secure_filename(f.filename)
imghash = imagehash.dhash(Image.open(f))
f.save(os.path.join(imgdir, filename))

The saved file is missing bytes and isn't a valid image.

If I run imagehash against an existing file on disk, it works as expected. I would like to check the hash against a db record before writing anything to disk however.

I have also tested simply opening the image using PIL's Image.open() to make sure that wasn't trashing the file, but it successfully writes the image to disk afterwards. It is only after running imagehash against the PIL-opened file that I lose bytes and write an incomplete file to disk.

Is this a bug or am I missing something here?

I'm using Pillow (4.2.1) and ImageHash (3.4). This is a Flask (0.12.2) site where the uploaded file is wrapped in the Werkzeug FileStorage class.

AssertionError: ('ImageHashes must be of the same shape!', (64,), (8, 8))

I need to save the hex hash string to my database, and read it later. But after I create a ImageHash instance using the hex string by hex_to_hash() method, 'h2 - h1' raise an exception. I wonder if it is a bug of hex_to_hash() method? I have no idea about that. I'm not good at NumPy.

from PIL import Image
import imagehash

i1 = Image.open('1.jpg')
i2 = Image.open('1.png')

h1 = imagehash.dhash(i1)
print h1
# d3b7cf4ebd183140
h2 = imagehash.dhash(i2)
print h2
# d3b7cf4ebd183140

print h2 - h1
#0

h3 = imagehash.hex_to_hash('d3b7cf4ebd183140')
print h3 - h2
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/usr/lib/python2.7/site-packages/imagehash/__init__.py", line 64, in __sub__
#     assert self.hash.shape == other.hash.shape, ('ImageHashes must be of the same shape!', self.hash.shape, other.hash.shape)
# AssertionError: ('ImageHashes must be of the same shape!', (64,), (8, 8))

print h3 - h1
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/usr/lib/python2.7/site-packages/imagehash/__init__.py", line 64, in __sub__
#     assert self.hash.shape == other.hash.shape, ('ImageHashes must be of the same shape!', self.hash.shape, other.hash.shape)
# AssertionError: ('ImageHashes must be of the same shape!', (64,), (8, 8))

Version 3 not in pypi repo

Not sure if this was the right place, but I want to upgrade my imagehash version and doing pip install imagehash --upgrade still gives me the 2.2 version, presumably because the pypi page hasn't been updated with v3.

dhash computes difference vertically, not horizontally

The dhash function computes the pixel difference between vertically adjacent pixels.
But if we follow http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html the difference should be horizontal.
This might make a difference for e.g. images of landscapes.

Create a gitter.im community

@JohannesBuchner Would you consider fostering a community around https://gitter.im/ ?

It's easy to create and manage a community and also associate it with this repo. It might prove very useful for collaboration, rather than chatter on PRs or issues ...

I've already checked and there isn't an existing community called imagehash ... I'd recommend grabbing it whilst you can, and more importantly if you do, please invite me to join!

Add an int() function

I need to access the imagehash as an integer to store it in a database and would like to say int(imageHash) without converting to string and then back to int.

Note about Greyscale?

Would it be possible to add a note that it computes the hash on the greyscale rather than all 3/4 channels of an RGB/RGBA? -- I didn't realize that till later