jesjimher / imgdupes Goto Github PK
View Code? Open in Web Editor NEWChecks for duplicated images in a directory tree, ignoring metadata
License: GNU General Public License v3.0
Checks for duplicated images in a directory tree, ignoring metadata
License: GNU General Public License v3.0
Have you thought about doing whole-file MD5 for other image types such as png and nef?
I have forked your code and done some adjustments. I have some files that crash imgdupes, because they have "truncated jpg block" data. Also, imgdupes seems to show the same file multiple times for HDR files re-developed by shotwell. Then choosing one to keep fails with the error that it cannot delete the extras, e.g.
If you are still interested in this project, I'm planning to send you some PRs for:
Hello,
I get an error after a certain number of files analysed:
Traceback (most recent call last):
File "/home/gilles/bin/imgdupes.py", line 259, in <module>
'hash':hashcalc(ruta,pool,args.method),
File "/home/gilles/bin/imgdupes.py", line 63, in hashcalc
results=pool.map(phash,lista)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
struct.error: integer out of range for 'H' format code
If I relaunch the command, it continues from where it stopped until the next error.
$ python --version
Python 2.7.6
and
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.2 LTS
Release: 14.04
Codename: trusty
Thank you!
tostring() has been deprecated, and it hangs on execution...
Hi, I tried running this script (Linux Mint).
I had to add
import gi
gi.require_version('GExiv2', '0.10')
before the from gi.repository import GExiv2
in order to not get an error.
The next attempt left me with the below error on the top and the script just finished after crawling some folders:
/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libturbojpeg.a(libturbojpeg_la-turbojpeg.o): relocation R_X86_64_32 against `.data' can not be used when making a shared object; recompile with -fPIC
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libturbojpeg.a: error adding symbols: Bad value
collect2: error: ld returned 1 exit status
Hi,
Thanks for notifying me of your updates & posting them on pypi.
For the arch linux' AUR package you should release the versions on github. The AUR pkg will download sources from github and not pypi. I've proceeded already by creating jpegdupes-git package which downloads your latest commit as source, but the non git version of the AUR pkg would require a release on github.
For simplicity's sake it would be nice to rename the github repo to jpegdupes as well. People will be wondering what line 4 in my PKGBUILD is.
Thanks!
Pieter
I just did a
$ jpegdupes -d /home/turgut/Pictures/
and got:
(...)
Exploring ./2018/07
Exploring ./2018/07/06
Exploring ./2018/07/07
Exploring ./2018/07/08
Exploring ./2018/07/24
Exploring ./2018/07/27
Exploring ./2018/08
Exploring ./2018/08/19
Exploring ./2018/08/21
Exploring ./2018/08/23
Exploring ./2018/08/11
Exploring ./2018/08/12
Exploring ./2018/08/17
Exploring ./2018/08/18
Exploring ./2018/08/20
Exploring ./2018/09
Exploring ./2018/09/07
Exploring ./2018/09/08
Exploring ./2018/09/09
Exploring ./2018/09/14
Exploring ./2018/09/16
Exploring ./2009
Exploring ./2009/09
Exploring ./2009/09/22
Traceback (most recent call last):
File "/usr/local/bin/jpegdupes", line 11, in
load_entry_point('jpegdupes==2.0.13', 'console_scripts', 'jpegdupes')()
File "/usr/local/lib/python3.6/site-packages/jpegdupes-2.0.13-py3.6.egg/jpegdupes/jpegdupes.py", line 337, in main
File "/usr/local/lib/python3.6/site-packages/jpegdupes-2.0.13-py3.6.egg/jpegdupes/jpegdupes.py", line 337, in
File "/usr/local/lib/python3.6/site-packages/jpegdupes-2.0.13-py3.6.egg/jpegdupes/jpegdupes.py", line 145, in metadata_summary
AttributeError: 'Metadata' object has no attribute 'get_tags'
imgdupes stops on this image:
Calculating hash of ./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg...
Traceback (most recent call last):
File "/root/imgdupes/imgdupes.py", line 256, in <module>
'hash':hashcalc(ruta,pool,args.method),
File "/root/imgdupes/imgdupes.py", line 66, in hashcalc
results=pool.map(phash,lista)
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
NameError: global name 'path' is not defined
root@nas:/yyyyy/yyyyy/yyyyyy/yyyyyy# ls -l "./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg"
-rwxrwxr-x+ 1 root root 2960434 Jun 8 2007 ./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg
root@nas:/yyyyy/yyyyy/yyyyyy/yyyyyy# file "./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg"
./_xxxxxxxxxx/xxxxxx xxx xxxxxxx/xxxxxx xxxxxx/xxxxxx_xxxxxxxx_668.jpg: JPEG image data, Exif standard: [TIFF image data, big-endian, direntries=11, manufacturer=CASIO COMPUTER CO.,LTD , model=QV-R61 , orientation=upper-left, xresolution=178, yresolution=186, resolutionunit=2, software=1.00 , datetime=2007:02:01 22:11:18]
When a JPEG file has been rotated using jpegtran or any other JPEG lossless rotation utility, imgdupes can't find duplicates, because this kind of rotation involves altering original image data. "Standard" rotation (switching EXIF rotation tag) is fully detected.
One way to detect this kind of transformations would imply generating and storing in .signatures up to 4 hashes (all possible rotations) instead of just one. This would slow things quite a bit, albeit perhaps not that much since image data would already be in memory and imgdupes is usually I/O bound. Some multiprocessing would help.
One thing to note is that jpegtran also allows to losslessly flip images, so theoretically imgdupes should store all 4 possible rotations, 2 possible flips (horizontal and vertical), and all possible combinations of rotation+flip. Since this is obviously unfeasible, I think that flipping may be ignored for the moment. After all, is not an operation as common as rotation.
Got result like this:
(all 4 files are SAME file)
.... dupes that are ok ...
./2018-09-17/IMG_20180819_193752.jpg
./2018-09-17/IMG_20180819_193752.jpg
./2018-09-17/IMG_20180819_193752.jpg
./2018-09-17/IMG_20180819_193752.jpg
... some more dupes that are ok ...
I don't know why but it worked perfect (detected all dupes it should have detected) except when it thought this one file to be a dupe of itself... weird.
root@myhostname:/tmp/imgdupes# apt-get install python3-dev libjpeg-dev gir1.2-gexiv2-0.10 jpeginfo
...
root@myhostname:/tmp/imgdupes# python3 setup.py build
...
root@myhostname:/tmp/imgdupes# python3 setup.py install
ModuleNotFoundError: No module named 'cffi'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.