israel-lugo / capidup Goto Github PK
View Code? Open in Web Editor NEWQuickly find duplicate files in directories
License: GNU General Public License v3.0
Quickly find duplicate files in directories
License: GNU General Public License v3.0
We should work both in Python 2 and Python 3. Most of the code should be rather version agnostic, except perhaps for the file interaction (bytes vs str) and things like dict.iterkeys.
Any portions of code that can't work with both Python versions can be moved out to version-specific modules, to be imported accordingly.
Right now, python setup.py bdist
or python setup.py bdist_wheel
don't include the docs/
directory. We should include it for our users.
It might be advantageous to use multiprocessing, to calculate multiple hashes at the same time. Bottleneck will probably be storage performance, but we might be comparing directories in different drives, so it may pay off.
Make sure we don't break. Try to handle weird differences in locales, unencodable filenames when printing, etc.
Maybe we could create a paranoid setting, where files are hashed with two algorithms. Is this even worthwhile? Not like we're expecting forced collisions in SHA-512 anytime soon.
README.md looks nice in Github, but it's not practical for users who download our code, or install the package.
A README.txt can be generated using pandoc:
pandoc -t plain -f markdown README.md > README.txt
But this doesn't support all formatting. Namely, it doesn't support HTML <sup>
tags, which we use to represent 264, leaving us with "1 in 264" instead of "1 in 264". And that's just not the same thing.
We could change to AsciiDoc. That supports superscripting natively.
Now that find_duplicates_in_firs
has the follow_dirlinks
parameter (see #16), we need a way to detect symlink loops. If there is a symlink pointing to .
, or to a parent directory, we will go in a loop.
Fortunately, os.walk()
seems to stop after several levels of recursion. But still, it's probably undefined behavior.
See what commands like find
or rsync
do.
Some files may be very similar at the start, but different at the end. E.g. a large ISO of two similar operating system versions, or so on. Also, some media formats add their metadata headers at the end, instead of the start (cf ID3v1 tags for MP3).
The partial test could read X/2 KB from the start and X/2 from the end. This would also help against collisions, by mixing it up.
Should performance test this. It will cause more seeking in mechanical disks. Will it make a noticeable difference?
Installing with pip install capidup
doesn't include the documentation files. We should make sure to include the documentation, so our users can know how to use the package...
Files included:
$ pip show -f capidup
---
Metadata-Version: 2.0
Name: capidup
Version: 1.0.1
Summary: Quickly find duplicate files in directories
Home-page: https://github.com/israel-lugo/capidup
Author: Israel G. Lugo
Author-email: [email protected]
License: GPLv3+
Location: /tmp/asdf/lib/python2.7/site-packages
Requires:
Files:
capidup-1.0.1.dist-info/DESCRIPTION.rst
capidup-1.0.1.dist-info/METADATA
capidup-1.0.1.dist-info/RECORD
capidup-1.0.1.dist-info/WHEEL
capidup-1.0.1.dist-info/metadata.json
capidup-1.0.1.dist-info/top_level.txt
capidup/__init__.py
capidup/__init__.pyc
capidup/finddups.py
capidup/finddups.pyc
capidup/py3compat.py
capidup/py3compat.pyc
capidup/version.py
capidup/version.pyc
index_files_by_size
isn't catching some errors.
It is possible for os.lstat()
to fail inside the loop that iterates filenames, and that exception will not be caught.
This is a corner case from a race condition. The file could exist when os.walk()
lists the directory, but already be removed when we execute os.lstat()
. To trigger this, we can scan /proc
:
>>> capidup.finddups.find_duplicates_in_dirs(['/proc'])
...
error listing '/proc/14690/fd': Permission denied
error listing '/proc/14690/fdinfo': Permission denied
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/tmp/virtenv/local/lib/python2.7/site-packages/capidup/finddups.py", line 270, in find_duplicates_in_dirs
sub_errors = index_files_by_size(directory, files_by_size)
File "/tmp/virtenv/local/lib/python2.7/site-packages/capidup/finddups.py", line 121, in index_files_by_size
file_info = os.lstat(full_path)
OSError: [Errno 2] No such file or directory: '/proc/14823/task/14823/fd/3'
>>>
The PID in question is the Python interpreter itself.
Solution: wrap os.lstat()
with a try: ... except
and call _print_error()
for consistency.
Must create a test case (how?).
It would be nice to have the possibility of following symbolic links to (sub)directories. Currently, we don't follow them.
find_duplicates_in_dirs
could have a new follow_links
parameter, defaulting to False for compatibility.
Talk about capidup.finddups.
As per subject.
Hi,
Just a suggestion:
It would be useful to have an option to exclude directories. This would make it even more flexible.
Thanks a lot for this nice tool !!
test_dups_full.py:test_find_dups_in_dirs()
(introduced initially in b415c70) can fail with the following error:
tmpdir = local('/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02'), file_groups = (('', '', ''), ('a', 'a', 'a', 'a', 'a', 'a', ...)), num_index_errors = 0, num_read_errors = 0
flat = False
@pytest.mark.parametrize("file_groups", file_groups_data)
@pytest.mark.parametrize("num_index_errors", index_errors_data)
@pytest.mark.parametrize("num_read_errors", read_errors_data)
@pytest.mark.parametrize("flat", [True, False])
def test_find_dups_in_dirs(tmpdir, file_groups, num_index_errors,
num_read_errors, flat):
"""Test find_duplicates_in_dirs with multiple files.
[...]
# Check that duplicate groups match. The files may have been traversed
# in a different order from how we created them; sort both lists.
dup_groups.sort()
expected_dup_groups.sort()
for i in range(len(dup_groups)):
> assert len(dup_groups[i]) == len(expected_dup_groups[i])
E AssertionError: assert 200 == 3
E + where 200 = len(['/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file140/g1', '/tmp/pytest-of-capi/pytest-28/test_find_..._find_dups_in_dirs_False_02/file40/g1', '/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file21/g1', ...])
E + and 3 = len(['/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file0/g0', '/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file1/g0', '/tmp/pytest-of-capi/pytest-28/test_find_dups_in_dirs_False_02/file2/g0'])
capidup/tests/test_dups_full.py:208: AssertionError
This is seen on a Debian 9.1 system, kernel 4.9.0-3-amd64, Python versions 2.7.13 and 3.5.3, pytest-3.2.2. It is not however seen on the Travis build server, or my own home PC running Gentoo Linux.
Create a command-line option to let the user select between MD5 and other (more secure) hashing algorithms. E.g. SHA-1, SHA-256, SHA-512.
This way, user can select a more collision resistant hash, for when security is a greater concern (e.g. comparing software installers which might have been tampered with, and so on).
Currently, capidup.finddups.find_duplicates_in_dirs() returns a boolean to tell if there were any errors.
Do you have plans to have this function returning a list of those errors?
Thanks
We need to implement tests for the directory loop detection, which was implemented in issue #17.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.