jbruchon / jdupes Goto Github PK

A powerful duplicate file finder and an enhanced fork of 'fdupes'.

License: MIT License

Makefile 3.12% C 80.42% Shell 10.69% Roff 5.76%

c duplicate-files fast deduplication btrfs hard-links delete-files fdupes windows win32

jdupes's Introduction

Introduction

jdupes is a program for identifying and taking actions upon duplicate files such as deleting, hard linking, symlinking, and block-level deduplication (also known as "dedupe" or "reflink"). It is faster than most other duplicate scanners. It prioritizes data safety over performance while also giving expert users access to advanced (and sometimes dangerous) features.

Please consider financially supporting continued development of jdupes using the links on my home page (Ko-fi, PayPal, SubscribeStar, etc.):

https://www.jodybruchon.com/

Why use jdupes instead of the original fdupes or other duplicate finders?

The biggest reason is raw speed. In testing on various data sets, jdupes is over 7 times faster than fdupes-1.51 on average.

jdupes provides a native Windows port. Most duplicate scanners built on Linux and other UNIX-like systems do not compile for Windows out-of-the-box and even if they do, they don't support Unicode and other Windows-specific quirks and features.

jdupes is generally stable. All releases of jdupes are compared against a known working reference versions of fdupes or jdupes to be certain that output does not change. You get the benefits of an aggressive development process without putting your data at increased risk.

Code in jdupes is written with data loss avoidance as the highest priority. If a choice must be made between being aggressive or careful, the careful way is always chosen.

jdupes includes features that are not always found elsewhere. Examples of such features include block-level data deduplication and control over which file is kept when a match set is automatically deleted. jdupes is not afraid of dropping features of low value; a prime example is the -1 switch which outputs all matches in a set on one line, a feature which was found to be useless in real-world tests and therefore thrown out.

While jdupes maintains some degree of compatibility with fdupes from which it was originally derived, there is no guarantee that it will continue to maintain such compatibility in the future. However, compatibility will be retained between minor versions, i.e. jdupes-1.6 and jdupes-1.6.1 should not have any significant differences in results with identical command lines.

If the program eats your dog or sets fire to your lawn, the authors cannot be held responsible. If you notice a bug, please report it.

What jdupes is not: a similar (but not identical) file finding tool

Please note that jdupes ONLY works on 100% exact matches. It does not have any sort of "similarity" matching, nor does it know anything about any specific file formats such as images or sounds. Something as simple as a change in embedded metadata such as the ID3 tags in an MP3 file or the EXIF information in a JPEG image will not change the sound or image presented to the user when opened, but technically it makes the file no longer identical to the original.

Plenty of excellent tools already exist to "fuzzy match" specific file types using knowledge of their file formats to help. There are no plans to add this type of matching to jdupes.

There are some match options available in jdupes that enable dangerous file matching based on partial or likely but not 100% certain matching. These are considered expert options for special situations and are clearly and loudly documented as being dangerous. The -Q and -T options are notable examples, and the extreme danger of the -T option is safeguarded by a requirement to specify it twice so it can't be used accidentally.

How can I do stuff with jdupes that isn't supported by fdupes?

The standard output format of jdupes is extremely simple. Match sets are presented with one file path per line, and match sets are separated by a blank line. This is easy to process with fairly simple shell scripts. You can find example shell scripts in the "example scripts" directory in the jdupes source code. The main example script, "example.sh", is easy to modify to take basic actions on each file in a match set. These scripts are used by piping the standard jdupes output to them:

jdupes dir1 dir2 dir3 | example.sh scriptparameters

Usage

Usage: jdupes [options] DIRECTORY...

Duplicate file sets will be printed by default unless a different action option is specified (delete, summarize, link, dedupe, etc.)

 -@ --loud              output annoying low-level debug info while running
 -0 --print-null        output nulls instead of CR/LF (like 'find -print0')
 -1 --one-file-system   do not match files on different filesystems/devices
 -A --no-hidden         exclude hidden files from consideration
 -B --dedupe            do a copy-on-write (reflink/clone) deduplication
 -C --chunk-size=#      override I/O chunk size in KiB (min 4, max 262144)
 -d --delete            prompt user for files to preserve and delete all
                        others; important: under particular circumstances,
                        data may be lost when using this option together
                        with -s or --symlinks, or when specifying a
                        particular directory more than once; refer to the
                        documentation for additional information
 -D --debug             output debug statistics after completion
 -e --error-on-dupe     exit on any duplicate found with status code 255
 -f --omit-first        omit the first file in each set of matches
 -h --help              display this help message
 -H --hard-links        treat any linked files as duplicate files. Normally
                        linked files are treated as non-duplicates for safety
 -i --reverse           reverse (invert) the match sort order
 -I --isolate           files in the same specified directory won't match
 -j --json              produce JSON (machine-readable) output
 -l --link-soft         make relative symlinks for duplicates w/o prompting
 -L --link-hard         hard link all duplicate files without prompting
                        Windows allows a maximum of 1023 hard links per file
 -m --summarize         summarize dupe information
 -M --print-summarize   will print matches and --summarize at the end
 -N --no-prompt         together with --delete, preserve the first file in
                        each set of duplicates and delete the rest without
                        prompting the user
 -o --order=BY          select sort order for output, linking and deleting:
                        by mtime (BY=time) or filename (BY=name, the default)
 -O --param-order       sort output files in order of command line parameter
sequence
                        Parameter order is more important than selected -o sort
                        which applies should several files share the same
parameter order
 -p --permissions       don't consider files with different owner/group or
                        permission bits as duplicates
 -P --print=type        print extra info (partial, early, fullhash)
 -q --quiet             hide progress indicator
 -Q --quick             skip byte-by-byte duplicate verification. WARNING:
                        this may delete non-duplicates! Read the manual first!
 -r --recurse           for every directory, process its subdirectories too
 -R --recurse:          for each directory given after this option follow
                        subdirectories encountered within (note the ':' at
                        the end of the option, manpage for more details)
 -s --symlinks          follow symlinks
 -S --size              show size of duplicate files
 -t --no-change-check   disable security check for file changes (aka TOCTTOU)
 -T --partial-only      match based on partial hashes only. WARNING:
                        EXTREMELY DANGEROUS paired with destructive actions!
                        -T must be specified twice to work. Read the manual!
 -u --print-unique      print only a list of unique (non-matched) files
 -U --no-trav-check     disable double-traversal safety check (BE VERY CAREFUL)
                        This fixes a Google Drive File Stream recursion issue
 -v --version           display jdupes version and license information
 -X --ext-filter=x:y    filter files based on specified criteria
                        Use '-X help' for detailed extfilter help
 -y --hash-db=file      use a hash database text file to speed up repeat runs
                        Passing '-y .' will expand to  '-y jdupes_hashdb.txt'
 -z --zero-match        consider zero-length files to be duplicates
 -Z --soft-abort        If the user aborts (i.e. CTRL-C) act on matches so far
                        You can send SIGUSR1 to the program to toggle this


Detailed help for jdupes -X/--extfilter options
General format: jdupes -X filter[:value][size_suffix]

noext:ext1[,ext2,...]           Exclude files with certain extension(s)

onlyext:ext1[,ext2,...]         Only include files with certain extension(s)

size[+-=]:size[suffix]          Only Include files matching size criteria
                                Size specs: + larger, - smaller, = equal to
                                Specs can be mixed, i.e. size+=:100k will
                                only include files 100KiB or more in size.

nostr:text_string               Exclude all paths containing the string
onlystr:text_string             Only allow paths containing the string
                                HINT: you can use these for directories:
                                -X nostr:/dir_x/  or  -X onlystr:/dir_x/
newer:datetime                  Only include files newer than specified date
older:datetime                  Only include files older than specified date
                                Date/time format: "YYYY-MM-DD HH:MM:SS"
                                Time is optional (remember to escape spaces!)

Some filters take no value or multiple values. Filters that can take
a numeric option generally support the size multipliers K/M/G/T/P/E
with or without an added iB or B. Multipliers are binary-style unless
the -B suffix is used, which will use decimal multipliers. For example,
16k or 16kib = 16384; 16kb = 16000. Multipliers are case-insensitive.

Filters have cumulative effects: jdupes -X size+:99 -X size-:101 will
cause only files of exactly 100 bytes in size to be included.

Extension matching is case-insensitive.
Path substring matching is case-sensitive.

The -U/--no-trav-check option disables the double-traversal protection. In the VAST MAJORITY of circumstances, this SHOULD NOT BE DONE, as it protects against several dangerous user errors, including specifying the same files or directories twice causing them to match themselves and potentially be lost or irreversibly damaged, or a symbolic link to a directory making an endless loop of recursion that will cause the program to hang indefinitely. This option was added because Google Drive File Stream presents directories in the virtual hard drive used by GDFS with identical device:inode pairs despite the directories actually being different. This triggers double-traversal prevention against every directory, effectively blocking all recursion. Disabling this check will reduce safety, but will allow duplicate scanning inside Google Drive File Stream drives. This also results in a very minor speed boost during recursion, but the boost is unlikely to be noticeable.

The -t/--no-change-check option disables file change checks during/after scanning. This opens a security vulnerability that is called a TOCTTOU (time of check to time of use) vulnerability. The program normally runs checks immediately before scanning or taking action upon a file to see if the file has changed in some way since it was last checked. With this option enabled, the program will not run any of these checks, making the algorithm slightly faster, but also increasing the risk that the program scans a file, the file is changed after the scan, and the program still acts like the file was in its previous state. This is particularly dangerous when considering actions such as linking and deleting. In the most extreme case, a file could be deleted during scanning but match other files prior to that deletion; if the file is the first in the list of duplicates and auto-delete is used, all of the remaining matched files will be deleted as well. This option was added due to user reports of some filesystems (particularly network filesystems) changing the reported file information inappropriately, rendering the entire program unusable on such filesystems.

The -n/--no-empty option was removed for safety. Matching zero-length files as duplicates now requires explicit use of the -z/--zero-match option instead.

Duplicate files are listed together in groups with each file displayed on a separate line. The groups are then separated from each other by blank lines.

The -s/--symlinks option will treat symlinked files as regular files, but direct symlinks will be treated as if they are hard linked files and the -H/--hard-links option will apply to them in the same manner.

When using -d or --delete, care should be taken to insure against accidental data loss. While no information will be immediately lost, using this option together with -s or --symlink can lead to confusing information being presented to the user when prompted for files to preserve. Specifically, a user could accidentally preserve a symlink while deleting the file it points to. A similar problem arises when specifying a particular directory more than once. All files within that directory will be listed as their own duplicates, leading to data loss should a user preserve a file without its "duplicate" (the file itself!)

Using -1 or --one-file-system prevents matches that cross filesystems, but a more relaxed form of this option may be added that allows cross-matching for all filesystems that each parameter is present on.

-Z or --soft-abort used to be --hard-abort in jdupes prior to v1.5 and had the opposite behavior. Defaulting to taking action on abort is probably not what most users would expect. The decision to invert rather than reassign to a different option was made because this feature was still fairly new at the time of the change.

On non-Windows platforms that support SIGUSR1, you can toggle the state of the -Z option by sending a SIGUSR1 to the program. This is handy if you want to abort jdupes, didn't specify -Z, and changed your mind and don't want to lose all the work that was done so far. Just do 'killall -USR1 jdupes' and you will be able to abort with -Z. This works in reverse: if you want to prevent a -Z from happening, a SIGUSR1 will toggle it back off. That's a lot less useful because you can just stop and kill the program to get the same effect, but it's there if you want it for some reason. Sending the signal twice while the program is stopped will behave as if it was only sent once, as per normal POSIX signal behavior.

The -O or --param-order option allows the user greater control over what appears in the first position of a match set, specifically for keeping the -N option from deleting all but one file in a set in a seemingly random way. All directories specified on the command line will be used as the sorting order of result sets first, followed by the sorting algorithm set by the -o or --order option. This means that the order of all match pairs for a single directory specification will retain the old sorting behavior even if this option is specified.

When used together with options -s or --symlink, a user could accidentally preserve a symlink while deleting the file it points to.

The -Q or --quick option only reads each file once, hashes it, and performs comparisons based solely on the hashes. There is a small but significant risk of a hash collision which is the purpose of the failsafe byte-for-byte comparison that this option explicitly bypasses. Do not use it on ANY data set for which any amount of data loss is unacceptable. You have been warned!

The -T or --partial-only option produces results based on a hash of the first block of file data in each file, ignoring everything else in the file. Partial hash checks have always been an important exclusion step in the jdupes algorithm, usually hashing the first 4096 bytes of data and allowing files that are different at the start to be rejected early. In certain scenarios it may be a useful heuristic for a user to see that a set of files has the same size and the same starting data, even if the remaining data does not match; one example of this would be comparing files with data blocks that are damaged or missing such as an incomplete file transfer or checking a data recovery against known-good copies to see what damaged data can be deleted in favor of restoring the known-good copy. This option is meant to be used with informational actions and can result in EXTREME DATA LOSS if used with options that delete files, create hard links, or perform other destructive actions on data based on the matching output. Because of the potential for massive data destruction, this option MUST BE SPECIFIED TWICE to take effect and will error out if it is only specified once.

The -I/--isolate option attempts to block matches that are contained in the same specified directory parameter on the command line. Due to the underlying nature of the jdupes algorithm, a lot of matches will be blocked by this option that probably should not be. This code could use improvement.

The -C/--chunk-size option overrides the size of the I/O "chunk" used for all file operations. Larger numbers will increase the amount of data read at once from each file and may improve performance when scanning lots of files that are larger than the default chunk size by reducing "thrashing" of the hard disk heads. Smaller numbers may increase algorithm speed depending on the characteristics of your CPU but will usually increase I/O and system call overhead as well. The number also directly affects memory usage: I/O chunk size is used for at least three allocations in the program, so using a chunk size of 16777216 (16 MiB) will require 48 MiB of RAM. The default is usually between 32768 and 65536 which results in the fastest raw speed of the algorithm and generally good all-around performance. Feel free to experiment with the number on your data set and report your experiences (preferably with benchmarks and info on your data set.)

Using -P/--print will cause the program to print extra information that may be useful but will pollute the output in a way that makes scripted handling difficult. Its current purpose is to reveal more information about the file matching process by printing match pairs that pass certain steps of the process prior to full file comparison. This can be useful if you have two files that are passing early checks but failing after full checks.

The -y/--hash-db feature creates and maintains a text file with a list of file paths, hashes, and other metadata that enables jdupes to "remember" file data across runs. Specifying a period '.' as the database file name will use a name of "jdupes_hashdb.txt" instead; this alias makes it easy to use the hash database feature without typing a descriptive name each time. THIS FEATURE IS CURRENTLY UNDER DEVELOPMENT AND HAS MANY QUIRKS. USE IT AT YOUR OWN RISK. In particular, one of the biggest problems with this feature is that it stores every path exactly as specified on the command line; if any paths are passed into jdupes on a subsequent run with a different prefix then they will not be recognized and they will be treated as totally different files. For example, running jdupes -y . foo/ is not the same as jdupes -y . ./foo nor the same as (from a sibling directory) jdupes -y ../foo. You must run jdupes from the same working directory and with the same path specifications to take advantage of the hash database feature. When used correctly, a fully populated hash database can reduce subsequent runs with hundreds of thousands of files that normally take a very long time to run down to the directory scanning time plus a couple of seconds. If the directory data is already in the OS disk cache, this can make subsequent runs with over 100K files finish in under one second.

Hard and soft (symbolic) linking status symbols and behavior

A set of arrows are used in file linking to show what action was taken on each link candidate. These arrows are as follows:

----> File was hard linked to the first file in the duplicate chain

-@@-> File was symlinked to the first file in the chain

-##-> File was cloned from the first file in the chain

-==-> Already a hard link to the first file in the chain

-//-> File linking failed due to an error during the linking process

If your data set has linked files and you do not use -H to always consider them as duplicates, you may still see linked files appear together in match sets. This is caused by a separate file that matches with linked files independently and is the correct behavior. See notes below on the "triangle problem" in jdupes for technical details.

Microsoft Windows platform-specific notes

Windows has a hard limit of 1024 hard links per file. There is no way to change this. The documentation for CreateHardLink() states: "The maximum number of hard links that can be created with this function is 1023 per file. If more than 1023 links are created for a file, an error results." (The number is actually 1024, but they're ignoring the first file.)

The current jdupes algorithm's "triangle problem"

Pairs of files are excluded individually based on how the two files compare. For example, if --hard-links is not specified then two files which are hard linked will not match one another for duplicate scanning purposes. The problem with only examining files in pairs is that certain circumstances will lead to the exclusion being overridden.

Let's say we have three files with identical contents:

a/file1
a/file2
a/file3

and a/file1 is linked to a/file3. Here's how jdupes a/ sees them:

    Are 'a/file1' and 'a/file2' matches? Yes
    [point a/file1->duplicates to a/file2]

    Are 'a/file1' and 'a/file3' matches? No (hard linked already, `-H` off)

    Are 'a/file2' and 'a/file3' matches? Yes
    [point a/file2->duplicates to a/file3]

Now you have the following duplicate list:

a/file1->duplicates ==> a/file2->duplicates ==> a/file3

The solution is to split match sets into multiple sets, but doing this will also remove the guarantee that files will only ever appear in one match set and could result in data loss if handled improperly. In the future, options for "greedy" and "sparse" may be introduced to switch between allowing triangle matches to be in the same set vs. splitting sets after matching finishes without the "only ever appears once" guarantee.

Does jdupes meet the "Good Practice when Deleting Duplicates" by rmlint?

Yes. If you've not read this list of cautions, it is available at http://rmlint.readthedocs.io/en/latest/cautions.html

Here's a breakdown of how jdupes addresses each of the items listed.

"Backup your data"/"Measure twice, cut once"

These guidelines are for the user of duplicate scanning software, not the software itself. Back up your files regularly. Use jdupes to print a list of what is found as duplicated and check that list very carefully before automatically deleting the files.

"Beware of unusual filename characters"

The only character that poses a concern in jdupes is a newline \n and that is only a problem because the duplicate set printer uses them to separate file names. Actions taken by jdupes are not parsed like a command line, so spaces and other weird characters in names aren't a problem. Escaping the names properly if acting on the printed output is a problem for the user's shell script or other external program.

"Consider safe removal options"

This is also an exercise for the user.

"Traversal Robustness"

jdupes tracks each directory traversed by dev:inode pair to avoid adding the contents of the same directory twice. This prevents the user from being able to register all of their files twice by duplicating an entry on the command line. Symlinked directories are only followed if they weren't already followed earlier. Files are renamed to a temporary name before any linking is done and if the link operation fails they are renamed back to the original name.

"Collision Robustness"

jdupes uses xxHash for file data hashing. This hash is extremely fast with a low collision rate, but it still encounters collisions as any hash function will ("secure" or otherwise) due to the pigeonhole principle. This is why jdupes performs a full-file verification before declaring a match. It's slower than matching by hash only, but the pigeonhole principle puts all data sets larger than the hash at risk of collision, meaning a false duplicate detection and data loss. The slower completion time is not as important as data integrity. Checking for a match based on hashes alone is irresponsible, and using secure hashes like MD5 or the SHA families is orders of magnitude slower than xxHash while still suffering from the risk brought about by the pigeonholing. An example of this problem is as follows: if you have 365 days in a year and 366 people, the chance of having at least two birthdays on the same day is guaranteed; likewise, even though SHA512 is a 512-bit (64-byte) wide hash, there are guaranteed to be at least 256 pairs of data streams that causes a collision once any of the data streams being hashed for comparison is 65 bytes (520 bits) or larger.

"Unusual Characters Robustness"

jdupes does not protect the user from putting ASCII control characters in their file names; they will mangle the output if printed, but they can still be operated upon by the actions (delete, link, etc.) in jdupes.

"Seek Thrash Robustness"

jdupes uses an I/O chunk size that is optimized for reading as much as possible from disk at once to take advantage of high sequential read speeds in traditional rotating media drives while balancing against the significantly higher rate of CPU cache misses triggered by an excessively large I/O buffer size. Enlarging the I/O buffer further may allow for lots of large files to be read with less head seeking, but the CPU cache misses slow the algorithm down and memory usage increases to hold these large buffers. jdupes is benchmarked periodically to make sure that the chosen I/O chunk size is the best compromise for a wide variety of data sets.

"Memory Usage Robustness"

This is a very subjective concern considering that even a cell phone in someone's pocket has at least 1GB of RAM, however it still applies in the embedded device world where 32MB of RAM might be all that you can have. Even when processing a data set with over a million files, jdupes memory usage (tested on Linux x86-64 with -O3 optimization) doesn't exceed 2GB. A low memory mode can be chosen at compile time to reduce overall memory usage with a small performance penalty.

How does a duplicate scanner algorithm work?

The most naive way to look for files that are the same is to compare all files to all other files using a tool like cmp command on Linux/macOS/BSD or the fc command on Windows/DOS. This works but is extremely slow and wastes a lot of time. For every new file to compare, the number of comparisons increases exponentially (the formula is n(n-1)/2 for the discrete math nerds):

Files	Compares
2	1
3	3
4	6
5	10
10	45
100	4950
1000	499500
5000	12497500
10000	49995000
14142	99991011

Let's say that every file is 1,000 bytes in size and you have 10,000 files for a total size of 10,000,000 bytes (about 9.53 MiB). Using this naive comparison approach means the actual amount of data to compare is around 47,679 MiB. You should be able to see how extreme this can get--especially with larger files.

A slightly smarter approach is to use file hashes as a substitute for the full file contents. A hash is a number based on the data fed into a hash function and the number is always the same when the same data is fed in. If the hash for two files is different then the contents of those files are guaranteed to be different; if the hash is the same then the data might be the same, though this is not guaranteed due to the birthday problem: the size of the number is much smaller than the size of the data it represents, so there will always be many different inputs that produce the same hash value. Files with matching hash values must still be compared just to be sure that they are 100% identical.

49,995,000 comparisons can be done much quicker when you're only comparing a single big number every time instead of thousands or millions of bytes. This makes a big difference in performance since the only files being compared are files that look likely to be identical.

Fast exclusion of non-duplicates is the main purpose of duplicate scanners.

jdupes uses a lot of fast exclusion techniques beyond this. A partial list of these in the order they're performed is as follows:

Files that the user asks the program to exclude are skipped entirely
Files with different sizes can't be identical, so they're not compared
The first 4 KiB is hashed and compared which avoids reading full files
Entire files are hashed and compared which avoids comparing data directly
Finally, actual file data is compared to verify that they are duplicates

The vast majority of non-duplicate file pairs never make it past the partial (4 KiB) hashing step. This reduces the amount of data read from disk and time spent comparing things to the smallest amount possible.

v1.20.0 specific: most long options have changed and -n has been removed

Long options now have consistent hyphenation to separate the words used in the option names. Run jdupes -h to see the correct usage. Legacy options will remain in place until the next major or minor release (v2.0 or v1.21.0) for compatibility purposes. Users should change any scripts using the old options to use the new ones...or better yet, stop using long options in your scripts in the first place, because it's unnecessarily verbose and wasteful to do so.

v1.15+ specific: Why is the addition of single files not working?

If a file was added through recursion and also added explicitly, that file would end up matching itself. This issue can be seen in v1.14.1 or older versions that support single file addition using a command like this in the jdupes source code directory:

/usr/src/jdupes$ jdupes -rH testdir/isolate/1/ testdir/isolate/1/1.txt testdir/isolate/1/1.txt testdir/isolate/1/1.txt testdir/isolate/1/2.txt

Even worse, using the special dot directory will make it happen without the -H option, which is how I discovered this bug:

/usr/src/jdupes/testdir/isolate/1$ jdupes . 1.txt ./1.txt ./2.txt 1.txt

This works for any path with a single dot directory anywhere in the path, so it has a good deal of potential for data loss in some use cases. As such, the best option was to shove out a new minor release with this feature turned off until some additional checking can be done, e.g. by making sure the canonical paths aren't identical between any two files.

A future release will fix this safely.

Website and contact information

General program information, help, and tech info: https://www.jdupes.com/

Development, source code, and releases: https://codeberg.org/jbruchon/jdupes

Have a bug report or questions? contact Jody Bruchon [email protected]

Legal information and software license

The MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

jdupes's People

Contributors

Stargazers

Watchers

Forkers

simonsj janklabacka yuri-sevatz xenoxaos ycaihua thequestatredtotal brainsia ajjl hayka61 neowinston danweflen zenny d-arndt xrmx mmitch teotikalki shekkbuilder drkno razvanco13 teosoft123 a59 jahhein erthink sylvainjuge skadauke mer2329 huangyingw asureus megabyte tylerszabo bssrdf darkblue-b lbaham2 nanobitsoftware luoliyan scpham zalois peposso jbinkleyj calinou spartinus atifaziz fieri happy-ferret grukz tjmnmk maxsu vikingmage rblenis artoria2e5 lbibass ratsch jroncero-lgi vercingetorix-forks anthonyfok mubashir-rehman ia2115 fuck2ky sarfatis shinichihimura ttys3 gspu 5l1v3r1 thedward urs-bruelhart alexhunsley codeshane kevindiffily gelma dweomer21 cwuensch dualsky joepfortunato moshiba sz10101 muskanmahajan37 juergenhoetzel grumpymelon 08wckster08 tanshuai fekir joooostb kokizzu jn7163 lossmaster clayne frafra gladiopeace burkely-00orso69 thatbitchsystem ywjheart nomuna psmithwork servetti-polito mwgevans lonee6 lhecker robertartigas mesh-newbie aash-gates

jdupes's Issues

Work around 1,023 hard links limitation on Windows

I don't now why but for lots of duplicates (30 %), I get:

warning: unable to hard link 'C:\Program Files (x86)\Sage\CRMv66\CRMv66\Library/Sandrine CLEMENT/Comm99145/Telecopie Confirmation de cde.pdf' -> 'C:\Program Files (x86)\Sage\CRMv66\CRMv66\Library/Catherine AMRO/Comm146658/Telecopie Confirmation de cde.pdf': No error

Is there any way to know what happened?

By the way, could it be possible to add a carriage return in the printf on line 1690 of jdupes.c

Regards,

Feature request: Unicode file path support on Windows platforms

1230 EBS업데이트.txt

If I remove the Korean characters from the filename (Windows 7), then jdupes works correctly.

C:\temp\>jdupes -dN .
Examining 593 files, 1 dirs (in 1 specified)

   [+] ./1230 EBS (2).txt
   [-] ./1230 EBS.txt

Request: Ignore zero byte files

I think jdupes should ignore zero byte files. I don't really consider them matches since both files are empty.

Switch to ignore files below certain size

We are trying to dedup some of our backup data (about 1B files). It's not really feasible to run jdupes over all that data, lots of it already hard linked (rsnapshot). We know though that lots of user data is duplicate, especially in large image (as in TIFF) files.

Do you think creating a switch to skip files smaller than x would considerably speed up operations? Would you, should we implement it, be willing to merge that feature?

Thanks,
Daniel

Exclusion of selected directories from automatic deletion

I've always wished fdupes could use the --delete --no-prompt for removing files only in a specific folder (or the inverse: retain files in a specific folder).

example 1:
i find duplicates in folders ./X/, ./Y/, and ./Z/ ; i run fdupes and wish to delete all the dupes NOT in ./Y/. this in different from --omitfirst , because the files i wish to keep are not always listed first.

example 2:
i find duplicates in folders ./E/, ./F/, and ./G/ ; i run fdupes and wish to delete all the dupes ONLY in ./G/. Again --omitfirst is obviously not the solution.

Output sort by hard link reference count

No error, but more of a question:

I plan to use jdupes to dedupe some rsnapshot backup trees by using hardlinks (rsnapshot uses hardlinks for unchanged files between backup generations, but does not catch moved or renamed files). I don't want to regularly scan all old backup generations, but rather only the current backup against the preceding one.

It is important that when a duplicate is found the file in the current backup is replaced by a hardlink to the file in the preceding backup (and not the other way round) because the file from the preceding backup might already be hardlinked multiple times into even older backup generations.

eg. I want the two hardlink groups (gen0) (gen1 gen2 gen3) to become a single hardlink group (gen0 gen1 gen2 gen3) rather than two different groups (gen0 gen1) (gen2 gen3) when I instruct jdupes to only check gen0 and gen1.

Is there any kind of stable sort order regarding which file of a duplicate set is retained and which file is replaced by a hardlink?
Could this be achieved by using the --paramoder parameter?

jdupes.c:1112:12: warning: return discards ‘const’ qualifier from pointer target type

Is this a safe warning to ignore?

cc -Wall -Wextra -Wwrite-strings -Wcast-align -Wstrict-aliasing -pedantic -Wstrict-overflow -std=gnu99 -O2 -g -D_FILE_OFFSET_BITS=64 -fstrict-aliasing -pipe -I. -DNO_FLOAT -DHAVE_BTRFS_IOCTL_H -c -o jdupes.o jdupes.c
jdupes.c: In function ‘dedupeerrstr’:
jdupes.c:1112:12: warning: return discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
return "BTRFS_SAME_DATA_DIFFERS (data modified in the meantime?)";
^

jdupes v1.2.1
Makefile: HAVE_BTRFS_IOCTL_H = -DHAVE_BTRFS_IOCTL_H
btrfs-progs v4.4.1+20160307

Linux: openSUSE 20160401
Kernel~4.5.0-2-default x86_64

Absolutely love jdupes! A++++

Compile failure with -DENABLE_BTRFS

Compiling from source with no options is fine, however compiling with -DENABLE_BTRFS fails with an undefined reference:

cc -Wall -Wextra -Wwrite-strings -Wcast-align -Wstrict-aliasing -Wstrict-overflow -Wstrict-prototypes -Wpointer-arith -Wundef -Wshadow -Wfloat-equal -Wstrict-overflow=5 -Waggregate-return -Wcast-qual -Wswitch-default -Wswitch-enum -Wconversion -Wunreachable-code -Wformat=2 -Winit-self -std=gnu99 -O2 -g -D_FILE_OFFSET_BITS=64 -fstrict-aliasing -pipe -DENABLE_BTRFS   -c -o act_printmatches.o act_printmatches.c
cc -Wall -Wextra -Wwrite-strings -Wcast-align -Wstrict-aliasing -Wstrict-overflow -Wstrict-prototypes -Wpointer-arith -Wundef -Wshadow -Wfloat-equal -Wstrict-overflow=5 -Waggregate-return -Wcast-qual -Wswitch-default -Wswitch-enum -Wconversion -Wunreachable-code -Wformat=2 -Winit-self -std=gnu99 -O2 -g -D_FILE_OFFSET_BITS=64 -fstrict-aliasing -pipe -DENABLE_BTRFS   -c -o act_summarize.o act_summarize.c
cc -Wall -Wextra -Wwrite-strings -Wcast-align -Wstrict-aliasing -Wstrict-overflow -Wstrict-prototypes -Wpointer-arith -Wundef -Wshadow -Wfloat-equal -Wstrict-overflow=5 -Waggregate-return -Wcast-qual -Wswitch-default -Wswitch-enum -Wconversion -Wunreachable-code -Wformat=2 -Winit-self -std=gnu99 -O2 -g -D_FILE_OFFSET_BITS=64 -fstrict-aliasing -pipe -DENABLE_BTRFS  -o jdupes jdupes.o jody_hash.o jody_paths.o jody_sort.o jody_win_unicode.o string_malloc.o jody_cacheinfo.o act_deletefiles.o act_linkfiles.o act_printmatches.o act_summarize.o 
jdupes.o: In function `main':
/usr/local/src/jdupes/jdupes.c:2019: undefined reference to `dedupefiles'
collect2: error: ld returned 1 exit status
Makefile:120: recipe for target 'jdupes' failed
make: *** [jdupes] Error 1

out of memory

I tried running jdupes on 2 operating systems snapshots (as a test) and I got an "out of memory" error message:

Examining 557979 files, 62369 dirs (in 2 specified)

out of memory

Is jdupes storing a lot of information in memory ?

Introducing Caching / read from previous reads...

It would be extremely useful (and fast) if jdupes could read from previous attempts; so if jdupes would create an index file, and then, instead of reading the whole path again, jdupes would read from an index file (paths,size,time,hash,inode,..etc) that has been generated in previous rounds, possibly (ideally) verifying information correctness before taking action(s)!

Auto-delete only if file basenames identical (otherwise prompt)

Hello,

I would like to ask for a feature. I would like to delete automatically a file, which has the same filename and the same checksum like another file. If the checksums are identical but the filenames differs, I want to decide which file should be deleted.

Thanks! :)

Segmentation fault 11 in macOS Sierra

Hello,

I've compiled your jdupes (latest stable release) and I'm getting segmentation faults during the scanning process:

Scanning: 72365 files, 4588 dirs (in 1 specified)Segmentation fault: 11

I have attached diagnostic reports ...

HTH

Regards,

Alex

cr2.txt
cr1.txt

jdupes is getting a warning: cannot move hard link target to a temporary name, not linking

we have upgraded jdupes from version 1.1.1 to 1.7 and now jdupes became much slower and getting warning.
this is the command I am using: jdupes.exe -r -L folderName
and receiving a warning for about 20 files, like this:
Examining 6560 files, 733 dirs (in 1 specified)
[SRC] D:\folderName\somepath\123\filename.ttf
warning: cannot move hard link target to a temporary name, not linking:
-//-> D:\folderName\somepath\456\filename.ttf
warning: cannot move hard link target to a temporary name, not linking:
-//-> D:\folderName\somepath\789\filename.ttf

the version I am using:
jdupes.exe -v
jdupes 1.7 (2016-12-28)

and also, compare to the previous version we had is it very slow, unseated of 1.5 minutes it takes about 14 minutes to complete.

please help me troubleshoot, thank you!

Print machine readable output

I've hacked this quick patch to print duplicated files in json, if there's interest i can clean it up:
xrmx@5f44f8b

Segfault on invalid option

Installing https://github.com/jbruchon/jdupes/releases/download/v1.8/jdupes-1.8.c4f7b45_win64.zip
and running:
jdupes -a
or any invalid option, produced this segfault:

Problem signature:
  Problem Event Name:	APPCRASH
  Application Name:	jdupes.exe
  Application Version:	0.0.0.0
  Application Timestamp:	58a1fad6
  Fault Module Name:	jdupes.exe
  Fault Module Version:	0.0.0.0
  Fault Module Timestamp:	58a1fad6
  Exception Code:	c0000005
  Exception Offset:	000000000000741d
  OS Version:	6.1.7601.2.1.0.256.48
  Locale ID:	1033
  Additional Information 1:	12e5
  Additional Information 2:	12e578c7bfeb6bbbf39795d6991499c4
  Additional Information 3:	b250
  Additional Information 4:	b25042c0d330c4b80653ce6cf1462d87

Read our privacy statement online:
  http://go.microsoft.com/fwlink/?linkid=104288&clcid=0x0409

If the online privacy statement is not available, please read our privacy statement offline:
  C:\Windows\system32\en-US\erofflps.txt

Please, honor PROGRAM_NAME variable

Hi!

jdupes can be compiled with the -DENABLE_BTRFS to add support to btrfs filesystem. However, this option uses the file linux/btrfs.h, not available for hurd-i386 and kfreebsd-* architectures in Debian.

My intent when packaging jdupes was provide a program to convert duplicate files in relative symlinks, to help Debian packagers to reduce a package size (issue pointed by lintian). Considering all facts, I want make jdupes available in all architectures, which make -DENABLE_BTRFS inviable. The solution for this issue was create an extra package with an extra executable, called jdupes-btrfs, specific for linux-any architectures, keeping jdupes (without btrfs) alive in all architectures. To do it, I tried to use the PROGRAM_NAME variable to produce jdupes-btrfs. However, the variable exists in Makefile but is not used.

I think that the attached patch will fix this issue.

Thanks in advance!

Cheers,

Eriberto

10_use-program-name-variable.patch.txt

How to make/compile on Ubuntu 17.04?

Hi, on the root folder, I run
make
and get
make: Nothing to be done for 'all'.
I am in Ubuntu 17.04.
What else should I do ?

Missing -L/--linkhard switch

Hi!

I love the idea of a native windows version of fdupes. Thanks!
Maybe you want to consider implementing -L/--linkhard switch for NTFS volumes as available in https://github.com/tobiasschulz/fdupes version of fdupes.

Regards,
A.

btrfs dedupe fails with >128 identical small files

Creating and deduplicating 129 identical small files results in
dedupe failed against file 'filename' (128 matches): Cannot allocate memory

Steps to reproduce

get a Debian sid VM
run the attached reproduce.sh as root
The script creates 129 files containing foobar\n and deduplicates with
jdupes -B -r -D -- directory
The attached output.log was produced on a VM with 8GiB of RAM.

.zip (GitHub does not like .sh): reproduce_and_output.zip

Feature request: Add ability to open duplicate files.

When jdupes finds duplicates of images and other files I think it would be good if there was an option to open all the duplicate files(using the "open" command in mac for example so they open in whatever program is associated with that file type) so that you could check if they were in fact duplicates.

Directory name cut off after first character in messages on Windows

In error messages, the directory is cut off after the first character, e.g.:

C:\Temp>jdupes NOPE

could not stat dir N

C:\Temp>jdupes NOPE 2>err.txt

C:\Temp>hexdump err.txt

0000  0d 0a 63 6f 75 6c 64 20  6e 6f 74 20 73 74 61 74  ..could  not stat
0010  20 64 69 72 20 4e 00 4f  00 50 00 45 00 0d 00 0a   dir N.O .P.E....
0020  00 0d 0a                                          ...

I'm using the binary from jdupes_1.8-win32.zip on Windows 7 (32-bit).

Progress gets slower over time

I run fdupes on collection of 300 000 files and noticed that it takes much more time to process the last 10% than it takes to process the first 10%, which seems counterintuitive and makes it difficult to guess how long will it take to finish the task.

My guess is that this is because each file needs to be matched against all other files processed thus far, which means the number of operations for each subsequent file grows linearly.

If that is the case, I propose to change the reported percentage value to use non-linear scale, for example, like this...

Assuming n files, total number of operations = 0+1+2+3+4+...+n = n(n + 1)/2
For i-th file, total number of operations done = 0+1+2+3+4+...+i = i(i + 1)/2
So for i-th file, rather than reporting the progress to be i/n, report it as (i(i+1)/2)/(n(n+1)/2)) = (i(i+1))/(n(n+1)), probably could be approximated to i²/n².

The difference would look as follows (assuming n = 100) - before:

After:

jody_block_hash tail only works for little-endian

While jody_hash.h mentions endian conversion, I found nothing about endian conversion in jody_hash.c, the tail_mask arrays only work for little-endian. In case of a file larger than PARTIAL_HASH_SIZE this wouldn't matter, because for two identical files it would take some bytes from the previous complete block of the file, and as the files are identical, so is the previous block and therefor the hash computed for the last block. But for smaller files, the content depends on the file previously read, and is therefor undefined. So it is possible that two small, identical files will generate different hashes, and therefor not be considered equal.

There are different ways to fix this:

clear the rest of the buffer, then adjust the size to be hashed in the caller to the next multiple of sizeof(hash_t)
clear the remainder in jody_block_hash. This would violate the const attribute of the data pointer
set element to 0, then memcpy the bytes into element. Note that partial_salt seems also to imply little-endian format. It would probably be best to just omit the mask from partial_salt and use JODY_HASH_CONSTANT.
The easiest solution is to just ignore the tail data. The function of the hash is just to find potentially equal files. It is only important that identical files produce identical hashes. It is also desirable that different files produce different hashes. So the question is, is it that likely to have many files that differ only in the last at most 7 bytes?

It seems for some reasons I can't upload files here ("Something went really wrong, and we can’t process that file."), so here is my suggested diff.

I also have some suggestions for the main program, there is no need to use stdio when reading the files, and it leads to additional overhead. What would be the preferred way to introduce them? Try to upload them here as a patch, or should I try to create a pull request?

--- a/jody_hash.c
+++ b/jody_hash.c
@@ -11,6 +11,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <string.h>
 #include "jody_hash.h"
 
 /* DO NOT modify the shift unless you know what you're doing.
@@ -33,35 +34,12 @@
 /* Set hash parameters based on requested hash width */
 #if JODY_HASH_WIDTH == 64
 #define JODY_HASH_CONSTANT 0x1f3d5b79U
-static const hash_t tail_mask[] = {
-       0x0000000000000000,
-       0x00000000000000ff,
-       0x000000000000ffff,
-       0x0000000000ffffff,
-       0x00000000ffffffff,
-       0x000000ffffffffff,
-       0x0000ffffffffffff,
-       0x00ffffffffffffff,
-       0xffffffffffffffff
-};
 #endif /* JODY_HASH_WIDTH == 64 */
 #if JODY_HASH_WIDTH == 32
 #define JODY_HASH_CONSTANT 0x1f3d5b79U
-static const hash_t tail_mask[] = {
-       0x00000000,
-       0x000000ff,
-       0x0000ffff,
-       0x00ffffff,
-       0xffffffff,
-};
 #endif /* JODY_HASH_WIDTH == 32 */
 #if JODY_HASH_WIDTH == 16
 #define JODY_HASH_CONSTANT 0x1f5bU
-static const hash_t tail_mask[] = {
-       0x0000,
-       0x00ff,
-       0xffff,
-};
 #endif /* JODY_HASH_WIDTH == 16 */
 
 @@ -77,7 +55,6 @@ extern hash_t jody_block_hash(const hash_t * restrict data,
 {
        hash_t hash = start_hash;
        hash_t element;
-       hash_t partial_salt;
        size_t len;
 
        /* Don't bother trying to hash a zero-length block */
@@ -96,19 +73,21 @@ extern hash_t jody_block_hash(const hash_t * restrict data,
                data++;
        }
 
+#if 01 /* Ignore tail, it's just a hash */
        /* Handle data tail (for blocks indivisible by sizeof(hash_t)) */
        len = count & (sizeof(hash_t) - 1);
        if (len) {
-               partial_salt = JODY_HASH_CONSTANT & tail_mask[len];
-               element = *data & tail_mask[len];
+               element = 0;
+               /* 0 < len < sizeof (hash_t) */
+               memcpy (&element, data, len);
                hash += element;
-               hash += partial_salt;
+               hash += JODY_HASH_CONSTANT;
                hash = (hash << JODY_HASH_SHIFT) | hash >> (sizeof(hash_t) * 8 - JODY_HASH_SHIFT);
                hash ^= element;
                hash = (hash << JODY_HASH_SHIFT) | hash >> (sizeof(hash_t) * 8 - JODY_HASH_SHIFT);
-               hash ^= partial_salt;
+               hash ^= JODY_HASH_CONSTANT;
                hash += element;
        }
-
+#endif
        return hash;
 }

Feature request: stay on one filesystem

When deduplicating with -B (BTRFS support), jdupes provides no mechanism to limit its filesystem traversal to a single filesystem. This produces errors during the deduplication phase, if (for instance) there's a file on a mounted CDROM that is identical to a file on the btrfs root filesystem.

It would be great if there was a --one-file-system option, similar to what rsync, du and find provide, to limit directory recursion.

Keep only the longest or shortest filename

Being able to have -dN keep the longest or shortest named file in a set can be helpful, especially in a directory with a bunch of accidental copies ending in extra text or wanting to preserve a long and detailed name over a shorter and meaningless one.

"Birthday Problem" in README

The example given in the README is not the birthday problem, but the pigeonhole principle. The birthday problem concerns situations where the number of hashes computed is smaller than the number of possible hashes, but the probability of a collision is high. For example, the classic birthday problem says that 30 randomly selected birthdays (out of 365) have a >70% chance of a collision.

Debian packaging

Hi,

As said in #18, I packaged jdupes in Debian. I have some considerations and suggestions.

Can you remove or move the debian/ directory? It will make easier the packaging.
In some files you say that license is MIT or GPL-2 or GPL-3. If you want GPL-2+ or GPL-3+, you need to declare it explicitly.
I used a patch to provide full GCC hardening. Can you apply this patch? I attached the patch.

10_add-GCC-hardening.patch.txt

There are some warnings when building. I attached the build log.

jdupes_1.5+git20161026.d58c0bc-1_amd64.build.txt

Thanks a lot for your work.

Cheers,

Eriberto

Several warnings when enabled '-B/--dedupe' for btrfs deduplication

Hi @jbruchon,

This bug is from Debian. Can you take a look here[1]?

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=846255

If you want to send a reply or Cc: to Debian bug, please use [email protected].

Thanks in advance.

Cheers,

Eriberto

Documentation is misleading

Documentation states:

Various build options are available and can be turned on at compile time by setting CFLAGS_EXTRA or by passing it to 'make':

make CFLAGS_EXTRA=-DYOUR_OPTION
make CFLAGS_EXTRA='-DYOUR_OPTION_ONE -DYOUR_OPTION_TWO'

If all options require "=1" to enable this should probably read:

make CFLAGS_EXTRA=-DYOUR_OPTION=1
make CFLAGS_EXTRA='-DYOUR_OPTION_ONE=1 -DYOUR_OPTION_TWO=1'

Thanks for clarifying the issue I had and putting in a check for it!

The --reverse option

Why was this removed from the program? It's a very useful feature. Thanks for your time.

Add to Macports

Hi there @jbruchon !

Thanks for your amazing fork. What a great contribution to the world of filesystem maintenance.

Have you considered adding jdupes to the Macports directory? I'm not really sure what it entails, though I think it involves creating something called a "portfile".

Interactive mode for links and dedupes

Hi,

I notice for symnlinks it says

" -l --linksoft make relative symlinks for duplicates w/o prompting "

Is there any way to prompt for those too?

Cheers
Chris

Compilation failed on Debian Jessie

I haven't been able to compile the project with -DENABLE_BTRFS because of the errors below.
I am working with Debian Stable (Jessie), on amd64 and kernel 4.7.
However compiling without this flag works without problem.

cc -Wall -Wextra -Wwrite-strings -Wcast-align -Wstrict-aliasing -pedantic -Wstrict-overflow -std=gnu99 -O2 -g -D_FILE_OFFSET_BITS=64 -fstrict-aliasing -pipe -I.  -DENABLE_BTRFS    -c -o jdupes.o jdupes.c
jdupes.c: In function ‘get_relative_name’:
jdupes.c:344:3: warning: return from incompatible pointer type
   return &rel_path;
   ^
jdupes.c:335:22: warning: unused variable ‘p’ [-Wunused-variable]
   static const char *p;
                      ^
jdupes.c:334:14: warning: unused variable ‘depthcnt’ [-Wunused-variable]
   static int depthcnt = 0;
              ^
jdupes.c:333:25: warning: unused variable ‘p2’ [-Wunused-variable]
   static char p1[4096], p2[4096], rel_path[4096];
                         ^
jdupes.c:333:15: warning: unused variable ‘p1’ [-Wunused-variable]
   static char p1[4096], p2[4096], rel_path[4096];
               ^
jdupes.c: In function ‘dedupeerrstr’:
jdupes.c:1302:14: error: ‘BTRFS_SAME_DATA_DIFFERS’ undeclared (first use in this function)
   if (err == BTRFS_SAME_DATA_DIFFERS) {
              ^
jdupes.c:1302:14: note: each undeclared identifier is reported only once for each function it appears in
jdupes.c: In function ‘dedupefiles’:
jdupes.c:1326:24: error: invalid application of ‘sizeof’ to incomplete type ‘struct btrfs_ioctl_same_args’
   same = calloc(sizeof(struct btrfs_ioctl_same_args) +
                        ^
jdupes.c:1327:24: error: invalid application of ‘sizeof’ to incomplete type ‘struct btrfs_ioctl_same_extent_info’
                 sizeof(struct btrfs_ioctl_same_extent_info) * max_dupes, 1);
                        ^
jdupes.c:1352:15: error: dereferencing pointer to incomplete type
           same->info[cur_info].fd = fd;
               ^
jdupes.c:1353:15: error: dereferencing pointer to incomplete type
           same->info[cur_info].logical_offset = 0;
               ^
jdupes.c:1358:11: error: dereferencing pointer to incomplete type
       same->logical_offset = 0;
           ^
jdupes.c:1359:11: error: dereferencing pointer to incomplete type
       same->length = files->size;
           ^
jdupes.c:1360:11: error: dereferencing pointer to incomplete type
       same->dest_count = n_dupes;
           ^
jdupes.c:1369:23: error: ‘BTRFS_IOC_FILE_EXTENT_SAME’ undeclared (first use in this function)
       ret = ioctl(fd, BTRFS_IOC_FILE_EXTENT_SAME, same);
                       ^
jdupes.c:1380:27: error: dereferencing pointer to incomplete type
         if ((status = same->info[cur_info].status) != 0) {
                           ^
jdupes.c:1388:23: error: dereferencing pointer to incomplete type
         if (close(same->info[cur_info].fd) == -1) {
                       ^
<builtin>: recipe for target 'jdupes.o' failed
make: *** [jdupes.o] Error 1

Add control over the "triangle problem"

When two file duplicate pairs share a common file between them, the pairs will be merged into a set of three matched files even if the files weren't directly matched. This is the "triangle problem" and it has two simple solutions: ignore it and let matches be "greedy" or run additional passes that can split match sets. A third possibility is to record match pairings in a list instead of as a linked list of duplicates; this would enable more diverse options for processing the matches into sets, but will also require a lot of code rewriting.

Change installation prefix to allow for homebrew installation.

Homebrew package manager for macOS installs all Kegs under /usr/local to as it is a safe place to install non-standard macOS packages designated by Apple.

Can some changes be done to allow for a homebrew formula to be created? I'm putting in an issue at homebrew for adding this package and have already started on trying to modify the sources to allow this however not having very good luck.

--hzelp (typo) causes crash

I called jdupes --hzelp (obvious typo), but this crashed the program on windows. That should not happen

[FIXED] Windows: various slashes in paths

Dear Jody,

jdupes --recurse --size D:\Data\Library\Text

outputs as follows

1590 bytes each:
D:\Data\Library\Text/Lyrics/Delerium - After All.txt
D:\Data\Library\Text/Lyrics/Delerium - After All (Edit).txt

However I should not see that kind of slashes / on Windows, right?
Correct ones are \, so let’s make slashes in paths consistent.

Improvements from Debian (1.8 version)

Hi,

Today, I will upload jdupes 1.8 to Debian. I have two suggestions:

Remove act_dedupefiles.o file when cleaning (Makefile).
When building in Debian Stretch, I can see the following warning:

jdupes.c: In function ‘get_filehash’:
jdupes.c:878:9: warning: assuming signed overflow does not occur when simplifying conditional [-Wstrict-overflow]
   while (fsize > 0) {
         ^

Thanks in advance!

Regards,

Eriberto

--dedupe not working

I have currently a problem trying to deduplicate found duplicates with --dedupe

jdupes -r --dedupe test duplicate
Examining 4 files, 2 dirs (in 2 specified)
Dedupe [1/2] 50% Couldn't dedupe duplicate/DSC_5440.NEF => test/DSC_5440.NEF: Invalid argument
Dedupe [2/2] 100% Couldn't dedupe duplicate/DSC_5440.JPG => test/DSC_5440.JPG: Invalid argument

Invalid argument [-22]

I've tried on my whole system and I only got a few errors of that type:

couldn't dedupe /home/user/.local/share/flatpak/app/org.pitivi.Pitivi/x86_64/master/a7db8b814077a01ef7f8d4686469f938d4dd96d375643a9fe1f169f2eb84d656/files/lib/python3.4/site-pack
ages/pytz/zoneinfo/GMT => /home/user/.local/share/flatpak/repo/objects/ef/b0bc86b3e31cc839e0033feb237d34014f283f196f3dd33155f39c7c2afb07.file: Invalid argument [-22]

Do you have any idea what would that be ?
I've tried a diff on these 2 files and they are indeed the same.
Perhaps the name is too long ?

Feature: Search for files between two sizes

Hi,

I tried using two x parameters like: -x 10M -x +100M to try to search for files between two sizes, but it seemed to I think just take the last parameter into consideration.

cheers
Chris

Cache hash information across invocations using a hash database

I have a large set (300k, 80G) of files that is slowly, but constantly expanded with new files, whereas the existing files are not modified. It takes very long time to scan for the duplicates in the root directory each time. Having a switch for incremental deduplication that stores the previous data in ~/.cache/, and takes the user's word that the existing files weren't modified since the last deduplication, would be a great help - the time would be spent only on checking the recent files.

[FIXED] Rmlint warns

Dear Jody, do you take into consideration the warnings issued by Rmlint?

Add an equivalent of the fdupes "-1"/"--sameline" option

One use case for the original fdupes -1 / --sameline option is to generate downstream scripts based on grep-style filtering of the output. This allows lazy folks like me a way to process each group of duplicates without scripting a walk of the data that detects when a "record" has changed.

Segmentation fault on Debian ARM

# jdupes-btrfs -r /mnt
Scanning: 216099 files, 20948 dirs (in 1 specified)Segmentation fault

Problem with read-only BTRFS snapshot

Trying to deduplicate with a BTRFS read-only snapshot is not possible:

Unable to open("snapshot_read_only/test", O_RDWR): Read-only file system

If it helps, duperemove has a special flag ("-A") to manage this situation:
https://github.com/markfasheh/duperemove/blob/master/filerec.c#L405-L434

Feature request: create relative symlinks to duplicate files

Hi,

I am a Debian Developer and I am searching for a good solution to solve these lintians[1].

[1] https://lintian.debian.org/tags/duplicate-files.html

Considering that Debian packaging process uses relative symlinks, I use the following way to solve the issue:

    # Using rdfind and symlinks to transform duplicated files in softlinks
    rdfind -makesymlinks true $(DOCPATH)
    symlinks -cr $(DOCPATH)

Can you add a feature to solve this question? If yes, I saw an ITP sent by you to Debian. I can sponsor your package or, if you want, I can package jdupes for you.

Thanks a lot in advance.

Regards,

Eriberto

Invalid option in Unicode mode causes segfault

When compiled with -municode the getopt() code in MinGW-w64 causes a segmentation fault. We need to include our own subset of a getopt() implementation to work around this problem and perhaps rework some of the argument-finding code in the process. Functionally, this only causes a problem when an invalid option is passed which results in an error exit anyway, so it is not a serious bug.

Make -m hard link aware and clarify what "occupying X bytes" means

Using jdupes to further dedupe my rsnapshot backups I came upon some very high numbers of occupied space using the -m parameter (eg. 3 GB), but when I deduplicated the files with -L, only some 100 MB were freed.

How is the "occupied space" shown in the -m statistics calculated?

I've done this small test:

~/git/jdupes$ mkdir tdir
~/git/jdupes$ echo -n 0123456789 > tdir/A
~/git/jdupes$ echo -n 0123456789 > tdir/B
~/git/jdupes$ ln tdir/A tdir/A1
~/git/jdupes$ ln tdir/A tdir/A2
~/git/jdupes$ ln tdir/B tdir/B1
~/git/jdupes$ ln tdir/B tdir/B2
~/git/jdupes$ ln tdir/B tdir/B3
~/git/jdupes$ ls -lin tdir/
total 84
1310944 -rw-r--r-- 3 1000 1000 10 Mar 21 22:10 A
1310944 -rw-r--r-- 3 1000 1000 10 Mar 21 22:10 A1
1310944 -rw-r--r-- 3 1000 1000 10 Mar 21 22:10 A2
1310945 -rw-r--r-- 4 1000 1000 10 Mar 21 22:10 B
1310945 -rw-r--r-- 4 1000 1000 10 Mar 21 22:10 B1
1310945 -rw-r--r-- 4 1000 1000 10 Mar 21 22:10 B2
1310945 -rw-r--r-- 4 1000 1000 10 Mar 21 22:10 B3
~/git/jdupes$ ./jdupes -m tdir/
Scanning: 7 files, 1 dirs (in 1 specified)
4 duplicate files (in 1 sets), occupying 40 bytes

I would have expected -m to show either 10 bytes if "size that can be freed" is meant or 20 bytes if the number shows the total occupied size.
As there are only two inodes used with 10 bytes each, the shown size of 40 bytes is a bit high.

…Thinking about this: Is this an instance of the triangle problem?

If yes, could you please still tell me, which size -m prints?
(my guess: "size that can be freed")

Strange size issue with btrfs dedupe

With the latest checkout of jdupes on Debian Stretch (with kernel 4.9.0-3-amd64 and btrfs-tools 4.9.1-1~bpo9+1) on a btrfs filesystem, I got some strange size issues when deduping some files with jdupes.

root@system:/tmp/jdupes# ls -l
total 47488
-rwxrwxr-x 1 user user 48627712 déc.  15  2016 random
root@system:/tmp/jdupes# btrfs fi du .
     Total   Exclusive  Set shared  Filename
  46.38MiB    46.38MiB           -  ./random
  46.38MiB    46.38MiB       0.00B  .
root@system:/tmp/jdupes# /bin/cp -a random random2
root@system:/tmp/jdupes# btrfs fi du .
    Total   Exclusive  Set shared  Filename
  46.38MiB    46.38MiB           -  ./random
  46.38MiB    46.38MiB           -  ./random2
  92.75MiB    92.75MiB       0.00B  .
root@system:/tmp/jdupes# rm random2
root@system:/tmp/jdupes# /bin/cp -a --reflink=auto random random2
root@system:/tmp/jdupes# btrfs fi du .
     Total   Exclusive  Set shared  Filename
  46.38MiB       0.00B           -  ./random
  46.38MiB       0.00B           -  ./random2
92.75MiB       0.00B    46.38MiB  .
root@system:/tmp/jdupes# rm random2
root@system:/tmp/jdupes# /bin/cp -a random random2
root@system:/tmp/jdupes# btrfs fi du .
     Total   Exclusive  Set shared  Filename
  46.38MiB    46.38MiB           -  ./random
  46.38MiB    46.38MiB           -  ./random2
  92.75MiB    92.75MiB       0.00B  .
root@system:/tmp/jdupes# jdupes -1 -r -B -Z .
Scanning: 2 files, 1 dirs (in 1 specified)
Deduplication done (1 files processed)                      
root@system:/tmp/jdupes# btrfs fi du .
     Total   Exclusive  Set shared  Filename
  46.38MiB    26.09MiB           -  ./random
  46.38MiB    30.38MiB           -  ./random2
  92.75MiB    56.46MiB    20.29MiB  .
root@system:/tmp/jdupes# btrfs fi du .
     Total   Exclusive  Set shared  Filename
  46.38MiB       0.00B           -  ./random
  46.38MiB       0.00B           -  ./random2
  92.75MiB       0.00B    76.75MiB  .

When copying with reflink, it seems that I get a perfect deduplicated copy.
Using jdupes, it seems that I only get a partial deduplicated copy, then after a while I got some strange results.

Is there something I can do to check that jdupes really deduplicated correctly ?
Perhaps it's just an error with btrfs fi du

Thanks

jbruchon / jdupes Goto Github PK

jdupes's Introduction

Introduction

Why use jdupes instead of the original fdupes or other duplicate finders?

What jdupes is not: a similar (but not identical) file finding tool

How can I do stuff with jdupes that isn't supported by fdupes?

Usage

Hard and soft (symbolic) linking status symbols and behavior

Microsoft Windows platform-specific notes

The current jdupes algorithm's "triangle problem"

Does jdupes meet the "Good Practice when Deleting Duplicates" by rmlint?

"Backup your data"/"Measure twice, cut once"

"Beware of unusual filename characters"

"Consider safe removal options"

"Traversal Robustness"

"Collision Robustness"

"Unusual Characters Robustness"

"Seek Thrash Robustness"

"Memory Usage Robustness"

How does a duplicate scanner algorithm work?

v1.20.0 specific: most long options have changed and -n has been removed

v1.15+ specific: Why is the addition of single files not working?

Website and contact information

Legal information and software license

jdupes's People

Contributors

Stargazers

Watchers

Forkers

jdupes's Issues

Recommend Projects

Recommend Topics

Recommend Org