Comments (15)
Hi @Berryamin,
Looks like a bug. I reproduced it locally. Please apply this small online fix and give it a try again.
diff --git a/dduper b/dduper
index 521d2c5..693cf92 100755
--- a/dduper
+++ b/dduper
@@ -242,7 +242,7 @@ def dedupe_files(file_list, dry_run):
def validate_file(filename):
global run_len
- if os.path.exists == False:
+ if os.path.exists(filename) == False:
return False
file_stat = os.stat(filename)
# Verify its a unique regular file
from btrfs-progs.
I've applied the patch, then, when I run, I get a bunch of "skipped files" which looks ok, like:
Skipped /home/ida/.python_history not unique regular files or file size < 4kb or size < 32768
But then, I many errors like below:
query various internal information btrfs inspect-internal: unknown token 'dump-csum' usage: btrfs inspect-internal btrfs inspect-internal inode-resolve [-v] Get file system paths for the given inode btrfs inspect-internal logical-resolve [-Pv] [-s bufsize] Get file system paths for the given logical address btrfs inspect-internal subvolid-resolve Get file system paths for the given subvolume ID. btrfs inspect-internal rootid Get tree ID of the containing subvolume of path. btrfs inspect-internal min-dev-size [options] Get the minimum size the device can be shrunk to. The btrfs inspect-internal dump-tree [options] device Dump tree structures from a given device btrfs inspect-internal dump-super [options] device [device...] Dump superblock from a device in a textual form btrfs inspect-internal tree-stats [options] Print various stats for trees
Reverting to previous version does not produce the later errors.
from btrfs-progs.
Those errors points btrfs is missing dump-csum
option with btrfs inspect-internal
.
please share the output for these two commands:
stat cmds-inspect-dump-csum.c
grep -r 'os.path.exists' dduper
And also have you installed btrfs-progs
from your current source (kalou@myvps:~/btrfs-progs) ? Remember dump-csum
is not yet merged with official btrfs-progs.
from btrfs-progs.
kalou@myvps:~/btrfs-progs$ stat cmds-inspect-dump-csum.c File: cmds-inspect-dump-csum.c Size: 6168 Blocks: 16 IO Block: 4096 regular file Device: 33h/51d Inode: 366195 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 1001/ kalou) Gid: ( 1001/ kalou) Access: 2020-04-02 11:13:33.859008254 +0200 Modify: 2020-04-02 11:13:33.859008254 +0200 Change: 2020-04-02 11:13:33.859008254 +0200 Birth: - kalou@myvps:~/btrfs-progs$ grep -r 'os.path.exists' dduper if os.path.exists(filename) == False:
btrfs tools are the ones from Debian 10;
root@myvps:~# dpkg -l | grep btrfs ii btrfs-progs 4.20.1-2 amd64 Checksumming Copy on Write Filesystem utilities ii libbtrfs-dev 4.20.1-2 amd64 Checksumming Copy on Write Filesystem utilities (development headers) ii libbtrfs0 4.20.1-2 amd64 Checksumming Copy on Write Filesystem utilities (runtime library)
from btrfs-progs.
btrfs tools are the ones from Debian 10;
That's the problem, they don't have dump-csum
option since is not merged with upstream.
You have two options:
Option-1:
- Build btrfs-progs
./autogen.sh && ./configure && make
- Then install it under custom location, say for example /opt/btrfs-progs with command like:
sudo mkdir -p /opt/btrfs-progs
sudo make DESTDIR=/opt/btrfs-progs install
Now if you perform
/opt/btrfs-progs/usr/local/bin/btrfs inspect-internal dump-csum
it should provide you usage info.
- Finally update dduper file to use above custom location:
Find these two lines
out1 = subprocess.Popen(['btrfs', 'inspect-internal', 'dump-csum', src_file, device_name],
out2 = subprocess.Popen(['btrfs', 'inspect-internal', 'dump-csum', dst_file, device_name],
and modify it as:
out1 = subprocess.Popen(['/opt/btrfs-progs/usr/local/bin/btrfs', 'inspect-internal', 'dump-csum', src_file, device_name],
out2 = subprocess.Popen(['/opt/btrfs-progs/usr/local/bin/btrfs', 'inspect-internal', 'dump-csum', dst_file, device_name],
Now run the dduper command.
Option-2:
If you are not worried about your existing btrfs-progs from debian-10.
-
Then just uninstall them.
-
build and install binaries from this source: ./autogen.sh && ./configure && make && make install
-
Run the dduper command.
from btrfs-progs.
Ok.
I had to use --prefix when configure instead of DESTDIR when install, otherwise I'd get weird path like /opt/btrfs-progs/usr/local/bin
Now, using
export PATH=/opt/btrfs-progs/bin:$PATH
kalou@myvps:~/btrfs-progs$ python dduper --device=/dev/mapper/data-home --dir=/home --recurse | egrep -v ^Skipped
************************ Dedupe completed for /home/ida/.cache/pip/http/0/4/1/8/c/0418c83b80f7f7bfaec2738bfbbee53d2c1562196c0781702f6eddc8:/home/ida/.local/lib/python2.7/site-packages/pandas/core/window.pyc Dedupe validation successful /home/ida/.cache/pip/http/0/4/1/8/c/0418c83b80f7f7bfaec2738bfbbee53d2c1562196c0781702f6eddc8:/home/ida/.local/lib/python2.7/site-packages/pandas/core/window.pyc Summary blk_size : 4 chunksize : 32 /home/ida/.cache/pip/http/0/4/1/8/c/0418c83b80f7f7bfaec2738bfbbee53d2c1562196c0781702f6eddc8 has 3 chunks /home/ida/.local/lib/python2.7/site-packages/pandas/core/window.pyc has 3 chunks Matched chunks : 0 Unmatched chunks: 3 File "dduper", line 330, in main(results) File "dduper", line 281, in main dedupe_dir(results.dir_path, results.dry_run, results.recurse) File "dduper", line 272, in dedupe_dir dedupe_files(file_list, dry_run) File "dduper", line 233, in dedupe_files if validate_files(src_file, dst_file, processed_files) is True: File "dduper", line 206, in validate_files dst_stat = os.stat(dst_file) OSError: [Errno 2] No such file or directory: '/home/ida/.local/lib/python2.7/site-packages/pandas/core/algorithms.pyc.__superduper' root@myvps:/home/kalou/btrfs-progs# ls -ld /home/ida/.local/lib/python2.7/site-packages/pandas/core/algorithms.pyc.__superduper ls: cannot access '/home/ida/.local/lib/python2.7/site-packages/pandas/core/algorithms.pyc.__superduper': No such file or directory root@myvps:/home/kalou/btrfs-progs# grep -r 'os.path.exists' dduper if os.path.exists(filename) == False:
Output of dduper suggest that deduplication has take place .. But df before and after does not show much improvements.
Maybe I should tweak the block size threshold ?
from btrfs-progs.
I wonder what's going on with this:
OSError: [Errno 2] No such file or directory: '/home/ida/.local/lib/python2.7/site-packages/pandas/core/algorithms.pyc.__superduper'
Lets perform a simple test, find out whether de-dupe works with two files. I assume /tmp/
uses tmpfs.
dd if=/dev/urandom of=/tmp/f1 bs=1M count=100
# Now copy to your btrfs partition:
cp /tmp/f1 /home/ida/f1
df -h
cp /tmp/f1 /home/ida/f2 # you must copy from /tmp
df -h
# Now run de-dupe on these two files:
python dduper --device=/dev/mapper/data-home --files /home/ida/f1 /home/ida/f2
sleep 5 && sync && sleep 5
df -h # should show 100M less.
If this fails there is no way --dir
will work and other tweaks are not required.
Sample demo: http://giis.co.in/btrfs_dedupe.gif
from btrfs-progs.
Simple test worked great, dedup ok.
from btrfs-progs.
nice thanks :)
Back to --dir
: what was your /home directory size above df said just 1.1GB, is that correct?
I think its too low and probably you don't have much duplicate contents on them?
from btrfs-progs.
I have more than 4Go used on /home
I expect duplicate stuff to be mostly javascript stuff, downloaded with "npm install" ...
I just did a little report, here is how it goes:
207003 out of 214724 skipped due to "file size < 4kb or size < 32768"
I probably need to change the block size of the whole btrfs filesystem to benefit deduplication in my case ?
from btrfs-progs.
Okay. Yes, dduper by default searches for files greater than 32KB.
python dduper --device /dev/sda1 --files /mnt/f1 /mnt/f2 --chunk-size 1024 #will seach for 1M
(see: https://gitlab.collabora.com/laks/btrfs-progs/blob/dump_csum/Documentation/dduper_usage.md#changing-dedupe-chunk-size)
You can give a try with --chunk-size 16
and see whether it makes any difference.
from btrfs-progs.
Seems that <32k is not supported
root@myvps:/home/kalou/btrfs-progs# python dduper --device=/dev/mapper/data-home --dir=/home --recurse --chunk-size 16 | egrep -v ^Skipped
Ensure chunk size is of multiple 32KB. (32,64,128 etc)
from btrfs-progs.
TBH, I was not really thinking about de-duping smaller files during dduper development.
I don't know whether this going to make any difference, try changing below code from 32
to 16
to skip above error message. https://gitlab.collabora.com/laks/btrfs-progs/-/blob/dump_csum/dduper#L54
from btrfs-progs.
Ok I don't have the right type of data at hand to benefit from dedup ..
kalou@myvps:~$ for sz in 4 8 16 32 ; do echo -n "sz=${sz}k :"; sudo find /home/ -type f -size +${sz}k | wc -l; done
sz=4k :52778
sz=8k :32165
sz=16k :18582
sz=32k :8615
kalou@myvps:~$ for sz in 4 8 16 32 ; do echo -n "sz=${sz}k :"; sudo find /home/ -type f -name *js -size +${sz}k | wc -l; done
sz=4k :5992
sz=8k :3282
sz=16k :1721
sz=32k :862
However, thank you so much for the help, I think we may clone the ticket for now.
If you need more feedback, just ask.
By the way: felicitations about your ext2/ext3 undelete tool "giis" !!!
from btrfs-progs.
Yes, right. I think something like https://github.com/pauldreik/rdfind may be useful. To find duplicate files and replace with symlinks. thank you for testing patch and reports. I'll go ahead to close this issue. Thanks and stay safe !
If you need more feedback, just ask.
Sure, will do 👍
By the way: felicitations about your ext2/ext3 undelete tool "giis" !!!
thank you :)
from btrfs-progs.
Related Issues (2)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from btrfs-progs.