Giter Club home page Giter Club logo

Comments (15)

Lakshmipathi avatar Lakshmipathi commented on September 25, 2024

Hi @Berryamin,

Looks like a bug. I reproduced it locally. Please apply this small online fix and give it a try again.

diff --git a/dduper b/dduper
index 521d2c5..693cf92 100755
--- a/dduper
+++ b/dduper
@@ -242,7 +242,7 @@ def dedupe_files(file_list, dry_run):
 
 def validate_file(filename):
         global run_len
-        if os.path.exists == False:
+        if os.path.exists(filename) == False:
             return False
         file_stat = os.stat(filename)
         # Verify its a unique regular file

from btrfs-progs.

karawitan avatar karawitan commented on September 25, 2024

I've applied the patch, then, when I run, I get a bunch of "skipped files" which looks ok, like:

Skipped /home/ida/.python_history not unique regular files or             file size < 4kb or size < 32768

But then, I many errors like below:

query various internal information
btrfs inspect-internal: unknown token 'dump-csum'
usage: btrfs inspect-internal  

    btrfs inspect-internal inode-resolve [-v]  
        Get file system paths for the given inode
    btrfs inspect-internal logical-resolve [-Pv] [-s bufsize]  
        Get file system paths for the given logical address
    btrfs inspect-internal subvolid-resolve  
        Get file system paths for the given subvolume ID.
    btrfs inspect-internal rootid 
        Get tree ID of the containing subvolume of path.
    btrfs inspect-internal min-dev-size [options] 
        Get the minimum size the device can be shrunk to. The
    btrfs inspect-internal dump-tree [options] device
        Dump tree structures from a given device
    btrfs inspect-internal dump-super [options] device [device...]
        Dump superblock from a device in a textual form
    btrfs inspect-internal tree-stats [options] 
        Print various stats for trees

Reverting to previous version does not produce the later errors.

from btrfs-progs.

Lakshmipathi avatar Lakshmipathi commented on September 25, 2024

Those errors points btrfs is missing dump-csum option with btrfs inspect-internal.

please share the output for these two commands:
stat cmds-inspect-dump-csum.c
grep -r 'os.path.exists' dduper

And also have you installed btrfs-progs from your current source (kalou@myvps:~/btrfs-progs) ? Remember dump-csum is not yet merged with official btrfs-progs.

from btrfs-progs.

karawitan avatar karawitan commented on September 25, 2024
kalou@myvps:~/btrfs-progs$ stat cmds-inspect-dump-csum.c
  File: cmds-inspect-dump-csum.c
  Size: 6168            Blocks: 16         IO Block: 4096   regular file
Device: 33h/51d Inode: 366195      Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1001/   kalou)   Gid: ( 1001/   kalou)
Access: 2020-04-02 11:13:33.859008254 +0200
Modify: 2020-04-02 11:13:33.859008254 +0200
Change: 2020-04-02 11:13:33.859008254 +0200
 Birth: -
kalou@myvps:~/btrfs-progs$ grep -r 'os.path.exists' dduper
        if os.path.exists(filename) == False:

btrfs tools are the ones from Debian 10;

root@myvps:~# dpkg -l | grep btrfs
ii  btrfs-progs                          4.20.1-2                           amd64        Checksumming Copy on Write Filesystem utilities
ii  libbtrfs-dev                         4.20.1-2                           amd64        Checksumming Copy on Write Filesystem utilities (development headers)
ii  libbtrfs0                            4.20.1-2                           amd64        Checksumming Copy on Write Filesystem utilities (runtime library)

from btrfs-progs.

Lakshmipathi avatar Lakshmipathi commented on September 25, 2024

btrfs tools are the ones from Debian 10;

That's the problem, they don't have dump-csum option since is not merged with upstream.

You have two options:

Option-1:

  1. Build btrfs-progs

./autogen.sh && ./configure && make

  1. Then install it under custom location, say for example /opt/btrfs-progs with command like:

sudo mkdir -p /opt/btrfs-progs
sudo make DESTDIR=/opt/btrfs-progs install

Now if you perform

/opt/btrfs-progs/usr/local/bin/btrfs inspect-internal dump-csum

it should provide you usage info.

  1. Finally update dduper file to use above custom location:

Find these two lines
out1 = subprocess.Popen(['btrfs', 'inspect-internal', 'dump-csum', src_file, device_name],
out2 = subprocess.Popen(['btrfs', 'inspect-internal', 'dump-csum', dst_file, device_name],

and modify it as:

out1 = subprocess.Popen(['/opt/btrfs-progs/usr/local/bin/btrfs', 'inspect-internal', 'dump-csum', src_file, device_name],
out2 = subprocess.Popen(['/opt/btrfs-progs/usr/local/bin/btrfs', 'inspect-internal', 'dump-csum', dst_file, device_name],

Now run the dduper command.

Option-2:

If you are not worried about your existing btrfs-progs from debian-10.

  1. Then just uninstall them.

  2. build and install binaries from this source: ./autogen.sh && ./configure && make && make install

  3. Run the dduper command.

from btrfs-progs.

karawitan avatar karawitan commented on September 25, 2024

Ok.

I had to use --prefix when configure instead of DESTDIR when install, otherwise I'd get weird path like /opt/btrfs-progs/usr/local/bin

Now, using
export PATH=/opt/btrfs-progs/bin:$PATH
kalou@myvps:~/btrfs-progs$ python dduper --device=/dev/mapper/data-home --dir=/home --recurse | egrep -v ^Skipped

************************
Dedupe completed for /home/ida/.cache/pip/http/0/4/1/8/c/0418c83b80f7f7bfaec2738bfbbee53d2c1562196c0781702f6eddc8:/home/ida/.local/lib/python2.7/site-packages/pandas/core/window.pyc
Dedupe validation successful /home/ida/.cache/pip/http/0/4/1/8/c/0418c83b80f7f7bfaec2738bfbbee53d2c1562196c0781702f6eddc8:/home/ida/.local/lib/python2.7/site-packages/pandas/core/window.pyc
Summary
blk_size : 4 chunksize : 32
/home/ida/.cache/pip/http/0/4/1/8/c/0418c83b80f7f7bfaec2738bfbbee53d2c1562196c0781702f6eddc8 has 3 chunks
/home/ida/.local/lib/python2.7/site-packages/pandas/core/window.pyc has 3 chunks
Matched chunks : 0
Unmatched chunks: 3
  File "dduper", line 330, in 
    main(results)
  File "dduper", line 281, in main
    dedupe_dir(results.dir_path, results.dry_run, results.recurse)
  File "dduper", line 272, in dedupe_dir
    dedupe_files(file_list, dry_run)
  File "dduper", line 233, in dedupe_files
    if validate_files(src_file, dst_file, processed_files) is True:
  File "dduper", line 206, in validate_files
    dst_stat = os.stat(dst_file)
OSError: [Errno 2] No such file or directory: '/home/ida/.local/lib/python2.7/site-packages/pandas/core/algorithms.pyc.__superduper'

root@myvps:/home/kalou/btrfs-progs# ls -ld /home/ida/.local/lib/python2.7/site-packages/pandas/core/algorithms.pyc.__superduper
ls: cannot access '/home/ida/.local/lib/python2.7/site-packages/pandas/core/algorithms.pyc.__superduper': No such file or directory

root@myvps:/home/kalou/btrfs-progs# grep -r 'os.path.exists' dduper
        if os.path.exists(filename) == False:

Output of dduper suggest that deduplication has take place .. But df before and after does not show much improvements.
Maybe I should tweak the block size threshold ?

from btrfs-progs.

Lakshmipathi avatar Lakshmipathi commented on September 25, 2024

I wonder what's going on with this:

OSError: [Errno 2] No such file or directory: '/home/ida/.local/lib/python2.7/site-packages/pandas/core/algorithms.pyc.__superduper'

Lets perform a simple test, find out whether de-dupe works with two files. I assume /tmp/ uses tmpfs.

dd if=/dev/urandom of=/tmp/f1 bs=1M count=100
# Now copy to your btrfs partition:
cp /tmp/f1 /home/ida/f1
df -h
cp /tmp/f1 /home/ida/f2 # you must copy from /tmp 
df -h 
# Now run de-dupe on these two files:
python dduper --device=/dev/mapper/data-home  --files /home/ida/f1 /home/ida/f2 
sleep 5 && sync && sleep 5
df -h # should show 100M less. 

If this fails there is no way --dir will work and other tweaks are not required.
Sample demo: http://giis.co.in/btrfs_dedupe.gif

from btrfs-progs.

karawitan avatar karawitan commented on September 25, 2024

Simple test worked great, dedup ok.

from btrfs-progs.

Lakshmipathi avatar Lakshmipathi commented on September 25, 2024

nice thanks :)

Back to --dir: what was your /home directory size above df said just 1.1GB, is that correct?
I think its too low and probably you don't have much duplicate contents on them?

from btrfs-progs.

karawitan avatar karawitan commented on September 25, 2024

I have more than 4Go used on /home

I expect duplicate stuff to be mostly javascript stuff, downloaded with "npm install" ...

I just did a little report, here is how it goes:
207003 out of 214724 skipped due to "file size < 4kb or size < 32768"

I probably need to change the block size of the whole btrfs filesystem to benefit deduplication in my case ?

from btrfs-progs.

Lakshmipathi avatar Lakshmipathi commented on September 25, 2024

Okay. Yes, dduper by default searches for files greater than 32KB.

python dduper --device /dev/sda1 --files /mnt/f1 /mnt/f2 --chunk-size 1024 #will seach for 1M
(see: https://gitlab.collabora.com/laks/btrfs-progs/blob/dump_csum/Documentation/dduper_usage.md#changing-dedupe-chunk-size)

You can give a try with --chunk-size 16 and see whether it makes any difference.

from btrfs-progs.

karawitan avatar karawitan commented on September 25, 2024

Seems that <32k is not supported

root@myvps:/home/kalou/btrfs-progs# python dduper --device=/dev/mapper/data-home --dir=/home --recurse --chunk-size 16 | egrep -v ^Skipped
Ensure chunk size is of multiple 32KB. (32,64,128 etc)

from btrfs-progs.

Lakshmipathi avatar Lakshmipathi commented on September 25, 2024

TBH, I was not really thinking about de-duping smaller files during dduper development.

I don't know whether this going to make any difference, try changing below code from 32 to 16 to skip above error message. https://gitlab.collabora.com/laks/btrfs-progs/-/blob/dump_csum/dduper#L54

from btrfs-progs.

karawitan avatar karawitan commented on September 25, 2024

Ok I don't have the right type of data at hand to benefit from dedup ..

kalou@myvps:~$ for sz in 4 8 16 32 ; do echo -n "sz=${sz}k :"; sudo find /home/ -type f -size +${sz}k | wc -l; done
sz=4k :52778
sz=8k :32165
sz=16k :18582
sz=32k :8615

kalou@myvps:~$ for sz in 4 8 16 32 ; do echo -n "sz=${sz}k :"; sudo find /home/ -type f -name *js -size +${sz}k | wc -l; done
sz=4k :5992
sz=8k :3282
sz=16k :1721
sz=32k :862

However, thank you so much for the help, I think we may clone the ticket for now.
If you need more feedback, just ask.

By the way: felicitations about your ext2/ext3 undelete tool "giis" !!!

from btrfs-progs.

Lakshmipathi avatar Lakshmipathi commented on September 25, 2024

Yes, right. I think something like https://github.com/pauldreik/rdfind may be useful. To find duplicate files and replace with symlinks. thank you for testing patch and reports. I'll go ahead to close this issue. Thanks and stay safe !

If you need more feedback, just ask.

Sure, will do 👍

By the way: felicitations about your ext2/ext3 undelete tool "giis" !!!

thank you :)

from btrfs-progs.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.