Giter Club home page Giter Club logo

dmdedup4.8's Introduction

dm-dedup

Device-mapper's dedup target provides transparent data deduplication of block devices. Every write coming to a dm-dedup instance is deduplicated against previously written data. For datasets that contain many duplicates scattered across the disk (e.g., virtual machine disk image collections, backups, home directory servers) deduplication provides a significant amount of space savings.

Construction Parameters

<meta_dev> <data_dev> <block_size>
<hash_algo> <backend> <flushrq>

<meta_dev> This is the device where dm-dedup's metadata resides. Metadata typically includes hash index, block mapping, and reference counters. It should be specified as a path, like "/dev/sdaX".

<data_dev> This is the device where the actual data blocks are stored. It should be specified as a path, like "/dev/sdaX".

<block_size> This is the size of a single block on the data device in bytes. Block is both a unit of deduplication and a unit of storage. Supported values are between 4096 to 1048576 (1MB) and should be a power of two.

<hash_algo> This specifies which hashing algorithm dm-dedup will use for detecting identical blocks, e.g., "md5" or "sha256". Any hash algorithm supported by the running kernel can be used (see "/proc/crypto" file).

This is the backend that dm-dedup will use to store metadata. Currently supported values are "cowbtree" and "inram". Cowbtree backend uses persistent Copy-on-Write (COW) B-Trees to store metadata. Inram backend stores all metadata in RAM which is lost after a system reboot. Consequently, inram backend should typically be used only for experiments. Notice, that though inram backend does not use metadata device, parameter should still be specified in the command line. This parameter specifies how many writes to the target should occur before dm-dedup flushes its buffered metadata to the metadata device. In other words, in an event of power failure, one can loose up to this number of most recent writes. Notice, that dm-dedup also flushes its metadata when it sees REQ_FLUSH or REQ_FUA flags in the I/O requests. In particular, these flags are set by file systems in the appropriate points of time to ensure file system consistency.

During construction, dm-dedup checks if the first 4096 bytes of the metadata device are equal to zero. If they are, then a completely new dm-dedup instance is initialized with the metadata and data devices considered "empty". If, however, 4096 starting bytes are not zero, dm-dedup will try to reconstruct the target based on the current information on the metadata and data devices.

Theory of Operation

We provide an overview of dm-dedup design in this section. Detailed design and performance evaluation can be found in the following paper:

V. Tarasov and D. Jain and G. Kuenning and S. Mandal and K. Palanisami and P. Shilane and S. Trehan. Dmdedup: Device Mapper Target for Data Deduplication. Ottawa Linux Symposium, 2014. http://www.fsl.cs.stonybrook.edu/docs/ols-dmdedup/dmdedup-ols14.pdf

To quickly identify duplicates, dm-dedup maintains an index of hashes for all written blocks. Block is a user-configurable unit of deduplication and storage. Dm-dedup index, along with other deduplication metadata, resides on a separate block device, which we refer to as metadata device. Blocks themselves are stored on the data device. Although the metadata device can be any block device, e.g., an HDD or its partition, for higher performance we recommend to use SSD devices to store metadata.

For every block that is written to a target, dm-dedup computes its hash using the <hash_algo>. It then looks for the resulting hash in the hash index. If a match is found then the write is considered to be a duplicate.

Dm-dedup's hash index is essentially a mapping between the hash and the physical address of a block on the data device. In addition, dm-dedup maintains a mapping between logical block addresses on the target and physical block address on the data device (LBN-PBN mapping). When a duplicate is detected, there is no need to write actual data to the disk and only LBN-PBN mapping is updated.

When a non-duplicate data is written, new physical block on the data device is allocated, written, and a corresponding hash is added to the index.

On read, LBN-PBN mapping allows to quickly locate a required block on the data device. If there were no writes to an LBN before, a zero block is returned.

Target Size

When using device-mapper one needs to specify target size in advance. To get deduplication benefits, target size should be larger than the data device size (or otherwise one could just use the data device directly). Because dataset deduplication ratio is not known in advance one has to use an estimation.

Usually, up to 1.5 deduplication ratio for a primary dataset is a safe assumption. For backup datasets, however, deduplication ratio can be as high as 100.

Estimating deduplication ratio of an existing dataset using fs-hasher package from http://tracer.filesystems.org/ can give a good starting point for a specific dataset.

If one over-estimates deduplication ratio, data device can run out of free space. This situation can be monitored using dmsetup status command (described below). After data device is full, dm-dedup will stop accepting writes until free space becomes available on the data device again.

Backends

Dm-dedup's core logic considers index and LBN-PBN mappings as plain key-value stores with an extended API described in

drivers/md/dm-dedup-backend.h

Different backends can provided key-value store API. We implemented a cowbtree backend that uses device-mapper's persistent metadata framework to consistently store metadata. Details on this framework and its on-disk layout can be found here:

Documentation/device-mapper/persistent-data.txt

By using persistent COW B-trees, cowbtree backend guarantees consistency in the event of power failure.

In addition, we also provide inram backend that stores all metadata in RAM. Hash tables with linear probing are used for storing the index and LBN-PBN mapping. Inram backend does not store metadata persistently and should usually by used only for experiments.

Dmsetup Status

Dm-dedup exports various statistics via dmsetup status command. The line returned by dmsetup status will contain the following values in the order:



, , , and are generic fields printed by dmsetup tool for any target.

- total number of blocks on the data device - number of free (unallocated) blocks on the data device - number of used (allocated) blocks on the data device - number of allocated logical blocks (were written at least once) - block size in bytes - data disk's major:minor - metadata disk's major:minor - total number of writes to the target - the number of writes that weren't duplicates (were unique) - the number of writes that were duplicates - the number of times dm-dedup had to read data from the data device because a write was misaligned (read-on-write effect) - the number of writes to a logical block that was written before at least once - the number of writes to a logical address that was not written before even once

To compute deduplication ratio one needs to device dactual by dused.

Example

Decide on metadata and data devices:

META_DEV=/dev/sdX

DATA_DEV=/dev/sdY

Compute target size assuming 1.5 dedup ratio:

DATA_DEV_SIZE=blockdev --getsz $DATA_DEV

TARGET_SIZE=expr $DATA_DEV_SIZE \* 15 / 10

Reset metadata device:

dd if=/dev/zero of=$META_DEV bs=4096 count=1

Setup a target: echo "0 $TARGET_SIZE dedup $META_DEV $DATA_DEV 4096 md5 cowbtree 100" |
dmsetup create mydedup

Authors

dm-dedup was developed in the File system and Storage Lab (FSL) at Stony Brook University Computer Science Department, in collaboration with Harvey Mudd College and EMC.

Key people involved in the project were Vasily Tarasov, Geoff Kuenning, Sonam Mandal, Karthikeyani Palanisami, Philip Shilane, Sagar Trehan, and Erez Zadok.

We also acknowledge the help of several students involved in the deduplication project: Teo Asinari, Deepak Jain, Mandar Joshi, Atul Karmarkar, Meg O'Keefe, Gary Lent, Amar Mudrankit, Ujwala Tulshigiri, and Nabil Zaman.

dmdedup4.8's People

Contributors

nidhipanpalia94 avatar vinothkumarraja avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dmdedup4.8's Issues

High dedup ratio may cause "mount" failure.

My data device and metadata device are both 4G.
The recommend dedup ratio is 1.5, when i make a ext4 filesystem on the "mydedup", then "mount" will failure with "read superblock error". I try to change the dedup ratio to 1.4, 1.3, 1.2, 1.1, the "mount" will success on all of them.
But i don't know why.

Question: dmdedup invoking handle_read and handle_write without even doing any IO, and even during dd command

I had setup the dmdedup target with inram backend, and when I am writing to it using the following dd command:

dd if=/dev/zero of=/dev/mapper/mydedup bs=4096 count=5

I have observed that there are a lot of read requests (i.e., bios with read direction). I am unable comprehend what is going on.

Specifically, if the count=5 in the dd command, there is 1 write request and 5 read requests.

As far as I understood, the above dd command should write to my device-mapper target. But why it is reading from it? Is it because I am reading from /dev/zero? But that should not be a part of device-mapper, right? What am I missing here.

I inserted some debug messages using DMINFO(). I later used dmesg to see the outputs.

Debug output from dmesg:
debug_output_after_dd.txt

insmod: ERROR:

insmod: ERROR: could not insert module dm-dedup.ko: Unknown symbol in module

[ 790.030181] dm_dedup: Unknown symbol dm_btree_insert_notify (err 0)
[ 790.030184] dm_dedup: Unknown symbol dm_bm_read_lock (err 0)
[ 790.030188] dm_dedup: Unknown symbol dm_tm_create_with_sm (err 0)
[ 790.030192] dm_dedup: Unknown symbol dm_sm_disk_create (err 0)
[ 790.030196] dm_dedup: Unknown symbol dm_bm_write_lock_zero (err 0)
[ 790.030200] dm_dedup: Unknown symbol dm_btree_lookup (err 0)
[ 790.030203] dm_dedup: Unknown symbol dm_tm_destroy (err 0)
[ 790.030210] dm_dedup: Unknown symbol dm_bm_write_lock (err 0)
[ 790.030215] dm_dedup: Unknown symbol dm_btree_find_lowest_key (err 0)
[ 790.030221] dm_dedup: Unknown symbol dm_block_manager_destroy (err 0)
[ 790.030225] dm_dedup: Unknown symbol dm_bm_checksum (err 0)
[ 790.030230] dm_dedup: Unknown symbol dm_btree_empty (err 0)
[ 790.030234] dm_dedup: Unknown symbol dm_sm_disk_open (err 0)
[ 790.030237] dm_dedup: Unknown symbol dm_tm_commit (err 0)
[ 790.030243] dm_dedup: Unknown symbol dm_bm_block_size (err 0)
[ 790.030246] dm_dedup: Unknown symbol dm_btree_find_highest_key (err 0)
[ 790.030249] dm_dedup: Unknown symbol dm_btree_lookup_next (err 0)
[ 790.030255] dm_dedup: Unknown symbol dm_btree_remove (err 0)
[ 790.030258] dm_dedup: Unknown symbol dm_bm_unlock (err 0)
[ 790.030263] dm_dedup: Unknown symbol dm_tm_open_with_sm (err 0)

poor write performance (continued)

(this is the continuation of dmdedup/dmdedup3.19#44)

I redid the test on kernel 4.9.34-29.el7.x86_64 (4 CPUs, data and metadata both on the same NVMe device). Specifically I used three deduplication ratios (50%, 75%, and 100%) and compared performance against the native NVMe device. I populated the dedup target as follows (where XXX is the deduplication ratio):

fio --filename=... --ioengine=libaio --direct=1 --name=foo --blocksize=4k --filesize=4G --rw=write --dedupe_percentage=XXX --numjobs=4 --iodepth=8

And then tested it as follows:

fio --filename=... --ioengine=libaio --direct=1 --name=foo --blocksize=4k --filesize=4G --rw=randwrite --dedupe_percentage=XXX --numjobs=4 --iodepth=64 --time_based=1 --runtime=30 --group_reporting=1

For 50% and 75% deduplication ratio I got 20-21K IOPS while for the native NVMe case I got 43.4K IOPS. In the dedup case the NVMe device was only 35% utilised and the CPUs were largely underutilised.

I also tested using 100% deduplication ratio and got 77K IOPS. Since all blocks were duplicate the NVMe device wasn't active at all (all metadata fit in the cache) but the CPUs were again underutulised. I also tested using a zero DM target instead of the NVMe device for storing data and got similar results. Testing directly to the zero target achieves 455K IOPS.

So my understanding is that there is some kind of unnecessary serialisation? Is there something else I could tune? Is there some other test that could saturate either the NVMe device or the CPU?

I'll user the dmdedup but i have same questions.

I make the program and product the '.ko', insmod the '.ko' . how can I use program ,will i need to upgrade the ‘lvcreate’ to most new version ? And how can i use the dmdedup on the linux-4.4 ,plesase.

kernel bug when using block size larger than 4KB

In kernel 4.9.34-29.el7.x86_64 I create a 10 GB LV for metadata and a 100 GB LV for data, I then create a dm-dedup target using block size larger than 4 KB (e.g. 8, 16, 64) and finally create an ext3 file system, which hangs. In dmesg I see the following:

[ 1032.481954] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 1032.481969] IP: [<ffffffff816edb71>] list_get_page+0x11/0x30
[ 1032.481982] PGD 785c9067 [ 1032.481985] PUD 74c62067
PMD 0 [ 1032.481991]
[ 1032.481995] Oops: 0000 [#1] SMP
[ 1032.482002] Modules linked in: dm_dedup(O) xen_netfront xen_blkfront virtio_pci virtio_net 
virtio_mmio virtio_blk virtio_balloon nbd megaraid_sas eoa(O) crct10dif_pclmul dm_thin_pool 
dm_bio_prison dm_persistent_data libcrc32c xen_pcif
ront nvme nvme_core virtio_ring virtio ahci_platform libahci_platform
[ 1032.482040] CPU: 1 PID: 2648 Comm: kworker/u8:2 Tainted: G           O    4.9.34-29.el7.x86_64 #1
[ 1032.482051] Workqueue: dm-dedup do_work [dm_dedup]
[ 1032.482057] task: ffff8800040f0000 task.stack: ffffc900421c8000
[ 1032.482062] RIP: e030:[<ffffffff816edb71>]  [<ffffffff816edb71>] list_get_page+0x11/0x30
[ 1032.482071] RSP: e02b:ffffc900421cbb78  EFLAGS: 00010206
[ 1032.482076] RAX: 0000000000000000 RBX: 0000000000000078 RCX: ffffc900421cbbec
[ 1032.482081] RDX: ffffc900421cbbf8 RSI: ffffc900421cbbf0 RDI: ffffc900421cbc78
[ 1032.482088] RBP: ffffc900421cbb78 R08: 0000000000000000 R09: 0000000002400000
[ 1032.482094] R10: ffff88007f49d288 R11: ffffc900421cbd00 R12: ffff880005699100
[ 1032.482099] R13: ffffc900421cbc78 R14: ffffc900421cbd08 R15: ffff88005598b7c0
[ 1032.482109] FS:  0000000000000000(0000) GS:ffff88007f480000(0000) knlGS:0000000008e1b830
[ 1032.482117] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1032.482122] CR2: 0000000000000008 CR3: 0000000074ef8000 CR4: 0000000000042660
[ 1032.482129] Stack:
[ 1032.482133]  ffffc900421cbc28 ffffffff816ee0c6 0000000000000000 0000000000000000
[ 1032.482142]  ffff880078629820 ffffffff816edb60 00000000421cbbb8 ffffffff816edb90
[ 1032.482153]  0000020000000001 0000000000000000 0000000000000000 0000000000000000
[ 1032.482163] Call Trace:
[ 1032.482170]  [<ffffffff816ee0c6>] dispatch_io+0x216/0x3e0
[ 1032.482176]  [<ffffffff816edb60>] ? dm_copy_name_and_uuid+0xb0/0xb0
[ 1032.482182]  [<ffffffff816edb90>] ? list_get_page+0x30/0x30
[ 1032.482188]  [<ffffffff816ee4fb>] dm_io+0x26b/0x2e0
[ 1032.482193]  [<ffffffff816edb60>] ? dm_copy_name_and_uuid+0xb0/0xb0
[ 1032.482200]  [<ffffffff816edb90>] ? list_get_page+0x30/0x30
[ 1032.482206]  [<ffffffffc00f2214>] prepare_bio_on_write+0x174/0x220 [dm_dedup]
[ 1032.482216]  [<ffffffff811c63de>] ? mempool_kfree+0xe/0x10
[ 1032.482222]  [<ffffffffc00f32f2>] do_work+0x2b2/0x613 [dm_dedup]
[ 1032.482229]  [<ffffffff810bf182>] ? pwq_activate_delayed_work+0x42/0xb0
[ 1032.482235]  [<ffffffff810c0d1e>] process_one_work+0x14e/0x3f0
[ 1032.482242]  [<ffffffff810c18bb>] worker_thread+0x12b/0x4a0
[ 1032.482249]  [<ffffffff81878b84>] ? __schedule+0x224/0x680
[ 1032.482254]  [<ffffffff810c1790>] ? max_active_store+0x60/0x60
[ 1032.482261]  [<ffffffff810c1790>] ? max_active_store+0x60/0x60
[ 1032.482267]  [<ffffffff810c6c97>] kthread+0xd7/0xf0
[ 1032.482273]  [<ffffffff810c6bc0>] ? kthread_park+0x60/0x60
[ 1032.482279]  [<ffffffff81003ccb>] ? do_fast_syscall_32+0xab/0x220
[ 1032.482286]  [<ffffffff8187d8d5>] ret_from_fork+0x25/0x30
[ 1032.482291] Code: 96 07 d1 ff eb b0 45 31 ed eb ab b8 fa ff ff ff eb b3 0f 1f 84 00 00 00 00 00 0f 1f 
44 00 00 55 48 8b 47 18 44 8b 47 10 48 89 e5 <48> 8b 40 08 48 89 06 44 89 c6 b8 00 10 00 00 48 29 
f0 48 89 02
[ 1032.482351] RIP  [<ffffffff816edb71>] list_get_page+0x11/0x30
[ 1032.482358]  RSP <ffffc900421cbb78>
[ 1032.482361] CR2: 0000000000000008
[ 1032.482368] ---[ end trace e16ff557165924b9 ]---
[ 1032.482417] BUG: unable to handle kernel paging request at ffffffffffffffd8
[ 1032.482425] IP: [<ffffffff810c78c0>] kthread_data+0x10/0x20
[ 1032.482431] PGD 1e0a067 [ 1032.482434] PUD 1e0c067
PMD 0 [ 1032.482439]
[ 1032.482442] Oops: 0000 [#2] SMP
[ 1032.482446] Modules linked in: dm_dedup(O) xen_netfront xen_blkfront virtio_pci virtio_net virtio_mmio virtio_blk virtio_balloon nbd megaraid_sas eoa(O) crct10dif_pclmul dm_thin_pool dm_bio_prison dm_persistent_data libcrc32c xen_pcifront nvme nvme_core virtio_ring virtio ahci_platform libahci_platform
[ 1032.482481] CPU: 1 PID: 2648 Comm: kworker/u8:2 Tainted: G      D    O    4.9.34-29.el7.x86_64 #1
[ 1032.482497] task: ffff8800040f0000 task.stack: ffffc900421c8000
[ 1032.482503] RIP: e030:[<ffffffff810c78c0>]  [<ffffffff810c78c0>] kthread_data+0x10/0x20
[ 1032.482511] RSP: e02b:ffffc900421cbe78  EFLAGS: 00010002
[ 1032.482516] RAX: 0000000000000000 RBX: ffff88007f499780 RCX: 0000000000000001
[ 1032.482522] RDX: ffff880078c15180 RSI: ffff8800040f0000 RDI: ffff8800040f0000
[ 1032.482528] RBP: ffffc900421cbe78 R08: 0000000000000000 R09: 0000000000004c00
[ 1032.482533] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000019780
[ 1032.482539] R13: ffff8800040f0000 R14: ffff8800040f09c0 R15: 0000000000000000
[ 1032.482547] FS:  0000000000000000(0000) GS:ffff88007f480000(0000) knlGS:0000000008e1b830
[ 1032.482554] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1032.482560] CR2: 0000000000000028 CR3: 0000000074ef8000 CR4: 0000000000042660
[ 1032.482567] Stack:
[ 1032.482570]  ffffc900421cbe88 ffffffff810c1cae ffffc900421cbed0 ffffffff81878d86
[ 1032.482581]  ffff8800055fdde0 ffffc900421cbed0 ffff8800040f0000 ffff8800040f0898
[ 1032.482588]  ffffc900421cbac8 ffff8800040f0000 ffff880078ffcb80 ffffc900421cbee8
[ 1032.482599] Call Trace:
[ 1032.482604]  [<ffffffff810c1cae>] wq_worker_sleeping+0xe/0x90
[ 1032.482610]  [<ffffffff81878d86>] __schedule+0x426/0x680
[ 1032.482618]  [<ffffffff810d3fe8>] do_task_dead+0x38/0x40
[ 1032.482624]  [<ffffffff810ac588>] do_exit+0x638/0xaf0
[ 1032.482630]  [<ffffffff8187ef47>] rewind_stack_do_exit+0x17/0x20
[ 1032.482636] Code: 55 be 01 00 00 00 48 89 e5 e8 fd fe ff ff 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 87 48 09 00 00 48 89 e5 <48> 8b 40 d8 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
[ 1032.482701] RIP  [<ffffffff810c78c0>] kthread_data+0x10/0x20
[ 1032.482707]  RSP <ffffc900421cbe78>
[ 1032.482710] CR2: ffffffffffffffd8
[ 1032.482715] ---[ end trace e16ff557165924ba ]---
[ 1032.482719] Fixing recursive fault but reboot is needed!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.