Giter Club home page Giter Club logo

Comments (10)

X-Ryl669 avatar X-Ryl669 commented on June 24, 2024

You are right.
I've considered using a HMAC-SHA256 and even a simple Hash(Multichunk || RandomSalt) but I was struck on re-identification of the data from the hash. Basically, using a HMAC or a random salt would make the hash change for the same input data.

This is painful while developing as I wanted to assert that the process was determinist (using the same master key, of course) and always gave the same output for a given input vector.

Now that I'm quite confident in the code (I had used it to restore a crashed server twice in the past year), I might add either of those feature to the system. This would however require re-backuping your data folder since changing the format will break the ability to read-back the previous backup.

Concerning the two possible solutions, I'm afraid that using the master key to generate HMAC's key might help weakening the master key (because we'll have to disclose how we generated it for each multichunk), and add an additional point of failure (since you must store the additional data). It also slows down hashing by at least a factor 4 because of 2 hash + xor of the blocks.
Using a random salt still require storing it, but is more lightweight on the hashing performance (adding 32 bytes to a 25MB buffer is not a big overhead)

from frost.

Vayu avatar Vayu commented on June 24, 2024

I've considered using a HMAC-SHA256 and even a simple Hash(Multichunk || RandomSalt) but I was struck on re-identification of the data from the hash. Basically, using a HMAC or a random salt would make the hash change for the same input data.

HMAC-SHA256 is deterministic given the same key and data. There is no salt needed or randomness involved. The only difference with plain SHA256 is that the key is needed to verify integrity (which coincidentally protects confidentiality), and in addition it also verifies authenticity.

I'm afraid that using the master key to generate HMAC's key might help weakening the master key (because we'll have to disclose how we generated it for each multichunk), and add an additional point of failure (since you must store the additional data).

One can use any standard KDF (e.g. PBKDF2) to create derived keys for HMAC from the master key, this should not weaken master key in any way. There is no need of "per-chunk" HMAC key, one per backup repository is enough.

It also slows down hashing by at least a factor 4 because of 2 hash + xor of the blocks.

HMAC is only x2 times slower than plain hashing which is an acceptable tradeoff is most situations.

... simple Hash(Multichunk || RandomSalt) ... Using a random salt still require storing it, but is more lightweight on the hashing performance (adding 32 bytes to a 25MB buffer is not a big overhead)

Such salted hash will not significantly improve confidentiality since the salt is public.

from frost.

X-Ryl669 avatar X-Ryl669 commented on June 24, 2024

What you are asking for is to have a single HMAC's key for the whole repository ?
I'm afraid, but in that case, this does not seems right to me.
Hmac is defined as:

HMAC(m,k) = h((k xor opad) || h((k xor ipad) || m))

The part k xor opad and k xor ipad is constant (just like a salt) if the key is the same for the whole archive. Thus, this reduces to HMAC(m,k) = h(constantSalt1 || h(constantSalt2 || m)), which is, IMHO, a lot less secure than h(m || salt) provided the salt is modified for every m. The idea behind this thinking is if one can break h once, it can break it twice, thus there is no additional protection with hashing recursively.

Said differently, I can take an example. Let's say you have a string of zero (or any known value) making 2 different messages and goes through the process of HMAC with a constant key, then it'll give the same output for both messages. An attacker can then figure out the message are the same (but not their content obviously). If the key evolves, then it's impossible to deduce this.

As I understand it, a key derived from PBKDF2 (like in Frost) is not unbreakable, it's just harder. Once you have the master key, the PBKDF2 is no more a protection. If I were a hacker, I'd focus on this.

However, if what you want is a true message authentication, then Frost must create a new key for each instance of HMAC (or at least, at regular interval).

Please notice that the process of chunking and deduplication means that it's very unlikely the same string of data goes to the hashing process (that's the whole idea of deduplication: never store the same data twice). So, in the end, it's also very unlikely that an attack by "identifying the hash" would work, even without a salt.

I was wrong about the additional time for hashing with HMAC. I just measured it, and it's only 1.6x slower than plain hashing. Salting is negligible. The PBKDF2 takes time however, so it'll be hard to do for every block. I wonder if a simple counter on the key would work as well, like in:
HMAC(m,k) with k = KDF(masterKey, blockIndex), the KDF function being a simple hash of the master key with the blockIndex.

In that case, then it'll be possible to implement your suggestion, since we don't need to add anything to the file format, everything can be deduced from the information we (already) have.

from frost.

Vayu avatar Vayu commented on June 24, 2024

What you are asking for is to have a single HMAC's key for the whole repository ?
I'm afraid, but in that case, this does not seems right to me.

It has to be the same if one wants deduplication.

The idea behind this thinking is if one can break h once, it can break it twice, thus there is no additional protection with hashing recursively.

if someone can break h neither option will be secure. Double hashing is there for protection against length-extension and collision attacks.

Said differently, I can take an example. Let's say you have a string of zero (or any known value) making 2 different messages and goes through the process of HMAC with a constant key, then it'll give the same output for both messages. An attacker can then figure out the message are the same (but not their content obviously). If the key evolves, then it's impossible to deduce this.

Our "messages" are blocks in a deduplicated backup, it is not a secret that they are the same, by definition of deduplication.

As I understand it, a key derived from PBKDF2 (like in Frost) is not unbreakable, it's just harder. Once you have the master key, the PBKDF2 is no more a protection. If I were a hacker, I'd focus on this.

password -> PBKDF2 -> key is breakable because "password" is a short low entropy value, and we need more KDF rounds to slow down brute-force. But master-key -> PBKDF2 -> derived-keys is fine even with 1 round, because master-key is long high-entropy value.

Please notice that the process of chunking and deduplication means that it's very unlikely the same string of data goes to the hashing process (that's the whole idea of deduplication: never store the same data twice). So, in the end, it's also very unlikely that an attack by "identifying the hash" would work, even without a salt.

Chunking is deterministic, so if anyone wants to see whether a specific file is present in the repository, they will simply run chunking on it and compute/compare plain hashes. While if we have HMAC (keyed hash), it will be not possible without knowing the HMAC key.

If you really want per-block HMAC key with deduplication you can borrow ideas from convergent encryption schemes. For instance at the cost of one extra hash: mac = HMAC(m, k) with k = KDF(masterKey, H(m)) where H is a plain hash like sha256. But this doesn't add any security over constant HMAC key

from frost.

X-Ryl669 avatar X-Ryl669 commented on June 24, 2024

Ok, now I understand what you are saying.
In fact, Frost uses a rolling checksum to find out chunk boundaries (which means that the chunk size is not constant), and a SHA-1 function to hash the chunks, used as a key in an index table to quickly locate the chunk. Please notice that this function must be fast and provide a length limited output (20 bytes is already quite large) because else it makes a large index (and then, slow to use/query).
Using HMAC here might be possible, but based on HMAC-SHA1 not HMAC-SHA256.
Deduplication works at this level, that is, if a chunk's hash matches one already recorded, it's not processed further (not saved, not compressed, not encoded).

Those unique chunks are then copied in a multichunk which is hashed with SHA256. I was talking about this one. This is the most intensive work here, since multichunks are large and have to be hashed, compressed, encoded. Since you can skip last 2 steps, there is a (very low) risk that one could find out the multichunk content from its hash.

What you said makes sense now.
I can add an option to HMAC-SHA1 instead of SHA1 in the chunker's algorithm.

Please notice, however, that currently the index is not encrypted, so there'll still contain the file name and metadata. I consider the index as risky as the keyvault. And because it's so much required for searching for chunks, it shouldn't be far, it should be in the local "safe" zone, as the keyvault.

from frost.

Vayu avatar Vayu commented on June 24, 2024

Thanks for the explanation.

Is there an encrypted copy of the index in the repository? Basically my question is if we lose all local data but have the key and repository, can everything be restored from that?

from frost.

X-Ryl669 avatar X-Ryl669 commented on June 24, 2024

Short answer is no.
You need 3 things to restore a backup, the key vault, the index file, and the data. If any of them is missing you are struck.

The long answer is, yes you can probably decrypt & decompress the data with the master key from the keyvault, however, without the index, you'll get a bunch of data where you can probably find your files. But, because of deduplication, it's not linear anymore and it'll be very very hard to find out how to rebuild a complete file you know the content.

I'm not storing the index in the data set, since I designed the system so the data set can be copied to an hostile environment without giving any clues to use it, yet being fast to backup. If you want to save the index remotely, then you'll need to encrypt it (this is a planned feature I'm going to add in the next version, basically: pre & post backup/restore actions).

from frost.

Vayu avatar Vayu commented on June 24, 2024

Now I see, thanks. If index is sensitive and not stored remotely in plain text then HMAC is not necessary.

It is probably worth mentioning that in current state it is not a complete backup solution because losing index is equal to losing backup. Ideally one should be able to restore with just a key and access to remote repository

from frost.

X-Ryl669 avatar X-Ryl669 commented on June 24, 2024

Ok, I'm adding an additional option here to encrypt/decrypt the index file and store the encrypted index in the backup's dataset.
So it can now be part of the backup set safely.
Pro:

  1. You only need to store the keyvault in a trusted location, the index file is no more required
  2. If you loose your local (unencrypted) index file, you can still rebuild from the remote one.

Con:

  1. Encrypting the index file is an additional step in the backup process, and unfortunately, it must happen for the complete file each time (since the index can change anywhere). It's on average 1% the size of the dataset, so it means encrypting that size for each backup (unless no file is modified...). For a 500GB dataset, it's a 5GB's encryption task.
  2. I'm still using a local (clear) index that's built from the encrypted index if and only if it's not found (you can see that as a "cache") so, in normal usage, one does not have to pay the decrypting cost to backup.
  3. If not done already, it'll take 1% more data on the remote server to store the encrypted index.

from frost.

X-Ryl669 avatar X-Ryl669 commented on June 24, 2024

Ok, done in v3 branch. The new option is --safeindex and --decryptindex if you need to use FUSE's frost version. The only feature missing before release is support of ZSTD compression (easier to deal with than BSC).

from frost.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.