Giter Club home page Giter Club logo

Comments (6)

mxmlnkn avatar mxmlnkn commented on August 19, 2024

The progress messages are broken for compressed archives because it basically calculates the progress to the uncompressed file size because I don't have the decompressed file size yet and it would add too much overhead to calculate that beforehand. I could however try to get an estimate for the uncompressed file size based on the current position in the compressed and decompressed stream an scaling the compressed file size up proportionally.

Read errors should never happen but might happen more so with compressed files because the bz2 module is rather new. I'll take a look at it soon.

from ratarmount.

rickhg12hs avatar rickhg12hs commented on August 19, 2024

The progress messages are broken for compressed archives because it basically calculates the progress to the uncompressed file size because I don't have the decompressed file size yet and it would add too much overhead to calculate that beforehand. I could however try to get an estimate for the uncompressed file size based on the current position in the compressed and decompressed stream an scaling the compressed file size up proportionally.

Maybe for compressed archives a simple ratio of compressed read to total read size might be OK.

from ratarmount.

mxmlnkn avatar mxmlnkn commented on August 19, 2024

I can reproduce the file difference problem and it's related to the custom bz2 decoder... I'm glad you found that and sorry for the time it cost you to find the issue. That's unfortunately what happens when modifying free versions of bzip2 decoders instead of using the canonical version. But, I had to modify the one I based mine on quite a lot to add seeking support and even buffer output support (as opposed to writing all data in one go directly to file descriptors).

When comparing the hexdumps, it's visible that the decoder fails to output some characters of sequences with repeated characters.

diff <( hexdump -C CTU-13-Dataset/11/*.pcap ) 
     <( hexdump -C CTU-13-Dataset.mounted/CTU-13-Dataset/11/*.pcap ) > CTU-13-11-bz2-bug.diff
14309719,14381940c14309719,14380586
< 8bd41150  e3 e3 e3 e3 e3 e3 e3 e3  e3 e3 e3 73 19 4d 4e 8c  |...........s.MN.|
< 8bd41160  94 04 00 2a 04 00 00 2a  04 00 00 00 1e 49 db 19  |...*...*.....I..|
< 8bd41170  c3 08 00 27 b5 b7 19 08  00 45 00 04 1c 01 00 00  |...'.....E......|
< 8bd41180  00 80 01 5a b6 93 20 54  a5 93 20 60 45 ab 5a 00  |...Z.. T.. `E.Z.|
< 8bd41190  00 98 00 01 00 e1 e1 e1  e1 e1 e1 e1 e1 e1 e1 e1  |................|
< 8bd411a0  e1 e1 e1 e1 e1 e1 e1 e1  e1 e1 e1 e1 e1 e1 e1 e1  |................|
< *
< 8bd41590  e1 e1 e1 e1 e1 73 19 4d  4e 8f 94 04 00 2a 04 00  |.....s.MN....*..|
< 8bd415a0  00 2a 04 00 00 00 1e 49  db 19 c3 08 00 27 e2 09  |.*.....I.....'..|
< 8bd415b0  2d 08 00 45 00 04 1c 01  00 00 00 80 01 5a 9c 93  |-..E.........Z..|
< 8bd415c0  20 54 bf 93 20 60 45 c3  d0 00 00 6a 00 01 00 35  | T.. `E....j...5|
< 8bd415d0  35 35 35 35 35 35 35 35  35 35 35 35 35 35 35 35  |5555555555555555|
---
> 8bd41120  e3 e3 e3 e3 e3 e3 e3 e3  73 19 4d 4e 8c 94 04 00  |........s.MN....|
> 8bd41130  2a 04 00 00 2a 04 00 00  00 1e 49 db 19 c3 08 00  |*...*.....I.....|
> 8bd41140  27 b5 b7 19 08 00 45 00  04 1c 01 00 00 00 80 01  |'.....E.........|
> 8bd41150  5a b6 93 20 54 a5 93 20  60 45 ab 5a 00 00 98 00  |Z.. T.. `E.Z....|
> 8bd41160  01 00 e1 e1 e1 e1 e1 e1  e1 e1 e1 e1 e1 e1 e1 e1  |................|
> 8bd41170  e1 e1 e1 e1 e1 e1 e1 e1  e1 e1 e1 e1 e1 e1 e1 e1  |................|
> *
> 8bd41560  e1 e1 73 19 4d 4e 8f 94  04 00 2a 04 00 00 2a 04  |..s.MN....*...*.|
> 8bd41570  00 00 00 1e 49 db 19 c3  08 00 27 e2 09 2d 08 00  |....I.....'..-..|
> 8bd41580  45 00 04 1c 01 00 00 00  80 01 5a 9c 93 20 54 bf  |E.........Z.. T.|
> 8bd41590  93 20 60 45 c3 d0 00 00  6a 00 01 00 35 35 35 35  |. `E....j...5555|
> 8bd415a0  35 35 35 35 35 35 35 35  35 35 35 35 35 35 35 35  |5555555555555555|

In the above case three e3 bytes are missing and the rest is shifted accordingly.

It seems like my unit tests don't cover repeated sequences well enough because they use random data. After adding some testes with variable sequences length of repeated characters, I can reproduce the bug in the tests.

The problem was some characters not being flushed out when the internal decoding buffer was empty but not the output buffer.

I tested the new version on the full CTU-13 dataset with ratarmount CTU-13-Dataset.tar.bz2 CTU-13-Dataset.mounted; tar -xf CTU-13-Dataset.tar.bz2; diff -r CTU-13-Dataset/ CTU-13-Dataset.mounted/CTU-13-Dataset/ and it seems to work.

Version 0.3.3 is up. I hope it works now.

from ratarmount.

rickhg12hs avatar rickhg12hs commented on August 19, 2024

Thank you! I will use/test it later today.

ratarmount is orders of magnitude faster than archivemount. I really appreciate the speed-up and I just have no drive space for all the data if I had to expand it.

I'm glad you found that and sorry for the time it cost you to find the issue.

I'm glad you were able to diagnose and fix it so quickly! I can't promise my neighbors that they won't hear me screaming at wireshark for different reasons now though. 😊

from ratarmount.

mxmlnkn avatar mxmlnkn commented on August 19, 2024

I forgot to mention: you should force the index to be created anew with the -c option or simply delete the *.index.sqlite file because the calculated decompressed offsets are wrong.

from ratarmount.

rickhg12hs avatar rickhg12hs commented on August 19, 2024

With my example .tar.bz2 file, everything looks good now! Thanks!

I don't have space to fully expand the tar file, so I used checksums to verify integrity.

$ for f in $(find CTU-13-Dataset/ -type f); do sha256sum $f;tar -jxf CTU-13-Dataset.tar.bz2 --to-stdout $f | sha256sum; done

All checksums are identical!

from ratarmount.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.