Working with https://mcfp.felk.cvut.cz/publicDatasets/CTU-13

Thank you! I will use/test it later today. <code class="notranslate

File read errors and index creation weirdness about ratarmount HOT 6 CLOSED

mxmlnkn commented on August 19, 2024

File read errors and index creation weirdness

from ratarmount.

Comments (6)

mxmlnkn commented on August 19, 2024

The progress messages are broken for compressed archives because it basically calculates the progress to the uncompressed file size because I don't have the decompressed file size yet and it would add too much overhead to calculate that beforehand. I could however try to get an estimate for the uncompressed file size based on the current position in the compressed and decompressed stream an scaling the compressed file size up proportionally.

Read errors should never happen but might happen more so with compressed files because the bz2 module is rather new. I'll take a look at it soon.

from ratarmount.

rickhg12hs commented on August 19, 2024

The progress messages are broken for compressed archives because it basically calculates the progress to the uncompressed file size because I don't have the decompressed file size yet and it would add too much overhead to calculate that beforehand. I could however try to get an estimate for the uncompressed file size based on the current position in the compressed and decompressed stream an scaling the compressed file size up proportionally.

Maybe for compressed archives a simple ratio of compressed read to total read size might be OK.

from ratarmount.

mxmlnkn commented on August 19, 2024

I can reproduce the file difference problem and it's related to the custom bz2 decoder... I'm glad you found that and sorry for the time it cost you to find the issue. That's unfortunately what happens when modifying free versions of bzip2 decoders instead of using the canonical version. But, I had to modify the one I based mine on quite a lot to add seeking support and even buffer output support (as opposed to writing all data in one go directly to file descriptors).

When comparing the hexdumps, it's visible that the decoder fails to output some characters of sequences with repeated characters.

diff <( hexdump -C CTU-13-Dataset/11/*.pcap ) 
     <( hexdump -C CTU-13-Dataset.mounted/CTU-13-Dataset/11/*.pcap ) > CTU-13-11-bz2-bug.diff

14309719,14381940c14309719,14380586
< 8bd41150  e3 e3 e3 e3 e3 e3 e3 e3  e3 e3 e3 73 19 4d 4e 8c  |...........s.MN.|
< 8bd41160  94 04 00 2a 04 00 00 2a  04 00 00 00 1e 49 db 19  |...*...*.....I..|
< 8bd41170  c3 08 00 27 b5 b7 19 08  00 45 00 04 1c 01 00 00  |...'.....E......|
< 8bd41180  00 80 01 5a b6 93 20 54  a5 93 20 60 45 ab 5a 00  |...Z.. T.. `E.Z.|
< 8bd41190  00 98 00 01 00 e1 e1 e1  e1 e1 e1 e1 e1 e1 e1 e1  |................|
< 8bd411a0  e1 e1 e1 e1 e1 e1 e1 e1  e1 e1 e1 e1 e1 e1 e1 e1  |................|
< *
< 8bd41590  e1 e1 e1 e1 e1 73 19 4d  4e 8f 94 04 00 2a 04 00  |.....s.MN....*..|
< 8bd415a0  00 2a 04 00 00 00 1e 49  db 19 c3 08 00 27 e2 09  |.*.....I.....'..|
< 8bd415b0  2d 08 00 45 00 04 1c 01  00 00 00 80 01 5a 9c 93  |-..E.........Z..|
< 8bd415c0  20 54 bf 93 20 60 45 c3  d0 00 00 6a 00 01 00 35  | T.. `E....j...5|
< 8bd415d0  35 35 35 35 35 35 35 35  35 35 35 35 35 35 35 35  |5555555555555555|
---
> 8bd41120  e3 e3 e3 e3 e3 e3 e3 e3  73 19 4d 4e 8c 94 04 00  |........s.MN....|
> 8bd41130  2a 04 00 00 2a 04 00 00  00 1e 49 db 19 c3 08 00  |*...*.....I.....|
> 8bd41140  27 b5 b7 19 08 00 45 00  04 1c 01 00 00 00 80 01  |'.....E.........|
> 8bd41150  5a b6 93 20 54 a5 93 20  60 45 ab 5a 00 00 98 00  |Z.. T.. `E.Z....|
> 8bd41160  01 00 e1 e1 e1 e1 e1 e1  e1 e1 e1 e1 e1 e1 e1 e1  |................|
> 8bd41170  e1 e1 e1 e1 e1 e1 e1 e1  e1 e1 e1 e1 e1 e1 e1 e1  |................|
> *
> 8bd41560  e1 e1 73 19 4d 4e 8f 94  04 00 2a 04 00 00 2a 04  |..s.MN....*...*.|
> 8bd41570  00 00 00 1e 49 db 19 c3  08 00 27 e2 09 2d 08 00  |....I.....'..-..|
> 8bd41580  45 00 04 1c 01 00 00 00  80 01 5a 9c 93 20 54 bf  |E.........Z.. T.|
> 8bd41590  93 20 60 45 c3 d0 00 00  6a 00 01 00 35 35 35 35  |. `E....j...5555|
> 8bd415a0  35 35 35 35 35 35 35 35  35 35 35 35 35 35 35 35  |5555555555555555|

In the above case three e3 bytes are missing and the rest is shifted accordingly.

It seems like my unit tests don't cover repeated sequences well enough because they use random data. After adding some testes with variable sequences length of repeated characters, I can reproduce the bug in the tests.

The problem was some characters not being flushed out when the internal decoding buffer was empty but not the output buffer.

I tested the new version on the full CTU-13 dataset with ratarmount CTU-13-Dataset.tar.bz2 CTU-13-Dataset.mounted; tar -xf CTU-13-Dataset.tar.bz2; diff -r CTU-13-Dataset/ CTU-13-Dataset.mounted/CTU-13-Dataset/ and it seems to work.

Version 0.3.3 is up. I hope it works now.

from ratarmount.

rickhg12hs commented on August 19, 2024

Thank you! I will use/test it later today.

ratarmount is orders of magnitude faster than archivemount. I really appreciate the speed-up and I just have no drive space for all the data if I had to expand it.

I'm glad you found that and sorry for the time it cost you to find the issue.

I'm glad you were able to diagnose and fix it so quickly! I can't promise my neighbors that they won't hear me screaming at wireshark for different reasons now though. 😊

from ratarmount.

mxmlnkn commented on August 19, 2024

I forgot to mention: you should force the index to be created anew with the -c option or simply delete the *.index.sqlite file because the calculated decompressed offsets are wrong.

from ratarmount.

rickhg12hs commented on August 19, 2024

With my example .tar.bz2 file, everything looks good now! Thanks!

I don't have space to fully expand the tar file, so I used checksums to verify integrity.

$ for f in $(find CTU-13-Dataset/ -type f); do sha256sum $f;tar -jxf CTU-13-Dataset.tar.bz2 --to-stdout $f | sha256sum; done

All checksums are identical!

from ratarmount.

File read errors and index creation weirdness about ratarmount HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent