mxmlnkn / indexed_bzip2 Goto Github PK

View Code? Open in Web Editor NEW

67.0 6.0 2.0 30.94 MB

Fast parallel random access to bzip2 and gzip files in Python

License: Apache License 2.0

C++ 67.92% Python 3.37% Shell 4.70% CMake 0.95% Cython 1.30% Batchfile 0.01% TeX 21.71% Makefile 0.02% Dockerfile 0.02%

bzip2 cpp cpp17-library decompression parallel python library gzip random-access cli

indexed_bzip2's People

Contributors

Stargazers

Watchers

Forkers

hyanwong pombredanne

indexed_bzip2's Issues

Installation on Windows

Hi,

Thanks for your package, it runs perfectly on macOS.

However, I have trouble installing the package on Windows, here is what I did:

download and extract the source code from pypi.
add '-DMS_WIN64' flag in the setup.py to avoid a error.
create a wheel and install.

No error until I import the module in python, it shows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: DLL load failed while importing indexed_bzip2: The specified module could not be found.

I wonder are there any dependencies that are missing on windows? or what could cause the trouble here? Thanks in advance!

I'd like to save the block offsets in a "index file" and read this back in before actually seeking into a ~~zip~~ bzip2 file. Would it be possible to add an example of doing this? E.g. can a blockOffsets object be pickled and saved somehow?

How to check if block offsets have been calculated?

Sorry to keep on opening issues! Is there any way to check whether an IndexedBzip2File has had block offsets calculated yet or not? If it takes 8 hours to calculate them for my huge file, it's worth checking before running e.g. a seek(), and outputting a warning that this might take some time.

Allow for backseeking without forcing the index to be built for the rest of the file

Consider:

# file with gigabytes of data
bz2reader.read( 100 )
bz2reader.seek( 10 )
bz2reader.read( 10 )

That seek would trigger building the whole index, which can take hours for gigabytes of data even though we actually already have all the necessary information to seek back to position 10.

In the current design it's hard to do because the "first read" which creates the block offset list is special. For example there is a counter for the amount of decoded bytes and also the block and stream CRC, which is calculated during that first read would become wrong when seeking back a bit.

Function for getting uncompressed file size

I assume that, if the block offsets have been calculated, it's very fast to get the total number of bytes in a bzip2 file, using e.g. file.seek(0, 2); file.tell()? Is this the fastest way to get the compressed number of bytes, and might it be useful to provide a method to the IndexedBzip2File class that returns the uncompressed size?

Don't simply quit on bad CRC?

In order to read from faulty media it might be helpful to not just "crash" with an exception when a bad crc is encountered.
I did simply quit because when a crc is wrong, I can't be sure anymore that the next bits are for the next block.
However, for the parallel version I added a blockfinder function, which can search for the magic bit strings of bzip2 blocks.
I could use that to recover from bad blocks.
Ironically, this is not the only new feature which I could get out of the box from the parallelized design.

How to do error reporting then, a simple message to stderr?

Harden the decoder with the help of a fuzzer

E.g. use afl

Add lz4 support

It seems that seeking inside lz4 should be easily possible the same way as for gzip. The only difference is that the lz4 window size is 64 KiB instead of 32 KiB. It would help the index size to support the space-saving ideas mentioned in mxmlnkn/rapidgzip#17 for this. In contrast to gzip, it should be easy to support creating windows inside an lz4 block, even though lz4 block sizes are supposed to be limited to 4 MiB for wider support. In general, they can be arbitrarily large (64-bit size).

Seeking inside large lz4 files would be nice for ratarmount and other other applications. Specialized subset formats already make it possible but it makes requirements on the compressor used. The same applies to the lz4 frame content-size flag, which is only optional.

Parallel decompression is a whole another matter.

It might be possible to detect valid start positions because there seems to be some things to check for, e.g., offset may not be 0, high 4 bits probably are zero when the low 4 bits are < 15.
Looking for lz4 blocks themselves might not be advisable because they can be arbitrarily large and 4 MiB is a suggested maximum size. Needs to be checked what some real values are for various compressors. lz4 frames do have 4 B magic bytes byte-aligned. This should make it possible to search quite fast for them with memchr. Maybe, this would make it feasible again.
"this format has a maximum achievable compression ratio of about ~250.".
There is no Huffman-coding. Because of this, the single-core decompression probably might reach memcpy bandwidths anyway. Probably make no sense to parallelize this only maybe on 4+-channel RAM systems. But then we still would have the interdependencies between lz4 blocks. Adding a second pass with markers might slow down decompression by more than 2x and thereby make it infeasible again to parallelize. It might be possible to only compute the end-of-block window by propagating only that and discarding the output results. Then windows can be resolved in serial and then blocks could be decompressed in parallel. Again, doubtful it will help to make anything faster.

Also add lzo support.

Parallelize

Things to parallelize:

The original code already had some comments about the burrows wheeler transformation preparation being done asynchronously.
Running decodeStream and readBlockData in parallel. The problem is the access to dbuf.
Recognize simple sequential access or large reads over block boundaries and decode bzip2 blocks in parallel

segfault abort on Ubuntu 22.04

Hello,

if this package is installed as a dep of ratarmount, the extension loading chain causes a weird abort. This does not seem to happen under anaconda env + ubuntu 20.04. I don't know if this is related to anaconda or ubuntu 20.04. I can reproduce this with an internal codebase but can't come up with an MWE by just simply importing ratarmountcore or indexed_bzip2 from the interpreter.

Any idea on what could be the reason?

Thanks!

Buffered IndexedBzip2File fails with numpy deserialization

I was investigating a bug that someone reported in indexed_zstd and I realized the same issue also happens in indexed_bzip2.

Here is a simple script that reproduce the problem. IndexedBzip2FileRaw works without any problem while IndexedBzip2File seems to fail.

import numpy as np
import bz2
from io import BytesIO
from tempfile import NamedTemporaryFile
from indexed_bzip2 import IndexedBzip2File
from indexed_bzip2 import IndexedBzip2FileRaw

file = NamedTemporaryFile()
A = np.random.random((100, 100))
handler = bz2.open(file.name, "wb")
np.save(handler, A)
handler.close()

f = IndexedBzip2File(file.name)
#f = IndexedBzip2FileRaw(file.name)
B = np.load(f)
assert np.array_equal(A, B)

The Traceback is:

Traceback (most recent call last):
  File "/tmp/test.py", line 16, in <module>
    B = np.load(f)
  File "/usr/lib/python3/dist-packages/numpy/lib/npyio.py", line 439, in load
    return format.read_array(fid, allow_pickle=allow_pickle,
  File "/usr/lib/python3/dist-packages/numpy/lib/format.py", line 771, in read_array
    array.shape = shape
ValueError: cannot reshape array of size 9592 into shape (100,100)

I have yet to find a fix for indexed_zstd but knowing that it is a problem in both libraries it is excluded that it is something deep in the cpp code.

Cheers,
Andreas 😃