mxmlnkn / indexed_bzip2 Goto Github PK
View Code? Open in Web Editor NEWFast parallel random access to bzip2 and gzip files in Python
License: Apache License 2.0
Fast parallel random access to bzip2 and gzip files in Python
License: Apache License 2.0
Hi,
Thanks for your package, it runs perfectly on macOS.
However, I have trouble installing the package on Windows, here is what I did:
'-DMS_WIN64'
flag in the setup.py to avoid a error.No error until I import the module in python, it shows:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: DLL load failed while importing indexed_bzip2: The specified module could not be found.
I wonder are there any dependencies that are missing on windows? or what could cause the trouble here? Thanks in advance!
I'd like to save the block offsets in a "index file" and read this back in before actually seeking into a zip bzip2 file. Would it be possible to add an example of doing this? E.g. can a blockOffsets object be pickled and saved somehow?
Sorry to keep on opening issues! Is there any way to check whether an IndexedBzip2File has had block offsets calculated yet or not? If it takes 8 hours to calculate them for my huge file, it's worth checking before running e.g. a seek()
, and outputting a warning that this might take some time.
Consider:
# file with gigabytes of data
bz2reader.read( 100 )
bz2reader.seek( 10 )
bz2reader.read( 10 )
That seek would trigger building the whole index, which can take hours for gigabytes of data even though we actually already have all the necessary information to seek back to position 10.
In the current design it's hard to do because the "first read" which creates the block offset list is special. For example there is a counter for the amount of decoded bytes and also the block and stream CRC, which is calculated during that first read would become wrong when seeking back a bit.
I assume that, if the block offsets have been calculated, it's very fast to get the total number of bytes in a bzip2 file, using e.g. file.seek(0, 2); file.tell()
? Is this the fastest way to get the compressed number of bytes, and might it be useful to provide a method to the IndexedBzip2File class that returns the uncompressed size?
In order to read from faulty media it might be helpful to not just "crash" with an exception when a bad crc is encountered.
I did simply quit because when a crc is wrong, I can't be sure anymore that the next bits are for the next block.
However, for the parallel version I added a blockfinder function, which can search for the magic bit strings of bzip2 blocks.
I could use that to recover from bad blocks.
Ironically, this is not the only new feature which I could get out of the box from the parallelized design.
How to do error reporting then, a simple message to stderr?
E.g. use afl
It seems that seeking inside lz4 should be easily possible the same way as for gzip. The only difference is that the lz4 window size is 64 KiB instead of 32 KiB. It would help the index size to support the space-saving ideas mentioned in mxmlnkn/rapidgzip#17 for this. In contrast to gzip, it should be easy to support creating windows inside an lz4 block, even though lz4 block sizes are supposed to be limited to 4 MiB for wider support. In general, they can be arbitrarily large (64-bit size).
Seeking inside large lz4 files would be nice for ratarmount and other other applications. Specialized subset formats already make it possible but it makes requirements on the compressor used. The same applies to the lz4 frame content-size flag, which is only optional.
Parallel decompression is a whole another matter.
Also add lzo support.
Things to parallelize:
Hello,
if this package is installed as a dep of ratarmount
, the extension loading chain causes a weird abort. This does not seem to happen under anaconda env + ubuntu 20.04. I don't know if this is related to anaconda or ubuntu 20.04. I can reproduce this with an internal codebase but can't come up with an MWE by just simply importing ratarmountcore
or indexed_bzip2
from the interpreter.
Any idea on what could be the reason?
Thanks!
I was investigating a bug that someone reported in indexed_zstd and I realized the same issue also happens in indexed_bzip2.
Here is a simple script that reproduce the problem. IndexedBzip2FileRaw works without any problem while IndexedBzip2File seems to fail.
import numpy as np
import bz2
from io import BytesIO
from tempfile import NamedTemporaryFile
from indexed_bzip2 import IndexedBzip2File
from indexed_bzip2 import IndexedBzip2FileRaw
file = NamedTemporaryFile()
A = np.random.random((100, 100))
handler = bz2.open(file.name, "wb")
np.save(handler, A)
handler.close()
f = IndexedBzip2File(file.name)
#f = IndexedBzip2FileRaw(file.name)
B = np.load(f)
assert np.array_equal(A, B)
The Traceback is:
Traceback (most recent call last):
File "/tmp/test.py", line 16, in <module>
B = np.load(f)
File "/usr/lib/python3/dist-packages/numpy/lib/npyio.py", line 439, in load
return format.read_array(fid, allow_pickle=allow_pickle,
File "/usr/lib/python3/dist-packages/numpy/lib/format.py", line 771, in read_array
array.shape = shape
ValueError: cannot reshape array of size 9592 into shape (100,100)
I have yet to find a fix for indexed_zstd but knowing that it is a problem in both libraries it is excluded that it is something deep in the cpp code.
Kinda the requirement for #1 as it isn't fun to do with C++11.
It'd be really nice to be able to mount single files in a directory so that file is exposed decompressed without having to be a tar.
Would it be difficult to support IndexedBzip2File.readline
? I can roll my own version, but perhaps it would be useful to incorporate into the main lib?
Hi @mxmlnkn
I have prepared the recipe for staging.
Could you confirm in the PR here conda-forge/staged-recipes#21258 that you agree to be a maintainer?
Cheers,
Andreas 😃
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.