Would it be difficult to support IndexedBzip2File.readline</

So, I have a first very rudimentary parallel version in the <a href="https://github.co

In <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://git

Here's a very small test, using the code you posted at <a aria-label="Issue #5" class=

readline support? about indexed_bzip2 HOT 17 CLOSED

mxmlnkn commented on June 18, 2024

readline support?

from indexed_bzip2.

Comments (17)

mxmlnkn commented on June 18, 2024 1

Hey, that's really neat, and does seem to simplify the code too. For my limited testing, it seems to work fine. I can also try it on my 56GB compressed / 1TB compressed bzip2 file and compare read performance if you like?

It would definitely make me feel more safe before I push a new 1.1.3 release. The automated test for readline in particular is only very short for now. E.g., it does not go over bzip2 block boundaries because it is so small. However, maybe I don't need tests because I'm using the Python3 default implementation. So, as long as my implemented readinto method is sufficiently tested, readline should also work in all cases.

If you are interested in actually testing performance regularly, then for one of the main projects I'm involved with (https://github.com/tskit-dev/msprime) we have set up ASV (airspeedvelocity) unit tests for performance. I didn't set it up though, and don't know how much of a hassle it was.

That sounds interesting but I have no experience with that yet so it might be hassle to set up. The manylinux containers are already a hassle. I think, I would first start with parallelizing because it is more fun. I already have some ideas for the design.

from indexed_bzip2.

mxmlnkn commented on June 18, 2024 1

So, I have a first very rudimentary parallel version in the parallel branch, which can at least decode an input file (not yet stdin) once (no seeking support yet). My test was with tools/pbzcat.cpp. On my Ryzen 3900X 12-core, the comparison for a 60MB bz2 is 4s parallel vs 35s serial. So, roughly speedup of 9. I guess it could be better but it's still helpful. Let's see when I get this polished up and tested sufficiently to release it. I'm thinking about two different classes so that bugs introduced by the parallel version will not appear when specifying in the arguments to only use one core, i.e., the serial version.

from indexed_bzip2.

mxmlnkn commented on June 18, 2024 1

I see. Then you are right, you would only profit from the parallelism when creating the block offsets because you are already parallelizing the reading yourself.

from indexed_bzip2.

mxmlnkn commented on June 18, 2024

I guess it would be useful How would you do it? I think need to read into how to best implement something like io.BufferedReader, then the Python implementation should provide many things automatically, like peek, readline, readinto, ...

from indexed_bzip2.

mxmlnkn commented on June 18, 2024

In 35fab99, I use BufferedReader. This actually simplifies the code a bit and might even improve performance because I only have to implement readinto for most parts, which does not even need to allocate memory for the returned bytes.

from indexed_bzip2.

hyanwong commented on June 18, 2024

Hey, that's really neat, and does seem to simplify the code too. For my limited testing, it seems to work fine. I can also try it on my 56GB compressed / 1TB compressed bzip2 file and compare read performance if you like?

If you are interested in actually testing performance regularly, then for one of the main projects I'm involved with (https://github.com/tskit-dev/msprime) we have set up ASV (airspeedvelocity) unit tests for performance. I didn't set it up though, and don't know how much of a hassle it was.

from indexed_bzip2.

hyanwong commented on June 18, 2024

Here's a very small test, using the code you posted at #5 (comment) but adjusted to read 100MB chunks. As you can see, in this small test, the new version is a fraction faster for a standard read(): roughly 2668MB/min in the old version vs 2830MB/min in the new. I'll decode the entire 1.1TB and see how long it takes with the new version.

(added - obviously differences could be due to the load on the system etc, but I think it can be concluded that the new version is not much slower, and possibly a little faster)

Old version (current master):

Decoding 100.0 MiB took 2.1957297325134277s
Decoding 200.0 MiB took 4.4403698444366455s
Decoding 300.0 MiB took 6.67558741569519s
Decoding 400.0 MiB took 8.936365127563477s
Decoding 500.0 MiB took 11.18674635887146s
Decoding 600.0 MiB took 13.468021631240845s
Decoding 700.0 MiB took 15.733132362365723s
Decoding 800.0 MiB took 18.002384901046753s
Decoding 900.0 MiB took 20.24819779396057s
Decoding 1000.0 MiB took 22.476165533065796s
Decoding 1100.0 MiB took 24.65657949447632s
Decoding 1200.0 MiB took 26.926076650619507s
Decoding 1300.0 MiB took 29.167692184448242s
Decoding 1400.0 MiB took 31.430949211120605s
Decoding 1500.0 MiB took 33.70232009887695s
Decoding 1600.0 MiB took 35.97959613800049s
Decoding 1700.0 MiB took 38.25622272491455s
Decoding 1800.0 MiB took 40.533286809921265s
Decoding 1900.0 MiB took 42.78500437736511s
Decoding 2000.0 MiB took 45.04755663871765s
Decoding 2100.0 MiB took 47.32207798957825s
Decoding 2200.0 MiB took 49.559104681015015s
Decoding 2300.0 MiB took 51.791510581970215s
Decoding 2400.0 MiB took 54.030834674835205s
Decoding 2500.0 MiB took 56.256428480148315s
Decoding 2600.0 MiB took 58.480509757995605s
Decoding 2700.0 MiB took 60.72967195510864s
...

New version (develop branch):

Decoding 100.0 MiB took 2.2382559776306152s
Decoding 200.0 MiB took 4.418438911437988s
Decoding 300.0 MiB took 6.51515793800354s
Decoding 400.0 MiB took 8.634415626525879s
Decoding 500.0 MiB took 10.74658751487732s
Decoding 600.0 MiB took 12.888115167617798s
Decoding 700.0 MiB took 15.003849506378174s
Decoding 800.0 MiB took 17.126858949661255s
Decoding 900.0 MiB took 19.243683099746704s
Decoding 1000.0 MiB took 21.356130599975586s
Decoding 1100.0 MiB took 23.405311346054077s
Decoding 1200.0 MiB took 25.542148113250732s
Decoding 1300.0 MiB took 27.653825044631958s
Decoding 1400.0 MiB took 29.77428960800171s
Decoding 1500.0 MiB took 31.9023859500885s
Decoding 1600.0 MiB took 34.03745102882385s
Decoding 1700.0 MiB took 36.16895079612732s
Decoding 1800.0 MiB took 38.30633306503296s
Decoding 1900.0 MiB took 40.42850923538208s
Decoding 2000.0 MiB took 42.54588508605957s
Decoding 2100.0 MiB took 44.675206422805786s
Decoding 2200.0 MiB took 46.77887749671936s
Decoding 2300.0 MiB took 48.874287366867065s
Decoding 2400.0 MiB took 50.98031258583069s
Decoding 2500.0 MiB took 53.07324266433716s
Decoding 2600.0 MiB took 55.154353857040405s
Decoding 2700.0 MiB took 57.266836643218994s
Decoding 2800.0 MiB took 59.386892795562744s
Decoding 2900.0 MiB took 61.475019454956055s

from indexed_bzip2.

hyanwong commented on June 18, 2024

And here's what it's like using 1GB chunks on the same file - I also ran this the other way round, so that the new (develop) version was run first, in case there was an order effect due to file caching, or something. Again, the devel version is a fraction faster, but it is probably within the range of error.

new version

Decoding 1000.0 MiB took 21.40721082687378s
Decoding 2000.0 MiB took 42.65827751159668s
Decoding 3000.0 MiB took 63.725568532943726s
Decoding 4000.0 MiB took 84.73346877098083s
Decoding 5000.0 MiB took 105.8410234451294s
Decoding 6000.0 MiB took 126.5337426662445s
Decoding 7000.0 MiB took 147.51092338562012s
Decoding 8000.0 MiB took 168.19302701950073s
Decoding 9000.0 MiB took 188.58807253837585s
Decoding 10000.0 MiB took 207.90019512176514s

Old (master) version:

Decoding 1000.0 MiB took 21.85849618911743s
Decoding 2000.0 MiB took 43.538970947265625s
Decoding 3000.0 MiB took 65.02622151374817s
Decoding 4000.0 MiB took 86.4333746433258s
Decoding 5000.0 MiB took 107.95172548294067s
Decoding 6000.0 MiB took 129.04518508911133s
Decoding 7000.0 MiB took 150.4252119064331s
Decoding 8000.0 MiB took 171.49680089950562s
Decoding 9000.0 MiB took 192.27367043495178s
Decoding 10000.0 MiB took 211.96346759796143s

from indexed_bzip2.

hyanwong commented on June 18, 2024

I also tried decoding the file by iterating using readline() calls, both with the develop version of indexed_bzip2 and with the standard bz2 library, which allows readline() calls. I interrupted both after roughly 60 seconds, which decoded over 2GB, which I guess should be enough to cross the bzip2 block boundaries. Both worked fine, and the byte totals for each line were the same, so no data seems to be going missing, but it looks like your version of readline() is faster!

indexed_bzip2.readline(): Decoding 2597.463005065918 MiB took 60.501546144485474s
bz2.readline(): Decoding 2113.1387910842896 MiB took 60.48660755157471s

from indexed_bzip2.

mxmlnkn commented on June 18, 2024

Thanks for all the tests and especially the one comparing to the bz2 module. I didn't really do such a comparison yet. It's a positively surprising result. And yes, 2GiB is enough. Bzip2 blocks are under 1MB large.

from indexed_bzip2.

hyanwong commented on June 18, 2024

Thanks for all the tests and especially the one comparing to the bz2 module. I didn't really do such a comparison yet. It's a positively surprising result. And yes, 2GiB is enough. Bzip2 blocks are under 1MB large.

No problem. I think that the variation may be within the error caused by disk access etc. FWIW, decoding 1.1TB was a fraction faster too (develop branch) - 3008 MB/min:

Decoding 1145213.418343544 MiB took 22837.333334684372s

from indexed_bzip2.

hyanwong commented on June 18, 2024

Personally, I would only use the parallel version to get the offsets. Once I have those, the lines in my bzip2 file are independent, so my idea is to divide the decoding itself into multiple threads, each of which opens the same file independently, but each reading a different subset of the lines in the file. A rough version looks like this (not tested):

class PartialBzip2(indexed_bzip2.IndexedBzip2File):
    """
    Open a bzip2 file, with the ability to only return certain lines if an index is given
    """
    def __init__(self, filename, block_offsets=None, start_after_byte=None, stop_including_byte=None):
        """
        We assume that the block_offsets are correct for the bzip2 file, as it is expensive to check.
        The returned object has readline() functionality that returns any whole
        lines that start after byte position `start_after_byte` and include the line
        that contains `stop_including_byte`. To include the first line, set
        `start_after_byte` to -1 or None (if it is set to 0, the first line will be
        skipped).
        """
        self.start_after_byte = -1 if start_after_byte is None else start_after_byte
        self.stop_including_byte = math.inf if stop_including_byte is None else stop_including_byte
        if block_offsets is None or (start_after_byte is None and stop_including_byte is None):
            if self.start_after_byte >= 0:
                raise ValueError("Can't set a start point if no index given")
        super().__init__(filename)
        if block_offsets is not None:
            self.set_block_offsets(block_offsets)
        self.reset()

    def reset(self):
        if self.start_after_byte >= 0:
            self.seek(self.start_after_byte)
            self.readline()  # advance to first whole line
        else:
            if self.tell() > 0:
                self.seek(0)

    def readline(self):
        if self.tell() <= self.stop_including_byte:
            return super().readline()
        else:
            return b""

Used like this:

def run_partial(filename, block_offsets, start, end):
    with PartialBzip2(filename, block_offsets, start, end) as partial_file:
        for line in partial_file:
            # do stuff with each line
    return stuff

block_offsets = pickle.load(offsets_filename)
n_chunks = 40  # or however many threads you want
chunksize = max(block_offsets.values()) / n_chunks
divs = [None] + [a * chunksize for a in range(1, n_chunks))] + [math.inf]

params_iter = zip(
    itertools.repeat(filename),
    itertools.repeat(block_offsets),
    divs[:-1],  # start_after_byte
    divs[1:],  # stop_including_byte
)
with multiprocessing.Pool(processes= n_chunks) as pool:
    for ret_val in pool.imap_unordered(run_partial, params_iter):
        # do stuff with ret_val

from indexed_bzip2.

hyanwong commented on June 18, 2024

I know you are busy with parallelisation, but it would be great if the readline() ability could be released onto the master branch soon?

from indexed_bzip2.

mxmlnkn commented on June 18, 2024

I'll do it this weekend. Pushing to master should be easier to do than publishing a new PyPI version for which I would want to do more checking first.

from indexed_bzip2.

hyanwong commented on June 18, 2024

Thanks. It's useful to be able to do pip install git+https://github.com/mxmlnkn/indexed_bzip2/ without having to specify a bespoke branch.

from indexed_bzip2.

mxmlnkn commented on June 18, 2024

I double-checked, added the other suggested methods and pushed all to master.

from indexed_bzip2.

hyanwong commented on June 18, 2024

Great, thanks so much

from indexed_bzip2.

readline support? about indexed_bzip2 HOT 17 CLOSED

Comments (17)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent