Giter Club home page Giter Club logo

python-idzip's Introduction

python-idzip

Seekable, gzip compatible, compression format

Gzip allows to store extra fields in the gzip header. Idzip stores offsets for the efficient seeking there.

Install

python setup.py install

or

[python-idzip RHEL6 signed RPM] (http://pkgs.bauman.in/repoview/python-idzip.html)

Acknowledgement

based on https://code.google.com/p/idzip/

The file format was designed by Rik Faith for dictzip. Idzip just uses multiple gzip members to have no file size limit.

Idzip means Improved Dictzip.

added a Writer class

Sizing

downloaded

http://textfiles.com/stories/bureau.txt

cat several copies together up to 20GB > input.txt

gzfile generated using standard gzip

dzfile generated using this library

    total 50172612
    -rw-rw-r--. 1 dan dan 21313751280 May 10 15:58 input.txt
    -rw-rw-r--. 1 dan dan  8576570661 May 10 17:21 dzfile.txt.dz
    -rw-rw-r--. 1 dan dan  8076548622 May 10 16:28 gzfile.txt.gz

Size is almost the same as standard gzip

Seek Timing

    seekpos = 21313751280 - 15
    from time import time

    start=time()
    original = open("/home/dan/ziptest/input.txt")
    original.seek(seekpos)
    original.close()
    print "Raw Seek to end", time() - start, "seconds"


    import gzip
    start=time()
    verify = gzip.open("/home/dan/ziptest/gzfile.txt.gz", "rb")
    verify.seek(seekpos)
    verify.close()
    print "Standard GZIP Seek to end", time() - start, "seconds"


    import idzip
    start=time()
    verify = idzip.open("/home/dan/ziptest/input.txt.dz")
    verify.seek(seekpos)
    verify.close()
    print "idzip Seek to end", time() - start, "seconds"
    Raw Seek to end 0.000866889953613 seconds
    Standard GZIP Seek to end 255.133864164 seconds
    idzip Seek to end 0.0381989479065 seconds

Stream Writer

class allows streaming.

    from idzip import Writer

    outfile = "/home/dan/ziptest/input1.txt.dz"
    writer = Writer(outfile, sync_size=1048576*100)
    infile = open("/home/dan/ziptest/input.txt", "rb")
    while True:
        data = infile.read(1048576+1)
        if not data:
            break
        writer.write(data)
    writer.close()
    infile.close()

Alternatively, you can open an IdzipFile in write mode and accomplish the same task:

    import idzip

    infile =

    infile = open("/home/dan/ziptest/input.txt", "rb")
    writer = idzip.IdzipFile("/home/dan/ziptest/input1.txt.dz", "wb", sync_size=1048576*100)

    with infile, writer:
        while True:
            data = infile.read(1048576 + 1)
            if not data:
                break
            writer.write(data)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.