Hi Tatu, I think using wide instructions via Unsafe looks very promi

Parallel LZF about compress HOT 6 CLOSED

ning commented on July 21, 2024

Parallel LZF

from compress.

Comments (6)

cowtowncoder commented on July 21, 2024

Interesting. Yes, this sounds like a very interesting idea. Thank you for suggesting it. I had actually seen a reference to this, but had forgotten to read more.

Btw, I was unable to use Unsafe tricks for single-threaded compression so far -- I tried, but in the end failed. :-)

Another sort of related thing that I have been toying with (the idea, that is, not implementation) is ability to define "blindly splittable" format. Meaning that one would be able to find a block boundary given an arbitrary point in file. This is theoretically possible, with some caveats; for one, compressor must guarantee that specific byte sequence never occurs (easiest to guarantee for sequence that would always be compressed, like string of 4 or more instances of same byte), and for another, that all blocks use compressor (otherwise "raw" data could have such sequence).
But that warrants another issue obviously.

from compress.

whoschek commented on July 21, 2024

Interesting, I once wrote a byte stuffing codec that might be useable for such a "blindly splittable" format. I still haven't figured out how to attach code to this issue tracking system. I can email you the two small (BSD licensed) java files, if you're interested.

/**

Encoder/Decoder implementing Consistent Overhead Byte Stuffing (COBS) for
efficient, reliable, unambigous packet framing regardless of packet content,
making it is easy for applications to recover from malformed packet payloads.
For details, see the <a
href="http://www.stuartcheshire.org/papers/COBSforToN.pdf">paper . In
case the link is broken, get it from the <a
href="http://www.stuartcheshire.org">paper's author .
Quoting from the paper: "When packet data is sent over any serial medium, a
protocol is needed by which to demarcate packet boundaries. This is done by
using a special bit-sequence or character value to indicate where the
boundaries between packets fall. Data stuffing is the process that transforms
the packet data before transmission to eliminate any accidental occurrences
of that special framing marker, so that when the receiver detects the marker,
it knows, without any ambiguity, that it does indeed indicate a boundary
between packets.
COBS takes an input consisting of bytes in the range [0,255] and produces an
output consisting of bytes only in the range [1,255]. Having eliminated all
zero bytes from the data, a zero byte can now be used unambiguously to mark
boundaries between packets.
This allows the receiver to synchronize reliably with the beginning of the
next packet, even after an error. It also allows new listeners to join a
broadcast stream at any time and without failing to receive and decode the
very next error free packet.
With COBS all packets up to 254 bytes in length are encoded with an overhead
of exactly one byte. For packets over 254 bytes in length the overhead is at
most one byte for every 254 bytes of packet data. The maximum overhead is
therefore roughly 0.4% of the packet size, rounded up to a whole number of
bytes. COBS encoding has low overhead (on average 0.23% of the packet size,
rounded up to a whole number of bytes) and furthermore, for packets of any
given length, the amount of overhead is virtually constant, regardless of the
packet contents."
This class implements the original COBS algorithm, not the COBS/ZPE variant.
There holds: decode(encode(src)) = src.
Performance Note: The JDK 1.5 server VM runs decode(encode(src))
at about 125 MB/s throughput on a commodity PC (2 GHz Pentium 4). Encoding is
the bottleneck, decoding is extremely cheap. Obviously, this is way more
efficient than Base64 encoding or similar application level byte stuffing
mechanisms.
@author [email protected]
@author $Author: hoschek3 $
@Version $Revision: 1.4 $, $Date: 2005/06/09 22:44:05 $
*/

from compress.

cowtowncoder commented on July 21, 2024

Cool thanks. I'll have a look. For what it's worth, it looks like Snappy format (alas!) might actually work with simple sequence of 4 zero bytes... for LZF, a change or two might be needed. But it too does have couple of unused bytes for the first byte of each sequence.

from compress.

javabean commented on July 21, 2024

Hi all,

I may have a go at writing a parallel version of LZF if no-one started working on one yet.
If street creds are required, I have implemented a parallel GZip compressor in Java (similar to pigz); hope this is enough! :-) (Ping me for the URL, I am not writing here for advertisement.)
Tatu, would you be interested in such a contribution?

from compress.

cowtowncoder commented on July 21, 2024

I would be absolutely thrilled to get such a contribution! Please let me know if you need help with block-level handling or such. And obviously you can add accessors if/as necessary.

I would also be interested in link to the project if that's ok; maybe tweet to 'cowtowncoder'? I am ok with adding link in this issue as well unless you don't want to.

Finally: this package has small gzip wrapper, so if you have improvements to that, those would be welcome.
But it's mostly just added for my own use (I handle smallish payloads with gzip, larger with lzf).

from compress.

cowtowncoder commented on July 21, 2024

Will be in 0.9.9.

from compress.

Parallel LZF about compress HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent