Comments (6)
Interesting. Yes, this sounds like a very interesting idea. Thank you for suggesting it. I had actually seen a reference to this, but had forgotten to read more.
Btw, I was unable to use Unsafe tricks for single-threaded compression so far -- I tried, but in the end failed. :-)
Another sort of related thing that I have been toying with (the idea, that is, not implementation) is ability to define "blindly splittable" format. Meaning that one would be able to find a block boundary given an arbitrary point in file. This is theoretically possible, with some caveats; for one, compressor must guarantee that specific byte sequence never occurs (easiest to guarantee for sequence that would always be compressed, like string of 4 or more instances of same byte), and for another, that all blocks use compressor (otherwise "raw" data could have such sequence).
But that warrants another issue obviously.
from compress.
Interesting, I once wrote a byte stuffing codec that might be useable for such a "blindly splittable" format. I still haven't figured out how to attach code to this issue tracking system. I can email you the two small (BSD licensed) java files, if you're interested.
/**
- Encoder/Decoder implementing Consistent Overhead Byte Stuffing (COBS) for
- efficient, reliable, unambigous packet framing regardless of packet content,
- making it is easy for applications to recover from malformed packet payloads.
-
- For details, see the <a
- href="http://www.stuartcheshire.org/papers/COBSforToN.pdf">paper . In
- case the link is broken, get it from the <a
- href="http://www.stuartcheshire.org">paper's author .
-
- Quoting from the paper: "When packet data is sent over any serial medium, a
- protocol is needed by which to demarcate packet boundaries. This is done by
- using a special bit-sequence or character value to indicate where the
- boundaries between packets fall. Data stuffing is the process that transforms
- the packet data before transmission to eliminate any accidental occurrences
- of that special framing marker, so that when the receiver detects the marker,
- it knows, without any ambiguity, that it does indeed indicate a boundary
- between packets.
-
- COBS takes an input consisting of bytes in the range [0,255] and produces an
- output consisting of bytes only in the range [1,255]. Having eliminated all
- zero bytes from the data, a zero byte can now be used unambiguously to mark
- boundaries between packets.
-
- This allows the receiver to synchronize reliably with the beginning of the
- next packet, even after an error. It also allows new listeners to join a
- broadcast stream at any time and without failing to receive and decode the
- very next error free packet.
-
- With COBS all packets up to 254 bytes in length are encoded with an overhead
- of exactly one byte. For packets over 254 bytes in length the overhead is at
- most one byte for every 254 bytes of packet data. The maximum overhead is
- therefore roughly 0.4% of the packet size, rounded up to a whole number of
- bytes. COBS encoding has low overhead (on average 0.23% of the packet size,
- rounded up to a whole number of bytes) and furthermore, for packets of any
- given length, the amount of overhead is virtually constant, regardless of the
- packet contents."
-
- This class implements the original COBS algorithm, not the COBS/ZPE variant.
-
- There holds:
decode(encode(src)) = src
. -
- Performance Note: The JDK 1.5 server VM runs
decode(encode(src))
- at about 125 MB/s throughput on a commodity PC (2 GHz Pentium 4). Encoding is
- the bottleneck, decoding is extremely cheap. Obviously, this is way more
- efficient than Base64 encoding or similar application level byte stuffing
- mechanisms.
- @author [email protected]
- @author
$Author: hoschek3 $ - @Version
$Revision: 1.4 $ ,$Date: 2005/06/09 22:44:05 $
*/
from compress.
Cool thanks. I'll have a look. For what it's worth, it looks like Snappy format (alas!) might actually work with simple sequence of 4 zero bytes... for LZF, a change or two might be needed. But it too does have couple of unused bytes for the first byte of each sequence.
from compress.
Hi all,
I may have a go at writing a parallel version of LZF if no-one started working on one yet.
If street creds are required, I have implemented a parallel GZip compressor in Java (similar to pigz); hope this is enough! :-) (Ping me for the URL, I am not writing here for advertisement.)
Tatu, would you be interested in such a contribution?
from compress.
I would be absolutely thrilled to get such a contribution! Please let me know if you need help with block-level handling or such. And obviously you can add accessors if/as necessary.
I would also be interested in link to the project if that's ok; maybe tweet to 'cowtowncoder'? I am ok with adding link in this issue as well unless you don't want to.
Finally: this package has small gzip wrapper, so if you have improvements to that, those would be welcome.
But it's mostly just added for my own use (I handle smallish payloads with gzip, larger with lzf).
from compress.
Will be in 0.9.9.
from compress.
Related Issues (20)
- Implement skip() efficiently, without needing to decode if possible HOT 1
- Deserialize directly into a ByteBuffer HOT 6
- Add a method to encode directly into given output buffer (of guaranteed size) HOT 1
- Implement encoder (compressor) that makes use of sun.misc.Unsafe HOT 1
- Unsafe-based decompressor of 0.9.7 fails on 2 sample files from 'maxcomp' data set HOT 1
- Improve 'DataHandler` callback to allow early termination HOT 1
- Expose number of bytes read from `InputStream`, via `LZFInputStream` HOT 1
- Unsafe clean up of Thread Local Value
- Add new variants for "compress only if comp rate at least N" HOT 1
- Incorrect de-serialization leading to stream corruption in Big Endian systems HOT 9
- Add convenience method(s) for GZIP read/write
- Document parallel compression task
- OptimizedGZIPInputStream fails on chunked stream HOT 2
- estimateMaxWorkspaceSize() is too small HOT 9
- did not start with 'ZV' signature bytes HOT 5
- API changes report for Compress LZF
- Fix issues outlined by "lgtm.com"'s static analysis HOT 1
- Add Java 9+ module info using Moditect HOT 1
- `Unsafe` needs support in `module-info.java` HOT 1
- Mistaken Code in k8s HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compress.