composewell / streamly-lz4 Goto Github PK

View Code? Open in Web Editor NEW

5.0 4.0 0.0 241 KB

Streamly combinators for LZ4 compression.

License: Apache License 2.0

Haskell 27.20% C 70.65% Nix 1.57% Shell 0.58%

streamly compression lz4

streamly-lz4's Issues

Avoid using the `Array` constructor

Improve the Array API instead.

Provide support to read/write lz4 frames

See this document for the lz4 container format: https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md .

We can write a parser/terminating fold to decode a stream of frames into a stream of compressed blocks and decode the frame header and then decompress appropriately using the options/flags in the frame header. Similarly write a serializer to write a stream of compressed blocks as a lz4 frame using options provided by user.

Also, we have to handle the following in the block headers:

the uncompressed bit in the block header. Currently we error out in that case.
blocks with block checksum following the block

Test extreme block size cases for compression/decompression

Test the following cases:

in decompressResized
- array length is less than the length specified in the header of the block
- array length is = 0, 2GB, > 2GB
in compress
- array length is = 0, 2GB, > 2GB

validateFooter should always print a relevant error.

See #29 (comment)

write compressChunk and decompressChunk as folds

compressChunk :: Int -> Fold m (Array.Array Word8) (Array.Array Word8)

-- Accept properly resized chunks
decompressChunk :: Fold m (Array.Array Word8) (Array.Array Word8)

We can then implement compress and decompress in terms of these folds.

Possibility of leaking `Ptr C_LZ4Stream` ?

https://github.com/composewell/streamly-lz4/blob/a929c20fb582da95783f84a48dc174204cb8601d/src/Streamly/Internal/LZ4.hs#L394C56-L394C56

In the code above, Ptr C_LZ4Stream would be freed ONLY after the following path (in reversed orderd):

c_freeStream was called
stream enter state, CompressDone ctx
got Stream.Stop from inner stream (The only way to produce CompressDone ctx)

Does that mean we'll got memory leak in code like:

readFile f & compressChunks cfg speed & take 0 & fold drain :: IO ()

since we'll never got Stream.Stop from readFile, or did I missing other path that would clean up Ptr C_LZ4Stream properly?

Replace "error" calls with exceptions

Specialized INLINE pragmas

In streamly, we use the following CPP helpers,

#define INLINE_EARLY  INLINE [2]
#define INLINE_NORMAL INLINE [1]
#define INLINE_LATE   INLINE [0]

This does not work well with a .hsc file.
This might not even be the right way to define these helpers in a .hsc file.
The closest, I can see in the documentation is by using #let.

This code is currently commented out but we should figure out a way to write informative INLINE pragmas.

Use specific endianness for block headers

We should use a specific byte ordering instead of the machine byte ordering otherwise the code will not work across machines with different byte ordering. We can raise an issue for this and address it separately.

Tracking broken links

Streamly.LZ4

https://streamly.composewell.com/streamly-0.8.2/Compiling.html does not exist

Use specific type cast for all uses of fromIntegral

For example, like this:

              else if ((fromIntegral :: Int32 -> Int) srcLen < Array.byteLength arr)

It is is easier to verify the safety, and forces us to think whether it is safe. Always try to upcast a smaller type rather than doing vice-versa.

Benchmark names are too long

We should not use the whole absolute path of the corpus file in the benchmark name.

compress/files/bufsize(65536)/compress 5//home/harendra/composewell/streamly-lz4/corpora/large/bible.txt.normalized time                 41.77 ms

compress/files/bufsize(65536)/compress 5//home/harendra/composewell/streamly-lz4/corpora/large/world192.txt.normalized time                 33.97 ms

compress/files/bufsize(65536)/compress 5//home/harendra/composewell/streamly-lz4/corpora/cantrbry/alice29.txt.normalized time                 39.86 ms

Configuration nomenclature

Format vs Config vs Options

Issue with fusion

All combinators fuse, is they are the only combinators in the pipeline.
Adding any other combinator breaks fusion.

The problem seems to be with stuff not getting inlined.

Replace ParserD with Parser

Implementation plan

Proposed compress and decompress APIs

Compress:

compress: A convenience API that uses a default speedup factor and a default block size for compression
compressWith: ability to specify speedup and block size
compressRaw: compress the arrays in the input stream as is without resizing

Decompress:

decompress: resize the input arrays and decompress
decompressRaw: assume the arrays are of the right size already

Support streamly >= 0.9 (or streamly-core >= 0.1)

Hi, as far as I know, streamly-lz4 still uses streamly == 0.8.2 currently. Do we have any plan to migrate streamly-lz4 to new streamly API?

Restrict block size when compressing.

Check the size of the input array and fail if it is more than the max block size set in BlockConfig. Otherwise decompression would fail.

Fix build on Windows

https://ci.appveyor.com/project/harendra-kumar/streamly-lz4

Try stream to stream compression/decompression APIs

For decompression, we could flatten the compressed input arrays into a stream and parse the stream into compressed chunks using a terminating fold. Check how this performs in comparison to array resizing.

For compression we could chunk the stream into arrays of 64K by default and provide an option to change the default.

We can provide both array as well as stream based APIs if arrays show some perf advantage otherwise we can keep just stream based ones.

Unreproducable GHC 884 error

The CI for GHC 884 on Linux failed once. Unfortunately, the logs were not recorded and I don't know the right conditions to reproduce the error.

It was probably a memory issue but I'm not sure.

Improvements to benchmarks

There are too many benchmarks, its difficult to analyze because of the noise. We should keep only those which provide us some useful information:
a. remove the compress.decompress benchmarks. Usual use case is to either compress a stream or decompress a stream, it should be enough to measure these two separately.
b. We can use 1 decompressResized benchmark for comparison how much difference resizing makes, we do not need all benchmarks to have these variants. Most of the time we would be resizing anyway.
c. We can run 1 benchmark with multiple acceleration values to compare how much difference acceleration makes. We should not run all benchmark variants with all acceleration values.
64K block size would probably be optimal because the dictionary is maximum 64K. We can use a small block size (64 bytes), 64K, and a big block size (1 MB) for comparisons. We do not need to run all benchmarks with all buffer sizes. We can run one benchmark with all buffer sizes not all of them.
Make all the input files of equal size so that we can compare the benchmarks. We can replicate the data to make them all 5MB or 10MB.
Add a way to benchmark external files, like we have in streamly FileSystem.Handle benchmarks.

Rename the APIs

The compress/decompress APIs work on chunks/arrays. All the APIs in stream that work on arrays are suffixed with chunk. We should use the same convention here as well:

compressChunks: current compress
decompressChunks: current decompress
compress: compress a stream of Word8 to a stream of Word8
decompress: decompress a stream of Word8 consisting of compressed chunks

composewell / streamly-lz4 Goto Github PK

streamly-lz4's Issues

Streamly.LZ4

Recommend Projects

Recommend Topics

Recommend Org