For now, the DecoderState and <code class="notranslat

I had a brief look, the <a href="https://github.com/rust-lang/rust/issues/74878" data-

I've been doing some experiments in <a href="https://github.com/chyyran/lzma-rs/commit

Vec2D from <a class="issue-link js-issue-link" data-e

Use const-generics about lzma-rs HOT 4 CLOSED

gendx commented on August 22, 2024

Use const-generics

from lzma-rs.

Comments (4)

npmccallum commented on August 22, 2024

Const generics are now stable in 1.51.

from lzma-rs.

gendx commented on August 22, 2024

I had a brief look, the min_const_generics feature was indeed stabilized in 1.51, but we still need the more complex const_evaluatable_checked feature due to the 1 << num_bits expression here.

In the meantime, it might be worth experimenting with it on nightly to see if benchmarks would improve with const generics.

from lzma-rs.

chyyran commented on August 22, 2024

I've been doing some experiments in https://github.com/chyyran/lzma-rs/commits/feature-perf-experiments and found that const generics in BitTree and RangeCoder doesn't actually contribute that much performance benefit, except for large dictionaries.

Without const generics (but with some additional optimizations) SnowflakePowered@ecc71a0

running 8 tests
test compress_65536                  ... bench:   1,372,700 ns/iter (+/- 32,501)
test compress_empty                  ... bench:         713 ns/iter (+/- 35)
test compress_hello                  ... bench:       1,021 ns/iter (+/- 50)
test decompress_after_compress_65536 ... bench:   2,463,569 ns/iter (+/- 77,148)
test decompress_after_compress_empty ... bench:       1,720 ns/iter (+/- 146)
test decompress_after_compress_hello ... bench:       2,219 ns/iter (+/- 50)
test decompress_big_file             ... bench:   3,959,745 ns/iter (+/- 3,155,824)
test decompress_huge_dict            ... bench:       2,221 ns/iter (+/- 136)

With const generics SnowflakePowered@ad7d096

running 8 tests
test compress_65536                  ... bench:   1,379,341 ns/iter (+/- 160,163)
test compress_empty                  ... bench:         711 ns/iter (+/- 57)
test compress_hello                  ... bench:       1,016 ns/iter (+/- 30)
test decompress_after_compress_65536 ... bench:   2,442,160 ns/iter (+/- 125,797)
test decompress_after_compress_empty ... bench:         544 ns/iter (+/- 110)
test decompress_after_compress_hello ... bench:       1,003 ns/iter (+/- 47)
test decompress_big_file             ... bench:   3,939,209 ns/iter (+/- 177,650)
test decompress_huge_dict            ... bench:       1,065 ns/iter (+/- 40)

Probably worth pursuing in the future because of that benefit but SnowflakePowered@ad7d096 uses kind of a nasty workaround to get around the lack of #[feature(generic_const_exprs)] by putting 1 << N into a const fn and passing it in at the constructor site.

from lzma-rs.

chyyran commented on August 22, 2024

Vec2D from #77 lets us easily experiment with different backing stores for the literal_probs array which is the main allocation.
I used arrayvec which gives us a nice API over a raw array.

Since DecoderState needs to be able to update lclppb at runtime, we can't just parameterize it with const generics unlike BitTree, so we have to account for the theoretical maximum parameters which is sizeof(u16) * 0x300 * { 1 << (4 + 8) } = 3145728 * 2. An array of 6 megabytes is way too large to fit on the stack and I was unable to actually decompress anything without a stack overflow.

For LZMA2 streams, the theoretical maximum parameters for the literal_probs array is sizeof(u16) * 0x300 * { 1 << 4 } = 12288 * 2. While this isn't great, 24KB is way smaller than 6MB so I was able to actually run some benchmarks, which are not great.

These benchmarks are with an allocating BitTree, i.e. without const-generics.

ArrayVec<u16, 0x300 * { 1 << 4 }>

running 8 tests
test compress_65536                  ... bench:   1,323,468 ns/iter (+/- 174,348)
test compress_empty                  ... bench:         695 ns/iter (+/- 20)
test compress_hello                  ... bench:       1,035 ns/iter (+/- 90)
test decompress_after_compress_65536 ... bench:   1,336,165 ns/iter (+/- 57,379)
test decompress_after_compress_empty ... bench:       6,166 ns/iter (+/- 359)
test decompress_after_compress_hello ... bench:       6,531 ns/iter (+/- 224)
test decompress_big_file             ... bench:   3,716,763 ns/iter (+/- 121,865)
test decompress_huge_dict            ... bench:       6,572 ns/iter (+/- 315)

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured; 0 filtered out; finished in 13.64s

Box<[u16]>

running 8 tests
test compress_65536                  ... bench:   1,334,964 ns/iter (+/- 1,371,984)
test compress_empty                  ... bench:         709 ns/iter (+/- 436)
test compress_hello                  ... bench:       1,033 ns/iter (+/- 64)
test decompress_after_compress_65536 ... bench:   1,145,697 ns/iter (+/- 82,369)
test decompress_after_compress_empty ... bench:       1,890 ns/iter (+/- 92)
test decompress_after_compress_hello ... bench:       2,231 ns/iter (+/- 295)
test decompress_big_file             ... bench:   3,620,380 ns/iter (+/- 152,660)
test decompress_huge_dict            ... bench:       2,271 ns/iter (+/- 75)

test result: ok. 0 passed; 0 failed; 0 ignored; 8 measured; 0 filtered out; finished in 7.76s

The huge array on the stack probably causes a lot of cache misses and substantially reduces performance. It seems the only benefit to using const generics here is with BitTree, where it does show an improvement in the benchmarks as above.

from lzma-rs.

Use const-generics about lzma-rs HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent