If I make multiple calls to Write, the resulting compressed data stream length is much

Possibly related? <a class="issue-link js-issue-link" data-error-text="Failed to l

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The zstd bug referenced above (<a class="issue-link js-issue-link" data-error-text="Fa

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Multiple calls to Write() has unexpected overheads? about zstd HOT 10 CLOSED

datadog commented on September 22, 2024 1

Multiple calls to Write() has unexpected overheads?

from zstd.

Comments (10)

jimsmart commented on September 22, 2024

Possibly related?
facebook/zstd#206

from zstd.

jimsmart commented on September 22, 2024

Ok, the best thing to do if one cannot ensure good-sized calls to Write, is to use a bufio.Writer like this:-

func main() {

	b := &bytes.Buffer{}
	for i := 0; i < 500; i++ {
		b.Write([]byte("Hello World! "))
	}
	data1 := b.Bytes()
	fmt.Println("data len", len(data1))

	// Compress 1
	buffer1 := &bytes.Buffer{}
	w1 := zstd.NewWriterLevel(buffer1, CompressionLevel)
	w1.Write(data1)
	w1.Close()

	fmt.Println("Buffer1 len", buffer1.Len())

	// Compress 2
	buffer2 := &bytes.Buffer{}
	w2 := zstd.NewWriterLevel(buffer2, CompressionLevel)
	bw := bufio.NewWriter(w2) // default buffer size = 4k
	// bw := bufio.NewWriterSize(w2, 8192) // buffer size = 8k
	for i := 0; i < 500; i++ {
		bw.Write([]byte("Hello World! "))
	}
	bw.Flush()
	w2.Close()

	fmt.Println("Buffer2 len", buffer2.Len())
}

Output:

data len 6500
Buffer1 len 33
Buffer2 len 44

It's not so elegant, but ¯\(ツ)/¯

— Hope that helps someone!

from zstd.

valyala commented on September 22, 2024

@jimsmart , try gozstd.Writer. It uses another underlying zstd API, which should have lower overhead.

from zstd.

dmoklaf commented on September 22, 2024

The zstd bug referenced above (facebook/zstd#206) has been closed. Is this issue still ongoing? If yes, the need to wrap the writer with a buffer shall be documented, this is a pretty subtle usage advice.

from zstd.

Viq111 commented on September 22, 2024

Hi @rgeronimi,

I checked again previous result and indeed you'd currently have the same results.
zstd (this lib) & gozstd (from @valyala above) use 2 slightly different C zstd API with slightly different directions.

(this) zstd library uses ZSTD_compressContinue which basically use buffer-less zstd streaming compression, meaning we have complete control over memory at the expense of needing to do buffers Go-side if you want to optimize for compression size on small inputs.

gozstd uses ZSTD_compressStream which abstract that buffer logic into the C code (at the cost of having less control over memory consumption C land)

Hope this helps!

from zstd.

valyala commented on September 22, 2024

The following limitations for ZSTD_compressContinue look scary:

ZSTD_compressContinue() presumes prior input is still accessible and unmodified (up to maximum distance size, see WindowLog). It remembers all previous contiguous blocks, plus one separated memory segment (which can itself consists of multiple contiguous blocks)

ZSTD_compressContinue() detects that prior input has been overwritten when src buffer overlaps. In which case, it will "discard" the relevant memory section from its history.

As I understand, they mean two things:

zstd could use garbage as a dictionary from the previous buffers used in the ZSTD_compressContinue call if address of these buffers doesn't match the address of the current buffer. This also may lead to segmentation fault if the underlying memory of the previous buffer has been unmapped from the process address space.
zstd may have bad compression rate, since it discards dictionary data from the previously compressed block if the buffer passed to the func is re-used.

cc'ing @Cyan4973 for further clarification.

from zstd.

Cyan4973 commented on September 22, 2024

@Viq111 explanations are correct.

ZSTD_compressContinue() is a fairly low level function, designed for systems which need absolute control over memory allocation. It requires a fairly good control over buffer content and lifetime. To be fair, it's more targeted at embedded environments than managed languages, but I'm not a qualified expert to tell if this is a good fit or not for go.

When in doubt, prefer using ZSTD_compressStream(). It's safer to use, and abstract all the machinery, at the cost of also managing its own internal buffers.

from zstd.

dmoklaf commented on September 22, 2024

ZSTD_compressContinue() is a fairly low level function, designed for systems which need absolute control over memory allocation. It requires a fairly good control over buffer content and lifetime. To be fair, it's more targeted at embedded environments than managed languages, but I'm not a qualified expert to tell if this is a good fit or not for go.

This depends on what the Go wrapper code does. I just checked it and it transmits directly the user-provided buffer as a C pointer. If I understand what @valyala wrote, this could be a critical bug as the zstd C code is expecting this buffer to remain accessible after the function returns. If true, this would have the potential for data corruptions, process crashes, and hard-to-reproduce cases.

When in doubt, prefer using ZSTD_compressStream(). It's safer to use, and abstract all the machinery, at the cost of also managing its own internal buffers.

from zstd.

Viq111 commented on September 22, 2024

Reading back at the code, we started implementing the go wrapper at zstd v0.5 which only had the ZBUFF_decompressContinue methods indeed: https://github.com/facebook/zstd/blob/201433a7f713af056cc7ea32624eddefb55e10c8/lib/zstd_buffered.h#L79

It may actually be also the issue for #39

If anyone could put up a PR for migrating to ZSTD_compressStream, we are accepting all contributions!

Otherwise I can also look into it as it seems it could bit a couple of people using the streaming interface

from zstd.

dmoklaf commented on September 22, 2024

We don't have the skillset to zoom into that soon. For storage tasks (e.g., blob storage in DB or compressed custom backup) this bug is a showstopper unfortunately.

from zstd.

Multiple calls to Write() has unexpected overheads? about zstd HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent