Hey, it’s me again :) I’ve been thinking about my storage concerns r

I think that's the code <a class="commit-link" data-hovercard-type="commit" data-hover

Some thoughts about storage improvements. about parca HOT 13 CLOSED

parca-dev commented on May 14, 2024 1

Some thoughts about storage improvements.

from parca.

Comments (13)

bwplotka commented on May 14, 2024 1

For storing/mmap-ing on demand, looks like we want to improve this on WAL side TSDB anyway (e.g https://matrix.to/#/!WaUKIfoqfiyWQhenET:matrix.org/$1568229168815166NkByc:matrix.org?via=matrix.org). We should start some design discussion on Prometheus at some point. However, this will not solve Querying as you mentioned.

For querying those two facts that you mentioned are crucial:

One instance generates 10 series. This is - at least in my experience - a lot less than the usual metric series count per instance.

What's the instance here? Anyway - yea the cardinality will be smaller for sure.

One datapoint is comparatively much bigger, than with metrics. A cpu/mem profile is tens of kilobytes, a trace is tens of megabytes - as opposed to a float being 64 bytes.

Let’s say we want to query long timespans. At query time, we are only interested in metadata. So the datapoint sizes will negatively impact the hot-path performance, even though we don’t need them here (as opposed to metric-scraping prometheus, where we are aggregating data over time while querying).

I think those summarize very well the problem we are solving here. No aggregations needed, huge sample data and low cardinality.

To me, it means that:

Lowering min block size would help a lot. We have less index due to smaller cardinality so it might actually bring some benefits.
Fetching on deman for WAL would help, but again at then while constructing block we would probably need to keep all or most of it in memory.

I think the idea you propose, if we would only just store references to the profile in other storage (e.g obj store) instead of profile itself might make sense. I think we might want to look how Loki does as it's pretty similar topic (metadata + payload)

from parca.

brancz commented on May 14, 2024

(I'm about to go on vacation, so I apologize in advance for delayed responses after this one)

First of all, I think it's super awesome that you are getting involved and want to improve things, I very highly appreciate the effort you are putting in and I want to work with you to improve this situation! :)

The way I've been thinking about it is: Prometheus TSDB has some of the exact the same problems, just on a much smaller scale in regards to the sample value size. Prometheus TSDB WAL already writes segments of 128Mb on disk, but keeps the data in-memory. My idea around this was if we could mmap the segments instead of just using them to re-create the in-memory representation of the WAL then we would only have things in memory that we are actually querying + at most 128Mb. To be clear, I realize that "just mmap-ing" is unlikely to be enough, it might need us to rethink the segment format a bit to allow for a strategy like this as well as adapt the read path to potentially be aware of this, kind of as a way of layered caches.

@bwplotka @krasi-georgiev you know the TSDB and WAL code better than I do, do you think this could be feasible?

from parca.

cube2222 commented on May 14, 2024

Just saying, I'm also concerned about the read path. Because there's a lot of "bytes" put into the data structure we use for querying, even though it doesn't need to be there. (later we do a single get for the single datapoint of interest)

(reiterating this as I'd like to hear their thoughts on this too, because maybe I'm trying to fix a problem that just isn't there)

It would also hugely impact query performance if using some kind of remote storage if I understand those mechanisms correctly. Downloading multiple TB from object storage and scanning them, even though we only really care about metadata at this stage, is fairly suboptimal.

from parca.

krasi-georgiev commented on May 14, 2024

could you save me a bit of time digging through the conprof code and give an example of the data format that you are saving in tsdb at the moment?

Also what are the most common queries to get that data.

from parca.

cube2222 commented on May 14, 2024

I think that's the code conprof/tsdb@70f0d4a

We're saving go pprof/trace files, which are from the tsdb point of view, slices of bytes containing from a few tens of Kilobytes, up to a few tens of Megabytes.

I'm not sure what you mean with most common queries:
usually you want to scan some small timespan for datapoints with the given labels, but are not interested in the actual data stored there. Then you select the datapoint to open the pprof. That's when you get an iterator to the wanted series and seek to the single timestamp you're interested in and read the bytes there.

storageFetcher := func(_ string, _, _ time.Duration) (*profile.Profile, string, error) {
		var prof *profile.Profile

		q, err := p.db.Querier(0, math.MaxInt64)
		if err != nil {
			level.Error(p.logger).Log("err", err)
		}

		ss, err := q.Select(m...)
		if err != nil {
			level.Error(p.logger).Log("err", err)
		}

		ss.Next()
		s := ss.At()
		i := s.Iterator()
		t, err := stringToInt(timestamp)
		if err != nil {
			return nil, "", err
		}
		i.Seek(t)
		_, buf := i.At()
		prof, err = profile.Parse(bytes.NewReader(buf))
		return prof, "", err
	}

Currently there are no aggregations.

from parca.

cube2222 commented on May 14, 2024

What's the instance here? Anyway - yea the cardinality will be smaller for sure.

With one instance I mean one scraping target. One scraping target has ~10 different kinds of profiles to get, which create one series each.

from parca.

krasi-georgiev commented on May 14, 2024

yeah I think the simplest solution would be to just keep a reference to the file path in tsdb and when needed just open the file from disk rather than keep the raw bytes in memory.

from parca.

brancz commented on May 14, 2024

I think that's actually pretty reasonable, essentially using tsdb just as an index. In order for that to work though we absolutely need isolation/MVCC in tsdb otherwise we're gonna have a bad time. I know it's been on our plate for a long time, maybe it's time to finally finish it.

from parca.

krasi-georgiev commented on May 14, 2024

Why would isolation/MVCC be required? the access to the files would be read only so more than a single process can read from the same file.

from parca.

brancz commented on May 14, 2024

No when inserting we need isolation between it showing up in the index and appending to the series.

from parca.

cube2222 commented on May 14, 2024

Could you please expand on the problem? As I see it you'd first write the file to some storage layer (file system, object storage, etc.) and after it's closed (and immutable) write its file path to TSDB, the same way you write a byte slice now.

from parca.

brancz commented on May 14, 2024

Yes you're right, doing it in that order should solve that problem.

from parca.

brancz commented on May 14, 2024

We have a major re-work of the storage as part of the Conprof -> Parca re-brand https://www.parca.dev/docs/storage

from parca.

Some thoughts about storage improvements. about parca HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent