thanos-io / thanos Goto Github PK

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.

License: Apache License 2.0

Makefile 0.61% Go 86.92% Shell 0.39% CSS 0.32% HTML 0.60% Dockerfile 0.03% Jsonnet 2.64% TypeScript 8.15% jq 0.01% SCSS 0.33% JavaScript 0.01%

prometheus google-cloud-storage high-availability prometheus-ha-pairs thanos s3 storage cncf prometheus-setup go metrics monitoring observability hacktoberfest

thanos's Introduction

📢 ThanosCon is happening on 19th March as a co-located half-day on KubeCon EU in Paris. Join us there! 🤗 CFP is open until 3rd December!

Overview

Thanos is a set of components that can be composed into a highly available metric system with unlimited storage capacity, which can be added seamlessly on top of existing Prometheus deployments.

Thanos is a CNCF Incubating project.

Thanos leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric data in any object storage while retaining fast query latencies. Additionally, it provides a global query view across all Prometheus installations and can merge data from Prometheus HA pairs on the fly.

Concretely the aims of the project are:

Global query view of metrics.
Unlimited retention of metrics.
High availability of components, including Prometheus.

Getting Started

Features

Global querying view across all connected Prometheus servers
Deduplication and merging of metrics collected from Prometheus HA pairs
Seamless integration with existing Prometheus setups
Any object storage as its only, optional dependency
Downsampling historical data for massive query speedup
Cross-cluster federation
Fault-tolerant query routing
Simple gRPC "Store API" for unified data access across all metric data
Easy integration points for custom metric providers

Architecture Overview

Deployment with Sidecar for Kubernetes:

Deployment with Receive in order to scale out or implement with other remote write compatible sources:

Thanos Philosophy

The philosophy of Thanos and our community is borrowing much from UNIX philosophy and the golang programming language.

Each subcommand should do one thing and do it well
- e.g. thanos query proxies incoming calls to known store API endpoints merging the result
Write components that work together
- e.g. blocks should be stored in native prometheus format
Make it easy to read, write, and, run components
- e.g. reduce complexity in system design and implementation

Releases

Main branch should be stable and usable. Every commit to main builds docker image named main-<date>-<sha> in quay.io/thanos/thanos and thanosio/thanos dockerhub (mirror)

We also perform minor releases every 6 weeks.

During that, we build tarballs for major platforms and release docker images.

See release process docs for details.

Contributing

Contributions are very welcome! See our CONTRIBUTING.md for more information.

Community

Thanos is an open source project and we value and welcome new contributors and members of the community. Here are ways to get in touch with the community:

Slack: #thanos
Issue Tracker: GitHub Issues

Adopters

See Adopters List.

Maintainers

See MAINTAINERS.md

thanos's People

Contributors

Stargazers

Watchers

Forkers

gouthamve krasi-georgiev v3ckt0r dadux claudiouzelac mattbostock bwplotka sjanulonoks goodwinos vonneudeck nicolasbernard africanzoe zeari alvaroaleman markjacksonfishing ravisantoshgudimetla linecode raghu999 lulzzz brancz nmaupu mxinden juliusv agora-t simonpasquier huydx dosire featherking monad-one yaspee shaunstanislauslau sysbot bossjones xjewer benoitknecht imskyer fantomx1-github danigrmartinez vdemonchy elvis-cai tandd1123 yuanfeng0905 awesome-nfv wjam povilasv nisxiya everesio danhcole lostick jakekeeys yuecong zexuanhuang colstuwjx twasa meua satchel9 deanwei danielqsj servicefoundation yanghongkjxy huyaowen g-research radishgz pugna0 xyy0807 cofyc yuwentao iamjarvo wutz liujinok irinafakotakis harisekhon wanyinglong ieasydevops tjsampson mbrukman jojohappy guhuajun msuterski jeanfav khanchan kjm0001 navyseal8 ccakes davidquarles wyatt88 jerryhax 40a spfz mgicode dyhpoon gunner-me tgxworld chrischdi johannesfrey sylr emhaye swollo mibc nollbit

thanos's Issues

Latency increasing notably with result size

The result size of the a final PromQL query seems to have very significant effect on total latency.

For example, apiserver_request_count{job="apiserver"} takes at the order of 15-20 seconds. If wrapped with a sum like sum(apiserver_request_count{job="apiserver"}), latency sharply decreases to 2-3 seconds.

The data queried from GCS and flowing into query nodes is the same. The query engine is actually doing more work.
So the whole delta seems to come from sending the result after data fetching and processing completed successfully. The result is bigger, so more JSON encoding happens and more data is sent to the browser. The difference seems rather extreme nonetheless.

Query node is in the US, end client in EU. Still seems rather unusually high.
The query node is accessed via an SSH tunnel, which might play a role as well.

Implement proxy to back store API by Prometheus servers

Since the most recent data is on Prometheus servers, we must be able to fan out to those for queries. For overall simplicity of the system it is feasible to have the sidecar implement the store API on top of Prometheus's remote read API or HTTP v1 API. Probably either works fine but the HTTP v1 API has proper stability guarantees.

Global deployment model

Current test deployments within a single cluster/region work very well. Time to give global deployment options some thought.

Generally speaking, a full setup in every region is probably desirable for reliability and data locality for almost all setups. It also provides trivial means of sharding, which isolates failures and increases scalability.

Currently there seem to be two options:

A) Expand the gossip network globally. All store node, Prometheus servers, and queriers are interconnected.

Pros:

Seamlessly applies our existing model to global scale.
Virtually not additional code/logic
Conceptually works equally well with a single object storage bucket as well as a bucket per region or other splits.

Cons:

Probably need some gossip fine-tuning to support WAN latencies. That would decrease responsiveness to changes in LAN. Probably no big issue overall.
Probably operational complexity for most users (routing and securing gossip traffic)
Do we need/want the same strong reliability properties at global scale?

B) Keep Thanos cluster regional or even more fine-grained. Additional global federation layer across smaller Thanos clusters.

Query nodes would be made aware of each other through regular service discovery mechanism and act as federation proxies for the Store API.

Pros:

Cross-cluster communication is a lot simpler. Probably most organisations have appropriate infrastructure in place.
Query nodes becoming Store API providers for their cluster seems generally elegant.
Extensible for distributed query evaluation.
Since it's basically an extension, we don't block anyone from doing A). Modulo some potential configurables we are lacking.

Cons:

Federated queries are no longer independent from the external service discovery system
Additional logical layer adds code and mental overhead
Conceptually only makes real sense with full Thanos clusters with dedicated buckets in each region.

[ruler] Ruler data is bit off.

Deduplicated query gives 2 series for single series (with and without replica label).

Duplicated and out-of-order compacted data

During debugging for downsampling last week I discovered at least one block that contained duplicated and out of order data.
For about 10% of its series there was a sequence of 3 chunks that was repeated about 180 times. At the end of that repetition were two chunks for times before that sequence. I could not come up with an explanation how this could have happened. Generally, the original block with those 3 chunks might not have been GC'd and then accidentally been re-compacted in over several iterations. But this should not get us to 180.

No data loss occurred as far as I can tell and this should be fully recoverable.
Even though the duplication is very high, the total data blow up is fairly minimal AFAICT.

As we have no explanation of what caused this, the best way to address this seems to be:

Add handling to our normal reads and downsampling that account for the issue so it is not user facing
Add verification logic to our compactor that aborts if it detects such a case again. This way we can detect the issue right away and will have a chance to properly debug it rather than after several more compaction iterations
Add a thanos bucket check command that walks existing blocks and detects the issue. It can also be extended to re-write affected blocks properly.

Create release

Please consider creating a release, so that people using Thanos can reference version numbers.

[UI] Search by metric name does not work after some time.

Looks like just UI issue: Nothing wrong with LabelValues. I can call /api/v1/label/__name__/values?_=1513164422596 as aggressively as I want.

[query] Add a `no-dedup` query parameter and UI support for it.

Related to #107

[query <-> store] too large gRPC message.

Querying go_goroutines for 2w is fine

for 4w I am getting data only from sidecar. Request to store node ends up with:

query ts=2017-12-14T12:50:14.379715022Z caller=querier.go:185 msg="single select failed. Ignoring this result." err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4635734 vs. 4194304)"

Rule evaluations results are scattered (deduplication)?

Query-HA:

On normal query it looks better, but somehow cannot access the newest data:

Manual test real alertmanager with Thanos ruler

Panic on during high mem contention and concurrent store request for the same metrics

After couple of times this is happening:

level=debug name=thanos-store ts=2017-12-08T13:36:30.559067909Z caller=bucket.go:1054 msg="preloaded range" block=01C0MSNM5GS6J537VA4K822STW file=0 numOffsets=21 length=5807 duration=4.399890956s
level=debug name=thanos-store ts=2017-12-08T13:36:30.55943186Z caller=bucket.go:400 msg="preload postings" duration=4.889970819s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb6a652]
goroutine 478309 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Next(0xc469b264c0, 0x1e3df52f)
    <autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc4b6652630, 0x2158fed9fd6d1ba0)
    /home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:351 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc4b6652690, 0x2158fed9)
    /home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:351 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc4b66526c0, 0x246dfb94a3f)
    /home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:302 +0x33
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc420229490, 0x12c37e0, 0xc529da4b00, 0xc4bbe7a7e0, 0xc4dc2c8cc0, 0x2, 0x2, 0x15fee39c7b3, 0x16036576393, 0x0, ...)
    /home/bartek/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:404 +0x619
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x0)
    /home/bartek/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:511 +0xa7
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc4de432360, 0xc4d8a71920, 0xc4274cb130)
    /home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
    /home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8

How to handle Alertmanager discovery?

With #92 an initial version of the rule component is largely complete.

What's still missing is sending out the alerts to Alertmanagers. Upstream uses the standard service discovery integrations and designated config section for that.
Some upstream dependency decoupling work I did last week however makes the notification sending properly pluggable.

Prometheus' service discovery is quite useful for extracting meta data and doing advanced filtering. Neither really seem to be a concern when sending alerts to a small set of Alertmanagers. DNS is basically in place everywhere and seems sufficient after all.

I'd like to consider going back to a simple --alertmanagers flag, that just allows a few specifics regarding used DNS records and ports.

http://alertmanager.example.org:8080 – just sends alerts to that specific address using HTTP
dns+https://alertmanager.example.org:8080 – sends alerts to each A record for the hostname name using port 8080 for each, uses HTTPS
dnssrv+https://alertmanager.example.org – like 2. but queries SRV records and uses the port specified there
dns+https://alertmanager.example.org:8080/path/prefix/ – like 2. but uses a path prefix

The flag could be repeatable.
That solves virtually all sane use cases well enough with minimal learning curve.

The alternative is pulling in the whole dependency chain of the Prometheus discovery framework and requiring configuration files with relabeling for the rule component.

Another thing upstream supports are relabeling rules for alerts. The only reasonable use case I can think of is dropping "replica" labels from alerts. Since replicas however are a first-class concept in Thanos, we'd handle that case in a first-class manner.

[query] dedup query param should let specify dedup labels not just true/false

We are basically move the source of truth for replica labels to the client.

Benefits:

We allow to specify replica label on UI per each query
No more fixed replica label on startup
Allowance for many replica labels across cluster (This is actually downside ;p)

Cons:

if you have the rulers out of sync regarding the replica label -> it's bad

Do not call sidecars that do not match with label matchers given in query

Right now we are asking every store API for series with given time range and labels.
Having external labels from sidecars, we can filter out sidecars which does not match with the external label specified in query (if any). E.g:

With given query (time range does not matter in this example):

{ _name_ : "a", external: "prom1", "label1": "1"}

And having these peers with peerType=store

{ external: "prom1" }
{ external: "prom2" }
{ } // Store node will likely does not have external labels propagated.

We can certainly say that there is no need to call 2 store peer for data. This can give some benefits when having lots of sidecards in system.

HA handling for store nodes

Store nodes are currently generally run as a single replica. It's not super critical to have HA in general since several hours or even days of recent data are HA via the Prometheus servers. But for some scenarios it might still be preferable.

Two could simply be deployed and the query node would take care of deduplication/merging just like for Prometheus HA pairs. But unlike Prometheus servers, the underlying data is truly the same in this case and fetching twice the amount is unnecessary overhead.

Some simple logic could be added to the query node to recognize real duplicates (Prometheus HA pairs are actually different through a replica label) and to only query one of them.

Series fetch issue

querying store failed: receive series: rpc error: code = Aborted desc = fetch series for block 01C18S2S39ZQ4VK43SKVYBRKSR: invalid remaining size 4096, expected 13485

From time to time.

Block shipper test failing

Found potential issue of the block shipper test is failing for me locally:

https://github.com/improbable-eng/promlts/blob/master/pkg/shipper/shipper_test.go#L95

We use DB.Blocks() to determine how many blocks should be there. However that number only updates when the DB gets reloaded (unexported method right now). Snapshots are also in another directory and thus not possibly visible for DB.

The number of blocks in the mem store varies in my failing tests.
The tests passing on CI before might be an indicator that the number in the mem store has thus been 0 before. The test would then pass as DB.Blocks also reports zero.

Those are just guesses as I was about wrap it up for today.

sidecar: Allow Thanos backup when local compaction is enabled

Hey @Bplotka @fabxc
Still pretty new to the Thanos code base. Going through it, one thing I've noticed with the backup behaviour is it seems to only dump the initial 2 hourly blocks. In my instance I've stood up thanos against an existing Prom server. My data dir looks as follows:

data]# ls -ltr
total 108
drwxr-xr-x 3 root    root       4096 Jan 28 19:00 01C4Z28176WR17K7PH37K7FG9V
drwxr-xr-x 3 root    root       4096 Jan 29 13:00 01C5101JDEX1TK8CMSC4NQK8KP
drwxr-xr-x 3 root    root       4096 Jan 30 07:00 01C52XV3T2ETVSG93K73HVP6D1
drwxr-xr-x 3 root    root       4096 Jan 31 01:00 01C54VMN5R94ZM7N7F08J20DDA
drwxr-xr-x 3 root    root       4096 Jan 31 19:00 01C56SE59DGWBV587GG9M2W99W
drwxr-xr-x 3 root    root       4096 Feb  1 13:00 01C58Q7Q5G4R0DFY5HDGD3XC9Y
drwxr-xr-x 3 root    root       4096 Feb  2 07:00 01C5AN1885H7B9J1W11DXB137A
drwxr-xr-x 3 root    root       4096 Feb  3 01:00 01C5CJTST135PPXDYDWTKEPYTD
drwxr-xr-x 3 root    root       4096 Feb  3 19:00 01C5EGMASFVT63NC0QZTJSABFJ
drwxr-xr-x 3 root    root       4096 Feb  4 13:00 01C5GEDW21PTYQ8WNKYRCAVQJX
drwxr-xr-x 3 root    root       4096 Feb  5 07:00 01C5JC7DGMKPJVQD2DZH0JT7QJ
drwxr-xr-x 3 root    root       4096 Feb  6 01:00 01C5MA0ZC22NSD0H7S8G207JFF
-rw------- 1 root    root          6 Feb  6 12:29 lock
drwxr-xr-x 3 root    root       4096 Feb  6 19:00 01C5P7TGEMP97FSN779S5E5AYH
drwxr-xr-x 3 root    root       4096 Feb  7 13:00 01C5R5M1ANW6RYY81S91VC0F75
drwxr-xr-x 3 root    root       4096 Feb  8 07:00 01C5T3DJX67W411JZ9FP745B2Q
drwxr-xr-x 3 root    root       4096 Feb  9 01:00 01C5W173WNJA1F01TVJJPT5B93
drwxr-xr-x 3 root    root       4096 Feb  9 19:00 01C5XZ0MQ5BSY2W5K9KKAGF5N2
drwxr-xr-x 3 root    root       4096 Feb 10 13:00 01C5ZWT693CHSK8KEKBW04SSDX
drwxr-xr-x 3 root    root       4096 Feb 11 07:00 01C61TKQF3YXY1XT01JN2X0A3W
drwxr-xr-x 3 root    root       4096 Feb 12 01:00 01C63RD8T9Y2Z2C7G98YX9RBXV
drwxr-xr-x 3 root    root       4096 Feb 12 07:00 01C64D0D1Y2B9P3QHHH9XCF8NV
drwxr-xr-x 3 root    root       4096 Feb 12 09:00 01C64KW317054P6TNH8DCNGRP9
drwxr-xr-x 3 root    root       4096 Feb 12 11:00 01C64TQT974JFMQV24CH9060XW
drwxrwxrwx 2 root    root       4096 Feb 12 11:56 wal

When standing up Thanos I see:

./thanos sidecar --prometheus.url http://localhost:9090 --tsdb.path /opt/prometheus/promv2/data/ --s3.bucket=thanos --s3.endpoint=xxxxxxxx --s3.access-key=xxxxxxx --s3.secret-key=xxxxxx
level=info ts=2018-02-12T12:25:49.329654785Z caller=sidecar.go:293 msg="starting sidecar" peer=01C64ZMYG5728WQFXTVCD0F70V
level=info ts=2018-02-12T12:25:49.652116167Z caller=shipper.go:179 msg="upload new block" id=01C64KW317054P6TNH8DCNGRP9
level=info ts=2018-02-12T12:25:51.747570129Z caller=shipper.go:179 msg="upload new block" id=01C64TQT974JFMQV24CH9060XW

only the last two blocks are uploaded. From the flags for thanos sidecar I don't see a mechanism for specifying a period for backdating. Perhaps I am doing something wrong? Is this intentional for some reason (compute/performance)? Or am I simply listing a feature request here?

Thanks.

HTTP related memory leak in store nodes

The store node is leaking memory very heavily. For the high request frequency I currently have, each query bumps the resident memory by 200-500MB.

The memory seems to go somewhere in net/http but is neither reused nor freed. From the heap profile I cannot tell whether it comes from our own gRPC or from requests to GCS.
We are not opening connections ourselves for either but only instantiate a constant number of clients. So possibly it's something we have to change in how we configure those.

Default Make target installs binaries from the Internet

Running the default Make target invokes the install-tools target, which has the unexpected side-effect of performing a 'git pull' on repositories under the GOPATH:
https://github.com/improbable-eng/thanos/blob/b77877f42f9edf67a4413a9602054850a1a26afa/Makefile#L25-L31

We might want to consider using vg instead, which supports using dep metadata to specify which binaries are required. This would also allow for versions to be pinned, to make the builds repeatable.

The default Make target should not install binaries on a user's system.

[ruler] Should ruler scrape duplicated data? Or HA should be transparent for it?

I think scraping deduplicated data is actually the correct answer. I think we had idea to add deduplication batch job for old data that will reduce data in obj store.

[store node] Add cost-wise metric to understand a cost of running GCS store node

ui: Add store targets page for querier.

We wanted to do that with some UI upgrade, but that is extremely useful now -> Setting gossip cluster and ensuring all labels are ok might be tricky.

K8s manifests for sidecar, query node and (?) store node

Basically we need these to run on any k8s cluster.

The question is if store node, which is bit stateful is useful to be provisioned on k8s or just separate VM. I guess being able to run it on k8s as well is not a harm.

[store] Improve LRU cache with additional features

For instance:

Removal of item when block is deleted
Time-based retention (?)

Come up with convention for partial errors

When fanning out requests to multiple Prometheus instances or store nodes, partial failures will happen and should be handled gracefully. We need a consistent way to express partial failures for internal and external error handling.

[store] Improve startup time.

Define "store" API

We need a well-defined gRPC API to query series data by time range and a set of label selectors. It should return compressed chunks. The compression format is known and query layer or other clients can decompress chunks as needed after retrieval.

This API will not be compatible with the Prometheus remote read API

[shipper] Warning about .tmp Prometheus files

Got

level=warn name=sidecar ts=2017-11-16T15:30:06.554929827Z caller=shipper.go:71 msg="open file failed" err="stat /prometheus-data/01BZ2Q6ZJYT2R3N1ZM4P54XT7T.tmp: no such file or directory"

We should not care about all the irrelevant files Prometheus or someone can add

Incorrect regexp works on store node, eg: `go_goroutines{job=~"thanos-*"}

go_goroutines{job=~"thanos-*"} works for store node, but it should not
(proper) go_goroutines{job=~"thanos-.*"} works for both prometheus and store node.

Thanos image update

Hey @Bplotka @fabxc,

I'm using the following to stand up a thanos store gateway

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: thanos-store
spec:
  serviceName: "thanos-store"
  replicas: 1
  selector:
    matchLabels:
      app: thanos
      thanos-peer: "true"
  template:
    metadata:
      labels:
        app: thanos
        thanos-peer: "true"
    spec:
      containers:
      - name: thanos-store
        image: improbable/thanos:latest
        env:
        - name: SECRET_ACCESS
          valueFrom:
            secretKeyRef:
              name: secret-s3
              key: secret_access_s3
        - name: SECRET_KEY
          valueFrom:
            secretKeyRef:
              name: secret-s3
              key: secret_key_s3
        args:
        - "store"
        - "--log.level=debug"
        - "--tsdb.path=/var/thanos/store"
        - "--cluster.peers=thanos-peers.default.svc.cluster.local:10900"
        - "--s3.bucket=<bucket_name>"
        - "--s3.endpoint=s3-<az>.amazonaws.com"
        - "--s3.access-key=$(SECRET_ACCESS)"
        - "--s3.secret-key=$(SECRET_KEY)"
        ports:
        - name: http
          containerPort: 10902
        - name: grpc
          containerPort: 10901
        - name: cluster
          containerPort: 10900
        volumeMounts:
        - name: data
          mountPath: /var/thanos/store
      volumes:
      - name: data
        emptyDir: {}

However the image is failing to spin up:

Error parsing commandline arguments: required flag --gcs.bucket not provided
usage: thanos store --gcs.bucket=<bucket> [<flags>]

I suspect the improbable/thanos:latest image hasn't been updated and is an older image, as I can see in the master branch that the --gcs.bucket flag has been changed so it is no longer Required()

// registerStore registers a store command.
func registerStore(m map[string]setupFunc, app *kingpin.Application, name string) {
	cmd := app.Command(name, "store node giving access to blocks in a GCS bucket")

	grpcAddr := cmd.Flag("grpc-address", "listen address for gRPC endpoints").
		Default(defaultGRPCAddr).String()

	httpAddr := cmd.Flag("http-address", "listen address for HTTP endpoints").
		Default(defaultHTTPAddr).String()

	dataDir := cmd.Flag("tsdb.path", "data directory of TSDB").
		Default("./data").String()

	gcsBucket := cmd.Flag("gcs.bucket", "Google Cloud Storage bucket name for stored blocks. If empty sidecar won't store any block inside Google Cloud Storage").
		PlaceHolder("<bucket>").String()

Some series's oldest data are not available (?)

So for go_goroutines we have oldest datapoint at 26 Nov:

But for example for go_goroutines{job="thanos-store"} it is better 21 Nov (!)

The metric without any matcher should show these old datapoints...

Skip preloading of irrelevant postings

It just occurred to me that once we have blocks from many sources, it can rather easily happen that we preload a lot of irrelevant postings.

Suppose you get up{job="A"}. The up metric will be present in virtually every single block for the given time range. However, job="A" will probably be present in just a few. Currently if there's no postings list of job="A" for a block, we will return an EmptyPostings transparently and still register the postings list for __name__="up" for preloading for all blocks.

If we were more explicit about the non-existant postings, we could skip that work. That however would go pretty deep down into index fetching internals and we can only infer this by looking at the operations between postings (intersection vs. merge).

Probably not something to optimize quite right now yet. Especially since we've ongoing upstream work in that area.
But we should keep it in mind in case we run into performance issues later.

Implement store layer

The store layer is an HA pair of nodes that connect to the GCS bucket containing all block data. They load all index files to local disk and mmap them to quickly resolve queries of time range + label matchers.
They dynamically load chunks via range queries from GCS and potentially store them in an in-memory LRU cache.

Chunks for the same series are sequential within a block file. Thus multiple chunks can be loaded with a single range request. Similarly, chunks for multiple series are co-located and can be loaded in a single request as well.
Some research needs to be done on how to consolidate multiple chunk queries. From a latency perspective it can make sense to load significant amount of 'unnecessary' data (gaps between required chunks).

Sharding of store nodes is a non-goal initially. See the design doc for calculations leading to that decision.

Remove overlap when querying data that both Source and Store have.

Basically we don't want to ask for the same data twice. Since we know that:

Source:
- Produces some data and pushes that to some storage
- Makes that data accessible for short time (aggressive retention)
- Faster
Store:
- Does not produce any data just enable browsing data from storage
- Probably slower

As we agreed, it would nice to get as much as possible from Source and rest from store.
As an example:

0───2───4───6───8───10
└────Store───┘
         └───Source──┘
    └────Query───┘

When we query for 2-8 we can make naive algorithm that gets max(minTime) from Source (4 here) and we can query
2-4 + 1 for safeguard so 2-5 from store
4-8 + 1 for safeguard so 3-8 from source.
We need safeguards, since the propagation is eventually consistent.

Problems:

There could be multiple sources that can have different minTime:

0───2───4───6───8───10
└────Store───┘
         └───Source1─┘
           └─Source2─┘
    └────Query───┘

And even worse in this case for store.. are we sure that 0-6 include actually 4-6 from Source1?

Potential solution: We need minTime-maxTime ranges for each store -> PER source.

There could be multiple stores that can use different backends.. (e.g different buckets -> "global queue" case)

Potential solution: minTime->maxTime per store or storageID

As we can see, any solution requires to introduce a lot of complexity (passing some IDs to match time ranges), but for what benefit? In reality it in worst case it will be ~12h of unnecessary data for each source in system. That sounds a lot if you want to have rule component and HA (so 3*12h)

Potential complexity reduction:
Since store nodes have peer info about source, they can limit returned data to the data still available from source node. But still we need some ID matching.

build and compile

Hey @Bplotka,

Am I missing something here? 🤔

So I am seeing errors for incorrect arguments when doing go get:

$ go get github.com/improbable-eng/thanos/...
# github.com/improbable-eng/thanos/pkg/block
go1/src/github.com/improbable-eng/thanos/pkg/block/index.go:152:31: not enough arguments in call to index.NewFileReader
	have (string)
	want (string, int)
# github.com/improbable-eng/thanos/pkg/objstore
go1/src/github.com/improbable-eng/thanos/pkg/objstore/objstore.go:207:32: cannot use b.opsDuration.WithLabelValues(op) (type prometheus.Observer) as type prometheus.Histogram in argument to newTimingReadCloser:
	prometheus.Observer does not implement prometheus.Histogram (missing Collect method)
go1/src/github.com/improbable-eng/thanos/pkg/objstore/objstore.go:222:32: cannot use b.opsDuration.WithLabelValues(op) (type prometheus.Observer) as type prometheus.Histogram in argument to newTimingReadCloser:
	prometheus.Observer does not implement prometheus.Histogram (missing Collect method)

and when doing a make I get:

$ make
>> fetching goimports
>> fetching promu
>> fetching dep
>> dep ensure
>> formatting code
make: goimports: Command not found
make: *** [format] Error 127

Go versions: go version go1.9.4 linux/amd64

resolvePeers() overrides peer lookup behaviour in memberlist library

The resolvePeers() function in the cluster package tries to resolve the IP addresses for the cluster peers:
https://github.com/improbable-eng/thanos/blob/19ff509a4f658887f90ab543c79cc821762f64cd/pkg/cluster/cluster.go#L423-L476

It does this using the Go DNS resolver.

The memberlist library does the same thing, but uses TCP first in a bid to retrieve as many peers as possible:
https://github.com/hashicorp/memberlist/blob/ce8abaa0c60c2d6bee7219f5ddf500e0a1457b28/memberlist.go#L303-L306

Go's resolver may try TCP if the UDP query results in a truncated response, so maybe this is not an issue. But it seems like Thanos is doing more than it needs to - there's a lot of overlap between the code in these two places.

Should Thanos delegate DNS resolution for peers to the memberlist library, and is there a compelling reason for it to do its own resolution as it is currently?

Define store API behavior on returned time range

When we query the store API, we define a time window. When we retrieve chunks from block data, they do not align with that query window. For chunks crossing the query window at beginning and wend we could always decode the chunk, drop irrelevant samples, and encode new chunks.

This however can be quite a few CPU cycles and allocations. The bandwidth overhead of just sending the full chunks is minor AFAICS. It might be a good idea to just send chunks as they are and have the query layer ensure we skip over irrelevant samples.

For the current sidecar implementation, we guarantee to return exactly the relevant data since we query raw datapoints and encode them as a chunk. But this might not be the case indefinitely – we eventually might want to get a chunk-based API upstream that's similar to the store API.

Any concerns with this proposal?

[query] Difference between direct Prom and query graph (~staleness)

Somehow missing data points are ... represented differently?

(Removed image because it could contain sensitive data)

Upload blocks from Prometheus nodes to GCS

This is implemented through a sidecar, which watches Prometheus's data directory for new blocks and uploads them with the existing directory structure into a single GCS bucket.

Implement querying layer

The querying layer is a stateless and horizontally scalable component that implements PromQL on top of the store API. It parses PromQL queries and loads relevant series and their chunks through the store API.

It is TBD whether the store nodes should proxy to the sidecars or the querying layer knows about store nodes and all Prometheus servers. Once we implement pre-filtering of blocks/Prometheus servers based on well-known label sets it might be preferable to not pull this complexity in the querying layer itself.

It should query just a single random store node and only fail over if that one does not respond. However, capability to merge and deduplicate overlapping data should be a direct capbility of the query layer.

The query layer implements the Prometheus HTTP v1 query API to be compatible with existing tools such as Grafana.

Add rule service

The overall system provides us with a global view of the data. We want to make use of that for recording and alerting rules as well. This requires a service which evaluates those rules against the querying layer and writes this data to disk.

The component should appear like yet another Prometheus server to the outside, i.e. it writes down data from recording rules in the Prometheus format. The sidecar is deployed along with it, uploads completed blocks, and responds to queries for recent data.
Firing alerts are sent out in the same way as from a regular Prometheus server.

Since the expensive querying is outsourced to the querying layer, we probably don't need any sharding here either and can stick with an HA pair.

To retain some reliability properties of the Prometheus model, it may be a good idea to only evaluate rules in this component that actually require a global data view. Rules that just need data scoped to a single collector Prometheus could still be evaluated there. Finding some decision rules and potentially and automated process for this would be a valuable extension.
There are caveats to this for recording rules. Maybe it is only feasible for alerts.

Cache LabelValues

We should keep an eye on the LabelValues RPC to store nodes. It will always hit every single block, which can get particularly costly on GCS store nodes.

This data does rarely change and is only used for things like auto-completion. But for that it is called whenever the query UI opens up. Probably the query nodes themselves should already cache them.

Implement data merging/deduplicating

Merging and deduplicating of data is non-trivial and has no hard truth. We need to find some heuristic we want to follow. This will be used at query time but also to consolidate data by rewriting it. So it has a significant and irreversible impact.

Implement a way to validate completed block upload

Currently when we upload blocks to the object storage, if we encounter an error we clean up after ourselves and delete all files that were uploaded so far. This is however not 100% safe yet.

If the uploader has a hard crash, orphaned objects are left behind. Files are uploaded in order so a block will only be validly considered if the meta.json exists, which is uploaded last. This guards against for example, considering a block during garbage collection even though it is only partial. This is however implicit and makes us rely on upload order, i.e. we cannot parallelise it.

We should come up with a way to ensure this. This could either be a dedicated done file for each block or setting of a metadata/label onto the meta.json file in the object storage. Either way, our code would do an explicit finalization after successful upload of all block data.
We should also add a background cleanup procedure that detects and purges partial blocks.

Small and unbalanced downsampled chunks produced

I'm currently debugging a case where a downsampled series is producing a lot of small chunks rather than few sufficiently large ones. A small fraction of series are also finalized with chunks with a timestamp range of [-1, -1] for which there's no logical reason in the base data.

This does likely not affect correctness of query results but was caught by the new validations procedures added in #194. It's probably really worth keeping those around forever to catch regressions into behavior like this early.

[query] grpc Resource Exhausted error or specific queries

Query: topk(10, count by (__name__)({__name__=~".+"}))
Query node response:

Error executing query: fetch series: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (10277563 vs. 4194304)

Time:

Load time: 3412ms 
Resolution: 14s 
Total time series: 0

Doing the same on Prometheus gives proper data with time:

Load time: 651ms 
Resolution: 14s 
Total time series: 10
Execute

Refactor AWS object store auth to be as simple and flexible as Google `FindCredentials`.

While running in AWS, you can use the instance profile to get an auth token, but currently Thanos doesn't configure the bucket if no credentials are provided.

minio supports IAM :

iam := credentials.NewIAM("")
s3Client, err := minio.NewWithCredentials("s3.amazonaws.com", iam, true, "")

Would you consider a PR ?

Huge memory usage on average store node load.

Doing simple query over not really long period can OOM the pod (with given 6GB of RAM). Obviously we can increase the box size, but can we improve that?

Initial profiling shows that postings:

  .   144.70GB    393:    if err := indexr.preloadPostings(); err != nil {
         .          .    394:        return nil, err
         .          .    395:    }

Is increasing really fast with our ruler service scraping it every 15s.

In comparison with compressed series:

         .   446.41MB    407:	if err := indexr.preloadSeries(ps); err != nil {
         .          .    408:		return nil, err
         .          .    409:	}

Memory:

Mem allocs per sec:

thanos-io / thanos Goto Github PK

thanos's Introduction

Overview

Getting Started

Features

Architecture Overview

Thanos Philosophy

Releases

Contributing

Community

Adopters

Maintainers

thanos's People

Contributors

Stargazers

Watchers

Forkers

thanos's Issues

Recommend Projects

Recommend Topics

Recommend Org