cortexproject / cortex Goto Github PK

View Code? Open in Web Editor NEW

5.3K 5.3K 775.0 213.34 MB

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

Home Page: https://cortexmetrics.io/

License: Apache License 2.0

Makefile 0.33% Go 98.97% Shell 0.31% Dockerfile 0.11% HTML 0.22% PLpgSQL 0.01% SCSS 0.05%

cncf hacktoberfest kubernetes monitoring prometheus

cortex's People

Stargazers

Watchers

Forkers

ekimekim christianmkg linearregression khanchan openthings auhlig codeaudit improbable-io bwplotka zoues gomyway-networks cnicolov maniacs-ops xephon-contrib billthebest yanghongkjxy catechnologiestest ieasydevops everesio tomwilkie savemech mattbostock dlespiau sun363587351 mbrukman getcloudnative ianthehat aruv99 khaines utkarshmani1997 alexxnica kryndex satyamz swhsiang hyperpilotio kausalco bboreham gouthamve grafana bala-27 aryanugroho lokanadham100 lulzzz rusenask sfrias mjudeikis iflowfor8hours limscoder gonorth thorfour yuanfeng0905 berthartm fossabot sjanulonoks xyy0807 avinashpenmetsa tsungming lelenanam qsbao agrahamlincoln shammishailaj surajb08 gramidt dgurindapalli codesome rsteneteg jjmengze aditya-konarde qiell chinaloryu stonesfour deniszh sangfei markemeis kangwoo prvnmali2017 mateusz-gozdek-sociomantic hanlimo guangbochen aylei untab mstump walter211 awh infa-dsokolov qjsky00 jangwonpark74 the-cc-dev searchlight kidmam vashadow huangyingting mengjin001 joewrightss sanjog152 patform9 zz rbx-rchall saurav-malani microwaves

cortex's Issues

Display per-user metrics metadata

To give users some amount of initial feebback, it'd be great for Cortex to show some per-user stats in the user's Cortex dashboard:

number of metrics
number of time series
samples/s

This will require adding backend support for getting those stats from the ingesters. Probably the distributor will have to query all ingesters for their current metadata for a given user and then aggregate those stats for the UI.

Consider deduping consul client and memcache client with Scope?

From @tomwilkie

Probably into a new repo.

Copied from original issue: tomwilkie/frankenstein#14

There is no code and I am sad.

Custom dashboards

When I've shown our Prometheus functionality to people at conferences, they have asked about custom dashboards. Specifically, they've asked about:

using grafana
being able to provide their own data sources for grafana

The last case is from someone who is already running prometheus & grafana, but has their own extra proprietary data source that they use to add extra information to their graphs.

Add http endpoint to render ring state in html

To help with debugging.

Should be exposed on ingesters and distributors.

Refactor web API and submit upstream

From @tomwilkie

Copied from original issue: tomwilkie/frankenstein#1

Rename to Cortex

This repo:

Rename the GitHub repository
Update code to use "cortex" package, not "prism"
Update code to use "cortex", not "prism"
Revise README to ensure naming makes sense
Create new repository on Docker Hub
Update CI to work with new names

On service:

Update deployment config for new image names (weaveworks/service-conf#274)
Roll out new image names
- to dev
- to prod
Update the feature flag
- UI code
- database (probably need 2 updates to UI code for smooth migration)
Update the production dashboards
Update playbook

Other

rename slack channel
rename meeting

To verify:

this repo
- git grep prism returns only unrelated vendored results
- git ls-files | grep prism returns empty
service
- git grep prism returns empty
- git ls-files |grep prism returns empty
service-conf
- git grep prism returns empty
- git ls-files |grep prism returns empty

Write ahead log ingester

From @tomwilkie

We need this so that when ingesters go away, we don't lose their data.

Copied from original issue: tomwilkie/frankenstein#11

Rename to prism

Separate out ingester index?

From @tomwilkie

To allow for non-metric-name queries, and improve load balancing in ingesters

Copied from original issue: tomwilkie/frankenstein#12

Don't try to write to "leaving" ingesters

Currently, when identifying ingesters to write data to, we include those that are in the process of shutting down. These ingesters (correctly) refuse to accept new writes, as they are busy writing everything they have to dynamo. This causes writes to fail, which causes a loss of user data.

Instead, we should:

mark "leaving" ingesters as such
skip over them for writing, not counting them to the replica total
include them querying, counting them to the replica total

Ingesters get network errors when writing dynamo data during shutdown

time="2016-10-10T13:17:29Z" level=error msg="Failed to flush chunks for series: RequestError: send request failed\ncaused by: Put https://weaveworks-prod-chunks.s3.amazonaws.com/2/17142991289174007630%3A1476104540889%3A1476105230889: dial tcp: lookup weaveworks-prod-chunks.s3.amazonaws.com on 10.0.0.10:53: dial udp 10.0.0.10:53: connect: network is unreachable" source="cortex.go:440"

This might be a cause of data loss.

Roll out sometimes breaks deployment

When we roll out new versions of the ingesters, they sometimes fail to correctly deregister themselves from consul, causing user-visible errors, manifesting as 500s on /push requests.

Distributor logs will have messages like:

time="2016-09-12T14:12:22Z" level=error msg="error sending request: Post http://10.244.9.10:80/push: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" source="server.go:107"

And the retrieval queues will likely fill up.

You can confirm this is a problem by looking at the distributor logs and observing:

time="2016-09-12T15:00:31Z" level=info msg="Got update to ring - 6 ingesters, 512 tokens" source="ring.go:85"

Where 6 is higher than the number of actual ingester pods.

The workaround is to restart consul like so:

kubectl delete --namespace frankenstein name=consul

After a minute or so, the errors in the distributor log should clear.

Frequently getting errors when querying cortex

On Weave Cloud dev & prod, getting:

Error executing query: too few successful reads, last error was: <nil>

when executing queries.

I haven't had a chance to investigate, but my guess is that deployments no longer visibly break the cluster (per #19) but they do silently break it, partly because we're not yet alerting on 500s for cortex at weaveworks (c.f. weaveworks/service#904).

This is probably something that will be fixed by clean shut-downs (which, at weaveworks, are blocked on weaveworks/service-conf#111), but I wanted to track something something symptom-oriented so I could have something to search for and also make notes on other things I found along the way.

Debugging this would be made easier by addressing #57, #58, #59, and #60.

Investigate and fix 500s when writing to Dynamo

We are getting a fairly constant stream of 500s from Dynamo when we attempt to write to it (~3-4qps on dev).

We should understand what's causing this & fix it. https://github.com/weaveworks/monitoring/issues/12 will help.

Ensure query path is completely parallelised

From @tomwilkie

Copied from original issue: tomwilkie/frankenstein#7

Limit per-user metric cardinality

Even just by accident, it's really easy for a user to overload or hotspot an ingester right now, by either sending too many series with the same metric name, or just too many time series in general (especially when accidentally misusing a label with unbounded value cardinality).

We should have some way of limiting the user's number of series.

Detailed distributor->ingester metrics not exported

Related to #58

The new metrics defined to track replicated queries are not being exported at all.

Inconsistent namespace for metrics

Some are prism, others prometheus. I don't care much which we pick as long as we're consistent.

Write order errors should be 400, not 500

A couple of times today, we've got:

time="2016-10-25T16:47:27Z" level=info msg="POST /push (500) 420.17µs"
time="2016-10-25T16:47:29Z" level=error msg="append err: sample timestamp out of order" source="server.go:125"

These represent invalid input by the client and should be represented as 400s, rather than 500s.

Figure out what to do with the ingestor code

From @tomwilkie

Should we try and submit it upstream?

Copied from original issue: tomwilkie/frankenstein#2

Sort out vendoring; update prometheus

The vendored prometheus vendors a bunch of stuff, and it shouldn't.

Frankenstein needs to accept gRPC

From @tomwilkie

Copied from original issue: tomwilkie/frankenstein#5

Better monitoring for chunk state

As part of understanding & fixing #61 and making sure it doesn't happen again, we want to gather data on:

how big the backlog of flushes for chunks
(maybe) how many chunks there are in the system
how many chunks we're dropping

Chunk store index shouldn't need to include chunk metadata

From @tomwilkie

Copied from original issue: tomwilkie/frankenstein#6

Optionally disable local storage upstream

From @tomwilkie

Copied from original issue: tomwilkie/frankenstein#16

Ingesters crashlooping

time="2016-09-20T20:50:48Z" level=info msg="Flushing 1 chunks" source="frankenstein.go:468" 
panic: runtime error: slice bounds out of range

goroutine 98074 [running]:
panic(0xd68a20, 0xc820014030)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local.(*Ingester).flushSeries(0xc820114540, 0x7fc41d6d9c90, 0xc822b09050, 0xc820256c60, 0x9842fa55f203a838, 0xc820321420, 0xc820641e00, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local/frankenstein.go:476 +0x5f8
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local.(*Ingester).flushAllSeries.func1(0xc820114540, 0x7fc41d6d9c90, 0xc822b09050, 0xc820256c60, 0xc821466860, 0x0, 0xc821466850)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local/frankenstein.go:439 +0x84
created by github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local.(*Ingester).flushAllSeries
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local/frankenstein.go:444 +0x181

Tune chunk size

From @tomwilkie

Currently 10mins, should be 1hr.

Copied from original issue: tomwilkie/frankenstein#10

Retriever config for Kubernetes that doesn't hardcode secret

A couple of times over the last few days I've been pinged by people who have been confused by the cortex retrieval example in the sockshop, which hardcodes a dummy, broken value for the Weave Cloud access token.

It might be nice, as part of Cortex or something else, to provide an image that takes the access token as a command line or environment variable, and a set of Kubernetes configurations that use that image rather than prometheus.

Pro:

easier onboarding for newcomers to prometheus

Con:

masks the fact that retrievers really are vanilla prometheus
maybe makes our messaging more confusing for existing prometheus users

Recording rules

From @tomwilkie

Need to support them...

Copied from original issue: tomwilkie/frankenstein#3

Hash ring improvements

From @tomwilkie

Deal with token clashes in Consul
Deal with ingesters going away
Deal with consul going away
Implement replication
~~Implement ingester lifecycle (see #87)~~

~~Copied from original issue: tomwilkie/frankenstein#8~~

Create some basic ingester flushing tests

Bugs around that logic have bitten us a couple of times now, so it'd be good to have at least some basic tests.

Ingesters Crashlooping on consul value size error

time="2016-09-22T09:41:54Z" level=error msg="Error CASing collectors/ring: Unexpected response code: 413 (Value exceeds 524288 byte limit)" source="consul_client.go:146"
time="2016-09-22T09:41:54Z" level=fatal msg="Failed to pick tokens in consul: failed to CAS collectors/ring" source="ingester_lifecycle.go:113"

Potentially replace parts of Cortex's remote querying code with upstream code once it exists

From @tomwilkie

Could probably also be upstreamed.

Copied from original issue: tomwilkie/frankenstein#4

Alerting

When I've shown our hosted Prometheus stuff to people at conferences, they almost immediately ask about alerting, and being able to define custom alerts. We should implement that.

Design doc: https://docs.google.com/a/weave.works/document/d/1ds8-9s3jj-m4r0ZXTBrbRfyBhRaquwI-oUxiuc3CNvc/edit?usp=drive_web

TODO before KubeCon:

Single tenant ruler -> alertmanager setup
Add user ID header to rule -> alertmanager RPCs
Make alert manager understand user ID - make it multitenant
Store per tenant alert manager config in config service (see https://github.com/weaveworks/service-ui/issues/301) (davkal)
backend validation features for AM config editing UI (see https://github.com/weaveworks/service-ui/issues/301) (julius, jml)
Actually try out HA mode (jml)
Better metrics (merge / register the metrics from multiple AM user instances) (julius)
Ruler notifier actually sending multitenant requests

TODO later:

Figure out Scale Up story for alert manager
Notification templating support
Persist silences & notification logs better
Figure out how to show user Alertmanager-related errors (like failling to send a notification to Slack etc.)
Do synchronous config applications (AM, rules) and show errors to users right away?

Retry on errors from Dynamo

When we attempt to write to dynamo, we will sometimes get errors. Currently, we abort on error.

This is almost never the right behaviour, because failing to write to dynamo (esp. during shut down) means loss of user data. Instead, we should retry.

Note that one major source of errors is being throttled by Dynamo. In this case, we should also retry, but we should be especially careful that we back off, lest we cause a cascading failure.

Handle provisioning failures better?

We sometimes get a bunch of messages like these:

time="2016-10-18T16:16:52Z" level=error msg="Failed to flush chunks for series: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.\n\tstatus code: 400, request id: REDACTED" source="ingester.go:442"

Can we handle this case better?

back off?
instrument this with metrics?
increase our provisioning level?

Ingesters log too much

9m log entries in 8hrs, which is about 40% of the clusters output (another 40% being kube-dns, which is now fixed).

66% of the logs lines from the ingester are “Flushing chunk".

Larger buckets in chunk index

From @tomwilkie

Should reduce index size, super important for scalability.

Copied from original issue: tomwilkie/frankenstein#9

Use packages, clean up code etc

From @tomwilkie

Copied from original issue: tomwilkie/frankenstein#15

Parallelise ingester -> chunk store writes

From @tomwilkie

And get a handle (ie way to measure) on the chunk backlog.

Copied from original issue: tomwilkie/frankenstein#13

Insufficient visibility into distributor ring state

When debugging the distributor, ring state is pretty important. We have some metrics like prometheus_distributor_ingesters_total, prometheus_distributor_ingester_clients, and prometheus_distributor_ingester_ownership_percent, but these don't distinguish between ingesters that have had a recent heartbeat and those that haven't.

It's possible that the "fix" for this is to just clean up old ingesters properly, but I think even then I'd like either:

an extra dimension to the vectors indicating ingester 'liveness' or state or both
an admin page that presents the current state of the ring with some relevant timestamps and extra stuff

Distributor is crashing

panic: inconsistent label cardinality

goroutine 512 [running]:
panic(0xd22c00, 0xc8200156d0)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).WithLabelValues(0xc82026c300, 0xc820905dd8, 0x2, 0x2, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus/vec.go:135 +0xae
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus.(*HistogramVec).WithLabelValues(0xc820156248, 0xc820905dd8, 0x2, 0x2, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus/histogram.go:337 +0x55
github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument.TimeRequestHistogramStatus(0xeac790, 0x4, 0xc820156248, 0x1062dc8, 0xc820905e80, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument/instrument.go:39 +0x1f7
github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument.TimeRequestHistogram(0xeac790, 0x4, 0xc820156248, 0xc820905e80, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument/instrument.go:24 +0x57
github.com/weaveworks/frankenstein.(*Distributor).sendSamples(0xc82027e2a0, 0x7f06e68a6e50, 0xc8201eb7a0, 0xc820236ad0, 0xe, 0xc8201a6800, 0x1a, 0x20, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/distributor.go:179 +0x12b
github.com/weaveworks/frankenstein.(*Distributor).Append.func1(0xc82006e5a0, 0xc82027e2a0, 0x7f06e68a6e50, 0xc8201eb7a0, 0xc820236ad0, 0xe, 0xc8201a6800, 0x1a, 0x20)
    /go/src/github.com/weaveworks/frankenstein/distributor.go:160 +0x7d
created by github.com/weaveworks/frankenstein.(*Distributor).Append
    /go/src/github.com/weaveworks/frankenstein/distributor.go:161 +0x4c3

Getting started requires non-trivial Prometheus knowledge

We got some great feedback from a new user, @bboreham, who said:

The Prometheus Getting Started page tells me to write a prometheus.yml file which is essentially identical to the one that came in the distribution (where "distribution" is the tarball provided at https://prometheus.io/download/)

and

The Configuration page linked to in the cortex README is too complicated; not suitable for "getting started".

(I've paraphrased for context, adding links as appropriate).

Given that we hope to provide something useful to people without deep Prometheus expertise, we should think about how we can provide a smoother initial experience who are unfamiliar with it.

One possible solution would be to provide a container with some preset configuration.

Post merge reviews

Tracking ticket for stuff that didn't get a proper review:

Day 1

Day 2

#117 — Use age of oldest chunk when deciding to flush
#118 — Make max chunk size
#119 — Remove chunk store parallelisation; add backoff
#116 — Parameterise the number of concurrent flushers

Day 3

Don't try and read/write all ingesters in a set; only do it if one fails/times out

We should only write to the minimum required.

This should improve tail latencies & reduce load on the system.

Optimize merging of many small lists of samples

See https://github.com/weaveworks/cortex/pull/80/files#r85902648

If we merge samples from many chunks for the same series, this could get slow.

"distributor_ingester_appends_total" bound three times

More code organisational cleanups

rebase tomwilkie/prometheus:frankenstein against master, and make just a small set of commits for the ingester
remove use of old generic API datatypes for read path
copy ingester code into frankenstein and vendor vanilla prometheus (when chunk API is public)

Ingesters never disappear from ring if shut down ungracefully

Related to #9

Right now, the only way an ingester is ever removed from the ring is if it shuts down gracefully. In any other circumstance (crash, network partition, asteroid strike) nothing will ever invalidate the old entries, until such a time as an ingester with the same hostname reconnects. This in turn means a large number of failing operations, requiring operator intervention.

This issue is less pronounced in most environments, where a crashed ingester typically will be swiftly restarted with an identical hostname. It could easily occur if an error occurred during, say, a rolling upgrade in k8s (new pod = new hostname). Note however that I haven't seen it in the wild. Yet.

cortexproject / cortex Goto Github PK

cortex's People

Stargazers

Watchers

Forkers

cortex's Issues

Day 1

Day 2

Day 3

Recommend Projects

Recommend Topics

Recommend Org