cortexproject / cortex Goto Github PK
View Code? Open in Web Editor NEWA horizontally scalable, highly available, multi-tenant, long term Prometheus.
Home Page: https://cortexmetrics.io/
License: Apache License 2.0
A horizontally scalable, highly available, multi-tenant, long term Prometheus.
Home Page: https://cortexmetrics.io/
License: Apache License 2.0
To give users some amount of initial feebback, it'd be great for Cortex to show some per-user stats in the user's Cortex dashboard:
This will require adding backend support for getting those stats from the ingesters. Probably the distributor will have to query all ingesters for their current metadata for a given user and then aggregate those stats for the UI.
:(
When I've shown our Prometheus functionality to people at conferences, they have asked about custom dashboards. Specifically, they've asked about:
The last case is from someone who is already running prometheus & grafana, but has their own extra proprietary data source that they use to add extra information to their graphs.
To help with debugging.
Should be exposed on ingesters and distributors.
From @tomwilkie
Copied from original issue: tomwilkie/frankenstein#1
This repo:
On service:
Other
To verify:
git grep prism
returns only unrelated vendored resultsgit ls-files | grep prism
returns emptygit grep prism
returns emptygit ls-files |grep prism
returns emptygit grep prism
returns emptygit ls-files |grep prism
returns emptyFrom @tomwilkie
We need this so that when ingesters go away, we don't lose their data.
Copied from original issue: tomwilkie/frankenstein#11
From @tomwilkie
To allow for non-metric-name queries, and improve load balancing in ingesters
Copied from original issue: tomwilkie/frankenstein#12
Currently, when identifying ingesters to write data to, we include those that are in the process of shutting down. These ingesters (correctly) refuse to accept new writes, as they are busy writing everything they have to dynamo. This causes writes to fail, which causes a loss of user data.
Instead, we should:
time="2016-10-10T13:17:29Z" level=error msg="Failed to flush chunks for series: RequestError: send request failed\ncaused by: Put https://weaveworks-prod-chunks.s3.amazonaws.com/2/17142991289174007630%3A1476104540889%3A1476105230889: dial tcp: lookup weaveworks-prod-chunks.s3.amazonaws.com on 10.0.0.10:53: dial udp 10.0.0.10:53: connect: network is unreachable" source="cortex.go:440"
This might be a cause of data loss.
When we roll out new versions of the ingesters, they sometimes fail to correctly deregister themselves from consul, causing user-visible errors, manifesting as 500s on /push
requests.
Distributor logs will have messages like:
time="2016-09-12T14:12:22Z" level=error msg="error sending request: Post http://10.244.9.10:80/push: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" source="server.go:107"
And the retrieval queues will likely fill up.
You can confirm this is a problem by looking at the distributor logs and observing:
time="2016-09-12T15:00:31Z" level=info msg="Got update to ring - 6 ingesters, 512 tokens" source="ring.go:85"
Where 6
is higher than the number of actual ingester pods.
The workaround is to restart consul like so:
kubectl delete --namespace frankenstein name=consul
After a minute or so, the errors in the distributor log should clear.
On Weave Cloud dev & prod, getting:
Error executing query: too few successful reads, last error was: <nil>
when executing queries.
I haven't had a chance to investigate, but my guess is that deployments no longer visibly break the cluster (per #19) but they do silently break it, partly because we're not yet alerting on 500s for cortex at weaveworks (c.f. weaveworks/service#904).
This is probably something that will be fixed by clean shut-downs (which, at weaveworks, are blocked on weaveworks/service-conf#111), but I wanted to track something something symptom-oriented so I could have something to search for and also make notes on other things I found along the way.
Debugging this would be made easier by addressing #57, #58, #59, and #60.
We are getting a fairly constant stream of 500s from Dynamo when we attempt to write to it (~3-4qps on dev).
We should understand what's causing this & fix it. https://github.com/weaveworks/monitoring/issues/12 will help.
From @tomwilkie
Copied from original issue: tomwilkie/frankenstein#7
Even just by accident, it's really easy for a user to overload or hotspot an ingester right now, by either sending too many series with the same metric name, or just too many time series in general (especially when accidentally misusing a label with unbounded value cardinality).
We should have some way of limiting the user's number of series.
Related to #58
The new metrics defined to track replicated queries are not being exported at all.
Some are prism
, others prometheus
. I don't care much which we pick as long as we're consistent.
A couple of times today, we've got:
time="2016-10-25T16:47:27Z" level=info msg="POST /push (500) 420.17µs"
time="2016-10-25T16:47:29Z" level=error msg="append err: sample timestamp out of order" source="server.go:125"
These represent invalid input by the client and should be represented as 400s, rather than 500s.
From @tomwilkie
Copied from original issue: tomwilkie/frankenstein#2
The vendored prometheus vendors a bunch of stuff, and it shouldn't.
From @tomwilkie
Copied from original issue: tomwilkie/frankenstein#5
As part of understanding & fixing #61 and making sure it doesn't happen again, we want to gather data on:
From @tomwilkie
Copied from original issue: tomwilkie/frankenstein#6
From @tomwilkie
Copied from original issue: tomwilkie/frankenstein#16
time="2016-09-20T20:50:48Z" level=info msg="Flushing 1 chunks" source="frankenstein.go:468"
panic: runtime error: slice bounds out of range
goroutine 98074 [running]:
panic(0xd68a20, 0xc820014030)
/usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local.(*Ingester).flushSeries(0xc820114540, 0x7fc41d6d9c90, 0xc822b09050, 0xc820256c60, 0x9842fa55f203a838, 0xc820321420, 0xc820641e00, 0x0, 0x0)
/go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local/frankenstein.go:476 +0x5f8
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local.(*Ingester).flushAllSeries.func1(0xc820114540, 0x7fc41d6d9c90, 0xc822b09050, 0xc820256c60, 0xc821466860, 0x0, 0xc821466850)
/go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local/frankenstein.go:439 +0x84
created by github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local.(*Ingester).flushAllSeries
/go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local/frankenstein.go:444 +0x181
From @tomwilkie
Currently 10mins, should be 1hr.
Copied from original issue: tomwilkie/frankenstein#10
A couple of times over the last few days I've been pinged by people who have been confused by the cortex retrieval example in the sockshop, which hardcodes a dummy, broken value for the Weave Cloud access token.
It might be nice, as part of Cortex or something else, to provide an image that takes the access token as a command line or environment variable, and a set of Kubernetes configurations that use that image rather than prometheus.
Pro:
Con:
From @tomwilkie
Copied from original issue: tomwilkie/frankenstein#8
Bugs around that logic have bitten us a couple of times now, so it'd be good to have at least some basic tests.
time="2016-09-22T09:41:54Z" level=error msg="Error CASing collectors/ring: Unexpected response code: 413 (Value exceeds 524288 byte limit)" source="consul_client.go:146"
time="2016-09-22T09:41:54Z" level=fatal msg="Failed to pick tokens in consul: failed to CAS collectors/ring" source="ingester_lifecycle.go:113"
From @tomwilkie
Could probably also be upstreamed.
Copied from original issue: tomwilkie/frankenstein#4
When I've shown our hosted Prometheus stuff to people at conferences, they almost immediately ask about alerting, and being able to define custom alerts. We should implement that.
TODO before KubeCon:
TODO later:
When we attempt to write to dynamo, we will sometimes get errors. Currently, we abort on error.
This is almost never the right behaviour, because failing to write to dynamo (esp. during shut down) means loss of user data. Instead, we should retry.
Note that one major source of errors is being throttled by Dynamo. In this case, we should also retry, but we should be especially careful that we back off, lest we cause a cascading failure.
We sometimes get a bunch of messages like these:
time="2016-10-18T16:16:52Z" level=error msg="Failed to flush chunks for series: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.\n\tstatus code: 400, request id: REDACTED" source="ingester.go:442"
Can we handle this case better?
9m log entries in 8hrs, which is about 40% of the clusters output (another 40% being kube-dns, which is now fixed).
66% of the logs lines from the ingester are “Flushing chunk".
From @tomwilkie
Should reduce index size, super important for scalability.
Copied from original issue: tomwilkie/frankenstein#9
From @tomwilkie
Copied from original issue: tomwilkie/frankenstein#15
From @tomwilkie
And get a handle (ie way to measure) on the chunk backlog.
Copied from original issue: tomwilkie/frankenstein#13
When debugging the distributor, ring state is pretty important. We have some metrics like prometheus_distributor_ingesters_total
, prometheus_distributor_ingester_clients
, and prometheus_distributor_ingester_ownership_percent
, but these don't distinguish between ingesters that have had a recent heartbeat and those that haven't.
It's possible that the "fix" for this is to just clean up old ingesters properly, but I think even then I'd like either:
panic: inconsistent label cardinality
goroutine 512 [running]:
panic(0xd22c00, 0xc8200156d0)
/usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).WithLabelValues(0xc82026c300, 0xc820905dd8, 0x2, 0x2, 0x0, 0x0)
/go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus/vec.go:135 +0xae
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus.(*HistogramVec).WithLabelValues(0xc820156248, 0xc820905dd8, 0x2, 0x2, 0x0, 0x0)
/go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus/histogram.go:337 +0x55
github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument.TimeRequestHistogramStatus(0xeac790, 0x4, 0xc820156248, 0x1062dc8, 0xc820905e80, 0x0, 0x0)
/go/src/github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument/instrument.go:39 +0x1f7
github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument.TimeRequestHistogram(0xeac790, 0x4, 0xc820156248, 0xc820905e80, 0x0, 0x0)
/go/src/github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument/instrument.go:24 +0x57
github.com/weaveworks/frankenstein.(*Distributor).sendSamples(0xc82027e2a0, 0x7f06e68a6e50, 0xc8201eb7a0, 0xc820236ad0, 0xe, 0xc8201a6800, 0x1a, 0x20, 0x0, 0x0)
/go/src/github.com/weaveworks/frankenstein/distributor.go:179 +0x12b
github.com/weaveworks/frankenstein.(*Distributor).Append.func1(0xc82006e5a0, 0xc82027e2a0, 0x7f06e68a6e50, 0xc8201eb7a0, 0xc820236ad0, 0xe, 0xc8201a6800, 0x1a, 0x20)
/go/src/github.com/weaveworks/frankenstein/distributor.go:160 +0x7d
created by github.com/weaveworks/frankenstein.(*Distributor).Append
/go/src/github.com/weaveworks/frankenstein/distributor.go:161 +0x4c3
We got some great feedback from a new user, @bboreham, who said:
The Prometheus Getting Started page tells me to write a
prometheus.yml
file which is essentially identical to the one that came in the distribution (where "distribution" is the tarball provided at https://prometheus.io/download/)
and
The Configuration page linked to in the cortex README is too complicated; not suitable for "getting started".
(I've paraphrased for context, adding links as appropriate).
Given that we hope to provide something useful to people without deep Prometheus expertise, we should think about how we can provide a smoother initial experience who are unfamiliar with it.
One possible solution would be to provide a container with some preset configuration.
Tracking ticket for stuff that didn't get a proper review:
We should only write to the minimum required.
This should improve tail latencies & reduce load on the system.
See https://github.com/weaveworks/cortex/pull/80/files#r85902648
If we merge samples from many chunks for the same series, this could get slow.
Related to #9
Right now, the only way an ingester is ever removed from the ring is if it shuts down gracefully. In any other circumstance (crash, network partition, asteroid strike) nothing will ever invalidate the old entries, until such a time as an ingester with the same hostname reconnects. This in turn means a large number of failing operations, requiring operator intervention.
This issue is less pronounced in most environments, where a crashed ingester typically will be swiftly restarted with an identical hostname. It could easily occur if an error occurred during, say, a rolling upgrade in k8s (new pod = new hostname). Note however that I haven't seen it in the wild. Yet.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.