Giter Club home page Giter Club logo

cortex's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cortex's Issues

Display per-user metrics metadata

To give users some amount of initial feebback, it'd be great for Cortex to show some per-user stats in the user's Cortex dashboard:

  • number of metrics
  • number of time series
  • samples/s

This will require adding backend support for getting those stats from the ingesters. Probably the distributor will have to query all ingesters for their current metadata for a given user and then aggregate those stats for the UI.

Custom dashboards

When I've shown our Prometheus functionality to people at conferences, they have asked about custom dashboards. Specifically, they've asked about:

  • using grafana
  • being able to provide their own data sources for grafana

The last case is from someone who is already running prometheus & grafana, but has their own extra proprietary data source that they use to add extra information to their graphs.

Rename to Cortex

This repo:

  • Rename the GitHub repository
  • Update code to use "cortex" package, not "prism"
  • Update code to use "cortex", not "prism"
  • Revise README to ensure naming makes sense
  • Create new repository on Docker Hub
  • Update CI to work with new names

On service:

  • Update deployment config for new image names (weaveworks/service-conf#274)
  • Roll out new image names
    • to dev
    • to prod
  • Update the feature flag
    • UI code
    • database (probably need 2 updates to UI code for smooth migration)
  • Update the production dashboards
  • Update playbook

Other

  • rename slack channel
  • rename meeting

To verify:

  • this repo
    • git grep prism returns only unrelated vendored results
    • git ls-files | grep prism returns empty
  • service
    • git grep prism returns empty
    • git ls-files |grep prism returns empty
  • service-conf
    • git grep prism returns empty
    • git ls-files |grep prism returns empty

Rename to prism

  • Rename the GitHub repository
  • Update code to use "prism" package, not "frankenstein"
  • Update prometheus branch to use "prism" package
  • Update code to use "prism", not "frank"
  • Revise README to ensure naming makes sense
  • Create new repository on Docker Hub
  • Update CI to work with new names
  • Update deployment config for new image names
  • Roll out new image names
    • to dev
    • to prod
  • Update prism to depend on prometheus branch w/ prism rename
  • Update the feature flag
    • UI code
    • database (probably need 2 updates to UI code for smooth migration)
  • Update the production dashboards
  • Update playbook

Don't try to write to "leaving" ingesters

Currently, when identifying ingesters to write data to, we include those that are in the process of shutting down. These ingesters (correctly) refuse to accept new writes, as they are busy writing everything they have to dynamo. This causes writes to fail, which causes a loss of user data.

Instead, we should:

  • mark "leaving" ingesters as such
  • skip over them for writing, not counting them to the replica total
  • include them querying, counting them to the replica total

Ingesters get network errors when writing dynamo data during shutdown

time="2016-10-10T13:17:29Z" level=error msg="Failed to flush chunks for series: RequestError: send request failed\ncaused by: Put https://weaveworks-prod-chunks.s3.amazonaws.com/2/17142991289174007630%3A1476104540889%3A1476105230889: dial tcp: lookup weaveworks-prod-chunks.s3.amazonaws.com on 10.0.0.10:53: dial udp 10.0.0.10:53: connect: network is unreachable" source="cortex.go:440"

This might be a cause of data loss.

Roll out sometimes breaks deployment

When we roll out new versions of the ingesters, they sometimes fail to correctly deregister themselves from consul, causing user-visible errors, manifesting as 500s on /push requests.

Distributor logs will have messages like:

time="2016-09-12T14:12:22Z" level=error msg="error sending request: Post http://10.244.9.10:80/push: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" source="server.go:107" 

And the retrieval queues will likely fill up.

You can confirm this is a problem by looking at the distributor logs and observing:

time="2016-09-12T15:00:31Z" level=info msg="Got update to ring - 6 ingesters, 512 tokens" source="ring.go:85"

Where 6 is higher than the number of actual ingester pods.

The workaround is to restart consul like so:

kubectl delete --namespace frankenstein name=consul

After a minute or so, the errors in the distributor log should clear.

Frequently getting errors when querying cortex

On Weave Cloud dev & prod, getting:

Error executing query: too few successful reads, last error was: <nil>

when executing queries.

I haven't had a chance to investigate, but my guess is that deployments no longer visibly break the cluster (per #19) but they do silently break it, partly because we're not yet alerting on 500s for cortex at weaveworks (c.f. weaveworks/service#904).

This is probably something that will be fixed by clean shut-downs (which, at weaveworks, are blocked on weaveworks/service-conf#111), but I wanted to track something something symptom-oriented so I could have something to search for and also make notes on other things I found along the way.

Debugging this would be made easier by addressing #57, #58, #59, and #60.

Limit per-user metric cardinality

Even just by accident, it's really easy for a user to overload or hotspot an ingester right now, by either sending too many series with the same metric name, or just too many time series in general (especially when accidentally misusing a label with unbounded value cardinality).

We should have some way of limiting the user's number of series.

Write order errors should be 400, not 500

A couple of times today, we've got:

time="2016-10-25T16:47:27Z" level=info msg="POST /push (500) 420.17µs"
time="2016-10-25T16:47:29Z" level=error msg="append err: sample timestamp out of order" source="server.go:125"

These represent invalid input by the client and should be represented as 400s, rather than 500s.

Better monitoring for chunk state

As part of understanding & fixing #61 and making sure it doesn't happen again, we want to gather data on:

  • how big the backlog of flushes for chunks
  • (maybe) how many chunks there are in the system
  • how many chunks we're dropping

Ingesters crashlooping

time="2016-09-20T20:50:48Z" level=info msg="Flushing 1 chunks" source="frankenstein.go:468" 
panic: runtime error: slice bounds out of range

goroutine 98074 [running]:
panic(0xd68a20, 0xc820014030)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local.(*Ingester).flushSeries(0xc820114540, 0x7fc41d6d9c90, 0xc822b09050, 0xc820256c60, 0x9842fa55f203a838, 0xc820321420, 0xc820641e00, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local/frankenstein.go:476 +0x5f8
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local.(*Ingester).flushAllSeries.func1(0xc820114540, 0x7fc41d6d9c90, 0xc822b09050, 0xc820256c60, 0xc821466860, 0x0, 0xc821466850)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local/frankenstein.go:439 +0x84
created by github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local.(*Ingester).flushAllSeries
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/prometheus/storage/local/frankenstein.go:444 +0x181

Retriever config for Kubernetes that doesn't hardcode secret

A couple of times over the last few days I've been pinged by people who have been confused by the cortex retrieval example in the sockshop, which hardcodes a dummy, broken value for the Weave Cloud access token.

It might be nice, as part of Cortex or something else, to provide an image that takes the access token as a command line or environment variable, and a set of Kubernetes configurations that use that image rather than prometheus.

Pro:

  • easier onboarding for newcomers to prometheus

Con:

  • masks the fact that retrievers really are vanilla prometheus
  • maybe makes our messaging more confusing for existing prometheus users

Ingesters Crashlooping on consul value size error

time="2016-09-22T09:41:54Z" level=error msg="Error CASing collectors/ring: Unexpected response code: 413 (Value exceeds 524288 byte limit)" source="consul_client.go:146"
time="2016-09-22T09:41:54Z" level=fatal msg="Failed to pick tokens in consul: failed to CAS collectors/ring" source="ingester_lifecycle.go:113"

Alerting

When I've shown our hosted Prometheus stuff to people at conferences, they almost immediately ask about alerting, and being able to define custom alerts. We should implement that.

Design doc: https://docs.google.com/a/weave.works/document/d/1ds8-9s3jj-m4r0ZXTBrbRfyBhRaquwI-oUxiuc3CNvc/edit?usp=drive_web

TODO before KubeCon:

  • Single tenant ruler -> alertmanager setup
  • Add user ID header to rule -> alertmanager RPCs
  • Make alert manager understand user ID - make it multitenant
  • Store per tenant alert manager config in config service (see https://github.com/weaveworks/service-ui/issues/301) (davkal)
  • backend validation features for AM config editing UI (see https://github.com/weaveworks/service-ui/issues/301) (julius, jml)
  • Actually try out HA mode (jml)
  • Better metrics (merge / register the metrics from multiple AM user instances) (julius)
  • Ruler notifier actually sending multitenant requests

TODO later:

  • Figure out Scale Up story for alert manager
  • Notification templating support
  • Persist silences & notification logs better
  • Figure out how to show user Alertmanager-related errors (like failling to send a notification to Slack etc.)
  • Do synchronous config applications (AM, rules) and show errors to users right away?

Retry on errors from Dynamo

When we attempt to write to dynamo, we will sometimes get errors. Currently, we abort on error.

This is almost never the right behaviour, because failing to write to dynamo (esp. during shut down) means loss of user data. Instead, we should retry.

Note that one major source of errors is being throttled by Dynamo. In this case, we should also retry, but we should be especially careful that we back off, lest we cause a cascading failure.

Handle provisioning failures better?

We sometimes get a bunch of messages like these:

time="2016-10-18T16:16:52Z" level=error msg="Failed to flush chunks for series: ProvisionedThroughputExceededException: The level of configured provisioned throughput for the table was exceeded. Consider increasing your provisioning level with the UpdateTable API.\n\tstatus code: 400, request id: REDACTED" source="ingester.go:442"

Can we handle this case better?

  • back off?
  • instrument this with metrics?
  • increase our provisioning level?

Ingesters log too much

9m log entries in 8hrs, which is about 40% of the clusters output (another 40% being kube-dns, which is now fixed).

66% of the logs lines from the ingester are “Flushing chunk".

Insufficient visibility into distributor ring state

When debugging the distributor, ring state is pretty important. We have some metrics like prometheus_distributor_ingesters_total, prometheus_distributor_ingester_clients, and prometheus_distributor_ingester_ownership_percent, but these don't distinguish between ingesters that have had a recent heartbeat and those that haven't.

It's possible that the "fix" for this is to just clean up old ingesters properly, but I think even then I'd like either:

  • an extra dimension to the vectors indicating ingester 'liveness' or state or both
  • an admin page that presents the current state of the ring with some relevant timestamps and extra stuff

Distributor is crashing

panic: inconsistent label cardinality

goroutine 512 [running]:
panic(0xd22c00, 0xc8200156d0)
    /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).WithLabelValues(0xc82026c300, 0xc820905dd8, 0x2, 0x2, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus/vec.go:135 +0xae
github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus.(*HistogramVec).WithLabelValues(0xc820156248, 0xc820905dd8, 0x2, 0x2, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/prometheus/client_golang/prometheus/histogram.go:337 +0x55
github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument.TimeRequestHistogramStatus(0xeac790, 0x4, 0xc820156248, 0x1062dc8, 0xc820905e80, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument/instrument.go:39 +0x1f7
github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument.TimeRequestHistogram(0xeac790, 0x4, 0xc820156248, 0xc820905e80, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/vendor/github.com/weaveworks/scope/common/instrument/instrument.go:24 +0x57
github.com/weaveworks/frankenstein.(*Distributor).sendSamples(0xc82027e2a0, 0x7f06e68a6e50, 0xc8201eb7a0, 0xc820236ad0, 0xe, 0xc8201a6800, 0x1a, 0x20, 0x0, 0x0)
    /go/src/github.com/weaveworks/frankenstein/distributor.go:179 +0x12b
github.com/weaveworks/frankenstein.(*Distributor).Append.func1(0xc82006e5a0, 0xc82027e2a0, 0x7f06e68a6e50, 0xc8201eb7a0, 0xc820236ad0, 0xe, 0xc8201a6800, 0x1a, 0x20)
    /go/src/github.com/weaveworks/frankenstein/distributor.go:160 +0x7d
created by github.com/weaveworks/frankenstein.(*Distributor).Append
    /go/src/github.com/weaveworks/frankenstein/distributor.go:161 +0x4c3

Getting started requires non-trivial Prometheus knowledge

We got some great feedback from a new user, @bboreham, who said:

The Prometheus Getting Started page tells me to write a prometheus.yml file which is essentially identical to the one that came in the distribution (where "distribution" is the tarball provided at https://prometheus.io/download/)

and

The Configuration page linked to in the cortex README is too complicated; not suitable for "getting started".

(I've paraphrased for context, adding links as appropriate).

Given that we hope to provide something useful to people without deep Prometheus expertise, we should think about how we can provide a smoother initial experience who are unfamiliar with it.

One possible solution would be to provide a container with some preset configuration.

Post merge reviews

Tracking ticket for stuff that didn't get a proper review:

Day 1

  • #96
  • #93
  • #98
  • #99
  • weaveworks/monitoring#26
  • weaveworks/monitoring#27
  • weaveworks/service-conf#322
  • #102
  • #103
  • weaveworks/monitoring#29
  • #104
  • weaveworks/monitoring#31
  • #106
  • #107

Day 2

  • #117 — Use age of oldest chunk when deciding to flush
  • #118 — Make max chunk size
  • #119 — Remove chunk store parallelisation; add backoff
  • #116 — Parameterise the number of concurrent flushers

Day 3

More code organisational cleanups

  • rebase tomwilkie/prometheus:frankenstein against master, and make just a small set of commits for the ingester
  • remove use of old generic API datatypes for read path
  • copy ingester code into frankenstein and vendor vanilla prometheus (when chunk API is public)

Ingesters never disappear from ring if shut down ungracefully

Related to #9

Right now, the only way an ingester is ever removed from the ring is if it shuts down gracefully. In any other circumstance (crash, network partition, asteroid strike) nothing will ever invalidate the old entries, until such a time as an ingester with the same hostname reconnects. This in turn means a large number of failing operations, requiring operator intervention.

This issue is less pronounced in most environments, where a crashed ingester typically will be swiftly restarted with an identical hostname. It could easily occur if an error occurred during, say, a rolling upgrade in k8s (new pod = new hostname). Note however that I haven't seen it in the wild. Yet.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.