Giter Club home page Giter Club logo

cortex-ops's Introduction

Cortex Ops

This repository has been created to build a community around operating Cortex. Therefore we encourage you to share your learnings, improvement proposals, challenges and failures so that we can learn from each other and ultimately come up with a decent guide about operating Cortex.

In this repository we seek to provide:

  • A production ready setup (created for GKE + BigTable) which should suit most companies' needs in terms of scale and performance
  • Operational knowledge (FAQs, clarifying the microservice architecture)
  • Premade monitoring solutions (Grafana dashboards)

Most of the granted information shall serve as inspiration for your own deployment. Due to the complexity and the large variety of configurations it is very likely that this setup does not suit all your needs.

Note: This setup uses Google's BigTable without an additional Bucket storage for storing all chunks and indexes.

Kubernetes deployment

All deployment files and the corresponding documentation can be found in ./kubernetes. It has been tested on GKE and built for the usage of BigTable as underlying time series database.

As of now the deployment manifests cover the following components:

  • Jaeger (Agent Daemonset, Collector Deployment, UI Deployemnt)
  • Prometheus, which is responsible for sending metrics from tenants to your Cortex cluster using the remote write API
  • Elasticsearch (using Elastic's Kubernetes Operator)
  • Cortex, which includes:
    • Distributor
    • Gateway - Auth Gateway (not part of Cortex)
    • Ingester
    • Memcached
    • Prometheus (exclusively used to monitor all Cortex components - just in case Cortex is down)
    • Querier
    • Query Frontend
    • Table Manager

Deployments which are not covered (yet):

  • Consul (required for Cortex components)
  • Cortex components required for Alertmanager (ruler, configs API, postgres db)
  • Other underlying Timeseries Databases for Cortex (such as Cassandra or DynamoDB)
  • Grafana deployment with provisioned datasources querying Cortex

We can recommend Elastic's Kubernetes Operator for creating Elasticsearch clusters.

Grafana Monitoring

To be created..

Ops Guides

To be created..

Tenant Authentication with Cortex Gatway

In order to store and query tenants' metrics separately from each other Cortex requires a User id which must be set in the headers (X-Scope-OrgID). Since you can not provide custom headers in the Prometheus Remote Write API, nor in the Grafana UI for a datasource, you could deploy a NGINX cluster in each of your tenants which does that for you and proxies these requests to your Cortex Gateway.

Cortex completely trusts the X-Scope-OrgID and it is your responsibility to ensure the header value is correct. This may work fine in a trusted, self controlled environment, however in most cases you want more control than that. You may want to issue JSON web tokens for your tenants, do additional validation based on the claims, invalidate tokens etc. The Cortex Gateway helps you with that and you can easily replace it with your own implementation, as it's very rudimentary as of writing this. In the prometheus remote write API config you can specify the Bearer Token and in Grafana you must provision a new Datasource where you can specify this bearer token as well. Once grafana/grafana#17846 is merged you can specify the Bearer token for your datasource in the Grafana UI as well. Take a look at the Cortex Gateway Documentation to learn more about our approach to solve multitenancy.

cortex-ops's People

Contributors

metost avatar weeco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cortex-ops's Issues

Use Kubernetes manifests instead of Terraform modules

In order to be more open to the community we should use Kubernetes manifests for the deployment rather than Terraform modules. This way we can ensure everyone understands the deployment configs and therefore can also contribute improvements.

On top of that we don't need to wait for the Terraform Kubernetes provider to catch up with lacking features.

Investigate consistent hashing for memcache

We should investigate and clarify whether and under what circumstances it makes sense to enable or disable the consistent hashing feature for Memcache. From the memcache docs:

Consistent Hashing is a model that allows for more stable distribution of keys given addition or removal of servers. [...]

https://github.com/memcached/memcached/wiki/ConfiguringClient#consistent-hashing

We have flags to enable it:

-memcached.consistent-hash=true
-store.index-cache-read.memcached.consistent-hash=true

I am not sure on what cortex components this should be enabled and how the migration from non consistent hashing to consistent hashing works. Can we just switch this without worrying about that, or should we clear the cache after enabling this?

Add docs about memcache tuning

Memcache can have a great positive impact on performance. If setup incorrectly you won't benefit much from it. It's not easy to tell whether you are properly utilizing memcache. For instance:

  1. Is your Memcache cluster oversized / undersized?
  2. Should you reserve some memcache storage for the different components?
    a) Is it more important to cache Query Frontend stuff than index writes?
    b) If yes, does it make sense to have multiple memcache clusters, each component has it's own to make sure they have dedicated storage sizes? Or is it possible to configure memcache so that each component can have it's own dedicated storage within a single cluster

Slack discussion: https://cloud-native.slack.com/archives/CCYDASBLP/p1574096262229600

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.