grafana / mimir Goto Github PK

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.

Home Page: https://grafana.com/oss/mimir/

License: GNU Affero General Public License v3.0

Shell 0.73% Makefile 0.38% Dockerfile 0.11% Go 91.15% Jsonnet 6.79% JavaScript 0.27% CSS 0.01% Smarty 0.14% Mustache 0.23% Open Policy Agent 0.20%

prometheus metrics tsdb opentelemetry otlp observability

mimir's Introduction

Grafana Mimir

Grafana Mimir is an open source software project that provides a scalable long-term storage for Prometheus. Some of the core strengths of Grafana Mimir include:

Easy to install and maintain: Grafana Mimir’s extensive documentation, tutorials, and deployment tooling make it quick to get started. Using its monolithic mode, you can get Grafana Mimir up and running with just one binary and no additional dependencies. Once deployed, the best-practice dashboards, alerts, and runbooks packaged with Grafana Mimir make it easy to monitor the health of the system.
Massive scalability: You can run Grafana Mimir's horizontally-scalable architecture across multiple machines, resulting in the ability to process orders of magnitude more time series than a single Prometheus instance. Internal testing shows that Grafana Mimir handles up to 1 billion active time series.
Global view of metrics: Grafana Mimir enables you to run queries that aggregate series from multiple Prometheus instances, giving you a global view of your systems. Its query engine extensively parallelizes query execution, so that even the highest-cardinality queries complete with blazing speed.
Cheap, durable metric storage: Grafana Mimir uses object storage for long-term data storage, allowing it to take advantage of this ubiquitous, cost-effective, high-durability technology. It is compatible with multiple object store implementations, including AWS S3, Google Cloud Storage, Azure Blob Storage, OpenStack Swift, as well as any S3-compatible object storage.
High availability: Grafana Mimir replicates incoming metrics, ensuring that no data is lost in the event of machine failure. Its horizontally scalable architecture also means that it can be restarted, upgraded, or downgraded with zero downtime, which means no interruptions to metrics ingestion or querying.
Natively multi-tenant: Grafana Mimir’s multi-tenant architecture enables you to isolate data and queries from independent teams or business units, making it possible for these groups to share the same cluster. Advanced limits and quality-of-service controls ensure that capacity is shared fairly among tenants.

Migrating to Grafana Mimir

If you're migrating to Grafana Mimir, refer to the following documents:

Deploying Grafana Mimir

For information about how to deploy Grafana Mimir, refer to Deploy Grafana Mimir.

Getting started

If you’re new to Grafana Mimir, read the Get started guide.

Before deploying Grafana Mimir in a production environment, read:

Documentation

Refer to the following links to access Grafana Mimir documentation:

Latest release
Upcoming release, at the tip of the main branch

Contributing

To contribute to Grafana Mimir, refer to Contributing to Grafana Mimir.

Join the Grafana Mimir discussion

If you have any questions or feedback regarding Grafana Mimir, join the Grafana Mimir Discussion. Alternatively, consider joining the monthly Grafana Mimir Community Call.

Your feedback is always welcome, and you can also share it via the #mimir Slack channel.

License

Grafana Mimir is distributed under AGPL-3.0-only.

mimir's People

Contributors

Stargazers

Watchers

Forkers

bryanhonof jmherbst cyberflamego chinadeer gy09535 liguozhong isgasho hadoop835 mrphuongbn bubu11e showsmall devinnorgarb dio hier ekmixon onlyone0001 oneacl andrew-waters coderpoet kwangil-ha bugfyi zishanbilal suryatmodulus iuriimattos2 fatelei mu-l songjiayang 10088 claire887 xvzf ademahmudf dalakatt libery whyeasy laashub-soa jmcarp udmire kayaklee orendatagentech zhanglei leoota okmeter panshiforks vimalk78 kehao95 wilfriedroset terrorizer1980 fanis apertus-dev foxxnuaa gavinljj hi-rustin isabella232 agrahamlincoln bruno-dasilva grondinl dbrennand ch33hau t00350320 carloswalterbr smitiamops sriram323 zalegrala sylr masums pdf secustor wenj91 lincolnhuang mrcavalcanti swg2 catherinetcai muharihar samkenxstream overgules simonswine sysadminxxx williamzelesny lzh-lab algorithm004-02 rasmusdencker dannykopping supergicko sysedwinistrator gubjanos saikatharryc timeff-carlo rlex kofclubs javad-hajiani jhesketh thechef23 kavymp patatman ericspehlmann ese jordy1024 ryan-dyer-sp lion-kingdom huchaogithup

mimir's Issues

/ready message after /shutdown should be clearer

After calling /shutdown, an ingester removes itself from the ring and returns 503 to /ready.
However the body returned via http is not very helpful if you didn't know what happened:

Some services are not Running:
Running: 4
Terminated: 1

It would be better to say something like "ingester has been shut down", or at least list the service which has been terminated.

Remove configdb support for ruler and alertmanager backends

The new ruler and alertmanager config don't support configdb and we should remove the whole configdb support.

Related to #9.

Lost data during compaction on Swift

Describe the bug

The compactor somehow failed to upload the block's index file to Swift, but still deleted the source blocks. There are warnings in the logs, but the compactor does not seem to be aware of them. We lost one day of metrics for our main tenant. (I was hoping to be able to re-generate the index file from the chunks, but that doesn't seem possible as the chunk files only have samples, not the labels themselves.)

We opened a bug in Thanos (thanos-io/thanos#3958), but we're wondering if Cortex would be the more relevant place for it?

To Reproduce

We're not sure how it happens, so here's our best attempt at recollection:

Running Cortex 1.7.0, the Compactor compacted a series of blocks. It then uploaded all resulting files to Swift, but the index file never made it to Swift. In Swift's own logs, there are no traces of the index file ever being uploaded. We /think/ an error might have been detected by "CloseWithLogOnErr", but never made its way back to the Compactor (since it runs as deferred) and thus ignored.

See logs below.

Expected behavior

The Compactor would retry sending a file if there is an error.

Environment:

Infrastructure: Kubernetes
Deployment tool: helmfile

Storage Engine

Blocks
Chunks

Additional Context

Compactor logs:

{  
  "caller": "runutil.go:124",
  "err": "upload object close: Timeout when reading or writing data",
  "level": "warn",
  "msg": "detected close error",
  "ts": "2021-03-20T05:12:44.877771796Z"
}
{
  "bucket": "tracing: cortex-tsdb-prod04",
  "caller": "objstore.go:159",
  "component": "compactor",
  "dst": "01F16ZRT8TYA08VJQR1ZPCC5EP/index",
  "from": "data/compact/0@14583055817248146110/01F16ZRT8TYA08VJQR1ZPCC5EP/index",
  "group": "0@{__org_id__=\"1\"}",
  "groupKey": "0@14583055817248146110",
  "level": "debug",
  "msg": "uploaded file",
  "org_id": "1",
  "ts": "2021-03-20T05:12:44.877834603Z"
}
{
  "caller": "compact.go:810",
  "component": "compactor",
  "duration": "4m41.662527735s",
  "group": "0@{__org_id__=\"1\"}",
  "groupKey": "0@14583055817248146110",
  "level": "info",
  "msg": "uploaded block",
  "org_id": "1",
  "result_block": "01F16ZRT8TYA08VJQR1ZPCC5EP",
  "ts": "2021-03-20T05:12:45.140243007Z"
}
{
  "caller": "compact.go:832",
  "component": "compactor",
  "group": "0@{__org_id__=\"1\"}",
  "groupKey": "0@14583055817248146110",
  "level": "info",
  "msg": "marking compacted block for deletion",
  "old_block": "01F15H6D6CXE1ASE788HQECHM4",
  "org_id": "1",
  "ts": "2021-03-20T05:12:45.627586825Z"
}

$ openstack object list cortex-tsdb-prod04 --prefix 1/01F16ZRT8TYA08VJQR1ZPCC5EP
+--------------------------------------------+
| Name                                       |
+--------------------------------------------+
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000001 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000002 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000003 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000004 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000005 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000006 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000007 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000008 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000009 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000010 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000011 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000012 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000013 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000014 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000015 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000016 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000017 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000018 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000019 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000020 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000021 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000022 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000023 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/meta.json     |
+--------------------------------------------+

Submitted by: ubcharron
Cortex Issue Number: 4055

Duplicate Issue

Currently, Cortex only implements a time series deletion API for chunk storage. We would like to have the same functionality with blocks storage. Ideally, the API for deleting series should be the same as currently in Prometheus and in Cortex with chunk storage.

Motivation:

Confidential or accidental data might have been incorrectly pushed and needs to be removed.
GDPR regulations require data to be eventually deleted

I am currently working on the design doc, and will link it soon.

Submitted by: ilangofman
Cortex Issue Number: 4267

Compactors can't keep up with the load

Describe the bug
The number of blocks per tenant increases over time instead of going down.
At any given time some compactors are idle (and basically remain idle for all the time until an eventual restart) even if there are many compactions still needed for tenants that are not under compaction in a given moment.

To Reproduce
Steps to reproduce the behavior:

9 tenants, about 42M active time series per tenant
12 compactors
In the compactor v1.6 dashboard, both number of blocks per each tenant and the average number of blocks are increasing over time

Expected behavior

the number of blocks for every tenant and the overall average to decrease over time
if there are X tenant and X compactors, all the X compactors to be busy compacting

Environment:

Infrastructure: Kubernetes
Deployment tool: jsonnet
AWS, S3
12 compactors with 2CPUs and 5GB of RAM (50GB limit)
9 tenants with a total of 381M active time series and 6M reqps. The metrics are evenly split between tenants.

Storage Engine

[X ] Blocks
Chunks

Additional Context
There are 3 issues:

a compactor does not keep up with the load of one tenant if the tenant is big enough. I tried initially with 7 tenants with 55M active series per tenant but even if a tenant was compacted by a compactor, its blocks kept increasing. So I tried splitting the 381M active time series between 9 tenants, reducing the number of active time series per tenant to 42. But the number of blocks per tenant is still increasing over time.
If there are a few tenants and a few compactors, the chance that a compactor will not be responsible for any tenant and another compactor is responsible for more than one is high because of the hashing distribution probably not working well when number of tenants is relatively low.
it's not clear from logs or dashboards how to find the bottleneck or if there is anything wrong

Submitted by: agardiman
Cortex Issue Number: 3753

Remove support for transferring decoded chunks (timeseries) from ingesters to queriers

Mimir support both transferring decoded chunks (timeseries) and encoded chunks from ingesters to queriers. Decoding chunks in ingesters increase both CPU and memory usage in ingesters and we should remove the support for transferring decoded chunks.

Test Issue 1

TEST ISSUE Submitted by: pracucci
Cortex Issue Number: 4047
Issue Body: Is your feature request related to a problem? Please describe.
We have a Cortex cluster with a large number of tenants (tens of thousands). Discovering tenants in the bucket is very slow. The bucket listing is done by every compactor but we could reduce the load on the bucket configuring bucket caching for the compactor as well.

Describe the solution you'd like
Bucket caching for the compactor was not allowed because it's generally not safe, but there are few operations (like tenants discovery) for which it's expected to be safe, and I believe we should do it.

Remove -ruler.storage.* and -alertmanager.storage.* config

The ruler config -ruler.storage.* has been superseeded by -ruler-storage.* and the alertmanager config -alertmanager.storage.* has been superseeded by -alertmanager-storage.*. We should remove the support for the old config.

Inefficient blocks storage index lookup when running query-frontend due to Prometheus lookback

Given a "large" time range query (even "Last 24h"), the query-frontend split queries by day and then the querier receives a query whose time interval interval is [00:00:00, 23:59:59].

The problem is that the PromQL engine does a 5m lookback (by default) so the query we end up executing is actually [prev day 23:55:00, 23:59:59].

In the blocks storage, we do have daily blocks, so we have to query 2 daily blocks to execute each query and querying the block for the previous day wastes a bunch of resources just to fetch the "last 5m samples" of the block.

The Cortex chunks storage suffers it as well because we shard index by day, but I haven't measured its impact on query times.

Submitted by: pracucci
Cortex Issue Number: 2652

Remove support for non-streaming based querying

Queriers can query ingesters both in a gRPC-streaming fashion and not. Over the time we've experienced gRPC streaming is a better approach, so we could remove the support for non-streaming based querying and focus on gRPC streaming only.

Compactor should not generate blocks covering a time range wider than the max configured blocks range

Describe the bug
Compactor (blocks storage) currently compacts together any overlapping block, no matter if the resulting block (after compaction) will cover a time range wider than the max configured blocks range (24h by default).

For example, if in the storage there's a block with a time range of 30 days (for any reason, like a bug), all blocks that will overlap this very large block will be compacted together.

Expected behavior
Compactor should skip from compaction blocks whose min/max timestamp are not aligned to expected boundaries and/or with a timerange bigger than the max configured blocks range (24h by default).

Storage Engine

Blocks
Chunks

Submitted by: pracucci
Cortex Issue Number: 3331

Remove query-frontend without query-scheduler

Currently the query-frontend can run with and without query-scheduler. Query-scheduler is a better solution because it allows to indefinitely scale the query-frontend. To simplify deployment options and converge towards a unique model, we could remove the support to run query-frontend without query-scheduler.

Remove alternative ways to iterate sample in the querier

The querier currently supports multiple ways to iterate samples. I suggest to offer only 1 way to iterate samples and remove support for other ones. This way code will be simplified and we can focus our optimisation efforts.

Remove deprecated S3 sse-encryption config option

The S3 config option -<prefix>.s3.sse-encryption is deprecated and should be removed in favour of the S3 SSE config object.

Ingesters latency and in-flight requests spike right after startup with empty TSDB

Describe the bug
When starting a brand new ingester (empty disk - running blocks storage), as soon as the ingester is registered to the ring and its state switches to ACTIVE, it suddenly receive a bunch of new series. If you target each ingester to have about 1.5M active series, it will have to add 1.5M series to TSDB in a matter of few seconds.

Today, while scaling out a large number of ingesters (50), in few of such ingesters we got a very high latency and a high number of in-flight requests. The high number of in-flight requests caused the memory to increase, until some of these ingesters were OOMKilled.

I've been able to profile the affected ingesters and the following is what I found so far.

1. Number of in-flight push requests skyrocket right after ingester startup

2. The number of TSDB appenders skyrocket too

3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too

4. Lock contention in `Head.getOrCreateWithID()`

With no big surprise, looking at the number of active goroutines, 99.9% where blocked in Head.getOrCreateWithID() due to lock contention.

To Reproduce
Haven't found a way to easily reproduce it yet locally or with a stress test, but unfortunately looks that it's not that difficult to reproduce in production (where debugging is harder).

Storage Engine

Blocks
Chunks

Submitted by: pracucci
Cortex Issue Number: 3349

Document blocks storage per-tenant retention

We should document how the storage retention works in Mimir, specifically this configuration option:

# Delete blocks containing samples older than the specified retention period.
# Also used by query-frontend to avoid querying beyond the retention period. 0
# to disable.
# CLI flag: -compactor.blocks-retention-period
[compactor_blocks_retention_period: <duration> | default = 0s]

Allow to configure bucket caching for compactor

Is your feature request related to a problem? Please describe.
We have a Cortex cluster with a large number of tenants (tens of thousands). Discovering tenants in the bucket is very slow. The bucket listing is done by every compactor but we could reduce the load on the bucket configuring bucket caching for the compactor as well.

Submitted by: pracucci
Cortex Issue Number: 4047

Store-gateway: high memory allocations caused by per-tenant Prometheus registry

Describe the bug
To be able to use Thanos BucketStore while supporting Cortex multi-tenancy we need to create a BucketStore for each tenant, passing a dedicated Prometheus registry to each one and then aggregate metrics from all registries.

Due to this, the Prometheus metrics collection causes high memory allocations (order of 50MB/s in a store-gateway with 7.5K tenants). Allocated memory is not retained, but still puts pressure on GC.

In a cluster with low QPS, 95% store-gateway memory allocations are caused by metrics collecting.

Submitted by: pracucci
Cortex Issue Number: 3697

The meta-syncer directory is not cleaned up when enabling bucket index

Describe the bug
The meta-syncer directory is not used anymore by the store-gateway when the bucket index is enabled, however it’s not cleaned up if the store-gateway disk is persisted before and after enabling the bucket index.

The impact of meta-syncer on disk utilisation in this cluster is atypical, because the cluster is made of a large number (7.5K) of very small tenants.

Workaround

To manually fix it:

find /data -name 'meta-syncer' | xargs -I {} rm -fr {}

Expected behavior
I'm not sure Cortex should really take care of this cleanup, but it should be at least documented.

Submitted by: pracucci
Cortex Issue Number: 3696

Gaps in chunks don't get filled by data from other chunks

While working on a feature I noticed that in main we currently have a behavior which I believe could be considered a bug:

If there are multiple uncompacted chunks which cover the same/overlapping time ranges and one of them has a gap in its data then I think the expected behavior should be that we fill that gap by taking samples from another chunk if there is one that has them. But currently we don't do that, depending the order in which the chunks are decoded it's possible that we return the gap to the user.

I noticed that when adding a unit tests which tests this case, note that this commit is based on a relatively recent commit in main:

d478766#diff-ad0732f7c88d9207016a50d8d3f9455940fb90329dccb77e0f63ba749508b03aR195-R213

When running this test it fails, I believe that it should succeed.

I think fixing this behavior would be relatively simple, but it would mean that if we'd always have to decode all chunks that contain data for the queried time range, because otherwise there is no good way to detect gaps afaik. This would likely have a performance impact.

Remove deprecated fifocache size config option

The size config option on the fifocache (eg. -<prefix>.fifocache.size) is deprecated and should be removed, in favour of max-size-items and max-size-bytes.

Blocks storage unable to ingest samples older than 1h after an outage

TSDB doesn't allow to append samples whose timestamp is older than the last block cut from the head. Given a block is cut from the head up until -50% of the max timestamp within the head and given the default block range period is 2h, this means that the blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Let's consider this scenario:

Multiple Prometheus servers remote writing to the same Cortex tenant
Some Prometheus servers stop remote writing to Cortex (for any reason, ie. networking issue) and they fall behind more than 1h
When the Prometheus servers will be back online, Cortex will discard any sample whose timestamp is older than 1h because the max timestamp in the TSDB head is close to "now" (due to the working Prometheus servers which never stopped to write series) while the failing ones are trying to catch up writing samples older than 1h

We recently had an outage in our staging environment which triggered this condition and we should find a way to solve it.

@bwplotka You may be interested, given I think this issue affects Thanos receive too.

Submitted by: pracucci
Cortex Issue Number: 2366

Consider removing `blocks-storage.bucket-store.max-chunk-pool-bytes` limits

Is your feature request related to a problem? Please describe.
blocks-storage.bucket-store.max-chunk-pool-bytes is used to configure the maximum size of the chunk pool in the store-gateway bucket store. The chunk byte pool is used to reduce allocations but having the limit enforced can cause queries to fail when the pool is exhausted. I believe the purpose of the limit being enforced is to prevent store-gateway OOM kills but since you end up needing to set the limit to a much higher value to satisfy queries (e.g. 80% of the store-gateway requested memory) you don't really gain any extra protection.

Describe the solution you'd like
Removal of the blocks-storage.bucket-store.max-chunk-pool-bytes flag and associated configuration and using a chunk pool with no maximum.
Submitted by: jdbaldry
Cortex Issue Number: 3793

Memberlist: Aggressive scale down can cause lost tombstones

Describe the bug
When scaling down extremely fast, a tombstone can still go missing. The TestSingleBinaryWithMemberlistScaling can reproduce this on occasion with the default values. e.g.

integration_memberlist_single_binary_test.go:212: cortex-1: cortex_ring_members=4.000000 memberlist_client_kv_store_value_tombstones=16.000000

memberlist-tombstone-with-debug.log

What appears to be happening is that the final messages from the instance being scaled down are being sent the expected number of times, but the intended recipients are also shutting down. This is not trivial to fix because we do not get any feedback from memberlist as to whether our messages were actually received. Possible solutions:

Somehow monitor for failed sends and re-send until some number of successful sends are achieved
Send out messages tombstones more times (e.g. a form of retransmit multiplier specifically for tombstones)

To Reproduce
Run the TestSingleBinaryWithMemberlistScaling a few times.

make ./cmd/cortex/.uptodate
go test -timeout=1h -count=20 -v -tags=requires_docker ./integration -run "^TestSingleBinaryWithMemberlistScaling$"

Tweaking the scaling numbers in the test make it fail more often:

maxCortex := 8
minCortex := 1

Expected behavior
The test doesn't fail.

Environment:

Infrastructure: N/A
Deployment tool: N/A

Additional Context

(Origin: cortexproject/cortex#4360)

Remove deprecated -store.query-chunk-limit

The config -store.query-chunk-limit is deprecated and should be removed.

Two metrics have the same help value

Describe the bug

https://github.com/cortexproject/cortex/blob/master/pkg/compactor/compactor.go
In line 236 and 240, these 2 metrics have the same help value.
It seems that cortex_compactor_blocks_marked_for_deletion_total means the total number of blocks marked for deletion,
and cortex_compactor_garbage_collected_blocks_total means the total number of blocks deleted after being marked?

Submitted by: wangzhao765
Cortex Issue Number: 3610

Proposal: introduce read-only mode for ingesters

Recently we had to deal with a Cortex outage during which ingesters (running the blocks storage) were failing to compact head due to a in-memory corruption in TSDB (which has already been fixed in prometheus/prometheus#7560). During the outage we had the need to stop ingesting samples on some ingesters, while keeping them running for the queriers in order to not loose series when querying, but unfortunately we haven't found any way to do it.

As a follow up action from this outage, I would like to propose to introduce in the ring the ability to mark an ingester as read-only. When manually marked as read-only, the ingester is ignored by distributors on the write path, while queries will continue query it.

Thoughts?

Submitted by: pracucci
Cortex Issue Number: 2931

Reduce number of bucket API calls done by compactor

The work done to introduce the bucket index has reduced the number of baseline bucket API calls done by queriers and store-gateways. However, the compactor is still doing a large number of bucket API calls.

I did a brief analysis and below is the split down of the bucket API calls, with some details about the compactor. This analysis could be a baseline to discuss further improvements.

Analysis of a cluster with 7.2K tenants

We're running a Cortex cluster (blocks storage) with 7.2K tenants and a very low QPS. Shuffle sharding is enabled, and each tenant series are ingested in 3 ingesters. Below is an analysis of the bucket API calls over the last 24h (pricing based on Google GCS).

6754205 class A ops / day at $0.05 every 10k = 33.7$ / day = 1013$ / month
21937274 class B ops / day at $0.004 every 10k = 8.7$ / day = 261$ / month

compactor:

5676115 class A ops (84% of the cluster)
- upload 26% (legit, needs to upload compacted blocks)
- delete 30% (legit, needs to delete source blocks)
- iter 43%
14314950 class B ops (65% of the cluster)
- attributes 2.4%
- exists 44%
- get 53% (legit, needs to download blocks to compact)

Estimated compactor calls to Exists():

Tenant deletion mark
- 1x tenant every compaction_interval
- 1x tenant every cleanup_interval
block.MarkForDeletion()
- 1x deleted block
block.Delete()
- 1x deleted block
BaseFetcher.loadMeta()
- 1x block every compaction "run" (can happen multiple times for the same tenant over the same compaction_interval)

Estimated compactor calls to Iter() (about 2.4M/day):

Blocks cleaner
- [non significative] 1x every 1K tenants every cleanup_interval (to discover tenants)
Compactor
- [non significative] 1x every 1K tenants every compaction_interval (to discover tenants)
- [estimated ~250k/day] 1x tenant every compaction "run" (BaseFetcher.fetchMetadata(), called at least twice)
[estimated ~1.3M/day] Bucket index updater (used by blocks cleaner)
- 1x tenant every cleanup_interval (to list blocks)
- 1x tenant every cleanup_interval (to list deletion marks)
[estimated ~550K/day] block.Delete()
- 1x every block to delete (to list <block-id>/)
- 1x every sub-path in the block structure of a block to delete (to list <block-id>/chunks/)

In this specific cluster, compactor is configured with block range periods 12h,24h,168h,672h.

Submitted by: pracucci
Cortex Issue Number: 3633

Race condition between queries and closing TSDB in ingester

Describe the bug
When the ingester is shutting down (graceful stop), TSDB is closed but we don't wait until in-flight queries are completed before closing it. This leads to error logs like the following:

level=warn ts=2021-01-18T10:47:58.080281469Z caller=grpc_logging.go:55 duration=8.779921ms method=/cortex.Ingester/QueryStream err="open querier for block REDACTED: open chunk reader: can't read from a closed head" msg="gRPC\n"

Expected behavior
We should wait until any read/write from/to TSDB is completed before closing it.

Storage Engine

Blocks
Chunks

Submitted by: pracucci
Cortex Issue Number: 3704

Fix goroutine leaks in unit tests

We have some unit tests leaking goroutines. Typically it's not a big deal, considering they're short lived but in some circumstances they could significantly increase the memory allocated by tests.

To give an example, in #69 we've fixed a goroutines leak in consul in-memory KV store which was accounting for > 1GB when running pkg/storegateway.

As a chore activity, would be great to expand the usage of goleak in unit tests, to eventually spot more goroutine leaks (see #69).

Remove -ingester.min-ready-duration

I propose to remove -ingester.min-ready-duration.

The min ready duration was originally introduced in the Cortex PR 260 by @tomwilkie and, to my understanding, was required because of ingesters deployed as Kubernetes Deployments. Assuming we'll remove the chunks storage (#5), Cortex blocks storage requires ingesters to run as StatefulSets and so I believe we don't need the "min ready duration" anymore.

Store-gateway broken GCS connection does not recover

We had a malfunctioning store-gateway instance logging write: broken pipe each time it tried to fetch chunks from GCS. I looks like the issue has started with a broken TCP connection (which is OK) but then didn't recover closing it and opening a new one. Restarting the store-gateway fixed the issue, but we should fix the root cause.

This kind of logs were continuously repeated:

level=warn ts=2020-06-10T07:15:35.181518285Z caller=grpc_logging.go:55 method=/gatewaypb.StoreGateway/Series duration=11.744754ms err="rpc error: code = Aborted desc = fetch series for block 01E95411Z8PHCFPNMCAHD91VAQ: preload chunks: read range for 133: get range reader: failed to get object attributes: 10428/01E95411Z8PHCFPNMCAHD91VAQ/chunks/000134: Get https://storage.googleapis.com/storage/v1/b/REDACTED-BUCKET/o/10428%2F01E95411Z8PHCFPNMCAHD91VAQ%2Fchunks%2F000134?alt=json&prettyPrint=false&projection=full: write tcp REDACTED-LOCAL-IP:46668->REDACTED-GCS-IP:443: write: broken pipe" msg="gRPC\n"

Submitted by: pracucci
Cortex Issue Number: 2703

Panic in compactor

Seen single occurence of this panic in compactor.

fatal error: concurrent map iteration and map write

goroutine 131064 [running]:
runtime.throw(0x2c85468, 0x26)
	/usr/local/go/src/runtime/panic.go:1117 +0x72 fp=0xc00067ab68 sp=0xc00067ab38 pc=0x438652
runtime.mapiternext(0xc00067ac78)
	/usr/local/go/src/runtime/map.go:858 +0x54c fp=0xc00067abe8 sp=0xc00067ab68 pc=0x410d2c
github.com/prometheus/client_golang/prometheus.(*constHistogram).Write(0xc000b3b100, 0xc0007ac380, 0x2, 0x2)
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/histogram.go:556 +0x179 fp=0xc00067ace8 sp=0xc00067abe8 pc=0x8c8939
github.com/prometheus/client_golang/prometheus.processMetric(0x3157fa8, 0xc000b3b100, 0xc00067b058, 0xc00067b088, 0x0, 0x0, 0x1)
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/registry.go:598 +0xa2 fp=0xc00067ae10 sp=0xc00067ace8 pc=0x8cdbc2
github.com/prometheus/client_golang/prometheus.(*Registry).Gather(0xc0000b4910, 0x0, 0x0, 0x0, 0x0, 0x0)
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/registry.go:492 +0x9da fp=0xc00067b270 sp=0xc00067ae10 pc=0x8cd57a
github.com/prometheus/client_golang/prometheus/promhttp.HandlerFor.func1(0x316e820, 0xc0006435e0, 0xc0012e1b00)
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/promhttp/http.go:126 +0x99 fp=0xc00067b420 sp=0xc00067b270 pc=0xbaefd9
net/http.HandlerFunc.ServeHTTP(0xc0007910a0, 0x316e820, 0xc0006435e0, 0xc0012e1b00)
	/usr/local/go/src/net/http/server.go:2069 +0x44 fp=0xc00067b448 sp=0xc00067b420 pc=0x715b84
github.com/gorilla/mux.(*Router).ServeHTTP(0xc000316cc0, 0x316e820, 0xc0006435e0, 0xc0012e1800)
	/backend-enterprise/vendor/github.com/gorilla/mux/mux.go:210 +0xd3 fp=0xc00067b580 sp=0xc00067b448 pc=0xb99ef3
github.com/weaveworks/common/middleware.Instrument.Wrap.func1.2(0x316e820, 0xc0006435e0)
	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/instrument.go:68 +0x4c fp=0xc00067b5b0 sp=0xc00067b580 pc=0xd188ac
github.com/felixge/httpsnoop.CaptureMetricsFn(0x316a0b0, 0xc00134f860, 0xc00067b790, 0x2, 0x31ac158, 0xc0005a8580)
	/backend-enterprise/vendor/github.com/felixge/httpsnoop/capture_metrics.go:81 +0x24b fp=0xc00067b690 sp=0xc00067b5b0 pc=0xd0658b
github.com/weaveworks/common/middleware.Instrument.Wrap.func1(0x316a0b0, 0xc00134f860, 0xc0012e1800)
	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/instrument.go:67 +0x325 fp=0xc00067b820 sp=0xc00067b690 pc=0xd18be5
net/http.HandlerFunc.ServeHTTP(0xc000784730, 0x316a0b0, 0xc00134f860, 0xc0012e1800)
	/usr/local/go/src/net/http/server.go:2069 +0x44 fp=0xc00067b848 sp=0xc00067b820 pc=0x715b84
github.com/weaveworks/common/middleware.Log.Wrap.func1(0x316e940, 0xc00134f810, 0xc0012e1800)
	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/logging.go:52 +0x1a9 fp=0xc00067b9a0 sp=0xc00067b848 pc=0xd19449
net/http.HandlerFunc.ServeHTTP(0xc000bf3380, 0x316e940, 0xc00134f810, 0xc0012e1800)
	/usr/local/go/src/net/http/server.go:2069 +0x44 fp=0xc00067b9c8 sp=0xc00067b9a0 pc=0x715b84
net/http.Handler.ServeHTTP-fm(0x316e940, 0xc00134f810, 0xc0012e1800)
	/usr/local/go/src/net/http/server.go:87 +0x56 fp=0xc00067b9f8 sp=0xc00067b9c8 pc=0x73f6d6
github.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5(0x316c7e0, 0xc000210540, 0xc0012e1700)
	/backend-enterprise/vendor/github.com/opentracing-contrib/go-stdlib/nethttp/server.go:154 +0x5e4 fp=0xc00067bb48 sp=0xc00067b9f8 pc=0xcf67c4
net/http.HandlerFunc.ServeHTTP(0xc000bf3400, 0x316c7e0, 0xc000210540, 0xc0012e1700)
	/usr/local/go/src/net/http/server.go:2069 +0x44 fp=0xc00067bb70 sp=0xc00067bb48 pc=0x715b84
net/http.serverHandler.ServeHTTP(0xc000210460, 0x316c7e0, 0xc000210540, 0xc0012e1700)
	/usr/local/go/src/net/http/server.go:2887 +0xa3 fp=0xc00067bba0 sp=0xc00067bb70 pc=0x719243
net/http.(*conn).serve(0xc001d09b80, 0x317d680, 0xc001d13b00)
	/usr/local/go/src/net/http/server.go:1952 +0x8cd fp=0xc00067bfc8 sp=0xc00067bba0 pc=0x71466d
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc00067bfd0 sp=0xc00067bfc8 pc=0x472701
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3013 +0x39b

goroutine 1 [select, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*Manager).AwaitStopped(0xc000a074a0, 0x317d610, 0xc00005c040, 0x0, 0x0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/manager.go:145 +0x8b
github.com/cortexproject/cortex/pkg/cortex.(*Cortex).Run(0xc000b40800, 0xc00077e5c0, 0x4)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/cortex.go:473 +0x953
github.com/grafana/backend-enterprise/pkg/enterprise/cortex/init.(*CortexEnterprise).Run(...)
	/backend-enterprise/pkg/enterprise/cortex/init/cortex.go:147
main.main()
	/backend-enterprise/cmd/metrics-enterprise/main.go:212 +0x1105

goroutine 57 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc000394080)
	/backend-enterprise/vendor/go.opencensus.io/stats/view/worker.go:276 +0xcd
created by go.opencensus.io/stats/view.init.0
	/backend-enterprise/vendor/go.opencensus.io/stats/view/worker.go:34 +0x68

goroutine 240 [select]:
github.com/uber/jaeger-client-go/utils.(*reconnectingUDPConn).reconnectLoop(0xc000a025b0, 0x6fc23ac00)
	/backend-enterprise/vendor/github.com/uber/jaeger-client-go/utils/reconnecting_udp_conn.go:70 +0xc8
created by github.com/uber/jaeger-client-go/utils.newReconnectingUDPConn
	/backend-enterprise/vendor/github.com/uber/jaeger-client-go/utils/reconnecting_udp_conn.go:60 +0x10c

goroutine 160 [chan receive, 1139 minutes]:
github.com/cortexproject/cortex/pkg/alertmanager.init.0.func1()
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/alertmanager/alertmanager.go:127 +0x55
created by github.com/cortexproject/cortex/pkg/alertmanager.init.0
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/alertmanager/alertmanager.go:123 +0x35

goroutine 298 [select]:
github.com/uber/jaeger-client-go.(*RemotelyControlledSampler).pollControllerWithTicker(0xc0009bb860, 0xc000784410)
	/backend-enterprise/vendor/github.com/uber/jaeger-client-go/sampler_remote.go:153 +0xa5
github.com/uber/jaeger-client-go.(*RemotelyControlledSampler).pollController(0xc0009bb860)
	/backend-enterprise/vendor/github.com/uber/jaeger-client-go/sampler_remote.go:148 +0x73
created by github.com/uber/jaeger-client-go.NewRemotelyControlledSampler
	/backend-enterprise/vendor/github.com/uber/jaeger-client-go/sampler_remote.go:87 +0x125

goroutine 302 [select, 1139 minutes]:
github.com/cortexproject/cortex/pkg/cortex.NewServerService.func1(0x317d5d8, 0xc000acc080, 0x2, 0xc00036ff68)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/server_service.go:28 +0xd3
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc00047b540)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 305 [select]:
github.com/uber/jaeger-client-go.(*remoteReporter).processQueue(0xc00029e7e0)
	/backend-enterprise/vendor/github.com/uber/jaeger-client-go/reporter.go:296 +0xfe
created by github.com/uber/jaeger-client-go.NewRemoteReporter
	/backend-enterprise/vendor/github.com/uber/jaeger-client-go/reporter.go:237 +0x1a5

goroutine 347 [chan receive, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener.func1(0xc000a07500, 0x3184cf8, 0xc000799da0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:344 +0x66
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:343 +0x11d

goroutine 348 [chan receive, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener.func1(0xc000a07560, 0x3184cf8, 0xc000799db8)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:344 +0x66
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:343 +0x11d

goroutine 349 [chan receive, 1137 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener.func1(0xc000a075c0, 0x3184cf8, 0xc000799dd0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:344 +0x66
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:343 +0x11d

goroutine 350 [chan receive, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener.func1(0xc000a07620, 0x3184cf8, 0xc000799de8)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:344 +0x66
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:343 +0x11d

goroutine 351 [chan receive, 1137 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*Manager).AddListener.func1(0xc000a076e0, 0x3165040, 0xc000799e30)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/manager.go:244 +0x66
created by github.com/cortexproject/cortex/pkg/util/services.(*Manager).AddListener
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/manager.go:243 +0xfd

goroutine 352 [select, 1139 minutes]:
github.com/weaveworks/common/signals.(*Handler).Loop(0xc000aed020)
	/backend-enterprise/vendor/github.com/weaveworks/common/signals/signals.go:48 +0x1bb
github.com/cortexproject/cortex/pkg/cortex.(*Cortex).Run.func4(0xc000aed020, 0xc000a074a0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/cortex.go:462 +0x2b
created by github.com/cortexproject/cortex/pkg/cortex.(*Cortex).Run
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/cortex.go:461 +0x745

goroutine 353 [select, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).awaitState(0xc00047b540, 0x317d5d8, 0xc000a101c0, 0x4, 0xc00015a660, 0xc000a07500, 0xc0001231b0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:294 +0x98
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AwaitTerminated(0xc00047b540, 0x317d5d8, 0xc000a101c0, 0x0, 0xc00047b5e0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:290 +0x51
github.com/cortexproject/cortex/pkg/util.(*moduleService).run(0xc000bf3480, 0x317d5d8, 0xc000a101c0, 0xc00047b5f8, 0x2dd10e0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/module_service.go:71 +0x48
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc00047b5e0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 354 [select, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).awaitState(0xc000258d20, 0x317d5d8, 0xc000a10200, 0x4, 0xc000025ec0, 0xc000602601, 0xc000a84e88)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:294 +0x98
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AwaitTerminated(0xc000258d20, 0x317d5d8, 0xc000a10200, 0xc000a07560, 0xc0001231b8)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:290 +0x51
github.com/cortexproject/cortex/pkg/util.(*moduleService).run(0xc0009d7e00, 0x317d5d8, 0xc000a10200, 0xc0002592d8, 0x2dd10e0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/module_service.go:71 +0x48
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc0002592c0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 355 [select, 1137 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).awaitState(0xc000259360, 0x317d5d8, 0xc000a10240, 0x4, 0xc000054180, 0xc000b04e01, 0xc000b04e88)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:294 +0x98
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AwaitTerminated(0xc000259360, 0x317d5d8, 0xc000a10240, 0xc000a075c0, 0xc0001231c0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:290 +0x51
github.com/cortexproject/cortex/pkg/util.(*moduleService).run(0xc000a10100, 0x317d5d8, 0xc000a10240, 0xc0002595f8, 0x2dd10e0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/module_service.go:71 +0x48
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc0002595e0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 356 [select, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).awaitState(0xc00047a3c0, 0x317d5d8, 0xc000a10280, 0x4, 0xc00015a240, 0xc000602601, 0xc000a88e88)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:294 +0x98
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AwaitTerminated(0xc00047a3c0, 0x317d5d8, 0xc000a10280, 0xc000a07620, 0xc0001231c8)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:290 +0x51
github.com/cortexproject/cortex/pkg/util.(*moduleService).run(0xc000bf2a40, 0x317d5d8, 0xc000a10280, 0xc00047a518, 0x2dd10e0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/module_service.go:71 +0x48
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc00047a500)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 357 [select, 4 minutes]:
github.com/grafana/backend-enterprise/pkg/cloud/bi.(*DistributorWriteEventProcessor).run(0xc000790070, 0x317d5d8, 0xc000a10300, 0x0, 0x0)
	/backend-enterprise/pkg/cloud/bi/wrapper.go:109 +0x14f
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc00047a3c0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 369 [syscall, 1139 minutes]:
os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:168 +0xa5
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
	/usr/local/go/src/os/signal/signal.go:151 +0x45

goroutine 303 [chan receive, 1139 minutes]:
github.com/weaveworks/common/server.(*Server).Run(0xc00013c2c0, 0x0, 0x0)
	/backend-enterprise/vendor/github.com/weaveworks/common/server/server.go:424 +0x15b
github.com/cortexproject/cortex/pkg/cortex.NewServerService.func1.1(0xc000a1f0e0, 0xc00013c2c0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/server_service.go:25 +0x57
created by github.com/cortexproject/cortex/pkg/cortex.NewServerService.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/server_service.go:23 +0x5e

goroutine 304 [chan receive, 1139 minutes]:
github.com/cortexproject/cortex/pkg/cortex.ignoreSignalHandler.Loop(0xc00015a480)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/cortex/server_service.go:65 +0x34
github.com/weaveworks/common/server.(*Server).Run.func1(0xc00013c2c0, 0xc00029ef00)
	/backend-enterprise/vendor/github.com/weaveworks/common/server/server.go:384 +0x3a
created by github.com/weaveworks/common/server.(*Server).Run
	/backend-enterprise/vendor/github.com/weaveworks/common/server/server.go:383 +0x6b

goroutine 385 [IO wait]:
internal/poll.runtime_pollWait(0x7fb60e3a8588, 0x72, 0x0)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc000294518, 0x72, 0x0, 0x0, 0x2c1b3a3)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Accept(0xc000294500, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:401 +0x212
net.(*netFD).accept(0xc000294500, 0x203000, 0x7fb60d06ec70, 0x0)
	/usr/local/go/src/net/fd_unix.go:172 +0x45
net.(*TCPListener).accept(0xc000798828, 0x7eebe452, 0xf10e4d4d8f7f128b, 0x0)
	/usr/local/go/src/net/tcpsock_posix.go:139 +0x32
net.(*TCPListener).Accept(0xc000798828, 0x1aa845c700f027b0, 0x60f90b32, 0xc0005a4df0, 0x4e5766)
	/usr/local/go/src/net/tcpsock.go:261 +0x65
github.com/weaveworks/common/middleware.(*countingListener).Accept(0xc0000db320, 0xc0005a4e40, 0x18, 0xc000ae8300, 0x71973b)
	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/counting_listener.go:22 +0x37
net/http.(*Server).Serve(0xc000210460, 0x316a0e0, 0xc0000db320, 0x0, 0x0)
	/usr/local/go/src/net/http/server.go:2981 +0x285
github.com/weaveworks/common/server.(*Server).Run.func2(0xc00013c2c0, 0xc00029ef00)
	/backend-enterprise/vendor/github.com/weaveworks/common/server/server.go:394 +0x126
created by github.com/weaveworks/common/server.(*Server).Run
	/backend-enterprise/vendor/github.com/weaveworks/common/server/server.go:391 +0x97

goroutine 386 [IO wait, 1139 minutes]:
internal/poll.runtime_pollWait(0x7fb60e3a84a0, 0x72, 0x0)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc000294598, 0x72, 0x0, 0x0, 0x2c1b3a3)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Accept(0xc000294580, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:401 +0x212
net.(*netFD).accept(0xc000294580, 0x41521b, 0xc00051c8c8, 0xc000a90710)
	/usr/local/go/src/net/fd_unix.go:172 +0x45
net.(*TCPListener).accept(0xc000798858, 0x285b580, 0xc00051c8c8, 0xc000a90710)
	/usr/local/go/src/net/tcpsock_posix.go:139 +0x32
net.(*TCPListener).Accept(0xc000798858, 0xc000a90658, 0x40e2f8, 0xc00051c8c0, 0xc00051c948)
	/usr/local/go/src/net/tcpsock.go:261 +0x65
github.com/weaveworks/common/middleware.(*countingListener).Accept(0xc0000db340, 0xc000a0ab70, 0xc000a90710, 0xc00051c948, 0x0)
	/backend-enterprise/vendor/github.com/weaveworks/common/middleware/counting_listener.go:22 +0x37
google.golang.org/grpc.(*Server).Serve(0xc000318c40, 0x316a0e0, 0xc0000db340, 0x0, 0x0)
	/backend-enterprise/vendor/google.golang.org/grpc/server.go:732 +0x27f
github.com/weaveworks/common/server.(*Server).Run.func3(0xc00013c2c0, 0xc00029ef00)
	/backend-enterprise/vendor/github.com/weaveworks/common/server/server.go:413 +0x4e
created by github.com/weaveworks/common/server.(*Server).Run
	/backend-enterprise/vendor/github.com/weaveworks/common/server/server.go:412 +0x13b

goroutine 358 [select, 1139 minutes]:
github.com/cortexproject/cortex/pkg/ring/kv/memberlist.(*KVInitService).running(0xc000a02a10, 0x317d5d8, 0xc000a10380, 0xc000258d38, 0x2dd10e0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ring/kv/memberlist/kv_init_service.go:74 +0xa9
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc000258d20)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 359 [runnable]:
runtime.CallersFrames(...)
	/usr/local/go/src/runtime/symtab.go:66
runtime.Caller(0x3, 0x7888bd, 0x2963540, 0xc0005db998, 0x2963540, 0xc0008f3a10)
	/usr/local/go/src/runtime/extern.go:205 +0xa5
github.com/go-kit/kit/log.Caller.func1(0x2963540, 0xc0008f3a10)
	/backend-enterprise/vendor/github.com/go-kit/kit/log/value.go:86 +0x2e
github.com/go-kit/kit/log.bindValues(0xc0016ddd00, 0x8, 0x10)
	/backend-enterprise/vendor/github.com/go-kit/kit/log/value.go:20 +0x79
github.com/go-kit/kit/log.(*context).Log(0xc0008f39e0, 0xc018f39e00, 0x4, 0x4, 0x2, 0x3122de0)
	/backend-enterprise/vendor/github.com/go-kit/kit/log/log.go:122 +0x1cb
github.com/cortexproject/cortex/pkg/compactor.(*Compactor).compactUsers(0xc000789500, 0x317d5d8, 0xc000a10440)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/compactor/compactor.go:514 +0x1bbd
github.com/cortexproject/cortex/pkg/compactor.(*Compactor).running(0xc000789500, 0x317d5d8, 0xc000a10440, 0x0, 0x0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/compactor/compactor.go:451 +0xf9
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc000259360)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 424 [chan receive, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener.func1(0xc00082b1a0, 0x3184cf8, 0xc00000dcb0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:344 +0x66
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:343 +0x11d

goroutine 425 [chan receive, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener.func1(0xc00082b200, 0x3184cf8, 0xc00000dcc8)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:344 +0x66
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).AddListener
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:343 +0x11d

goroutine 426 [chan receive, 1139 minutes]:
github.com/cortexproject/cortex/pkg/util/services.(*Manager).AddListener.func1(0xc00082b260, 0x3165040, 0xc00000dce0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/manager.go:244 +0x66
created by github.com/cortexproject/cortex/pkg/util/services.(*Manager).AddListener
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/manager.go:243 +0xfd

goroutine 427 [select]:
github.com/cortexproject/cortex/pkg/ring.(*Lifecycler).loop(0xc000232700, 0x317d5d8, 0xc000a11b80, 0x0, 0x0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ring/lifecycler.go:400 +0x1fc
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc000259d60)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 428 [select]:
net/http.(*persistConn).roundTrip(0xc00077d9e0, 0xc001582300, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/transport.go:2610 +0x765
net/http.(*Transport).roundTrip(0xc00004cf00, 0xc001e1a500, 0x2bb30e0, 0x2af5101, 0xc001e1a500)
	/usr/local/go/src/net/http/transport.go:592 +0xacb
net/http.(*Transport).RoundTrip(0xc00004cf00, 0xc001e1a500, 0xc00004cf00, 0xc03660b24dbfe89b, 0x3e3a181c7c9b)
	/usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc001e1a200, 0x3129dc0, 0xc00004cf00, 0xc03660b24dbfe89b, 0x3e3a181c7c9b, 0x4455960, 0xc00084a320, 0xc03660b24dbfe89b, 0x1, 0x0)
	/usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc000594c90, 0xc001e1a200, 0xc03660b24dbfe89b, 0x3e3a181c7c9b, 0x4455960, 0xc00084a320, 0x0, 0x1, 0x2af5101)
	/usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc000594c90, 0xc001e1a200, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
	/usr/local/go/src/net/http/client.go:585
github.com/hashicorp/consul/api.(*Client).doRequest(0xc00077d0e0, 0xc00248a240, 0xc000b17638, 0xc0011ee340, 0x10, 0xc00248a240)
	/backend-enterprise/vendor/github.com/hashicorp/consul/api/api.go:880 +0xbe
github.com/hashicorp/consul/api.(*KV).getInternal(0xc000123328, 0x2c1fe00, 0x9, 0x0, 0xc002a400b0, 0x7fb635070108, 0xb0, 0xc002a400b0, 0x3bd17804f03077ad)
	/backend-enterprise/vendor/github.com/hashicorp/consul/api/kv.go:131 +0x2b4
github.com/hashicorp/consul/api.(*KV).Get(0xc000123328, 0x2c1fe00, 0x9, 0xc002a400b0, 0x0, 0x0, 0x0, 0x0)
	/backend-enterprise/vendor/github.com/hashicorp/consul/api/kv.go:65 +0xa5
github.com/cortexproject/cortex/pkg/ring/kv/consul.consulMetrics.Get.func1(0x317d680, 0xc001e8e450, 0x3, 0xc03660ad4dbfa161)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ring/kv/consul/metrics.go:44 +0x133
github.com/weaveworks/common/instrument.CollectedRequest(0x317d680, 0xc000aea9f0, 0x2c14d92, 0x3, 0x316a080, 0xc00028c030, 0x2dce458, 0xc000b17988, 0xc000b179f0, 0x40db9b)
	/backend-enterprise/vendor/github.com/weaveworks/common/instrument/instrument.go:152 +0x271
github.com/cortexproject/cortex/pkg/ring/kv/consul.consulMetrics.Get(0x3185138, 0xc000123328, 0x2c1fe00, 0x9, 0xc002a400b0, 0xa8, 0x2b9e900, 0x1, 0xc002a40000)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ring/kv/consul/metrics.go:41 +0x145
github.com/cortexproject/cortex/pkg/ring/kv/consul.(*Client).WatchKey(0xc000191a40, 0x317d680, 0xc000aea9f0, 0x2c1fe00, 0x9, 0xc0003419a0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ring/kv/consul/client.go:229 +0x2d6
github.com/cortexproject/cortex/pkg/ring/kv.metrics.WatchKey.func1(0x317d680, 0xc000aea9f0, 0x8, 0xc0361de186a4239a)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ring/kv/metrics.go:99 +0x5e
github.com/weaveworks/common/instrument.CollectedRequest(0x317d5d8, 0xc000a11bc0, 0x2c1d81f, 0x8, 0x316a080, 0xc000123340, 0x2dce458, 0xc000b17df0, 0x10, 0x28752c0)
	/backend-enterprise/vendor/github.com/weaveworks/common/instrument/instrument.go:152 +0x271
github.com/cortexproject/cortex/pkg/ring/kv.metrics.WatchKey(0x3199d90, 0xc000191a40, 0xc000123340, 0x317d5d8, 0xc000a11bc0, 0x2c1fe00, 0x9, 0xc0003419a0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ring/kv/metrics.go:98 +0x105
github.com/cortexproject/cortex/pkg/ring.(*Ring).loop(0xc000485680, 0x317d5d8, 0xc000a11bc0, 0xc000259e18, 0x2dd10e0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/ring/ring.go:284 +0x95
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc000259e00)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 396 [IO wait]:
internal/poll.runtime_pollWait(0x7fb60e3a83b8, 0x72, 0xffffffffffffffff)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc000295998, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc000295980, 0xc00097f000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc000295980, 0xc00097f000, 0x1000, 0x1000, 0x43b37c, 0xc000b07c38, 0x46a040)
	/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc00000f128, 0xc00097f000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:183 +0x91
net/http.(*persistConn).Read(0xc000a0c7e0, 0xc00097f000, 0x1000, 0x1000, 0xc000a9c540, 0xc000b07d40, 0x405755)
	/usr/local/go/src/net/http/transport.go:1922 +0x77
bufio.(*Reader).fill(0xc00029fe00)
	/usr/local/go/src/bufio/bufio.go:101 +0x108
bufio.(*Reader).Peek(0xc00029fe00, 0x1, 0x0, 0x1, 0x4, 0x1, 0x3)
	/usr/local/go/src/bufio/bufio.go:139 +0x4f
net/http.(*persistConn).readLoop(0xc000a0c7e0)
	/usr/local/go/src/net/http/transport.go:2083 +0x1a8
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1743 +0xc77

goroutine 374 [select]:
net/http.(*persistConn).writeLoop(0xc00077d9e0)
	/usr/local/go/src/net/http/transport.go:2382 +0xf7
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1744 +0xc9c

goroutine 502 [select]:
net/http.(*http2ClientConn).roundTrip(0xc004986300, 0xc001447700, 0x0, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/h2_bundle.go:7668 +0x9c5
net/http.(*http2Transport).RoundTripOpt(0xc00014a4d0, 0xc001447700, 0x27a2300, 0xc000aebce0, 0xc000940d20, 0x5)
	/usr/local/go/src/net/http/h2_bundle.go:6981 +0x1a5
net/http.(*http2Transport).RoundTrip(...)
	/usr/local/go/src/net/http/h2_bundle.go:6942
net/http.http2noDialH2RoundTripper.RoundTrip(0xc00014a4d0, 0xc001447700, 0x312d4e0, 0xc00014a4d0, 0x0)
	/usr/local/go/src/net/http/h2_bundle.go:9197 +0x3e
net/http.(*Transport).roundTrip(0xc0004b1cc0, 0xc001447700, 0x26b76e0, 0xc000e10001, 0xc000e10160)
	/usr/local/go/src/net/http/transport.go:537 +0xdec
net/http.(*Transport).RoundTrip(0xc0004b1cc0, 0xc001447700, 0x2c21f96, 0xa, 0xc000e10200)
	/usr/local/go/src/net/http/roundtrip.go:17 +0x35
google.golang.org/api/transport/http.(*parameterTransport).RoundTrip(0xc000a104c0, 0xc001447500, 0x5, 0x5, 0x0)
	/backend-enterprise/vendor/google.golang.org/api/transport/http/dial.go:147 +0x248
go.opencensus.io/plugin/ochttp.(*traceTransport).RoundTrip(0xc00310b040, 0xc001447500, 0xc0021b8480, 0x1, 0x1)
	/backend-enterprise/vendor/go.opencensus.io/plugin/ochttp/trace.go:84 +0x47c
go.opencensus.io/plugin/ochttp.statsTransport.RoundTrip(0x3126720, 0xc00310b040, 0xc001447300, 0xc001122070, 0xc001122070, 0x29006e0)
	/backend-enterprise/vendor/go.opencensus.io/plugin/ochttp/client_stats.go:57 +0x5f7
go.opencensus.io/plugin/ochttp.(*Transport).RoundTrip(0xc000785400, 0xc001447300, 0x0, 0x0, 0xc0017d4ac8)
	/backend-enterprise/vendor/go.opencensus.io/plugin/ochttp/client.go:99 +0x206
golang.org/x/oauth2.(*Transport).RoundTrip(0xc000510cc0, 0xc001447200, 0x0, 0x0, 0x0)
	/backend-enterprise/vendor/golang.org/x/oauth2/transport.go:55 +0x15d
net/http.send(0xc001447200, 0x31269c0, 0xc000510cc0, 0x0, 0x0, 0x0, 0xc0009bc898, 0x1e, 0x1, 0x0)
	/usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc000aed6b0, 0xc001447200, 0x0, 0x0, 0x0, 0xc0009bc898, 0x0, 0x1, 0x0)
	/usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc000aed6b0, 0xc001447200, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
	/usr/local/go/src/net/http/client.go:585
google.golang.org/api/internal/gensupport.send(0x317d5d8, 0xc000accdc0, 0xc000aed6b0, 0xc001447000, 0x448bca0, 0x0, 0x0)
	/backend-enterprise/vendor/google.golang.org/api/internal/gensupport/send.go:35 +0x10f
google.golang.org/api/internal/gensupport.SendRequest(0x317d5d8, 0xc000accdc0, 0xc000aed6b0, 0xc001447000, 0xc0017d5188, 0x9f, 0x0)
	/backend-enterprise/vendor/google.golang.org/api/internal/gensupport/send.go:28 +0x90
google.golang.org/api/storage/v1.(*ObjectsListCall).doRequest(0xc0017d5628, 0x2c1616f, 0x4, 0x0, 0x5, 0xc0017d5290)
	/backend-enterprise/vendor/google.golang.org/api/storage/v1/storage-gen.go:10474 +0x789
google.golang.org/api/storage/v1.(*ObjectsListCall).Do(0xc0017d5628, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/backend-enterprise/vendor/google.golang.org/api/storage/v1/storage-gen.go:10486 +0xa5
cloud.google.com/go/storage.(*ObjectIterator).fetch.func1(0x7fb60e402978, 0x82)
	/backend-enterprise/vendor/cloud.google.com/go/storage/bucket.go:1168 +0x71
cloud.google.com/go/storage.runWithRetry.func1(0x7fb635072688, 0xc0017d5305, 0xc008efc820)
	/backend-enterprise/vendor/cloud.google.com/go/storage/invoke.go:28 +0x2a
cloud.google.com/go/internal.retry(0x317d5d8, 0xc000accdc0, 0x0, 0x0, 0x0, 0x0, 0xc0017d5410, 0x2dcc558, 0x10, 0x26b76e0)
	/backend-enterprise/vendor/cloud.google.com/go/internal/retry.go:38 +0x56
cloud.google.com/go/internal.Retry(...)
	/backend-enterprise/vendor/cloud.google.com/go/internal/retry.go:31
cloud.google.com/go/storage.runWithRetry(0x317d5d8, 0xc000accdc0, 0xc0017d5600, 0x9, 0xc000b6f9a8)
	/backend-enterprise/vendor/cloud.google.com/go/storage/invoke.go:27 +0x87
cloud.google.com/go/storage.(*ObjectIterator).fetch(0xc00144c680, 0x0, 0xc008efc520, 0xc, 0xc008efc520, 0xc, 0x0, 0x0)
	/backend-enterprise/vendor/cloud.google.com/go/storage/bucket.go:1167 +0x6d6
google.golang.org/api/iterator.(*PageInfo).fill(0xc0025da9b0, 0x0, 0x1dc3c88, 0xc008efc7c5)
	/backend-enterprise/vendor/google.golang.org/api/iterator/iterator.go:139 +0x49
google.golang.org/api/iterator.(*PageInfo).next(0xc0025da9b0, 0xc0017d5790, 0x406ffa)
	/backend-enterprise/vendor/google.golang.org/api/iterator/iterator.go:118 +0xb4
cloud.google.com/go/storage.(*ObjectIterator).Next(0xc00144c680, 0xc0005c0e40, 0x0, 0x0)
	/backend-enterprise/vendor/cloud.google.com/go/storage/bucket.go:1140 +0x2f
github.com/thanos-io/thanos/pkg/objstore/gcs.(*Bucket).Iter(0xc000a10600, 0x317d5d8, 0xc000accdc0, 0x0, 0x0, 0xc000859c80, 0x0, 0x0, 0x0, 0x20, ...)
	/backend-enterprise/vendor/github.com/thanos-io/thanos/pkg/objstore/gcs/gcs.go:117 +0x165
github.com/thanos-io/thanos/pkg/objstore.(*metricBucket).Iter(0xc000a10640, 0x317d5d8, 0xc000accdc0, 0x0, 0x0, 0xc000859c80, 0x0, 0x0, 0x0, 0x44324a, ...)
	/backend-enterprise/vendor/github.com/thanos-io/thanos/pkg/objstore/objstore.go:368 +0x103
github.com/thanos-io/thanos/pkg/objstore.TracingBucket.Iter.func1(0x317d5d8, 0xc000accdc0, 0x31b1c80, 0x4488e60)
	/backend-enterprise/vendor/github.com/thanos-io/thanos/pkg/objstore/tracing.go:27 +0x185
github.com/thanos-io/thanos/pkg/tracing.DoWithSpan(0x317d5d8, 0xc000accdc0, 0x2c248d3, 0xb, 0xc00053da28, 0x0, 0x0, 0x0)
	/backend-enterprise/vendor/github.com/thanos-io/thanos/pkg/tracing/tracing.go:80 +0xde
github.com/thanos-io/thanos/pkg/objstore.TracingBucket.Iter(0x31ac228, 0xc000a10640, 0x317d5d8, 0xc000accdc0, 0x0, 0x0, 0xc000859c80, 0x0, 0x0, 0x0, ...)
	/backend-enterprise/vendor/github.com/thanos-io/thanos/pkg/objstore/tracing.go:25 +0x128
github.com/cortexproject/cortex/pkg/storage/tsdb/bucketindex.(*globalMarkersBucket).Iter(0xc000c01b40, 0x317d5d8, 0xc000accdc0, 0x0, 0x0, 0xc000859c80, 0x0, 0x0, 0x0, 0x319a600, ...)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/storage/tsdb/bucketindex/markers_bucket_client.go:85 +0x99
github.com/cortexproject/cortex/pkg/storage/tsdb.(*UsersScanner).ScanUsers(0xc0000201e0, 0x317d5d8, 0xc000accdc0, 0x1, 0x1, 0x319a600, 0xc00082b500, 0x4, 0x1, 0x2c1612f, ...)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/storage/tsdb/users_scanner.go:37 +0xe2
github.com/cortexproject/cortex/pkg/compactor.(*BlocksCleaner).cleanUsers(0xc00077c6c0, 0x317d5d8, 0xc000accdc0, 0x0, 0x0, 0x0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/compactor/blocks_cleaner.go:156 +0x5e
github.com/cortexproject/cortex/pkg/compactor.(*BlocksCleaner).runCleanup(0xc00077c6c0, 0x317d5d8, 0xc000accdc0, 0xc000e19e00)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/compactor/blocks_cleaner.go:142 +0x18a
github.com/cortexproject/cortex/pkg/compactor.(*BlocksCleaner).ticker(...)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/compactor/blocks_cleaner.go:133
github.com/cortexproject/cortex/pkg/util/services.NewTimerService.func1(0x317d5d8, 0xc000accdc0, 0x0, 0x0)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/services.go:33 +0x13c
github.com/cortexproject/cortex/pkg/util/services.(*BasicService).main(0xc000259b80)
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:190 +0x402
created by github.com/cortexproject/cortex/pkg/util/services.(*BasicService).StartAsync.func1
	/backend-enterprise/vendor/github.com/cortexproject/cortex/pkg/util/services/basic_service.go:119 +0xb3

goroutine 373 [IO wait]:
internal/poll.runtime_pollWait(0x7fb60e3a82d0, 0x72, 0xffffffffffffffff)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc000394a18, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc000394a00, 0xc0005c6000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc000394a00, 0xc0005c6000, 0x1000, 0x1000, 0x43b37c, 0xc0005a5c38, 0x46a040)
	/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc00028c5a8, 0xc0005c6000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:183 +0x91
net/http.(*persistConn).Read(0xc00077d9e0, 0xc0005c6000, 0x1000, 0x1000, 0xc0005c0360, 0xc0005a5d40, 0x405755)
	/usr/local/go/src/net/http/transport.go:1922 +0x77
bufio.(*Reader).fill(0xc000119aa0)
	/usr/local/go/src/bufio/bufio.go:101 +0x108
bufio.(*Reader).Peek(0xc000119aa0, 0x1, 0x0, 0x1, 0x4, 0x1, 0x3)
	/usr/local/go/src/bufio/bufio.go:139 +0x4f
net/http.(*persistConn).readLoop(0xc00077d9e0)
	/usr/local/go/src/net/http/transport.go:2083 +0x1a8
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1743 +0xc77

goroutine 397 [select]:
net/http.(*persistConn).writeLoop(0xc000a0c7e0)
	/usr/local/go/src/net/http/transport.go:2382 +0xf7
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1744 +0xc9c

goroutine 127542 [IO wait]:
internal/poll.runtime_pollWait(0x7fb60e3a8018, 0x72, 0xffffffffffffffff)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc0021e4598, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0021e4580, 0xc0010b4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc0021e4580, 0xc0010b4000, 0x1000, 0x1000, 0x0, 0x0, 0x7fb60e3a8020)
	/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0021fc088, 0xc0010b4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:183 +0x91
net/http.(*connReader).Read(0xc0021e6e70, 0xc0010b4000, 0x1000, 0x1000, 0xc0017d1b90, 0xd1ad3a, 0xc0021fc088)
	/usr/local/go/src/net/http/server.go:800 +0x1b9
bufio.(*Reader).fill(0xc000119b00)
	/usr/local/go/src/bufio/bufio.go:101 +0x108
bufio.(*Reader).Peek(0xc000119b00, 0x4, 0x3e4e4faea4fa, 0x4455960, 0x0, 0x0, 0x4455960)
	/usr/local/go/src/bufio/bufio.go:139 +0x4f
net/http.(*conn).serve(0xc0021b6c80, 0x317d680, 0xc0010b0200)
	/usr/local/go/src/net/http/server.go:1977 +0xa47
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3013 +0x39b

goroutine 13037825 [IO wait]:
internal/poll.runtime_pollWait(0x7fb60ce90990, 0x72, 0xffffffffffffffff)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc001b93a18, 0x72, 0x6400, 0x6458, 0xffffffffffffffff)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc001b93a00, 0xc002467500, 0x6458, 0x6458, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc001b93a00, 0xc002467500, 0x6458, 0x6458, 0x6453, 0xc000b02860, 0x53aef5)
	/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0001230f0, 0xc002467500, 0x6458, 0x6458, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:183 +0x91
crypto/tls.(*atLeastReader).Read(0xc0010de018, 0xc002467500, 0x6458, 0x6458, 0x200000003, 0xc0014ae000, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:776 +0x63
bytes.(*Buffer).ReadFrom(0xc0020365f8, 0x3120da0, 0xc0010de018, 0x40b925, 0x2802a60, 0x2b53600)
	/usr/local/go/src/bytes/buffer.go:204 +0xbe
crypto/tls.(*Conn).readFromUntil(0xc002036380, 0x3129cc0, 0xc0001230f0, 0x5, 0xc0001230f0, 0x27)
	/usr/local/go/src/crypto/tls/conn.go:798 +0xf3
crypto/tls.(*Conn).readRecordOrCCS(0xc002036380, 0x0, 0x0, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:605 +0x115
crypto/tls.(*Conn).readRecord(...)
	/usr/local/go/src/crypto/tls/conn.go:573
crypto/tls.(*Conn).Read(0xc002036380, 0xc000b66000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:1276 +0x165
bufio.(*Reader).Read(0xc004aa0ea0, 0xc000210738, 0x9, 0x9, 0x11, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:227 +0x222
io.ReadAtLeast(0x3120b40, 0xc004aa0ea0, 0xc000210738, 0x9, 0x9, 0x9, 0xc000b02d08, 0x59e69b, 0xc000776a50)
	/usr/local/go/src/io/io.go:328 +0x87
io.ReadFull(...)
	/usr/local/go/src/io/io.go:347
net/http.http2readFrameHeader(0xc000210738, 0x9, 0x9, 0x3120b40, 0xc004aa0ea0, 0x0, 0x0, 0xc004986468, 0x0)
	/usr/local/go/src/net/http/h2_bundle.go:1477 +0x89
net/http.(*http2Framer).ReadFrame(0xc000210700, 0xc000a16000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/http/h2_bundle.go:1735 +0xa5
net/http.(*http2clientConnReadLoop).run(0xc000b02fa8, 0x0, 0x0)
	/usr/local/go/src/net/http/h2_bundle.go:8322 +0xd8
net/http.(*http2ClientConn).readLoop(0xc004986300)
	/usr/local/go/src/net/http/h2_bundle.go:8244 +0x6f
created by net/http.(*http2Transport).newClientConn
	/usr/local/go/src/net/http/h2_bundle.go:7208 +0x6c5

goroutine 13457454 [chan send]:
github.com/prometheus/client_golang/prometheus.(*metricMap).Collect(0xc000021740, 0xc003a4fec0)
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/vec.go:312 +0x111
github.com/prometheus/client_golang/prometheus.(*MetricVec).Collect(0xc000021140, 0xc003a4fec0)
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/vec.go:109 +0x38
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/registry.go:446 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/registry.go:457 +0x5ce

goroutine 13457453 [IO wait]:
internal/poll.runtime_pollWait(0x7fb60e3a7f30, 0x72, 0xffffffffffffffff)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc001a35b98, 0x72, 0x0, 0x1, 0xffffffffffffffff)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc001a35b80, 0xc001d54e81, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc001a35b80, 0xc001d54e81, 0x1, 0x1, 0xc00172ec00, 0xc00003a000, 0x28029e0)
	/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc0009bdfa8, 0xc001d54e81, 0x1, 0x1, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:183 +0x91
net/http.(*connReader).backgroundRead(0xc001d54e70)
	/usr/local/go/src/net/http/server.go:692 +0x58
created by net/http.(*connReader).startBackgroundRead
	/usr/local/go/src/net/http/server.go:688 +0xd5

goroutine 1146783 [select]:
net/http.(*persistConn).writeLoop(0xc000a0c480)
	/usr/local/go/src/net/http/transport.go:2382 +0xf7
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1744 +0xc9c

goroutine 13457455 [semacquire]:
sync.runtime_Semacquire(0xc000513558)
	/usr/local/go/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc000513550)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func2(0xc000513550, 0xc003a4fec0, 0xc003a4ff20)
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/registry.go:463 +0x2b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/registry.go:462 +0x60d

goroutine 1146782 [IO wait]:
internal/poll.runtime_pollWait(0x7fb60e3a81e8, 0x72, 0xffffffffffffffff)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc00005e198, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc00005e180, 0xc000da9000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc00005e180, 0xc000da9000, 0x1000, 0x1000, 0x43b37c, 0xc0005a7c38, 0x46a040)
	/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc00182c008, 0xc000da9000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:183 +0x91
net/http.(*persistConn).Read(0xc000a0c480, 0xc000da9000, 0x1000, 0x1000, 0xc000a9cea0, 0xc0005a7d40, 0x405755)
	/usr/local/go/src/net/http/transport.go:1922 +0x77
bufio.(*Reader).fill(0xc002186060)
	/usr/local/go/src/bufio/bufio.go:101 +0x108
bufio.(*Reader).Peek(0xc002186060, 0x1, 0x0, 0x1, 0x4, 0x1, 0x3)
	/usr/local/go/src/bufio/bufio.go:139 +0x4f
net/http.(*persistConn).readLoop(0xc000a0c480)
	/usr/local/go/src/net/http/transport.go:2083 +0x1a8
created by net/http.(*Transport).dialConn
	/usr/local/go/src/net/http/transport.go:1743 +0xc77

goroutine 13457456 [chan send]:
github.com/prometheus/client_golang/prometheus.(*metricMap).Collect(0xc000020ed0, 0xc003a4fec0)
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/vec.go:312 +0x111
github.com/prometheus/client_golang/prometheus.(*MetricVec).Collect(0xc000020ea0, 0xc003a4fec0)
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/vec.go:109 +0x38
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/registry.go:446 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
	/backend-enterprise/vendor/github.com/prometheus/client_golang/prometheus/registry.go:538 +0xe4d

Add integration tests for compactor

The compactor is not covered by any integration test. We should cover, at least, basic usage with integration tests.

Submitted by: pracucci
Cortex Issue Number: 3096

Proposal: make bucket index mandatory for the blocks storage

Describe the solution you'd like
I would like to propose to make the bucket index mandatory when running the blocks storage.

Cortex 1.7.0 is introducing the support to the bucket index (doc), which is a per-tenant .json file containing the list of the tenant's blocks and deletion marks. The bucket index is periodically written by the compactor and read by querier/ruler/store-gateway in order to have a almost updated view over the bucket without having to scanning it (eg. run "list objects" API calls which are slow and expensive).

Making it mandatory would simplify the Cortex code removing the need to keep a dual implementation (either scan the bucket in querier/ruler/store-gateway or read bucket index) and would also offer an (hopefully) better default setup for Cortex users.

At Grafana Labs we're running the bucket index in production since 1 month and we haven't had any issue (so far) while it reduced the number of API calls to the object store and allowed us to scale up queries with no waiting time (no initial bucket scanning required anymore).

Submitted by: pracucci
Cortex Issue Number: 3814

Flusher doesn't report flush problems via exit code.

Flusher is a Cortex component designed to flush local data (chunks WAL, blocks WAL and unshipped blocks) to long term storage.

Currently, when Flusher finishes it triggers clean shutdown of Cortex. This happens even if there were any errors during the flush. Errors are only reported in the log.

It would be desirable to use exit code as an indicator of whether Flusher has succeeded or not.

Submitted by: pstibrany
Cortex Issue Number: 2879

Count active metrics that match given query within each tenant

Is your feature request related to a problem? Please describe.
We need a gauge metric that provides the number of active series within each tenant that match a pre-configured query.
I.e., we want to know how many active metrics match {foo="bar",...}

Describe the solution you'd like
We want to add a configuration to the Ingester, called active_matching_series, which is a slice of structs:

type ActiveMatchingSeriesConfig struct {
	Name    string `yaml:"name"`
	Matcher string `yaml:"matcher"`
}

Same configuration can be provided by using multiple ingester.active-matching-series flags, each one with a value built as <name>:<matcher>, for example: --ingester.active-matching-series='foobar:{foo="bar"}'

A new gauge would be defined in ingesterMetrics as:

		// Not registered automatically, but only if activeSeriesEnabled is true.
		activeMatchingSeriesPerUser: prometheus.NewGaugeVec(prometheus.GaugeOpts{
			Name: "cortex_ingester_active_matching_series",
			Help: "Number of currently active series matching a pre-configured matcher per user.",
		}, []string{"user", "matcher"}),

When the ingester is instantiated, we'll instantiate an implementation of this interface:

type ActiveSeriesMatcher interface {
	// Matches provide a slice of bools indicating whether the set of labels.Labels provided matches or not each one of the matchers.
	// The length of the returned slice is the same as the length of the Labels() slice, and matchers are applied in same order.
	Matches(labels.Labels) []bool
	// Labels provides the values of the `matcher` label for each matcher.
	Labels() []string
}

We would modify the activeSeriesStripe to contain a slice of counters, one for each of the matchers:

// activeSeriesStripe holds a subset of the series timestamps for a single tenant.
type activeSeriesStripe struct {
	// ...
	active         int   // Number of active entries in this stripe. Only decreased during purge or clear.
	activeMatching []int // Number of active entries matching the pre-configured matchers in this stripe. Only decreased during purge or clear. Has the same length as the number of matchers provided.
}

And we'll add a matches []bool to the activeSeriesEntry struct, which is the most critical part as this is where we'll consume memory. Given n the number of matches, we would consume 8*3 (slice ptr, cap & len) + n bytes more per each metric, plus the slice pointer pressure on the garbage collector. With 1M series per ingester, this would be 24MB per ingester plus one megabyte per each of the matchers provided, which isn't a big deal.

This slice would be filled once, when creating the series, using the previous interface, as ActiveSeries would hold an instance of it. We'll use this bool mask to increase the counters in activeSeriesStripe.activeMatching just like we do with active, that's O(n*m) with n series and m matchers.

Then we'll modify the signature of func (c *ActiveSeries) Active() int to func (c *ActiveSeries) Active() (int, []int) so it would return the number of series matching each one of the matchers too. Counting those would be O(m) with m the number of matchers, as each one of the stripes would provide it's own count.

Describe alternatives you've considered

Just put metrics in different tenants: this is not how we want the product to work.
Use uint64 bitmask instead of []byte to save memory: the amount of memory saved would be neglible but we'd be limited to just 64 matchers. We don't expect more than those, but since it's cheap to remove the limitation, we can just go with the bool slice.

Questions
Do we need to modify the userState? I don't have enough context on that one yet (maybe I'll see better once I start implementing this).

Test issue 2

Submitted by: pracucci
Cortex Issue Number: 4047

Compactor stalls when failing to compact a blocks group due to corrupted source blocks

We experienced the compactor being stalled (not compacting any block) because of a corrupted source block. The compactor was continuously failing while trying to compact this blocks group due to a corrupted source block:

level=error ts=2020-07-12T17:35:05.516823471Z caller=compactor.go:339 component=compactor msg="failed to compact user blocks" user=REDACTED err="compaction: group 0@6672437747845546250: block with not healthy index found /data/compact/0@6672437747845546250/REDACTED; Compaction level 1; Labels: map[__org_id__:REDACTED]: 1/1183085 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

Given investigating and fixing the root cause of the out-of-order chunks should be done but it's out of the scope of this issue, the compactor should ideally either skip the corrupted block and compact the other ones, or move on compacting other non-overlapping blocks (eg. other non-overlapping time ranges) if available, otherwise the compactor just stalls even if other work could be done.

Similarly to what we do with the deletion-mark.json we could also consider to mark a block as corrupted (eg. corruption-mark.json) and automatically exclude blocks marked as corrupted from compaction while alerting on it. An operator can offline investigate it and, if a repairing tool is available, compactor will compact it once the block will be fixed and unmarked as corrupted.

/cc @bwplotka @codesome @pstibrany

Submitted by: pracucci
Cortex Issue Number: 2866

Compacting TSDB head on every ingester shutdown to speed up startup

Describe the solution you'd like
The typical use case of an ingester shutdown is during a rolling update. We currently close TSDB and, at subsequent startup, we replay the WAL before the ingester is ready. Replaying the WAL is slow and we recently found out that compacting the head and shipping it to the storage on /shutdown is actually faster than replaying the WAL.

Idea: what if we always compact TSDB head and ship it to the storage at shutdown?

Question:

If we compact TSDB head (up until head max time) on shutdown, what's the last checkpoint created and what's actually replayed from WAL at startup?

Pros:

The ingesters rollout may be faster
The scale down wouldn't be a snowflake operation anymore (currently it requires calling /shutdown API beforehand)

Cons (potential blockers):

At ingester startup, can the ingester ingests samples with timestamp < the last ingested samples before shutting down?

Let's discuss it.

Submitted by: pracucci
Cortex Issue Number: 3723

Store-gateway blocks resharding during rollout

When running the blocks storage, store-gateways reshard blocks whenever the ring topology changes. This means that during a rollout of the store-gateways (ie. deploy a config change or version upgrade) blocks are resharded across instances.

This is highly inefficient in a cluster with a large number of tenants or few very large tenants. Ideally, no blocks resharding should occur during a rollout (if the blocks replication factor is > 1).

Rollouts

We could improve the system to avoid the blocks resharding when the following conditions are met:

-experimental.store-gateway.replication-factor is > 1 (so that while a store-gateway restarts all of its blocks are replicated to at least another instance)
-experimental.store-gateway.tokens-file-path is configured (so that previous tokens are picked up on restart)
The store-gateway instance ID is stable across restarts (ie. Kubernetes Statfulsets)

To avoid blocks resharding during store-gateways rollout we need the restarting store-gateway instance to not be unregistered from the ring during the restart.

When a store-gateway shutdowns, the instance could be left in the LEAVING state within the ring and we could change the BlocksReplicationStrategy.ShouldExtendReplicaSet() to not extend the replica set if an instance is in the LEAVING state.

This means that during the rollout, for the blocks hold by the restarting replica there will be N-1 replicas (contrary to the N desired replicas configured). Once the instance restarts, it will have the same instance ID and same tokens (assuming tokens-file-path is configured) and thus will replace its state from LEAVING to JOINING within the ring.

Scale down

There's no way to distinguish between a rollout and a scale down: the process just receives a termination signal.

This means that during a scale down, the instance would be left in the LEAVING state within the ring. However, the store-gateway has an auto-forget feature which removes unhealthy instances after 10x heartbeat timeouts (default: 1m timeout = 10m before an unhealthy instance is forgotten).

A scale down of a number of instance < replication factor could leverage on the auto-forget. However, there's no easy way to have a smooth scale down unless we'll have a way to signal the process whether it's going to shutdown because of a scale down or rollout.

Crashes

In case a store-gateway crashes, there would be no difference compared to today.

Submitted by: pracucci
Cortex Issue Number: 2823

Ingester can end up with incorrect state of TSDB when it fails to remove all files when closing idle TSDB.

Describe the bug
Under some conditions, ingester can close TSDB and delete local files, eg. when TSDB is idle for too long. After cleanup, ingester removes in-memory entry about the user from i.TSDBState.dbs map.

However, if ingester fails to delete the local files, it still deletes this entry. This can be problematic if same user starts sending new samples -- ingester will try to reopen existing TSDB, but as some files have been deleted, such TSDB will be broken, which can lead to weird errors.

Instead we suggest to keep entry in i.TSDBState.dbs map if files deletion fails, but in closed state, or possibly some error state.
Submitted by: pstibrany
Cortex Issue Number: 4082

Higher latency on write and read path while rolling out ingesters (-ingester.unregister-on-shutdown=false)

Describe the bug
We're running ingesters with -ingester.unregister-on-shutdown=false and -distributor.extend-writes=false. This means that, while rolling out ingesters 1 by 1, the restarting ingester is left in the ring in the LEAVING state.

We've observed that, while rolling out ingesters with such configuration, the latency is significantly higher both on write and read path.

Expected behavior
Ideally, no impact on latency when rolling out ingesters.

Storage Engine

Blocks
Chunks
Submitted by: pracucci
Cortex Issue Number: 4035

Compactor block index size limited to 64GiB

Describe the bug
During compaction, if the block index to be output by the compactor will be larger than 64GiB, the process will fail.

To Reproduce

Run Mimir with a lot of metrics in a single tenant, like 1B
Configure compaction to be 2h0m0s,12h0m0s,24h0m0s
When the compaction eventually tries to compact a large file, compactor will output exceeding max size of 64GiB

Expected behavior
The file would be allowed to be written larger than 64GiB

Environment:

Infrastructure: k8s
Deployment tool: jsonnet

Time series deletion API for block storage

Currently, Mimir only implements a time series deletion API for chunk storage. We would like to have the same functionality with blocks storage. Ideally, the API for deleting series should be the same as currently in Prometheus

Motivation:

Confidential or accidental data might have been incorrectly pushed and needs to be removed.
GDPR regulations require data to be eventually deleted

I am currently working on the design doc, and will link it soon.

Submitted by: ilangofman
Cortex Issue Number: 4267

Distributor rate-limit message should give global limit not local

Describe the bug
Example:

Jul 22 20:11:16 grafana-agent[1754]: ts=2021-07-22T20:11:16.386587507Z agent=prometheus [...] msg="non-recoverable error" count=284 exemplarCount=0 err="server returned HTTP status 429 
Too Many Requests: ingestion rate limit (200) exceeded while adding 284 samples and 0 metadata"

The code

mimir/pkg/distributor/distributor.go

Line 730 in 17e3423

 return nil, httpgrpc.Errorf(http.StatusTooManyRequests, "ingestion rate limit (%v) exceeded while adding %d samples and %d metadata", d.ingestionRateLimiter.Limit(now, userID), validatedSamples, len(validatedMetadata)) 

just fetches a number for the limit, and that number doesn't take account of the "strategy" which might have divided the global limit by the number of distributors to arrive at the local rate limit.

Expected behavior
I think it should quote the global limit, so the end-user sees a consistent message regardless of how many distributors are running.

Skip blocks with out-of-order chunks in compactor

The PR #142 upgraded Thanos including a feature in the compactor that can skip blocks with out-of-order chunks. The feature is based on the no-compact Thanos marker which is currently not supported by Mimir, but we should add support.

Some things to keep in mind to add support:

Add no-compact marker support to global location (see BucketWithGlobalMarkers)
Add no-compact marker support to bucket-index (see pkg/storage/tsdb/bucketindex)
Add option to enable "skip blocks with out-of-order chunks" and pass it to NewBucketCompactor
Add a new metric to track the number of blocks marked for no-compact and pass it to NewDefaultGrouper

grafana / mimir Goto Github PK

mimir's Introduction

Grafana Mimir

Migrating to Grafana Mimir

Deploying Grafana Mimir

Getting started

Documentation

Contributing

Join the Grafana Mimir discussion

License

mimir's People

Contributors

Stargazers

Watchers

Forkers

mimir's Issues

1. Number of in-flight push requests skyrocket right after ingester startup

2. The number of TSDB appenders skyrocket too

3. Average cortex_ingester_tsdb_appender_add_duration_seconds skyrocket too

4. Lock contention in Head.getOrCreateWithID()

Analysis of a cluster with 7.2K tenants

Rollouts

Scale down

Crashes

Recommend Projects

Recommend Topics

Recommend Org

4. Lock contention in `Head.getOrCreateWithID()`