Giter Club home page Giter Club logo

Comments (25)

Sh4d1 avatar Sh4d1 commented on June 24, 2024 1

Seems it's working now after the night! I'll take it! Thanks for the help!

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

Hello, it seems that block validation step (done by Mimir after block is fully uploaded, but before it is "accepted") has failed, and now block cannot be uploaded or validated anymore. Such block is treated as incomplete-upload, and will be eventually deleted by Mimir. After that, you can try upload it again.

Alternatively, you can try to delete validation.json file from inside uploaded block in the Mimir's storage. After that, you can run mimirtool to upload the block again, and run new validation. (You can also disable validation step by setting -compactor.block-upload-validation-enabled=false on compactor, and then upload will finish successfully, without doing block-validation. However this allows you to upload any "garbage" as TSDB block, and may cause errors in Mimir later).

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

Ah good to know thanks! Though it appears I get this issue on every block I try to import 🤔

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

Is there any way to see how the validation failed? It's a block taken from a classic prometheus instance, I figure it should work without issues? (though maybe it's too old in the past ? )

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

You should be able to see failure reasons in compactor logs. Based on mimirtool output, these blocks look quite big (40 GB?), so validation can take a while, and compactor must be running during the process without interruption.

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

Yeah I only get that:

ts=2024-05-21T12:36:57.248928356Z caller=block_upload.go:360 level=error component=compactor user=anonymous feature="block upload" block=01HGC2YXSZ86TQK4Z9G8A2DKZN operation="com
plete block upload" request_id=17d18125344c3e74 msg="error while validating block" err="failed to download block: context canceled"

Guess there is some timeout to tweak somewhere?

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024
ts=2024-05-21T12:39:07.165217295Z caller=block_upload.go:105 level=info component=compactor user=anonymous feature="block upload" block=01HGXF4Z4HTZ22TSNTH9AZXKNN operation="star
t block upload" request_id=17d18157c78e4bb4 msg="started block upload"
ts=2024-05-21T12:46:25.950370338Z caller=block_upload.go:172 level=info component=compactor user=anonymous feature="block upload" block=01HGXF4Z4HTZ22TSNTH9AZXKNN operation="comp
lete block upload" request_id=17d181bdef52b395 msg="validation process started"
tts=2024-05-21T12:47:44.330682597Z caller=block_upload.go:360 level=error component=compactor user=anonymous feature="block upload" block=01HGXF4Z4HTZ22TSNTH9AZXKNN operation="com
plete block upload" request_id=17d181bdef52b395 msg="error while validating block" err="failed to download block: context canceled"

to be precise*, so it seems to fail after 80s ?

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

I don't know about any timeout on the compactor side for validation / downloading blocks, but individual storage clients can have operation timeouts (eg. Swift storage has request timeout config option).

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

@pstibrany so you think the error comes from the storage provider? I'm using GCS, and doesn't seem to be any timeouts configurable :/

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

My thinking was that client-side could have some timeout, but in fact I wouldn't expect that to be logged as context canceled. We use GCS client internally, and don't see timeouts when reading data from it.

I'm wondering ... Is that the full compactor log, or is there possibly more messages that could be related to validation (block upload feature)? Any other error logged by compactor that would explain what's happening?

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

The 3 lines were next to each other in the compactor logs. So I guess there is no other messages related to this upload ?

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

The 3 lines were next to each other in the compactor logs. So I guess there is no other messages related to this upload ?

Could you try grepping on block ID, maybe something still shows up?

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

I just tested with another block, live tailing the logs only gets me that:

ts=2024-05-22T09:04:52.177159745Z caller=block_upload.go:105 level=info component=compactor user=anonymous feature="block upload" block=01HHMMGVFCC3J5EY8VTDK1JAH3 operation="start block upload" request_id=17d1c43b4e8d4ad3 msg="started block upload"
ts=2024-05-22T09:06:56.208641698Z caller=block_upload.go:172 level=info component=compactor user=anonymous feature="block upload" block=01HHMMGVFCC3J5EY8VTDK1JAH3 operation="complete block upload" request_id=17d1c4582f908ec3 msg="validation process started"
ts=2024-05-22T09:08:14.223566493Z caller=block_upload.go:360 level=error component=compactor user=anonymous feature="block upload" block=01HHMMGVFCC3J5EY8VTDK1JAH3 operation="complete block upload" request_id=17d1c4582f908ec3 msg="error while validating block" err="failed to download block: context canceled"

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

To unblock you, you can disable validation (-compactor.block-upload-validation-enabled=false), then any blocks will be accepted without further checks. If you are uploading blocks from Prometheus, that should be fine.

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

Yes it looks like it's uploaded correctly. It's not yet shown in Mimir though, I guess I need to wait some time? Can I refresh reset some cache somewhere?

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

Yes it looks like it's uploaded correctly. It's not yet shown in Mimir though, I guess I need to wait some time? Can I refresh reset some cache somewhere?

It can take about 30mins to block to become queryable -- compactor needs to include the block in bucket-index, and then queriers and store-gateways need to reload the bucket index and fetch parts of the block (store-gateways do that). If you try to query it before, empty results may be cached in the results cache. You can flush results cache by restarting memcached.

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

Okay I might see some empty cached results, though I have deployed mimir without memcached enabled, is it still used somewhere?

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

Well I restarted my 3 store gateways, but still nothing. I'll wait a bit more, but it's been over 1h already 🤔

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

Well I restarted my 3 store gateways, but still nothing. I'll wait a bit more, but it's been over 1h already 🤔

What time range does the uploaded block cover?

Can you see if store-gateways loaded your new block?

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

The upload block cover roughly from january 2024 to a few weeks ago.

I can see in the store gateway:

ts=2024-05-22T14:07:01.823562954Z caller=bucket.go:407 level=info user=anonymous msg="loaded new block" elapsed=403.665477ms id=01HMQ525MXFV8SR05WT2BMXBT2
ts=2024-05-22T14:22:39.628135866Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loading sparse index-header from full index-header" id=01HMQ525MXFV8SR05WT2BMXBT2
ts=2024-05-22T14:22:39.669792351Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loaded sparse index-header from full index-header" id=01HMQ525MXFV8SR05WT2BMXBT2 elapsed=41.658729ms
ts=2024-05-22T14:22:39.669820198Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="writing sparse index-header to disk" id=01HMQ525MXFV8SR05WT2BMXBT2 path=/data/tsdb-sync/anonymous/01HMQ525MXFV8SR05WT2BMXBT2/sparse-index-header
ts=2024-05-22T14:22:39.68422412Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="wrote sparse index-header to disk" id=01HMQ525MXFV8SR05WT2BMXBT2 path=/data/tsdb-sync/anonymous/01HMQ525MXFV8SR05WT2BMXBT2/sparse-index-header elapsed=14.394569ms

If I grep a block I uploaded. Another for instance:

ts=2024-05-22T14:06:13.466412037Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loading from sparse index-header from disk" id=01HK2ZTGZXM7543W0ZJE5DXDQ2 path=/data/tsdb-sync/anonymous/01HK2ZTGZXM7543W0ZJE5DXDQ2/sparse-index-header
ts=2024-05-22T14:06:13.468317073Z caller=spanlogger.go:109 user=anonymous method=indexheader.NewStreamBinaryReader level=info msg="loaded sparse index-header from disk" id=01HK2ZTGZXM7543W0ZJE5DXDQ2 path=/data/tsdb-sync/anonymous/01HK2ZTGZXM7543W0ZJE5DXDQ2/sparse-index-header elapsed=1.907567ms
ts=2024-05-22T14:06:13.49157358Z caller=bucket.go:407 level=info user=anonymous msg="loaded new block" elapsed=28.30006ms id=01HK2ZTGZXM7543W0ZJE5DXDQ2

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

👍 That shows that blocks are now in bucket-index, and store-gateways know about them. Queriers should too, since they load the same bucket-index.

Since you don't have any cache ("I have deployed mimir without memcached enabled"), I don't see other reason why your queries should not work anymore. You can try something like count({__name__!=""}) to see how many series there are between january and now in your Mimir.

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

Well I'm getting timeout with this query 😅 even on a 5 minutes interval if it's a few weeks ago. In the querier logs:

ts=2024-05-22T15:22:08.666957598Z caller=scheduler_processor.go:240 level=error user=anonymous msg="error notifying scheduler about finished query after the scheduler stream previously failed and returned error" err=EOF addr=10.52.1.244:9095 previousErr="rpc error: code = Canceled desc = context canceled: frontend cancelled query"

Do you have any ideas?

from mimir.

Sh4d1 avatar Sh4d1 commented on June 24, 2024

Hum it seems I have a lot of consistency checks on these blocks
image

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

Hum it seems I have a lot of consistency checks on these blocks image

Consistency checks mean that querier tried to query some block, but no store-gateway that was supposed to have it, replied with data from the block. Perhaps store-gateways were still loading new blocks?

from mimir.

pstibrany avatar pstibrany commented on June 24, 2024

Well I'm getting timeout with this query 😅 even on a 5 minutes interval if it's a few weeks ago. In the querier logs:

ts=2024-05-22T15:22:08.666957598Z caller=scheduler_processor.go:240 level=error user=anonymous msg="error notifying scheduler about finished query after the scheduler stream previously failed and returned error" err=EOF addr=10.52.1.244:9095 previousErr="rpc error: code = Canceled desc = context canceled: frontend cancelled query"

Do you have any ideas?

Unfortunately I'm not familiar with this one.

from mimir.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.