Giter Club home page Giter Club logo

Comments (6)

tjkirch avatar tjkirch commented on June 3, 2024 1

I did learn a few things from the extra instrumentation, running with the 1GB file.

Retry counts:

  • Not quite half of runs required any retries, and typically just 1 retry for 1 block.
  • If there are multiple blocks that require retries, the block numbers tended to be clumped, e.g. blocks 142, 143, 144, 146, 148, 154, 155, 157, and one outlier at 873. Another example: 133, 139, 147, 148, 150, 151, 153, 154, and 646. A handful more like this.
  • 8 of 127 runs required a second retry for some block.
  • 1 of 127 runs would have required a fourth retry, but we kill the run after 3.

Retry causes:

  • Almost all of the retries were caused by a "connection closed".
  • Two were caused by "Couldn't find AWS credentials"; I use an instance IAM role for creds, so it could have been a temporary issue there.

It'd be great to figure out why we get the connection closed errors. It's tempting to raise the retry limit to 5 to work around it.

Obviously, none of these errors are the same as the original one in this issue, where EBS thinks the snapshot ID doesn't exist. Increasing retries seems less likely to fix that issue, since it's almost surely a timing issue, and we're not waiting between retries. It's tempting to add a wait.

In the past, when faced with similar resource timing issues, we've used the strategy of building a separate client object, ideally one that talks to a different endpoint, and checking the existence of the resource with that. It's unfortunate but it might help here.

from coldsnap.

tjkirch avatar tjkirch commented on June 3, 2024 1

We do still occasionally see this error, even with the retries from #56 - twice in the past two weeks, I believe, during CI runs.

from coldsnap.

tjkirch avatar tjkirch commented on June 3, 2024

Another one today. Most of the same block numbers...

Failed to upload snapshot: Failed to put 7 blocks for snapshot 'snap-bcdefg': 
Failed to put block 3 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 258 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 514 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 770 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 1282 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 1794 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 2047 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.

from coldsnap.

tjkirch avatar tjkirch commented on June 3, 2024

While trying to reproduce this, I also saw the following error; not sure if it's related, but I'd guess not. I'm putting it here because the same fix might help, if we have to impose waits / retries.

Failed to put block 1258 for snapshot 'snap-x': Error during dispatch: connection closed before message completed

from coldsnap.

tjkirch avatar tjkirch commented on June 3, 2024

I ran 5 upload tests each with a 10MB file and a ~1GB file, each time rerunning coldsnap until I got a failure.

For the 10MB file, it failed after 31, 177, 68, 48, and 968 attempts respectively, all but one with the "snapshot does not exist" error; the last gave Failed to complete snapshot 'snap-x': The snapshot 'snap-x' does not exist., something I hadn't seen before.

For the 1GB file, it failed after 27, 7, 4, 33, and 130 attempts respectively, always with the "connection closed before message completed" error. I hadn't seen this before, but now it seems consistent, at least when running somewhat intensive tests with the bigger file. The connection could be closed for any number of reasons...

(For reference, I did also see Failed to start snapshot: Number of concurrent backups exceeded: Limit: 100. a handful of times, which seems unrelated, so I just waited a bit and restarted things when I saw it.)

I'm going to try to instrument the error case a bit more thoroughly to see if I can get better clues about the state of the connection and the snapshot at the point of error. For example, we only keep the error message from the last retry for each block, and I'll try keeping all of those; I'm also going to check the result of the start_snapshot call, which we currently just check was successful, but there's a separate 'status' field inside that could be error - you never know!

from coldsnap.

tjkirch avatar tjkirch commented on June 3, 2024

As a test, I bumped block retry count from 3 to 5, and added an increasing wait time after each each block retry - 1 second, 2, 3, 4, 5. I was finally able to catch the interesting case, "snapshot does not exist." (I don't think it was related to those changes, I just happened to catch it.)

As you'd expect, when you see that error, you see it a lot - hundreds of blocks continued to fail with that error after >15s of total delay, killing the upload. However, very near the end, a handful of later blocks did manage to succeed in uploading after 1 not-found error. (I only logged ones that had some kind of failure, not currently the status of every block, so there could have been more.)

This implies that we could get some mileage out of confirming we can describe the snapshot after starting it, before we start uploading. However, I'm still not confident that it's a linear ordering, meaning that once we see an upload without a not-found error, I think we can still see a not-found error. I believe rusoto uses a connection pool, and we're not necessarily sending requests on the same connection that saw the resource... but in combination with backoff-retries it may be enough.

[edit] Actually, I'm fairly sure the ordering isn't linear, and that the waits wouldn't be enough in combination. I had another failed run with only one block failing due to the snapshot not being found. The per-connection results seem pretty inconsistent.

from coldsnap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.