Comments (6)
I did learn a few things from the extra instrumentation, running with the 1GB file.
Retry counts:
- Not quite half of runs required any retries, and typically just 1 retry for 1 block.
- If there are multiple blocks that require retries, the block numbers tended to be clumped, e.g. blocks 142, 143, 144, 146, 148, 154, 155, 157, and one outlier at 873. Another example: 133, 139, 147, 148, 150, 151, 153, 154, and 646. A handful more like this.
- 8 of 127 runs required a second retry for some block.
- 1 of 127 runs would have required a fourth retry, but we kill the run after 3.
Retry causes:
- Almost all of the retries were caused by a "connection closed".
- Two were caused by "Couldn't find AWS credentials"; I use an instance IAM role for creds, so it could have been a temporary issue there.
It'd be great to figure out why we get the connection closed errors. It's tempting to raise the retry limit to 5 to work around it.
Obviously, none of these errors are the same as the original one in this issue, where EBS thinks the snapshot ID doesn't exist. Increasing retries seems less likely to fix that issue, since it's almost surely a timing issue, and we're not waiting between retries. It's tempting to add a wait.
In the past, when faced with similar resource timing issues, we've used the strategy of building a separate client object, ideally one that talks to a different endpoint, and checking the existence of the resource with that. It's unfortunate but it might help here.
from coldsnap.
We do still occasionally see this error, even with the retries from #56 - twice in the past two weeks, I believe, during CI runs.
from coldsnap.
Another one today. Most of the same block numbers...
Failed to upload snapshot: Failed to put 7 blocks for snapshot 'snap-bcdefg':
Failed to put block 3 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 258 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 514 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 770 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 1282 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 1794 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
Failed to put block 2047 for snapshot 'snap-bcdefg': The snapshot 'snap-bcdefg' does not exist.
from coldsnap.
While trying to reproduce this, I also saw the following error; not sure if it's related, but I'd guess not. I'm putting it here because the same fix might help, if we have to impose waits / retries.
Failed to put block 1258 for snapshot 'snap-x': Error during dispatch: connection closed before message completed
from coldsnap.
I ran 5 upload tests each with a 10MB file and a ~1GB file, each time rerunning coldsnap until I got a failure.
For the 10MB file, it failed after 31, 177, 68, 48, and 968 attempts respectively, all but one with the "snapshot does not exist" error; the last gave Failed to complete snapshot 'snap-x': The snapshot 'snap-x' does not exist.
, something I hadn't seen before.
For the 1GB file, it failed after 27, 7, 4, 33, and 130 attempts respectively, always with the "connection closed before message completed" error. I hadn't seen this before, but now it seems consistent, at least when running somewhat intensive tests with the bigger file. The connection could be closed for any number of reasons...
(For reference, I did also see Failed to start snapshot: Number of concurrent backups exceeded: Limit: 100.
a handful of times, which seems unrelated, so I just waited a bit and restarted things when I saw it.)
I'm going to try to instrument the error case a bit more thoroughly to see if I can get better clues about the state of the connection and the snapshot at the point of error. For example, we only keep the error message from the last retry for each block, and I'll try keeping all of those; I'm also going to check the result of the start_snapshot call, which we currently just check was successful, but there's a separate 'status' field inside that could be error
- you never know!
from coldsnap.
As a test, I bumped block retry count from 3 to 5, and added an increasing wait time after each each block retry - 1 second, 2, 3, 4, 5. I was finally able to catch the interesting case, "snapshot does not exist." (I don't think it was related to those changes, I just happened to catch it.)
As you'd expect, when you see that error, you see it a lot - hundreds of blocks continued to fail with that error after >15s of total delay, killing the upload. However, very near the end, a handful of later blocks did manage to succeed in uploading after 1 not-found error. (I only logged ones that had some kind of failure, not currently the status of every block, so there could have been more.)
This implies that we could get some mileage out of confirming we can describe the snapshot after starting it, before we start uploading. However, I'm still not confident that it's a linear ordering, meaning that once we see an upload without a not-found error, I think we can still see a not-found error. I believe rusoto uses a connection pool, and we're not necessarily sending requests on the same connection that saw the resource... but in combination with backoff-retries it may be enough.
[edit] Actually, I'm fairly sure the ordering isn't linear, and that the waits wouldn't be enough in combination. I had another failed run with only one block failing due to the snapshot not being found. The per-connection results seem pretty inconsistent.
from coldsnap.
Related Issues (20)
- Support EKS IAM roles for service accounts HOT 3
- Provide links to installation dependancies HOT 1
- Enable parameters of Encrypted and KmsKeyArn to the StartSnapshot API/command HOT 1
- coldsnap wait does not wait long enough
- AWS permission errors are hidden during upload HOT 1
- Support for Tags HOT 7
- Incremental snapshot support HOT 4
- Downloads will fail if there is a block whoes is greater than 4096 HOT 1
- Docker image / pre-built binary HOT 5
- Add support for sts assume HOT 2
- Improve error handling on upload failures HOT 3
- SnapshotUploader::upload_from_file should not retry file permission errors
- v0.4.3 should have been v0.5.0 HOT 2
- fix dependency hell: hide foreign types from public interface
- update aws sdk: newer versions of the aws sdk do not have nativetls feature flags HOT 1
- Snapshot restore : bad superblock HOT 2
- Support reading and writing to volumes HOT 1
- Downloading snap to s3fs mount: Input/output error (os error 5) HOT 3
- Cargo install failure HOT 2
- Add timeout to remote calls
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from coldsnap.