Comments (11)
It doesn't make ECS unusable, it just means you have to schedule your own cleanups
$ cat /etc/cron.d/docker-cleanup
10,30,50 * * * * root docker rm -v $(docker ps -a -q -f status=exited) > /dev/null 2>&1
20,40,00 * * * * root docker rmi $(docker images -f 'dangling=true' -q) > /dev/null 2>&1
from amazon-ecs-agent.
We like option 1 (keeping N stopped containers per task family) as its nice and flexible.
Our demo of container autoscaling / prioritization is running on ECS. Since we're rapidly starting and stopping containers we're seeing up to 700 stopped containers build up once a container instance has been running 3 hours.
We have 2 tasks families that are long running services. For these its really useful to be able to access the stopped containers for trouble shooting. But most of the stopped containers are our demo containers and for these we don't need the 3 hours history.
Although Force12 is just a demo for now I can see this being useful in real world situations.
from amazon-ecs-agent.
This just caught me out too. Interested in a solution to this, other than just throwing disk at the problem and hoping the 3 hour cleanup happens first.
from amazon-ecs-agent.
+1. Our containers are pretty big and this becomes a problem much sooner.
from amazon-ecs-agent.
Ditto; our containers are pretty big and this makes ECS half-unusable.
from amazon-ecs-agent.
@matalo33 It actually gets a bit more complex than that. We have many fairly fast-running containers. If we wait 10 minutes, the Docker data space fills up and the entire instances dies; cannot start new tasks, cannot finish old tasks. So if we take a fixed period (or even listen for the 'die' event from Docker and then wait a few minutes), we end up with occasional dead instances, depending on how quickly our containers run and how many of them there are.
On the other hand, if we remove the containers as fast as we can (immediately after the docker die
event), the ECS agent starts having problems. It monitors the containers to update task status. If a container is deleted when the ECS agent thinks its still running, then it tries to transition it from RUNNING -> NONE, but refuses because that's going backwards; it then tries to stop it and make it go to STOPPED but that doesn't work because the container doesn't exist. In the end we somehow end up with an infinitely RUNNING task that the ECS agent cannot pick up. (This seems like a separate bug, but one triggered by us quickly removing previously running containers before ECS picks up on it)
So we need to somehow find the right "balance" between cleaning up too quickly and too slowly, and the balance depends on the current workload and other things. We're still trying to figure out how to handle this; one possibility we're looking at now is cleaning up containers immediately and getting rid of the infinitely RUNNING tasks by "spoofing" the ECS container and sending our own SubmitContainerStateChange and SubmitTaskStateChange events.
Either way, it has been very frustrating, and I think it's fair to say that it's made ECS unusable for our workload without significant workaround.
from amazon-ecs-agent.
We have a similar scenario. Large containers due to running a JVM (it's about a 640mb image). We run the containers frequently and generally quickly. Throwing disk space at the problem seems to extend it a little bit, but it appears with Docker 1.9.1 that Docker becomes more and more unresponsive as the containers build up in that 3 hour window.
Like @gibiansky, we tried to clean up containers akin to what @matalo33 mentioned and hit similar problems with the agent being confused by us coming in and removing stopped containers. Ideally, we'd rather not even touch the container instances anyway and just let ECS do all the dirty Docker work. :)
I also opened a support case with AWS that gives a reproducible cluster so that this should be easier to see what exactly is the cause.
As a side note, I see pull #286 which may resolve the issue for us.
from amazon-ecs-agent.
@mbaumbach I can confirm that #286 fixed that issue for us.
from amazon-ecs-agent.
Should this issue be closed now that #302 is merged and released in 1.8.0 ?
from amazon-ecs-agent.
Only one of the 4 possible options (and the least powerful of them at that) is there. It took a while to get there, but I'm going to hold out hope for one of the other options, such as watching disk space and just doing the right thing automatically.
If the other options will never happen, this can be closed, but if not I think it makes sense to keep tracking it here.
from amazon-ecs-agent.
closing issue since it was addressed by ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
from amazon-ecs-agent.
Related Issues (20)
- Upgrade minimum docker client api to 1.24 to maintain compatibility with upcoming docker engine v26 release HOT 3
- Task Health Status wrongly reported as HEALTHY HOT 1
- Update not supported on ARM architecture HOT 1
- Add retries for publishing metrics & health checks
- ECS Deployment Fails Due to Premature Resource Availability Reporting HOT 8
- Add support for custom ECS-Agent and ECS-Telemetry Endpoints HOT 1
- Upgraded ecs agent causes Error loading previously saved state from BoltDB HOT 4
- ECS control plane not compatible with ECS-A and Docker v26 requirements for API version HOT 6
- AWS ECS agent does not start in EC2 instance HOT 3
- Agent is Failing to Add com.amazonaws.ecs.capability.logging-driver.journald Attribute to the Container Instance HOT 1
- Docker client doesn't support zstd compression HOT 1
- ECS agent on windows does not work for more than 10 CPU despite setting 'ECS_ENABLE_TASK_CPU_MEM_LIMIT' to true HOT 4
- Secret in US region, and ECS cluster in Asia pacific region HOT 2
- Run Security Updates without failing long-running tasks HOT 2
- Unable to delete Docker image due to multiple repository references HOT 1
- Specifying docker image for caching during ecs-init
- nvidia-gpu-info.json not being generated since v1.82.4
- Docker tags are not shown for pulled images where tag is specified in task definition HOT 1
- More descriptive log message for "Resources not consumed, enough resources not available" HOT 1
- ECS Instances stuck with "Agent Disconnected" HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon-ecs-agent.