Currently the Agent removes containers that are part of stopped tasks after the task h

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Should this issue be closed now that <a class="issue-link js-issue-link" data-error-te

Container Removal: Make more flexible than a fixed timeout about amazon-ecs-agent HOT 11 CLOSED

euank commented on August 16, 2024

Container Removal: Make more flexible than a fixed timeout

from amazon-ecs-agent.

Comments (11)

matalo33 commented on August 16, 2024 1

It doesn't make ECS unusable, it just means you have to schedule your own cleanups

$ cat /etc/cron.d/docker-cleanup
10,30,50 * * * * root docker rm -v $(docker ps -a -q -f status=exited) > /dev/null 2>&1
20,40,00 * * * * root docker rmi $(docker images -f 'dangling=true' -q) > /dev/null 2>&1

from amazon-ecs-agent.

rossf7 commented on August 16, 2024

We like option 1 (keeping N stopped containers per task family) as its nice and flexible.

Our demo of container autoscaling / prioritization is running on ECS. Since we're rapidly starting and stopping containers we're seeing up to 700 stopped containers build up once a container instance has been running 3 hours.

We have 2 tasks families that are long running services. For these its really useful to be able to access the stopped containers for trouble shooting. But most of the stopped containers are our demo containers and for these we don't need the 3 hours history.

Although Force12 is just a demo for now I can see this being useful in real world situations.

from amazon-ecs-agent.

matalo33 commented on August 16, 2024

This just caught me out too. Interested in a solution to this, other than just throwing disk at the problem and hoping the 3 hour cleanup happens first.

from amazon-ecs-agent.

chenliu0831 commented on August 16, 2024

+1. Our containers are pretty big and this becomes a problem much sooner.

from amazon-ecs-agent.

gibiansky commented on August 16, 2024

Ditto; our containers are pretty big and this makes ECS half-unusable.

from amazon-ecs-agent.

gibiansky commented on August 16, 2024

@matalo33 It actually gets a bit more complex than that. We have many fairly fast-running containers. If we wait 10 minutes, the Docker data space fills up and the entire instances dies; cannot start new tasks, cannot finish old tasks. So if we take a fixed period (or even listen for the 'die' event from Docker and then wait a few minutes), we end up with occasional dead instances, depending on how quickly our containers run and how many of them there are.

On the other hand, if we remove the containers as fast as we can (immediately after the docker die event), the ECS agent starts having problems. It monitors the containers to update task status. If a container is deleted when the ECS agent thinks its still running, then it tries to transition it from RUNNING -> NONE, but refuses because that's going backwards; it then tries to stop it and make it go to STOPPED but that doesn't work because the container doesn't exist. In the end we somehow end up with an infinitely RUNNING task that the ECS agent cannot pick up. (This seems like a separate bug, but one triggered by us quickly removing previously running containers before ECS picks up on it)

So we need to somehow find the right "balance" between cleaning up too quickly and too slowly, and the balance depends on the current workload and other things. We're still trying to figure out how to handle this; one possibility we're looking at now is cleaning up containers immediately and getting rid of the infinitely RUNNING tasks by "spoofing" the ECS container and sending our own SubmitContainerStateChange and SubmitTaskStateChange events.

Either way, it has been very frustrating, and I think it's fair to say that it's made ECS unusable for our workload without significant workaround.

from amazon-ecs-agent.

mbaumbach commented on August 16, 2024

We have a similar scenario. Large containers due to running a JVM (it's about a 640mb image). We run the containers frequently and generally quickly. Throwing disk space at the problem seems to extend it a little bit, but it appears with Docker 1.9.1 that Docker becomes more and more unresponsive as the containers build up in that 3 hour window.

Like @gibiansky, we tried to clean up containers akin to what @matalo33 mentioned and hit similar problems with the agent being confused by us coming in and removing stopped containers. Ideally, we'd rather not even touch the container instances anyway and just let ECS do all the dirty Docker work. :)

I also opened a support case with AWS that gives a reproducible cluster so that this should be easier to see what exactly is the cause.

As a side note, I see pull #286 which may resolve the issue for us.

from amazon-ecs-agent.

gibiansky commented on August 16, 2024

@mbaumbach I can confirm that #286 fixed that issue for us.

from amazon-ecs-agent.

gregwebs commented on August 16, 2024

Should this issue be closed now that #302 is merged and released in 1.8.0 ?

from amazon-ecs-agent.

euank commented on August 16, 2024

Only one of the 4 possible options (and the least powerful of them at that) is there. It took a while to get there, but I'm going to hold out hope for one of the other options, such as watching disk space and just doing the right thing automatically.

If the other options will never happen, this can be closed, but if not I think it makes sense to keep tracking it here.

from amazon-ecs-agent.

adnxn commented on August 16, 2024

closing issue since it was addressed by ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION

from amazon-ecs-agent.

Container Removal: Make more flexible than a fixed timeout about amazon-ecs-agent HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent