Giter Club home page Giter Club logo

Comments (11)

matalo33 avatar matalo33 commented on August 16, 2024 1

It doesn't make ECS unusable, it just means you have to schedule your own cleanups

$ cat /etc/cron.d/docker-cleanup
10,30,50 * * * * root docker rm -v $(docker ps -a -q -f status=exited) > /dev/null 2>&1
20,40,00 * * * * root docker rmi $(docker images -f 'dangling=true' -q) > /dev/null 2>&1

from amazon-ecs-agent.

rossf7 avatar rossf7 commented on August 16, 2024

We like option 1 (keeping N stopped containers per task family) as its nice and flexible.

Our demo of container autoscaling / prioritization is running on ECS. Since we're rapidly starting and stopping containers we're seeing up to 700 stopped containers build up once a container instance has been running 3 hours.

We have 2 tasks families that are long running services. For these its really useful to be able to access the stopped containers for trouble shooting. But most of the stopped containers are our demo containers and for these we don't need the 3 hours history.

Although Force12 is just a demo for now I can see this being useful in real world situations.

from amazon-ecs-agent.

matalo33 avatar matalo33 commented on August 16, 2024

This just caught me out too. Interested in a solution to this, other than just throwing disk at the problem and hoping the 3 hour cleanup happens first.

from amazon-ecs-agent.

chenliu0831 avatar chenliu0831 commented on August 16, 2024

+1. Our containers are pretty big and this becomes a problem much sooner.

from amazon-ecs-agent.

gibiansky avatar gibiansky commented on August 16, 2024

Ditto; our containers are pretty big and this makes ECS half-unusable.

from amazon-ecs-agent.

gibiansky avatar gibiansky commented on August 16, 2024

@matalo33 It actually gets a bit more complex than that. We have many fairly fast-running containers. If we wait 10 minutes, the Docker data space fills up and the entire instances dies; cannot start new tasks, cannot finish old tasks. So if we take a fixed period (or even listen for the 'die' event from Docker and then wait a few minutes), we end up with occasional dead instances, depending on how quickly our containers run and how many of them there are.

On the other hand, if we remove the containers as fast as we can (immediately after the docker die event), the ECS agent starts having problems. It monitors the containers to update task status. If a container is deleted when the ECS agent thinks its still running, then it tries to transition it from RUNNING -> NONE, but refuses because that's going backwards; it then tries to stop it and make it go to STOPPED but that doesn't work because the container doesn't exist. In the end we somehow end up with an infinitely RUNNING task that the ECS agent cannot pick up. (This seems like a separate bug, but one triggered by us quickly removing previously running containers before ECS picks up on it)

So we need to somehow find the right "balance" between cleaning up too quickly and too slowly, and the balance depends on the current workload and other things. We're still trying to figure out how to handle this; one possibility we're looking at now is cleaning up containers immediately and getting rid of the infinitely RUNNING tasks by "spoofing" the ECS container and sending our own SubmitContainerStateChange and SubmitTaskStateChange events.

Either way, it has been very frustrating, and I think it's fair to say that it's made ECS unusable for our workload without significant workaround.

from amazon-ecs-agent.

mbaumbach avatar mbaumbach commented on August 16, 2024

We have a similar scenario. Large containers due to running a JVM (it's about a 640mb image). We run the containers frequently and generally quickly. Throwing disk space at the problem seems to extend it a little bit, but it appears with Docker 1.9.1 that Docker becomes more and more unresponsive as the containers build up in that 3 hour window.

Like @gibiansky, we tried to clean up containers akin to what @matalo33 mentioned and hit similar problems with the agent being confused by us coming in and removing stopped containers. Ideally, we'd rather not even touch the container instances anyway and just let ECS do all the dirty Docker work. :)

I also opened a support case with AWS that gives a reproducible cluster so that this should be easier to see what exactly is the cause.

As a side note, I see pull #286 which may resolve the issue for us.

from amazon-ecs-agent.

gibiansky avatar gibiansky commented on August 16, 2024

@mbaumbach I can confirm that #286 fixed that issue for us.

from amazon-ecs-agent.

gregwebs avatar gregwebs commented on August 16, 2024

Should this issue be closed now that #302 is merged and released in 1.8.0 ?

from amazon-ecs-agent.

euank avatar euank commented on August 16, 2024

Only one of the 4 possible options (and the least powerful of them at that) is there. It took a while to get there, but I'm going to hold out hope for one of the other options, such as watching disk space and just doing the right thing automatically.

If the other options will never happen, this can be closed, but if not I think it makes sense to keep tracking it here.

from amazon-ecs-agent.

adnxn avatar adnxn commented on August 16, 2024

closing issue since it was addressed by ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION

from amazon-ecs-agent.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.