Giter Club home page Giter Club logo

Comments (6)

garden-gnome avatar garden-gnome commented on July 21, 2024

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

  • #129667821 Investigate: runc "no space left on device" cgroup errors on concourse worker

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

from garden-runc-release.

teddyking avatar teddyking commented on July 21, 2024

Thanks for the clear description and info on how to access the env @JesseTAlford !
I'm having difficulty SSHing to the worker, is the SSH port for the BOSH director locked down to a specific set of IP addresses? I'm getting a Errno::ETIMEDOUT from here.

from garden-runc-release.

anEXPer avatar anEXPer commented on July 21, 2024

Yeah, it looks like there's a security group rule that restricts access; you hit the nail on the head, it only allows traffic from SF.

I've added a rule that allows traffic from 80.169.160.158/30, which IOPS tells me is the Pivotal London office. If that doesn't work, you might try VPNing into the SF office, or letting me know what range you need a rule for.

Please let me know if that gets you in or not!

from garden-runc-release.

teddyking avatar teddyking commented on July 21, 2024

Hi @JesseTAlford,

Thanks for sorting that out, we've now been able to gain access to the worker VM to take a look around.
We have some good news and some not so good news.
The good news is that we've found the cause of the no space left on device error you're seeing.
The reason for the error is that the maximum no. of memory Cgroups has been reached on the VM:

$ cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset          1               1524            1
cpu             2               168             1
cpuacct         3               168             1
memory          4               **65535**       1

Unfortunately we're not sure if this is something that can be fixed immediately...
Long story short is that we believe we need this feature to be implemented in runc to prevent this from happening again.
It's looking like memory Cgroups are not getting cleaned up as efficiently as they could be, and so in environments that are creating lots and lots of containers, the aforementioned limit can be reached.

An interim solution is to upgrade the stemcell for the deployment, which, at the very least, will reset the count back down to 0.
It's also possible that the kernel in the newer stemcell will help with the mem Cgroup cleanup as well (but we're not 100% sure on that).

Thanks,
Ed & Petar

from garden-runc-release.

garden-gnome avatar garden-gnome commented on July 21, 2024

Hello again!

All stories related to this issue have been accepted, so I'm going to automatically close this issue.

At the time of writing, the following stories have been accepted:

  • #129667821 Investigate: runc "no space left on device" cgroup errors on concourse worker

If you feel there is still more to be done, or if you have any questions, leave a comment and we'll reopen if necessary!

from garden-runc-release.

anEXPer avatar anEXPer commented on July 21, 2024

Cool, we used bosh recreate worker 1 to resolve the problem. We'll just keep applying that balm until that runc-feature-based fix lands.

from garden-runc-release.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.