Giter Club home page Giter Club logo

Comments (14)

tleyden avatar tleyden commented on August 30, 2024

@wilsondy Looks to me like you are correct -- I tried launching the process the same way we are doing so in the run script and when I send a SIGTERM signal via Ctl-C, it responds with this:

# ./couchbase-server -- -kernel global_enable_tracing false -noinput
^C
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution

I believe that's an artifact of the way we are launching in Erlang, and agree that it should be fixed.

from docker.

tleyden avatar tleyden commented on August 30, 2024

Possible workaround: This is a unix wrapper around the erlang vm. (gist)

from docker.

tleyden avatar tleyden commented on August 30, 2024

@wilsondy Changed your title to something a little less graphic ;-)

from docker.

tleyden avatar tleyden commented on August 30, 2024

From @dave-finlay:

I don't believe we are eating SIGTERM
docker stop sends SIGTERM, for instance
and if that doesn't work after a specified timeout sends SIGKILL
when i run "docker stop" on my docker container with a 10 s time out it stops my container immediately
https://docs.docker.com/engine/reference/commandline/stop/#description
Your ctrl-c test is incorrect. ctrl-c sends a SIGINT by default which does interrupt the erlang vm and cause it to enter into its "break" routine

I just tried this test:

  • In one shell, run docker run [couchbase-container] /bin/bash and then start couchbase server. via ./couchbase-server -- -kernel global_enable_tracing false -noinput. The shell will be blocked while the server runs.
  • In another shell, run docker exec and get another shell in that same container, and run:
    • ps auxww | grep -i erl and look for the first beam.smp PID
    • kill -SIGTERM 441 (replace 441 with your actual PID)

In the first shell, Couchbase Server will exit immediately and the call will unblock on the shell. Also ps auxww | grep -i erl returns no results.

@wilsondy there might still be an issue somewhere .. can you try to isolate it down further and re-open or file a fresh ticket?

from docker.

ceejatec avatar ceejatec commented on August 30, 2024

If you start the container normally (rather than running /bin/bash and starting it manually), it uses runit as a process supervisor. I'm not sure whether runit passes SIGTERM on to the child processes or not, but quite likely not.

from docker.

tleyden avatar tleyden commented on August 30, 2024

Ok thanks @ceejatec -- will need to retest then

from docker.

tleyden avatar tleyden commented on August 30, 2024

This might be relevant:

https://peter.bourgon.org/blog/2015/09/24/docker-runit-and-graceful-termination.html

from docker.

ceejatec avatar ceejatec commented on August 30, 2024

Ok, I've spent a while chasing after this. Here's what I think is true now.

First off, no, our container isn't swallowing SIGTERM. PID 1 is the command runsvdir, which is documented here: http://manpages.ubuntu.com/manpages/xenial/man8/runsvdir.8.html

As documented, when that process receives SIGTERM, it immediately exits with return code 0. This causes the container to exit and kill/reap any other processes in the container. As Dave's experiment shows, "docker stop" (which sends SIGTERM to process 1 in the container) causes the container to exit right away, which is consistent. I don't know enough about Kubernetes to understand the original question raised, but I don't think our container is doing anything wrong regarding SIGTERM as a member of a larger world.

However, this does mean that "docker stop" prevents Couchbase from gracefully exiting. Ideally we would want to call the script /opt/couchbase/bin/couchbase-server -k to gracefully close everything down including epmd. As far I can tell, this isn't possible with our current configuration, because runsvdir is too limited regarding signal handling. Starting with Docker 1.11 it is possible to configure the container to send some signal beside SIGTERM when it is stopped. But that doesn't help because the only other signal runsvdir is programmed to respond to is SIGHUP, which causes it to send SIGTERM to all child processes and then immediately exit.

All child processes of runsvdir are runsv (http://manpages.ubuntu.com/manpages/xenial/man8/runsv.8.html), and they respond to SIGTERM the way we would want - we can modify the configuration to cleanly shut down Couchbase when that signal is received. However, in practice runsvdir still exits before this cleanup is complete, which causes the container to exit and Docker to kill everything immediately.

Soooo... we don't actually need runsvdir, because we don't launch more than one service. (And as a "proper" Docker container, we really never should.) Locally I've mocked up a variation of the Couchbase docker contain which runs runsv as PID 1, and configured it to shut Couchbase down cleanly when SIGTERM is received. And that actually works great - "docker stop" of the container now takes about 5 seconds to run, which is happening because Server is gracefully shutting itself down. This seems like a win. I will propose that change, but I'm not sure who should take the lead to do more thorough testing.

from docker.

ceejatec avatar ceejatec commented on August 30, 2024

Actually, one thought, @wilsondy - you said that the preStop hook is "removing the node". How is it doing that? If you do it via "docker stop" or equivalent, then the previous conversation is relevant. However, if you're doing something inside the container - stopping the Couchbase processes directly, for instance - then indeed you'll find that doesn't shut the container down. In fact, all the Couchbase processes will restart right away, because they're being managed by runsv. That would explain why Kubernetes is not seeing the container exit and waiting for the full grace period.

from docker.

wilsondy avatar wilsondy commented on August 30, 2024

I run this:
couchbase-cli rebalance --cluster=$COUCHBASE_SERVICE_HOST:$COUCHBASE_SERVICE_PORT -u $CB_REST_USERNAME -p $CB_REST_PASSWORD --server-remove=$COUCHNAME

In limited testing I don't think this kills the Couchbase process, but I can double check later today.

from docker.

ceejatec avatar ceejatec commented on August 30, 2024

Yes, I wouldn't expect that command to stop any processes, just to move data around. A bit of Googling shows me that the Kubernetes controller should also be sending SIGTERM to the main process in the containers, though, and that should indeed cause the container to exit immediately.

Can you confirm that the rebalance command you're invoking is actually completing? I know that even on a relatively small cluster, rebalancing can be a lengthy operation (almost certainly longer than the default 30 second grace period). I don't know, however, whether the couchbase-cli command normally waits for the operation to complete, or just triggers the operation and returns.

from docker.

wilsondy avatar wilsondy commented on August 30, 2024

I initially set the Kubernetes terminationGracePeriodSeconds to 1 hour and it did complete. Then the pod/container was held in the Terminating state for the rest of the hour... so the bug I linked above led me to believe there might this SIGTERM issue. As I understand it, the preStop hook runs (remove/rebalance), finishes and then the SIGTERM is sent.

from docker.

tleyden avatar tleyden commented on August 30, 2024

And that actually works great - "docker stop" of the container now takes about 5 seconds to run, which is happening because Server is gracefully shutting itself down. This seems like a win.

That would be a huge win!

I can envision stacking on shut-down hooks that depend on the fact that Couchbase Server has successfully shut itself down -- either in the process itself, if we ever allow user-provided post shutdown hooks, or in a container orchestrator context.

Are you planning to put up a PR for that change?

from docker.

tleyden avatar tleyden commented on August 30, 2024

Had some more offline discussion w/ @dave-finlay on this, and based on:

First off, no, our container isn't swallowing SIGTERM.

and

As documented, when that process receives SIGTERM, it immediately exits with return code 0.

I'm going to close out this ticket.

@ceejatec if you see an area to improve, it's probably better to file this as a separate ticket.

@wilsondy if you can narrow this down more to your specific problem, please open a new ticket.

from docker.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.