Giter Club home page Giter Club logo

Comments (14)

aj-jester avatar aj-jester commented on May 24, 2024

There's also a problem where if the leave works on a client, the kill will fail, resulting in a "FAILED" response from $retcode.

I also have observed some cases of clients in a "failed" state where they should have left, which I think is down to a race condition between issuing a leave and the subsequent 'kill -9'.

@pforman What would happen if leave_on_terminate is set to true, keep consul leave, then execute kill -TERM right after? Would that cause a race condition as well? Would the state of the client be failed or left?

Edit: Updated question for clarity

from puppet-consul.

pforman avatar pforman commented on May 24, 2024

If leave_on_terminate is true, then a TERM should issue another leave if it actually runs. Normally leave_on_terminate is false. (There's another default but modifiable behavior around SIGINT as well, basically SIGINT will cause a leave unless specified not to).

Normally the leave works correctly. In some rare cases, I've seen it not work, but I can't reliably duplicate it (yet). Luckily my test cluster just triggered it for me.

Here's what I got:

[root@ip-10-103-65-99 ~]# /etc/init.d/consul stop
Shutting down consul: Error leaving: client closed

Here's the log:

    2015/08/21 03:25:39 [INFO] agent: Synced check 'service:nginx'
    2015/08/21 03:31:53 [INFO] agent: Synced check 'service:nginx'
    2015/08/21 03:34:34 [INFO] agent.rpc: Accepted client: 127.0.0.1:49830
    2015/08/21 03:34:34 [INFO] agent.rpc: Accepted client: 127.0.0.1:49831
    2015/08/21 03:34:34 [INFO] agent.rpc: Graceful leave triggered
    2015/08/21 03:34:34 [INFO] consul: client starting leave
    2015/08/21 03:34:34 [INFO] serf: EventMemberLeave: ip-10-103-65-99 10.103.65.99
    2015/08/21 03:34:34 [INFO] agent: requesting shutdown
    2015/08/21 03:34:34 [INFO] consul: shutting down client
    2015/08/21 03:34:34 [INFO] agent: shutdown complete

This looks identical to any other client leave in the logs.

So it looks like the shutdown code returned "1", but then shut down correctly anyway. Checking from another client gives a "left" status:

[zinc@ip-10-103-67-243 ~]$ consul members
Node              Address             Status  Type    Build  Protocol  DC
ip-10-103-66-5    10.103.66.5:8301    alive   server  0.5.2  2         pfdemo-us-west-2
ip-10-103-65-5    10.103.65.5:8301    alive   server  0.5.2  2         pfdemo-us-west-2
ip-10-103-65-99   10.103.65.99:8301   left    client  0.5.2  2         pfdemo-us-west-2
ip-10-103-67-243  10.103.67.243:8301  alive   client  0.5.2  2         pfdemo-us-west-2
ip-10-103-67-5    10.103.67.5:8301    alive   server  0.5.2  2         pfdemo-us-west-2

This may just be a consul thing from internal goroutines and the shutdown order. I'll look over on their issues. As far as I can tell, the TERM is just sort of superfluous, unless you really want to force a "failed" state instead of left.

I haven't ever seen a leave actually fail, but with the old script I did see some "failed" states come about when clients should have left. I think that's because the SIGKILL can't be caught, and the process just terminates abruptly.

Seems like having clients simply do "consul leave" and having servers do "kill -TERM" is probably the best. Having clients use INT (which the handler will generate a leave from) is also an option.

Not sure if that's more or less confusing. I spent some quality time reading the consul signals code today to get what understanding I do have.
It's at https://github.com/hashicorp/consul/blob/master/command/agent/command.go

I'll dig in a little more, and go check the issues of consul to see if they've noticed this behavior.

from puppet-consul.

aj-jester avatar aj-jester commented on May 24, 2024

As far as I can tell, the TERM is just sort of superfluous, unless you really want to force a "failed" state instead of left.

Seems like having clients simply do "consul leave" and having servers do "kill -TERM" is probably the best. Having clients use INT (which the handler will generate a leave from) is also an option.

@pforman So then why not remove consul leave and just use INT? Isin't that what we want, consul to leave gracefully? That way we don't have to deal with any race condition with consul leave.

from puppet-consul.

pforman avatar pforman commented on May 24, 2024

I think that's viable. Let me beat on it a bit and see if I can get it to misbehave in any way.

The nice part about using INT is that it's controllable by config_hash, using the "skip_leave_on_interrupt" parameter.

All of this discussion about leave-vs-TERM is currently only the case for "sysv", though. Everything else appears to just use "consul leave". Seems like a lot of us with weird needs are using sysv-based systems...

I'm not sure if consistency across the init types is a good goal or not.

from puppet-consul.

aj-jester avatar aj-jester commented on May 24, 2024

@pforman Thank You! The more testing the better 😄

Currently consul leave is only implemented for sysv, debian and upstart scripts. So if you can confirm INT isin't doing any funny business, I will update the upstart and debian scripts to use INT too.

So please keep us posted 👍

from puppet-consul.

pforman avatar pforman commented on May 24, 2024

Initial results with INT look good. We get this in the log (turned up to DEBUG):

==> Caught signal: interrupt
==> Gracefully shutting down agent...
    2015/08/21 05:05:15 [INFO] consul: client starting leave
    2015/08/21 05:05:15 [DEBUG] serf: messageLeaveType: ip-10-103-65-99
    2015/08/21 05:05:15 [DEBUG] serf: messageLeaveType: ip-10-103-65-99
    2015/08/21 05:05:15 [DEBUG] serf: messageLeaveType: ip-10-103-65-99
    2015/08/21 05:05:15 [INFO] serf: EventMemberLeave: ip-10-103-65-99 10.103.65.99
    2015/08/21 05:05:15 [DEBUG] memberlist: Initiating push/pull sync with: 10.103.65.5:8301
    2015/08/21 05:05:15 [DEBUG] http: Shutting down http server (127.0.0.1:8500)
    2015/08/21 05:05:15 [INFO] agent: requesting shutdown
    2015/08/21 05:05:15 [INFO] consul: shutting down client
    2015/08/21 05:05:15 [INFO] agent: shutdown complete

The one oddity is that the script contains this:

        rm -f /var/lock/subsys/consul $PID_FILE

That's too fast, if the rm is allowed to remove $PID_FILE, then consul itself complains in the logs when it subsequently tries to. So the signal is sent to consul, but then the script marches on. Turns out the script is quicker at issuing "rm" than consul is at shutting down.

    2015/08/21 04:58:03 [WARN] agent: could not delete pid file  Could not remove pid file: stat /var/run/consul/consul.pid: no such file or directory

Still looking into the original condition that causes "consul leave" to occasionally generate a return code of 1. Seems like that should be sent upstream. "consul leave" is a bit more readable than "kill -INT", so I'd hate to obfuscate everything if we can find the root issue.

from puppet-consul.

aj-jester avatar aj-jester commented on May 24, 2024

@pforman 💡 We can do one of the following:

  1. We can check if the process exists pid -0 $PID (possibly fixed loop count with a one second back off)?
  • This raises another question. If after x amount of tries the process is still running then we have to decide whether to exit right then and echo "hey something went wrong, pid is still running" or forcefully kill -9 and clean up the pid and lock files?

OR

  1. We can just issue INT and assume consul took care of everything else but on start up ensure the pid is not running (if it does do kill -9) and clean up the lock and pid files.

Also, I kinda understand your point about consul leave being more clear. But the log doesn't really show where the leave was triggered from. It would been nice if consul added a flag called -reason or something so we could do consul leave -reason 'triggered from init' or something. Right now its quite ambiguous.

from puppet-consul.

aj-jester avatar aj-jester commented on May 24, 2024

@pforman I checked the other init scripts and most are sleeping/waiting before removing the pid file, so I think its fine to do that here as well (option 1).

Upstart is the exception to that rule, but I will add support to it as well.

from puppet-consul.

pforman avatar pforman commented on May 24, 2024

I opened hashicorp/consul#1189 about this. We'll see what happens.

I think INT and backoff is satisfactory, and if they fix this issue upstream going back to consul leave might be preferable then.

from puppet-consul.

aj-jester avatar aj-jester commented on May 24, 2024

@pforman Awesome. Looking forward to your PR 😎

from puppet-consul.

hopperd avatar hopperd commented on May 24, 2024

Was just thinking/working on this exact same issue today. I'd prefer if we can let the leave_on_terminate property determine the behavior of the cluster leave/failing otherwise there are situations in which this can cause a node to leave the cluster when you didn't really want it to because of a stop/start.

@pforman have you made headway on this and have a PR ready soon?

from puppet-consul.

pforman avatar pforman commented on May 24, 2024

PR is #181, waiting for merge.

The behavior for both clients and servers is somewhat configurable with the Consul properties, because the script just sends INT or TERM respectively.

If the signal isn't handled in a reasonable time it escalates the signal, eventually to KILL.

Consul is pretty well behaved as a daemon so I haven't seen that happen.

from puppet-consul.

solarkennedy avatar solarkennedy commented on May 24, 2024

Merged in #181.
I look forward to the day when we can just "use the upstream init script" :(

from puppet-consul.

aj-jester avatar aj-jester commented on May 24, 2024

Something like this could be used to standardize init scripts https://github.com/jordansissel/pleaserun

from puppet-consul.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.