Comments (14)
Hmm. Interesting problem @jberkus . I agree I don't think I want to add an "execute arbitrary code" bit to the main codebase. Currently you can poll the /fsm/leader
endpoint from the API and check for changes - so that's a possible stop-gap for now.
Example output from that is
> curl http://localhost:5000/fsm/leader | json_pp golang-custom-raft [75c3f42] modified untracked
{
"data" : {
"exists" : true,
"leader" : {
"connection_string" : "postgres://replication:[email protected]:5432/postgres",
"name" : "postgresql0"
}
}
}
In the long-term, we might look to implement something in the API similar to etcd's Watchers https://coreos.com/etcd/docs/latest/api.html#waiting-for-a-change
But in general I can't think of a good way to have the core code execute something on an event without bloating it beyond what I'm comfortable with
from governor.
Why not an option to run a script built into the code so that it's just an extension point for integration? (Serious question - just wondering what the rationale is for not doing it)
Dj
On 18 Nov 2016, at 05:39, Joshua Deare [email protected] wrote:
Hmm. Interesting problem @jberkus . I agree I don't think I want to add an "execute arbitrary code" bit to the main codebase. Currently you can poll the /fsm/leader endpoint from the API and check for changes - so that's a possible stop-gap for now.
Example output from that is
curl http://localhost:5000/fsm/leader | json_pp golang-custom-raft [75c3f42] modified untracked
{
"data" : {
"exists" : true,
"leader" : {
"connection_string" : "postgres://replication:[email protected]:5432/postgres",
"name" : "postgresql0"
}
}
}
In the long-term, we might look to implement something in the API similar to etcd's Watchers https://coreos.com/etcd/docs/latest/api.html#waiting-for-a-changeBut in general I can't think of a good way to have the core code execute something on an event without bloating it beyond what I'm comfortable with
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
from governor.
My reasoning is that I want governor to be a tool to provide auto failover - and give endpoints for external processes to query the state. Those processes can use that information to what they wish
I'm always rather cautious of giving a process the ability to say "When an event happens, fork and execute this user-provided string". I know buffer overflows are supposed to be impossible with standard Go code. But... There's been exploits published where you can exploit certain data race conditions, garbage collector, and the internals of a Slice to induce a malicious buffer overflow. So I'm a little hesitant to expose Governor to the possibility of an arbitrary code execution exploit
from governor.
Well, any approach which requires polling has two major drawbacks:
-
requires an additional container, or an additional daemon inside the Governor/Postgres container, and I thought one goal of this new approach was to reduce external requriements?
-
adds latency between the time failover happens and the orchestration system learns that failover has happened, thus adding additional downtime.
from governor.
@jberkus Agreed with the argument against polling. I don't like polling either. Since adding the ability to run a script on failover is quite a bit easier than implementing a "watchable" API endpoint I'll add the ability to run a script on failover. But it may be deprecated in the future if we add that endpoint. I'm probably too cautious with security for my own good
from governor.
Just make the requirement that the external script has to be in a certain directory and owned by the same user as governor. There's no real point in restricting it further; if someone can shell in as the user who owns the governor process, they can do anything they want anyway.
from governor.
Sounds good. I'm on an on-call shift this Sunday, so I'll probably shell this out to pass the time during that. Expect a release probably Sunday evening
from governor.
So at a minimum, we'd need to pass the external script the result of the election, which would be one of three things:
- leader
- follower
- down (as in, no longer part of a valid cluster)
from governor.
Cool. That fits with what I was going to do. There's a few reasons a failover could occur (dead man switch, governor recognizes underlying PG is unrecoverable for some reason, loss of quorum). I wasn't planning on revealing this to the script. But just passing the script the current state(follower, leader, no-leader available)
from governor.
Discussion: do we want two different failure states?
- Can't join cluster
- Local instance is inoperable (postgres won't start, etc.)
Just thinking that the system might react differently to those two different situations.
from governor.
Relevant to this is that in failure state (1), the node probably can't message the orchestration system anyway. The only time that state would really be relevant is if cluster networking is messed up, or if you have a "no valid lead" situation.
from governor.
Good point. Although can't join cluster could mean 2 things:
- Network issues for the node
- Loss of quorum due to other nodes dropping out on their own(for whatever reason)
As I'm looking through the possible state changes there are actually 5 possible states in a failover. Right now there's
- I am the new leader
- I am a new follower - This only happens when the governor process launches.
- Read only following no leader - When a leader step-down occurs but this particular node's WAL log isn't progressed enough to claim the leader position(set in the
maximum_lag_on_failover
flag). The node is waiting for a new leader to be elected to follow. - Read only following no leader - When there is no quorum so no leader can exist until the cluster regains quorum(caused by network issues or other nodes dying)
- PG has become unhealthy - Currently if PG isn't healthy and something with
pg_ctl
errors when attempting to fix it then Governor gets killed. There are a couple dangerous situations with outcomes like this I need to fix and I'll create issues for that
That being said I see 3 failure states I think are worth reporting:
maximum_lag_on_failover
condition not met when trying to gain leadership(this will auto-resolve after another node declares leadership)- Quorum issues with the cluster
- Unhealthy PG
from governor.
Hmmm. Given that mess of states, I'm starting to think that we just want two failure states:
- Fail: any condition in which somehow Governor is up, but Postgres isn't taking connections, and
- Orphan: any condition in which the node isn't part of a cluster, but is up for read-only access.
Really only those two states matter in terms of the orchestration system; "Fail" status should be taken out of any connection pool, and "Orphan" would be according to local policy (some might want orphans in a pool, some might not).
For any other status refinements, I think it makes sense to just poll the HTTP status from the individual node, which would tell us all of the above other information.
from governor.
Agreed. I found an issue with how it handled the maximum_lag_on_failover
flag that I'm working on correcting. After that then I'll get back to working on script execution
from governor.
Related Issues (20)
- database system identifier differs between the primary and standby? HOT 2
- etcd returns 500 internal server error on ubuntu which causes postgres to crash. HOT 1
- 404 error causing the postgres to go down
- Use python-etcd client library HOT 1
- Fatal: requested timeline 8 is not a child of this server's history HOT 3
- Fencing and Quorum Support HOT 3
- Local Docker cluster with Governor on board HOT 1
- Make governor a module
- PostgreSQL + haproxy with multiple IP HOT 3
- not catching ssl timeout exception HOT 1
- rewind ex-leader before joining again HOT 2
- [Errno 32] Broken pipe HOT 1
- non atomic has_lock() and update_lock()
- cannot easily "go build" golang-custom-raft; maybe we should have a new project? HOT 1
- New GB build tool based off of golang-custom-raft branch with a new name - hapg HOT 2
- golang-custom-raft: If a PG process is unhealthy - it can kill governor
- golang-custom-raft: maximum_lag_on_failover doesn't work as it should
- help:the connect info in the recover.conf are "None"
- replication slots failing when names contain dashes
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from governor.