compose / governor Goto Github PK

View Code? Open in Web Editor NEW

511.0 511.0 75.0 6.32 MB

Runners to orchestrate a high-availability PostgreSQL

License: MIT License

Python 100.00%

governor's People

Contributors

Stargazers

Watchers

governor's Issues

[IDEA] Replace HAProxy with vulcand

It would be great to replace haproxy with vulcand, since governor could re-use data that's already present in etcd.
One of the pros of vulcand is that it doesn't need to reload/restart when configuration changes.

Need this implemented first.

cannot easily "go build" golang-custom-raft; maybe we should have a new project?

So I've been whacking this about a bit, and there's no good way to automatically build the new branch/project. "go get" isn't compatible with github branches.

Given that old governor will exist on the master branch indefinitely, maybe golang-custom-raft should actually be a new github project?

Make governor a module

Make governor be a python module, so you can script governor into other projects more easily

haproxy_status.sh should get leader status from etcd

Given etcd is the proper location for leaders/followers, haproxy_status.sh should respond after checking leader information from etcd instead of checking for leadership in PostgreSQL.

This will reduce the chance of writing data to a PostgreSQL that has lost its lock on the leader key, but has not failed over.

not catching ssl timeout exception

not catching exception

return self.etcd.touch_member(self.state_handler.ip)
File "/mnt/bludata0/blumeta0/home/db2inst1/governor/helpers/etcd.py", line 130, in touch_member
self.put_client_path("/members/%s" % value, {"value": value, "ttl": self.ttl})
File "/mnt/bludata0/blumeta0/home/db2inst1/governor/helpers/etcd.py", line 67, in put_client_path
urllib2.urlopen(request, timeout=self.timeout).read()
File "/usr/local/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/local/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/local/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/local/lib/python2.7/urllib2.py", line 1240, in https_open
context=self._context)
File "/usr/local/lib/python2.7/urllib2.py", line 1200, in do_open
r = h.getresponse(buffering=True)
File "/usr/local/lib/python2.7/httplib.py", line 1073, in getresponse
response.begin()
File "/usr/local/lib/python2.7/httplib.py", line 415, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python2.7/httplib.py", line 371, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/local/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
File "/usr/local/lib/python2.7/ssl.py", line 714, in recv
return self.read(buflen)
File "/usr/local/lib/python2.7/ssl.py", line 608, in read
v = self._sslobj.read(len or 1024)
ssl.SSLError: ('The read operation timed out',)

Make governor.py more verbose

I think it is better to make governor.py more verbose so the current status of governor.py is visible. Something like so:

     if etcd.race("/initialize", postgresql.name):
         logging.info("initializing")
         postgresql.initialize()
         etcd.take_leader(postgresql.name)
         postgresql.start()
     else:
         logging.info("else")
         synced_from_leader = False
         while not synced_from_leader:
             logging.info("waiting for sync")
             leader = etcd.current_leader()
             if not leader:
                 logging.info("i'm not a leader, sleeping")
                 time.sleep(5)
                 continue
             if postgresql.sync_from_leader(leader):
                 logging.info("syncing from leader")
                 postgresql.write_recovery_conf(leader)
                 postgresql.start()
                 synced_from_leader = True
             else:
                 logging.info("else sleeping")
                 time.sleep(5)

Use python-etcd client library

This adds support for etcd as a cluster, plus nice syntax. I'm also getting a problem with current library reading members list, so I'll go for a refactor and sibmit a pull request.

Fencing and Quorum Support

I am interested in Governor, but am curious about how it handles the following HA components:

fencing: being able to ensure that a failed node is really dead and won't come back online at some future point is critical for an HA cluster; does Governor provide any mechanism to STONITH or otherwise ensure that a failed node is guaranteed to be in a known state?
quorum: I've read how election of a new leader works, but how does Governor handle race conditions or ties? What if two nodes are exactly equal candidates in terms of WAL position - is there a mechanism in place to prevent both of them from becoming master concurrently?

etcd returns 500 internal server error on ubuntu which causes postgres to crash.

Hi,
Postgres server crashes as a node tries to write a value in etcd and gets 500.

self.put_client_path("/members/%s" % member, {"value": connection_string, "ttl": self.ttl})

File "/home/i074560/shashank/governor/helpers/etcd.py", line 40, in put_client_path
opener.open(request)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(_args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(_args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
LOG: received fast shutdown request

[Errno 32] Broken pipe

Hi, we just started to use governor and it seems to be more or less working.
However I saw in the log files these exceptions:

Traceback (most recent call last):
File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python2.7/SocketServer.py", line 651, in init
self.finish()
File "/usr/lib/python2.7/SocketServer.py", line 710, in finish
self.wfile.close()
File "/usr/lib/python2.7/socket.py", line 279, in close
self.flush()
File "/usr/lib/python2.7/socket.py", line 303, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe

Is this a known issue?

rewind ex-leader before joining again

Hello,

I'm no pg expert, but I'm a bit surprised you don't try to rewind an ex-leader trying to rejoin the cluster. That's why pg 9.5 introduced the pg_rewind command. Before that you basically had to use pg_basebackup.

The problem as I understood it is that when the leader is disconnected from the cluster but didn't had time to replicate the most recent pages, and secondary takes role of leader, the previous leader can't rejoin unless it "rewinds" its most recent pages because both ex and new leaders have diverged since last common point in time.

What's your take on this?

Thank you,
Laurent Debacker

404 error causing the postgres to go down

Hi ,
my 2 node cluster ran for almost 5 hrs after which I got this error
HTTP Error 404: Not Found
Exception occured in reaching leader

Traceback (most recent call last):
File "./governor.py", line 72, in
logging.info(ha.run_cycle())
File "/home/i074560/shashank/governor/helpers/ha.py", line 47, in run_cycle
if self.state_handler.is_healthiest_node(self.etcd):
File "/home/i074560/shashank/governor/helpers/postgresql.py", line 137, in is_healthiest_node
if (state_store.last_leader_operation() - self.xlog_position()) > self.config["maximum_lag_on_failover"]:
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
waiting for server to shut down....LOG: received fast shutdown request
LOG: aborting any active transactions
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
LOG: autovacuum launcher shutting down
LOG: shutting down
LOG: database system is shut down
. done

replication slots failing when names contain dashes

We're using the hostnames of the PostgreSQL Servers as replication slot names. A new name schema introduced dashes in the hostnames. This leads to fails.

There is an unused function in the postgresql class which would solve this problem: replication_slot_name

Hot-Standby does not support writes

haproxy will load-balance any connection to any node, master or slave. Without knowing the type of query (read-only or write) there's no way of actually do proper load-balancing. And AFAIK, haproxy will not be the best choice.

Either I'm missing something here or there is the need for a more thought-out solution.

golang-custom-raft: maximum_lag_on_failover doesn't work as it should

maximum_lag_on_failover incorrectly tries to query an existing leader for the WAL position. Problem is, the time which maximum_lag_on_failover should be checked - a leader wouldn't exist. Solution would be to implement historic information on members

golang-custom-raft: add abilty to execute post-election script

In order to enable Kubernetes services which communicate with the PostgreSQL master, I need to update some annotations in the Kubernetes API. As such, I want to execute a script each time we finish a leader election in golang-custom-raft.

Thoughts on what's the best way to add this? Obviously we could add it to the go code, but we don't necessarily want the dependancies in the main codebase, do we?

non atomic has_lock() and update_lock()

                if self.has_lock():
                    self.update_lock()

this is not atomic, and update_lock would fail if too slow (412), could combine them together

PostgreSQL + haproxy with multiple IP

I have setup a postgresql cluster (with one master node and one slave/standby node) using Governor. I want to use HA proxy in front of my cluster.
I think in this case HA proxy itself could be a single point of failure.
So to avoid this problem if I use multiple nodes for the HA proxy. Then not sure how client will handle the connection in case of failure of IP of HA proxy to which client is currently connected.
(Or we can say how client/client_app will switch over between different available IP's of HA proxy).

How should Postgres behave if etcd is unavailable / unpredictable?

Some of the scenarios to consider are:

governor for Postgres primary cannot communicate with etcd, but rest of cluster can
no governors in Postgres cluster can communicate with etcd
etcd crashes and recovers. after recovery, etcd leader TTLs have expired.
etcd crashes and recovers. after recovery, the initialization key is empty. this would cause issues when new members would come online and race to initialize

The first decision to answer is: should Postgres cluster go readonly if etcd fails? Or, should the Postgres cluster keep the current Primary, but not have automatic failover functionality?

New GB build tool based off of golang-custom-raft branch with a new name - hapg

Hi Guys,
Great work. I like the direction you guys are going and just created a new repo with hapg as the project name. I cloned the golang-custom-raft branch and made it gb build tool compatible, which in my opinion benefit this project in the long run (visit here for more info: https://getgb.io/). Currently, you can build it after git clone https://github.com/oneness/hapg once you have gb installed.

At PGConf, you guys talked about a new project off of the governor is under consideration and was wondering if this is ok with you guys.

Local Docker cluster with Governor on board

Hi,

I'm working on local psql cluster for horizontal scalability. I'm playing with Docker (docker-compose especially) and using @miketonks fork to achieve my goals and almost everything plays smoothly.

The cluster is built upon docker.compose.yml config file. When it starts for the first time, the election phase 'emits' master and slaves correctly, however when I shutdown the whole cluster (with etcd as well) and then start it once again, it's likely new master will be elected from a standby and it's just fine. The problem is with the old master (turned into standby), which basically crashes with logs:

LOG:  entering standby mode
FATAL:  requested timeline 3 is not a child of this server's history
DETAIL:  Latest checkpoint is at 0/6000028 on timeline 2, but in the history of the requested timeline, the server forked off from that timeline at 0/5014B50.
LOG:  startup process (PID 21) exited with exit code 1
LOG:  aborting startup due to startup process failure

I've read this article which explains what happens and how to recover from this. I'm reviewing the governor.py code, seeing the if else block and wondering how to recover safely the old master. I mean what was the purpose, assuming the data exists, that the old master node should follow_no_leader. Could you elaborate that?

postgresql.write_recovery_conf({"address": "postgres://169.0.0.1:5432"})

Hi there @Winslett ,

I'm curious about this line:

postgresql.write_recovery_conf({"address": "postgres://169.0.0.1:5432"})

I assume it should get the current leader from etcd instead of being hardcoded.

What is the context around this please ?

Thanks in advance!

error in helpers/etcd.py

Found a typo in helpers/etcd.py:

            self.put_client_path("/optime/leader", {"value": state_handler.last_operation()})

Line should read:

            self.put_client_path("/optime/leader", {"value": state_handler.last_leader_operation()})

help：the connect info in the recover.conf are "None"

primary_conninfo = 'user=None password=None host=None port=None sslmode=prefer sslcompression=1'
the connect info has "None"

After I have debug the function write_recovery_conf of postgresql.py:
{
leader = urlparse(leader_hash["address"])
f.write("""
primary_conninfo = 'user=%(user)s password=%(password)s host=%(hostname)s port=%(port)s sslmode=prefer sslcompression=1'
""" % {"user": leader.username, "password": leader.password, "hostname": leader.hostname, "port": leader.port})
}

The debug info:
leader_hash["address"]==>postgres://repmgr:[email protected]:5432/postgres
leader==>ParseResult(scheme='postgres', netloc='',
path='//repmgr:[email protected]:5432/postgres', params='', query='', fragment='')

so : the leader.username is "None"

golang-custom-raft: If a PG process is unhealthy - it can kill governor

Currently - if an underlying PG process is misbehaving and the pg service interface issues a command to pg_ctl to fix it and that command has a non-zero exit status then governor is killed. This causes issues with quorum. Governor should handle these errors gracefully and report them rather than erring out

pg_ctl: directory "data/postgres" does not exist

When I run a clean install I get no output after executing ./governor.py postgres1.yml however, if I ctrl+c the application I receive the following error:

postgres@sql1:~/governor$ ./governor.py postgres1.yml
^CTraceback (most recent call last):
  File "./governor.py", line 48, in <module>
    time.sleep(5)
KeyboardInterrupt
pg_ctl: directory "data/postgres" does not exist

Why is the data directory not begin created during first time run?

Fatal: requested timeline 8 is not a child of this server's history

Did some startup and shutdown checks, eventually landed in the following state

LOG:  database system was shut down in recovery at 2015-08-27 11:58:44 CEST
WARNING:  recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
HINT:  The database server will regularly poll the pg_xlog subdirectory to check for files placed there.
LOG:  entering standby mode
FATAL:  requested timeline 8 is not a child of this server's history
DETAIL:  Latest checkpoint is at 0/15000028 on timeline 7, but in the history of the requested timeline, the server forked off from that timeline at 0/14000198.
LOG:  startup process (PID 2147) exited with exit code 1
LOG:  aborting startup due to startup process failure

Can we somehow come back from this?

Changes I've done to get Governor to work other then localhost

I had to change some things to get governor work with something other then localhost.
First I had to change etcd so it uses init and my own config:

In /etc/init/etcd.override:

# Override file for etcd Upstart script providing some environment variables
env ETCD_INITIAL_CLUSTER="sql1=http://10.0.0.75:2380,sql2=http://10.0.0.76:2380,etcd=http://10.0.0.77:2380"
env ETCD_INITIAL_CLUSTER_STATE="new"
env ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster-01"
env ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.0.0.75:2380"
env ETCD_DATA_DIR="/var/lib/postgresql/governor/data/etcd"
env ETCD_LISTEN_PEER_URLS="http://10.0.0.75:2380"
env ETCD_LISTEN_CLIENT_URLS="http://10.0.0.75:2379"
env ETCD_ADVERTISE_CLIENT_URLS="http://10.0.0.75:2379"
env ETCD_NAME="sql1"

In helpers/postgresql.py I added the following line:

        f.write("host all all %(self)s trust\n" % {"self": self.replication["self"]} )

right after:

     def write_pg_hba(self):
         f = open("%s/pg_hba.conf" % self.data_dir, "a")

and lastly added:

self: 10.0.0.0/24

to the postgresX.yml files. This not ideal as now the hole 10.0.0.0/24 network is trusted. But it did the trick for now. I would like to know you guys thought on this.

database system identifier differs between the primary and standby?

What does this mean?

FATAL:  database system identifier differs between the primary and standby
DETAIL:  The primary's identifier is 6150202608832133854, the standby's identifier is 6150207001805245765.

compose / governor Goto Github PK

governor's People

Contributors

Stargazers

Watchers

Forkers

governor's Issues

Recommend Projects

Recommend Topics

Recommend Org