compose / governor Goto Github PK
View Code? Open in Web Editor NEWRunners to orchestrate a high-availability PostgreSQL
License: MIT License
Runners to orchestrate a high-availability PostgreSQL
License: MIT License
So I've been whacking this about a bit, and there's no good way to automatically build the new branch/project. "go get" isn't compatible with github branches.
Given that old governor will exist on the master branch indefinitely, maybe golang-custom-raft should actually be a new github project?
Make governor be a python module, so you can script governor into other projects more easily
Given etcd
is the proper location for leaders/followers, haproxy_status.sh
should respond after checking leader information from etcd instead of checking for leadership in PostgreSQL.
This will reduce the chance of writing data to a PostgreSQL that has lost its lock on the leader key, but has not failed over.
not catching exception
return self.etcd.touch_member(self.state_handler.ip)
File "/mnt/bludata0/blumeta0/home/db2inst1/governor/helpers/etcd.py", line 130, in touch_member
self.put_client_path("/members/%s" % value, {"value": value, "ttl": self.ttl})
File "/mnt/bludata0/blumeta0/home/db2inst1/governor/helpers/etcd.py", line 67, in put_client_path
urllib2.urlopen(request, timeout=self.timeout).read()
File "/usr/local/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/local/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/local/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/local/lib/python2.7/urllib2.py", line 1240, in https_open
context=self._context)
File "/usr/local/lib/python2.7/urllib2.py", line 1200, in do_open
r = h.getresponse(buffering=True)
File "/usr/local/lib/python2.7/httplib.py", line 1073, in getresponse
response.begin()
File "/usr/local/lib/python2.7/httplib.py", line 415, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python2.7/httplib.py", line 371, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/local/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
File "/usr/local/lib/python2.7/ssl.py", line 714, in recv
return self.read(buflen)
File "/usr/local/lib/python2.7/ssl.py", line 608, in read
v = self._sslobj.read(len or 1024)
ssl.SSLError: ('The read operation timed out',)
I think it is better to make governor.py more verbose so the current status of governor.py is visible. Something like so:
if etcd.race("/initialize", postgresql.name):
logging.info("initializing")
postgresql.initialize()
etcd.take_leader(postgresql.name)
postgresql.start()
else:
logging.info("else")
synced_from_leader = False
while not synced_from_leader:
logging.info("waiting for sync")
leader = etcd.current_leader()
if not leader:
logging.info("i'm not a leader, sleeping")
time.sleep(5)
continue
if postgresql.sync_from_leader(leader):
logging.info("syncing from leader")
postgresql.write_recovery_conf(leader)
postgresql.start()
synced_from_leader = True
else:
logging.info("else sleeping")
time.sleep(5)
This adds support for etcd as a cluster, plus nice syntax. I'm also getting a problem with current library reading members list, so I'll go for a refactor and sibmit a pull request.
I am interested in Governor, but am curious about how it handles the following HA components:
Hi,
Postgres server crashes as a node tries to write a value in etcd and gets 500.
self.put_client_path("/members/%s" % member, {"value": connection_string, "ttl": self.ttl})
File "/home/i074560/shashank/governor/helpers/etcd.py", line 40, in put_client_path
opener.open(request)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(_args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(_args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
LOG: received fast shutdown request
Hi, we just started to use governor and it seems to be more or less working.
However I saw in the log files these exceptions:
Traceback (most recent call last):
File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python2.7/SocketServer.py", line 651, in init
self.finish()
File "/usr/lib/python2.7/SocketServer.py", line 710, in finish
self.wfile.close()
File "/usr/lib/python2.7/socket.py", line 279, in close
self.flush()
File "/usr/lib/python2.7/socket.py", line 303, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe
Is this a known issue?
Hello,
I'm no pg expert, but I'm a bit surprised you don't try to rewind an ex-leader trying to rejoin the cluster. That's why pg 9.5 introduced the pg_rewind command. Before that you basically had to use pg_basebackup.
The problem as I understood it is that when the leader is disconnected from the cluster but didn't had time to replicate the most recent pages, and secondary takes role of leader, the previous leader can't rejoin unless it "rewinds" its most recent pages because both ex and new leaders have diverged since last common point in time.
What's your take on this?
Thank you,
Laurent Debacker
Hi ,
my 2 node cluster ran for almost 5 hrs after which I got this error
HTTP Error 404: Not Found
Exception occured in reaching leader
Traceback (most recent call last):
File "./governor.py", line 72, in
logging.info(ha.run_cycle())
File "/home/i074560/shashank/governor/helpers/ha.py", line 47, in run_cycle
if self.state_handler.is_healthiest_node(self.etcd):
File "/home/i074560/shashank/governor/helpers/postgresql.py", line 137, in is_healthiest_node
if (state_store.last_leader_operation() - self.xlog_position()) > self.config["maximum_lag_on_failover"]:
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
waiting for server to shut down....LOG: received fast shutdown request
LOG: aborting any active transactions
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
LOG: autovacuum launcher shutting down
LOG: shutting down
LOG: database system is shut down
. done
We're using the hostnames of the PostgreSQL Servers as replication slot names. A new name schema introduced dashes in the hostnames. This leads to fails.
There is an unused function in the postgresql class which would solve this problem: replication_slot_name
haproxy
will load-balance any connection to any node, master or slave. Without knowing the type of query (read-only or write) there's no way of actually do proper load-balancing. And AFAIK, haproxy
will not be the best choice.
Either I'm missing something here or there is the need for a more thought-out solution.
maximum_lag_on_failover incorrectly tries to query an existing leader for the WAL position. Problem is, the time which maximum_lag_on_failover should be checked - a leader wouldn't exist. Solution would be to implement historic information on members
In order to enable Kubernetes services which communicate with the PostgreSQL master, I need to update some annotations in the Kubernetes API. As such, I want to execute a script each time we finish a leader election in golang-custom-raft.
Thoughts on what's the best way to add this? Obviously we could add it to the go code, but we don't necessarily want the dependancies in the main codebase, do we?
if self.has_lock():
self.update_lock()
this is not atomic, and update_lock would fail if too slow (412), could combine them together
I have setup a postgresql cluster (with one master node and one slave/standby node) using Governor. I want to use HA proxy in front of my cluster.
I think in this case HA proxy itself could be a single point of failure.
So to avoid this problem if I use multiple nodes for the HA proxy. Then not sure how client will handle the connection in case of failure of IP of HA proxy to which client is currently connected.
(Or we can say how client/client_app will switch over between different available IP's of HA proxy).
Some of the scenarios to consider are:
The first decision to answer is: should Postgres cluster go readonly if etcd fails? Or, should the Postgres cluster keep the current Primary, but not have automatic failover functionality?
Hi Guys,
Great work. I like the direction you guys are going and just created a new repo with hapg as the project name. I cloned the golang-custom-raft branch and made it gb build tool compatible, which in my opinion benefit this project in the long run (visit here for more info: https://getgb.io/). Currently, you can build it after git clone https://github.com/oneness/hapg once you have gb installed.
At PGConf, you guys talked about a new project off of the governor is under consideration and was wondering if this is ok with you guys.
Hi,
I'm working on local psql cluster for horizontal scalability. I'm playing with Docker (docker-compose especially) and using @miketonks fork to achieve my goals and almost everything plays smoothly.
The cluster is built upon docker.compose.yml
config file. When it starts for the first time, the election phase 'emits' master and slaves correctly, however when I shutdown the whole cluster (with etcd as well) and then start it once again, it's likely new master will be elected from a standby and it's just fine. The problem is with the old master (turned into standby), which basically crashes with logs:
LOG: entering standby mode
FATAL: requested timeline 3 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/6000028 on timeline 2, but in the history of the requested timeline, the server forked off from that timeline at 0/5014B50.
LOG: startup process (PID 21) exited with exit code 1
LOG: aborting startup due to startup process failure
I've read this article which explains what happens and how to recover from this. I'm reviewing the governor.py
code, seeing the if else block and wondering how to recover safely the old master. I mean what was the purpose, assuming the data exists, that the old master node should follow_no_leader
. Could you elaborate that?
Hi there @Winslett ,
I'm curious about this line:
postgresql.write_recovery_conf({"address": "postgres://169.0.0.1:5432"})
I assume it should get the current leader from etcd instead of being hardcoded.
What is the context around this please ?
Thanks in advance!
Found a typo in helpers/etcd.py:
self.put_client_path("/optime/leader", {"value": state_handler.last_operation()})
Line should read:
self.put_client_path("/optime/leader", {"value": state_handler.last_leader_operation()})
primary_conninfo = 'user=None password=None host=None port=None sslmode=prefer sslcompression=1'
the connect info has "None"
After I have debug the function write_recovery_conf of postgresql.py:
{
leader = urlparse(leader_hash["address"])
f.write("""
primary_conninfo = 'user=%(user)s password=%(password)s host=%(hostname)s port=%(port)s sslmode=prefer sslcompression=1'
""" % {"user": leader.username, "password": leader.password, "hostname": leader.hostname, "port": leader.port})
}
The debug info:
leader_hash["address"]==>postgres://repmgr:[email protected]:5432/postgres
leader==>ParseResult(scheme='postgres', netloc='',
path='//repmgr:[email protected]:5432/postgres', params='', query='', fragment='')
so : the leader.username is "None"
Currently - if an underlying PG process is misbehaving and the pg service interface issues a command to pg_ctl
to fix it and that command has a non-zero exit status then governor is killed. This causes issues with quorum. Governor should handle these errors gracefully and report them rather than erring out
When I run a clean install I get no output after executing ./governor.py postgres1.yml
however, if I ctrl+c
the application I receive the following error:
postgres@sql1:~/governor$ ./governor.py postgres1.yml
^CTraceback (most recent call last):
File "./governor.py", line 48, in <module>
time.sleep(5)
KeyboardInterrupt
pg_ctl: directory "data/postgres" does not exist
Why is the data directory not begin created during first time run?
Did some startup and shutdown checks, eventually landed in the following state
LOG: database system was shut down in recovery at 2015-08-27 11:58:44 CEST
WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command
HINT: The database server will regularly poll the pg_xlog subdirectory to check for files placed there.
LOG: entering standby mode
FATAL: requested timeline 8 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/15000028 on timeline 7, but in the history of the requested timeline, the server forked off from that timeline at 0/14000198.
LOG: startup process (PID 2147) exited with exit code 1
LOG: aborting startup due to startup process failure
Can we somehow come back from this?
I had to change some things to get governor work with something other then localhost.
First I had to change etcd so it uses init and my own config:
In /etc/init/etcd.override
:
# Override file for etcd Upstart script providing some environment variables
env ETCD_INITIAL_CLUSTER="sql1=http://10.0.0.75:2380,sql2=http://10.0.0.76:2380,etcd=http://10.0.0.77:2380"
env ETCD_INITIAL_CLUSTER_STATE="new"
env ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster-01"
env ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.0.0.75:2380"
env ETCD_DATA_DIR="/var/lib/postgresql/governor/data/etcd"
env ETCD_LISTEN_PEER_URLS="http://10.0.0.75:2380"
env ETCD_LISTEN_CLIENT_URLS="http://10.0.0.75:2379"
env ETCD_ADVERTISE_CLIENT_URLS="http://10.0.0.75:2379"
env ETCD_NAME="sql1"
In helpers/postgresql.py
I added the following line:
f.write("host all all %(self)s trust\n" % {"self": self.replication["self"]} )
right after:
def write_pg_hba(self):
f = open("%s/pg_hba.conf" % self.data_dir, "a")
and lastly added:
self: 10.0.0.0/24
to the postgresX.yml files. This not ideal as now the hole 10.0.0.0/24 network is trusted. But it did the trick for now. I would like to know you guys thought on this.
What does this mean?
FATAL: database system identifier differs between the primary and standby
DETAIL: The primary's identifier is 6150202608832133854, the standby's identifier is 6150207001805245765.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.