Comments (24)
What I think is happening is that the operator tried to remove the finalizer after deleting the database but that failed for some reason so it requeued the request. But then when it tries to drop the roles again it has to connect to the database it just deleted. This should be handled more gracefully.
Is there no other error in the log?
from postgres-operator.
@arnarg Thanks for looking into this - no others as far as I can tell - but it could be that some error was encountered on a previous run and is now 'lost' behind many, many restarts.
If it's of any further help, I've set the env. var on the Operator deployment to POSTGRES_DEFAULT_DATABASE
to postgres
-- I was assuming that with this set, the Operator would always connect to the maintenance DB and not need the resource-specified DB to exist in order to operate.
from postgres-operator.
Yeah that's probably the case. You can manually edit the Postgres
object and remove the finalizers
list and kubernetes will finish deleting the object. Then you can start using the operator again normally. Has this happened more than once?
I've set the env. var on the Operator deployment to POSTGRES_DEFAULT_DATABASE to postgres -- I was assuming that with this set, the Operator would always connect to the maintenance DB and not need the resource-specified DB to exist in order to operate.
The thing is, in order to drop roles we need to first assign all of its objects to operator's user and in order to do that we need to connect to the database where the objects are.
from postgres-operator.
from postgres-operator.
Yes, this behavior is consistent.
Can you then recreate this and capture the logs before the first crash?
Are you saying that the current implementation of dropping databases is
fundamentally broken, or is there still a viable solution?
If my theory is correct then removing the finalizer is failing for some reason it just needs to handle this more gracefully. But first I'll need to be able to recreate this myself or see some logs.
What does the ClusterRole
in use by the pod look like?
from postgres-operator.
Ah ha! I tinkered with the Operator's Pod template so that it hangs around for a while after crashing. After repeating the process, I witnessed this in the logs:
{"level":"info","ts":1579092991.258128,"logger":"controller_postgres","msg":"Dropped database example-temp-db","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
{"level":"error","ts":1579092991.2638128,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"postgres-controller","request":"tmt/example-temp-db","error":"Operation cannot be fulfilled on postgres.db.movetokube.com \"example-temp-db\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/u
til/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}
{"level":"info","ts":1579092992.264065,"logger":"controller_postgres","msg":"Reconciling Postgres","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
2020/01/15 12:56:32 failed to connect to PostgreSQL server: pq: database "example-temp-db" does not exist
At this point, the process has exited and would be restarted, entering the crash loop described before.
The ClusterRole
this Pod is running is as effectively the same as the example given in the deploy
directory:
rules:
- apiGroups:
- ''
resources:
- pods
- services
- endpoints
- persistentvolumeclaims
- events
- configmaps
- secrets
verbs:
- '*'
- apiGroups:
- apps
resources:
- deployments
- daemonsets
- replicasets
- statefulsets
verbs:
- '*'
- apiGroups:
- apps
resourceNames:
- dev-postgresql-provisioner-ext-postgresql-operator
resources:
- deployments/finalizers
verbs:
- update
- apiGroups:
- db.movetokube.com
resources:
- '*'
verbs:
- '*'
from postgres-operator.
Interesting. You get the error Operation cannot be fulfilled on postgres.db.movetokube.com "example-temp-db": the object has been modified; please apply your changes to the latest version and try again
when trying to remove the finalizer. The kubernetes API is not happy about something. The Postgres
will never be garbage collected if we can't remove the finalizer.
What version of Kubernetes are you running?
from postgres-operator.
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-c0eccc", GitCommit:"c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2", GitTreeState:"clean", BuildDate:"2019-12-22T23:14:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
from postgres-operator.
I'm not super-knowledgable on RBAC setup; is is possible that .rules[2].resourceNames
is incorrect? In the examples given in deploy
, it's simply ext-postgresql-operator
. However, I'm looking at use-cases where we can provision against different PostgreSQL servers based on the tier the instance is running on. (eg. an in-cluster, shared PostgreSQL instance for dev, another for pre-prod, and RDS for prod.)
Does .rules[2].resourceNames
have to be literally exp-postgresql-operator
, is it OK to by instance name instead?
from postgres-operator.
Ok I have the same version of EKS
running where I might be able to try to recreate this situation.
This error seems to suggest that in the time the controller starts processing the request and gets the Postgres
object from the API until it removes the finalizer and tries to update the object, something else is changing the Postgres
object. Is there something else in your cluster that messes with the Postgres
object?
Does
.rules[2].resourceNames
have to be literallyexp-postgresql-operator
, is it OK to by instance name instead
It should be the name of the deployment for postgres-operator
. But that rule is for the deployment finalizer and is therefor not related to the finalizer in the Postgres
object, the last rule covers that.
from postgres-operator.
Quite possibly - the Postgres
resource is defined in a Helm chart that is installed into the cluster via Argo CD.
from postgres-operator.
I'll create and delete a Postgres
resource outside of these tools to see if the behaviour is any different.
from postgres-operator.
Confirmed, it's likely Argo modifying the resource in some way that the operator doesn't like. Manually creating and then deleting a Resource works as expected with no crash loop.
from postgres-operator.
Ok, incidentally we also use ArgoCD but I had never included a Postgres
in a chart yet. Thanks for reporting this and I will want to get this use case sorted as well.
from postgres-operator.
Thanks for sounding me out and helping to isolate the cause so quickly.
from postgres-operator.
I'm glad you guys sorted this out
from postgres-operator.
I ran some tests with and without ArgoCD. I didn't manage to crash the operator though but I did sometimes see the error Operation cannot be fulfilled on postgres.db.movetokube.com "my-db": the object has been modified; please apply your changes to the latest version and try again
.
I noticed that when deleting the Postgres
object in ArgoCD a foregroundDeletion
finalizer was added while deleting with kubectl delete postgres my-db
does not. I don't think this is the cause but I just thought I'd add it here.
The controller uses a cached getter client. It could be getting a cached Postgres
with an older resourceVersion
than what is stored in kubernetes' datastore. I think this is a more likely cause.
I think what we should do is add a check when dropping a role if the database where we think the role still owns some reasources actually exists before we try to connect. In that case it should be able to remove the finalizer in the next reconcile loop if it fails, without crashing. We might even want to send a patch instead of update to remove the finalizer.
from postgres-operator.
Commenting from the peanut gallery, it seems that patching the current in-cluster resource to remove the finalizer would be more sensible than sending an update against the current in-memory version of the resource.
from postgres-operator.
@wms As I can't reproduce the crash it's a bit difficult to test this on my end. I have made some changes in branch bugfix/35 that I believe fixes this bug.
Are you able to build this and test?
from postgres-operator.
from postgres-operator.
Yeah! Looks like that new Patch logic does the trick, and no crashes! I've attached the operator's logs for this test if you're curious...
from postgres-operator.
Great!
Regarding that last error postgres.db.movetokube.com "devdb-test-deletes" not found
: I've been seeing this as well on my EKS cluster. It seems to be garbage collecting objects even though there is a finalizer attached to it. AFAIK that's not how things are suppose to work and seems to be a bug with EKS.
The operator is trying to remove the finalizer but kubernetes has already garbage collected the object, so it doesn't exist anymore. This error does not matter at all if dropping the roles and database were successful but if anything goes wrong and the request needs to be requeued, the remaining objects will be left in the database.
EDIT: Disregard most of this comment. What is happening is that it's trying to patch the status after the finalizer has been removed and kubernetes has garbage collected it.
from postgres-operator.
@arnarg Thanks for taking the time to diagnose and fix this, I really appreciate it! I figured there was some weird out-of-order stuff happening here but did check in the PostgreSQL server to check that the DB was really gone, and it was!
At this time, I only intend to use this operator for our short-lived, non-production tier(s) to allow us to dynamically create and destroy databases on a single server. However, I'll continue to keep an eye on this and let you know if we encounter any strange situations where the operator thinks the DB is gone but failed.
from postgres-operator.
The fix for this was released with version 0.4.2.
from postgres-operator.
Related Issues (20)
- Create a ConnectionString in the secret
- Add operator user to created roles by default
- Kustomization not using the new Google Image Repository
- Postgres User reconciller error loop when DB deleted manually - still having this problem in 1.2.1
- When postgres CR deleted before postgresusers CR it makes postgresusers finalizers to fail
- Schema creation failing HOT 15
- ARM support HOT 1
- Cannot create a database with customized `locale` and `encoding` options
- cannot terminate database with open connections
- Helm Chart should come with .image.tag set to an empty string in values.yaml
- The role-reader and role-writer do not have the appropriate permissions. HOT 8
- [Question]: Multiple RDS/Database instances HOT 1
- Add PostgresSchema CR
- Hash suffix to usernames
- [Feature] Allow connection with certificate
- POSTGRES_URI_ARGS applied to operator don't apply to generated secrets
- Failure to Cleanup Finalizers on Shutdown HOT 1
- postgres connection limits
- POSTGRES_URL should contain POSTGRES_URI_ARGS HOT 2
- RBAC is spooooooky HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from postgres-operator.