Giter Club home page Giter Club logo

Comments (24)

arnarg avatar arnarg commented on June 19, 2024

What I think is happening is that the operator tried to remove the finalizer after deleting the database but that failed for some reason so it requeued the request. But then when it tries to drop the roles again it has to connect to the database it just deleted. This should be handled more gracefully.

https://github.com/movetokube/postgres-operator/blob/master/pkg/controller/postgres/postgres_controller.go#L103-L130

Is there no other error in the log?

from postgres-operator.

wms avatar wms commented on June 19, 2024

@arnarg Thanks for looking into this - no others as far as I can tell - but it could be that some error was encountered on a previous run and is now 'lost' behind many, many restarts.

If it's of any further help, I've set the env. var on the Operator deployment to POSTGRES_DEFAULT_DATABASE to postgres -- I was assuming that with this set, the Operator would always connect to the maintenance DB and not need the resource-specified DB to exist in order to operate.

from postgres-operator.

arnarg avatar arnarg commented on June 19, 2024

Yeah that's probably the case. You can manually edit the Postgres object and remove the finalizers list and kubernetes will finish deleting the object. Then you can start using the operator again normally. Has this happened more than once?

I've set the env. var on the Operator deployment to POSTGRES_DEFAULT_DATABASE to postgres -- I was assuming that with this set, the Operator would always connect to the maintenance DB and not need the resource-specified DB to exist in order to operate.

The thing is, in order to drop roles we need to first assign all of its objects to operator's user and in order to do that we need to connect to the database where the objects are.

from postgres-operator.

wms avatar wms commented on June 19, 2024

from postgres-operator.

arnarg avatar arnarg commented on June 19, 2024

Yes, this behavior is consistent.

Can you then recreate this and capture the logs before the first crash?

Are you saying that the current implementation of dropping databases is
fundamentally broken, or is there still a viable solution?

If my theory is correct then removing the finalizer is failing for some reason it just needs to handle this more gracefully. But first I'll need to be able to recreate this myself or see some logs.

What does the ClusterRole in use by the pod look like?

from postgres-operator.

wms avatar wms commented on June 19, 2024

Ah ha! I tinkered with the Operator's Pod template so that it hangs around for a while after crashing. After repeating the process, I witnessed this in the logs:

{"level":"info","ts":1579092991.258128,"logger":"controller_postgres","msg":"Dropped database example-temp-db","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
{"level":"error","ts":1579092991.2638128,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"postgres-controller","request":"tmt/example-temp-db","error":"Operation cannot be fulfilled on postgres.db.movetokube.com \"example-temp-db\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/u
til/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88"}
{"level":"info","ts":1579092992.264065,"logger":"controller_postgres","msg":"Reconciling Postgres","Request.Namespace":"tmt","Request.Name":"example-temp-db"}
2020/01/15 12:56:32 failed to connect to PostgreSQL server: pq: database "example-temp-db" does not exist

At this point, the process has exited and would be restarted, entering the crash loop described before.

The ClusterRole this Pod is running is as effectively the same as the example given in the deploy directory:

rules:
  - apiGroups:
      - ''
    resources:
      - pods
      - services
      - endpoints
      - persistentvolumeclaims
      - events
      - configmaps
      - secrets
    verbs:
      - '*'
  - apiGroups:
      - apps
    resources:
      - deployments
      - daemonsets
      - replicasets
      - statefulsets
    verbs:
      - '*'
  - apiGroups:
      - apps
    resourceNames:
      - dev-postgresql-provisioner-ext-postgresql-operator
    resources:
      - deployments/finalizers
    verbs:
      - update
  - apiGroups:
      - db.movetokube.com
    resources:
      - '*'
    verbs:
      - '*'

from postgres-operator.

arnarg avatar arnarg commented on June 19, 2024

Interesting. You get the error Operation cannot be fulfilled on postgres.db.movetokube.com "example-temp-db": the object has been modified; please apply your changes to the latest version and try again when trying to remove the finalizer. The kubernetes API is not happy about something. The Postgres will never be garbage collected if we can't remove the finalizer.

What version of Kubernetes are you running?

from postgres-operator.

wms avatar wms commented on June 19, 2024

Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-c0eccc", GitCommit:"c0eccca51d7500bb03b2f163dd8d534ffeb2f7a2", GitTreeState:"clean", BuildDate:"2019-12-22T23:14:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

from postgres-operator.

wms avatar wms commented on June 19, 2024

I'm not super-knowledgable on RBAC setup; is is possible that .rules[2].resourceNames is incorrect? In the examples given in deploy, it's simply ext-postgresql-operator. However, I'm looking at use-cases where we can provision against different PostgreSQL servers based on the tier the instance is running on. (eg. an in-cluster, shared PostgreSQL instance for dev, another for pre-prod, and RDS for prod.)

Does .rules[2].resourceNames have to be literally exp-postgresql-operator, is it OK to by instance name instead?

from postgres-operator.

arnarg avatar arnarg commented on June 19, 2024

Ok I have the same version of EKS running where I might be able to try to recreate this situation.

This error seems to suggest that in the time the controller starts processing the request and gets the Postgres object from the API until it removes the finalizer and tries to update the object, something else is changing the Postgres object. Is there something else in your cluster that messes with the Postgres object?

Does .rules[2].resourceNames have to be literally exp-postgresql-operator, is it OK to by instance name instead

It should be the name of the deployment for postgres-operator. But that rule is for the deployment finalizer and is therefor not related to the finalizer in the Postgres object, the last rule covers that.

from postgres-operator.

wms avatar wms commented on June 19, 2024

Quite possibly - the Postgres resource is defined in a Helm chart that is installed into the cluster via Argo CD.

from postgres-operator.

wms avatar wms commented on June 19, 2024

I'll create and delete a Postgres resource outside of these tools to see if the behaviour is any different.

from postgres-operator.

wms avatar wms commented on June 19, 2024

Confirmed, it's likely Argo modifying the resource in some way that the operator doesn't like. Manually creating and then deleting a Resource works as expected with no crash loop.

from postgres-operator.

arnarg avatar arnarg commented on June 19, 2024

Ok, incidentally we also use ArgoCD but I had never included a Postgres in a chart yet. Thanks for reporting this and I will want to get this use case sorted as well.

from postgres-operator.

wms avatar wms commented on June 19, 2024

Thanks for sounding me out and helping to isolate the cause so quickly.

from postgres-operator.

hitman99 avatar hitman99 commented on June 19, 2024

I'm glad you guys sorted this out

from postgres-operator.

arnarg avatar arnarg commented on June 19, 2024

I ran some tests with and without ArgoCD. I didn't manage to crash the operator though but I did sometimes see the error Operation cannot be fulfilled on postgres.db.movetokube.com "my-db": the object has been modified; please apply your changes to the latest version and try again.

I noticed that when deleting the Postgres object in ArgoCD a foregroundDeletion finalizer was added while deleting with kubectl delete postgres my-db does not. I don't think this is the cause but I just thought I'd add it here.

The controller uses a cached getter client. It could be getting a cached Postgres with an older resourceVersion than what is stored in kubernetes' datastore. I think this is a more likely cause.

I think what we should do is add a check when dropping a role if the database where we think the role still owns some reasources actually exists before we try to connect. In that case it should be able to remove the finalizer in the next reconcile loop if it fails, without crashing. We might even want to send a patch instead of update to remove the finalizer.

from postgres-operator.

wms avatar wms commented on June 19, 2024

Commenting from the peanut gallery, it seems that patching the current in-cluster resource to remove the finalizer would be more sensible than sending an update against the current in-memory version of the resource.

from postgres-operator.

arnarg avatar arnarg commented on June 19, 2024

@wms As I can't reproduce the crash it's a bit difficult to test this on my end. I have made some changes in branch bugfix/35 that I believe fixes this bug.

Are you able to build this and test?

from postgres-operator.

wms avatar wms commented on June 19, 2024

from postgres-operator.

wms avatar wms commented on June 19, 2024

Yeah! Looks like that new Patch logic does the trick, and no crashes! I've attached the operator's logs for this test if you're curious...

postgresql-operator-logs.txt

from postgres-operator.

arnarg avatar arnarg commented on June 19, 2024

Great!

Regarding that last error postgres.db.movetokube.com "devdb-test-deletes" not found: I've been seeing this as well on my EKS cluster. It seems to be garbage collecting objects even though there is a finalizer attached to it. AFAIK that's not how things are suppose to work and seems to be a bug with EKS.

The operator is trying to remove the finalizer but kubernetes has already garbage collected the object, so it doesn't exist anymore. This error does not matter at all if dropping the roles and database were successful but if anything goes wrong and the request needs to be requeued, the remaining objects will be left in the database.

EDIT: Disregard most of this comment. What is happening is that it's trying to patch the status after the finalizer has been removed and kubernetes has garbage collected it.

from postgres-operator.

wms avatar wms commented on June 19, 2024

@arnarg Thanks for taking the time to diagnose and fix this, I really appreciate it! I figured there was some weird out-of-order stuff happening here but did check in the PostgreSQL server to check that the DB was really gone, and it was!

At this time, I only intend to use this operator for our short-lived, non-production tier(s) to allow us to dynamically create and destroy databases on a single server. However, I'll continue to keep an eye on this and let you know if we encounter any strange situations where the operator thinks the DB is gone but failed.

from postgres-operator.

arnarg avatar arnarg commented on June 19, 2024

The fix for this was released with version 0.4.2.

from postgres-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.