Giter Club home page Giter Club logo

Comments (15)

galp avatar galp commented on June 26, 2024 1

thanks, it looks there is a new release with the PR you did from yesterday, shall give it a try soon.

* thanks, I rechecked again, above shouldn't have logged if are above k8s 1.17 (I see you are are at 1.20 based on issue desc)

I am on kubernetes 1.27.13

from kadalu.

leelavg avatar leelavg commented on June 26, 2024

Second was briefly offline.

  • I believe there should be an option in kadalu cli to initial a full heal, k kadalu healinfo --trigger-full-heal, could you pls try that once?

local variable 'sock' referenced before assignment

  • this seems to be legit issue at kadalu layer as well, I think code is checking for all servers to be reachable but there seems to be some coding error while checking it.

from kadalu.

galp avatar galp commented on June 26, 2024

running the heal command and fails

kubectl exec -nkadalu kadalu-csi-provisioner-0 -c kadalu-provisioner -- /kadalu/client_heal.sh No storage pool added to kadalu yet! command terminated with exit code 1
k kadalu healinfo :
Giving heal summary of volume ssd-storage:
Volume ssd-storage does not exist
( i can see it it with storage-list command)
Thank you for your help.

from kadalu.

galp avatar galp commented on June 26, 2024

I have a second replica3 set which I can run the command:
kadalu healinfo --name flash-storage --trigger-full-heal Applying full client-side self-heal on volume flash-storage by crawling recursively:
no more output though

from kadalu.

leelavg avatar leelavg commented on June 26, 2024

Unfortunately your CSI container might be getting restarted a lot due to the exception, sry for that.

no more output though

  • I expect atleast some info should be shown here, however may be due to above volume mounting didn't happen and so /mnt/$vol is empty inside the provisioner 🤔

from kadalu.

galp avatar galp commented on June 26, 2024

What can I do to get things working again? I can get the lost node restarted in two hours or so but I am also happy to leave it down if it helps debugging.

The PR you made is something that could help?

from kadalu.

leelavg avatar leelavg commented on June 26, 2024

Yes, CSI container restarting should be fixed by linked PR as it's a deterrent to further debug.

Np, you can continue, ntg more to debug. Unless you are willing to follow some hacks or build containers w/ the fix, I don't think you can get things working atm.

from kadalu.

galp avatar galp commented on June 26, 2024

I would like to add another piece of info about the socket issue . kubelet logs these every second, it has been a while but I didnt pay attention until my logs got full.

W0523 07:21:44.769244 1784671 logging.go:59] [core] [Channel #133 SubChannel #134] grpc: addrConn.createTransport failed to connect to {Addr: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", ServerName: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", }. Err: connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins_registry/kadalu/csi.sock: connect: connection refused"
May 23 07:21:45 node5 kubelet[1784671]: E0523 07:21:45.654462 1784671 kubelet_volumes.go:89] "Pod found, but error occurred during checking mounted volumes from disk" err="fail to check mount point \"/var/lib/kubelet/pods/067adfdc-9b4d-4816-bf58-9dec838c6a0f/volumes/kubernetes.io~csi/pvc-949ada71-8393-455b-bf8c-1d0073e88d40/mount\": stat /var/lib/kubelet/pods/067adfdc-9b4d-4816-bf58-9dec838c6a0f/volumes/kubernetes.io~csi/pvc-949ada71-8393-455b-bf8c-1d0073e88d40/mount: transport endpoint is not connected" podUID=067adfdc-9b4d-4816-bf58-9dec838c6a0f

May 23 07:25:06 node1 kubelet[1004]: E0523 07:25:06.682607 1004 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/kadalu/csi.sock" failed. No retries permitted until 2024-05-23 07:27:08.682510275 +0000 UTC m=+71276.598805835 (durationBeforeRetry 2m2s). Error: RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: context deadline exceeded
and the socket
root@node1:~# tail /var/lib/kubelet/plugins_registry/kadalu/csi.sock tail: cannot open '/var/lib/kubelet/plugins_registry/kadalu/csi.sock' for reading: No such device or address root@node1:~# ls -la /var/lib/kubelet/plugins_registry/kadalu/csi.sock srwxr-xr-x 1 root root 0 Mar 10 16:50 /var/lib/kubelet/plugins_registry/kadalu/csi.sock

from kadalu.

leelavg avatar leelavg commented on June 26, 2024

ya, we did fix that afair. Basically, it was about where kubelet checks for registering the plugin. Anyways, will recheck.

Check against https://kubernetes-csi.github.io/docs/deploying.html#driver-volume-mounts once

Maybe an incorrect fix in #845

from kadalu.

galp avatar galp commented on June 26, 2024

thanks for the help @leelavg

Check against https://kubernetes-csi.github.io/docs/deploying.html#driver-volume-mounts once

is there something for me to mess about in there? I have never touched any csi settings in the past.

from kadalu.

leelavg avatar leelavg commented on June 26, 2024

is there something for me to mess about in there?

  • thanks, I rechecked again, above shouldn't have logged if are above k8s 1.17 (I see you are are at 1.20 1.27 based on issue desc)

Let me split and expand on the errors:

Error while dialing: dial unix /var/lib/kubelet/plugins_registry/kadalu/csi.sock: connect: connection refused [...]
[...] failed to dial socket

  • probably nodeplugin is not ready for sidecar to register the driver

transport endpoint is not connected

  • nothing new, the nodeplugin might've been rebooted

ls -la /var/lib/kubelet/plugins_registry/kadalu/csi.sock

  • sock file exists and you can't tail anyways, all good

from kadalu.

galp avatar galp commented on June 26, 2024

Hello, 1.3.0 should be improving my situation a bit. I have just upgraded.
Noticed that there is no csi-nodeplugin.yaml included in the release Is this correct , Does it mean that csi-node-plugin will stay at that 1.2.0

from kadalu.

leelavg avatar leelavg commented on June 26, 2024

No, it is packaged w/ operator itself and comes up if only existing nodeplugin on a certain node is deleted giving admin more flexibility.

Seems I didn't update the doc properly, just follow this except explicitly deploying the nodeplugin part https://github.com/kadalu/kadalu/blob/devel/doc/upgrade.adoc#apply-kadalu-csi-nodeplugin

from kadalu.

galp avatar galp commented on June 26, 2024

OK I get 100s of those:
[2024-06-23 19:28:24.502760 +0000] E [name.c:383:af_inet_client_get_remote_sockaddr] 0-ssd-storage-client-1: DNS resolution failed on host server-ssd-storage-1-0.ssd-storage

server-ssd-storage-1-0.ssd-storage is down since a while < is this normal to get so many of those?

Also 1 gazilion of those

Jun 23 19:32:00 node1 kubelet[2073089]: W0623 19:32:00.610079 2073089 logging.go:59] [core] [Channel #149 SubChannel #150] grpc: addrConn.createTransport failed to connect to {Addr: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", ServerName: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", }. Err: connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins_registry/kadalu/csi.sock: connect: connection refused"
Jun 23 19:32:01 node1 kubelet[2073089]: E0623 19:32:01.414177 2073089 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/kadalu/csi.sock" failed. No retries permitted until 2024-06-23 19:34:03.414081975 +0000 UTC m=+2191.682847888 (durationBeforeRetry 2m2s). Error: RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: context deadline exceeded

if there is some useful stuff for you I can send.

from kadalu.

leelavg avatar leelavg commented on June 26, 2024

node down

  • Ya, all bricks work in a quorum and they try to reach to every other brick and logging an error

addrConn.createTransport failed to connect to

  • I believe the log is from sidecar what does the Kadalu-nodeplugin container logs in this pod?

Unfortunately, I couldn't assist on data layer issues, I'm only restricted to kubernetes ops and couldn't invest on learning gluster.

from kadalu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.