Not sure this is a bug but your help is appreciated. One of my three

Second was briefly offline. I belie

I have a second replica3 set which I can run the command: <code class="notranslate

thanks for the help <a class="user-mention notranslate" data-hovercard-type="user" dat

storage unavailable after loss of replica nodes about kadalu HOT 15 OPEN

galp commented on June 26, 2024

storage unavailable after loss of replica nodes

from kadalu.

Comments (15)

galp commented on June 26, 2024 1

thanks, it looks there is a new release with the PR you did from yesterday, shall give it a try soon.

* thanks, I rechecked again, above shouldn't have logged if are above k8s 1.17 (I see you are are at 1.20 based on issue desc)

I am on kubernetes 1.27.13

from kadalu.

leelavg commented on June 26, 2024

Second was briefly offline.

I believe there should be an option in kadalu cli to initial a full heal, k kadalu healinfo --trigger-full-heal, could you pls try that once?

local variable 'sock' referenced before assignment

this seems to be legit issue at kadalu layer as well, I think code is checking for all servers to be reachable but there seems to be some coding error while checking it.

from kadalu.

galp commented on June 26, 2024

running the heal command and fails

kubectl exec -nkadalu kadalu-csi-provisioner-0 -c kadalu-provisioner -- /kadalu/client_heal.sh No storage pool added to kadalu yet! command terminated with exit code 1
k kadalu healinfo :
Giving heal summary of volume ssd-storage:
Volume ssd-storage does not exist
( i can see it it with storage-list command)
Thank you for your help.

from kadalu.

galp commented on June 26, 2024

I have a second replica3 set which I can run the command:
kadalu healinfo --name flash-storage --trigger-full-heal Applying full client-side self-heal on volume flash-storage by crawling recursively:
no more output though

from kadalu.

leelavg commented on June 26, 2024

Unfortunately your CSI container might be getting restarted a lot due to the exception, sry for that.

no more output though

I expect atleast some info should be shown here, however may be due to above volume mounting didn't happen and so /mnt/$vol is empty inside the provisioner 🤔

from kadalu.

galp commented on June 26, 2024

What can I do to get things working again? I can get the lost node restarted in two hours or so but I am also happy to leave it down if it helps debugging.

The PR you made is something that could help?

from kadalu.

leelavg commented on June 26, 2024

Yes, CSI container restarting should be fixed by linked PR as it's a deterrent to further debug.

Np, you can continue, ntg more to debug. Unless you are willing to follow some hacks or build containers w/ the fix, I don't think you can get things working atm.

from kadalu.

galp commented on June 26, 2024

I would like to add another piece of info about the socket issue . kubelet logs these every second, it has been a while but I didnt pay attention until my logs got full.

W0523 07:21:44.769244 1784671 logging.go:59] [core] [Channel #133 SubChannel #134] grpc: addrConn.createTransport failed to connect to {Addr: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", ServerName: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", }. Err: connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins_registry/kadalu/csi.sock: connect: connection refused"
May 23 07:21:45 node5 kubelet[1784671]: E0523 07:21:45.654462 1784671 kubelet_volumes.go:89] "Pod found, but error occurred during checking mounted volumes from disk" err="fail to check mount point \"/var/lib/kubelet/pods/067adfdc-9b4d-4816-bf58-9dec838c6a0f/volumes/kubernetes.io~csi/pvc-949ada71-8393-455b-bf8c-1d0073e88d40/mount\": stat /var/lib/kubelet/pods/067adfdc-9b4d-4816-bf58-9dec838c6a0f/volumes/kubernetes.io~csi/pvc-949ada71-8393-455b-bf8c-1d0073e88d40/mount: transport endpoint is not connected" podUID=067adfdc-9b4d-4816-bf58-9dec838c6a0f

May 23 07:25:06 node1 kubelet[1004]: E0523 07:25:06.682607 1004 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/kadalu/csi.sock" failed. No retries permitted until 2024-05-23 07:27:08.682510275 +0000 UTC m=+71276.598805835 (durationBeforeRetry 2m2s). Error: RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: context deadline exceeded
and the socket
root@node1:~# tail /var/lib/kubelet/plugins_registry/kadalu/csi.sock tail: cannot open '/var/lib/kubelet/plugins_registry/kadalu/csi.sock' for reading: No such device or address root@node1:~# ls -la /var/lib/kubelet/plugins_registry/kadalu/csi.sock srwxr-xr-x 1 root root 0 Mar 10 16:50 /var/lib/kubelet/plugins_registry/kadalu/csi.sock

from kadalu.

leelavg commented on June 26, 2024

ya, we did fix that afair. Basically, it was about where kubelet checks for registering the plugin. Anyways, will recheck.

Check against https://kubernetes-csi.github.io/docs/deploying.html#driver-volume-mounts once

Maybe an incorrect fix in #845

from kadalu.

galp commented on June 26, 2024

thanks for the help @leelavg

Check against https://kubernetes-csi.github.io/docs/deploying.html#driver-volume-mounts once

is there something for me to mess about in there? I have never touched any csi settings in the past.

from kadalu.

leelavg commented on June 26, 2024

is there something for me to mess about in there?

thanks, I rechecked again, above shouldn't have logged if are above k8s 1.17 (I see you are are at ~~1.20~~ 1.27 based on issue desc)

Let me split and expand on the errors:

Error while dialing: dial unix /var/lib/kubelet/plugins_registry/kadalu/csi.sock: connect: connection refused [...]
[...] failed to dial socket

probably nodeplugin is not ready for sidecar to register the driver

transport endpoint is not connected

nothing new, the nodeplugin might've been rebooted

ls -la /var/lib/kubelet/plugins_registry/kadalu/csi.sock

sock file exists and you can't tail anyways, all good

from kadalu.

galp commented on June 26, 2024

Hello, 1.3.0 should be improving my situation a bit. I have just upgraded.
Noticed that there is no csi-nodeplugin.yaml included in the release Is this correct , Does it mean that csi-node-plugin will stay at that 1.2.0

from kadalu.

leelavg commented on June 26, 2024

No, it is packaged w/ operator itself and comes up if only existing nodeplugin on a certain node is deleted giving admin more flexibility.

Seems I didn't update the doc properly, just follow this except explicitly deploying the nodeplugin part https://github.com/kadalu/kadalu/blob/devel/doc/upgrade.adoc#apply-kadalu-csi-nodeplugin

from kadalu.

galp commented on June 26, 2024

OK I get 100s of those:
[2024-06-23 19:28:24.502760 +0000] E [name.c:383:af_inet_client_get_remote_sockaddr] 0-ssd-storage-client-1: DNS resolution failed on host server-ssd-storage-1-0.ssd-storage

server-ssd-storage-1-0.ssd-storage is down since a while < is this normal to get so many of those?

Also 1 gazilion of those

Jun 23 19:32:00 node1 kubelet[2073089]: W0623 19:32:00.610079 2073089 logging.go:59] [core] [Channel #149 SubChannel #150] grpc: addrConn.createTransport failed to connect to {Addr: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", ServerName: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", }. Err: connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins_registry/kadalu/csi.sock: connect: connection refused"
Jun 23 19:32:01 node1 kubelet[2073089]: E0623 19:32:01.414177 2073089 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/kadalu/csi.sock" failed. No retries permitted until 2024-06-23 19:34:03.414081975 +0000 UTC m=+2191.682847888 (durationBeforeRetry 2m2s). Error: RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: context deadline exceeded

if there is some useful stuff for you I can send.

from kadalu.

leelavg commented on June 26, 2024

node down

Ya, all bricks work in a quorum and they try to reach to every other brick and logging an error

addrConn.createTransport failed to connect to

I believe the log is from sidecar what does the Kadalu-nodeplugin container logs in this pod?

Unfortunately, I couldn't assist on data layer issues, I'm only restricted to kubernetes ops and couldn't invest on learning gluster.

from kadalu.

storage unavailable after loss of replica nodes about kadalu HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent