Comments (15)
thanks, it looks there is a new release with the PR you did from yesterday, shall give it a try soon.
* thanks, I rechecked again, above shouldn't have logged if are above k8s 1.17 (I see you are are at 1.20 based on issue desc)
I am on kubernetes 1.27.13
from kadalu.
Second was briefly offline.
- I believe there should be an option in kadalu cli to initial a full heal,
k kadalu healinfo --trigger-full-heal
, could you pls try that once?
local variable 'sock' referenced before assignment
- this seems to be legit issue at kadalu layer as well, I think code is checking for all servers to be reachable but there seems to be some coding error while checking it.
from kadalu.
running the heal command and fails
kubectl exec -nkadalu kadalu-csi-provisioner-0 -c kadalu-provisioner -- /kadalu/client_heal.sh No storage pool added to kadalu yet! command terminated with exit code 1
k kadalu healinfo :
Giving heal summary of volume ssd-storage:
Volume ssd-storage does not exist
( i can see it it with storage-list command)
Thank you for your help.
from kadalu.
I have a second replica3 set which I can run the command:
kadalu healinfo --name flash-storage --trigger-full-heal Applying full client-side self-heal on volume flash-storage by crawling recursively:
no more output though
from kadalu.
Unfortunately your CSI container might be getting restarted a lot due to the exception, sry for that.
no more output though
- I expect atleast some info should be shown here, however may be due to above volume mounting didn't happen and so
/mnt/$vol
is empty inside the provisioner 🤔
from kadalu.
What can I do to get things working again? I can get the lost node restarted in two hours or so but I am also happy to leave it down if it helps debugging.
The PR you made is something that could help?
from kadalu.
Yes, CSI container restarting should be fixed by linked PR as it's a deterrent to further debug.
Np, you can continue, ntg more to debug. Unless you are willing to follow some hacks or build containers w/ the fix, I don't think you can get things working atm.
from kadalu.
I would like to add another piece of info about the socket issue . kubelet logs these every second, it has been a while but I didnt pay attention until my logs got full.
W0523 07:21:44.769244 1784671 logging.go:59] [core] [Channel #133 SubChannel #134] grpc: addrConn.createTransport failed to connect to {Addr: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", ServerName: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", }. Err: connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins_registry/kadalu/csi.sock: connect: connection refused"
May 23 07:21:45 node5 kubelet[1784671]: E0523 07:21:45.654462 1784671 kubelet_volumes.go:89] "Pod found, but error occurred during checking mounted volumes from disk" err="fail to check mount point \"/var/lib/kubelet/pods/067adfdc-9b4d-4816-bf58-9dec838c6a0f/volumes/kubernetes.io~csi/pvc-949ada71-8393-455b-bf8c-1d0073e88d40/mount\": stat /var/lib/kubelet/pods/067adfdc-9b4d-4816-bf58-9dec838c6a0f/volumes/kubernetes.io~csi/pvc-949ada71-8393-455b-bf8c-1d0073e88d40/mount: transport endpoint is not connected" podUID=067adfdc-9b4d-4816-bf58-9dec838c6a0f
May 23 07:25:06 node1 kubelet[1004]: E0523 07:25:06.682607 1004 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/kadalu/csi.sock" failed. No retries permitted until 2024-05-23 07:27:08.682510275 +0000 UTC m=+71276.598805835 (durationBeforeRetry 2m2s). Error: RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: context deadline exceeded
and the socket
root@node1:~# tail /var/lib/kubelet/plugins_registry/kadalu/csi.sock tail: cannot open '/var/lib/kubelet/plugins_registry/kadalu/csi.sock' for reading: No such device or address root@node1:~# ls -la /var/lib/kubelet/plugins_registry/kadalu/csi.sock srwxr-xr-x 1 root root 0 Mar 10 16:50 /var/lib/kubelet/plugins_registry/kadalu/csi.sock
from kadalu.
ya, we did fix that afair. Basically, it was about where kubelet checks for registering the plugin. Anyways, will recheck.
Check against https://kubernetes-csi.github.io/docs/deploying.html#driver-volume-mounts once
Maybe an incorrect fix in #845
from kadalu.
thanks for the help @leelavg
Check against https://kubernetes-csi.github.io/docs/deploying.html#driver-volume-mounts once
is there something for me to mess about in there? I have never touched any csi settings in the past.
from kadalu.
is there something for me to mess about in there?
- thanks, I rechecked again, above shouldn't have logged if are above k8s 1.17 (I see you are are at
1.201.27 based on issue desc)
Let me split and expand on the errors:
Error while dialing: dial unix /var/lib/kubelet/plugins_registry/kadalu/csi.sock: connect: connection refused [...]
[...] failed to dial socket
- probably nodeplugin is not ready for sidecar to register the driver
transport endpoint is not connected
- nothing new, the nodeplugin might've been rebooted
ls -la /var/lib/kubelet/plugins_registry/kadalu/csi.sock
- sock file exists and you can't tail anyways, all good
from kadalu.
Hello, 1.3.0 should be improving my situation a bit. I have just upgraded.
Noticed that there is no csi-nodeplugin.yaml included in the release Is this correct , Does it mean that csi-node-plugin will stay at that 1.2.0
from kadalu.
No, it is packaged w/ operator itself and comes up if only existing nodeplugin on a certain node is deleted giving admin more flexibility.
Seems I didn't update the doc properly, just follow this except explicitly deploying the nodeplugin part https://github.com/kadalu/kadalu/blob/devel/doc/upgrade.adoc#apply-kadalu-csi-nodeplugin
from kadalu.
OK I get 100s of those:
[2024-06-23 19:28:24.502760 +0000] E [name.c:383:af_inet_client_get_remote_sockaddr] 0-ssd-storage-client-1: DNS resolution failed on host server-ssd-storage-1-0.ssd-storage
server-ssd-storage-1-0.ssd-storage is down since a while < is this normal to get so many of those?
Also 1 gazilion of those
Jun 23 19:32:00 node1 kubelet[2073089]: W0623 19:32:00.610079 2073089 logging.go:59] [core] [Channel #149 SubChannel #150] grpc: addrConn.createTransport failed to connect to {Addr: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", ServerName: "/var/lib/kubelet/plugins_registry/kadalu/csi.sock", }. Err: connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/plugins_registry/kadalu/csi.sock: connect: connection refused"
Jun 23 19:32:01 node1 kubelet[2073089]: E0623 19:32:01.414177 2073089 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins_registry/kadalu/csi.sock" failed. No retries permitted until 2024-06-23 19:34:03.414081975 +0000 UTC m=+2191.682847888 (durationBeforeRetry 2m2s). Error: RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: failed to dial socket /var/lib/kubelet/plugins_registry/kadalu/csi.sock, err: context deadline exceeded
if there is some useful stuff for you I can send.
from kadalu.
node down
- Ya, all bricks work in a quorum and they try to reach to every other brick and logging an error
addrConn.createTransport failed to connect to
- I believe the log is from sidecar what does the Kadalu-nodeplugin container logs in this pod?
Unfortunately, I couldn't assist on data layer issues, I'm only restricted to kubernetes ops and couldn't invest on learning gluster.
from kadalu.
Related Issues (20)
- [Bug]: Storage not created HOT 2
- [Bug]: Kadalu upgrading not working properly and PV data not correct HOT 2
- [Bug]: kadalu-operator.storage missing, how to add performance options HOT 8
- 3 node cluster - Arbiter volume still becomes unavailable temporarly when one node is rebooted HOT 3
- [RFE]: Can't create a volume via the kadalu cli not using kubernetes HOT 3
- [Bug]: HOT 1
- kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name kadalu not found in the list of registered CSI drivers HOT 3
- [Bug]: Error "Failed to open socket connection" while binding PVC HOT 4
- [Bug]: CSI: Sending SIGHUP to main.py only works once
- [Bug]: restart kadalu-csi-nodeplugin daemonset would make current gfs mount unavailable. HOT 2
- [Bug]: Frequent exceptions in Operator: "Expired: too old resource version: 1058643076 (1578740144)"
- [Bug]: Exception in Operator when name of added KadaluStorage starts with name of existing KadaluStorage HOT 2
- [RFE]: Mounting multiple GlusterFS volumes fails if referencing one or more missing volumes on the server. HOT 2
- [Bug]: ResourceExhausted desc = No Hosting Volumes available, add more storage HOT 1
- [Bug]: Double mounts after kublet restart
- [Bug]: ResourceExhausted desc = No Hosting Volumes available, add more storage
- [Bug]: pod stuck Terminating cause kadalu-csi-nodeplugin doesn't unmount all the filesystems it has mounted in the kubernetes node HOT 3
- We should have a way to define the LOGLEVEL for the node-plugin poststart hook HOT 2
- [Bug]: In Replica 3, one node endup with different GFID for `subvol` directory
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kadalu.