Comments (23)
Let me try to explain, so how I understand it is that the lock seems to be already acquired in the
NodStageVolume
on line 170, meaning that we already are in a locked state when entering functionmaybeUnlockFileEncryption
. Why this does not prevent the race condition is because theVolumeLocks
map is local to each node and multiple nodes can stage & encrypt a volume simultaneously in the RWX caseDo you agree with this understanding?
Yes, if VolumeLocks
do not store their state in Rados, they are definitely not usable.
I tested the Rados omap locking with a simple proof of concept, and it appears to be working well!
During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly.
Great work! Sounds really good 👏
However, I believe this can be further improved by creating a dedicated interface for encryption tracking.
If you have a generic "Rados lock/mutex" function, we might be able to use it for other things too. I can't immediately think of something else than encryption, though. Making it encryption specific is fine, but ideally it is easily adaptable later on to be more generic (if we find a usecase for it).
Thanks again!
from ceph-csi.
A lock that can time-out would be ideal, I guess. Depending on how complex you want to make it, this could be an approach:
-
obtain the lock
- set owner id of the lock (useful for logging)
- set timeout timestamp
- do operation
- (optional) update timeout timestamp to inform others the lock is still active
- do more operation(s)
- release the lock
-
blocked while obtaining the lock
- report the owner of the lock
- retry getting the lock, take timeout in account
- if timed-out, warn about it, including the owner
- if the lock was released, just report an informational message
from ceph-csi.
Hey everyone,
I've just opened a draft PR (#4688) that addresses the concurrency issue we've been facing. While it's still a bit rough around the edges and misses some functionality, it implements the necessary functionality for resolving the concurrency issue. I made a check list in the PR of what could still be done or looked into
I opened the PR in this state now because I'm going on vacation for the next four weeks and don't want to lose momentum on this issue. I'll be working to finalize as much as possible today, but any remaining tasks are open for anyone who feels inspired to take over.
From our side, @Sunnatillo will be available to assist with this.
Thanks!
@NymanRobin Thank you for taking care of it, Enjoy your vacation
from ceph-csi.
@Sunnatillo yes i think using https://github.com/ceph/go-ceph/blob/ee25db94c6595ad3dca6cd66d7f290df3baa7c9a/rados/ioctx.go#L463-L508 LockExclusive from rados does what described above by @nixpanic am fine with using it instead of reinvent the same. @nixpanic can you please check this as well.
from ceph-csi.
Hi @Madhu-1 @nixpanic. Should we use LockExclusive from ceph-go or the @NymanRobin's current implementation is fine?
The simpler it can be done, the better!
from ceph-csi.
@z2000l Instead of reopening the issue, can you be more specific about the problem with details?
- The Fix is not part of any release yet.
- Are you also testing cephfs encryption
- Have you tried with canary cephcsi image
from ceph-csi.
Yes, we're trying to test cephfs encryption. We used rook to orchestrate our ceph cluster. We applied the patch to ceph csi 3.10.2. The patch improved fscrypt a lot. Still we experienced problems when replica is high, especially when pods were running on multiple nodes: a subset of the replicas were running without a problem, some of them were stuck in ContainerCreating with error messges:
MountVolume.MountDevice failed for volume "pvc-ad63a27d-4651-42ae-a96c-ca21de963a38" : rpc error: code = Internal desc = fscrypt: unsupported state metadata=true kernel_policy=false
We suspect when multiple nodes try to setup fscrypt policy at the same time, the failed nodes don't refresh or sync to get the policy and get stuck.
And which canary cephcsi image you'd like us to try? Thanks.
from ceph-csi.
@z2000l Thanks for the response , quay.io/cephcsi/cephcsi:canary
can you please use this message and also please provide the logs and steps to reproduce so that others can try to reproduce it?
it could be problem with the fscrypt library we are using we need to open issues with fscrypt if we have any.
@NymanRobin is it something still happening or its fixed?
from ceph-csi.
Thanks for notifying me @Madhu-1 I thought this was fixed, but in my testing I did not use multiple nodes so this might be an issue still as pointed out by @z2000l. I will try with multiple nodes and see if I can reproduce
from ceph-csi.
Thanks for notifying me @Madhu-1 I thought this was fixed, but in my testing I did not use multiple nodes so this might be an issue still as pointed out by @z2000l. I will try with multiple nodes and see if I can reproduce
Thank you 👍🏻
from ceph-csi.
I can confirm that this is consistently reproducible when using a deployment with multiple replicas spread over nodes and RWX accessMode for the pvc. I will start looking into a solution, not sure yet if the problem is in fscrypt or ceph-csi implementation
from ceph-csi.
I now understand the error clearly. It seems that the fscrypt client handles metadata and policy differently, leading to this error. When checking for metadata, if it doesn't exist, we create it. However, this isn't the case for policy checks. This results in scenarios where, for instance, if three pods are created simultaneously, the first creates metadata, and the subsequent two find it. However, none of these pods find the policy, leading to a mismatch because the subsequent two have metadata but no policy.
The solution is to properly set up the policy and metadata initially. Once this is done, the race condition won't occur. As a temporary workaround, one can start the deployment with one replica, wait for it to be in the running state, and then scale up without issue.
I believe a permanent solution shouldn't be too difficult. Best idea I currently have is to check if the initial setup is done, this needs to be behind a lock and the lock can be released if the check is okay or when the setup is done. I will still do some more thinking about the best course of action, but how does this initially sound to you @Madhu-1
EDIT: Okay actually the pods are spread over nodes so a mutex won't solve this it will need a lease or some other mechanism to ensure it is ran only once
/assign
from ceph-csi.
Thanks for taking this issue! Let us know if you have any questions!
from ceph-csi.
The issue you are trying to assign to yourself is already assigned.
from ceph-csi.
I believe a permanent solution shouldn't be too difficult. Best idea I currently have is to check if the initial setup is done, this needs to be behind a lock and the lock can be released if the check is okay or when the setup is done. I will still do some more thinking about the best course of action, but how does this initially sound to you @Madhu-1
EDIT: Okay actually the pods are spread over nodes so a mutex won't solve this it will need a lease or some other mechanism to ensure it is ran only once
Thanks for exploring it. Yes, as you mentioned internal locking might not be good, we need to see if we can have something else for this one.
from ceph-csi.
Hi @NymanRobin,
You might be able to use VolumeLocks
from internal/util/idlocker.go
for locking too once you add a new fscrypt
operation.
Thanks for looking into this!
from ceph-csi.
Hey @nixpanic!
Thanks for the suggestion but based on our last discussion I still think the Rados omap is the way to go.
Let me try to explain, so how I understand it is that the lock seems to be already acquired in the NodStageVolume
on line 170, meaning that we already are in a locked state when entering function maybeUnlockFileEncryption
. Why this does not prevent the race condition is because the VolumeLocks
map is local to each node and multiple nodes can stage & encrypt a volume simultaneously in the RWX case
Do you agree with this understanding?
I think my best idea currently would be to implement a similar interface as the reftracker that would track the encryption of the cephfs filesystem to ensure the encryption is only set up once
from ceph-csi.
I tested the Rados omap locking with a simple proof of concept, and it appears to be working well!
During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly.
However, I believe this can be further improved by creating a dedicated interface for encryption tracking.
from ceph-csi.
Let me try to explain, so how I understand it is that the lock seems to be already acquired in the
NodStageVolume
on line 170, meaning that we already are in a locked state when entering functionmaybeUnlockFileEncryption
. Why this does not prevent the race condition is because theVolumeLocks
map is local to each node and multiple nodes can stage & encrypt a volume simultaneously in the RWX case
Do you agree with this understanding?Yes, if
VolumeLocks
do not store their state in Rados, they are definitely not usable.I tested the Rados omap locking with a simple proof of concept, and it appears to be working well!
During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly.Great work! Sounds really good 👏
@NymanRobin what about the case where we took the lock and csi restarted and the app pod scale downed to 1 where it needs to run on a specific node instead of two nodes, will this also helps?
However, I believe this can be further improved by creating a dedicated interface for encryption tracking.
If you have a generic "Rados lock/mutex" function, we might be able to use it for other things too. I can't immediately think of something else than encryption, though. Making it encryption specific is fine, but ideally it is easily adaptable later on to be more generic (if we find a usecase for it).
Thanks again!
from ceph-csi.
@Madhu-1, I currently do not see any problems with the scale-down scenario, but maybe I am missing something?
I also do not foresee any issues with scaling down, as ideally, once a node has set up the encryption, there is no need to acquire the lock again. This is why I considered a specific encryption interface to to first do a lookup if the volume has already been encrypted, thus avoiding the need to lock it again in case it is already encrypted. This approach could be an extension of the lock mechanism. Although a normal mutex could also work, it will be slightly slower during large-scale operations. However, the lock will only be active for a very short period of time is the key point here. On another note if the update/delete call for the lock fails, we need to maintain a list of orphaned locks and attempt to clean them periodically is my thought
That said, I understand the concern about:
Acquire lock -> pod restarts -> Orphaned lock
Is a valid point I hadn't considered earlier. Handling a pod restart in a locked state is more challenging, and I do not have a perfect solution on top of my mind for this scenario. Maybe implementing an expiry time for the lock could solve this? What do you think @Madhu-1 and @nixpanic?
from ceph-csi.
Hey everyone,
I've just opened a draft PR (#4688) that addresses the concurrency issue we've been facing. While it's still a bit rough around the edges and misses some functionality, it implements the necessary functionality for resolving the concurrency issue. I made a check list in the PR of what could still be done or looked into
I opened the PR in this state now because I'm going on vacation for the next four weeks and don't want to lose momentum on this issue. I'll be working to finalize as much as possible today, but any remaining tasks are open for anyone who feels inspired to take over.
From our side, @Sunnatillo will be available to assist with this.
Thanks!
from ceph-csi.
Thanks a lot @Madhu-1!
I however now noticed that the ceph-go library already supports LockExclusive, so probably not the best idea to reinvent the wheel as I have done here and instead use that. The locking supported the functionality that we need on an initial look. Oh well, if this remains I will fix my mess after the vacation 😄
from ceph-csi.
Hi @Madhu-1 @nixpanic. Should we use LockExclusive from ceph-go or the @NymanRobin's current implementation is fine?
from ceph-csi.
Related Issues (20)
- Can we share VolumeLocks for NodeServer/ControllerServer? HOT 3
- Not able to start CI job because of no nodes HOT 1
- RBD-Images are not shown in the Dashbord: Failed to execute RBD [errno 19] error generating diff from snapshot None HOT 2
- CentOS Stream 8 is EOL HOT 3
- Add support for ModifyVolume HOT 5
- allow podSecurityContexts to be set in `nodeplugin` and `provisioner` of chart `ceph-csi-cephfs` HOT 2
- Add http health endpoint for ceph-csi-cephfs and ceph-csi-rbd HOT 4
- cephfs-csi Pod has always been in containercreating HOT 3
- CephFS keyring requires nonsensicaly enormous and insecure privileges to work HOT 4
- Update modprobe in csi-rbdplugin to support zstd compressed rbd and nbd kernel module HOT 6
- Unable to create CephFS subvolume dynamically (`no available topology found`) HOT 2
- cephFS: Remove the 400 snapshot limitation from the doc HOT 1
- Allow to change mounter option from an existing PV HOT 6
- cephfs csi error,mds mds status Start request repeated too quickly. Failed with result 'signal'. HOT 1
- New csiplugin-configmap.yaml setting to override PVs's volumeAttributes.mounter HOT 1
- 'luksOpen'` encrypted volumes HOT 1
- rbd remap on network failure HOT 6
- Remove podSecurityPolicy from the helm documentation
- Ensure ceph-lock is always released when staging encrypted cephfs volume HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ceph-csi.