Giter Club home page Giter Club logo

Comments (5)

serhii-shnurenko avatar serhii-shnurenko commented on June 28, 2024 1

Yeah just faced this issue but with detach behavior.

Assume the reason is our pod have 2 mounted csi volumes and while 1-st of them detaching server becomes locked for short amount of time.

level=debug ts=2019-09-26T16:12:07.571214176Z component=grpc-server msg="handling level=debug ts=2019-09-26T16:12:07.571214176Z component=grpc-server msg="handling request" req="volume_id:\"volume_id\" node_id:\"server_id_here\" "
level=info ts=2019-09-26T16:12:07.571564951Z component=api-volume-service msg="detaching volume from server" volume-id=volume_id server-id=server_id_here
level=info ts=2019-09-26T16:12:07.705827634Z component=api-volume-service msg="failed to detach volume" volume-id=volume_id err="cannot perform operation because server is locked (locked)"
level=error ts=2019-09-26T16:12:07.705971793Z component=grpc-server msg="handler failed" err="rpc error: code = Internal desc = failed to unpublish volume: cannot perform operation because server is locked (locked)"

@thcyron
Assume it should be done here. Or do you have a better idea to fix this bug for both Detach and Attach actions?

if err := s.volumeService.Detach(ctx, volume, server); err != nil {
code := codes.Internal
switch err {
case volumes.ErrVolumeNotFound:
code = codes.NotFound
case volumes.ErrServerNotFound:
code = codes.NotFound
case volumes.ErrLockedServer:
code = codes.Aborted
}
return nil, status.Error(code, fmt.Sprintf("failed to unpublish volume: %s", err))
}

from csi-driver.

serhii-shnurenko avatar serhii-shnurenko commented on June 28, 2024 1

As far as I understood from this specification

ABORTED code means that action already invoked and running for this volume. But "server_locked" error doesn't mean action already running for this volume, it can be another volume currently attaching/detaching to/from the server causing it to be locked.

I have no experience in go, but for me, it looks like if we returning ABORTED code Publish/Unpublish action will not be retried, and will mean volume is detached.

I believe it should work next way:
verifying the server isn't locked and wait till server not locked -> unpublish/publish volume -> in case got server locked error, go to the beginning, of course with some meaningful retries limit.

A little test
Decided to test if error reproducible using hcloud cli tool. For this purpose created 1 server and 3 volumes in the empty project we are using for testing (IP redacted because some time ago hetzner had assigned IP of the previously destructed instance to the new machine):

$ hcloud server list
ID        NAME                        STATUS    IPV4             IPV6                     DATACENTER
3360348   testing-volume-concurency   running   116.203.***.**   2a01:***:***:****::/64   nbg1-dc3
$ hcloud volume list
ID        NAME        SIZE    SERVER   LOCATION
3323542   test-vol1   10 GB   -        nbg1
3323544   test-vol2   10 GB   -        nbg1
3323545   test-vol3   10 GB   -        nbg1

Here the code snippet I used to verify this error can occur due to concurent opperations running against same instance:

hcloud volume attach --server <server_id> <vol_name_1> &
hcloud volume attach --server <server_id> <vol_name_2> &
hcloud volume attach --server <server_id> <vol_name_3> &
wait

And what I've got in the output:

$ hcloud volume attach --server 3360348 test-vol1 &
[1] 8153
$ hcloud volume attach --server 3360348 test-vol2 &
[2] 8154
$ hcloud volume attach --server 3360348 test-vol3 &
[3] 8155
$ wait
hcloud: cannot perform operation because server is locked (locked)
hcloud: cannot perform operation because server is locked (locked)
[1]   Exit 1                  hcloud volume attach --server 3360348 test-vol1
[2]-  Exit 1                  hcloud volume attach --server 3360348 test-vol2
   1s [====================================================================] 100%
Volume 3323545 attached to server testing-volume-concurency
[3]+  Done                    hcloud volume attach --server 3360348 test-vol3

Results: Only test-vol-3 got mounted on the server, vol1 and vol2 attachments rejected by the hetzner doe to "server is locked" error.

To not make this message too long, want to mention the same behavior observed while detaching multiple volumes in a concurrent manner.

from csi-driver.

thcyron avatar thcyron commented on June 28, 2024

According to the code you posted, we already return error code aborted in case the server is locked:

 	case volumes.ErrLockedServer: 
 		code = codes.Aborted 

That that doesn’t seem to work. Would need to debug this.

from csi-driver.

thcyron avatar thcyron commented on June 28, 2024

I think you’re right. We don’t use the ABORTED error code correctly. I’ve creatd #63 to address this.

from csi-driver.

LKaemmerling avatar LKaemmerling commented on June 28, 2024

Was fixed with #84

from csi-driver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.