Giter Club home page Giter Club logo

Comments (22)

itamarst avatar itamarst commented on June 18, 2024

My use case, BTW, is running automated functional tests for our project's geard integration. But this is also worrying for eventual production use.

from geard.

smarterclayton avatar smarterclayton commented on June 18, 2024

@mfojtik can you look into this?

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton sure, I will be looking into this today

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

I can reproduce this as well, same as @itamarst describe, looking at the queue @smarterclayton described in the linked issue.

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton the problem I see is, that when you do what @itamarst describes, it seems like the geard got stuck in a job START:

Jun 12 12:08:06 dev.origin.com gear[29383]: 2014/06/12 12:08:06 204 19.58ms PUT /container/busybox/started
Jun 12 12:08:21 dev.origin.com gear[29383]: job START *linux.stopContainer, uDKMHJGCBO2wmUASfq4new: &{StoppedContainerStateRe...ce5f0}
Jun 12 12:08:21 dev.origin.com gear[29383]: job END   uDKMHJGCBO2wmUASfq4new
Jun 12 12:08:21 dev.origin.com gear[29383]: 2014/06/12 12:08:21 204 7.73ms PUT /container/busybox/stopped
Jun 12 12:08:21 dev.origin.com gear[29383]: job START *linux.startContainer, TUVnGWCLizCWHGcT4qBlFA: &{StartedContainerStateR...ce5f0}
Jun 12 12:08:21 dev.origin.com gear[29383]: job END   TUVnGWCLizCWHGcT4qBlFA
Jun 12 12:08:21 dev.origin.com gear[29383]: 2014/06/12 12:08:21 204 7.66ms PUT /container/busybox/started
Jun 12 12:08:27 dev.origin.com gear[29383]: job START *linux.stopContainer, sxSSmdVKKX_GQ7hdtx40Cw: &{StoppedContainerStateRe...ce5f0}

.... no END here the stopContainer will hang...

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton fyi, setting Concurrent to 8 does not fix this :-( and you are right, it is the StopUnitJob that is hanging... I'm debugging it now.

UPDATE: It might be a locking problem... go-systemd startJob method hangs when trying to acquire jobListener.Lock()

I'm on it :-) The jobComplete seems to take a Lock and under some circumstances it never release it... It seems like StartContainer cause it.

from geard.

smarterclayton avatar smarterclayton commented on June 18, 2024

Destructive job queuing maybe (start is using "replace"), maybe the previous systemd job never completes to return to the event subscription loop.

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton yeah that might be... maybe putting some sleep to jobComplete to give systemd finish the job may fix this.

from geard.

smarterclayton avatar smarterclayton commented on June 18, 2024

I don't see how sleeping helps - I just meant that there may be a flawed assumption that jobComplete is ever going to happen.

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton actually both start and stop use 'replace'

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton so adding:

time.Sleep(1 * time.Second)

to both stopContainer and startContainer after calling systemd.Connection().StopUnitJob fix this.

from geard.

smarterclayton avatar smarterclayton commented on June 18, 2024

Sure - we just wouldn't use that to fix it. Guess I'm asking what the underlying race condition in go-systemd is.

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton sure, but it proves there is a race between stop/start in go-systemd, so I will do more digging :-)

Btw. I just found that the StartUnit action return 'canceled' instead of done when you execute it right after StopUnit action.

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton how about:

err := systemd.Connection().StopUnitJob(unitName, "replace-irreversibly")

seems to work perfectly fine :-)

If "replace-irreversibly" is specified, operate like "replace", but also mark the new jobs as irreversible. This prevents future conflicting transactions from replacing these jobs (or even being enqueued while the irreversible jobs are still pending). Irreversible jobs can still be cancelled using the cancel command.

from geard.

smarterclayton avatar smarterclayton commented on June 18, 2024

Not what we want, if I'm reading correctly. Subsequent start/stop should cancel a stop/start.

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton well we can detect if the start/stop failed due to Transaction is destructive error. SystemD will prevent the operation if there is other operation currently being executed. In that case we can just wait and retry.

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton nvmd, just got the 'cancel' part ;-)

from geard.

smarterclayton avatar smarterclayton commented on June 18, 2024

We don't want to wait and retry - we want to make systemd handle destructive cleanly. There is a bug in go-systemd and it needs to be fixed.

EDIT: since i was the one who added start job I'm pretty sure this is just a bug in the event listener logic introduced by the new concept - go-systemd just never had to deal with a cancelled job before (an admin could cause this today by running systemctl at the same time).

from geard.

mfojtik avatar mfojtik commented on June 18, 2024

@smarterclayton I think I have fix, PR on the way

from geard.

itamarst avatar itamarst commented on June 18, 2024

@mfojtik how's it going? I really want to take the sleep(1) out of my code 😁

from geard.

smarterclayton avatar smarterclayton commented on June 18, 2024

This was merged - let us know if it's fixed

from geard.

itamarst avatar itamarst commented on June 18, 2024

My functional tests no longer lock up geard, so I think it's good. Thank you!

from geard.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.