Comments (22)
My use case, BTW, is running automated functional tests for our project's geard integration. But this is also worrying for eventual production use.
from geard.
@mfojtik can you look into this?
from geard.
@smarterclayton sure, I will be looking into this today
from geard.
I can reproduce this as well, same as @itamarst describe, looking at the queue @smarterclayton described in the linked issue.
from geard.
@smarterclayton the problem I see is, that when you do what @itamarst describes, it seems like the geard
got stuck in a job START:
Jun 12 12:08:06 dev.origin.com gear[29383]: 2014/06/12 12:08:06 204 19.58ms PUT /container/busybox/started
Jun 12 12:08:21 dev.origin.com gear[29383]: job START *linux.stopContainer, uDKMHJGCBO2wmUASfq4new: &{StoppedContainerStateRe...ce5f0}
Jun 12 12:08:21 dev.origin.com gear[29383]: job END uDKMHJGCBO2wmUASfq4new
Jun 12 12:08:21 dev.origin.com gear[29383]: 2014/06/12 12:08:21 204 7.73ms PUT /container/busybox/stopped
Jun 12 12:08:21 dev.origin.com gear[29383]: job START *linux.startContainer, TUVnGWCLizCWHGcT4qBlFA: &{StartedContainerStateR...ce5f0}
Jun 12 12:08:21 dev.origin.com gear[29383]: job END TUVnGWCLizCWHGcT4qBlFA
Jun 12 12:08:21 dev.origin.com gear[29383]: 2014/06/12 12:08:21 204 7.66ms PUT /container/busybox/started
Jun 12 12:08:27 dev.origin.com gear[29383]: job START *linux.stopContainer, sxSSmdVKKX_GQ7hdtx40Cw: &{StoppedContainerStateRe...ce5f0}
.... no END
here the stopContainer will hang...
from geard.
@smarterclayton fyi, setting Concurrent
to 8
does not fix this :-( and you are right, it is the StopUnitJob
that is hanging... I'm debugging it now.
UPDATE: It might be a locking problem... go-systemd
startJob
method hangs when trying to acquire jobListener.Lock()
I'm on it :-) The jobComplete
seems to take a Lock and under some circumstances it never release it... It seems like StartContainer cause it.
from geard.
Destructive job queuing maybe (start is using "replace"), maybe the previous systemd job never completes to return to the event subscription loop.
from geard.
@smarterclayton yeah that might be... maybe putting some sleep to jobComplete to give systemd finish the job may fix this.
from geard.
I don't see how sleeping helps - I just meant that there may be a flawed assumption that jobComplete is ever going to happen.
from geard.
@smarterclayton actually both start and stop use 'replace'
from geard.
@smarterclayton so adding:
time.Sleep(1 * time.Second)
to both stopContainer
and startContainer
after calling systemd.Connection().StopUnitJob
fix this.
from geard.
Sure - we just wouldn't use that to fix it. Guess I'm asking what the underlying race condition in go-systemd is.
from geard.
@smarterclayton sure, but it proves there is a race between stop/start in go-systemd, so I will do more digging :-)
Btw. I just found that the StartUnit
action return 'canceled' instead of done when you execute it right after StopUnit
action.
from geard.
@smarterclayton how about:
err := systemd.Connection().StopUnitJob(unitName, "replace-irreversibly")
seems to work perfectly fine :-)
If "replace-irreversibly" is specified, operate like "replace", but also mark the new jobs as irreversible. This prevents future conflicting transactions from replacing these jobs (or even being enqueued while the irreversible jobs are still pending). Irreversible jobs can still be cancelled using the cancel command.
from geard.
Not what we want, if I'm reading correctly. Subsequent start/stop should cancel a stop/start.
from geard.
@smarterclayton well we can detect if the start/stop failed due to Transaction is destructive
error. SystemD will prevent the operation if there is other operation currently being executed. In that case we can just wait and retry.
from geard.
@smarterclayton nvmd, just got the 'cancel' part ;-)
from geard.
We don't want to wait and retry - we want to make systemd handle destructive cleanly. There is a bug in go-systemd and it needs to be fixed.
EDIT: since i was the one who added start job I'm pretty sure this is just a bug in the event listener logic introduced by the new concept - go-systemd just never had to deal with a cancelled job before (an admin could cause this today by running systemctl at the same time).
from geard.
@smarterclayton I think I have fix, PR on the way
from geard.
@mfojtik how's it going? I really want to take the sleep(1)
out of my code 😁
from geard.
This was merged - let us know if it's fixed
from geard.
My functional tests no longer lock up geard, so I think it's good. Thank you!
from geard.
Related Issues (20)
- Add json response and versioning to REST API HOT 1
- Can't Build from tarball archive anymore HOT 1
- Unable to get version information from a tarball archive
- 'gear deploy' possible with REST API? HOT 1
- gear daemon crashed by rest "stopped" HOT 1
- STI save-artifacts script should be optional during build time
- geard cannot be installed in /usr/local
- gear status doesn't work if systemctl is not in /usr/bin/systemctl but /bin/systemctl (and /bin != /usr/bin)
- Configure units to automatically restart under certain failure conditions HOT 2
- Add a curl example showing how to set container environment variables with the REST API
- The `docker inspect` test in the systemd COMMON_CONTAINER template uses wrong variable name
- Don't automatically create `-data` containers HOT 1
- A Gear API for removing environment files
- Environment variables in .json files? HOT 3
- Find an alternative to bind mounts for STI HOT 11
- Deployment is not valid
- install gear isolate doesn't work HOT 2
- Documentation update at http://openshift.github.io/geard/#download
- go-netfilter-queue/netfilter.go compile errors
- Can not run the example HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from geard.