Giter Club home page Giter Club logo

sge's People

Contributors

0xaf1f avatar bodgerer avatar danpovey avatar dumain avatar kunzol avatar loveshack avatar njoly avatar opoplawski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sge's Issues

Revive this?

@njoly @bodgerer I'm wondering how much you guys know about GridEngine internals, or if you know anyone else who might? Dave Love seems to have disappeared again, and I'm wondering who else might have a deep knowledge of GridEngine and be willing to maintain a GitHub-based version of the project?

Deleting exec-host with jobs in 'dr' state is allowed

See email chain pasted below.
The basic issue, I believe, is that you can do
qconf -de some_host
when there are jobs in state 'dr' on that host. That crashes the gridengine master, and restarting it is not possible: message in /var/spool/gridengine/qmaster/messages is:
11/10/2018 16:23:27| main|deb8qmaster|C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

I'm not sure which part of the code deals with this; it should probably be fixed.

I was able to fix it, although I suspect that my fix may have been disruptive to the jobs.

Firstly, I  believe the problem was that gridengine does not handle a deleted job that is on a host that has been deleted, and it dies when it sees it.   Presumably the bug is in allowing it to be deleted in the first place.

Anyway, my fix (after backing up the directory /var/spool/gridengine) was to move the file /var/spool/gridengine/spooldb/sge_job to a temporary location, restart the qmaster, add the host back with qconf -ah, stop the qmaster, restore the old database  /var/spool/gridengine/spooldb/sge_job, and restart the qmaster.

Before doing that whole procedure, to stop the hosts getting confused I stopped all the gridengine-exec services.  That probably wasn't optimal because clients like qsub and qstat would still have been able to access the queue in the interim, and it definitely would have confused them and killed some processes.  Unfortunately I had to do this on short notice and wasn't sure how to use iptables to close off those ports from outside the qmaster while I did the maintenance-- that would have been a better solution. 

Also I encountered a hiccup that `systemctl stop gridengine-qmaster` didn't actually work the second time, the process was still running, with the old database, so I had to manually kill it and retry.

Anyway this whole episode is making me think more seriously about moving to Univa GridEngine.  I've known for a long time that the free version has a lot of bugs, and I just don't have time to deal with this type of thing.



On Sat, Nov 10, 2018 at 4:49 PM Marshall2, John (SSC/SPC) <[email protected]> wrote:
Hi,

I've never seen this but I would start with:
1) strace qmaster during restart to try to see at which point it is dying (e.g.,
loading a config file)
2) look for any reference to the name of the host you deleted in the spool
area and do some cleanup
3) clean out the jobs spool area

HTH,
John

On Sat, 2018-11-10 at 16:23 -0500, Daniel Povey wrote:
Has anyone found this error, and managed to fix it?
I am in a very difficult situation.
I deleted a host (qconf -de hostname) thinking that the machine no longer existed, but it did exist, and there was a job in 'dr' state there.
After I attempted to force-delete that job (qdel -f job-id), the queue master died with out-of-memory, and now I can't restart qmaster.

So now I don't know hw to fix it.  Am I just completely lost now?

Dan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Document the Debian package build process (+other build processes)

@0xaf1f, I am hoping you can help with this. We probably need to do this before we do anything else, because the compilation instructions in this repo are pretty hard.

What I am thinking of is-- it would be good to have some instructions to document how to build the Debian package, starting from what the base machine is (let's assume people can use cloud services to spin up a particular base machine), and what things need to be installed first. If you get compilation errors I likely already have patches for those, as I did manage to compile.

More generally, going forward I think we need clear documentation of the build process on different platforms, preferably with scripts that check dependencies and automate that process. The existing tools make it very unclear. I also want to identify the build processes that "matter", and work on those first. I imagine those are:

  • build Debian package
  • build RedHat package

but after that: we should find out if it's possible to build this stuff on Mac or on BSD Windows (Those are probably lower priority).

Importing https://gitlab.com/loveshack/sge.git

Per list discussions here
http://gridengine.org/pipermail/users/2018-May/thread.html
(search for 'Son of GridEngine succession), I am creating this organization and repo, and attempting to import https://gitlab.com/loveshack/sge.git. Unfortunately, due to GitLab bugs, that repo has errors ('git fsck' fails due to an issue mentioned here https://stackoverflow.com/questions/21971941/invalid-author-committer-line-missing-space-before-email).
I am running the fix suggested on that page. It may end up changing the commit hashes.

sge_qmaster segfaulting

Discussed in issue #3, but making a separate issue as it's a separate problem.
The problem @entn-at had. I will try to look at @Kunzol's patch over the weekend.

But note: this may actually be linked in a different way to issue #3 because @entn-at was not able to build from source in the first place! Anyway we should fix the bug.

hwloc/autogen/config.h error

Dear developers,
I am trying to install SGE on my linux 18.04.
When I launch ./aimk -no-java -no-jni -no-secure -spool-classic I encounter the following error:

cc -DSGE_ARCH_STRING=\"lx-amd64\" -O2 -Wstrict-prototypes -DLINUXAMD64 -DLINUXAMD64 -D_GNU_SOURCE -DGETHOSTBYNAME_R6 -DGETHOSTBYADDR_R8  -DTARGET_64BIT  -DSGE_PQS_API -DSPOOLING_classic  -DHAVE_HWLOC=1 -DNO_JNI -DCOMPILE_DC -D__SGE_COMPILE_WITH_GETTEXT__  -D__SGE_NO_USERMAPPING__ -I../common -I../libs -I../libs/uti -I../libs/juti -I../libs/gdi -I../libs/japi -I../libs/sgeobj -I../libs/cull -I../libs/comm -I../libs/comm/lists -I../libs/sched -I../libs/evc -I../libs/evm -I../libs/mir -I../daemons/common -I../daemons/qmaster -I../daemons/execd -I../clients/common -I. -fPIC -c ../libs/sgeobj/sge_binding.c
In file included from ../libs/uti/sge_binding_hlp.h:45:0,
                from ../libs/sgeobj/sge_binding.h:43,
                from ../libs/sgeobj/sge_binding.c:39:
../libs/uti/hwloc.h:49:10: fatal error: hwloc/autogen/config.h: No file or folder of this kind
#include <hwloc/autogen/config.h>
         ^~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
../libs/sgeobj/Makefile:184: recipe for target 'sge_binding.o' failed
make: *** [sge_binding.o] Error 1
not done

Could you help me resolve this problem ?

Marie

Importing the original repo

Making a separate issue for this.

@Kunzoi told me by email:

as written on Github, here is the link to the Git repo I created from the DARCS repo of SGE.

https://gitlab.bfabric.org/schmidt/sge

As far as I can see this is equal to the DARCS repo (diff). 

I tried importing his repo and it's the same as this repo where expected.

 README                                                               |  7 -------
 README.md                                                            | 15 +++++++++++++++
 debian/rules                                                         |  0
 ... all other diffs are empty.

Unfortunately that's a problem because I previously determined that this repo is out of sync with the release tarballs that the Debian people were using-- and it has been out of sync for some time into the past, at least since 8.1.3. Basically, the release tags in this repo don't line up with the release tarballs. It seems Dave had two repos. This repo may correspond to his "master" version, but there was another "release" version I believe.. or something like that. Do you think you could try to do your same process for the release version? It think it's more efficient than my process.

Once you have that, I can test whether it lines up with what the Debian people have, and try figure out whether his master repo had important differences from the release repo that we need to keep.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.