Giter Club home page Giter Club logo

sjm's Introduction

Simple Job Manager (SJM)

Summary

SJM is a program for managing a group of related jobs running on a compute cluster. It provides a convenient method for specifying dependencies between jobs and the resource requirements for each job (e.g. memory, CPU cores). It monitors the status of the jobs so you can tell when the whole group is done. If any of the jobs fails (e.g. due to a compute node crashing) SJM allows you to resume without rerunning the jobs that completed successfully. Finally, SJM provides a portable way to submit jobs to different job schedulers such as Sun Grid Engine or Platform LSF.

Software Requirements

Installing SJM requires the following prerequisites:

  1. GCC to compile the program.

  2. The Boost regex library (http://www.boost.org). If you don't already have it on your system you must install boost before installing SJM. The easiest way is to install a prebuilt package, but you can download the source release from the boost web site and build it yourself. See the instructions here: http://www.boost.org/doc/libs/1_49_0/more/getting_started/index.html Be sure to follow the build instructions in the section called "Prepare to Use a Boost Library Binary".

  3. The TCLAP library (http://tclap.sourceforge.net/). Download and install it before installing SJM.

  4. A job scheduler: either Sun Grid Engine or Platform LSF. SJM doesn't currently support other schedulers but it is relatively easy to add a new one (see Other Job Schedulers below).

  5. For Sun Grid Engine: a parallel environment for running multi-core (threaded) jobs on a single node of the cluster. Common names for this PE are "smp" or "shm". Here is a sample SGE configuration:

$ qconf -sp smp pe_name smp slots 999 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE allocation_rule $pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary FALSE

  1. Environment modules (optional). SJM has optional features to integrate with environment modules, a system for managing the user environment (see http://modules.sourceforge.net/). It is a convenient way for users to switch between different versions of a software package and to use software packages installed in non-standard directories.

Installation

First download the release. Then unpack it:

$ tar xvzf scg-1.2.0.tgz

Finally, to compile and install sjm into /usr/local/bin run:

$ cd scg-1.2.0 $ ./configure $ make $ sudo make install

That's it unless you need to customize the installation. The rest of this section describes various additional options you can use.

To install in a different location use the --prefix option:

$ ./configure --prefix=/destination/directory

The configure script will try to identify the job scheduler on your system, but you can force it to use a particular scheduler using the --with-scheduler option. Supported values include SGE and LSF:

$ ./configure --with-scheduler=SGE

or

$ ./configure --with-scheduler=LSF

For Sun Grid Engine you must specify the parallel environment for multi-core jobs using the --with-pe option:

$ ./configure --with-pe=smp

If you wish to disable the environment modules integration then use the flag --without-modules:

$ ./configure --without-modules

If the boost library and/or job scheduler library and header files are installed in a non-standard place you need to add compiler options to find these files. Use CPPFLAGS to specify header file locations:

$ ./configure CPPFLAGS="-I/boost/include/dir -I/scheduler/include/dir"

Use LDFLAGS to specify library locations:

$ ./configure LDFLAGS="-L/boost/lib/dir -L/scheduler/lib/dir"

Note that if you use shared libraries and they are in a non-standard location then you may have to set LD_LIBRARY_PATH prior to running sjm.

To see all of the available options run:

$ ./configure --help

To build from a git repository, you must have GNU autoconf and automake installed. First run the following command and then continue with the instructions above:

$ sh bootstrap.sh

Usage

See doc/MANUAL.txt for usage instructions. You can run a quick test like this:

$ sjm -i doc/example.sjm

Note that the integration with environment modules requires a helper script called run_with_env that is installed as part of the package. The installation directory is hardwired into the sjm binary at compile time and must be the same on your compute nodes. If you need to override that value you can set the environment variable RUN_WITH_ENV to the full path of the script prior to running sjm. For example:

$ export RUN_WITH_ENV=/full/path/for/run_with_env $ sjm ...

Other Job Schedulers

If you want to use SJM with a job scheduler that isn't currently supported you will need to write a C++ adapter class. Use src/sge.hh and src/sge.cc as an example. Then add references to the new adapter class in configure.ac, src/Makefile.am and src/job_mon.hh. Please contribute your changes so other people can benefit!

License

SJM is distributed under the BSD 3-clause licence. See COPYING for the full license.

Contact

Please send comments, suggestions, bug reports and bug fixes to [email protected].

sjm's People

Contributors

nathankw avatar nhammond avatar salinsde avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sjm's Issues

make failed

hi sir:
i need your help!!!
failed information below!!!!
######--------------------------------###########
make
make all-recursive
make[1]: Entering directory /GPFS01/home/feih/bin/SJM-1.2.0' Making all in src make[2]: Entering directory /GPFS01/home/feih/bin/SJM-1.2.0/src'
g++ -DHAVE_CONFIG_H -I. -I.. -DBINDIR="/GPFS01/home/feih/bin/SJM/bin" -I/GPFS01/ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/lib/../include -g -O2 -MT sjm-sjm.o -MD -MP -MF .deps/sjm-sjm.Tpo -c -o sjm-sjm.o test -f 'sjm.cc' || echo './'sjm.cc
mv -f .deps/sjm-sjm.Tpo .deps/sjm-sjm.Po
g++ -DHAVE_CONFIG_H -I. -I.. -DBINDIR="/GPFS01/home/feih/bin/SJM/bin" -I/GPFS01/ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/lib/../include -g -O2 -MT sjm-job_graph.o -MD -MP -MF .deps/sjm-job_graph.Tpo -c -o sjm-job_graph.o test -f 'job_graph.cc' || echo './'job_graph.cc
mv -f .deps/sjm-job_graph.Tpo .deps/sjm-job_graph.Po
g++ -DHAVE_CONFIG_H -I. -I.. -DBINDIR="/GPFS01/home/feih/bin/SJM/bin" -I/GPFS01/ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/lib/../include -g -O2 -MT sjm-job_mon.o -MD -MP -MF .deps/sjm-job_mon.Tpo -c -o sjm-job_mon.o test -f 'job_mon.cc' || echo './'job_mon.cc
mv -f .deps/sjm-job_mon.Tpo .deps/sjm-job_mon.Po
g++ -DHAVE_CONFIG_H -I. -I.. -DBINDIR="/GPFS01/home/feih/bin/SJM/bin" -I/GPFS01/ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/lib/../include -g -O2 -MT sjm-batch.o -MD -MP -MF .deps/sjm-batch.Tpo -c -o sjm-batch.o test -f 'batch.cc' || echo './'batch.cc
mv -f .deps/sjm-batch.Tpo .deps/sjm-batch.Po
g++ -DHAVE_CONFIG_H -I. -I.. -DBINDIR="/GPFS01/home/feih/bin/SJM/bin" -I/GPFS01/ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/lib/../include -g -O2 -MT sjm-lsf.o -MD -MP -MF .deps/sjm-lsf.Tpo -c -o sjm-lsf.o test -f 'lsf.cc' || echo './'lsf.cc
lsf.cc: In member function ‘virtual void LsfBatchSystem::init()’:
lsf.cc:61:23: warning: deprecated conversion from string constant to ‘char*’ [-Wwrite-strings]
if (lsb_init("sjm") < 0) {
^
mv -f .deps/sjm-lsf.Tpo .deps/sjm-lsf.Po
g++ -g -O2 -L/GPFS01/ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/lib -o sjm sjm-sjm.o sjm-job_graph.o sjm-job_mon.o sjm-batch.o sjm-lsf.o -lboost_regex -lbat -llsf -lnsl
/GPFS01/ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/lib/liblsf.so: undefined reference to shm_open' /GPFS01/ibm/lsf/10.1/linux3.10-glibc2.17-x86_64/lib/liblsf.so: undefined reference to shm_unlink'
collect2: error: ld returned 1 exit status
make[2]: *** [sjm] Error 1
make[2]: Leaving directory /GPFS01/home/feih/bin/SJM-1.2.0/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory /GPFS01/home/feih/bin/SJM-1.2.0'
make: *** [all] Error 2

LSF10 incompatible and cannot submit jobs

Hi,
We recently upgraded our LSF version from 8 to 10, and SJM no longer work.

There is no error message, but will have some "core dump" error as there will be a binary "core.123456" file been produced, and it much bigger than the original job file.
$ cat example4.sjm
job_begin
name jobA
time 1h
memory 500m
queue normal
project CompBio
cmd_begin
echo "hello from job jobA";
cmd_end
job_end

job_begin
name jobB
time 30m
memory 1g
queue normal
project CompBio
cmd_begin
echo "hello from job jobB";
cmd_end
job_end

order jobB after jobA

$ ~/app/SJM/src/sjm example4.sjm
Status file: example4.sjm.status
Log file: example4.sjm.status.log
Running jobs in the background....

$ ls -lrt
total 7200
...
-rw-r--r-- 1 xxxxx xxxxx 305 Mar 6 14:05 example4.sjm
-rw-r--r-- 1 xxxxx xxxxx 339 Mar 7 12:14 example4.sjm.status
-rw-r--r-- 1 xxxxx xxxxx 83 Mar 7 12:14 example4.sjm.status.log
-rw------- 1 xxxxx xxxxx 11247616 Mar 7 12:14 core.52031

We did some debug and it seems the job submitting has something wrong:
LS_LONG_INT jobId = lsb_submit(&req, &reply);
but still not sure what exactly went run.

I wonder if you can provide some insights?

Thanks a lot,

SJM in crontab

I have a problem when I use SJM in crontab.
It always can not run correctly, and only one line content in Log file like this
Fri Nov 15 16:52:01 2019: sjm process ID: 9299.

So, can you give me some advice about using SJM in crontab?

Thank you in advance.

Looking forward to your reply.

LSF system, stdout/stderr always goes to my email

I am on the LSF system. I have a job file like this:

job_begin
name jobA
time 1h
memory 500m
queue normal
project CompBio
sge_options -A swang -e example2.jobA.e.txt -o example2.jobA.o.txt
cmd echo "hello from job jobA"
job_end

job_begin
name jobB
time 30m
memory 1g
queue normal
project CompBio
cmd echo "hello from job jobB";
job_end

order jobB after jobA
log_dir /home/iiiit/app/SJM/doc/log

regardless of how i do, the "-e example2.jobA.e.txt -o example2.jobA.o.txt" or "log_dir /home/iiiit/app/SJM/doc/log" is not effective. all notifications goes to my email (for jobA and jobB)
Can you help?
Thanks,

LSF system, can not get stderr/stdout

Hi,
I can't get stderr/stdout on LSF system with jobs below(but SJM works well with SGE system), how can I get my stderr/stdout on LSF system?

job_begin
name haha
queue test
status waiting
sched_options -R "rusage[mem=1M]" -o haha.sh.o%J -e haha.sh.e%J
cmd_begin
sh haha.sh
cmd_end
job_end
job_begin
name heihei
queue test
status waiting
sched_options -R "rusage[mem=1M]" -o heihei.sh.o%J -e heihei.sh.e%J
cmd_begin
sh heihei.sh
cmd_end
job_end
order heihei after haha
log_dir ./Test/log

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.