ics-mu / westlife-cloudify-gromacs Goto Github PK
View Code? Open in Web Editor NEWGromacs portal Cloudify blueprint
Gromacs portal Cloudify blueprint
When job is manually aborted on portal, the job is not properly finished in Torque. It's marked as completed on Torque server, but it's still running on mom. Even momctl -d3 doesn't list the job. Some bug in Torque probably.
14110 ? SLsl 0:19 /usr/sbin/pbs_mom
7083 ? Ss 0:00 \_ -bash
7102 ? S 0:00 | \_ /bin/bash /var/lib/torque/mom_priv/jobs/1.stoor44.meta.zcu.cz.SC -d 2.25 -ttau 0.1 -f gmx
8702 ? S 0:00 | \_ /bin/bash /var/lib/torque/mom_priv/jobs/1.stoor44.meta.zcu.cz.SC -d 2.25 -ttau 0.1 -f
8718 ? Rl 44:46 | \_ /opt/gromacs/bin/mdrun -nice 0 -deffnm gmx-671651-MD-PRE -c gmx-671651-MD-PRE.gro
8792 ? Ss 0:00 \_ -bash
8811 ? S 0:00 \_ /bin/bash /var/lib/torque/mom_priv/jobs/2.stoor44.meta.zcu.cz.SC -d 2.25 -ttau 0.1 -f gmx
9212 ? Rl 1:36 \_ /opt/gromacs/bin/mdrun -nice 0 -deffnm gmx-766933-NPT-1.01325-0.5 -c gmx-766933-NPT-1
# momctl -d3
Host: stoor42.meta.zcu.cz/stoor42.meta.zcu.cz Version: 4.2.10 PID: 14110
Server[0]: stoor44.meta.zcu.cz (147.228.242.44:15001)
Last Msg From Server: 45 seconds (StatusJob)
WARNING: no messages sent to server
HomeDirectory: /var/lib/torque/mom_priv
stdout/stderr spool directory: '/var/lib/torque/spool/' (1402295blocks available)
NOTE: syslog enabled
MOM active: 240697 seconds
Check Poll Time: 45 seconds
Server Update Interval: 45 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model: TCP
MemLocked: TRUE (mlock)
TCP Timeout: 60 seconds
Prolog: /var/lib/torque/mom_priv/prologue (disabled)
Alarm Time: 0 of 10 seconds
Trusted Client List: 127.0.0.1:0,127.0.0.1:15003,147.228.242.44:0,147.228.242.45:15003: 0
Copy Command: /usr/bin/scp -rpB
job[1.stoor44.meta.zcu.cz] state=RUNNING cput=0 mem=0 vmem=0 mempressure=0 sidlist=7083
Assigned CPU Count: 0
diagnostics complete
20170107:01/07/2017 16:07:01;0001; pbs_mom.14110;Job;TMomFinalizeJob3;job 2.stoor44.meta.zcu.cz started, pid = 8792
20170107:01/07/2017 16:10:05;0001; pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;job recycled into exiting on SIGNULL/KILL from substate 42
20170107:01/07/2017 16:10:05;0008; pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;job was terminated
20170107:01/07/2017 16:10:05;0080; pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;obit sent to server
20170107:01/07/2017 16:10:07;0080; pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;removed job script
version 5.1.4
via http://www.gromacs.org/Downloads
Build process
$ mkdir gromacs-build && cd gromacs-build
$ cmake -DCMAKE_INSTALL_PREFIX=/opt/gromacs/nebo/kam/jinam -DGMX_MPI=on
-DGMX_GPU=on -DGMX_BUILD_OWN_FFTW=on -DGMX_DEFAULT_SUFFIX=OFF -DGMX_SIMD=SSE4.1
../gromacs-5.1.4
$ make -j nakoliksitroufnu
$ cmake -DCMAKE_INSTALL_PREFIX=/opt/gromacs/nebo/kam/jinam -DGMX_MPI=on
-DGMX_GPU=on -DGMX_BUILD_OWN_FFTW=on -DGMX_SIMD=AVX2_256
-DGMX_DEFAULT_SUFFIX=OFF -DGMX_BINARY_SUFFIX=_avx2 -DGMX_LIBS_SUFFIX=_mpiavx
-DGMX_BUILD_MDRUN_ONLY=on ../gromacs-5.1.4
$ make -j nakoliksitroufnu
If scale out/in workflows fail on a fatal error (e.g., VM fails to boot and is manually cleaned), the Cloudify Manager stuck in the infinite loop of retries.
Timestamp | Event Type | Log Level | Operation | Node Name | Node ID | Message | |
---|---|---|---|---|---|---|---|
2018-01-04 20:45:41 | Task failed | - | cloudify.interfaces.lifecycle.start | workerNode | workerNode_jmlkc1 | Task failed 'cloudify_occi_plugin.tasks.start' -> Failed to run occi: F, [2018-01-04T20:45:41.693050 #21070] FATAL -- : [rOCCI-cli] An error occurred! Message: Net::HTTP::Get with ID["85dc1002-3764-45f9-950f-f4c22a57ad3e"] failed! HTTP Response status: [404] Not Found : "Instance with ID 114237 does not exist!" [retry 2463] | |
2018-01-04 20:45:41 | - | INFO | cloudify.interfaces.lifecycle.start | workerNode | workerNode_jmlkc1 | Exited with code=1 | |
2018-01-04 20:45:39 | - | INFO | cloudify.interfaces.lifecycle.start | workerNode | workerNode_jmlkc1 | Executing ['/usr/local/bin/occi', '--output-format', 'json', '--endpoint', u'https://carach5.ics.muni.cz:11443', '--auth', u'x509', '--user-cred', u'/tmp/x509up_u1000', '--voms', '--action', 'describe', '--resource', u'https://carach5.ics.muni.cz:11443/compute/114237'] | |
2018-01-04 20:45:39 | - | INFO | cloudify.interfaces.lifecycle.start | workerNode | workerNode_jmlkc1 | Starting node | |
2018-01-04 20:45:38 | Task started | - | cloudify.interfaces.lifecycle.start | workerNode | workerNode_jmlkc1 | Task started 'cloudify_occi_plugin.tasks.start' [retry 2463] | |
2018-01-04 20:45:38 | Task sent | - | cloudify.interfaces.lifecycle.start | workerNode | workerNode_jmlkc1 | Sending task 'cloudify_occi_plugin.tasks.start' [retry 2463] |
See 2463 retries of just checking something which doesn't exist anymore.
It takes way too long to setup CUDA on all machines. Make it optional, initially disabled.
On a worker node:
# du -sh /scratch
924M /scratch
# du -sh /home/gromacs/
74G /home/gromacs/
Most should be in scratch, old data cleaned.
Currently, the second computational phase is limited to 1 core (due to a workaround for Torque exclusive nodes allocation). Must be fixed, make the number of cores on command line optional.
Sometimes I can see following errors:
Info: Class[Torque::Server::Service]: Scheduling refresh of Service[pbs_sched]
Info: Class[Torque::Server::Service]: Scheduling refresh of Service[pbs_server]
Notice: /Stage[main]/Torque::Server::Service/Service[pbs_sched]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Torque::Server::Service/Service[pbs_server]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns:
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: Unable to communicate with localhost(127.0.0.1)
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: Cannot connect to specified server host 'localhost'.
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: qmgr: cannot connect to server localhost (errno=111) Connection refused
Error: 'qmgr -a -c 'set server scheduler_iteration = 30' localhost' returned 3 instead of one of [0]
Error: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: change from notrun to 0 failed: 'qmgr -a -c 'set server scheduler_iteration = 30' localhost' returned 3 instead of one of [0]
... doesn't break the deployment, still works even with that.
Can we have the Gromacs portal code easily available from the private GH repo., @ljocha please? (so that I don't have to search for it somewhere in your home)
Currently works for only n=<0,1> and not more.
2016-12-20 10:15:18 CFY <local> [gromacsPortal_10051] Configuring node
2016-12-20 10:15:18 CFY <local> [torqueMom_bb566->torqueServer_0f83e|preconfigure] Task failed 'fabric_plugin.tasks.run_script' -> Unable to evaluate get_attribute function: More than one node instance found for node "workerNode". Cannot resolve a node instance unambiguously. [retry 1/10]
Information about the exclusively allocated nodes is lost after the Torque server is restarted.
$ pbsnodes
cloud255-20.cerit-sc.cz
state = job-exclusive
power_state = Running
np = 8
ntype = cluster
jobs = 0/1.cloud255-21.cerit-sc.cz
...
cloud255-23.cerit-sc.cz
state = job-exclusive
power_state = Running
np = 8
ntype = cluster
jobs = 0/9.cloud255-21.cerit-sc.cz
...
$ systemctl restart pbs_server
$ pbsnodes
cloud255-20.cerit-sc.cz
state = free
power_state = Running
np = 8
ntype = cluster
jobs = 0/1.cloud255-21.cerit-sc.cz
...
cloud255-23.cerit-sc.cz
state = free
power_state = Running
np = 8
ntype = cluster
jobs = 0/9.cloud255-21.cerit-sc.cz
...
Node ran out of disk space. In the Torque, it was switched to state down with an explanatory note. After the error condition disappeared, it didn't come back to up/free automatically. The pbs_mom
on particular node had to be manually restarted to repair the state.
$ pbsnodes
node1.localdomain
state = down
power_state = Running
np = 8
ntype = cluster
status = opsys=linux,uname=...,sessions=10062 10109 10274,nsessions=3,nusers=1,idletime=1444966,totmem=8010380kb,availmem=7536480kb,physmem=8010380kb,ncpus=8,loadave=0.00,message=ERROR: torque spool filesystem full,gres=,netload=15776691386,state=free,varattr= ,cpuclock=Fixed,version=6.1.1.1,rectime=1515096744,jobs=
note = ERROR: torque spool filesystem full
mom_service_port = 15002
mom_manager_port = 15003
Consider proactive operation to fix stale states automatically.
Also, long-term down nodes should be monitored as part of #10.
Cloudify Manager should monitor and alert(!?!) on critical situations. E.g.:
Maybe as a dedicated execution workflow triggered when monitoring thresholds are exceeded.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.