ics-mu / westlife-cloudify-gromacs Goto Github PK

View Code? Open in Web Editor NEW

0.0 8.0 0.0 126.4 MB

Gromacs portal Cloudify blueprint

Makefile 6.39% M4 25.67% Ruby 7.41% Pascal 0.07% Puppet 50.38% Shell 8.51% Python 1.58%

gromacs westlife cloudify cloudify-blueprints

westlife-cloudify-gromacs's People

Watchers

westlife-cloudify-gromacs's Issues

Torque doesn't abort the job properly

When job is manually aborted on portal, the job is not properly finished in Torque. It's marked as completed on Torque server, but it's still running on mom. Even momctl -d3 doesn't list the job. Some bug in Torque probably.

14110 ?        SLsl   0:19 /usr/sbin/pbs_mom
 7083 ?        Ss     0:00  \_ -bash
 7102 ?        S      0:00  |   \_ /bin/bash /var/lib/torque/mom_priv/jobs/1.stoor44.meta.zcu.cz.SC -d 2.25 -ttau 0.1 -f gmx
 8702 ?        S      0:00  |       \_ /bin/bash /var/lib/torque/mom_priv/jobs/1.stoor44.meta.zcu.cz.SC -d 2.25 -ttau 0.1 -f
 8718 ?        Rl    44:46  |           \_ /opt/gromacs/bin/mdrun -nice 0 -deffnm gmx-671651-MD-PRE -c gmx-671651-MD-PRE.gro
 8792 ?        Ss     0:00  \_ -bash
 8811 ?        S      0:00      \_ /bin/bash /var/lib/torque/mom_priv/jobs/2.stoor44.meta.zcu.cz.SC -d 2.25 -ttau 0.1 -f gmx
 9212 ?        Rl     1:36          \_ /opt/gromacs/bin/mdrun -nice 0 -deffnm gmx-766933-NPT-1.01325-0.5 -c gmx-766933-NPT-1

# momctl -d3

Host: stoor42.meta.zcu.cz/stoor42.meta.zcu.cz   Version: 4.2.10   PID: 14110
Server[0]: stoor44.meta.zcu.cz (147.228.242.44:15001)
  Last Msg From Server:   45 seconds (StatusJob)
  WARNING:  no messages sent to server
HomeDirectory:          /var/lib/torque/mom_priv
stdout/stderr spool directory: '/var/lib/torque/spool/' (1402295blocks available)
NOTE:  syslog enabled
MOM active:             240697 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            60 seconds
Prolog:                 /var/lib/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:  127.0.0.1:0,127.0.0.1:15003,147.228.242.44:0,147.228.242.45:15003:  0
Copy Command:           /usr/bin/scp -rpB
job[1.stoor44.meta.zcu.cz]  state=RUNNING cput=0 mem=0 vmem=0 mempressure=0 sidlist=7083
Assigned CPU Count:     0

diagnostics complete

20170107:01/07/2017 16:07:01;0001;   pbs_mom.14110;Job;TMomFinalizeJob3;job 2.stoor44.meta.zcu.cz started, pid = 8792
20170107:01/07/2017 16:10:05;0001;   pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;job recycled into exiting on SIGNULL/KILL from substate 42
20170107:01/07/2017 16:10:05;0008;   pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;job was terminated
20170107:01/07/2017 16:10:05;0080;   pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;obit sent to server
20170107:01/07/2017 16:10:07;0080;   pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;removed job script

Build and deploy Gromacs

version 5.1.4
via http://www.gromacs.org/Downloads

Build process

$ mkdir gromacs-build && cd gromacs-build
$ cmake -DCMAKE_INSTALL_PREFIX=/opt/gromacs/nebo/kam/jinam -DGMX_MPI=on
-DGMX_GPU=on -DGMX_BUILD_OWN_FFTW=on -DGMX_DEFAULT_SUFFIX=OFF -DGMX_SIMD=SSE4.1
../gromacs-5.1.4
$ make -j nakoliksitroufnu

$ cmake -DCMAKE_INSTALL_PREFIX=/opt/gromacs/nebo/kam/jinam -DGMX_MPI=on
-DGMX_GPU=on -DGMX_BUILD_OWN_FFTW=on -DGMX_SIMD=AVX2_256
-DGMX_DEFAULT_SUFFIX=OFF -DGMX_BINARY_SUFFIX=_avx2 -DGMX_LIBS_SUFFIX=_mpiavx
-DGMX_BUILD_MDRUN_ONLY=on ../gromacs-5.1.4
$ make -j nakoliksitroufnu

Execution workflows loop infinitely on fatal errors with CM

If scale out/in workflows fail on a fatal error (e.g., VM fails to boot and is manually cleaned), the Cloudify Manager stuck in the infinite loop of retries.

Timestamp	Event Type	Log Level	Operation	Node Name	Node ID	Message
2018-01-04 20:45:41	Task failed	-	cloudify.interfaces.lifecycle.start	workerNode	workerNode_jmlkc1	Task failed 'cloudify_occi_plugin.tasks.start' -> Failed to run occi: F, [2018-01-04T20:45:41.693050 #21070] FATAL -- : [rOCCI-cli] An error occurred! Message: Net::HTTP::Get with ID["85dc1002-3764-45f9-950f-f4c22a57ad3e"] failed! HTTP Response status: [404] Not Found : "Instance with ID 114237 does not exist!" [retry 2463]
2018-01-04 20:45:41	-	INFO	cloudify.interfaces.lifecycle.start	workerNode	workerNode_jmlkc1	Exited with code=1
2018-01-04 20:45:39	-	INFO	cloudify.interfaces.lifecycle.start	workerNode	workerNode_jmlkc1	Executing ['/usr/local/bin/occi', '--output-format', 'json', '--endpoint', u'https://carach5.ics.muni.cz:11443', '--auth', u'x509', '--user-cred', u'/tmp/x509up_u1000', '--voms', '--action', 'describe', '--resource', u'https://carach5.ics.muni.cz:11443/compute/114237']
2018-01-04 20:45:39	-	INFO	cloudify.interfaces.lifecycle.start	workerNode	workerNode_jmlkc1	Starting node
2018-01-04 20:45:38	Task started	-	cloudify.interfaces.lifecycle.start	workerNode	workerNode_jmlkc1	Task started 'cloudify_occi_plugin.tasks.start' [retry 2463]
2018-01-04 20:45:38	Task sent	-	cloudify.interfaces.lifecycle.start	workerNode	workerNode_jmlkc1	Sending task 'cloudify_occi_plugin.tasks.start' [retry 2463]

See 2463 retries of just checking something which doesn't exist anymore.

Support SSL on portal

Disable CUDA toolkit by default, make it optional

It takes way too long to setup CUDA on all machines. Make it optional, initially disabled.

Gromacs' home used for large computational data

On a worker node:

# du -sh /scratch
924M	/scratch

# du -sh /home/gromacs/
74G	/home/gromacs/

Most should be in scratch, old data cleaned.

Fix second phase to run across all cores

Currently, the second computational phase is limited to 1 core (due to a workaround for Torque exclusive nodes allocation). Must be fixed, make the number of cores on command line optional.

Qmgr used before server is ready

Sometimes I can see following errors:

Info: Class[Torque::Server::Service]: Scheduling refresh of Service[pbs_sched]
Info: Class[Torque::Server::Service]: Scheduling refresh of Service[pbs_server]
Notice: /Stage[main]/Torque::Server::Service/Service[pbs_sched]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Torque::Server::Service/Service[pbs_server]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: 
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: Unable to communicate with localhost(127.0.0.1)
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: Cannot connect to specified server host 'localhost'.
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: qmgr: cannot connect to server localhost (errno=111) Connection refused
Error: 'qmgr -a -c 'set server scheduler_iteration = 30' localhost' returned 3 instead of one of [0]
Error: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: change from notrun to 0 failed: 'qmgr -a -c 'set server scheduler_iteration = 30' localhost' returned 3 instead of one of [0]

... doesn't break the deployment, still works even with that.

Gromacs portal private repository

Can we have the Gromacs portal code easily available from the private GH repo., @ljocha please? (so that I don't have to search for it somewhere in your home)

Fix 1:n deployment schema

Currently works for only n=<0,1> and not more.

2016-12-20 10:15:18 CFY <local> [gromacsPortal_10051] Configuring node
2016-12-20 10:15:18 CFY <local> [torqueMom_bb566->torqueServer_0f83e|preconfigure] Task failed 'fabric_plugin.tasks.run_script' -> Unable to evaluate get_attribute function: More than one node instance found for node "workerNode". Cannot resolve a node instance unambiguously. [retry 1/10]

Torque server restart lose exclusive nodes allocations

Information about the exclusively allocated nodes is lost after the Torque server is restarted.

$ pbsnodes
cloud255-20.cerit-sc.cz
     state = job-exclusive
     power_state = Running
     np = 8
     ntype = cluster
     jobs = 0/1.cloud255-21.cerit-sc.cz
...

cloud255-23.cerit-sc.cz
     state = job-exclusive
     power_state = Running
     np = 8
     ntype = cluster
     jobs = 0/9.cloud255-21.cerit-sc.cz
...

$ systemctl restart pbs_server

$ pbsnodes
cloud255-20.cerit-sc.cz
     state = free
     power_state = Running
     np = 8
     ntype = cluster
     jobs = 0/1.cloud255-21.cerit-sc.cz
...

cloud255-23.cerit-sc.cz
     state = free
     power_state = Running
     np = 8
     ntype = cluster
     jobs = 0/9.cloud255-21.cerit-sc.cz
...

Support deployment on existing nodes

Torque nodes left in down after error condition disappeared

Node ran out of disk space. In the Torque, it was switched to state down with an explanatory note. After the error condition disappeared, it didn't come back to up/free automatically. The pbs_mom on particular node had to be manually restarted to repair the state.

$ pbsnodes
node1.localdomain
     state = down
     power_state = Running
     np = 8
     ntype = cluster
     status = opsys=linux,uname=...,sessions=10062 10109 10274,nsessions=3,nusers=1,idletime=1444966,totmem=8010380kb,availmem=7536480kb,physmem=8010380kb,ncpus=8,loadave=0.00,message=ERROR: torque spool filesystem full,gres=,netload=15776691386,state=free,varattr= ,cpuclock=Fixed,version=6.1.1.1,rectime=1515096744,jobs=
     note = ERROR: torque spool filesystem full
     mom_service_port = 15002
     mom_manager_port = 15003

Consider proactive operation to fix stale states automatically.

Also, long-term down nodes should be monitored as part of #10.

Monitoring and alerting

Cloudify Manager should monitor and alert(!?!) on critical situations. E.g.:

low disk space
long-term Torque node down
... ?

Maybe as a dedicated execution workflow triggered when monitoring thresholds are exceeded.

ics-mu / westlife-cloudify-gromacs Goto Github PK

westlife-cloudify-gromacs's People

Watchers

westlife-cloudify-gromacs's Issues

Torque doesn't abort the job properly

Build and deploy Gromacs

Execution workflows loop infinitely on fatal errors with CM

Support SSL on portal

Disable CUDA toolkit by default, make it optional

Gromacs' home used for large computational data

Fix second phase to run across all cores

Qmgr used before server is ready

Gromacs portal private repository

Fix 1:n deployment schema

Torque server restart lose exclusive nodes allocations

Support deployment on existing nodes

Torque nodes left in down after error condition disappeared

Monitoring and alerting

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent