Giter Club home page Giter Club logo

westlife-cloudify-gromacs's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

westlife-cloudify-gromacs's Issues

Torque doesn't abort the job properly

When job is manually aborted on portal, the job is not properly finished in Torque. It's marked as completed on Torque server, but it's still running on mom. Even momctl -d3 doesn't list the job. Some bug in Torque probably.

14110 ?        SLsl   0:19 /usr/sbin/pbs_mom
 7083 ?        Ss     0:00  \_ -bash
 7102 ?        S      0:00  |   \_ /bin/bash /var/lib/torque/mom_priv/jobs/1.stoor44.meta.zcu.cz.SC -d 2.25 -ttau 0.1 -f gmx
 8702 ?        S      0:00  |       \_ /bin/bash /var/lib/torque/mom_priv/jobs/1.stoor44.meta.zcu.cz.SC -d 2.25 -ttau 0.1 -f
 8718 ?        Rl    44:46  |           \_ /opt/gromacs/bin/mdrun -nice 0 -deffnm gmx-671651-MD-PRE -c gmx-671651-MD-PRE.gro
 8792 ?        Ss     0:00  \_ -bash
 8811 ?        S      0:00      \_ /bin/bash /var/lib/torque/mom_priv/jobs/2.stoor44.meta.zcu.cz.SC -d 2.25 -ttau 0.1 -f gmx
 9212 ?        Rl     1:36          \_ /opt/gromacs/bin/mdrun -nice 0 -deffnm gmx-766933-NPT-1.01325-0.5 -c gmx-766933-NPT-1
# momctl -d3

Host: stoor42.meta.zcu.cz/stoor42.meta.zcu.cz   Version: 4.2.10   PID: 14110
Server[0]: stoor44.meta.zcu.cz (147.228.242.44:15001)
  Last Msg From Server:   45 seconds (StatusJob)
  WARNING:  no messages sent to server
HomeDirectory:          /var/lib/torque/mom_priv
stdout/stderr spool directory: '/var/lib/torque/spool/' (1402295blocks available)
NOTE:  syslog enabled
MOM active:             240697 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            60 seconds
Prolog:                 /var/lib/torque/mom_priv/prologue (disabled)
Alarm Time:             0 of 10 seconds
Trusted Client List:  127.0.0.1:0,127.0.0.1:15003,147.228.242.44:0,147.228.242.45:15003:  0
Copy Command:           /usr/bin/scp -rpB
job[1.stoor44.meta.zcu.cz]  state=RUNNING cput=0 mem=0 vmem=0 mempressure=0 sidlist=7083
Assigned CPU Count:     0

diagnostics complete
20170107:01/07/2017 16:07:01;0001;   pbs_mom.14110;Job;TMomFinalizeJob3;job 2.stoor44.meta.zcu.cz started, pid = 8792
20170107:01/07/2017 16:10:05;0001;   pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;job recycled into exiting on SIGNULL/KILL from substate 42
20170107:01/07/2017 16:10:05;0008;   pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;job was terminated
20170107:01/07/2017 16:10:05;0080;   pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;obit sent to server
20170107:01/07/2017 16:10:07;0080;   pbs_mom.14110;Job;2.stoor44.meta.zcu.cz;removed job script

Build and deploy Gromacs

version 5.1.4
via http://www.gromacs.org/Downloads

Build process

$ mkdir gromacs-build && cd gromacs-build
$ cmake -DCMAKE_INSTALL_PREFIX=/opt/gromacs/nebo/kam/jinam -DGMX_MPI=on
-DGMX_GPU=on -DGMX_BUILD_OWN_FFTW=on -DGMX_DEFAULT_SUFFIX=OFF -DGMX_SIMD=SSE4.1
../gromacs-5.1.4
$ make -j nakoliksitroufnu
$ cmake -DCMAKE_INSTALL_PREFIX=/opt/gromacs/nebo/kam/jinam -DGMX_MPI=on
-DGMX_GPU=on -DGMX_BUILD_OWN_FFTW=on -DGMX_SIMD=AVX2_256
-DGMX_DEFAULT_SUFFIX=OFF -DGMX_BINARY_SUFFIX=_avx2 -DGMX_LIBS_SUFFIX=_mpiavx
-DGMX_BUILD_MDRUN_ONLY=on ../gromacs-5.1.4
$ make -j nakoliksitroufnu

Execution workflows loop infinitely on fatal errors with CM

If scale out/in workflows fail on a fatal error (e.g., VM fails to boot and is manually cleaned), the Cloudify Manager stuck in the infinite loop of retries.

  Timestamp Event Type Log Level Operation Node Name Node ID Message
  2018-01-04 20:45:41 Task failed - cloudify.interfaces.lifecycle.start workerNode workerNode_jmlkc1 Task failed 'cloudify_occi_plugin.tasks.start' -> Failed to run occi: F, [2018-01-04T20:45:41.693050 #21070] FATAL -- : [rOCCI-cli] An error occurred! Message: Net::HTTP::Get with ID["85dc1002-3764-45f9-950f-f4c22a57ad3e"] failed! HTTP Response status: [404] Not Found : "Instance with ID 114237 does not exist!" [retry 2463]
  2018-01-04 20:45:41 - INFO cloudify.interfaces.lifecycle.start workerNode workerNode_jmlkc1 Exited with code=1
  2018-01-04 20:45:39 - INFO cloudify.interfaces.lifecycle.start workerNode workerNode_jmlkc1 Executing ['/usr/local/bin/occi', '--output-format', 'json', '--endpoint', u'https://carach5.ics.muni.cz:11443', '--auth', u'x509', '--user-cred', u'/tmp/x509up_u1000', '--voms', '--action', 'describe', '--resource', u'https://carach5.ics.muni.cz:11443/compute/114237']
  2018-01-04 20:45:39 - INFO cloudify.interfaces.lifecycle.start workerNode workerNode_jmlkc1 Starting node
  2018-01-04 20:45:38 Task started - cloudify.interfaces.lifecycle.start workerNode workerNode_jmlkc1 Task started 'cloudify_occi_plugin.tasks.start' [retry 2463]
  2018-01-04 20:45:38 Task sent - cloudify.interfaces.lifecycle.start workerNode workerNode_jmlkc1 Sending task 'cloudify_occi_plugin.tasks.start' [retry 2463]

See 2463 retries of just checking something which doesn't exist anymore.

Fix second phase to run across all cores

Currently, the second computational phase is limited to 1 core (due to a workaround for Torque exclusive nodes allocation). Must be fixed, make the number of cores on command line optional.

Qmgr used before server is ready

Sometimes I can see following errors:

Info: Class[Torque::Server::Service]: Scheduling refresh of Service[pbs_sched]
Info: Class[Torque::Server::Service]: Scheduling refresh of Service[pbs_server]
Notice: /Stage[main]/Torque::Server::Service/Service[pbs_sched]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Torque::Server::Service/Service[pbs_server]: Triggered 'refresh' from 1 events
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: 
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: Unable to communicate with localhost(127.0.0.1)
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: Cannot connect to specified server host 'localhost'.
Notice: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: qmgr: cannot connect to server localhost (errno=111) Connection refused
Error: 'qmgr -a -c 'set server scheduler_iteration = 30' localhost' returned 3 instead of one of [0]
Error: /Stage[main]/Main/Torque::Qmgr::Attribute[server scheduler_iteration]/Exec[qmgr -a -c 'set server scheduler_iteration = 30' localhost]/returns: change from notrun to 0 failed: 'qmgr -a -c 'set server scheduler_iteration = 30' localhost' returned 3 instead of one of [0]

... doesn't break the deployment, still works even with that.

Fix 1:n deployment schema

Currently works for only n=<0,1> and not more.

2016-12-20 10:15:18 CFY <local> [gromacsPortal_10051] Configuring node
2016-12-20 10:15:18 CFY <local> [torqueMom_bb566->torqueServer_0f83e|preconfigure] Task failed 'fabric_plugin.tasks.run_script' -> Unable to evaluate get_attribute function: More than one node instance found for node "workerNode". Cannot resolve a node instance unambiguously. [retry 1/10]

Torque server restart lose exclusive nodes allocations

Information about the exclusively allocated nodes is lost after the Torque server is restarted.

$ pbsnodes
cloud255-20.cerit-sc.cz
     state = job-exclusive
     power_state = Running
     np = 8
     ntype = cluster
     jobs = 0/1.cloud255-21.cerit-sc.cz
...

cloud255-23.cerit-sc.cz
     state = job-exclusive
     power_state = Running
     np = 8
     ntype = cluster
     jobs = 0/9.cloud255-21.cerit-sc.cz
...

$ systemctl restart pbs_server

$ pbsnodes
cloud255-20.cerit-sc.cz
     state = free
     power_state = Running
     np = 8
     ntype = cluster
     jobs = 0/1.cloud255-21.cerit-sc.cz
...

cloud255-23.cerit-sc.cz
     state = free
     power_state = Running
     np = 8
     ntype = cluster
     jobs = 0/9.cloud255-21.cerit-sc.cz
...

Torque nodes left in down after error condition disappeared

Node ran out of disk space. In the Torque, it was switched to state down with an explanatory note. After the error condition disappeared, it didn't come back to up/free automatically. The pbs_mom on particular node had to be manually restarted to repair the state.

$ pbsnodes
node1.localdomain
     state = down
     power_state = Running
     np = 8
     ntype = cluster
     status = opsys=linux,uname=...,sessions=10062 10109 10274,nsessions=3,nusers=1,idletime=1444966,totmem=8010380kb,availmem=7536480kb,physmem=8010380kb,ncpus=8,loadave=0.00,message=ERROR: torque spool filesystem full,gres=,netload=15776691386,state=free,varattr= ,cpuclock=Fixed,version=6.1.1.1,rectime=1515096744,jobs=
     note = ERROR: torque spool filesystem full
     mom_service_port = 15002
     mom_manager_port = 15003

Consider proactive operation to fix stale states automatically.

Also, long-term down nodes should be monitored as part of #10.

Monitoring and alerting

Cloudify Manager should monitor and alert(!?!) on critical situations. E.g.:

  • low disk space
  • long-term Torque node down
  • ... ?

Maybe as a dedicated execution workflow triggered when monitoring thresholds are exceeded.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.