Giter Club home page Giter Club logo

cloudman's Introduction

CloudMan is a cloud infrastructure and application manager, primarily for Galaxy.

Github Build Status Test Coverage Report

Installation

CloudMan is intended to be installed via the CloudMan Helm chart.

Run locally for development

git clone https://github.com/galaxyproject/cloudman.git
cd cloudman
pip install -r requirements.txt
python cloudman/manage.py migrate
gunicorn --log-level debug cloudman.wsgi

The CloudMan API will be available at http://127.0.0.1:8000/cloudman/api/v1/

To add the UI, see https://github.com/cloudve/cloudman-ui

Build Docker image

To build a Docker image, run docker build -t galaxy/cloudman:latest . Push it to Dockerhub with:

docker login
docker push galaxy/cloudman:latest

cloudman's People

Contributors

afgane avatar almahmoud avatar blankenberg avatar chapmanb avatar dannon avatar ddavidovic avatar gregorydavidlong avatar jmchilton avatar kaktus42 avatar martenson avatar matthewralston avatar mdehollander avatar nuwang avatar piotrszul avatar razrichter avatar sbelluzzo avatar supernifty avatar ykowsar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cloudman's Issues

Assign cluster name to EBS volumes

Currently, there is a tag clusterName but I think it would be more helpful to (also) use the Name tag (for both Galaxy and root volumes) since it allows to see which volume belongs to what cluster at a glance in the AWS console.

Securing Cloudman Galaxy master...

I'm tasked with ensuring that the cloud-based Galaxy instance launched from launch.usegalaxy.org passes the same security scans a permanent server would. I'm using the Cluster with Galaxy option. I've dug around quite a lot, but haven't quite managed to figure out the special magic that would resolve the following issues. I hit most of these issues with the 15.x AMI/Galaxy version I think, but details below were uncovered on the 16.x AMI/Galaxy version.

  1. start up in HTTPS mode from the get go (rather than logging into the admin user-interface over HTTP and then toggling ssl mode).

  2. update the Ubuntu instance with the latest security patches. I run "sudo apt-get updates; sudo unattended-upgrades", which seems to be the recommended recipe, and reboot the cluster (via the cloudman/admin link).

    1. it is unfortunate that there is no hook for a script to run earlier than any Galaxy services run which might save the reboot.

    2. Unfortunately, apache2 is upgraded and configured to start at reboot - its links must be removed from /etc/rc*.d to be consistent with the "before" state of the instance.

    3. Unfortunately, nginx is upgraded too, and as best I can make out, the upgraded version of ubuntu nginx does not provide the nginx upload module which is used by Galaxy and this makes it fail at startup. The error is "nginx: [emerg] unknown directive "upload_store" in /etc/nginx/..." in /var/log/cloudman/cm_boot.log.

  3. If I stop ProFTPd, this change gets written to the s3 configuration bucket, no problem. However, after the reboot, the Galaxy service is never started by the supervisor, since it is waiting for ProFTPd to be started (and it never is, because it was stopped).

  4. Where/how can I manipulate the ports opened by the security group? What is on the other end of each of these ports - there seems like a lot? I'd like to shut down everything other than HTTPS and SSH (at least to the outside world).

  5. ssh access is permitted with either the password /or/ the key-pair - can I turn off password only access and require the key-pair?

  6. (Not strictly a security issue, but observed when rebooting...) a new galaxyIndices volume gets created with each reboot.

Thanks,

-- n

Select last tool not found in GVL 4.4.0 instance

I launched a GVL instance on AWS using Cloudlaunch today. The info from Cloudlaunch is

Appliance: 
Genomics Virtual Lab (GVL) 
Version: GVL 4.4.0 RC1 (Galaxy 18.05)
Cloud: amazon-us-east-n-virginia

When I try to run the Select last tool, the middle panel is replaced with a red box with a red x and this text:

Could not find tool with id 'Show+tail1'.

Cloudman removes headers needed for Galaxy application

Important headers that are being 'eaten' by cloudman / nginx include content-length, content-type, access-control-allow-origin, etc.

The loss of access-control-allow-origin, in particular, breaks some external display applications, such as Phinch for biom1 datatype.

For example, Galaxy sets these headers: trans.response.headers {'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json', 'access-control-allow-origin': 'http://www.bx.psu.edu', 'content-length': 140392}

But only these headers are returned to the browser:

Connection:keep-alive
Date:Fri, 24 Jun 2016 14:24:08 GMT
ETag:"576432a4-22468"
Last-Modified:Fri, 17 Jun 2016 17:25:56 GMT
Server:nginx/1.4.6 (Ubuntu)

For eg: http://aws_url/display_application/x1234/biom_simple/phinch_dan/x567/data/galaxy_x1234.biom

Unable to install tools from Tool Shed

I am able to request that a tool be installed on my local instance, but the tool does not finish installing.

In the Admin -> Tool Management -> Manage Tools window, the requested tool hangs at the "Cloning" step and displays a question mark icon in the 2nd column for the tool. Hovering over that shows:

"Unable to get information from the Tool Shed"

Jobs aren't evenly distributed across workers.

When starting a workflow or a handful of jobs, they initially run evenly across available worker nodes, but over time all of the jobs end up running on only one worker.

Trying to add additional nodes while jobs are ongoing or queued results in #88

Behavior observed using the GVL 4.4.0 RC2 (Galaxy 18.05)

REST API Status

Sorry to post this question here, but can someone please tell me what the status is of a REST API for Cloudman? If one exists, where is the documentation located?

Conda error message during Galaxy restart

I've launched a cluster based on cloudman-dev bucket, added a few items to job_conf.xml and restarted Galaxy using the Admin console. Galaxy appears to work fine but I've noticed the following message in the main.log:

galaxy.tools.deps DEBUG 2016-10-28 16:48:39,575 Unable to find config file './dependency_resolvers_conf.xml'
/bin/sh: 1: Syntax error: Unterminated quoted string
galaxy.tools.deps.resolvers.conda WARNING 2016-10-28 16:48:39,581 Conda installation requested and failed.
/bin/sh: 1: Syntax error: Unterminated quoted string
galaxy.tools.deps.resolvers.conda WARNING 2016-10-28 16:48:39,586 Conda installation requested and failed.
galaxy.tools.search DEBUG 2016-10-28 16:48:39,626 Starting to build toolbox index.
docutils WARNING 2016-10-28 16:48:39,925 <string>:38: (ERROR/3) Document may not end with a transition.

docutils WARNING 2016-10-28 16:48:40,815 <string>:61: (ERROR/3) Document may not end with a transition.

docutils WARNING 2016-10-28 16:48:42,493 <string>:35: (ERROR/3) Document may not end with a transition.

docutils WARNING 2016-10-28 16:48:42,965 <string>:16: (ERROR/3) Unexpected indentation.

docutils WARNING 2016-10-28 16:48:43,335 <string>:22: (ERROR/3) Document may not end with a transition.

docutils WARNING 2016-10-28 16:48:43,346 <string>:30: (WARNING/2) Duplicate explicit target name: "krona".

docutils WARNING 2016-10-28 16:48:43,347 <string>:4: (ERROR/3) Duplicate target name, cannot be used as a unique reference: "krona".

docutils WARNING 2016-10-28 16:48:43,348 <string>:28: (ERROR/3) Duplicate target name, cannot be used as a unique reference: "krona".

docutils WARNING 2016-10-28 16:48:43,443 <string>:16: (ERROR/3) Document may not end with a transition.

docutils WARNING 2016-10-28 16:48:43,574 <string>:50: (ERROR/3) Document may not end with a transition.

galaxy.tools.search DEBUG 2016-10-28 16:48:44,360 Toolbox index finished. It took: 0:00:04.734109

I'm not sure if this is of any consequence.

Unclear where to report an error on a GVL instance.

I'm running a GVL instance on AWS, launched from CloudLaunch:

Appliance: Genomics Virtual Lab (GVL) 
Version: GVL 4.4.0 RC1 (Galaxy 18.05)
Cloud: amazon-us-east-n-virginia 

And I got the error reported in issue #82. It wasn't at all clear what I should do to report the error. Clicking on the Help pulldown has an item called Support. That leads to https://www.gvl.org.au/help/ and from there I could not find anything about how to report bugs.

Tool sam_to_bam fails to install in CloudMan 16.10

I've tried installing a ChIP-seq workflow along with its dependencies into a CloudMan 16.10 dev instance but sam_to_bam tool installation did not advance past the Cloning stage with the following error message in the Galaxy log:

tool_shed.util.tool_dependency_util DEBUG 2016-10-25 21:44:57,434 Creating a new record for version 1.2 of tool dependency samtools for revision d04d9f1c6791 of repository sam_to_bam.  The status is being set to Never installed.
galaxy.tools.data DEBUG 2016-10-25 21:44:57,443 Could not parse existing tool data table config, assume no existing elements: [Errno 2] No such file or directory: u'/mnt/galaxy/galaxy-app/tool-data/shed_tool_data/toolshed.g2.bx.psu.edu/repos/devteam/sam_to_bam/d04d9f1c6791/tool_data_table_conf.xml'
galaxy.tools.data DEBUG 2016-10-25 21:44:57,444 Loading another instance of data table 'fasta_indexes', attempting to merge content.
134.174.183.88 - - [25/Oct/2016:21:44:54 +0000] "POST /admin_toolshed/manage_repositories HTTP/1.0" 500 - "http://galaxy-dev.aws.stemcellcommons.org/admin_toolshed/prepare_for_install" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36"
Error - <type 'exceptions.IOError'>: [Errno 2] No such file or directory: u'/cvmfs/data.galaxyproject.org/fasta_indexes.loc.sample'
URL: http://galaxy-dev.aws.stemcellcommons.org/admin_toolshed/manage_repositories
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/middleware/error.py', line 151 in __call__
  app_iter = self.application(environ, sr_checker)
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/paste/recursive.py', line 85 in __call__
  return self.application(environ, start_response)
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/paste/httpexceptions.py', line 640 in __call__
  return self.application(environ, start_response)
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/base.py', line 135 in __call__
  return self.handle_request( environ, start_response )
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/base.py', line 194 in handle_request
  body = method( trans, **kwargs )
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/decorators.py', line 89 in decorator
  return func( self, trans, *args, **kwargs )
File '/mnt/galaxy/galaxy-app/lib/galaxy/webapps/galaxy/controllers/admin_toolshed.py', line 750 in manage_repositories
  reinstalling=reinstalling,
File '/mnt/galaxy/galaxy-app/lib/tool_shed/galaxy_install/install_manager.py', line 853 in install_repositories
  tool_panel_section_mapping=tool_panel_section_mapping )
File '/mnt/galaxy/galaxy-app/lib/tool_shed/galaxy_install/install_manager.py', line 905 in install_tool_shed_repository
  tool_panel_section_mapping=tool_panel_section_mapping )
File '/mnt/galaxy/galaxy-app/lib/tool_shed/galaxy_install/install_manager.py', line 550 in __handle_repository_contents
  tool_util.copy_sample_files( self.app, tool_index_sample_files, tool_path=tool_path )
File '/mnt/galaxy/galaxy-app/lib/tool_shed/util/tool_util.py', line 81 in copy_sample_files
  copy_sample_file( app, filename, dest_path=dest_path )
File '/mnt/galaxy/galaxy-app/lib/tool_shed/util/tool_util.py', line 57 in copy_sample_file
  shutil.copy( full_source_path, full_destination_path )
File '/usr/lib/python2.7/shutil.py', line 119 in copy
  copyfile(src, dst)
File '/usr/lib/python2.7/shutil.py', line 83 in copyfile
  with open(dst, 'wb') as fdst:
IOError: [Errno 2] No such file or directory: u'/cvmfs/data.galaxyproject.org/fasta_indexes.loc.sample'


CGI Variables
-------------
  CONTENT_LENGTH: '17210'
  CONTENT_TYPE: 'application/x-www-form-urlencoded; charset=UTF-8'
  HTTP_ACCEPT: 'text/html, */*; q=0.01'
  HTTP_ACCEPT_ENCODING: 'gzip, deflate'
  HTTP_ACCEPT_LANGUAGE: 'en-US,en;q=0.8'
  HTTP_AUTHORIZATION: 'Basic dWJ1bnR1OnNucjJkZQ=='
  HTTP_CONNECTION: 'close'
  HTTP_COOKIE: 'galaxysession=c6ca0ddb55be603aa5fcb29e1552bb30267c7047a3b360a8ad55130b5ce038db1f428c9bb7e92128; SESS1be5e3fb3c4d312b5d6226cd168ec726=KfSJIBO9bwdXmULJ6tijJcus_gmCR7T9cCmycHtWi50; _ga=GA1.2.1555562963.1440513880'
  HTTP_DNT: '1'
  HTTP_HOST: 'galaxy-dev.aws.stemcellcommons.org'
  HTTP_ORIGIN: 'http://galaxy-dev.aws.stemcellcommons.org'
  HTTP_REFERER: 'http://galaxy-dev.aws.stemcellcommons.org/admin_toolshed/prepare_for_install'
  HTTP_USER_AGENT: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
  HTTP_X_FORWARDED_FOR: '134.174.183.88'
  HTTP_X_FORWARDED_HOST: 'galaxy-dev.aws.stemcellcommons.org'
  HTTP_X_REQUESTED_WITH: 'XMLHttpRequest'
  ORGINAL_HTTP_HOST: 'galaxy_app'
  ORGINAL_REMOTE_ADDR: '127.0.0.1'
  PATH_INFO: '/admin_toolshed/manage_repositories'
  REMOTE_ADDR: '134.174.183.88'
  REQUEST_METHOD: 'POST'
  SERVER_NAME: '127.0.0.1'
  SERVER_PORT: '8080'
  SERVER_PROTOCOL: 'HTTP/1.0'


WSGI Variables
--------------
  application: <paste.recursive.RecursiveMiddleware object at 0x7f0c29ef1450>
  controller_action_key: u'web.admin_toolshed.manage_repositories'
  is_api_request: False
  paste.cookies: (<SimpleCookie: SESS1be5e3fb3c4d312b5d6226cd168ec726='KfSJIBO9bwdXmULJ6tijJcus_gmCR7T9cCmycHtWi50' _ga='GA1.2.1555562963.1440513880' galaxysession='c6ca0ddb55be603aa5fcb29e1552bb30267c7047a3b360a8ad55130b5ce038db1f428c9bb7e92128'>, 'galaxysession=c6ca0ddb55be603aa5fcb29e1552bb30267c7047a3b360a8ad55130b5ce038db1f428c9bb7e92128; SESS1be5e3fb3c4d312b5d6226cd168ec726=KfSJIBO9bwdXmULJ6tijJcus_gmCR7T9cCmycHtWi50; _ga=GA1.2.1555562963.1440513880')
  paste.expected_exceptions: [<class 'paste.httpexceptions.HTTPException'>]
  paste.httpexceptions: <paste.httpexceptions.HTTPExceptionHandler object at 0x7f0c29ef13d0>
  paste.httpserver.proxy.host: 'dummy'
  paste.httpserver.proxy.scheme: 'http'
  paste.httpserver.thread_pool: <paste.httpserver.ThreadPool object at 0x7f0c20048710>
  paste.recursive.forward: <paste.recursive.Forwarder from />
  paste.recursive.include: <paste.recursive.Includer from />
  paste.recursive.include_app_iter: <paste.recursive.IncluderAppIter from />
  paste.recursive.script_name: ''
  paste.throw_errors: True
  request_id: '43e7d82c9afc11e6abac0ee21881b804'
  webob._body_file: (<_io.BufferedReader>, <socket._fileobject object at 0x7f0bfc6c0850 length=17210>)
  webob._parsed_post_vars: (MultiDict([('operation', u'install'), ('tool_shed_repository_ids', u"['0397e7c5778be5ee', '07a38ebd55a6989d', '56fc5a09f8ae2546', '21b91b9198fe5ccf']"), ('encoded_kwd', u'acd705e72f23d5df34c89743d9be7b197da22dd0:7b22737461747573223a2022646f6e65222c20226861735f7265706f7369746f72795f646570656e64656e63696573223a2066616c73652c2022696e636c756465735f746f6f6c735f666f725f646973706c61795f696e5f746f6f6c5f70616e656c223a20747275652c2022746f6f6c5f736865645f7265706f7369746f72795f696473223a205b2230333937653763353737386265356565222c202230376133386562643535613639383964222c202235366663356130396638616532353436222c202232316239316239313938666535636366225d2c2022736865645f746f6f6c5f636f6e66223a20222f6d6e742f67616c6178792f67616c6178792d6170702f636f6e6669672f736865645f746f6f6c5f636f6e662e786d6c222c2022696e7374616c6c5f7265706f7369746f72795f646570656e64656e63696573223a20747275652c2022746f6f6c5f70617468223a20222e2e2f736865645f746f6f6c73222c20227265706f5f696e666f5f6469637473223a205b7b226368697... 0x7f0c2016aae0>)
  webob._parsed_query_vars: (GET([]), '')
  webob.is_body_seekable: True
  wsgi process: 'Multithreaded'

I was able to install this workflow in CloudMan 16.03 and 16.05.

Galaxy FS usage and size displayed incorrectly for cloned clusters

Steps to reproduce

Clone a shared cluster with Galaxy FS volume with size of 200GB

Observed results

CloudMan Console shows:
Disk status: 5.7G / 20G (31%)
CloudMan Admin Console shows under File systems (correctly):
galaxy 26 GB/199.9 GB (13%)

Expected results

Disk status should be displayed correctly in the CloudMan Console

Which clouds are supported?

The description of this repo is "Easily create and manage compute clusters on any Cloud." I was wondering whether either GCE or MS Azure are supported, and if not, what steps would need to be taken to support them.

Handle BotoServerError in get_desc()

CloudMan 16.05. AWS is having some issues in N. Virginia at the moment and I've noticed the following traceback in the CM log:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mnt/cm/cm/master.py", line 2877, in __monitor
    "for a while; checking now.".format(w_instance.get_desc()))
  File "/mnt/cm/cm/instance.py", line 250, in get_desc
    return "'{id}; {ip}; {sn}'".format(id=self.get_id(), ip=self.get_public_ip(),
  File "/mnt/cm/cm/util/decorators.py", line 42, in df
    return fn(*args, **kwargs)
  File "/mnt/cm/cm/instance.py", line 453, in get_public_ip
    inst.update()
  File "/home/ubuntu/.virtualenvs/CM/local/lib/python2.7/site-packages/boto/ec2/instance.py", line 413, in update
    rs = self.connection.get_all_reservations([self.id], dry_run=dry_run)
  File "/home/ubuntu/.virtualenvs/CM/local/lib/python2.7/site-packages/boto/ec2/connection.py", line 682, in get_all_reservations
    [('item', Reservation)], verb='POST')
  File "/home/ubuntu/.virtualenvs/CM/local/lib/python2.7/site-packages/boto/connection.py", line 1166, in get_list
    response = self.make_request(action, params, path, verb)
  File "/home/ubuntu/.virtualenvs/CM/local/lib/python2.7/site-packages/boto/connection.py", line 1112, in make_request
    return self._mexe(http_request)
  File "/home/ubuntu/.virtualenvs/CM/local/lib/python2.7/site-packages/boto/connection.py", line 1025, in _mexe
    raise BotoServerError(response.status, response.reason, body)
BotoServerError: BotoServerError: 503 Service Unavailable
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>Unavailable</Code><Message>The service is unavailable. Please try again shortly.</Message></Error></Errors><RequestID>0a7779ff-f5ce-46ea-8cdc-9e3a812a10fd</RequestID></Response>

Trouble adding nodes to cluster via cloudman

Hello again,
I am having a challenge adding a new node to the cluster. I can connect to the EC2 node and examine logs, etc. How can I troubleshoot the issue? It seems that the autorun script is not retrieving the correct user data and is retrieving resources from the default Cloudman bucket. Is the autorun script supposed to retrieve the user data or simply launch Cloudman (and start services) from the default bucket?

Worker node

>tail cm_autorun.log 
[INFO] cm_autorun:79 2015-11-17 15:09:30,984: Getting user data from 'http://169.254.169.254/latest/user-data', attempt 0
[INFO] cm_autorun:90 2015-11-17 15:09:30,987: User data not found. Setting it to empty.
[INFO] cm_autorun:461 2015-11-17 15:09:30,987: Received empty user data; assuming default contextualization
[DEBUG] cm_autorun:356 2015-11-17 15:09:30,993: Default bucket url: http://s3.amazonaws.com/cloudman
[DEBUG] cm_autorun:465 2015-11-17 15:09:30,993: Resorting to the default bucket to get the boot script: http://s3.amazonaws.com/cloudman/cm_boot.py
[INFO] cm_autorun:240 2015-11-17 15:09:30,993: Getting boot script from 'http://s3.amazonaws.com/cloudman/cm_boot.py' and saving it locally to '/opt/cloudman/boot/cm_boot.py'
[ERROR] cm_autorun:251 2015-11-17 15:09:30,997: Boot script at 'http://s3.amazonaws.com/cloudman/cm_boot.py' not found.
[INFO] cm_autorun:315 2015-11-17 15:09:30,998: Running boot script 'HOME=/home/galaxy /opt/cloudman/boot/cm_boot.py'
[DEBUG] cm_autorun:319 2015-11-17 15:09:34,936: Successfully ran boot script 'HOME=/home/galaxy /opt/cloudman/boot/cm_boot.py'
[INFO] cm_autorun:622 2015-11-17 15:09:34,936: ---> /usr/bin/cm_autorun.py done <---
>hostname
ip-1-2-3-5.cloud.example.com
>hostname --long
hostname: Uknown host

Cloudman log

14:37:13 - Adding 1 on-demand instance(s)
14:44:04 - Rebooting instance 'i-54321; 1.2.3.5; w2' (reboot #1).
14:52:01 - ---> PROBLEM, running command '/usr/local/bin/scontrol update NodeName=w2 Reason="CloudMan-disabled" State=DOWN' returned code '1', the following stderr: 'slurm_update error: Invalid node name specified ' and stdout: ''
14:52:01 - Terminating instance i-54321

Specify VPC? [Enhancement]

Hello again,
I'm working under a corporate AWS account with an individual IAM user. I can't specify a default VPC and I have to run Galaxy in a particular VPC. Could VPC selection be implemented in Cloudman and/or Cloudlaunch? I see in the Changelog that in August, support for VPC was added in some sense, but I'm not sure how to leverage this and/or specify it while using cloudlaunch. Will this be part of the larger October Galaxy release?

Better handling for InsufficientInstanceCapacity error

I've just started a brand new cluster using the Dev 11/01 flavor and requested to add a c4.4xlarge worker node using the CM console. However, this failed with the following error message:

2016-11-07 16:50:23,834 ERROR            ec2:530  boto server error when starting an instance: BotoServerError: 500 Internal Server Error
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InsufficientInstanceCapacity</Code><Message>We currently do not have sufficient c4.4xlarge capacity in the Availability Zone you requested (us-east-1c). Our system will be working on provisioning additional capacity. You can currently get c4.4xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1d, us-east-1e.</Message></Error></Errors><RequestID>80586c44-65e6-4676-80f1-e96a47206379</RequestID></Response>

Also, the requested instance appears to be stuck as a blue square in the CM console while "Remove worker nodes" button is disabled. It would be great to either update the status in the UI or implement automatic retries.

Failed to get file snaps.yaml from bucket cloudman-dev

Error message after a "Dev 11/01" cluster restart in CM log:

2016-11-08 16:39:30,893 DEBUG         config:147  filesystem_templates not found in config {'static_images_dir': 'static/images', 'cluster_templates': [{'filesystem_templates': [{'archive_url': 'http://s3.amazonaws.com/cloudman/fs-archives/galaxyFS-20161101.tar.gz', 'type': u'volume', 'name': 'galaxy', 'roles': 'galaxyTools,galaxyData', 'size': u'10'}, {'mount_point': '/cvmfs/data.galaxyproject.org', 'type': 'cvmfs', 'name': 'galaxyIndices', 'roles': 'galaxyIndices'}], 'name': 'Galaxy'}, {'filesystem_templates': [{'name': 'galaxy'}], 'name': 'Data'}], 'ec2_port': None, 'storage_type': u'volume', 'iops': u'', 'is_secure': True, 'cluster_storage_type': u'volume', 'password': '*', 's3_port': None, 'log_level': 'DEBUG', 'cluster_type': u'Galaxy', 'initial_cluster_type': u'Galaxy', 'cloud_type': u'ec2', 'static_scripts_dir': 'static/scripts', 'cluster_name': u'cm-16.11-dev', 'freenxpass': u'*', 'machine_image_id': 'ami-3be8cd2c', 'role': 'master', 'bucket_cluster': 'cm-085dd2d743c466efbf1af5854b35dca5', 'boot_script_path': '/opt/cloudman/boot', 'bucket_default': 'cloudman-dev', 'ec2_conn_path': u'/', 's3_host': u's3.amazonaws.com', 'region_name': u'us-east-1', 'region_endpoint': u'ec2.amazonaws.com', 'persistent_data_version': 3, 'static_favicon_dir': 'static/favicon.ico', 'deployment_version': 2, 'storage_size': u'10', 'use_translogger': 'False', 'boot_script_name': 'cm_boot.py', 'services': [{'name': 'Postgres', 'roles': ['Postgres']}, {'name': 'ProFTPd', 'roles': ['ProFTPd']}, {'name': 'Slurmd', 'roles': ['Slurmd']}, {'name': 'Nginx', 'roles': ['Nginx']}, {'name': 'Supervisor', 'roles': ['Supervisor']}, {'name': 'PSS', 'roles': ['PSS']}, {'name': 'Slurmctld', 'roles': ['Slurmctld', 'Job manager']}, {'name': 'NodeJSProxy', 'roles': ['NodeJSProxy']}, {'home': '/mnt/galaxy/galaxy-app', 'name': 'Galaxy', 'roles': ['Galaxy']}], 'static_dir': 'static', 'custom_image_id': u'', 'use_lint': 'false', 'cloudman_file_name': 'cm.tar.gz', 'access_key': u'*', 'global_conf': {'__file__': '/mnt/cm/cm_wsgi.ini', 'here': '/mnt/cm'}, 'filesystems': [{'kind': 'cvmfs', 'mount_point': '/cvmfs/data.galaxyproject.org', 'name': 'galaxyIndices', 'roles': ['galaxyIndices']}, {'kind': 'volume', 'mount_point': '/mnt/galaxy', 'name': 'galaxy', 'roles': ['galaxyTools', 'galaxyData'], 'ids': [u'vol-04b4ba5293e434aa4']}], 'placement': 'us-east-1c', 'template_path': 'templates', 'cloud_name': u'Amazon - Virginia', 'static_style_dir': 'static/style', 'cloudman_home': '/mnt/cm', 'static_cache_time': '360', 'custom_instance_type': u'c4.large', 'debug': 'true', 'secret_key': u'*', 's3_conn_path': u'/', 'static_enabled': 'True', 'worker_initial_count': u''}; loading legacy snapshot data.
2016-11-08 16:39:30,930 DEBUG           misc:607  Failed to get file 'snaps.yaml' from bucket 'cloudman-dev': S3ResponseError: 403 Forbidden
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>4A5059CD060DCB35</RequestId><HostId>Jg1VaqmzlnYXqcPEh2cqxV2SD6w6jm/4vMN/dxBE3Gr2CQcLlUfCzMoP0XUdQN4TiEFvym/MetI=</HostId></Error>
2016-11-08 16:39:30,930 DEBUG           misc:808  Fetching file https://s3.amazonaws.com:None/cloudman-dev/snaps.yaml and saving it as cm_snaps.yaml
2016-11-08 16:39:30,931 DEBUG           misc:821  Could not fetch file from s3 public url: https://s3.amazonaws.com:None/cloudman-dev/snaps.yaml due to exception: Failed to parse: s3.amazonaws.com:None
2016-11-08 16:39:30,931 DEBUG         master:283  Couldn't get legacy snaps.yaml from default bucket. Assuming it's present in user data, since user data will override it anyway.

Install Tools not showing results in Admin view and/or is astonishingly slow.

I'm running a GVL instance on AWS, launched with CloudLaunch:

Appliance: Genomics Virtual Lab (GVL) 
Version: GVL 4.4.0 RC1 (Galaxy 18.05)
Cloud: amazon-us-east-n-virginia 

I can't get either the "Install new tools" or "Install new tools (Beta) to work, or they are mind-numbingly slow.

If the mind-numbingly slow is due to something on the toolshed end, then, um, close this request.

Error uploading large files into histories and libraries

AMI: ami-b45e59de, Galaxy 16.01.

Steps to reproduce

in Python interpreter:

from bioblend.galaxy.objects import GalaxyInstance
gi_aws = GalaxyInstance("galaxy-dev.aws.stemcellcommons.org", "api_key")
h_aws = gi_aws.histories.create()
h_aws.upload_dataset('/path/to/multigigabytefile')

Observed results

in Python interpreter:

ConnectionError: Unexpected response from galaxy: 504: <html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx/1.4.6 (Ubuntu)</center>
</body>
</html>

Nginx log:

2016/04/22 19:03:19 [error] 12825#0: *14311 upstream timed out (110: Connection timed out) while sending request to upstream, client: 134.174.183.88, server: , request: "POST /api/tools HTTP/1.1", upstream: "http://127.0.0.1:8080/api/tools", host: "galaxy-dev.aws.stemcellcommons.org"

Galaxy log:

134.174.183.88 - - [22/Apr/2016:19:01:56 +0000] "POST /api/tools HTTP/1.0" 500 - "-" "python-requests/2.9.1"
Error - <class 'webob.request.DisconnectionError'>: The client disconnected while sending the POST/PUT body (1183825676 more bytes were expected)
URL: http://galaxy-dev.aws.stemcellcommons.org/api/tools
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/middleware/error.py', line 151 in __call__
  app_iter = self.application(environ, sr_checker)
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/paste/recursive.py', line 85 in __call__
  return self.application(environ, start_response)
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/paste/httpexceptions.py', line 640 in __call__
  return self.application(environ, start_response)
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/base.py', line 126 in __call__
  return self.handle_request( environ, start_response )
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/base.py', line 153 in handle_request
  trans = self.transaction_factory( environ )
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/webapp.py', line 66 in <lambda>
  self.set_transaction_factory( lambda e: self.transaction_chooser( e, galaxy_app, session_cookie ) )
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/webapp.py', line 97 in transaction_chooser
  return GalaxyWebTransaction( environ, galaxy_app, self, session_cookie )
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/webapp.py', line 193 in __init__
  self.error_message = self._authenticate_api( session_cookie )
File '/mnt/galaxy/galaxy-app/lib/galaxy/web/framework/webapp.py', line 308 in _authenticate_api
  api_key = self.request.params.get('key', None)
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/webob/request.py', line 853 in params
  params = NestedMultiDict(self.GET, self.POST)
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/webob/request.py', line 789 in POST
  self.make_body_seekable()
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/webob/request.py', line 943 in make_body_seekable
  self.copy_body()
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/webob/request.py', line 963 in copy_body
  did_copy = self._copy_body_tempfile()
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/webob/request.py', line 980 in _copy_body_tempfile
  data = input.read(min(todo, 65536))
File '/mnt/galaxy/galaxy-app/.venv/local/lib/python2.7/site-packages/webob/request.py', line 1549 in readinto
  + "(%d more bytes were expected)" % self.remaining
DisconnectionError: The client disconnected while sending the POST/PUT body (1183825676 more bytes were expected)

CGI Variables
-------------
  CONTENT_LENGTH: '4628837012'
  CONTENT_TYPE: 'multipart/form-data; boundary=cf3af77a8a3445f98ce91060761385a7'
  HTTP_ACCEPT: '*/*'
  HTTP_ACCEPT_ENCODING: 'gzip, deflate'
  HTTP_CONNECTION: 'close'
  HTTP_HOST: 'galaxy-dev.aws.stemcellcommons.org'
  HTTP_USER_AGENT: 'python-requests/2.9.1'
  HTTP_X_FORWARDED_FOR: '134.174.183.88'
  HTTP_X_FORWARDED_HOST: 'galaxy-dev.aws.stemcellcommons.org'
  ORGINAL_HTTP_HOST: 'galaxy_app'
  ORGINAL_REMOTE_ADDR: '127.0.0.1'
  PATH_INFO: '/api/tools'
  REMOTE_ADDR: '134.174.183.88'
  REQUEST_METHOD: 'POST'
  SERVER_NAME: '127.0.0.1'
  SERVER_PORT: '8080'
  SERVER_PROTOCOL: 'HTTP/1.0'

WSGI Variables
--------------
  application: <paste.recursive.RecursiveMiddleware object at 0x7fcce108fcd0>
  is_api_request: True
  paste.expected_exceptions: [<class 'paste.httpexceptions.HTTPException'>]
  paste.httpexceptions: <paste.httpexceptions.HTTPExceptionHandler object at 0x7fcce108fc50>
  paste.httpserver.proxy.host: 'dummy'
  paste.httpserver.proxy.scheme: 'http'
  paste.httpserver.thread_pool: <paste.httpserver.ThreadPool object at 0x7fcce0aabb90>
  paste.recursive.forward: <paste.recursive.Forwarder from />
  paste.recursive.include: <paste.recursive.Includer from />
  paste.recursive.include_app_iter: <paste.recursive.IncluderAppIter from />
  paste.recursive.script_name: ''
  paste.throw_errors: True
  request_id: 'af50af2608bc11e6b6a90a4d98d6c597'
  webob._body_file: (<_io.BufferedReader>, <socket._fileobject object at 0x7fccc43f62d0 length=4628837012>)
  webob._parsed_query_vars: (GET([]), '')
  wsgi process: 'Multithreaded'

Expected results

Large file upload should succeed

Notes

Large file uploads work in a non-cloud Galaxy 16.01 instance, not running behind Nginx.
/etc/nginx/sites-enabled/default.server:

    # This file is maintained by CloudMan.
    # Changes will be overwritten!


    upstream galaxy_app {
        server 127.0.0.1:8080;
    }
    server {
        listen                  80;
        client_max_body_size    10G;
        proxy_read_timeout      1200s;

        include /etc/nginx/sites-enabled/*.locations;
    }

I've tried adding proxy_send_timeout 1200s; (and restarting Nginx from the command line to preserve the config file edits) but still got the same error though after a much longer period of time.

Update Galaxy to 18.05

Do a (GVL) release with CloudMan 1 to include Galaxy 18.05:

  • Build a new updated machine image
  • Update CloudMan code to work with uwsgi
  • Allow manual editing of Galaxy config and not have CloudMan overwrite it
  • Base all tool installs on Conda; however, most likely won't install the dependencies due to size and instead install only the wrappers with the deps being filled in on first job run (potentially install a few common deps)
  • Build and upload the file system archive; create CloudFront distribution; update refs
  • 10 other things I'm forgetting or will emerge

AMQP connection fails on CentOS

I'm seeing that the application launches pretty well except for the AMQP connection. Towards the bottom, a service failure is mentioned where the "hostname" service fails to restart (i've read this is different on RedHat machines). Is this causing the connection to fail? Alternatively, I am behind a proxy which could be disrupting the connection. Could someone suggest how to enable the AMQP connection behind my proxy? Is there a line in a profile shell script or in /etc/hosts/ that I can edit to enable this?

On a separate note, I'm running on CentOS 6. Why does "Running on Ubuntu: True" get set?

Python version:  (2, 7)
2015-11-02 12:49:43,132 DEBUG            app:80   Initializing app
2015-11-02 12:49:43,132 DEBUG            ec2:115  Gathering instance zone, attempt 0
2015-11-02 12:49:43,136 DEBUG            ec2:121  Instance zone is 'us-east-1a'
2015-11-02 12:49:43,137 DEBUG            ec2:44   Gathering instance ami, attempt 0
2015-11-02 12:49:43,139 DEBUG            app:83   Running on 'ec2' type of cloud in zone 'us-east-1a' using image 'ami-123456789'.
2015-11-02 12:49:43,139 DEBUG            app:106  Looking for existing cluster persistent data (PD).
2015-11-02 12:49:43,139 DEBUG            ec2:357  No S3 Connection, creating a new one.
2015-11-02 12:49:43,141 DEBUG            ec2:361  Got boto S3 connection.
2015-11-02 12:49:43,857 DEBUG           misc:597  Failed to get file 'persistent_data.yaml' from bucket 'my_company_bucket': S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>persistent_data.yaml</Key><RequestId>bobloblaw</RequestId><HostId>123456789</HostId></Error>
2015-11-02 12:49:43,858 DEBUG            app:125  No PD to go by. Setting deployment_version to 2.
2015-11-02 12:49:43,858 DEBUG            app:132  Master process starting.
2015-11-02 12:49:43,938 DEBUG         master:51   Initializing console manager - cluster start time: 2015-11-02 17:49:43.938619
2015-11-02 12:49:43,938 DEBUG           comm:30   Setting up a new AMQP connection
2015-11-02 12:49:43,939 DEBUG           comm:49   AMQP Connection Failure:  [Errno 111] Connection refused
2015-11-02 12:49:43,940 DEBUG         master:938  Trying to discover any worker instances associated with this cluster...
2015-11-02 12:49:43,940 DEBUG            ec2:336  Establishing boto EC2 connection
2015-11-02 12:49:44,472 DEBUG            ec2:328  Got region as 'RegionInfo:us-east-1'
2015-11-02 12:49:45,391 DEBUG            ec2:345  Got boto EC2 connection for region 'us-east-1'
2015-11-02 12:49:45,507 DEBUG            ec2:82   Gathering instance id, attempt 0
2015-11-02 12:49:45,510 DEBUG            ec2:88   Instance ID is 'i-c1234567'
2015-11-02 12:49:45,837 DEBUG            ec2:384  Adding tag 'clusterName:cloudlaunch-test-launch-v5' to resource 'i-c1234567'
2015-11-02 12:49:46,086 DEBUG            ec2:384  Adding tag 'role:master' to resource 'i-c1234567'
2015-11-02 12:49:46,341 DEBUG            ec2:384  Adding tag 'Name:master: cloudlaunch-test-launch-v5' to resource 'i-1234567'
2015-11-02 12:49:46,636 DEBUG       registry:164  Initiating loading of services
2015-11-02 12:49:46,637 DEBUG       registry:128  Loading service class in module 'cm/services/autoscale.py'
2015-11-02 12:49:46,890 DEBUG       registry:145  Importing service name AutoscaleService as module cm.services.autoscale
2015-11-02 12:49:46,893 DEBUG       registry:170  Loaded service Autoscale
2015-11-02 12:49:46,895 DEBUG       registry:128  Loading service class in module 'cm/services/apps/clouderamanager.py'
2015-11-02 12:49:47,906 DEBUG       registry:145  Importing service name ClouderaManagerService as module cm.services.apps.clouderamanager
2015-11-02 12:49:48,261 DEBUG       registry:170  Loaded service ClouderaManager
2015-11-02 12:49:48,262 DEBUG       registry:128  Loading service class in module 'cm/services/apps/pulsar.py'
2015-11-02 12:49:48,280 DEBUG       registry:145  Importing service name PulsarService as module cm.services.apps.pulsar
2015-11-02 12:49:48,286 DEBUG       registry:170  Loaded service Pulsar
2015-11-02 12:49:48,286 DEBUG       registry:128  Loading service class in module 'cm/services/apps/htcondor.py'
2015-11-02 12:49:48,306 DEBUG       registry:145  Importing service name HTCondorService as module cm.services.apps.htcondor
2015-11-02 12:49:48,310 DEBUG       htcondor:24   Condor is preparing
2015-11-02 12:49:48,310 DEBUG       registry:170  Loaded service HTCondor
2015-11-02 12:49:48,310 DEBUG       registry:128  Loading service class in module 'cm/services/apps/nginx.py'
2015-11-02 12:49:48,323 DEBUG       registry:145  Importing service name NginxService as module cm.services.apps.nginx
2015-11-02 12:49:48,327 DEBUG       registry:170  Loaded service Nginx
2015-11-02 12:49:48,327 DEBUG       registry:128  Loading service class in module 'cm/services/apps/nodejsproxy.py'
2015-11-02 12:49:48,332 DEBUG       registry:145  Importing service name NodejsProxyService as module cm.services.apps.nodejsproxy
2015-11-02 12:49:48,334 DEBUG       registry:170  Loaded service NodeJSProxy
2015-11-02 12:49:48,334 DEBUG       registry:128  Loading service class in module 'cm/services/apps/supervisor.py'
2015-11-02 12:49:48,457 DEBUG       registry:145  Importing service name SupervisorService as module cm.services.apps.supervisor
2015-11-02 12:49:48,461 DEBUG       registry:170  Loaded service Supervisor
2015-11-02 12:49:48,461 DEBUG       registry:128  Loading service class in module 'cm/services/apps/pss.py'
2015-11-02 12:49:48,470 DEBUG       registry:145  Importing service name PSSService as module cm.services.apps.pss
2015-11-02 12:49:48,473 DEBUG            pss:29   Configured PSS as master
2015-11-02 12:49:48,473 DEBUG       registry:170  Loaded service PSS
2015-11-02 12:49:48,473 DEBUG       registry:128  Loading service class in module 'cm/services/apps/postgres.py'
2015-11-02 12:49:48,487 DEBUG       registry:145  Importing service name PostgresService as module cm.services.apps.postgres
2015-11-02 12:49:48,516 DEBUG       registry:170  Loaded service Postgres
2015-11-02 12:49:48,517 DEBUG       registry:128  Loading service class in module 'cm/services/apps/hadoop.py'
2015-11-02 12:49:48,721 DEBUG       registry:145  Importing service name HadoopService as module cm.services.apps.hadoop
2015-11-02 12:49:48,737 DEBUG       registry:170  Loaded service Hadoop
2015-11-02 12:49:48,738 DEBUG       registry:128  Loading service class in module 'cm/services/apps/galaxy.py'
2015-11-02 12:49:48,750 DEBUG       registry:145  Importing service name GalaxyService as module cm.services.apps.galaxy
2015-11-02 12:49:48,755 DEBUG       registry:170  Loaded service Galaxy
2015-11-02 12:49:48,755 DEBUG       registry:128  Loading service class in module 'cm/services/apps/galaxyreports.py'
2015-11-02 12:49:48,761 DEBUG       registry:145  Importing service name GalaxyReportsService as module cm.services.apps.galaxyreports
2015-11-02 12:49:48,783 DEBUG       registry:170  Loaded service GalaxyReports
2015-11-02 12:49:48,784 DEBUG       registry:128  Loading service class in module 'cm/services/apps/migration.py'
2015-11-02 12:49:48,802 DEBUG       registry:145  Importing service name MigrationService as module cm.services.apps.migration
2015-11-02 12:49:48,808 DEBUG       registry:170  Loaded service Migration
2015-11-02 12:49:48,808 DEBUG       registry:128  Loading service class in module 'cm/services/apps/proftpd.py'
2015-11-02 12:49:48,813 DEBUG       registry:145  Importing service name ProFTPdService as module cm.services.apps.proftpd
2015-11-02 12:49:48,815 DEBUG        proftpd:26   Initializing ProFTPdService
2015-11-02 12:49:48,815 DEBUG       registry:170  Loaded service ProFTPd
2015-11-02 12:49:48,816 DEBUG       registry:128  Loading service class in module 'cm/services/apps/cloudgene.py'
2015-11-02 12:49:48,821 DEBUG       registry:145  Importing service name CloudgeneService as module cm.services.apps.cloudgene
2015-11-02 12:49:48,822 DEBUG       registry:170  Loaded service Cloudgene
2015-11-02 12:49:48,823 DEBUG       registry:128  Loading service class in module 'cm/services/apps/jobmanagers/sge.py'
2015-11-02 12:49:48,861 DEBUG       registry:145  Importing service name SGEService as module cm.services.apps.jobmanagers.sge
2015-11-02 12:49:48,879 DEBUG       registry:170  Loaded service SGE
2015-11-02 12:49:48,880 DEBUG       registry:128  Loading service class in module 'cm/services/apps/jobmanagers/slurmd.py'
2015-11-02 12:49:48,884 DEBUG       registry:145  Importing service name SlurmdService as module cm.services.apps.jobmanagers.slurmd
2015-11-02 12:49:48,885 DEBUG       registry:170  Loaded service Slurmd
2015-11-02 12:49:48,886 DEBUG       registry:128  Loading service class in module 'cm/services/apps/jobmanagers/slurmctld.py'
2015-11-02 12:49:49,225 DEBUG       registry:145  Importing service name SlurmctldService as module cm.services.apps.jobmanagers.slurmctld
2015-11-02 12:49:49,229 DEBUG          paths:228  Warning: Returning default transient file system path
2015-11-02 12:49:49,230 DEBUG       registry:170  Loaded service Slurmctld
2015-11-02 12:49:49,230 DEBUG         master:2401 Loaded services: {'NodeJSProxy': NodeJSProxy, 'Postgres': Postgres, 'ProFTPd': ProFTPd, 'Autoscale': Autoscale, 'GalaxyReports': Galaxy Reports service on port 9001, 'Hadoop': Hadoop, 'ClouderaManager': ClouderaManager, 'Slurmd': Slurmd, 'Nginx': Nginx, 'HTCondor': HTCondor, 'Supervisor': Supervisor, 'PSS': PSS, 'Pulsar': Pulsar, 'Slurmctld': Slurmctld, 'SGE': SGE, 'Cloudgene': Cloudgene, 'Galaxy': Galaxy, 'Migration': Migration}
2015-11-02 12:49:49,230 DEBUG         master:2721 Starting __monitor thread
2015-11-02 12:49:49,233 DEBUG         master:360  Config Data at manager start (with the following keys filtered out ['password', 'freenxpass', 'access_key', 'secret_key']): {'static_images_dir': 'static/images', 'storage_type': u'volume', 'is_secure': True, 's3_port': None, 'cloud_type': u'ec2', 'static_cache_time': '360', 'debug': 'true', 'initial_cluster_type': u'Galaxy', 'static_scripts_dir': 'static/scripts', 'cluster_name': u'cloudlaunch-test-launch-v5', 'role': 'master', 'bucket_cluster': 'bmsrd-ngs-galaxy', 'boot_script_path': '/opt/cloudman/boot', 'ec2_conn_path': u'/', 'region_name': u'us-east-1', 'region_endpoint': u'ec2.amazonaws.com', 'ec2_port': None, 'static_favicon_dir': 'static/favicon.ico', 'storage_size': u'10', 'use_translogger': 'False', 'boot_script_name': 'cm_boot.py', 'log_level': 'DEBUG', 'custom_image_id': u'ami-1234567', 'use_lint': 'false', 'cloudman_file_name': 'cm.tar.gz', 'global_conf': {'__file__': '/mnt/cm/cm_wsgi.ini', 'here': '/mnt/cm'}, 'template_path': 'templates', 'cloud_name': u'Amazon', 'static_dir': 'static', 'static_style_dir': 'static/style', 'cloudman_home': '/mnt/cm', 'bucket_default': 'cloudman', 'custom_instance_type': u'', 's3_host': u's3.amazonaws.com', 's3_conn_path': u'/', 'static_enabled': 'True'}
2015-11-02 12:49:49,233 DEBUG         master:2254 Generating root user's public key...
2015-11-02 12:49:49,247 DEBUG           base:57   Enabling 'root' controller, class: CM
2015-11-02 12:49:49,268 DEBUG       buildapp:93   Enabling 'httpexceptions' middleware
2015-11-02 12:49:49,270 DEBUG       buildapp:99   Enabling 'recursive' middleware
2015-11-02 12:49:49,273 DEBUG       buildapp:119  Enabling 'print debug' middleware
2015-11-02 12:49:49,281 DEBUG       buildapp:133  Enabling 'error' middleware
2015-11-02 12:49:49,281 DEBUG       buildapp:143  Enabling 'config' middleware
2015-11-02 12:49:49,283 DEBUG       buildapp:147  Enabling 'x-forwarded-host' middleware
Starting server in PID 23047.
serving on 0.0.0.0:42284 view at http://127.0.0.1:42284
2015-11-02 12:49:49,387 DEBUG         master:2257 Successfully generated root user's public key.
2015-11-02 12:49:49,387 DEBUG         master:2265 Successfully retrieved root user's public key from file.
2015-11-02 12:49:49,388 DEBUG         master:113  Activating service Migration
2015-11-02 12:49:49,388 DEBUG         master:157  ADD dependencies for service Migration
2015-11-02 12:49:49,388 DEBUG         master:113  Activating service Nginx
2015-11-02 12:49:49,388 DEBUG         master:157  ADD dependencies for service Nginx
2015-11-02 12:49:49,388 DEBUG         master:113  Activating service Supervisor
2015-11-02 12:49:49,388 DEBUG         master:157  ADD dependencies for service Supervisor
2015-11-02 12:49:49,413 DEBUG           misc:840  'cat /etc/lsb-release | grep DISTRIB_RELEASE | cut -f2 -d'='' command OK
2015-11-02 12:49:49,414 DEBUG         master:391  Running on Ubuntu True; using SGE as the cluster job manager
2015-11-02 12:49:49,415 DEBUG         master:113  Activating service SGE
2015-11-02 12:49:49,415 DEBUG         master:157  ADD dependencies for service SGE
2015-11-02 12:49:49,415 DEBUG       registry:99   Removing service Slurmctld from the registry
2015-11-02 12:49:49,415 DEBUG       registry:99   Removing service Slurmd from the registry
2015-11-02 12:49:49,416 DEBUG     filesystem:32   Instantiating Filesystem object transient_nfs with service roles: 'TransientNFS'
2015-11-02 12:49:49,416 DEBUG     filesystem:599  Configuring instance transient storage at /mnt/transient_nfs with NFS.
2015-11-02 12:49:49,416 DEBUG         master:107  Adding a new file system service into the registry: transient_nfs
2015-11-02 12:49:49,416 DEBUG       registry:111  Registering service transient_nfs with the registry
2015-11-02 12:49:49,416 DEBUG         master:113  Activating service transient_nfs
2015-11-02 12:49:49,416 DEBUG         master:157  ADD dependencies for service transient_nfs
2015-11-02 12:49:49,417 DEBUG         master:113  Activating service PSS
2015-11-02 12:49:49,417 DEBUG         master:157  ADD dependencies for service PSS
2015-11-02 12:49:49,417 DEBUG       registry:99   Removing service HTCondor from the registry
2015-11-02 12:49:49,417 DEBUG         master:483  Checking for and adding any previously defined cluster services
2015-11-02 12:49:49,417 DEBUG         master:489  Processing filesystems in an existing cluster config
2015-11-02 12:49:49,417 DEBUG         master:960  Trying to discover any volumes attached to this instance...
2015-11-02 12:49:49,686 DEBUG         master:977  Attached volumes: [Volume:vol-123456]
2015-11-02 12:49:49,687 DEBUG            ec2:384  Adding tag 'clusterName:cloudlaunch-test-launch-v5' to resource 'vol-123456'
2015-11-02 12:49:49,921 DEBUG         master:559  Activating previously-available application services from an existing cluster config.
2015-11-02 12:49:49,921 DEBUG         master:1335 initialize_cluster_with_custom_settings: cluster_type=Galaxy, storage_type=volume, storage_size=10
2015-11-02 12:49:49,921 INFO          master:1378 Initializing 'Galaxy' cluster type with storage type 'volume'. Please wait...
2015-11-02 12:49:49,922 DEBUG         master:1435 Checking for cluster template definitions.
2015-11-02 12:49:50,829 DEBUG           misc:591  Retrieved file 'snaps.yaml' from bucket 'cloudman' on host 's3.amazonaws.com' to 'cm_snaps.yaml'.
2015-11-02 12:49:50,847 DEBUG            ec2:309  Got region name as 'us-east-1'
2015-11-02 12:49:50,847 DEBUG         master:310  Loaded default snapshot data for cloud amazon: [{'snap_id': 'snap-e6e1c04a', 'name': 'galaxy', 'roles': 'galaxyTools,galaxyData'}, {'snap_id': 'snap-4b20f451', 'name': 'galaxyIndices', 'roles': 'galaxyIndices'}]
2015-11-02 12:49:50,848 DEBUG         master:1461 Processing file system templates: [{'snap_id': 'snap-e6e1c04a', 'name': 'galaxy', 'roles': 'galaxyTools,galaxyData'}, {'snap_id': 'snap-4b20f451', 'name': 'galaxyIndices', 'roles': 'galaxyIndices'}]
2015-11-02 12:49:50,848 DEBUG         master:960  Trying to discover any volumes attached to this instance...
2015-11-02 12:49:51,020 DEBUG         master:977  Attached volumes: [Volume:vol-123456]
2015-11-02 12:49:51,020 DEBUG            ec2:384  Adding tag 'clusterName:cloudlaunch-test-launch-v5' to resource 'vol-123456'
2015-11-02 12:49:51,226 DEBUG         master:1464 Processing file system template: galaxy
2015-11-02 12:49:51,226 DEBUG     filesystem:32   Instantiating Filesystem object galaxy with service roles: 'galaxyTools,galaxyData'
2015-11-02 12:49:51,226 DEBUG         master:584  Checking if vol 'vol-167c56ec' is file system 'galaxy'
2015-11-02 12:49:51,226 DEBUG            ec2:409  Getting tag 'clusterName' on resource 'vol-167c56ec'
2015-11-02 12:49:51,226 DEBUG            ec2:409  Getting tag 'filesystem' on resource 'vol-167c56ec'
2015-11-02 12:49:51,226 DEBUG         master:1476 There are no volumes already attached for file system galaxy
2015-11-02 12:49:51,226 DEBUG     filesystem:576  Adding Volume (id=None, size=10, snap=snap-e6e1c04a) into Filesystem galaxy FS
2015-11-02 12:49:51,227 DEBUG         volume:64   Volume service object created; creating the actual volume.
2015-11-02 12:49:51,353 DEBUG         volume:349  Creating a new volume of size '10' in zone 'us-east-1a' from snapshot 'snap-e6e1c04a' for galaxy FS.
2015-11-02 12:49:51,637 DEBUG         volume:360  Created a new volume of size '10' from snapshot 'snap-e6e1c04a' with ID 'vol-4b5c76b1' in zone 'us-east-1a' for galaxy FS.
2015-11-02 12:49:51,638 DEBUG            ec2:384  Adding tag 'clusterName:cloudlaunch-test-launch-v5' to resource 'vol-4b5c76b1'
2015-11-02 12:49:51,829 DEBUG            ec2:384  Adding tag 'bucketName:my_company_bucket' to resource 'vol-4b5c76b1'
2015-11-02 12:49:52,040 DEBUG         volume:148  Getting snaps derived from volume vol-4b5c76b1.
2015-11-02 12:50:33,275 DEBUG         volume:159  Got snaps derived from volume vol-4b5c76b1 in 41.2352890968 seconds: []
2015-11-02 12:50:33,278 DEBUG         master:107  Adding a new file system service into the registry: galaxy
2015-11-02 12:50:33,278 DEBUG       registry:111  Registering service galaxy with the registry
2015-11-02 12:50:33,278 DEBUG         master:113  Activating service galaxy
2015-11-02 12:50:33,279 DEBUG         master:157  ADD dependencies for service galaxy
2015-11-02 12:50:33,279 DEBUG         master:1464 Processing file system template: galaxyIndices
2015-11-02 12:50:33,279 DEBUG     filesystem:32   Instantiating Filesystem object galaxyIndices with service roles: 'galaxyIndices'
2015-11-02 12:50:33,279 DEBUG         master:584  Checking if vol 'vol-167c56ec' is file system 'galaxyIndices'
2015-11-02 12:50:33,279 DEBUG            ec2:409  Getting tag 'clusterName' on resource 'vol-167c56ec'
2015-11-02 12:50:33,279 DEBUG            ec2:409  Getting tag 'filesystem' on resource 'vol-167c56ec'
2015-11-02 12:50:33,279 DEBUG         master:1476 There are no volumes already attached for file system galaxyIndices
2015-11-02 12:50:33,280 DEBUG     filesystem:576  Adding Volume (id=None, size=0, snap=snap-4b20f451) into Filesystem galaxyIndices FS
2015-11-02 12:50:33,280 DEBUG         volume:64   Volume service object created; creating the actual volume.
2015-11-02 12:50:34,088 DEBUG         volume:349  Creating a new volume of size '700' in zone 'us-east-1a' from snapshot 'snap-4b20f451' for galaxyIndices FS.
2015-11-02 12:50:34,289 DEBUG         volume:360  Created a new volume of size '700' from snapshot 'snap-4b20f451' with ID 'vol-455d77bf' in zone 'us-east-1a' for galaxyIndices FS.
2015-11-02 12:50:34,289 DEBUG            ec2:384  Adding tag 'clusterName:cloudlaunch-test-launch-v5' to resource 'vol-455d77bf'
2015-11-02 12:50:34,475 DEBUG            ec2:384  Adding tag 'bucketName:my_company_bucket' to resource 'vol-455d77bf'
2015-11-02 12:50:34,908 DEBUG         volume:148  Getting snaps derived from volume vol-455d77bf.
2015-11-02 12:51:02,842 DEBUG         volume:159  Got snaps derived from volume vol-455d77bf in 27.9343369007 seconds: []
2015-11-02 12:51:02,844 DEBUG         master:107  Adding a new file system service into the registry: galaxyIndices
2015-11-02 12:51:02,844 DEBUG       registry:111  Registering service galaxyIndices with the registry
2015-11-02 12:51:02,844 DEBUG         master:113  Activating service galaxyIndices
2015-11-02 12:51:02,844 DEBUG         master:157  ADD dependencies for service galaxyIndices
2015-11-02 12:51:02,844 DEBUG         master:113  Activating service Postgres
2015-11-02 12:51:02,844 DEBUG         master:157  ADD dependencies for service Postgres
2015-11-02 12:51:02,844 DEBUG         master:113  Activating service ProFTPd
2015-11-02 12:51:02,845 DEBUG         master:157  ADD dependencies for service ProFTPd
2015-11-02 12:51:02,845 DEBUG         master:113  Activating service Galaxy
2015-11-02 12:51:02,845 DEBUG         master:157  ADD dependencies for service Galaxy
2015-11-02 12:51:02,845 DEBUG         master:113  Activating service GalaxyReports
2015-11-02 12:51:02,845 DEBUG         master:157  ADD dependencies for service GalaxyReports
2015-11-02 12:51:02,845 DEBUG         master:113  Activating service NodeJSProxy
2015-11-02 12:51:02,846 DEBUG         master:157  ADD dependencies for service NodeJSProxy
2015-11-02 12:51:02,846 DEBUG            ec2:223  Gathering instance private IP, attempt 0
2015-11-02 12:51:02,849 DEBUG            ec2:242  Gathering instance local hostname, attempt 0
2015-11-02 12:51:02,905 DEBUG           misc:840  'cp /etc/hosts /etc/hosts.orig' command OK
2015-11-02 12:51:02,933 DEBUG           misc:840  'cp /tmp/tmpZ2hwc2 /etc/hosts' command OK
2015-11-02 12:51:02,958 DEBUG           misc:840  'chmod 644 /etc/hosts' command OK
2015-11-02 12:51:02,958 DEBUG           misc:1081 Added the following line to /etc/hosts: 1.3.3.7 ip-1-3-3-7.cloud.domain.com ip-1-3-3-7 master

2015-11-02 12:51:03,006 ERROR           misc:849  ---> PROBLEM, running command 'service hostname restart' returned code '1', the following stderr: 'hostname: unrecognized service
' and stdout: ''
2015-11-02 12:51:03,007 INFO          master:460  Completed the initial cluster startup process. Configuring a predefined cluster of type Galaxy.
2015-11-02 12:51:03,007 DEBUG         master:2726 Monitor started; manager started
2015-11-02 12:51:08,008 DEBUG         master:2737 Trying to setup AMQP connection; conn = '<cm.util.comm.CMMasterComm object at 0x7f55a4c53110>'
2015-11-02 12:51:08,010 DEBUG           comm:30   Setting up a new AMQP connection
2015-11-02 12:51:08,011 DEBUG           comm:49   AMQP Connection Failure:  [Errno 111] Connection refused
2015-11-02 12:51:13,013 DEBUG         master:2737 Trying to setup AMQP connection; conn = '<cm.util.comm.CMMasterComm object at 0x7f55a4c53110>'
2015-11-02 12:51:13,014 DEBUG           comm:30   Setting up a new AMQP connection
2015-11-02 12:51:13,015 DEBUG           comm:49   AMQP Connection Failure:  [Errno 111] Connection refused
2015-11-02 12:51:18,015 DEBUG         master:2737 Trying to setup AMQP connection; conn = '<cm.util.comm.CMMasterComm object at 0x7f55a4c53110>'
2015-11-02 12:51:18,016 DEBUG           comm:30   Setting up a new AMQP connection
2015-11-02 12:51:18,016 DEBUG           comm:49   AMQP Connection Failure:  [Errno 111] Connection refused
2015-11-02 12:51:23,018 DEBUG         master:2737 Trying to setup AMQP connection; conn = '<cm.util.comm.CMMasterComm object at 0x7f55a4c53110>'
2015-11-02 12:51:23,019 DEBUG           comm:30   Setting up a new AMQP connection
2015-11-02 12:51:23,020 DEBUG           comm:49   AMQP Connection Failure:  [Errno 111] Connection refused
2015-11-02 12:51:28,021 DEBUG         master:2737 Trying to setup AMQP connection; conn = '<cm.util.comm.CMMasterComm object at 0x7f55a4c53110>'
2015-11-02 12:51:28,022 DEBUG           comm:30   Setting up a new AMQP connection
2015-11-02 12:51:28,022 DEBUG           comm:49   AMQP Connection Failure:  [Errno 111] Connection refused

17.09 release todo

17.09 will still be a minor-update release with the following being known issues from earlier releases to be fixed then:

  • Update IE images being staged
  • Include /mnt/galaxy/galaxy-app/config/plugins/interactive_environments/jupyter/config/jupyter.ini to set use_volumes = False (unless error when loading a BAM file is resolved docker: Error response from daemon: Invalid bind mount spec "/mnt/galaxy/files/000/dataset_9.dat:/import/[9] MergeSamFiles on data 7, data 6, and data 5: Merged BAM dataset.bam:rw": Invalid volume specification: '/mnt/galaxy/files/000/dataset_9.dat:/import/[9] MergeSamFiles on data 7, data 6, and data 5: Merged BAM dataset.bam:rw'.)
  • Set command_inject = -e DEFAULT_CONTAINER_RUNTIME=900 in jupyter.ini
  • Include /mnt/galaxy/galaxy-app/config/plugins/interactive_environments/rstudio/config/rstudio.ini and set use_volumes = False
  • Update galaxy.ini proxy settings as per https://docs.galaxyproject.org/en/master/admin/interactive_environments.html#configuring-the-proxy
  • Update nginx's conf (galaxy.locations) to fix the incorrect url prefix for IEs (it includes /galaxy/...)
  • Update conda_channels in galaxy.ini to iuc,bioconda,conda-forge,defaults,r

Decouple toggling master as exec host from autoscaling

We are exploring an option of running clusters on demand with a large master node (for better EBS/NFS throughput). Since these large master nodes have a few spare CPU cores available, we would like to allow the master to run jobs.

However, it appears to be impossible to switch the master to run jobs when autoscaling is enabled (message in the Admin console when clicking on "Switch master to run jobs": "Master is not an execution host"; message in the Cluster info log: "The master instance is set to not execute jobs. To manually change this, use the CloudMan Admin panel.").

Would it be possible to decouple toggling master as exec host from autoscaling?

Root volume space utilization increases during file uploads

/etc/nginx/sites-enabled/galaxy.locations contains upload_store /mnt/galaxy/upload_store;. However there are no files present in /mnt/galaxy/upload_store during file uploads and the use of the root volume increases:

ubuntu@ip-172-31-15-47:/etc/nginx$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            1.9G   12K  1.9G   1% /dev
tmpfs           377M  8.5M  369M   3% /run
/dev/xvda1       20G  5.9G   13G  32% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none            1.9G     0  1.9G   0% /run/shm
none            100M     0  100M   0% /run/user
cm_processes    1.9G     0  1.9G   0% /run/cloudera-scm-agent/process
/dev/xvdf       200G  9.8G  191G   5% /mnt/galaxy
/dev/xvdg        80G   65G   16G  81% /mnt/galaxyIndices

ubuntu@ip-172-31-15-47:/etc/nginx$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            1.9G   12K  1.9G   1% /dev
tmpfs           377M  8.5M  368M   3% /run
/dev/xvda1       20G   11G  8.6G  55% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none            1.9G     0  1.9G   0% /run/shm
none            100M     0  100M   0% /run/user
cm_processes    1.9G     0  1.9G   0% /run/cloudera-scm-agent/process
/dev/xvdf       200G   11G  190G   6% /mnt/galaxy
/dev/xvdg        80G   65G   16G  81% /mnt/galaxyIndices

This is potentially problematic because the total size of the root volume is 20GB and genomic data files can be up to 10GB each or more.

Setting client_body_temp_path to /mnt/galaxy/upload_store allows to avoid filling up the root partition but this may or may not be the right solution.

Cloning cluster fails due to missing permissions

Target AWS account must have the List permission on the source S3 bucket. Also, target AWS account must have permission to access the galaxyIndicesFS snapshot.

Neither permission is set by the automated sharing process (unlike the Open/Download permission on the source bucket objects). This results in failure to start the cluster and errors ('snap-7b992575' is the galaxyIndicesFS snapshot):

22:10:01 - Shared cluster's bucket 'cm-5b88aa0dbe0f7fc5765ccdc7dc116101' does not exist or is not accessible!
22:20:10 - Created a data volume 'vol-30e69299' of size 200GB from shared cluster's snapshot 'snap-497a6b49'
22:20:11 - EC2ResponseError retrieving snapshot IDs ['snap-7b992575']: EC2ResponseError: 400 Bad Request InvalidSnapshot.NotFoundThe snapshot 'snap-7b992575' does not exist.81ea690c-4ff9-4171-8029-7572a36816f3
22:20:11 - Did not retrieve Snapshot object for snap-7b992575; aborting.

Adjusting autoscaling parameters

Hi everyone,
I am finding that Cloudman's autoscaling feature to be too conservative. I have a situation where I don't mind the extra cost of firing up new nodes and I simply want to clear the queue as quickly as possible. How can I adjust Cloudman to fire up new instances whenever it sees stalled jobs in the queue? Can someone shed light on to the formula or algorithm that Cloudman uses to decide when to launch a new node?

Empty disk usage for FS 'transient_nfs'

AMI: ami-d5246abf, Name: Galaxy-CloudMan-1449500413.

Looks like a spurious warning message cluttering the Cluster info log:

17:29:12 - Empty disk usage for FS 'transient_nfs'
17:30:18 - Empty disk usage for FS 'transient_nfs'
17:31:21 - Empty disk usage for FS 'transient_nfs'
17:32:21 - Empty disk usage for FS 'transient_nfs'

a corresponding entry from the Cloudman log:

016-02-12 17:29:13,109 DEBUG         master:2761 S&S: AS..OK; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..Unstarted; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-02-12 17:30:18,567 WARNING   filesystem:487  Empty disk usage for FS 'transient_nfs'

However, everything seems to work fine. This usually (but not always) starts occurring after a cluster is brought back up after a shutdown (never with a brand new cluster).

Cloning cluster doesn't complete

The process appears to hang with the following messages in the CloudMan log:

2016-03-01 22:40:53,458 DEBUG           misc:205  Checking if bucket 'cm-3e863b9434af2d0e55c834cf200efc50' exists... it does not.
2016-03-01 22:40:53,458 DEBUG           misc:217  Creating bucket 'cm-3e863b9434af2d0e55c834cf200efc50'.
2016-03-01 22:40:53,612 DEBUG           misc:219  Created bucket 'cm-3e863b9434af2d0e55c834cf200efc50'.
2016-03-01 22:40:53,661 DEBUG           misc:591  Retrieved file 'shared/2016-03-01--22-05/shared_instance_file_list.txt' from bucket 'cm-5b88aa0dbe0f7fc5765ccdc7dc116101' on host 's3.amazonaws.com' to 'shared_instance_file_list.txt'.
2016-03-01 22:40:53,671 DEBUG           misc:643  Establishing handle with key object 'shared/2016-03-01--22-05/persistent_data.yaml'
2016-03-01 22:40:53,671 DEBUG           misc:647  Copying file 'cm-5b88aa0dbe0f7fc5765ccdc7dc116101/shared/2016-03-01--22-05/persistent_data.yaml' to file 'cm-3e863b9434af2d0e55c834cf200efc50/persistent_data.yaml'
2016-03-01 22:40:53,876 DEBUG           misc:643  Establishing handle with key object 'shared/2016-03-01--22-05/cm.tar.gz'
2016-03-01 22:40:53,876 DEBUG           misc:647  Copying file 'cm-5b88aa0dbe0f7fc5765ccdc7dc116101/shared/2016-03-01--22-05/cm.tar.gz' to file 'cm-3e863b9434af2d0e55c834cf200efc50/cm.tar.gz'
2016-03-01 22:40:54,259 DEBUG           misc:643  Establishing handle with key object 'shared/2016-03-01--22-05/cm_boot.py'
2016-03-01 22:40:54,259 DEBUG           misc:647  Copying file 'cm-5b88aa0dbe0f7fc5765ccdc7dc116101/shared/2016-03-01--22-05/cm_boot.py' to file 'cm-3e863b9434af2d0e55c834cf200efc50/cm_boot.py'
2016-03-01 22:40:54,454 DEBUG           misc:591  Retrieved file 'persistent_data.yaml' from bucket 'cm-3e863b9434af2d0e55c834cf200efc50' on host 's3.amazonaws.com' to 'shared_p_d.yaml'.
2016-03-01 22:40:54,462 DEBUG         master:1618 Initializing Galaxy cluster type from shared cluster
2016-03-01 22:40:54,693 INFO          master:1644 Created a data volume 'vol-153123ca' of size 200GB from shared cluster's snapshot 'snap-497a6b49'
2016-03-01 22:40:54,693 DEBUG         master:1650 Dumping scpd to file cm_cluster_config.yaml (which will become persistent_data.yaml): {'placement': 'us-east-1b', 'persistent_data_version': 3, 'cluster_type': u'Galaxy', 'deployment_version': 2, 'filesystems': [{'kind': 'snapshot', 'mount_point': '/mnt/galaxyIndices', 'ids': ['snap-7b992575'], 'roles': ['galaxyIndices'], 'name': 'galaxyIndices'}, {'kind': 'volume', 'mount_point': '/mnt/galaxy', 'ids': [u'vol-153123ca'], 'roles': ['galaxyTools', 'galaxyData'], 'name': 'galaxy'}], 'cluster_name': u'cloudman-test', 'machine_image_id': 'ami-d5246abf', 'services': [{'name': 'Postgres', 'roles': ['Postgres']}, {'name': 'ProFTPd', 'roles': ['ProFTPd']}, {'name': 'GalaxyReports', 'roles': ['GalaxyReports']}, {'name': 'Slurmd', 'roles': ['Slurmd']}, {'name': 'Nginx', 'roles': ['Nginx']}, {'name': 'NodeJSProxy', 'roles': ['NodeJSProxy']}, {'name': 'Supervisor', 'roles': ['Supervisor']}, {'name': 'Slurmctld', 'roles': ['Slurmctld', 'Job manager']}, {'home': '/mnt/galaxy/galaxy-app', 'name': 'Galaxy', 'roles': ['Galaxy']}], 'cluster_storage_type': u'volume'}
2016-03-01 22:40:54,729 DEBUG           misc:615  Saved file 'cm_cluster_config.yaml' of size 929B as 'persistent_data.yaml' to bucket 'cm-3e863b9434af2d0e55c834cf200efc50'
2016-03-01 22:40:54,776 DEBUG           misc:591  Retrieved file 'persistent_data.yaml' from bucket 'cm-3e863b9434af2d0e55c834cf200efc50' on host 's3.amazonaws.com' to 'pd.yaml'.
2016-03-01 22:40:54,883 DEBUG         master:483  Checking for and adding any previously defined cluster services
2016-03-01 22:40:54,884 DEBUG         master:489  Processing filesystems in an existing cluster config
2016-03-01 22:40:54,884 DEBUG         master:960  Trying to discover any volumes attached to this instance...
2016-03-01 22:40:55,275 DEBUG         master:977  Attached volumes: [Volume:vol-e80c1e37]
2016-03-01 22:40:55,276 DEBUG            ec2:389  Adding tag 'clusterName:cloudman-dev' to resource 'vol-e80c1e37'
2016-03-01 22:40:55,356 DEBUG     filesystem:32   Instantiating Filesystem object galaxyIndices with service roles: 'galaxyIndices'
2016-03-01 22:40:55,356 DEBUG         master:584  Checking if vol 'vol-e80c1e37' is file system 'galaxyIndices'
2016-03-01 22:40:55,356 DEBUG            ec2:414  Getting tag 'clusterName' on resource 'vol-e80c1e37'
2016-03-01 22:40:55,356 DEBUG            ec2:414  Getting tag 'filesystem' on resource 'vol-e80c1e37'
2016-03-01 22:40:55,356 DEBUG     filesystem:576  Adding Volume (id=None, size=0, snap=snap-7b992575) into Filesystem galaxyIndices FS
2016-03-01 22:40:55,356 DEBUG         volume:64   Volume service object created; creating the actual volume.
2016-03-01 22:40:55,521 DEBUG         volume:349  Creating a new volume of size '80' in zone 'us-east-1c' from snapshot 'snap-7b992575' for galaxyIndices FS.
2016-03-01 22:40:55,652 DEBUG         volume:360  Created a new volume of size '80' from snapshot 'snap-7b992575' with ID 'vol-8131235e' in zone 'us-east-1c' for galaxyIndices FS.
2016-03-01 22:40:55,652 DEBUG            ec2:389  Adding tag 'clusterName:cloudman-dev' to resource 'vol-8131235e'
2016-03-01 22:40:55,786 DEBUG            ec2:389  Adding tag 'bucketName:cm-3e863b9434af2d0e55c834cf200efc50' to resource 'vol-8131235e'
2016-03-01 22:40:55,903 DEBUG         volume:148  Getting snaps derived from volume vol-8131235e.
2016-03-01 22:40:57,878 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..Unstarted; GalaxyReports..Unstarted; Migration..Completed; Nginx..OK; NodeJSProxy..Unstarted; PSS..Unstarted; Postgres..Unstarted; ProFTPd..Unstarted; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; transient_nfs FS..OK;
2016-03-01 22:40:57,878 DEBUG         master:2618 Monitor adding service 'PSS'
2016-03-01 22:40:57,878 DEBUG            pss:88   Galaxy service not running yet; will wait.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/mnt/cm/cm/master.py", line 2790, in __monitor
    config_changed = self._start_services()
  File "/mnt/cm/cm/master.py", line 2621, in _start_services
    if service.add():
  File "/mnt/cm/cm/services/apps/pss.py", line 95, in add
    (break_srvc.get_full_name(), break_srvc.state, self.name))
UnboundLocalError: local variable 'break_srvc' referenced before assignment

2016-03-01 22:41:00,490 DEBUG         volume:159  Got snaps derived from volume vol-8131235e in 4.58710408211 seconds: []
2016-03-01 22:41:00,491 DEBUG         master:542  Adding a previously existing filesystem 'galaxyIndices' of kind 'snapshot'
2016-03-01 22:41:00,491 DEBUG         master:107  Adding a new file system service into the registry: galaxyIndices
2016-03-01 22:41:00,491 DEBUG       registry:111  Registering service galaxyIndices with the registry
2016-03-01 22:41:00,491 DEBUG         master:113  Activating service galaxyIndices
2016-03-01 22:41:00,492 DEBUG         master:157  ADD dependencies for service galaxyIndices
2016-03-01 22:41:00,492 DEBUG     filesystem:32   Instantiating Filesystem object galaxy with service roles: 'galaxyTools,galaxyData'
2016-03-01 22:41:00,492 DEBUG     filesystem:576  Adding Volume (id=vol-153123ca, size=0, snap=None) into Filesystem galaxy FS
2016-03-01 22:41:00,492 DEBUG         volume:169  Getting an update on volume vol-153123ca (<type 'unicode'>)
2016-03-01 22:41:00,492 DEBUG         volume:172  Retrieving a reference to the Volume object for ID vol-153123ca
2016-03-01 22:41:00,570 DEBUG         volume:191  Updating current `volume` object reference 'None' to a new one 'vol-153123ca'
2016-03-01 22:41:00,570 DEBUG         volume:199  For volume vol-153123ca (galaxy FS) set from_snapshot_id to snap-497a6b49
2016-03-01 22:41:00,575 DEBUG         volume:223  Volume vol-153123ca is not attached.
2016-03-01 22:41:00,575 DEBUG         volume:148  Getting snaps derived from volume vol-153123ca.
2016-03-01 22:41:02,775 DEBUG         volume:159  Got snaps derived from volume vol-153123ca in 2.19990801811 seconds: []
2016-03-01 22:41:02,776 DEBUG         master:542  Adding a previously existing filesystem 'galaxy' of kind 'volume'
2016-03-01 22:41:02,776 DEBUG         master:107  Adding a new file system service into the registry: galaxy
2016-03-01 22:41:02,776 DEBUG       registry:111  Registering service galaxy with the registry
2016-03-01 22:41:02,776 DEBUG         master:113  Activating service galaxy
2016-03-01 22:41:02,776 DEBUG         master:157  ADD dependencies for service galaxy
2016-03-01 22:41:02,776 DEBUG         master:559  Activating previously-available application services from an existing cluster config.
2016-03-01 22:41:02,776 DEBUG         master:113  Activating service Postgres
2016-03-01 22:41:02,776 DEBUG         master:157  ADD dependencies for service Postgres
2016-03-01 22:41:02,776 DEBUG         master:113  Activating service ProFTPd
2016-03-01 22:41:02,777 DEBUG         master:157  ADD dependencies for service ProFTPd
2016-03-01 22:41:02,777 DEBUG         master:113  Activating service GalaxyReports
2016-03-01 22:41:02,777 DEBUG         master:157  ADD dependencies for service GalaxyReports
2016-03-01 22:41:02,777 DEBUG         master:113  Activating service Slurmd
2016-03-01 22:41:02,777 DEBUG         master:157  ADD dependencies for service Slurmd
2016-03-01 22:41:02,777 DEBUG         master:113  Activating service Nginx
2016-03-01 22:41:02,777 DEBUG         master:157  ADD dependencies for service Nginx
2016-03-01 22:41:02,777 DEBUG         master:113  Activating service NodeJSProxy
2016-03-01 22:41:02,777 DEBUG         master:157  ADD dependencies for service NodeJSProxy
2016-03-01 22:41:02,778 DEBUG         master:113  Activating service Supervisor
2016-03-01 22:41:02,778 DEBUG         master:157  ADD dependencies for service Supervisor
2016-03-01 22:41:02,778 DEBUG         master:113  Activating service Slurmctld
2016-03-01 22:41:02,778 DEBUG         master:157  ADD dependencies for service Slurmctld
2016-03-01 22:41:02,778 DEBUG         master:113  Activating service Galaxy
2016-03-01 22:41:02,778 DEBUG         master:157  ADD dependencies for service Galaxy
2016-03-01 22:44:41,768 DEBUG            ec2:205  Gathering instance public keys (i.e., key pairs), attempt 0
2016-03-01 22:44:41,771 DEBUG            ec2:212  Got key pair: 'cloudman_key_pair

Also, galaxyFS and galaxyIndicesFS volumes are not attached to the master instance.

Transient nfs files are not shared

Hi everyone,
I've got a cluster running where the directory /transient_nfs should be the ephemeral storage that is used by Cloudman to share the hosts and slurm.conf files. I've noticed that when I touch a file on either the client or server side, the file is not accessible across NFS. Strangely, the files that Cloudman created with the file lock are available.

Sun Grid Engine service still required during Cloudman launch on CentOS

Hello again,
I've been following the Galaxy-Cloudman-playbook to configure my CentOS AMI and a few crucial services fail to start. One of them is SGE, which is not listed as a required package in the Ansible playbook. I notice that Slurm and Slurm-drmaa are required during the Galaxy-Cloudman playbook and have been installed. SGE does not start (as required packages, versions etc. of the SGE software have not been defined and installed) but Cloudman does not initialize its Slurm capabilities. What can I do to fix this?

Autoscaling fails to start worker nodes

Steps to reproduce

  1. Spin up a new or saved cluster (Galaxy 16.01, AMI ami-b45e59de, default bucket)
  2. Turn on autoscaling with min number of nodes = 0 and max number of nodes = 2
  3. Run a job

Observed results

Job doesn't trigger the launch of worker nodes and fails.
Galaxy log:

galaxy.jobs.runners.drmaa DEBUG 2016-05-04 18:59:59,387 (446) native specification is: --ntasks=16 --nodes=1
galaxy.jobs.runners.drmaa WARNING 2016-05-04 18:59:59,389 (446) drmaa.Session.runJob() failed, will retry: code 1: slurm_submit_batch_job: More processors requested than permitted
galaxy.jobs.runners.drmaa WARNING 2016-05-04 19:00:04,395 (446) drmaa.Session.runJob() failed, will retry: code 1: slurm_submit_batch_job: More processors requested than permitted
galaxy.jobs.runners.drmaa WARNING 2016-05-04 19:00:09,401 (446) drmaa.Session.runJob() failed, will retry: code 1: slurm_submit_batch_job: More processors requested than permitted
galaxy.jobs.runners.drmaa WARNING 2016-05-04 19:00:14,405 (446) drmaa.Session.runJob() failed, will retry: code 1: slurm_submit_batch_job: More processors requested than permitted
127.0.0.1 - - [04/May/2016:19:00:17 +0000] "GET / HTTP/1.1" 200 - "-" "Python-urllib/2.7"
galaxy.jobs.runners.drmaa WARNING 2016-05-04 19:00:19,412 (446) drmaa.Session.runJob() failed, will retry: code 1: slurm_submit_batch_job: More processors requested than permitted
galaxy.jobs.runners.drmaa ERROR 2016-05-04 19:00:24,417 (446) All attempts to submit job failed

CloudMan log:

2016-05-04 18:57:22,651 DEBUG           root:691  Turning autoscaling ON
2016-05-04 18:57:22,652 DEBUG         master:114  Activating service Autoscale
2016-05-04 18:57:22,652 DEBUG         master:158  ADD dependencies for service Autoscale
2016-05-04 18:57:22,652 DEBUG         master:966  Setting master not to be an exec host.
2016-05-04 18:57:23,118 DEBUG           misc:850  '/usr/bin/scontrol update NodeName=master Reason="CloudMan-disabled" State=DRAIN' command OK
2016-05-04 18:57:23,118 INFO          master:983  The master instance is set to *not* execute jobs. To manually change this, use the CloudMan Admin panel.
2016-05-04 18:57:25,438 DEBUG      autoscale:151  Checking if cluster too SMALL: minute:57,idle:0,total workers:0,avail workers:0,min:0,max:2
2016-05-04 18:57:25,746 DEBUG         master:2854 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-05-04 18:57:25,746 DEBUG         master:2711 Monitor adding service 'AS'
2016-05-04 18:57:25,746 INFO        __init__:371  AS service prerequisites OK; starting the service.
2016-05-04 18:57:25,747 DEBUG      autoscale:81   Turning autoscaling ON; using instances of type 'c4.4xlarge'
2016-05-04 18:57:25,747 DEBUG         master:2716 Monitor done adding service AS (setting config_changed)
2016-05-04 18:57:25,761 DEBUG         master:2638 Storing cluster configuration to cluster's bucket

Expected results

Jobs should trigger the launch of worker nodes.

How to provide persistent_data.yaml and snaps.yaml and uncaught URL error

I am having challenges getting cloudman to launch successfully. The cm_boot.py script runs fine with cloudlaunch after some configurations and downloads the tar file "cm.tar.gz" from my bucket. This tar file was pulled from the cloudman bucket on 10/29/15. The cm_boot.py script then extracts and triggers the run.sh script of Cloudman.

2015-10-30 11:29:22,101 DEBUG  cm_boot:25  - Successfully ran '/bin/bash -l -c 'VIRTUALENVWRAPPER_LOG_DIR=/tmp/; HOME=/home/galaxy; . /home/galaxy/.venvburrito/startup.sh; workon CM; cd /mnt/cm; pip install -r /mnt/cm/requirements.txt; sh run.sh --daemon --log-file=/var/log/cloudman/cloudman.log''

I noticed that Cloudlaunch hangs when something goes wrong during Cloudman startup: Cloudlaunch #46

I get some 404's in the middle of run.sh because I do not also have snaps.yaml or persistent_data.yaml in my bucket. I have found a few mentions of "persistent_data.yaml" in the documentation, but no examples or descriptions of what I need to add there. Also, snaps.yaml is not described at all. What do I need to provided here and do I need to supply these snapshots?

The uncaught URL error occurs next during the parsing of "s3.amazonaws.com:None". My s3 and ec2 ports are both null in the userData.yaml file. The error seems to start in the method "get_file_from_public_bucket" and my default bucket is not public. The same error seems to occur when trying to fetch snaps.yaml but doesn't occur when fetching persistent_data.yaml.

/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/deploy/loadwsgi.py:22: DeprecationWarning: Parameters to load are deprecated.  Call .resolve and .require separately.
  return pkg_resources.EntryPoint.parse("x=" + s).load(False)
Python version:  (2, 6)
Image configuration suports: {'apps': ['cloudman', 'galaxy']}
2015-10-30 11:29:26,124 DEBUG            app:74   Initializing app
2015-10-30 11:29:26,124 DEBUG            ec2:124  Gathering instance zone, attempt 0
2015-10-30 11:29:26,129 DEBUG            ec2:130  Instance zone is 'us-east-1a'
2015-10-30 11:29:26,129 DEBUG            ec2:48   Gathering instance ami, attempt 0
2015-10-30 11:29:26,131 DEBUG            app:77   Running on 'ec2' type of cloud in zone 'us-east-1a' using image 'ami-1234567'.
2015-10-30 11:29:26,131 DEBUG            app:95   Getting pd.yaml
2015-10-30 11:29:26,131 DEBUG            ec2:387  No S3 Connection, creating a new one.
2015-10-30 11:29:26,133 DEBUG            ec2:391  Got boto S3 connection.
2015-10-30 11:29:26,495 DEBUG           misc:578  Failed to get file 'persistent_data.yaml' from bucket 'my_company's_bucket_name': S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>persistent_data.yaml</Key><RequestId>foobar</RequestId><HostId>bobloblaw'slawblog</HostId></Error>
2015-10-30 11:29:26,495 DEBUG            app:102  Setting deployment_version to 2
2015-10-30 11:29:26,495 INFO             app:109  Master starting
2015-10-30 11:29:26,495 DEBUG         master:64   Initializing console manager - cluster start time: 2015-10-30 15:29:26.495641
2015-10-30 11:29:26,496 DEBUG           comm:42   AMQP Connection Failure:  [Errno 111] Connection refused
2015-10-30 11:29:26,496 DEBUG         master:857  Trying to discover any worker instances associated with this cluster...
2015-10-30 11:29:26,496 DEBUG            ec2:366  Establishing boto EC2 connection
2015-10-30 11:29:26,922 DEBUG            ec2:354  Got region as 'RegionInfo:us-east-1'
2015-10-30 11:29:28,012 DEBUG            ec2:375  Got boto EC2 connection for region 'us-east-1'
2015-10-30 11:29:28,305 DEBUG           misc:578  Failed to get file 'snaps.yaml' from bucket 'my_company's_bucket_name': S3ResponseError: 404 Not Found
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>snaps.yaml</Key><RequestId>barbaz</RequestId><HostId>bobloblaw'slawblog</HostId></Error>
Traceback (most recent call last):
  File "./scripts/paster.py", line 24, in <module>
    command.run()
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/script/command.py", line 104, in run
    invoke(command, command_name, options, args[1:])
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/script/command.py", line 143, in invoke
    exit_code = runner.run(args)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/script/command.py", line 238, in run
    result = self.command()
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/script/serve.py", line 284, in command
    relative_to=base, global_conf=vars)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/script/serve.py", line 321, in loadapp
    **kw)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/deploy/loadwsgi.py", line 247, in loadapp
    return loadobj(APP, uri, name=name, **kw)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/deploy/loadwsgi.py", line 272, in loadobj
    return context.create()
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/deploy/loadwsgi.py", line 710, in create
    return self.object_type.invoke(self)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/deploy/loadwsgi.py", line 229, in invoke
    filtered = context.next_context.create()
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/deploy/loadwsgi.py", line 710, in create
    return self.object_type.invoke(self)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/deploy/loadwsgi.py", line 146, in invoke
    return fix_call(context.object, context.global_conf, **context.local_conf)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/paste/deploy/util.py", line 56, in fix_call
    val = callable(*args, **kw)
  File "/mnt/cm/cm/buildapp.py", line 64, in app_factory
    app.startup()
  File "/mnt/cm/cm/app.py", line 111, in startup
    self.manager = master.ConsoleManager(self)
  File "/mnt/cm/cm/util/master.py", line 87, in __init__
    self.snaps = self._load_snapshot_data()
  File "/mnt/cm/cm/util/decorators.py", line 41, in df
    return fn(*args, **kwargs)
  File "/mnt/cm/cm/util/master.py", line 53, in newFunction
    return f(*args, **kw)
  File "/mnt/cm/cm/util/master.py", line 223, in _load_snapshot_data
    elif misc.get_file_from_public_bucket(self.app.ud, self.app.ud['bucket_default'], 'snaps.yaml', snaps_file):
  File "/mnt/cm/cm/util/misc.py", line 730, in get_file_from_public_bucket
    r = requests.get(url)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/requests/api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/requests/api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/requests/sessions.py", line 454, in request
    prep = self.prepare_request(req)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/requests/sessions.py", line 388, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/requests/models.py", line 293, in prepare
    self.prepare_url(url, params)
  File "/home/galaxy/.virtualenvs/CM/lib/python2.6/site-packages/requests/models.py", line 347, in prepare_url
    raise InvalidURL(*e.args)
requests.exceptions.InvalidURL: Failed to parse: s3.amazonaws.com:None
Removing PID file cm_webapp.pid

CloudMan 16.01 and 16.04 fail to start

Use either https://beta.launch.usegalaxy.org/catalog/appliance/galaxy-cloud or https://launch.usegalaxy.org/launch to start a new or existing 16.01 or 16.04 clusters (ami-b45e59de).

  • Web interface displays:
    CloudMan is not running.
    It may take a few moments for CloudMan to start after your instance has become available, or there may have been a problem launching it. After a while, you may want to reboot this instance and see if that resolves the problem.
  • There is no /var/log/cloudman/cloudman.log
  • Relevant logs:
/var/log/cloudman/cm_boot.log

2017-10-23 19:52:14,984 INFO   cm_boot:240 - << Starting nginx >>
2017-10-23 19:52:14,987 DEBUG  cm_boot:162 - Reconfiguring nginx conf
2017-10-23 19:52:14,990 DEBUG  cm_boot:259 - Creating tmp dir for nginx /mnt/galaxy/upload_store
2017-10-23 19:52:14,990 DEBUG  cm_boot:58  - /usr/local/sbin/nginx is file: False; it's executable: False
2017-10-23 19:52:14,990 DEBUG  cm_boot:58  - /usr/local/bin/nginx is file: False; it's executable: False
2017-10-23 19:52:14,990 DEBUG  cm_boot:58  - /usr/bin/nginx is file: False; it's executable: False
2017-10-23 19:52:14,990 DEBUG  cm_boot:58  - /usr/sbin/nginx is file: True; it's executable: True
2017-10-23 19:52:14,990 DEBUG  cm_boot:268 - Using '/usr/sbin/nginx' as the nginx executable
2017-10-23 19:52:15,089 ERROR  cm_boot:31  - Error running 'ps xa | grep nginx | grep -v grep'. Process returned code '1' and following stderr: ''
2017-10-23 19:52:15,090 DEBUG  cm_boot:270 - nginx not running; will try and start it now
2017-10-23 19:52:17,549 DEBUG  cm_boot:25  - Successfully ran '/usr/sbin/nginx'
2017-10-23 19:52:17,549 DEBUG  cm_boot:279 - Deleting tmp dir for nginx /mnt/galaxy/upload_store
2017-10-23 19:52:17,551 DEBUG  cm_boot:25  - Successfully ran 'rm -rf /mnt/galaxy/upload_store'
2017-10-23 19:52:17,551 DEBUG  cm_boot:339 - Deleting /mnt/cm dir before download
2017-10-23 19:52:17,553 DEBUG  cm_boot:25  - Successfully ran 'rm -rf /mnt/cm'
2017-10-23 19:52:17,553 INFO   cm_boot:341 - << Downloading CloudMan >>
2017-10-23 19:52:17,553 DEBUG  cm_boot:43  - Checking existence of directory '/mnt/cm'
2017-10-23 19:52:17,553 DEBUG  cm_boot:46  - Creating directory '/mnt/cm'
2017-10-23 19:52:17,553 DEBUG  cm_boot:48  - Directory '/mnt/cm' successfully created.
2017-10-23 19:52:17,553 DEBUG  cm_boot:346 - Using user-provided default bucket: cloudman
2017-10-23 19:52:17,553 INFO   cm_boot:308 - connecting to Amazon S3 at http://s3.amazonaws.com/
2017-10-23 19:52:17,554 DEBUG  cm_boot:333 - Got boto S3 connection: S3Connection:s3.amazonaws.com
2017-10-23 19:52:17,554 DEBUG  cm_boot:203 - Checking if key 'cm.tar.gz' exists in bucket 'cm-298bcca1dd946c4f308a30e9527cdbc9'
2017-10-23 19:52:17,570 DEBUG  cm_boot:183 - Getting file cm.tar.gz from bucket cloudman
2017-10-23 19:52:17,570 DEBUG  cm_boot:187 - Attempting to retrieve file 'cm.tar.gz' from bucket 'cloudman'
2017-10-23 19:52:17,734 INFO   cm_boot:190 - Successfully retrieved file 'cm.tar.gz' from bucket 'cloudman' via connection 's3.amazonaws.com' to '/mnt/cm/cm.tar.gz'
2017-10-23 19:52:17,734 INFO   cm_boot:363 - Retrieved CloudMan (cm.tar.gz) from bucket 'cloudman' via local s3 connection
2017-10-23 19:52:17,734 INFO   cm_boot:407 - << Unpacking CloudMan from /mnt/cm/cm.tar.gz >>
2017-10-23 19:52:17,799 DEBUG  cm_boot:430 - virtualenv does not exist
2017-10-23 19:52:20,596 DEBUG  cm_boot:25  - Successfully ran 'virtualenv /opt/cloudman/boot/.venv'
2017-10-23 19:52:20,596 DEBUG  cm_boot:450 - Copying user data file from '/opt/cloudman/boot/userData.yaml' to '/mnt/cm/userData.yaml'
2017-10-23 19:52:20,596 INFO   cm_boot:453 - << Starting CloudMan in /mnt/cm >>
2017-10-23 19:52:20,596 DEBUG  cm_boot:428 - virtualenv seems to be installed
2017-10-23 19:52:36,363 ERROR  cm_boot:31  - Error running '. /opt/cloudman/boot/.venv/bin/activate && cd /mnt/cm; pip install -r /mnt/cm/requirements.txt; sh run.sh --daemon --log-file=/var/log/cloudman/cloudman.log'. Process returned code '1' and following stderr: 'Traceback (most recent call last):
  File "./scripts/paster.py", line 22, in <module>
    from paste.script import command
ImportError: No module named paste.script
'
2017-10-23 19:52:36,363 INFO   cm_boot:513 - ---> /opt/cloudman/boot/cm_boot.py done <---


/root/.pip/pip.log

  Downloading from URL https://pypi.python.org/packages/bf/da/6a9f49cc7a970380c8235b3adab0c08c7c3d4814876f7383b33e3882a577/cryptography-2.1.1.tar.gz#md5=fa98118b468020349a798776aac6d572 (from https://pypi.python.org/simple/cryptography/)
  Running setup.py (path:/opt/cloudman/boot/.venv/build/cryptography/setup.py) egg_info for package cryptography
    error in cryptography setup command: Invalid environment marker: python_version < '3'
    Complete output from command python setup.py egg_info:
    error in cryptography setup command: Invalid environment marker: python_version < '3'

----------------------------------------
Cleaning up...
  Removing temporary dir /opt/cloudman/boot/.venv/build...
Command python setup.py egg_info failed with error code 1 in /opt/cloudman/boot/.venv/build/cryptography
Exception information:
Traceback (most recent call last):
  File "/opt/cloudman/boot/.venv/local/lib/python2.7/site-packages/pip/basecommand.py", line 122, in main
    status = self.run(options, args)
  File "/opt/cloudman/boot/.venv/local/lib/python2.7/site-packages/pip/commands/install.py", line 278, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
  File "/opt/cloudman/boot/.venv/local/lib/python2.7/site-packages/pip/req.py", line 1229, in prepare_files
    req_to_install.run_egg_info()
  File "/opt/cloudman/boot/.venv/local/lib/python2.7/site-packages/pip/req.py", line 325, in run_egg_info
    command_desc='python setup.py egg_info')
  File "/opt/cloudman/boot/.venv/local/lib/python2.7/site-packages/pip/util.py", line 697, in call_subprocess
    % (command_desc, proc.returncode, cwd))
InstallationError: Command python setup.py egg_info failed with error code 1 in /opt/cloudman/boot/.venv/build/cryptography

It appears that the error is due to an outdated version of pip and it is possible to update pip and start the CloudMan process manually:

sudo su -
. /opt/cloudman/boot/.venv/bin/activate
pip install -U pip setuptools
pip install -r /mnt/cm/requirements.txt
cd /mnt/cm
sh run.sh --daemon --log-file=/var/log/cloudman/cloudman.log

However, cluster shows the same error after restart.

Enable autoscaling based on memory specification

Submitting jobs specifying more memory than available on existing nodes results in failure.

Steps to reproduce

  1. Create a cluster with a c4.large master
  2. Configure autoscaling (min nodes: 0, max nodes: 1, custom type: c4.4xlarge)
  3. Configure a high memory job destination and use it for a tool definition. For example:
<destination id="slurm_high_memory" runner="slurm">
    <param id="nativeSpecification">--mem=25000</param>
</destination>

<tool id="wig_to_bigWig" destination="slurm_high_memory"/>
  1. Run the tool

Observed results

Galaxy log:

galaxy.jobs.runners.drmaa DEBUG 2017-01-27 21:00:42,163 (29) native specification is: --mem=25000
galaxy.jobs.runners.drmaa WARNING 2017-01-27 21:00:42,167 (29) drmaa.Session.runJob() failed, will retry: code 1: slurm_submit_batch_job: Requested node configuration is not available
...
galaxy.jobs.runners.drmaa ERROR 2017-01-27 21:01:07,190 (29) All attempts to submit job failed

Expected results

Job submission should succeed and autoscaling should add an instance of the specified type.

Compatibility with CentOS/Fedora family systems [Enhancement]

@afgane mentioned that Cloudman uses some commands specific to Ubuntu or Debian family systems and to search for misc.run in the Cloudman repo. There is one hard-coded command that I do not find on my CentOS machine: start-stop-daemon associated with the slurmd job manager. Otherwise, some of the commands executed are elements of iterable objects and are run in turn. I'm not sure what commands could be executed through this mechanism. Can someone point me in the right direction as to what commands are Ubuntu/Debian specific and then I can see if I can find an equivalent system command in CentOS?

Custom Galaxy conf templates are not preserved across system shutdowns.

In general it seems to be working well however I have experienced issues with retaining job_conf over cluster invocations, for example if I place a new jobconf in /opt/cloudman/config/conftemplates/ and restart galaxy in cloudman the new job_conf is loaded correctly.

However, if I terminate the cluster and then retrieve it with cloudlaunch the /opt/cloudman/config/conftemplates/ folder is deleted and she default jobconf is loaded , which seems unusual because all my other changes have persisted?

Adding additional worker nodes causes job failure, collections don't error out

When jobs/collections are running and additional worker nodes are added to the Cloudman instance, some of the jobs/collections stop running and return empty jobs/collections in the that are green state. Not all collections exhibit this behavior, some collections are stopped and in the error state.

The two issues:

  • Adding workers interrupts ongoing jobs
  • Interrupted jobs are not always marked in the error state, collections contain empty data items

Behavior observed using the GVL 4.4.0 RC2 (Galaxy 18.05)

Implement EFS support

Galaxy FS and indices EBS volumes on the master node are shared via NFS to the worker nodes and can become a bottleneck for IO intensive jobs.

Amazon EFS is a network file storage service for Amazon EC2 instances that:

  • supports the NFSv4 protocol
  • scales throughput and IOPS as the file system grows
  • grows and shrinks automatically as you add and remove files

Using EFS would increase cluster performance and reduce the processing and network load on the master node.

Specifying memory requirements for jobs doesn't have effect

I'm running a single worker node with 32GB of memory to run jobs that can require 20-25GB of memory each (slurm.conf):

# COMPUTE NODES
NodeName=master NodeAddr=127.0.0.1 CPUs=1 RealMemory=3764 Weight=10 State=UNKNOWN
NodeName=w3 NodeAddr=172.31.18.72 CPUs=16 RealMemory=30148 Weight=5 State=UNKNOWN

I'm using --mem 25000 as a destination parameter for these jobs (job_conf.xml):

        <destination id="slurm_memory" runner="slurm">
            <param id="nativeSpecification">--mem=25000</param>
        </destination>
...
<tool id="wig_to_bigWig" destination="slurm_memory"/>
galaxy.jobs.runners DEBUG 2016-03-21 21:34:33,707 (1059) command is: grep -v "^track" /mnt/galaxy/files/001/dataset_1381.dat | wigToBigWig stdin /mnt/galaxy/galaxy-app/tool-data/len/hg19.len /mnt/galaxy/files/001/dataset_1382.dat -clip 2>&1 || echo "Error running wigToBigWig." >&2; return_code=$?; python "/mnt/galaxy/tmp/job_working_directory/001/1059/set_metadata_oOH2bq.py" "/mnt/galaxy/tmp/tmpC8piC7" "/mnt/galaxy/tmp/job_working_directory/001/1059/galaxy.json" "/mnt/galaxy/tmp/job_working_directory/001/1059/metadata_in_HistoryDatasetAssociation_1436_b78YRM,/mnt/galaxy/tmp/job_working_directory/001/1059/metadata_kwds_HistoryDatasetAssociation_1436_FTWZDJ,/mnt/galaxy/tmp/job_working_directory/001/1059/metadata_out_HistoryDatasetAssociation_1436_PdF3Qh,/mnt/galaxy/tmp/job_working_directory/001/1059/metadata_results_HistoryDatasetAssociation_1436_bflH3p,/mnt/galaxy/files/001/dataset_1382.dat,/mnt/galaxy/tmp/job_working_directory/001/1059/metadata_override_HistoryDatasetAssociation_1436_1t3RXZ" 5242880; sh -c "exit $return_code"
galaxy.jobs.runners.drmaa DEBUG 2016-03-21 21:34:33,913 (1059) submitting file /mnt/galaxy/tmp/job_working_directory/001/1059/galaxy_1059.sh
galaxy.jobs.runners.drmaa DEBUG 2016-03-21 21:34:33,913 (1059) native specification is: --mem=25000
galaxy.jobs.runners.drmaa INFO 2016-03-21 21:34:33,917 (1059) queued as 10
galaxy.jobs DEBUG 2016-03-21 21:34:33,965 (1059) Persisting job destination (destination id: slurm_memory)
galaxy.jobs.runners.drmaa DEBUG 2016-03-21 21:34:34,134 (1059/10) state change: job is running

However, more than one job is allowed to run at a time which causes the node to run out of memory and jobs to be killed.

Please add "Unique" tools to CloudMan instances

I'm running an instance on AWS (launched from CloudLaunch):

Appliance: Genomics Virtual Lab (GVL) 
Version: GVL 4.4.0 RC1 (Galaxy 18.05)
Cloud: amazon-us-east-n-virginia 

Neither the "Unique" or "Unique Lines" tools are on the instance.

I think these are useful enough to include. However, if this omission was conscious then you can ignore it.

Set master to not run jobs when autoscaling is enabled

We have the following use case: run a small master node continuously (for web UI) and spin up workers on demand (with min: 0 and max: N).

Right now (using cloudman-dev bucket) when I enable autoscaling, the master is still set to run jobs. Also, even if I set the master not to run jobs, when the last idle worker terminates the master is set to run jobs again automatically:

22:58:21 - Initiated requested termination of instance. Terminating 'i-0e0226d1db2ed616b'.
22:58:21 - Initiated requested termination of instances. Terminating '1' instances.
22:58:21 - Terminating instance i-0e0226d1db2ed616b
22:58:24 - Instance 'i-0e0226d1db2ed616b' removed from the internal instance list.
22:58:24 - The master instance is set to execute jobs. To manually change this, use the CloudMan Admin panel.

It would be great to prevent the master from running jobs by default when autoscaling is enabled.

Only one worker added after requesting to add two

Using main cloudman bucket. Requested to add two c3.8xlarge workers using the "Add worker nodes" dialog box but only one was added:
screenshot 2016-03-15 11 32 32

CM log:

2016-03-15 15:21:07,625 DEBUG         master:1263 Adding 2 c3.8xlarge instance(s)
2016-03-15 15:21:07,626 DEBUG         master:916  Toggling master instance as exec host
2016-03-15 15:21:10,170 DEBUG           misc:840  '/usr/bin/scontrol update NodeName=master Reason="CloudMan-disabled" State=DRAIN' command OK
2016-03-15 15:21:10,170 INFO          master:931  The master instance is set to *not* execute jobs. To manually change this, use the CloudMan Admin panel.
2016-03-15 15:21:10,170 INFO             ec2:431  Adding 2 on-demand instance(s)
2016-03-15 15:21:10,188 DEBUG            ec2:176  Fetched security group ids for the first time: ['sg-8b3b99f3']
2016-03-15 15:21:10,189 DEBUG            ec2:463  Starting instance(s) in VPC with the following command : ec2_conn.run_instances( image_id='ami-d5246abf', min_count='1', max_count='2', key_name='cloudman_key_pair', security_group_ids=['sg-8b3b99f3'], user_data(with sensitive info filtered out)=[static_images_dir: static/images
cluster_templates: [{'filesystem_templates': [{'archive_url': 'http://s3.amazonaws.com/cloudman/fs-archives/galaxyFS-20151202.tar.gz', 'type': u'volume', 'name': 'galaxy', 'roles': 'galaxyTools,galaxyData', 'size': u'10'}, {'snap_id': 'snap-c332f2b0', 'name': 'galaxyIndices', 'roles': 'galaxyIndices'}], 'name': 'Galaxy'}, {'filesystem_templates': [{'name': 'galaxy'}], 'name': 'Data'}]
master_hostname_alt: ip-172-31-26-166
storage_type: volume
iops:
is_secure: True
cluster_storage_type: volume
s3_port: None
log_level: DEBUG
static_cache_time: 360
master_public_ip: 54.165.165.225
cluster_type: Galaxy
initial_cluster_type: Galaxy
static_scripts_dir: static/scripts
debug: true
master_ip: 172.31.26.166
cluster_name: galaxy-dev
machine_image_id: ami-d5246abf
role: worker
bucket_cluster: cm-c6d42a39947226f4727ed6a9c1c1d1fc
boot_script_path: /opt/cloudman/boot
master_hostname: ip-172-31-26-166.ec2.internal
ec2_conn_path: /
region_name: us-east-1
region_endpoint: ec2.amazonaws.com
ec2_port: None
static_favicon_dir: static/favicon.ico
deployment_version: 2
storage_size: 10
use_translogger: False
boot_script_name: cm_boot.py
services: [{'name': 'Supervisor', 'roles': ['Supervisor']}, {'name': 'NodeJSProxy', 'roles': ['NodeJSProxy']}, {'name': 'ProFTPd', 'roles': ['ProFTPd']}, {'name': 'GalaxyReports', 'roles': ['GalaxyReports']}, {'name': 'Slurmd', 'roles': ['Slurmd']}, {'name': 'Nginx', 'roles': ['Nginx']}, {'name': 'PSS', 'roles': ['PSS']}, {'name': 'Slurmctld', 'roles': ['Slurmctld', 'Job manager']}, {'home': '/mnt/galaxy/galaxy-app', 'name': 'Galaxy', 'roles': ['Galaxy']}, {'name': 'Postgres', 'roles': ['Postgres']}]
cloud_type: ec2
custom_image_id:
cloudman_file_name: cm.tar.gz
access_key: <redacted>
global_conf: {'__file__': '/mnt/cm/cm_wsgi.ini', 'here': '/mnt/cm'}
filesystems: [{'kind': 'volume', 'mount_point': '/mnt/galaxy', 'name': 'galaxy', 'roles': ['galaxyTools', 'galaxyData'], 'ids': [u'vol-dabaeb79']}, {'kind': 'snapshot', 'mount_point': '/mnt/galaxyIndices', 'name': 'galaxyIndices', 'roles': ['galaxyIndices'], 'ids': ['snap-0457ec13']}]
placement: us-east-1a
template_path: templates
cloud_name: Amazon - Virginia
static_dir: static
persistent_data_version: 3
cloudman_home: /mnt/cm
static_style_dir: static/style
bucket_default: cloudman
custom_instance_type:
s3_host: s3.amazonaws.com
use_lint: false
s3_conn_path: /
static_enabled: True
worker_initial_count: ], instance_type='c3.8xlarge', placement='us-east-1a', subnet_id='subnet-358d556d')
2016-03-15 15:21:13,925 DEBUG            ec2:389  Adding tag 'clusterName:galaxy-dev' to resource 'i-050c49e4523d1b746'
2016-03-15 15:21:14,058 DEBUG            ec2:389  Adding tag 'role:worker' to resource 'i-050c49e4523d1b746'
2016-03-15 15:21:14,181 DEBUG            ec2:389  Adding tag 'Name:Worker: galaxy-dev' to resource 'i-050c49e4523d1b746'
2016-03-15 15:21:14,328 DEBUG            ec2:510  Adding Instance Instance:i-050c49e4523d1b746
2016-03-15 15:21:14,328 DEBUG            ec2:524  Started 2 instance(s)
2016-03-15 15:21:14,328 DEBUG            ec2:526  Setting boto's logger to INFO mode
2016-03-15 15:21:17,698 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:21:17,956 DEBUG       instance:457  Got public IP for instance i-050c49e4523d1b746: 54.175.86.1
2016-03-15 15:21:17,956 DEBUG         master:2788 Instance 'i-050c49e4523d1b746; 54.175.86.1; w1' has been quiet for a while (last check 3 secs ago); will wait a bit longer before a check...
2016-03-15 15:21:33,562 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:21:33,562 DEBUG         master:2788 Instance 'i-050c49e4523d1b746; 54.175.86.1; w1' has been quiet for a while (last check 19 secs ago); will wait a bit longer before a check...
2016-03-15 15:21:49,073 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:21:49,073 DEBUG         master:2783 Have not heard from or checked on instance 'i-050c49e4523d1b746; 54.175.86.1; w1' for a while; checking now.
2016-03-15 15:21:49,185 DEBUG       instance:339  Requested instance 'i-050c49e4523d1b746; 54.175.86.1; w1' update: old state: running; new state: running
2016-03-15 15:21:49,185 DEBUG       instance:115  'Maintaining' instance 'i-050c49e4523d1b746; 54.175.86.1; w1' in 'running' state (last comm before 15:21:49 | last m_state change before 0:00:34 | time_rebooted before 15:21:49
2016-03-15 15:22:05,230 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:22:05,230 DEBUG         master:2788 Instance 'i-050c49e4523d1b746; 54.175.86.1; w1' has been quiet for a while (last check 16 secs ago); will wait a bit longer before a check...
2016-03-15 15:22:15,783 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:22:15,783 DEBUG         master:2788 Instance 'i-050c49e4523d1b746; 54.175.86.1; w1' has been quiet for a while (last check 26 secs ago); will wait a bit longer before a check...
2016-03-15 15:22:31,230 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:22:31,230 DEBUG         master:2783 Have not heard from or checked on instance 'i-050c49e4523d1b746; 54.175.86.1; w1' for a while; checking now.
2016-03-15 15:22:31,299 DEBUG       instance:339  Requested instance 'i-050c49e4523d1b746; 54.175.86.1; w1' update: old state: running; new state: running
2016-03-15 15:22:31,299 DEBUG       instance:115  'Maintaining' instance 'i-050c49e4523d1b746; 54.175.86.1; w1' in 'running' state (last comm before 15:22:31 | last m_state change before 0:01:16 | time_rebooted before 15:22:31
2016-03-15 15:22:46,805 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:22:46,805 DEBUG         master:2788 Instance 'i-050c49e4523d1b746; 54.175.86.1; w1' has been quiet for a while (last check 15 secs ago); will wait a bit longer before a check...
2016-03-15 15:23:02,447 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:23:02,448 DEBUG         master:2783 Have not heard from or checked on instance 'i-050c49e4523d1b746; 54.175.86.1; w1' for a while; checking now.
2016-03-15 15:23:02,514 DEBUG       instance:339  Requested instance 'i-050c49e4523d1b746; 54.175.86.1; w1' update: old state: running; new state: running
2016-03-15 15:23:02,514 DEBUG       instance:115  'Maintaining' instance 'i-050c49e4523d1b746; 54.175.86.1; w1' in 'running' state (last comm before 15:23:02 | last m_state change before 0:01:48 | time_rebooted before 15:23:02
2016-03-15 15:23:17,978 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:23:17,978 DEBUG         master:2788 Instance 'i-050c49e4523d1b746; 54.175.86.1; w1' has been quiet for a while (last check 15 secs ago); will wait a bit longer before a check...
2016-03-15 15:23:33,487 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:23:33,487 DEBUG         master:2783 Have not heard from or checked on instance 'i-050c49e4523d1b746; 54.175.86.1; w1' for a while; checking now.
2016-03-15 15:23:33,559 DEBUG       instance:339  Requested instance 'i-050c49e4523d1b746; 54.175.86.1; w1' update: old state: running; new state: running
2016-03-15 15:23:33,559 DEBUG       instance:115  'Maintaining' instance 'i-050c49e4523d1b746; 54.175.86.1; w1' in 'running' state (last comm before 15:23:33 | last m_state change before 0:02:19 | time_rebooted before 15:23:33
2016-03-15 15:23:49,187 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:23:49,188 DEBUG         master:2788 Instance 'i-050c49e4523d1b746; 54.175.86.1; w1' has been quiet for a while (last check 15 secs ago); will wait a bit longer before a check...
2016-03-15 15:23:59,189 INFO        instance:534  Instance 'i-050c49e4523d1b746; 54.175.86.1; w1' reported alive
2016-03-15 15:23:59,190 DEBUG       instance:555  INSTANCE_ALIVE private_ip: 172.31.26.72 public_ip: 54.175.86.1 zone: us-east-1a type: c3.8xlarge AMI: ami-d5246abf local_hostname: ip-172-31-26-72.ec2.internal, CPUs: 32, hostname: ip-172-31-26-72
2016-03-15 15:23:59,254 DEBUG           misc:840  'cp /etc/hosts /etc/hosts.orig' command OK
2016-03-15 15:23:59,257 DEBUG           misc:840  'cp /tmp/tmp1XafFZ /etc/hosts' command OK
2016-03-15 15:23:59,261 DEBUG           misc:840  'chmod 644 /etc/hosts' command OK
2016-03-15 15:23:59,261 DEBUG           misc:1081 Added the following line to /etc/hosts: 172.31.26.72 w1 ip-172-31-26-72.ec2.internal ip-172-31-26-72

2016-03-15 15:24:04,822 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:24:09,823 DEBUG       instance:564  Got MOUNT_DONE message
2016-03-15 15:24:09,824 DEBUG       instance:574  Got transient_nfs state on w1: 1
2016-03-15 15:24:09,826 DEBUG         master:2313 Instructing all workers to sync /etc/hosts w/ master
2016-03-15 15:24:09,827 DEBUG       instance:501  Sent master public key to worker instance 'i-050c49e4523d1b746'.
2016-03-15 15:24:09,827 DEBUG       instance:502    MT: Message MASTER_PUBKEY ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCY6gKhVVz9qJ2yKuz+vwNyPlDMvW2jcD4Iolxfi/TruRmk1MLzGdp+bCbEGFoseY8NBy1rfwH7sY0eWmXcp3fM+V2+fMw1fMg3ydz87mtbaEEH7eUE4jtxdAvw9ktg8mRml5ApKGLypi+95SaUEM2sEkkE6zkF9mmhc7IG2+xvrX8XmAXCAcyY4YToLqha7XITm1oHlFYWIPSNW5VZnmQZ1bvQ87RBH6Zyxyrx9FY7hnsW21J4HahzKhQZwbguPMefvrnNBwY3q4C/fvqjltLt37ZEUIp+5HdR9oWq80Ws9w7xDYWN5LfHx8jqn/cvNpgrq8dDn1LZ5ldYMVDHQw6l root@ip-172-31-26-166
 sent to 'i-050c49e4523d1b746'
2016-03-15 15:24:19,831 DEBUG       instance:603  Got WORKER_H_CERT message
2016-03-15 15:24:19,831 DEBUG         master:2284 Saving host certificate 'ip-172-31-26-72.ec2.internal ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDaSRt8IcxRJ9WtvQaQqS6ui9CJkf9SDmbWyDnlqID9hqzfKBVFlM6Ss4DQfcWyL9Hkc5IHtHXugqDZyzsXfL5JA0FlcgPqhUdo08dQ85jcDXUlMio5uPnbk53V3A85Mlu5Fyh32naiDwejtykwwmz4ACXkgF8ZxfpqNxWI/CfwuX4F2dxHdhl/v1iIHGXQBXFh1eMsAFkVskJb2z40Ev9GeACbRTfO0l+93hJmQnuegB799bt0d87NxRo6fPiV+zqiDCTzoemT/DiGEOcEJWO3IRWM8SUoLlzOD4LompDxh8d4g/7/VcRtf0tdn8evxWYs6CCmuv71a6ALyMfnCzlV
 '
2016-03-15 15:24:19,831 DEBUG         master:2285 Saving worker host certificate.
2016-03-15 15:24:19,831 DEBUG       instance:607  Worker 'i-050c49e4523d1b746' host certificate received and appended to /root/.ssh/known_hosts
2016-03-15 15:24:19,831 DEBUG      slurmctld:194  Adding node w1 into Slurm cluster
2016-03-15 15:24:19,831 DEBUG      slurmctld:181  Reconfiguring Slurm cluster
2016-03-15 15:24:19,832 DEBUG      slurmctld:148  Setting up /mnt/transient_nfs/slurm/slurm.conf (attempt 0/5)
2016-03-15 15:24:19,832 DEBUG      slurmctld:124  Setting slurm.conf parameters
2016-03-15 15:24:19,832 DEBUG           misc:983  Checking existence of directory '/tmp/slurm'
2016-03-15 15:24:19,832 DEBUG           misc:995  Directory '/tmp/slurm' exists.
2016-03-15 15:24:19,832 DEBUG      slurmctld:120  Worker node names to include in slurm.conf: w1
2016-03-15 15:24:19,839 DEBUG      slurmctld:152  Created slurm.conf as /mnt/transient_nfs/slurm/slurm.conf
2016-03-15 15:24:19,855 DEBUG           misc:840  '/usr/bin/scontrol reconfigure' command OK
2016-03-15 15:24:19,855 DEBUG       instance:506    MT: Sending START_SLURMD message to instance 'i-050c49e4523d1b746; 54.175.86.1; w1', named w1
2016-03-15 15:24:19,856 WARNING     instance:618  Could not get a handle on job manager service to add node 'i-050c49e4523d1b746; 54.175.86.1; w1'
2016-03-15 15:24:19,856 INFO        instance:625  Waiting on worker instance 'i-050c49e4523d1b746; 54.175.86.1; w1' to configure itself.
2016-03-15 15:24:20,454 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:24:35,457 INFO        instance:628  Instance 'i-050c49e4523d1b746; 54.175.86.1; w1' ready
2016-03-15 15:24:35,457 DEBUG            ec2:389  Adding tag 'clusterName:galaxy-dev' to resource 'i-050c49e4523d1b746'
2016-03-15 15:24:35,539 DEBUG            ec2:389  Adding tag 'role:worker' to resource 'i-050c49e4523d1b746'
2016-03-15 15:24:35,618 DEBUG            ec2:389  Adding tag 'alias:w1' to resource 'i-050c49e4523d1b746'
2016-03-15 15:24:35,746 DEBUG            ec2:389  Adding tag 'Name:Worker: galaxy-dev' to resource 'i-050c49e4523d1b746'
2016-03-15 15:24:36,342 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;
2016-03-15 15:24:51,888 DEBUG         master:2761 S&S: AS..Unstarted; ClouderaManager..Unstarted; Cloudgene..Unstarted; Galaxy..OK; GalaxyReports..OK; Migration..Completed; Nginx..OK; NodeJSProxy..OK; PSS..Completed; Postgres..OK; ProFTPd..OK; Pulsar..Unstarted; Slurmctld..OK; Slurmd..OK; Supervisor..OK; galaxy FS..OK; galaxyIndices FS..OK; transient_nfs FS..OK;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.