Giter Club home page Giter Club logo

ehos-python's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

usegalaxy-no

ehos-python's Issues

library version

Note down what versions of software/libraries have been used for the development:

An incomplete list:
htcondor
openstacksdk
cloud-init
pip
python

config in db

store daemon config in a database instead of a config file. This makes it possible to tweak the ehos daemon in a UI

Update documentation.

Describe the deployment for a production environment.
Add new image of where ehos live in a stack

Master node setup script

A dedicated script that configures condor on the master node instead of having it in the daemon script.

Add influx db logging

Current dev version is only doing node and job counts, and a single event when the system is started. Might add some more in a generic module that can be called from suitable places in the code.

build images for all clouds

Currently the deployment script only build the image for one cloud, this needs to be done on all of them.

This depends on how to proceed with the staging scenario going forward

async creation of execute nodes

when creating nodes let the creation process run in the background, but dont create additional resources until they are up and running

bigger scratch disk for htcondor to use

The disk space default for flavors provided on openstack are to small for real data analysis.

When creating the vm one needs to attach a volume that can be used as scratch. I believe it might be possible to attach the same volume to a collection of nodes, but this needs to be tested before I can be sure.

The other problem is setting the scratch area in htcondor, last time I investigated this this was the solution I came up with:

Mounted /var/lib/condor/execute onto an external volume in the end as I could not find the correct setting to point to somewhere else.

create nodes of different size

Allocate nodes depending on the queued job requirements. Eg memory and cores

This will require EHOS to pull information about flavors available in a openstack and try and use them in an optimal fashion.

Probably possible to set some min requirements in the configuration file so we don't end up with trying to spin up servers not meeting the minimum requirements. (This have been a "feature" before as it catches the openstacksdk out)

Place execute somewhere safe

This file is currently put in /tmp, thus gets deleted if the server runs for a long time, plus makes it harder to debug as the filename continuously change

Run master outside openstack

Check if it possible to run the master outside the openStack.

This will furthermore be useful when testing data/binary transfer between the master node and the slaves

restart master node

The initial implementation assumes the master doesn't need to be restarted. Would be nice to be able to restart a master node and it picks up the "knowledge" about nodes and states

Several solutions to this problem:

  • Dynamically pull node information from the condor master
  • Continually store the state information in the master configuration file

Cleanup script

a script that reverts the openstack to the state it was before the quick_test.
Delete:
firewall settting
ssh keys
images
VM's

running ehos across multiple openstack regions

Is it possible to run EHOS across multiple openstack instances or multiple regions with in a openstack instance.

For this the config file needs to tweaked slightly to contain multiple regions.

Multiple openstack connectors & base images for each site

Random? creations between regions as a test?

dynamic config nodes

Currently all nodes are configure as:

NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = false

It would be better to have this reflect the available number of CPUs etc.

Not sure if this is to be done by ehos or htcondor can do this

Investigate HTCondor + data transfer penalties

Investigate the following questions:

  1. Is is possible to submit a job from an external compute and share data with the master + slave nodes

  2. What are the data sharing mechanisms used in HTCondor

  3. How are binary dependencies resolved (if they are)? Eg: try and run a dynamical linked binary on a client. If a binary differs between master and slave, which one is being run?

  4. How long does it take to transfer a large dataset between nodes

cleanup

Get rid of old code, eg: flask frontend and binary test for htcondor

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.