elixir-no-nels / ehos-python Goto Github PK

View Code? Open in Web Editor NEW

1.0 7.0 4.0 1.53 MB

python flavour of EHOS

License: MIT License

Python 99.96% HTML 0.04%

htcondor openstack dynamic-scaling cluster-computing cloud-computing-clusters cloud-computing

ehos-python's People

Stargazers

Watchers

Forkers

usegalaxy-no

ehos-python's Issues

confusing naming use of state and status

The OS and condor/job status and states are confusing, should be more distinct, eg with a prefix

library version

Note down what versions of software/libraries have been used for the development:

An incomplete list:
htcondor
openstacksdk
cloud-init
pip
python

config in db

store daemon config in a database instead of a config file. This makes it possible to tweak the ehos daemon in a UI

Yaml config file validation

Check that the config files are in the correct format.

Some quick google hits:

https://stuartgunter.wordpress.com/2011/05/25/yaml-and-the-importance-of-schema-validation/

http://tekiwiki.blogspot.com/2017/03/normal-0-false-false-false-en-us-x-none_15.html

Update documentation.

Describe the deployment for a production environment.
Add new image of where ehos live in a stack

Master node setup script

A dedicated script that configures condor on the master node instead of having it in the daemon script.

Write to a logfile instead of stdout

start as a service

Make ehos run as a systemctl service instead of on the commandline

Add influx db logging

Current dev version is only doing node and job counts, and a single event when the system is started. Might add some more in a generic module that can be called from suitable places in the code.

Re-add the option for using a database backend for tracking of vm creation etc.

build images for all clouds

Currently the deployment script only build the image for one cloud, this needs to be done on all of them.

This depends on how to proceed with the staging scenario going forward

async creation of execute nodes

when creating nodes let the creation process run in the background, but dont create additional resources until they are up and running

bigger scratch disk for htcondor to use

The disk space default for flavors provided on openstack are to small for real data analysis.

When creating the vm one needs to attach a volume that can be used as scratch. I believe it might be possible to attach the same volume to a collection of nodes, but this needs to be tested before I can be sure.

The other problem is setting the scratch area in htcondor, last time I investigated this this was the solution I came up with:

Mounted /var/lib/condor/execute onto an external volume in the end as I could not find the correct setting to point to somewhere else.

create nodes of different size

Allocate nodes depending on the queued job requirements. Eg memory and cores

This will require EHOS to pull information about flavors available in a openstack and try and use them in an optimal fashion.

Probably possible to set some min requirements in the configuration file so we don't end up with trying to spin up servers not meeting the minimum requirements. (This have been a "feature" before as it catches the openstacksdk out)

change logging

Replace home made logging with python logging module

Place execute somewhere safe

This file is currently put in /tmp, thus gets deleted if the server runs for a long time, plus makes it harder to debug as the filename continuously change

Make execute init script optional

In a ehos setup where the vm's are configured/handled by a CM a init file is not required.

Make the execute file optional

Run master outside openstack

Check if it possible to run the master outside the openStack.

This will furthermore be useful when testing data/binary transfer between the master node and the slaves

Timestamps in logs

restart master node

The initial implementation assumes the master doesn't need to be restarted. Would be nice to be able to restart a master node and it picks up the "knowledge" about nodes and states

Several solutions to this problem:

Dynamically pull node information from the condor master
Continually store the state information in the master configuration file

Multiple openstack connectors & base images for each site

Random? creations between regions as a test?

dynamic config nodes

Currently all nodes are configure as:

NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = false

It would be better to have this reflect the available number of CPUs etc.

Not sure if this is to be done by ehos or htcondor can do this

Investigate HTCondor + data transfer penalties

Investigate the following questions:

Is is possible to submit a job from an external compute and share data with the master + slave nodes
What are the data sharing mechanisms used in HTCondor
How are binary dependencies resolved (if they are)? Eg: try and run a dynamical linked binary on a client. If a binary differs between master and slave, which one is being run?
How long does it take to transfer a large dataset between nodes

elixir-no-nels / ehos-python Goto Github PK

ehos-python's People

Stargazers

Watchers

Forkers

ehos-python's Issues

Recommend Projects

Recommend Topics

Recommend Org