elixir-no-nels / ehos-python Goto Github PK
View Code? Open in Web Editor NEWpython flavour of EHOS
License: MIT License
python flavour of EHOS
License: MIT License
The OS and condor/job status and states are confusing, should be more distinct, eg with a prefix
Note down what versions of software/libraries have been used for the development:
An incomplete list:
htcondor
openstacksdk
cloud-init
pip
python
store daemon config in a database instead of a config file. This makes it possible to tweak the ehos daemon in a UI
Check that the config files are in the correct format.
Some quick google hits:
https://stuartgunter.wordpress.com/2011/05/25/yaml-and-the-importance-of-schema-validation/
http://tekiwiki.blogspot.com/2017/03/normal-0-false-false-false-en-us-x-none_15.html
Describe the deployment for a production environment.
Add new image of where ehos live in a stack
A dedicated script that configures condor on the master node instead of having it in the daemon script.
Make ehos run as a systemctl service instead of on the commandline
Current dev version is only doing node and job counts, and a single event when the system is started. Might add some more in a generic module that can be called from suitable places in the code.
Currently the deployment script only build the image for one cloud, this needs to be done on all of them.
This depends on how to proceed with the staging scenario going forward
when creating nodes let the creation process run in the background, but dont create additional resources until they are up and running
The disk space default for flavors provided on openstack are to small for real data analysis.
When creating the vm one needs to attach a volume that can be used as scratch. I believe it might be possible to attach the same volume to a collection of nodes, but this needs to be tested before I can be sure.
The other problem is setting the scratch area in htcondor, last time I investigated this this was the solution I came up with:
Mounted /var/lib/condor/execute onto an external volume in the end as I could not find the correct setting to point to somewhere else.
Allocate nodes depending on the queued job requirements. Eg memory and cores
This will require EHOS to pull information about flavors available in a openstack and try and use them in an optimal fashion.
Probably possible to set some min requirements in the configuration file so we don't end up with trying to spin up servers not meeting the minimum requirements. (This have been a "feature" before as it catches the openstacksdk out)
Replace home made logging with python logging module
This file is currently put in /tmp, thus gets deleted if the server runs for a long time, plus makes it harder to debug as the filename continuously change
In a ehos setup where the vm's are configured/handled by a CM a init file is not required.
Make the execute file optional
Check if it possible to run the master outside the openStack.
This will furthermore be useful when testing data/binary transfer between the master node and the slaves
The initial implementation assumes the master doesn't need to be restarted. Would be nice to be able to restart a master node and it picks up the "knowledge" about nodes and states
Several solutions to this problem:
a script that reverts the openstack to the state it was before the quick_test.
Delete:
firewall settting
ssh keys
images
VM's
More on a general note, make sure resources exists before attempting to use them
a bug when updating the state for a known node
Is it possible to run EHOS across multiple openstack instances or multiple regions with in a openstack instance.
For this the config file needs to tweaked slightly to contain multiple regions.
Multiple openstack connectors & base images for each site
Random? creations between regions as a test?
Currently all nodes are configure as:
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = false
It would be better to have this reflect the available number of CPUs etc.
Not sure if this is to be done by ehos or htcondor can do this
Investigate the following questions:
Is is possible to submit a job from an external compute and share data with the master + slave nodes
What are the data sharing mechanisms used in HTCondor
How are binary dependencies resolved (if they are)? Eg: try and run a dynamical linked binary on a client. If a binary differs between master and slave, which one is being run?
How long does it take to transfer a large dataset between nodes
Use a password for node authentication rather than an IP range/white list
Get rid of old code, eg: flask frontend and binary test for htcondor
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.