Giter Club home page Giter Club logo

roced's Introduction

ROCED

Responsive On-Demand Cloud-enabled Deployment is a tool which can interface with different batch systems (Torque, HTCondor) and cloud sites (Eucalyptus, OpenNebula, OpenStack, Amazon EC2, etc.). It monitors demand of computing resources in the batch system(s) and dynamically manages Virtual Machines (starting and terminating them) on different cloud sites.

Build Status

Announcement

ROCED can manage hundreds of VMs. However, to improve the scheduling for multiple heterogeneous resources, we decided to follow a new approach to schedule resources based on their utilization. Our new resource manager TARDIS (Transparent Adaptive Resource Dynamic Integration System), follow this approach and shows good results to manage resources at multiple resource providers. Therefore, we concentrate on the development of TARDIS and stop further development on ROCED. We think that this is the correct way to go and invite you to follow us on GitHub (https://github.com/tardis-resourcemanager/tardis). TARDIS also uses the adapter concept to manage resources from different resource providers and support different batch system.

Design

ROCED periodically runs a management cycle, where it performs three steps:

  • Monitor batch system's queue and determine demand for machines
  • Boot machines
  • Integrate booted machines into batch system

Visualisation of management cycle

ROCED consists of five components; everything except the core has a modular structure, in order to offer a maximum of flexibility.
Users can freely combine different adapters to fulfill their requirement or even write their own.

ROCED needs at least one of each component to be in any way useful and we advice to use Requirement Adapter and Integration Adapter for the same batch system.

  • Core
  • Requirement Adapters
    Monitor batch system(s) to determine the demand for machines.
  • Site Adapters
    Request machines at cloud site(s)
  • Integration Adapters
    (Dis-)Integrate running machines from/into batch system(s)
  • Broker
    Balance demand across different cloud sites, depending on different metrics (e.g.: cost)

Visualisation of modular components

Requirements/Installation

  • Python 2.7 or 3.3+
    • Python 2 requires the future and the configparser package
    • Various adapters have system/site dependant packages.
      We follow the PEP 8 guideline when listing module imports, so you you can easily identify the needed modules for each adapter.
  • Correctly set up batch system
  • VM image(s) which can integrate into batch system(s) as worker node(s).

Contributors

ROCED was developed at the Institut für Experimentelle Kernphysik at the Karlsruhe Institute of Technology.
Further information can be found in the doc folder.

License

ROCED is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

ROCED is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with ROCED.  If not, see <http://www.gnu.org/licenses/>.

roced's People

Contributors

fra-nk avatar guenthererli avatar mschnepf avatar thomashauth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

roced's Issues

Make logging more robust / ROCED run independent of logging

Currently ROCED depends on logging functionality: If CSV/JSON/log files can not be read, the process will hang.

By design ROCED has all information that it needs in memory. Not being able to write log files should be a severe warning, but not a real error/cause to crash.

SSH Implementation

Currently the SSH class relies on the operating system's SSH implementation.
→ switch to Paramiko and get rid of OS dependencies.

Improve MachineRegistry Events propagation

Currently every MachineRegistry event is published to every subscribed adapter.
NewMachineEvent & MachineRemovedEvent are currently rarely used due to missing information (MR doesn't have have any information/attributes for these machines). Currently you also lack the information on site and batch system...

→ Develop a way to discern a machine's "target" adapters and only publish the information to these adapter(s).
Then NewMachineEvent & MachineRemovedEvent could be used similar to constructors/destructors in those adapters.

Check on blocked Jobs in FreinburgSiteAdapter

To reduce the load of the MOAB batch system, a limit on the number of blocked jobs would be nice. Since we want to change to a single XML MOAB request containing all information, this should be realized with a simple check function. This function should block new job submits if there are too many blocked jobs.

(Consistent) support for multiple machine types

Not all of ROCED correctly handles the presence of multiple machine types correctly.

→ Add consistent behavior while still allowing different sites to manage individual information for each machine type (image id, etc.).

Account for black hole and idle notes

ROCED currently primarily monitors the absolute demand for resources.

Reasons to throttle/stop requesting new machines:

  • VM/docker images are broken (-> black hole nodes, that start and cancel lots of jobs in a short time)
  • jobs themselves are broken (-> lots of job started and finished very quickly)
  • Batch system is damaged/corrupt/under load (-> lots of machines booting, but RecentJobStarts = 0)

-> More communication between Requirement and Integration adapter to account for these problems.

Parallelize adapter functions

Adapters of a similar type (Requirement, Integration, Site...) should be independent of each other. More-so if they are of different types (e.g. communicating with OpenStack and Freiburg), but also if they are of the same type (2 HTCondor adapters).

=> Parallel processing of Spawn/TerminateMachines, Manage, etc.

Possible limitations:

  • Events introduce interdependency (may be greatly reduced with #10)
  • Socket-limits, time-outs, firewalls
  • Thread-safety/locking of logging objects

Python offers a few different options, which have to be explored in more detail:

FreiburgSiteAdapter should use array submits

Currently each machine is a single job.
MOAB supports array submits to bundle identical jobs into larger clusters.

This should severely reduce the system load on the scheduler.

Periodically log/export information on sites' states

  • Machine Registry
  • Site Stats
  • Average machine runtime
  • number of "failed" VMs [require machines to traverse states without skipping a single state. Skipping a state can then be easily identified as erroneous?]

MachineRegistry "cleanup"

Current situation:

  • Each adapter has to implement machine timeouts/cleanup themselves from scratch.
  • Changing the configuration file can result in Machines being stuck in the MachineRegistry forever, resulting in the need to edit or delete the machine registry JSON.

Refactoring of the MOAB requests in the FreiburgSiteAdapter

There were issues with disappearing MOAB jobs registered by ROCED.

  • Since there are three showq requests of ROCED with a short time delays, a disappearing of jobs can happen. This should be solved by using only one general MOAB request.
  • Should move from regex to XML output of showq to avoid issues regarding for example number of cores per VM.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.