Giter Club home page Giter Club logo

lobster-tutorial's Introduction

Introduction

This repo provides the code for a basic tutorial showcasing the use of lobster with a non-CMSSW based program. It is assumed that you already have a working lobster installation and understand the basics of how to setup and configure a lobster job.

Setup and Usage

First we need to checkout the repo from github:

cd ~/Public
git clone https://github.com/Andrew42/lobster-tutorial.git

Next we need to activate our CMS environment in order to use lobster:

cd ~/path/to/some/release/CMSSW_X_Y_Z
cmsenv

Finally, we activate our VOMS proxy and the lobster virtual environment:

voms-proxy-init -voms cms -valid 192:00
source ~/.lobster/bin/activate.csh # or activate.sh if in a bash environment

Now we should be good to run the example:

cd ~/Public/lobster-tutorial
lobster process lobster_config.py

Description

Running lobster with a non-CMSSW executable is not very different from using lobster normally. The setup of the StorageConfiguration object should be largely unchanged:

tstamp = datetime.datetime.now().strftime('%Y%m%d_%H%M')
storage = StorageConfiguration(
    input=[
        "file://~$USER/Public/lobster-tutorial",
    ],
    output=[
        "hdfs://eddie.crc.nd.edu:19000/store/user/$USER/diceRolling/test/testing_{tstamp}".format(tstamp=tstamp),
    ],
    disable_input_streaming = True
)

The main caveat to keep in mind is that when lobster passes the filenames to your job they might be prefixed with a file transfer protocol identifier, e.g. file:, which should be accounted for in your code in order to properly read the file.

For non-CMSSW jobs you will almost always be using the plain Dataset object, where you will be specifying all the files and/or directories you want lobster to process. The list object passed to files can contain paths to individual files, or it can be paths to directories containing all the files you wish to run over. The patterns argument can then be used to help make sure that lobster only processes the types of files you want from a given directory. Finally, the files_per_task option will tell lobster how many files should be grouped together in a single task.

dataset=Dataset(
    files=['inputs/test_inputs_{tstamp}'.format(tstamp=tstamp)],   # This needs to be relative to the input path specified in your StorageConfiguration
    patterns=['*.txt'],
    files_per_task=1
)

Keep in mind that this means that the smallest unit of work that lobster can process at a time this way is a single file. So if your inputs aren't split up over a sufficient number of files, you may not be able to to take full advantage of the benefits that batch computing provides.

Note: Currently there is a bug in the way dependent workflow tasks update their expected number of remaining units to process. As a result of this the files_per_task argument of all parent datasets must be set to 1, otherwise the dependent workflow will get stuck in an endless loop of failing to create tasks due to insufficient units from the parent dataset. If none of your workflows involve a ParentDataset then you can safely set files_per_task to any value you wish.

The most significant difference from how a normal lobster configuration might be setup is that for non-CMSSW jobs you will generally need to specify all the non-CMSSW code required to run your job. This is done through the extra_inputs argument of the Workflow class:

GIT_REPO_DIR = subprocess.check_output(['git','rev-parse','--show-toplevel']).strip()
extra_inputs = [
    os.path.join(GIT_REPO_DIR,"python/Dice.py"),
    os.path.join(GIT_REPO_DIR,"python/the_job.py"),
    os.path.join(GIT_REPO_DIR,"python/merge_results.py")
]

A key thing to keep in mind here is that when lobster copies the extra inputs to the worker to run your job, it won't preserve the directory structure of whatever original location you got the extra inputs from. So in the above example, the three extra input files: Dice.py, the_job.py, and merge_results.py won't be located in a subdirectory called python when the remote worker tries to run your job. Instead they will all be located in the top level directory of the task working area. This means that if your code import statements assume some sort of directory structure, that being all in the same top-level directory won't break the imports.

The Workflow object itself isn't much different from normal running. The main differences are that the command argument should correspond to whatever command line argument you would use to run your code interactively and the other important option is to specify globaltag=False, which will bypass the autosense.sh code from getting run, which has problems if you're running from a directory not located inside of a CMSSW release:

roll = Workflow(
    label='roll',
    dataset=dataset,
    sandbox=cmssw.Sandbox(release=os.environ["CMSSW_BASE"]),
    category=Category(name='processing',cores=1,memory=2500,disk=2900),
    command='python the_job.py @inputfiles',
    globaltag=False,
    merge_size='1G',
    merge_command='python merge_results.py',
    outputs=['results.json']
)

That should cover most of the major differences that one should expect to encounter when trying to run lobster with a non-CMSSW based workflow.

lobster-tutorial's People

Contributors

andrew42 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.