Giter Club home page Giter Club logo

dig-loam-stream's Introduction

LoamStream

LoamStream is a genomic analysis stack featuring a high-level language, compiler and runtime engine.

Building

Requirements

  • Git
  • Java 8
  • SBT 1.5.0+

To build a runnable LoamStream binary from the master branch:

To run the unit tests:

  • sbt test

To run the integration tests:

  • sbt it:test
  • This is best done from Jenkins, or at a minimum, from a Broad VM that can submit Uger jobs.
    • See [JENKINS.md]

Running

  • See [CLI.md] and [LOAM.md]

dig-loam-stream's People

Contributors

clintatthebroad avatar curoli avatar kyuksel avatar massung avatar rmkoesterer avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

t2dream

dig-loam-stream's Issues

Before execution, list all jobs that will be run

Print out a comprehensive list of all commands to be run (to be used for reference).

(NB: It will not be possible to output a complete list ahead-of-time, since given upcoming scatter-gather/dynamic-execution changes, we won't know the execution graph in full before beginning execution.)

Jobs being run too many times

Look into why the following is printed out twice:

13:56:48.080 [pool-8-thread-43] INFO  l.m.j.c.CommandLineStringJob - Now running: '/humgen/diabetes2/users/ryank/software/GenotypeHarmonizer-1.4.18/GenotypeHarmonizer.sh         --input qc/CAMP.chr21         --inputType PLINK_BED         --output qc/CAMP.chr21.harm         --outputType PLINK_BED         --ref /humgen/diabetes2/users/dig/loamstream/pipeline/data/kg/1000GP_Phase3_vcf/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz         --refType VCF         --keep         --update-id         --variants 1000         --mafAlign 0.1         --update-id         --update-reference-allele         --debug'

Make Loam files into Scala files

Pros:

  • Enable Loam DSL to be syntax and type-checked in IDEs
  • Get correct locations of compile errors
  • Make stack traces actually traceable
  • Make Loam more intuitive to use, implement and extend

Allow running more than one Hail job at a time at Google

LS currently creates at most one Google cluster (it's created lazily on first use by GoogleCloudChunkRunner) and deletes it at app-shutdown-time. This means only one Google-Hail job can be running at once, something @rmkoesterer has identified as a bottleneck. It would be good to allow creating one cluster per Google-Hail job, to allow running multiple such jobs concurrently.

This should mostly involve GoogleCloudChunkRunner. Any clusters that get created will need to get tracked somehow and registered for shutdown, preferably once they're done being used, but at app-shutdown-time at the very least. (Google bills several ways, including cpu-seconds.)

TODOs/Notes:

  • Determine whether it's possible or desirable to run multiple jobs on the same cluster. (This sure sounds like it wouldn't make sense, but the question has been raised, and we've never dug up a definitive answer. -Clint)
  • Is it desirable to re-use clusters for multiple Hail-Google jobs?
  • We have a few quotas at Google. One notable one is for CPU cores. A .loam file that declares N independent Google-Hail jobs could burn through that quota very fast if N is sufficiently large and enough of the jobs are run concurrently. Some sort of limiting mechanism is probably a good idea for this reason, but isn't necessary for a first pass.

LS Needs Simple Integration Tests

Do an end-to-end run of a simple pipeline that copies 1 or 2 files via Uger, for example. Currently, no such end-to-end test exists for the Uger case. This will need a Jenkins build plan as well, since Uger pipelines can't be run from unit tests on developers' machines.

Allow specifying composable chunks of Loam files

It would be nice to have a simple, general way to say "file C depends on the outputs of files A and B" without having to dig around in A and B to see what their "terminal" outputs are (see also here).

Allow display or logging of a Tools' dependencies, and what Tools depend on that Tool

For each current command being executed, what are the commands leading to and from it (input and output) (if the user expects certain things to run in parallel and they aren’t running in parallel, then it would save time to kill everything instead of waiting for the outcome). Below is an example of a file that could be output by loamstream to give the user an understanding of the graph surrounding the particular command he is interested in. This may require the user to be able to name a cmd (or perhaps the file could be named using the line number in the loam file where the cmd is found).

file cmd1.graph

pre: cmdA
     cmdB
     cmdC
     ...
cmd1: cmd1
post: cmdD
     cmdE
     cmdF

Interact with Google Cloud via REST

It's possible to create a REST client to interact with Google Cloud to manage Spark clusters (create/delete cluster, submit and monitor jobs). This would be advantageous over using the command line SDK, which needs to be installed by LoamStream users (unless we provide Docker images).

LS Should Bake the host used for building into jars

LS currently records build date, git branch, git commit, etc when building, to allow printing those things with --version. We should add one more item: the host on which the jar was built.

This will mean modifying buildInfoTask in build.sbt as well as the Versions class. (Versions holds more than version info now, maybe rename it to something like BuildInfo.)

Qacct fails often, leading to noisy logs

qacct polling warnings; e.g.

WARN  loamstream.uger.AccountingClient$ - Error invoking 'qacct'; execution node and queue won't be available.
java.lang.RuntimeException: Nonzero exit code: 1
<long stack trace>

LS Hangs sometimes when some jobs fail multiple times.

Loamstream attempted to run the failed job (a Google cloud Hail job that should be run last out of all jobs) 4 times, failing each time with same error, but never ended itself. LS should have seen that it ran the failing job the allowed maximum number of times (5 in this case) and then transitioned the job to the FailedPermanently status, ending execution.

Scatter-gather/dynamic-execution support

Use something like NativeTools/NativeJobs to defer executing Loam code, as in

dynamic {
   val analysisVcfs = ArrayBuffer.empty
    val regions = Files.readFrom(regionsExclude.path).split(System.lineSeparator)

    // Scatter and map
    regions.map {
      region =>
        val epactsOutputFilesBaseName = s"$outDir".replace(':', '.')
        val outputVcf = store[VCF].at(outDir / s"$epactsOutputFilesBaseName.vcf")

        cmd"""epactsTool
          --vcf $inputVcf
          --region $region
          --out $epactsOutputFilesBaseName"""
          .in(inputVcf).out(outputVcf)

        analysisVcfs :+ outputVcf
    }

    // Gather
    val finalVcf = store[VCF].at(outDir / "final.vcf")
    cmd"stripHeaderAndConcat $analysisVcfs > $finalVcf".in(analysisVcfs).out(finalVcf)
  }.in(inputVcf).out(finalVcf)
}

LS uses too much memory

Running the QC pipeline with the METSIM data required setting the Java heap to 4+ gigs. Running with FUSION needs even more.

Initial experiments and past profiling point toward (mis)use of RxJava as a big factor. More investigation is needed.

Allow interpolating lists of Stores in cmd"..." strings

Allow custom expansion of list of stores within cmds. For example, given regions -- List(store1, store2, store3) -- one can write
cmd”cat ${regions} > ${someOtherStore}” which is then interpolated as
cmd”cat store1 store2 store3 > ….” where what delimiter to be used is specified by the user.

Omit queue from Uger native specifications

  • Necessary starting October 3, 2017 at 10am.
  • See #220
    From BITS:
Notification of Pending Change CHG0033049 - October 3, 2017 at 10:00AM
 
Problem:
Multicore jobs are currently dispatching at a poor rate. The poor dispatch rate is caused by the scheduler not having enough information to determine when a host will have the requested number of slots available and is therefore not reserving slots on the optimal host.

Solution:
The short and long queues will be consolidated into the “broad” queue.
A default maximum run time (h_rt) of 2 hours (02:00:00) will be added to every job. This is only a default. Any amount of time can be requested.
Backfilling will be enabled.

Impact:
By adding an appropriate h_rt to every job, the scheduler will be able to identify the order and time cores will free up on each execution host. Knowing the order will allow the scheduler to reserve the appropriate slots allowing multicore jobs to dispatch more quickly.

Job backfill will be enabled.  By adding h_rt to every job, the scheduler will know when enough slots will free up on an execution host to allow the multicore job with reserved slots to run.  With this knowledge, the scheduler will be able to backfill short jobs into the reserved slots without interfering with the multicore job’s dispatch time.

The new configuration will allow the scheduler to more effectively and efficiently schedule short jobs as well as multicore jobs.
	
Required actions:
All jobs will need to be submitted to a single queue named “broad”. The “broad” queue will be the default queue, meaning any job submitted after the cutover will no longer need to specify a queue. Any job dispatched to an execution host and running will continue to run to completion. Any job pending at the time of the cutover will need to be moved to the new broad queue. This is accomplished with “qalter -q broad -u ”. Any job not moved over will pend indefinitely. During the cutover, BITS will move any pending job to the new queue, but any jobs submitted after the cutover will need to be altered by the job owner.
All jobs will have an h_rt of 2 hours set by default. You will need to request an h_rt that is appropriate for your job. There is no limit to this length but the longer the h_rt the less likely the job is to benefit from backfilling. The more accurate every h_rt is, the quicker and more efficient the scheduler will dispatch jobs. The format for h_rt is HH:MM:SS.
 

Example:
   qsub -l h_rt=120:15:30 my_script.sh
   #This job will be killed after 120 hours (5 days), 15 minutes, and 30 seconds of runtime.

If you have any questions or concerns we highly encourage you to send an email to [email protected].

Thank you,
BITS Operations

Files backing anonymous stores are created too eagerly, leading to jobs running when they shouldn't.

Consider src/examples/loam/cp.loam:

val fileIn = store.at("does-not-exist").asInput 
val fileTmp1 = store //note anon store
val fileTmp2 = store /note anon store
val fileOut1 = store.at("fileOut1.txt")
val fileOut2 = store.at("fileOut2.txt")
val fileOut3 = store.at("fileOut3.txt")

//cmdA
cmd"cp $fileIn $fileTmp1".in(fileIn).out(fileTmp1)

//cmdB
cmd"cp $fileTmp1 $fileTmp2".in(fileTmp1).out(fileTmp2)

//cmd{X,Y,X}
cmd"cp $fileTmp2 $fileOut1".in(fileTmp2).out(fileOut1)
cmd"cp $fileTmp2 $fileOut2".in(fileTmp2).out(fileOut2)
cmd"cp $fileTmp2 $fileOut3".in(fileTmp2).out(fileOut3)

Here, we expect that cmdA, will fail, since fileIn points to a file that doesn't exist. Currently, that happens, but cmd{X,Y,Z} all run too. They fail (at least), but they shouldn't run at all.

My hunch is that the fact that LoamFileManager eagerly creates 0-byte files (via java.nio.file.Files.createTempFile) to back the anon stores is part of the problem, but I'm not positive.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.