broadinstitute / dig-loam-stream Goto Github PK

View Code? Open in Web Editor NEW

1.0 14.0 1.0 26.77 MB

dig-loam-stream

License: Other

Scala 99.73% Shell 0.04% Perl 0.17% Python 0.06%

dig-loam-stream's Introduction

LoamStream

LoamStream is a genomic analysis stack featuring a high-level language, compiler and runtime engine.

Building

Requirements

Git
Java 8
SBT 1.5.0+

To build a runnable LoamStream binary from the master branch:

Clone the repository at https://github.com/broadinstitute/dig-loam-stream
Build a fat jar with sbt assembly

To run the unit tests:

sbt test

To run the integration tests:

sbt it:test
This is best done from Jenkins, or at a minimum, from a Broad VM that can submit Uger jobs.
- See [JENKINS.md]

Running

See [CLI.md] and [LOAM.md]

dig-loam-stream's People

Contributors

Stargazers

Watchers

Forkers

t2dream

dig-loam-stream's Issues

Ability to run Hail on an on-prem Spark cluster (for EBI)

TODO: Needs more details ( @kyuksel )

Before execution, list all jobs that will be run

Print out a comprehensive list of all commands to be run (to be used for reference).

(NB: It will not be possible to output a complete list ahead-of-time, since given upcoming scatter-gather/dynamic-execution changes, we won't know the execution graph in full before beginning execution.)

Jobs being run too many times

Look into why the following is printed out twice:

13:56:48.080 [pool-8-thread-43] INFO  l.m.j.c.CommandLineStringJob - Now running: '/humgen/diabetes2/users/ryank/software/GenotypeHarmonizer-1.4.18/GenotypeHarmonizer.sh         --input qc/CAMP.chr21         --inputType PLINK_BED         --output qc/CAMP.chr21.harm         --outputType PLINK_BED         --ref /humgen/diabetes2/users/dig/loamstream/pipeline/data/kg/1000GP_Phase3_vcf/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz         --refType VCF         --keep         --update-id         --variants 1000         --mafAlign 0.1         --update-id         --update-reference-allele         --debug'

Enable querying of recorded error messages

See #223
Allow querying for and displaying output from jobs. Needs a way to identify jobs; probably via user-supplied names (#225)

Better monitoring of execution progress

Somewhat TBD. Options include:

Logging job status changes
A GUI displaying the graph/job-forest, and the status of jobs

Make Loam files into Scala files

Pros:

Enable Loam DSL to be syntax and type-checked in IDEs
Get correct locations of compile errors
Make stack traces actually traceable
Make Loam more intuitive to use, implement and extend

Allow running more than one Hail job at a time at Google

LS currently creates at most one Google cluster (it's created lazily on first use by GoogleCloudChunkRunner) and deletes it at app-shutdown-time. This means only one Google-Hail job can be running at once, something @rmkoesterer has identified as a bottleneck. It would be good to allow creating one cluster per Google-Hail job, to allow running multiple such jobs concurrently.

This should mostly involve GoogleCloudChunkRunner. Any clusters that get created will need to get tracked somehow and registered for shutdown, preferably once they're done being used, but at app-shutdown-time at the very least. (Google bills several ways, including cpu-seconds.)

TODOs/Notes:

Determine whether it's possible or desirable to run multiple jobs on the same cluster. (This sure sounds like it wouldn't make sense, but the question has been raised, and we've never dug up a definitive answer. -Clint)
Is it desirable to re-use clusters for multiple Hail-Google jobs?
We have a few quotas at Google. One notable one is for CPU cores. A .loam file that declares N independent Google-Hail jobs could burn through that quota very fast if N is sufficiently large and enough of the jobs are run concurrently. Some sort of limiting mechanism is probably a good idea for this reason, but isn't necessary for a first pass.

In Integration tests, check more output files

Compare just the number of lines and columns for files that otherwise vary from run to run (e.g. klu files report runtime at the end of the file causing variance)

Allow using one store to refer to the "same" file on the local FS or in a Google bucket

Remove any need to separate cloud and local stores where cloud files mirror local files and copying between the two becomes seamless

LS Needs Simple Integration Tests

Do an end-to-end run of a simple pipeline that copies 1 or 2 files via Uger, for example. Currently, no such end-to-end test exists for the Uger case. This will need a Jenkins build plan as well, since Uger pipelines can't be run from unit tests on developers' machines.

Allow specifying composable chunks of Loam files

It would be nice to have a simple, general way to say "file C depends on the outputs of files A and B" without having to dig around in A and B to see what their "terminal" outputs are (see also here).

Allow display or logging of a Tools' dependencies, and what Tools depend on that Tool

For each current command being executed, what are the commands leading to and from it (input and output) (if the user expects certain things to run in parallel and they aren’t running in parallel, then it would save time to kill everything instead of waiting for the outcome). Below is an example of a file that could be output by loamstream to give the user an understanding of the graph surrounding the particular command he is interested in. This may require the user to be able to name a cmd (or perhaps the file could be named using the line number in the loam file where the cmd is found).

file cmd1.graph

pre: cmdA
     cmdB
     cmdC
     ...
cmd1: cmd1
post: cmdD
     cmdE
     cmdF

Interact with Google Cloud via REST

It's possible to create a REST client to interact with Google Cloud to manage Spark clusters (create/delete cluster, submit and monitor jobs). This would be advantageous over using the command line SDK, which needs to be installed by LoamStream users (unless we provide Docker images).

Allow assigning human-readable names to Tools made with `cmd"..."`

Enable specification of resource (mem, cpu, etc.) usage per command or block

Motivation: There are R scripts that would perform considerably faster with, say, 8 cpu's but we wouldn't want to use 8 all of the time.

Containerizing (for EBI in particular)

TODO: Needs much more fleshing-out ( @kyuksel )

Allow marking some Tool outputs as optional

e.g. A file ending with '.kin0' is created by King software. This doesn't get created if no families are found.

TODO: @rmkoesterer , how should optional outputs affect how LS runs?

Log Google Cloud Job IDs

Prevent several threads from being generated to poll the same UGER job

This is an old issue; TODO: confirm that it still occurs.

LS Should Bake the host used for building into jars

LS currently records build date, git branch, git commit, etc when building, to allow printing those things with --version. We should add one more item: the host on which the jar was built.

This will mean modifying buildInfoTask in build.sbt as well as the Versions class. (Versions holds more than version info now, maybe rename it to something like BuildInfo.)

Qacct fails often, leading to noisy logs

qacct polling warnings; e.g.

WARN  loamstream.uger.AccountingClient$ - Error invoking 'qacct'; execution node and queue won't be available.
java.lang.RuntimeException: Nonzero exit code: 1
<long stack trace>

Record stderr and stdout for commands LS runs

Will need to handle local, Uger, and Google cases.

LS Hangs sometimes when some jobs fail multiple times.

Loamstream attempted to run the failed job (a Google cloud Hail job that should be run last out of all jobs) 4 times, failing each time with same error, but never ended itself. LS should have seen that it ran the failing job the allowed maximum number of times (5 in this case) and then transitioned the job to the FailedPermanently status, ending execution.

Scatter-gather/dynamic-execution support

Use something like NativeTools/NativeJobs to defer executing Loam code, as in

dynamic {
   val analysisVcfs = ArrayBuffer.empty
    val regions = Files.readFrom(regionsExclude.path).split(System.lineSeparator)

    // Scatter and map
    regions.map {
      region =>
        val epactsOutputFilesBaseName = s"$outDir".replace(':', '.')
        val outputVcf = store[VCF].at(outDir / s"$epactsOutputFilesBaseName.vcf")

        cmd"""epactsTool
          --vcf $inputVcf
          --region $region
          --out $epactsOutputFilesBaseName"""
          .in(inputVcf).out(outputVcf)

        analysisVcfs :+ outputVcf
    }

    // Gather
    val finalVcf = store[VCF].at(outDir / "final.vcf")
    cmd"stripHeaderAndConcat $analysisVcfs > $finalVcf".in(analysisVcfs).out(finalVcf)
  }.in(inputVcf).out(finalVcf)
}

LS uses too much memory

Running the QC pipeline with the METSIM data required setting the Java heap to 4+ gigs. Running with FUSION needs even more.

Initial experiments and past profiling point toward (mis)use of RxJava as a big factor. More investigation is needed.

Allow marking Tool outputs to indicate whether they should be deleted

Allow specifying which files should be persisted and which should be deleted. Automatically delete files unless a file is explicitly requested that they be persisted.

@rmkoesterer what does it mean for a file to be persisted? When should files be deleted?

Allow interpolating lists of Stores in cmd"..." strings

Allow custom expansion of list of stores within cmds. For example, given regions -- List(store1, store2, store3) -- one can write
cmd”cat ${regions} > ${someOtherStore}” which is then interpolated as
cmd”cat store1 store2 store3 > ….” where what delimiter to be used is specified by the user.

Omit queue from Uger native specifications

Necessary starting October 3, 2017 at 10am.
See #220
From BITS:

Notification of Pending Change CHG0033049 - October 3, 2017 at 10:00AM
 
Problem:
Multicore jobs are currently dispatching at a poor rate. The poor dispatch rate is caused by the scheduler not having enough information to determine when a host will have the requested number of slots available and is therefore not reserving slots on the optimal host.

Solution:
The short and long queues will be consolidated into the “broad” queue.
A default maximum run time (h_rt) of 2 hours (02:00:00) will be added to every job. This is only a default. Any amount of time can be requested.
Backfilling will be enabled.

Impact:
By adding an appropriate h_rt to every job, the scheduler will be able to identify the order and time cores will free up on each execution host. Knowing the order will allow the scheduler to reserve the appropriate slots allowing multicore jobs to dispatch more quickly.

Job backfill will be enabled.  By adding h_rt to every job, the scheduler will know when enough slots will free up on an execution host to allow the multicore job with reserved slots to run.  With this knowledge, the scheduler will be able to backfill short jobs into the reserved slots without interfering with the multicore job’s dispatch time.

The new configuration will allow the scheduler to more effectively and efficiently schedule short jobs as well as multicore jobs.
	
Required actions:
All jobs will need to be submitted to a single queue named “broad”. The “broad” queue will be the default queue, meaning any job submitted after the cutover will no longer need to specify a queue. Any job dispatched to an execution host and running will continue to run to completion. Any job pending at the time of the cutover will need to be moved to the new broad queue. This is accomplished with “qalter -q broad -u ”. Any job not moved over will pend indefinitely. During the cutover, BITS will move any pending job to the new queue, but any jobs submitted after the cutover will need to be altered by the job owner.
All jobs will have an h_rt of 2 hours set by default. You will need to request an h_rt that is appropriate for your job. There is no limit to this length but the longer the h_rt the less likely the job is to benefit from backfilling. The more accurate every h_rt is, the quicker and more efficient the scheduler will dispatch jobs. The format for h_rt is HH:MM:SS.
 

Example:
   qsub -l h_rt=120:15:30 my_script.sh
   #This job will be killed after 120 hours (5 days), 15 minutes, and 30 seconds of runtime.

If you have any questions or concerns we highly encourage you to send an email to [email protected].

Thank you,
BITS Operations

Document static vs. dynamic use cases in loams

Allow limiting the time a job can run

Give the user an option to limit the runtime of a command in case there is something wrong with the command and it hangs.

Loam syntax TBD.

Files backing anonymous stores are created too eagerly, leading to jobs running when they shouldn't.

Consider src/examples/loam/cp.loam:

val fileIn = store.at("does-not-exist").asInput 
val fileTmp1 = store //note anon store
val fileTmp2 = store /note anon store
val fileOut1 = store.at("fileOut1.txt")
val fileOut2 = store.at("fileOut2.txt")
val fileOut3 = store.at("fileOut3.txt")

//cmdA
cmd"cp $fileIn $fileTmp1".in(fileIn).out(fileTmp1)

//cmdB
cmd"cp $fileTmp1 $fileTmp2".in(fileTmp1).out(fileTmp2)

//cmd{X,Y,X}
cmd"cp $fileTmp2 $fileOut1".in(fileTmp2).out(fileOut1)
cmd"cp $fileTmp2 $fileOut2".in(fileTmp2).out(fileOut2)
cmd"cp $fileTmp2 $fileOut3".in(fileTmp2).out(fileOut3)

Here, we expect that cmdA, will fail, since fileIn points to a file that doesn't exist. Currently, that happens, but cmd{X,Y,Z} all run too. They fail (at least), but they shouldn't run at all.

My hunch is that the fact that LoamFileManager eagerly creates 0-byte files (via java.nio.file.Files.createTempFile) to back the anon stores is part of the problem, but I'm not positive.