Giter Club home page Giter Club logo

biokepi's People

Contributors

arahuja avatar armish avatar ihodes avatar jburos avatar julia326 avatar rleonid avatar ryan-williams avatar smondet avatar tavinathanson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biokepi's Issues

Make Picard MarkDuplicates spilling settings configurable

By default it sets:

TMPDIR =/tmp
MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP = 50000
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP =8000
SORTING_COLLECTION_SIZE_RATIO = 0.25

We can change these to have MarkDuplicates spill less when the memory available is high or at least spill to a different directory.

We also run it with java -jar $PICARD_JAR MarkDuplicates, but can update this to
java -Xmx<MEM_PARAM> -jar $PICARD_JAR MarkDuplicates

Example script for local somatic variant calling

I wish there was an example script which

(1) ran locally without configuration
(2) allowed me to align reads & call somatic variants from local FASTQs
(3) didn't require editing any OCaml code to specify input/output filenames

install through OPAM

Wanted a few clarifications on installing through OPAM

  1. The opam pin add biokepi . should be performed inside of the biokepi source folder?
  2. When I do that, I get:
[NOTE] You are pinning to /path/to/src/hammerlab/biokepi as a raw path while it seems to be a git repository.
[NOTE] Consider pinning with '-k git' to have OPAM synchronise with your commits and ignore build artefacts.

[biokepi.~unknown] Synchronizing with /path/to/src/hammerlab/biokepi
Installing new package description for biokepi from/path/to/src/hammerlab/biokepi/opam

biokepi needs to be installed.
Your request can't be satisfied:
  - biokepi is not available because the package is pinned to /path/to/src/hammerlab/biokepi, version 0.0.0-alpha.

No solution found, exiting
[NOTE] Pinning command successful, but your installed packages may be out of sync.

Setup tool versioning

This is important for many of the tools as some versions are incompatible with different programs

A good example is bwa as we may want to specify a downstream tool requires a specific version. A possible way to do this is to enumerate the versions supported by various tools and have the base tools setup the appropriate one. I'll provide a code example shortly for what I'm thinking for feedback.

Not sure the best way to handle multiple versions of tools not installed by biokepi. If a workflow needs python 2 vs 3.

Ensure result directories exist

The first step of many variant callers is

&& shf "cd %s" Filename.(dirname result_prefix)

where Filename.(dirname result_prefix) may not exist. The open is question is whether the task should create the directory or fail if it does not exist

Document pipeline.ml

Examples are good; there's room for some improvement over that. The API is deceptively simple; it's good to explain what it handles for you.

BQSR keeps hitting missing file

BaseRecalibrator seems to run but then the recal table file is missing? @seb has this worked for you before?

INFO  13:48:34,132 HelpFormatter - Program Args: -T BaseRecalibrator -I /hpc/users/ahujaa01/ksinai-demeter/work/path/pt189_bwa-mem-gap11-gep4-sorted.bam -R /hpc/users/ahujaa01/ksinai-demeter/B37-reference-genome/b37.fasta -knownSites /hpc/users/ahujaa01/ksinai-demeter/B37-reference-genome/dbsnp.vcf -o /hpc/users/ahujaa01/ksinai-demeter/work/PT189-B37/pt189-bwa-mem-gap11-gep4-recal_data.table


INFO  08:11:58,496 BaseRecalibrator - Writing recalibration report...
INFO  08:11:59,798 BaseRecalibrator - ...done!

INFO  08:12:02,994 HelpFormatter - Program Args: -T PrintReads -R /hpc/users/ahujaa01/ksinai-demeter/B37-reference-genome/b37.fasta -I /hpc/users/ahujaa01/ksinai-demeter/work/PT189-B37/pt189-bwa-mem-gap11-gep4-sorted.bam -BQSR /hpc/users/ahujaa01/ksinai-demeter/work/PT189-B37/pt189-bwa-mem-gap11-gep4-recal_data.table -o /hpc/users/ahujaa01/ksinai-demeter/work/PT189-PT189--B37/pt189-bwa-mem-gap11-gep4-bqsr.bam

There are many calls to INFO 08:12:00,701 HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Network is unreachable INFO 08:12:00,702 HttpMethodDirector - Retrying request due to the ET call. We can set the NO_ET flag if we have a key file to use

Cycledash submission sends local paths when Cycledash wants HDFS paths

Cycledash requires the normal/tumor BAMs exist at a location on HDFS

The current post_vcf function provides local paths to BAMs which Cycledash won't be able to access. Also, it requires them to be file_targets instead of strings so you can't submit and HDFS path without failing.

I've been using an hdfs_file and hdfs_file_target for somethings, but not sure where it should in biokepi or ketrew.

let hdfs_file ~run_with path =
  let open Ketrew.EDSL in
  let open Biokepi_run_environment in 
  let test_statement = sprintf "hadoop fs -test -e %s" path in
  object 
    method path = path
    val command = 
      (Ketrew.Target.Command.shell 
            ~host:Machine.(as_host run_with) test_statement)
    method exists = (`Command_returns (command, 0))
    method is_bigger_than n = (`Command_returns ((Ketrew.Target.Command.shell 
            ~host:Machine.(as_host run_with) (sprintf "hadoop fs -test -z %s" path)), 0))
  end

let hdfs_file_target ~run_with ?dependencies ?make ?metadata ?name ?equivalence ?if_fails_activate
    ?success_triggers ?tags path =
  let product = hdfs_file ~run_with path in
  let name = Option.value name ~default:("Make on HDFS:" ^ path) in
  Ketrew_edsl.target ~product ?equivalence ?if_fails_activate ?tags ?success_triggers
    ~done_when:product#exists ?dependencies ?make ?metadata name

Support Guacamole

This requires a new kind of machine, that knows about Hadoop/Yarn.

I've made it work before, the question is how to abstract it to make it portable/configurable.

Edit: typo

Make processors a max_procesors argument to a machine

Something we had discussed before, but most of the tools take a processors argument. Some of the tools don't support it but steps in their pipeline might. Alot of these tools default to 1 or drop the argument at some point, but this can lead to some oddities, i.e if I submit a variant caller run with mutect, the samtools sort step will happen with 2 threads, while with strelka it will happen with has many processors I gave strelka

Move bam type to Dna_bam and add Rna_bam

Interested to hear thoughts on this before I do it. In biokepi_pipeline there is a Bam type that the pipeline steps support. I was planing changing this to Dna_bam and adding an equivalent Rna_bam that has separated supported steps.

document Common.ml

I'm not quite sure where to file this ticket, could also go in internal-tools perhaps.

I'm finding it difficult to get my head around biokepi. Lots of this is me learning ocaml at the same time, so please bear with me.

Examples like: https://github.com/hammerlab/biokepi/blob/c823e3fb1cee3227930e703c9ed23c63a16a3578/src/app/main.ml and https://github.com/hammerlab/internal-tools/blob/master/ksinai/seb.ml are unfortunately not doing it for me. I think I need a more documented example.

How am in practice meant to use biokepi? Is the idea that I would set environment variables like BIOKEPI_DATASET_NAME as described in the biokepi README and then run a tool as it describes? Or is the idea that I write scripts like https://github.com/hammerlab/internal-tools/blob/master/ksinai/rna-alignment.ml and hard-code my dataset paths? I realize the answer is probably "you can do whatever you want, this is just a library", but I think it'd be useful for us to come up a recommended style for using this library, and a tutorial for beginners like me to get started working in that way, at least for our work in hammer lab.

I'm finding the stuff in Common.ml pretty confusing. Would be really useful to have docs on things like:

  • what's the basic idea behind the KEDSL extension to ketrew's EDSL? The only comment I can find on it is "The idea is carry around a type parameter to have arbitrary products.", which I don't understand.
    • what is the purpose of workflow_node and how does it differ from a ketrew target? Which should I use? If I'm supposed to use workflow_node, how do I specify its dependencies?
  • What's a Dataset and why is that in internal-tools/ksinai instead of biokepi?

I'm also having a tough time understanding the Ketrew EDSL here: https://github.com/hammerlab/ketrew/blob/master/src/lib/ketrew_edsl.mli (or API docs in http://seb.mondet.org/software/ketrew/doc.1.0.0/api/Ketrew_edsl.html). For example, looking at target, I'm not clear what the active parameter means, and what I should set it to.

Thanks in advance for putting up with my ignorance: -)

Reorganizing modules into pipelines

@rleonid suggested re-organizing some of the code here into more easily discoverable pipelines. I'd be interested in the best Ocaml strategy for doing this, but in the meantime I'll lay out some of the hierarchy for the tools that have been added.

DNA

Basic pipeline:
{aligners} x {BAM preprocessing} x {variant calling}

aligners: fastq -> BAM = {BWA, bwa-mem, bowtie (not in biokepi yet), SNAP (not in biokepi yet)
BAM preprocessing: BAM -> BAM = {GATK Indel Realignment, GATK BQSR, Picard MarkDuplicates}
Germline variant calling: BAM -> VCF = {GATK Unified Genotyper, GATK HaplotypeCaller}
Somatic variant calling: (BAM, BAM) -> VCF = {Mutect, Strelka, Varscan2, SomaticSniper}

RNA

Basic pipeline:
{aligners} x {quantification}

aligners: fastq -> BAM = {Hisat, Star}
quantification: fastq/BAM -> Custom text file = {Cufflinks, Stringtie, Kallisto}

BQSR doesn't work with B38 build

More GATK randomess

##### ERROR MESSAGE: Input files /hpc/users/ahujaa01/ksinai-demeter/B38-reference-genome/dbsnp.vcf and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.
##### ERROR   /hpc/users/ahujaa01/ksinai-demeter/B38-reference-genome/dbsnp.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]
##### ERROR   reference contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, KI270728.1, KI270727.1, KI270442.1, KI270729.1, GL000225.1, KI270743.1, GL000008.2, GL000009.2, KI270747.1, KI270722.1, GL000194.1, KI270742.1, GL000205.2, GL000195.1, KI270736.1, KI270733.1, GL000224.1, GL000219.1, KI270719.1, GL000216.2, KI270712.1, KI270706.1, KI270725.1, KI270744.1, KI270734.1, GL000213.1, GL000220.1, KI270715.1, GL000218.1, KI270749.1, KI270741.1, GL000221.1, KI270716.1, KI270731.1, KI270751.1, KI270750.1, KI270519.1, GL000214.1, KI270708.1, KI270730.1, KI270438.1, KI270737.1, KI270721.1, KI270738.1, KI270748.1, KI270435.1, GL000208.1, KI270538.1, KI270756.1, KI270739.1, KI270757.1, KI270709.1, KI270746.1, KI270753.1, KI270589.1, KI270726.1, KI270735.1, KI270711.1, KI270745.1, KI270714.1, KI270732.1, KI270713.1, KI270754.1, KI270710.1, KI270717.1, KI270724.1, KI270720.1, KI270723.1, KI270718.1, KI270317.1, KI270740.1, KI270755.1, KI270707.1, KI270579.1, KI270752.1, KI270512.1, KI270322.1, GL000226.1, KI270311.1, KI270366.1, KI270511.1, KI270448.1, KI270521.1, KI270581.1, KI270582.1, KI270515.1, KI270588.1, KI270591.1, KI270522.1, KI270507.1, KI270590.1, KI270584.1, KI270320.1, KI270382.1, KI270468.1, KI270467.1, KI270362.1, KI270517.1, KI270593.1, KI270528.1, KI270587.1, KI270364.1, KI270371.1, KI270333.1, KI270374.1, KI270411.1, KI270414.1, KI270510.1, KI270390.1, KI270375.1, KI270420.1, KI270509.1, KI270315.1, KI270302.1, KI270518.1, KI270530.1, KI270304.1, KI270418.1, KI270424.1, KI270417.1, KI270508.1, KI270303.1, KI270381.1, KI270529.1, KI270425.1, KI270396.1, KI270363.1, KI270386.1, KI270465.1, KI270383.1, KI270384.1, KI270330.1, KI270372.1, KI270548.1, KI270580.1, KI270387.1, KI270391.1, KI270305.1, KI270373.1, KI270422.1, KI270316.1, KI270340.1, KI270338.1, KI270583.1, KI270334.1, KI270429.1, KI270393.1, KI270516.1, KI270389.1, KI270466.1, KI270388.1, KI270544.1, KI270310.1, KI270412.1, KI270395.1, KI270376.1, KI270337.1, KI270335.1, KI270378.1, KI270379.1, KI270329.1, KI270419.1, KI270336.1, KI270312.1, KI270539.1, KI270385.1, KI270423.1, KI270392.1, KI270394.1]

Interesting answers on GATK forum without further explanation

B38 is not a simple drop-in replacement as previous releases were. There is some tool work associated with it before we can release anything.

We coud resort the fasta file or find a GATK appropriate link

Make reference-build handling more extensible

Now we now more where we are going with these.

  • remove the polymorphic-variant
  • centralize the existing references into one map-like datastructure accessed by name, and extensible from outside of the library.

Remove option wrapper around is_done Condition

Having an a product where is_done = None does not make sense. You should always be able to determine when a target is_done. We should not use it for target equivalence, since there is already a field for that logic in a target.

allow any variant caller to be run on a series of downsampled reads

For some analyses we're doing, I'd like to be able to run variant callers on randomly downsampled data.

An example analyses would be: run MuTect on subsamples of my BAM with {0.1, 0.2, ..., 1.0} fraction of the reads present, for a total of 10 runs. It be useful if I could specify any arbitrary schedule of downsampling rates, potentially with duplicates, e.g.: {0.1, 0.1, 0.1, 0.5, 0.5, 0.5}, so that we can get replicates with the same fraction of reads sampled (but a different random sample).

The downsampling can be performed on a BAM with "samtools view". This tool is actually fairly clever in that it keeps mate pairs together (either both the reads in the pair are sampled or none). Here's the bash loop I've been using so far for generating down samples manually:

for i in $(seq 1 9) ; do
  samtools view -s 0.$i -b original.clean.dedup.recal.bam > subsampled-0.$i.bam
done

According to samtools help, the integer part of the -s option can be used to specify the random seed, which we'd have to do in a careful way to support downsampling schedules with duplicates.

mutect dbsnp file gotchas

Just noting this in case it matters later, regarding the dbsnp_alt_url and dbsnp_broad_url defined here to be passed to mutect:

https://github.com/hammerlab/biokepi/blob/master/src/lib/run_environment.ml#L656-L657

  • the dbsnp_alt_url is a dead link: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz
  • I think a copy of that file is available here: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b144_GRCh37p13/VCF/00-All.vcf.gz
  • however, dbsnp_alt_url and dbsnp_broad_url are not the same files. The broad URL gives a VCF with 61M records; the alt url gives one with 145M records.

Make the reference-build a property of the BAM

Ran into this a few times but don't have a solution yet. Most of the variant callers take a reference build and BAM, but those are coupled and it should not be possible to specify the wrong reference build for a BAM

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.