Giter Club home page Giter Club logo

kf-alignment-workflow's People

Contributors

bogdang989 avatar dankolbman avatar danria avatar dmiller15 avatar haoxuan-jin avatar migbro avatar sickler-alex avatar yuankunzhu avatar zhangb1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

kf-alignment-workflow's Issues

HaplotypeCaller RamMin parameter

We should try to reduce the RamMin requirement as much as possible to allow more jobs to run in parallel, but without causing failure due to full RAM. Few tests show that reducing RamMin requirement to 8000 works on c4.8xlarge and c5.9xlarge instances.

Alignment Workflow

@allisonheath commented on Thu Oct 12 2017

Will document specific details as development occurs, but at a high level:

  • By read-group alignment by BWA-mem
  • Picard MarkDuplicates
  • GATK4 BQSR
  • Picard Metrics
  • Final output CRAM

@kellrott commented on Wed Oct 18 2017

Should the pipeline start from FASTQs or unaligned BAMs? With fastq pairs we need to package meta-data along side, with unaligned BAMs its built into the BAM header. PCAWG went with unaligned BAMs.

Also what meta-data should be included as part of the input?


@yuankunzhu commented on Wed Oct 25 2017

We're working on a workflow take both FASTQ and uBAM as input.

I personally lean to input uBAM, as BAM's advantage of handling metadata and that will make the whole pipeline way much cleaner, but I can also see the why some people still love FASTQs, as GZIP is smaller than BGZF, while you don't necessarily need random access for uBAM. If we could have a submitter system capture read group info like GDC dictionary, most metadata could be stored there as well. Lastly, if we decide to do bwa mem, we still need to convert uBAM to intermediate FastQ to do that afaik.

Anyway, we'd like to provide both FastQ and uBAM as the alignment pipeline entry points for now. Like to hearing more thoughts and comments around this.


@yuankunzhu commented on Wed Oct 25 2017

@kellrott do you have the link or more-details of the PCAWG pipeline?


@allisonheath commented on Fri Oct 27 2017

@yuankunzhu the alignment workflow is here: https://github.com/ICGC-TCGA-PanCancer/Seqware-BWA-Workflow and I believe in general the other various workflows are compiled under https://github.com/ICGC-TCGA-PanCancer. Also might be of interest is the specifications for the unaligned BAMs that were uploaded: https://github.com/ICGC-TCGA-PanCancer/cli/wiki/Preparing-BAM-files.

I think in general we want to use the Gen3 stack as the authoritative source of sequencing metadata, with just the correct IDs kept as part of the header. So a critical decision will be what ID(s) those are, remembering that they typically propagate to downstream files as well.


@allisonheath commented on Thu Dec 21 2017

This now lives in https://github.com/kids-first/kf-alignment-workflow

Optimal Instance Type

Choosing an optimal instance type for the execution regarding task cost / duration. Setting the CWL Requirements for RamMin and specifying max RAM allowed for tools, we can do a switch from AWS memory optimized instances (r-.-xlarge) to compute optimized instances (c-.-xlarge). Last week, Cavatica deployed support for the modern c5.*xlarge instances and preliminary tests show that the c5.9xlarge(36vCPU, 72GB RAM) instance may be the best bet as it is cheaper (looking at on-demand pricing) and faster (newer and better hardware) that it's c4.8 predecessor.

BAM Preprocessing optimization suggestion

Tools in the workflow

  • Picard Revertsam

Current function

Split input BAM/SAM/CRAM into BAM per RG.

Proposed modification

Use Samtools split for this purpose.

Performance improvement

Picard RevertSam time 4h 7m
Samtools Split time 28m

Example command line

/opt/samtools-1.7/samtools split -f '%!.bam' -@ 29 ae3b4fcd963d404081393b9cf038d4d5.bam

Samtools split is tested with a randomly selected BAM from the pilot set. QualityYield metrics show that there is no difference in tool outputs, which is expected since it is a simple tool.

alignment optimization suggestions

BWA Mem (scattered by read group)

Current setup:

Picard SamToFastq | BWA | Picard MergeBamAlignment

Proposed setup:

Samtools bam2fq | BWA |
Samblaster | (mark duplicates here and remove picard_markduplicates; It makes sense to mark duplicates only within the same read group and saves a lot of time by piping it with BWA) |
Sambamba view (sam to bam) |
Sambamba Sort

workflow/tools versioning and organizing

@yuankunzhu commented on Wed Nov 01 2017

Open this Issue to discuss how to better version and organize KF workflows in terms of tool versioning, docker images build and tags, CWL file names, workflow and tool ID and versions etc. And potentially how to automated the release process.

Initial idea is to create subrepo for each workflow, and link them back [this kf-workflows repo] (https://github.com/kids-first/kf-workflows/)

Currently on pre-cocleaning branch, we have something like this:

.
├── LICENSE
├── README.md
├── dockerfiles
│   ├── alpine-bwa-samtools
│   │   ├── 0.7.17
│   │   │   └── Dockerfile
│   │   └── latest
│   │       └── Dockerfile
│   ├── alpine-gatk
│   │   ├── 3.8
│   │   │   └── Dockerfile
│   │   └── latest
│   │       └── Dockerfile
│   └── alpine-picard
│       ├── 2.14.0
│       │   └── Dockerfile
│       └── latest
│           └── Dockerfile
├── tools
│   ├── bwa_mem_0.7.17.cwl
│   ├── bwa_mem_latest.cwl
│   ├── picard_markduplicates_2.14.0.cwl
│   ├── picard_markduplicates_latest.cwl
│   ├── picard_sortsam_2.14.0.cwl
│   └── picard_sortsam_latest.cwl
└── workflows
    ├── pre_cocleaning_workflow_1.0.cwl
    └── pre_cocleaning_workflow_latest.cwl

12 directories, 16 files

@allisonheath commented on Thu Dec 21 2017

Now being worked on in https://github.com/kids-first/rfcs/pull/4

BAM sorting optimization suggestion

Tools in the workflow

  • Picard SortSam

Current function

Sort aligned, duplicate marked BAM.

Proposed modification

Use Sambamba Sort+Index for this purpose.

Performance improvement

Picard SortSam time 3h 1m
Picard MarkDuplicates time 4h 50m
Sambamba Merge+Sort+Index time 1h 12m

Example command lines

Sambamba Merge

/opt/sambamba_v0.6.4 merge  -t 31  ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam
/root_bwa_mem_1_s/2895813008.aligned.unsorted.bam 
/root_bwa_mem_2_s/2895813030.aligned.unsorted.bam 
/root_bwa_mem_3_s/2895813316.aligned.unsorted.bam 
....
...
... 
/root_bwa_mem_21_s/2895821901.aligned.unsorted.bam

Sambamba Sort

/opt/sambamba_v0.6.4 sort  -o ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam -t 31 /root_sambamba_merge/ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam

Sambamba Index

mv /root_sambamba_sort/ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam . && /opt/sambamba_v0.6.4 index -t 31 ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bam ae3b4fcd963d404081393b9cf038d4d5.aligned.duplicates_marked.sorted.bai

Sambamba Merge is added in place of Picard MarkDuplicates, as merging of separate BAM files was originally done there. Tested with a randomly selected BAM from the pilot set.

optimization suggestions list

Open this issue as an aggregation for optimization suggestions.

  • Alignment #47
  • Marking duplicates, BQSR, sorting
  • Metrics
  • Validate Gvcf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.