Giter Club home page Giter Club logo

competitive-alignment's Introduction

Competitive genotyping pipeline

This pipeline calls variants (snp/indel and sv) by aligning assembled contigs to a reference as well as aligning contigs from two haplotypes to each other. The main pipeline is in competitive_genotyping.wdl. The per-sample variant calling and counting takes place in call_assembly_variants.wdl. There is also a process to qc variant calls by aligning reads to the assemblies and counting read support at the variant positions. This is in get_read_support.wdl (can be enabled to be called from inside call_assembly_variants.wdl but currently commented out due to high computational time).

Inputs

  • assembly_list: a tab-separated file listing input assemblies with the following fields: sample name, haplotype 1 contig path, haplotype 2 contig path, path to file listing fastq files for read support analysis
  • dataset_list: this gives a list of paths to illumina data that you want to align to the variants to genotype them in another dataset
  • ref: path to the reference (GRCh38 no alts)
  • ref_index: path to the reference index
  • ref_name: reference name
  • segdup_bed: bed track of segdups for categorizing variants
  • str_bed: bed track of str for categorizing variants

Outputs

  • SNP/indel calls from each haploid assembly are under call_small_variants1_ref and call_small_variants2_ref. loose.genotyped.vcf.gz is relative to reference coordinates. loose2.genotyped.vcf.gz is relative to the respective contig coordinates. SNP/indel calls from haplotype 1 v haplotype 2 contigs are under call_small_variants_self. loose.genotyped.vcf.gz is relative to haplotype 2 contig coordinates. loose2.genotyped.vcf.gz is relative to haplotype 1 contig coordinates. In all cases, the variant ids in loose.genotyped.vcf.gz and loose2.genotyped.vcf.gz correspond to the same alignment event in both vcfs. In some cases they don't line up exactly one-to-one, because the "genotyping" process summarizes places where the same variant is called in multiple contigs, or the same contig corresponds to multiple variants.
  • Split-read sv calls from each haploid assembly are under call_sv1_ref and call_sv2_ref. breakpoints.sorted.bedpe contains the putative breakpoints that have been classified by SV type based on a number of rules. There may be duplicate breakpoints in this file if multiple contigs have alignments that indicate the same breakpoint.
  • SNP/indel calls that have been combined to make diploid genotypes are under combine_small_variants_vcf. The two reference-based haploid callsets are combined into a diploid callset on reference coordinates in small_variants.combined.vcf.gz. Reference-based diploid calls that correspond to calls from haplotype 1 vs haplotype 2 are in ref_nonunique_small_variants.vcf.gz. Reference-based diploid calls that do not correspond to self-calls are in ref_unique_small_variants.vcf.gz. Diploid calls from aligning the haplotypes to each other are sorted into self_novel_small_variants.vcf.gz and self_known_small_variants.vcf.gz, based on whether they correspond to reference-based calls or not.
  • SV calls that have been combined to make diploid genotypes are under combine_sv. Haploid calls from haplotype 1 vs ref and haplotype 2 vs ref are compared using pairtopair with 50 bp slop. These calls are output in ref1_ref2.bedpe where the name column contains the SV type, the genotype (homalt=1/1, ref1=1|0, ref2=0|1), and a unique number.
  • Variants are counted and summarized in count_variants and count_self_variants.

Plotting

I included Rmd files that I used to produce plots: assembly_variants.Rmd takes counts.txt, which is a concatenation of all of the counts.txt files from the count_variants step in call_assembly_variants.wdl. It also takes self_counts.txt, which is a concatenation of all the counts.txt files from the count_self_variants step in call_assembly_variants.wdl. read_support.Rmd takes contigs2_contigs1_support.txt, which is the support.txt output of the combine_small_variant_support_contigs1_contigs2 step of get_read_support.wdl. ref1_self.txt and ref2_self.txt are outputs of the correspond_variant_ids step of get_read_support.wdl. contigs2_contigs1.{snps, ins, del}.txt are simply lists of the variant ids sorted by type, this could be generated in a number of ways. contigs2_contigs1.genotypes.txt gives the "genotypes" from the original variant calling step. These are not actually genotypes, but rather are dependent on how many contigs were observed with that variant alignment. giab_ref1_match.tsv and giab_ref2_match.tsv are derived from hap.py analysis comparing the reference-based calls to GiaB calls. This is only applicable to GiaB sample(s) or possibly other samples with truth sets. hap.py generates a vcf and the tsv file can be derived with the command zcat ref1.happy.vcf.gz | grep -v "^#" | grep -v NOCALL | cut -f 1,2 > giab_ref1_match.tsv

Notes on running

I have been running this using cromwell on compute1. Configuration details are in https://github.com/hall-lab/cromwell-on-lsf I start the cromwell server and the workflow using the following command: LSF_DOCKER_VOLUMES="/home/aregier:/home/aregier /storage1/fs1/ccdg/Active:/storage1/fs1/ccdg/Active /scratch1/fs1/ccdg:/scratch1/fs1/ccdg" bsub -R "rusage[mem=32000]" -q ccdg -G compute-ccdg -oo /storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/ca_round1/logs/%J.log -a 'docker(registry.gsc.wustl.edu/apipe-builder/genome_perl_environment:22)' /usr/bin/java -Xmx16g -Dconfig.file=/storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/cromwell-on-lsf/cromwell.config -Dsystem.input-read-limits.lines=50000 -jar /opt/cromwell.jar run -t wdl -i /storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/multiple_competitive_alignment/competitive_genotyping.inputs.json /storage1/fs1/ccdg/Active/analysis/ref_grant/assembly_analysis_20200220/multiple_competitive_alignment/competitive_genotyping.wdl

competitive-alignment's People

Contributors

apregier avatar abelhj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.