Giter Club home page Giter Club logo

fastvar's Introduction

fastvar

Some quick 'n dirty variant "estimation" tools. The intent of this project is not to create accurate genotypes, per se, as there are a great many good softwares for doing this, but to create the fastest possible genotypes that are passable to assess sample integrity, match samples from the same person, and approximate relatedness among individuals.

Installation

Dependencies:

[HTSLib v1.10+] (https://github.com/samtools/htslib/releases/tag/1.13) must be [installed] (https://github.com/samtools/htslib/blob/develop/INSTALL) globally on your machine, or if you don't have root access to your machine, tell GCC where a local htslib is.

For example, if you have built htslib in /home/user/htslib:

export LIBRARY_PATH=/home/usr/htslib:$LIBRARY_PATH
export CPATH=/home/usr/htslib/htslib:$CPATH

Install:

git clone http://github.com/txje/fastvar
cd fastvar

# get klib headers
mkdir incl
cd incl
git clone http://github.com/attractivechaos/klib
cd ..

cd ..
make

Usage

./fasttype <BAM> <BED> <REF FASTA>

BAM: reads aligned to <REF FASTA> or '-' to read SAM format from the input stream

BED: list of variant positions to capture in BED format (like dbSNP's [discontinued] batch query results)

<chrom>  <st> <en> <snp>  <score> <strand>
chr7  6026829 6026830 rs2639  0 +

We actually only care about the first two fields - chrom and start position.

IMPORTANT: in this format, for whatever reason, the canonical reference position of the SNP is the end, and the start is (pos - 1). In practice, we just use the "start" as an index into the 0-indexed reference, so if you're off by one, most of the time all of your loci will appear homozygous and there will be little or no differences between samples. If you see this, double-check your BED file. See data/exomeChip_fingerprint_snps.bed as an example.

REF FASTA: fasta file used for alignments

Or, directly from one or more unaligned read files, using a k-mer approach:

./ktype <BED> <REF FASTA> <READ FASTA/Q> [<READ FASTA/Q> ...]

READ FASTA/Q: may be gzipped

Output

A tab-separated file in the following format:

chrom	start	end	ref_allele	A	C	G	T	N	Unknown

Where the first three fields match the input BED file, followed by the allele from the reference fasta, and counts for each read allele observed at that position, which - in rare cases - may include an "Unknown" that accounts for all manner of sins, including deletions.

Accuracy and efficiency

For both fastttype and ktype, memory usage is typically dominated the size of the reference (~3GB for humans). If you have very many SNPs, the variant bookkeeping will begin to contribute.

Runtime is typically dominated by the BAM or read FASTQ size. In my hands, for high-depth (>100x) exome sequencing, fasttype takes ~2 min per sample and ktype takes ~7 min per sample from gzipped reads (on one thread).

Alignment-based fasttype results correspond very closely to the ktype results in most cases. For the limited trials, with relatively few SNPs, described below, I've tried k = 32 and k = 64, and there's little difference. There are still some undiagnosed differences where, in some cases, ktype finds anywhere from ~1/2 - 2x as many of one or both nucleotides as fasttype. For simple sample matching, this isn't a huge problem because at least it's consistent across samples, but is certainly not the expected behavior.

Notes

For the cleanest and fastest rapid genotyping, here's what I recommend:

git clone https://github.com/lh3/minimap2
cd minimap2
make
cd ..

ref=h38.fa                                             # or something
r1=SAMPLE_A_R1.fastq.gz                                # or something
r2=SAMPLE_A_R2.fastq.gz                                # or something
snp_bed=fastvar/data/exomeChip_fingerprint_snps.bed    # or something

./minimap2/minimap2 -t $(nproc) -ax sr $ref $r1 $r2 | ./fastvar/fasttype - $snp_bed $ref > SAMPLE_A.fast_variants

# -- OR --

./fastvar/ktype $snp_bed $ref $r1 $r2 > SAMPLE_A.k_variants

I've included a small set of common exome genotyping SNPs (data/exomeChip_fingerprint_snps.bed) (see https://genome.sph.umich.edu/wiki/Exome_Chip_Design) that are typically sufficient to match and distinguish samples by individual for fast sanity checks (read barcode mixups).

There is also another, much bigger set of exome SNPs with AF 0.45 - 0.55 from 1000 Genomes data: data/1kgenomes_0.5af_exome_snps.bed

Feel free to submit issues and pull requests.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.