Giter Club home page Giter Club logo

fade's Introduction

alt text

Fragmentase Artifact Detection and Elimination

Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge Anaconda-Server Badge

DNA shearing is a crucial first step in most NGS protocols for Illumina. Enzymatic fragmentation has shown in recent years to be a cost and time effective alternative to physical shearing (i.e. sonication). We discovered that enzymatic fragmentation leads to unexpected alteration of the original DNA source material. We provide fade as a method of identification and removal of enymatic fragmentation artifacts.

Installation

Our documentation has information on installing fade and its prerequisites. Or you can install fade via conda.

conda install -c bioconda fade

Running FADE

fade annotate -b sam1.bam ref.fa > sam1.anno.bam
samtools sort -n sam1.anno.bam > sam1.anno.qsort.bam #recommended but not neccessary
fade out -b sam1.anno.qsort.bam > sam1.filtered.bam

Note: Queryname sorting is suggested in between running fade annotate and running fade out. This is so fade out can eject whole fragments containing an artifact on either R1 or R2. If this step is not performed fade out with simply eject only the read with an artifact (orphaning its mate). The justification behind this behavior is that we assume the whole fragment is biased by the effects of enzymatic fragmentation.

Running FADE via Docker

docker run -v `pwd`:/data blachlylab/fade annotate -b /data/sam1.bam /data/ref.fa > sam1.anno.bam
docker run -v `pwd`:/data blachlylab/fade out -b -c /data/sam1.anno.qsort.bam > sam1.filtered.bam

Windows

docker run -v C:\path\to\folder:/data blachlylab/fade annotate -b /data/sam1.bam /data/ref.fa > sam1.anno.bam
docker run -v C:\path\to\folder:/data blachlylab/fade out -b -c /data/sam1.anno.qsort.bam > sam1.filtered.bam

Note: fade annotate works in parallel. Due to this, fade doesn't necessarily write the output in the same order as the input. Your sorting will be affected. You will likely need to re-sort using samtools sort if you would like to use IGV or samtools index. fade out when the -c flag is used will also affect sorting and mate information as it can modify the starting position of an alignment. If using fade out -c you should consider also running a tool like Picard's FixMateInformation after re-sorting.

Program Details

Command-line parameters

fade

Fragmentase Artifact Detection and Elimination
usage: ./fade [subcommand]
        annotate: marks artifact reads in bam tags (must be done first)
        out: eliminates artifact from reads(may require queryname sorted bam)
        stats: reports extended information about artifact reads
        stats-clip: reports extended information about all soft-clipped reads
        extract: extracts artifacts into a mapped bam
-h --help This help information.

fade annotate

Fragmentase Artifact Detection and Elimination
annotate: performs re-alignment of soft-clips and annotates bam records with bitflag (rs) and realignment tags (am)
usage: ./fade annotate [BAM/SAM input] [Indexed fasta reference]

-t     --threads extra threads for parsing the bam file
    --min-length Minimum number of bases for a soft-clip to be considered for artifact detection
-w --window-size Number of bases considered outside of read or mate region for re-alignment
-b         --bam output bam
-u        --ubam output uncompressed bam
-h        --help This help information.

fade out

Fragmentase Artifact Detection and Elimination
out: removes all read and mates for reads contain the artifact (used after annotate and requires queryname sorted bam) 
or, with the -c flag, hard clips out artifact sequence from reads
usage: ./fade out [BAM/SAM input]

-c    --clip clip reads instead of filtering them
-t --threads extra threads for parsing the bam file
-b     --bam output bam
-u    --ubam output uncompressed bam
-h    --help This help information.

fade stats

Fragmentase Artifact Detection and Elimination
stats: reports extended information about artifact reads (used after annotate)
-t --threads threads for parsing the bam file
-h    --help This help information.

fade stats-clip

Fragmentase Artifact Detection and Elimination
stats-clip: reports extended information about all soft-clipped reads (used after annotate)
-t --threads threads for parsing the bam file
-h    --help This help information.

fade extract

Fragmentase Artifact Detection and Elimination
extract: extracts artifacts into a mapped bam
usage: ./fade extract [BAM/SAM input]

-t --threads extra threads for parsing the bam file
-b     --bam output bam
-u    --ubam output uncompressed bam
-h    --help This help information.

Algorithm

FADE is written in D and uses the htslib library via dhtslib, and the parasail library via dparasail. FADE accepts SAM/BAM/CRAM files containing reads that have been mapped to a reference genome and filters or cleans up artifact-containing reads according to the following procedure.

FADE is designed to determine a sequencing read’s enzymatic artifact status by employing aligner soft-clipping. Soft-clipping is an action performed by the aligner to improve the alignment score of a read to the reference by ignoring a portion on one end of the read. Soft-clipping can help an aligner correctly align a read that has sequencing error on one end of the read or has adapter contamination. FADE employs soft-clipping to identify potentially enzymatic artifact containing reads.

  1. It considers only those reads aligned with soft-clipping.
  2. The reference sequence that the alignment is mapped to is extracted such that there exists 300 nucleotides (nt) of padding on each end of the mapped read.*
  3. The read is reverse-complemented and then aligned via a Smith-Waterman local alignment to the extracted region of reference sequence. We use a scoring matrix with a gap open penalty of 10, a gap extension penalty of 2, a mismatch penalty of 3, and a match score of 2.**

FADE makes available several subcommands that all rely on the algorithm described above.

  1. The annotate subcommand performs the initial analysis and adds BAM tags encoding information concerning artifact status to the alignments, used during filtration to remove the artifacts.
  2. The out subcommand eliminates the artifact. It either removes reads from the output BAM/SAM file completely if they or their mate contain an identified fragmentation artifact or artifact-containing reads are trimmed to remove extraneous sequence originating from the opposite strand. After elimination, FADE reports statistics describing the total number of alignments, the percentage of soft-clipped alignments, and the percentage of enzymatic artifacts found. The filtration step must be run on a queryname-sorted BAM file in order to fully filter out the read, its mate, and any other supplementary or secondary alignments.
  3. The stats subcommand reports extended information on all reads identified by annotate to contain the artifact.
  4. The stats-clip subcommand reports information on all soft-clipped sequences present.
  5. The extract subcommand allows the extraction of the artifact sequences in their remapped state.

* The 300 nt padding on each end of the mapped region provides ample search space for artifact alignment search without being too computationally expensive; most artifacts originate very close to the mapped region and 300 nt was chosen as an optimal tradeoff, but could be adjusted.
** Harsher gap penalties allows the algorithm to be strict in allowing gaps, since we expect the artifact sequences to directly match the reference, except for soft-clipped regions derived from sequencing error. A soft-clipped region is considered to be an artifact if there is a 90% or greater match to the opposite strand sequence.

fade's People

Contributors

charlesgregory avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fade's Issues

Where to run in GATK workflow

The documentation does not indicate when FADE should be run when following alignment workflows, such as GATK? Immediately after alignment and qname sort, or after recalibration/mark duplicates? Also, how does FADE work when comparing GATK 3 (with local indel realignment) to GATK 4 (no realignment)?

fade missing many reads that appear to be fragmentation artifacts

I ran fade on bam that we believe has high level of fragmentation artifacts. I am seeing that 8% of the reads are soft-clipped, while only 0.01% of the reads are classified by fade as artifacts. When I blat some of the soft clipped reads that remain after running fade out, I see that they show the characteristic alignment of the fragmentation artifact, with part of the read aligning to the forward strand and part to the reverse. When I look for these reads in the fade stats-clip output, I see that they have "artifact_status: false". Here is an IGV screenshot illustrating the issue:
image
and here is the input bam subset to the region shown:
STD347-81.reg.bam.zip:

error running fad stats

I am getting the following error running fade stats:

rt/dwarfeh.d:354: uncaught exception reached top of stack
This might happen if you're missing a top level catch in your fiber or signal handler
core.exception.OutOfMemoryError@core/lifetime.d(137): Memory allocation failed
----------------
src/env/__libc_start_main.c:10 <ERROR: Unable to retrieve function name> [0x0]
Aborted

Here is how I am running fade:

docker run -it -m 8g -v $PWD:/data --entrypoint sh blachlylab/fade
fade stats $BAM

My host machine is a Mac. I've also been getting the same error on a cloud workstation running ubuntu 18.04 with fade installed via conda. I've tried several different bams including one that was trivially sized (0.5G) and have checked that I have plenty of free memory available on both machines. I am able to run fade out on the same bams without any errors.

fade stats-clip output documentation?

Is there description for these values? It's the output for fade stats-clip

  • qname
  • sc_q_scores
  • sc_seq
  • sc_avg_bq
  • avg_bq
  • art_status

Also, do you have a way to produce a histogram of the soft clip read lengths? It would be a great way to identify read quality coming off of the sequencing machine.

invalid pointer issue

I installed fade into my conda env and simply run:
fade annotate -b

and I got this error message:
'''
*** Error in `fade': free(): invalid pointer: 0x000055d982948f68 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81489)[0x2b860cefa489]
/home/yangyxt/anaconda3/bin/../lib/libhts.so.3(hfile_destroy+0x22)[0x2b860b7d70c2]
/home/yangyxt/anaconda3/bin/../lib/libhts.so.3(hclose+0x38)[0x2b860b7d7978]
/home/yangyxt/anaconda3/bin/../lib/libhts.so.3(hts_close+0x106)[0x2b860b7e1d46]
fade(+0x7dcd6)[0x55d982120cd6]
fade(+0x3f3f0)[0x55d9820e23f0]
fade(+0x3e5f9)[0x55d9820e15f9]
fade(+0xb2f0c)[0x55d982155f0c]
fade(+0xb2e08)[0x55d982155e08]
fade(+0xb2c5e)[0x55d982155c5e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b860ce9b3d5]
fade(+0x345c9)[0x55d9820d75c9]
======= Memory map: ========
2b860b78b000-2b860b7ad000 r-xp 00000000 fd:00 69192456 /usr/lib64/ld-2.17.so
2b860b7ad000-2b860b7ae000 rw-p 00000000 00:00 0
2b860b7ae000-2b860b7bc000 r--p 00000000 00:29 85211792 /home/yangyxt/anaconda3/lib/libhts.so.1.12
2b860b7bc000-2b860b874000 r-xp 0000e000 00:29 85211792 /home/yangyxt/anaconda3/lib/libhts.so.1.12
2b860b874000-2b860b893000 r--p 000c6000 00:29 85211792 /home/yangyxt/anaconda3/lib/libhts.so.1.12
2b860b893000-2b860b895000 r--p 000e4000 00:29 85211792 /home/yangyxt/anaconda3/lib/libhts.so.1.12
2b860b895000-2b860b896000 rw-p 000e6000 00:29 85211792 /home/yangyxt/anaconda3/lib/libhts.so.1.12
'''

Warning or error log in fade annotate ?

Hi,
Thanks for your tool, when I test it for my panel data, it shows following log, should I ignore it?

Version: fade.many-linux-x86_64.tar.gz

fade annotate -t 8 -b $filein $ref_hsa > test.fadeanno.1.bam

Here is the log on screen:

[W::fade annotate] Output SAM/BAM will not be sorted (regardless of prior sorting)
[I::dhtslib.sam.reader.SAMReader.__ctor!string.this] 6 CPU cores detected; enabling multithreading
[I::dhtslib.sam.writer.SAMWriter.__ctor!(File).this] 6 CPU cores detected; enabling multithreading
[W::dparasail.alignment.Parasail.this] Couldn't load matrix named ACTGN, assuming alphabet
[W::dparasail.alignment.Parasail.this] Attempting to create matrix from alphabet ACTGN
[I::dhtslib.sam.record.SAMRecord.opIndexAssign!ubyte.opIndexAssign] Undefined issue adding/updating bam tag. Trying again...
[I::dhtslib.sam.record.SAMRecord.opIndexAssign!ubyte.opIndexAssign] Undefined issue adding/updating bam tag. Trying again...
[I::dhtslib.sam.record.SAMRecord.opIndexAssign!ubyte.opIndexAssign] Undefined issue adding/updating bam tag. Trying again...
[I::dhtslib.sam.record.SAMRecord.opIndexAssign!ubyte.opIndexAssign] Undefined issue adding/updating bam tag. Trying again...
[I::dhtslib.sam.record.SAMRecord.opIndexAssign!ubyte.opIndexAssign] Undefined issue adding/updating bam tag. Trying again...
[I::dhtslib.sam.record.SAMRecord.opIndexAssign!ubyte.opIndexAssign] Undefined issue adding/updating bam tag. Trying again...
[I::dhtslib.sam.record.SAMRecord.opIndexAssign!ubyte.opIndexAssign] Undefined issue adding/updating bam tag. Trying again...
[I::dhtslib.sam.record.SAMRecord.opIndexAssign!ubyte.opIndexAssign] Undefined issue adding/updating bam tag. Trying again...

Best,
xiucz

Non-printable characters in ab:Z field

Greetings, I am the author of Genozip, a compression software for BAM/FASTQ/VCF etc (www.genozip.com). One of our users opened a support ticket regarding a failing compression of a FADE-generated BAM file. After much investigation, it turned out that the issue was a single alignment (out of half a billion) which had an ab:Z field that appears to be entirely corrupted - many non-printable characters.

ab:Z:^L!"^!"! !^"""^K^^
^^M^]^R^Z"^L^M!^N"""^M!^N^]^\ESC""^L ^N^O^]^H^P"""^^^^"^^!ESC!^R ^]!!#^
^\ ^O

In addition, you might want to consider adding a @ PG line for FADE in the SAM header - this will make it easier for other software packages (like ours) to understand the properties of data with which we are interacting.

ab:z tag conflicting with samfile format

I've been having issues with the ab:z tag introducing new line characters in the sam file. This hinders downstream analysis and disallows file parsing using samtools in some instances. These discrepancies are present in the sam files (mostly kappa sonicated libraries) deposited in SRA from your article published in NAR genomics and bioinformatics. Although, none of the new line characters in these SRA files interfere with samtools.

PF variants in fade region are not totally filtered

Hi,
After diving into your wonderful tool, I find one more question. After running my bam file (reacalibrate.bam with GATK best practice) with the fade annotate and fade out (without -c) , most PF variants will not show in the bam( The first IGV panel bam, fade.bam).

And the second bam, mt2.bam, which comes from mutect2 --bamout option, let's ignore it.

However, there are still some FP variants which can not be filtered by fade software, I know it is not a bug of fade.

We know that fade can filter/trim reads that meet fade‘s inter threshold. But if a variant contains reads both meet fade‘s inter threshold (read A) and not (read B). Should we remove the read B also?

image

b6cf3463a315c5aea4a6106c205c074

Best,
xiucz

fade out runs extremely slow on a name sorted bam

Using a trivially sized 0.5G bam for testing purposes, i found that it took ~10hrs to run fade out on the output of samtools sort -n. Watching the process on top it appears to have low cpu usage, and spends alot of time in the 'D' state. In comparison, running fade out on the same bam without sorting took about 20sec. I am seeing the same behavior running fade on a cloud workstation with ubuntu 18.04 and fade installed via conda as well as running in the blachlylab/fad docker image on my mac.

Exception when adding the bam tag.

I run into this error when simpy trying to run fade annotate.
Here is the error log as a screenshot:
image

I use GATK ValidateSamFile to check the integrity of BAM file before using them on FADE. So I'm pretty confident my BAM file is not corrupted.

cannot index output file

After running fade annotate, samtools sort, and fade out, I want to index the final bam file. However, I get an error from samtools index:

[E::hts_idx_push] Unsorted positions on sequence #5: 112160519 followed by 112160464
[E::sam_index] Read 'A00814:310:HVJ77DRXX:1:2101:1090:1219:TCAAAGCCA' with ref_name='5', ref_length=180915260, flags=163, pos=112160464 cannot be indexed
samtools index: failed to create index for [filename]

This is the read (output form samtools view):

A00814:310:HVJ77DRXX:1:2101:1090:1219:TCAAAGCCA	83	5	112160519	60	151M	=	112160464	-206	CTCCTTTTCTTTTACCCAAGTGTAAAATGGTGGAATTCCTTGAAGCTCTGCCCTAGGCTCTTTTCCTTTCTCTCTATGCTTCTGTTATAGTTTTTGTTTTTGTTTTCAAATTTACAGCTCAATTTGAGTTTTTTCCCTGAGTTTTAGACAC	FFFFFFFFF,:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:151	MC:Z:15S106M30S	AS:i:151	XS:i:23	rs:i:0

Thank you,

fade out -c fails or produces truncated bam

I recently ran fade out -c on a set of 28 bams. In all cases, I see following warning printed to stderr many times:

[E::bam_write1] Positional data is too large for BAM format

For 16 of the bams, it completes successfully and appears to produce a complete bam. For seven of the bams, it prints the summary statistics, for example:

read count:	265354955
Clipped %:	0.313838
% With Supplementary alns:	0.00732937
Artifact rate:	0.00724173
% With Supplementary alns and artifacts:	0.00151888
Artifact rate left only:	0.00361884
Artifact rate right only:	0.00362289

but then gives the error:

Failed to flush stdout: Bad file descriptor

For the others, it completes with exit status 0, but samtools sort gives:

[E::bgzf_read_block] Invalid BGZF header at offset 27328607662
[E::bgzf_read] Read block operation failed with error 6 after 0 of 4 bytes
samtools sort: truncated file. Aborting

fade out without the -c option completed successfully on all 28 bams without the warning or error. I don't see any of the issues with fade out -c when running on one of these bam subset to a region, so I am not able to share data here to reproduce the issue.

Bam sorting

fade can be cumbersome and unintuitive when it comes to sorting.

Ideally, the input bam to fade out is queryname sorted. This allows fade out to eject an artifact mate pair there is evidence of a read or its mate being an enzymatic fragmentation artifact. The logic behind this is that we cannot trust the mate of a read that contains an artifact because it is expected the whole insert is artifactual.

If the -c flag is provided for only clipping artifact reads, nothing will be done to the mate of an artifact read.

If the bam is not queryname sorted, fade out will only eject the reads individually and not consider them as paired.

Additionally, fade annotate works in parallel, even if the input bam is coordinate or queryname sorted, the output bam will NOT be sorted.

fade out may also NOT result in a coordinate-sorted bam, even if samtools sort is run after fade annotate. fade out modifies the alignment position of artifactual reads if the -c flag is provided. Hard-clipping the artifact regions of the reads changes the alignment position. In the end, we may be looking at fade's overall execution for a pipeline as:

fade annotate -b input.bam ref.fa > input.anno.bam
samtools sort -n input.anno.bam > input.qns.bam
fade out -b input.qns.bam > input.out.bam
samtools sort input.out.bam > final.bam
samtools index final.bam

It may be a significant bit of work, but we should potentially support resorting input and output bam files internally.

# output either is coordinate sorted by fade
# or retains original sorting
fade annotate -b input.bam ref.fa > input.anno.bam 

# fade internally resorts input based on queryname 
# if -c flag is not used
# output is always resorted to coordinate sorting 
fade out -b input.qns.bam > input.out.bam

Alternatively, we could warn the user when their bam file appears unsorted and warn that fade's output is not sorted.

@jblachly thoughts?

Need unittests

I need unittests and CI to help keep me from tagging commits as versions when they are broken or untested.

Question about your use of D

Hello Lawrence here. Noticed you use D a lot for scientific computations. Would you like to have your org featured on the D website?

negative values for 'read count', 'Clipped %', etc

I am seeing a few bams were all of the values printed to STDERR from fade out are negative. For example:

read count:	-1875829843
Clipped %:	-0.297902
% With Supplementary alns:	-0.00544583
Artifact rate:	-0.00885916
% With Supplementary alns and artifacts:	-0.00333769
Artifact rate left only:	-0.00442794
Artifact rate right only:	-0.00443122

Unfortunately, this issue doesn't replicate on the same bam subset to a region, so I'm not able to share data to reproduce the issue.

How to process mutect2 caller after fade out

Dear sir,
I tried to use docker to process fade on my bam file, and the filtered bam cannot indexed by samtools so that I can't call somatic variants by mutect2.
command I used:
sudo docker run -v pwd:/data blachlylab/fade out -b -c /data/input.hg38.fade.bam >out.hg38.fade.filtered.bam
samtools index out.hg38.fade.filtered.bam
[E::hts_idx_push] Unsorted positions on sequence #1: 203008683 followed by 203008623
[E::sam_index] Read 'A00355:191:HJ2TWDSXY:2:1101:1009:14982' with ref_name='chr1', ref_length=248956422, flags=163, pos=2030d

Even I use samtools sort. The same error keep coming.

[W::dhtslib.tagvalue.TagValue.this] SA doesn't exist for this record & rs doesn't exist for this record

  • First, I mapping reads into genome: bwa mem -M -R "@rg\tID:Sample\tLB:Sample\tSM:Sample\tPL:ILLUMINA" -t 30 hg19.fa Sample.R1.paired.fastq.gz Sample.R2.paired.fastq.gz > Sample.sam

  • Next, sort the sam file into bam, and use picard mark duplicates reads, gatk realign indel, I get Sample.realigned.bam

  • Then, I run the command as your recommend: fade annotate -b Sample.realigned.bam hg19.fa > Sample.realigned.anno.bam, I get the WARNING "[W::dhtslib.tagvalue.TagValue.this] SA doesn't exist for this record" full of my screen, but I stall get Sample.realigned.anno.bam

  • Finally, I skip the sort steps run: fade out -b Sample.realigned.anno.bam > Sample.realigned.anno.filter.bam, the command exit and output error:
    [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
    [W::dhtslib.tagvalue.TagValue.this] rs doesn't exist for this record
    core.exception.AssertError@../../.dub/packages/dhtslib-0.10.1/dhtslib/source/dhtslib/tagvalue.d(84): Assertion failure


??:? [0x55b3d5efe590]
??:? [0x55b3d5f0884a]
??:? [0x55b3d5eef80d]
??:? [0x55b3d5ee722c]
.......

I want to know what's going on? ?

Docker: Use mimalloc

musl-c allocation is slow for multi-threaded programs.
https://www.linkedin.com/pulse/testing-alternative-c-memory-allocators-pt-2-musl-mystery-gomes

I tried to use mimalloc but ran into issues. Likely from mixing allocators by not properly linking. I would like to use mimalloc statically linked into fade. I would likely have to compile all dependencies with it too which didn't seem to compile in my testing.

Until this is fixed the static binary releases will not be as performant as they could be. Though this is not a huge priority.

Better estimate on mate region size

end=rec.mate_position()+300+151; //here we can't know the bases covered so estimate 151 bases

Currently we just estimate the mate region to be 151 bases as expected for Illuminia. Either make this configurable or at least make the guess smarter. Could we get mate cigar?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.