getzlab / rnaseqc Goto Github PK

Fast, efficient RNA-Seq metrics for quality control and process optimization

License: Other

C++ 79.54% Makefile 2.63% Dockerfile 0.48% Python 17.35%

rnaseqc's Introduction

RNA-SeQC

RNA-SeQC 2 is described in A. Graubert*, F. Aguet*, A. Ravi, K.G. Ardlie, Gad Getz, "RNA-SeQC 2: efficient RNA-seq quality control and quantification for large cohorts," Bioinformatics, 2021.

Installing

The latest stable build of RNA-SeQC is available on the GitHub Releases page, and contains static binaries for Linux and OSX.

RNA-SeQC is also available as a docker image: gcr.io/broad-cga-aarong-gtex/rnaseqc:latest which is automatically updated with any code change. Older versions of the docker image are tagged using the full commit SHA of any commit which introduced a code change.

To checkout the source of RNA-SeQC run git clone --recursive https://github.com/getzlab/rnaseqc.git. If you do not use the --recursive flag, you'll need to run git submodule update --init --recursive or you will be missing SeqLib.

Unit Tests

Input data for RNA-SeQC's testing suite is not stored in the repository due to size constraints. The current test data is available here, and must be unpacked within the test_data/ directory. Please note that the location of the test data is subject to change. The test resources use ~1.2 GB of space.

You can download and unpack test data with:

cd test_data
wget https://storage.googleapis.com/agraubert/broadinstitute/rnaseqc/test_inputs.tar.gz
tar xzf test_inputs.tar.gz

You can run the unit tests with make test

Usage

NOTE: This tool requires that the provided GTF be collapsed in such a way that there are no overlapping transcripts on the same strand and that each gene have a single transcript whose id matches the parent gene id. This is not a transcript-quantification method. Readcounts and coverage are made towards exons and genes only if all aligned segments of a read fully align to exons of a gene, but keep in mind that coverage may be counted towards multiple transcripts (and its exons) if these criteria are met. Beyond this, no attempt will be made to disambiguate which transcript a read belongs to. You can collapse an existing GTF using the GTEx collapse annotation script

Command Line Usage:

rnaseqc [OPTIONS] gtf bam output

Example: ./rnaseqc test_data/downsampled.gtf test_data/downsampled.bam --bed test_data/downsampled.bed --coverage .

OPTIONS:

  -h, --help                        Display this message and quit

  --version                         Display the version and quit

  gtf                               The input GTF file containing features
                                    to check the bam against

  bam                               The input SAM/BAM file containing reads
                                    to process

  output                            Output directory

  -s[sample], --sample=[sample]     The name of the current sample. Default:
                                    The bam's filename

  --bed=[BEDFILE]                   Optional input BED file containing
                                    non-overlapping exons used for fragment
                                    size calculations

  --fasta=[fasta]                   Optional input FASTA/FASTQ file
                                    containing the reference sequence used
                                    for parsing CRAM files

  --chimeric-distance=[DISTANCE]    Set the maximum accepted distance
                                    between read mates. Mates beyond this
                                    distance will be counted as chimeric
                                    pairs. Default: 2000000 [bp]

  --fragment-samples=[SAMPLES]      Set the number of samples to take when
                                    computing fragment sizes. Requires the
                                    --bed argument. Default: 1000000

  -q[QUALITY],
  --mapping-quality=[QUALITY]       Set the lower bound on read quality for
                                    exon coverage counting. Reads below this
                                    number are excluded from coverage
                                    metrics. Default: 255

  --base-mismatch=[MISMATCHES]      Set the maximum number of allowed
                                    mismatches between a read and the
                                    reference sequence. Reads with more than
                                    this number of mismatches are excluded
                                    from coverage metrics. Default: 6

  --offset=[OFFSET]                 Set the offset into the gene for the 3'
                                    and 5' windows in bias calculation. A
                                    positive value shifts the 3' and 5'
                                    windows towards eachother, while a
                                    negative value shifts them apart.
                                    Default: 150 [bp]

  --window-size=[SIZE]              Set the size of the 3' and 5' windows in
                                    bias calculation. Default: 100 [bp]

  --gene-length=[LENGTH]            Set the minimum size of a gene for bias
                                    calculation. Genes below this size are
                                    ignored in the calculation. Default: 600
                                    [bp]

  --legacy                          Use legacy counting rules. Gene and exon
                                    counts match output of RNA-SeQC 1.1.9

  --stranded=[stranded]             Use strand-specific metrics. Only
                                    features on the same strand of a read
                                    will be considered. Allowed values are
                                    'RF', 'rf', 'FR', and 'fr'

  -v, --verbose                     Give some feedback about what's going
                                    on. Supply this argument twice for
                                    progress updates while parsing the bam

  -t[TAG...], --tag=[TAG...]        Filter out reads with the specified tag.

  --chimeric-tag=[TAG]              Reads maked with the specified tag will
                                    be labeled as Chimeric. Defaults to 'ch'
                                    for STAR

  --exclude-chimeric                Exclude chimeric reads from the read
                                    counts

  -u, --unpaired                    Allow unpaired reads to be quantified.
                                    Required for single-end libraries

  --rpkm                            Output gene RPKM values instead of TPMs

  --coverage                        If this flag is provided, coverage
                                    statistics for each transcript will be
                                    written to a table. Otherwise, only
                                    summary coverage statistics are
                                    generated and added to the metrics table

  --coverage-mask=[SIZE]            Sets how many bases at both ends of a
                                    transcript are masked out when computing
                                    per-base exon coverage. Default: 500bp

  -d[threshold],
  --detection-threshold=[threshold] Number of counts on a gene to consider
                                    the gene 'detected'. Additionally, genes
                                    below this limit are excluded from 3'
                                    bias computation. Default: 5 reads

  "--" can be used to terminate flag options and force all following
  arguments to be treated as positional options

Output files:

The following output files are generated in the output directory you provide:

{sample}.metrics.tsv : A tab-delimited list of (Statistic, Value) pairs of all statistics and metrics recorded.
{sample}.exon_reads.gct : A tab-delimited GCT file with (Exon ID, Gene Name, coverage) tuples for all exons which had at least part of one read mapped.
{sample}.gene_reads.gct : A tab-delimited GCT file with (Gene ID, Gene Name, coverage) tuples for all genes which had at least one read map to at least one of its exons. This file contains the gene-level read counts used, e.g., for differential expression analyses.
{sample}.gene_tpm.gct : A tab-delimited GCT file with (Gene ID, Gene Name, TPM) tuples for all genes reported in the gene_reads.gct file, with expression values in transcript per million (TPM) units. Note: this file is renamed to .gene_rpkm.gct if the --rpkm flag is present.
{sample}.fragmentSizes.txt : A list of fragment sizes recorded, if a BED file was provided
{sample}.coverage.tsv : A tab-delimited list of (Gene ID, Transcript ID, Mean Coverage, Coverage Std, Coverage CV) tuples for all transcripts encountered in the GTF.

Metrics reported:

See Metrics.md for a description of all metrics reported in the metrics.tsv, coverage.tsv, and fragmentSizes.txt files.

Legacy mode differences

The --legacy flag enables compatibility with RNASeQC 1.1.9. This ensures that exon and gene readcounts match exactly the counts which would have been produced by running that version. This also adds an extra condition to classify reads as chimeric (see "Chimeric Reads", above). Any metrics which existed in 1.1.9 will also match within Java's floating point precision.

rnaseqc's People

Contributors

Stargazers

Watchers

Forkers

celalp inambioinfo matheb04 noelnamai yuybio tayaza standardgalactic liupfskygre nyagam wook2014 wjdingman1 gpertea wanying-zhu ldenriquez kachulis qindan2008 naamakashani trinityctat jamestiotio

rnaseqc's Issues

Cram references STILL not working

I was pretty sure we fixed the issue with cram references back in walaj/SeqLib#38, but apparently travis builds are still failing. Somehow, despite a reference being handed to SeqLib and SeqLib setting the reference on the HTSlib bam object we still can't read Crams unless the fasta is in the hts cache or at the location indicated in the bam header.

I unfortunately don't have time to work on this right now, so for the time being I'm adding a warning message to RNA-SeQC and disabling tests on crams

median TPM of gene differs from median TPM of its singleton transcript in GTEx v8

Dear Folks,

I hope all is well, and thanks for all your efforts.

In the transcript expression file, GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_tpm.gct.gz, there is only one transcript, namely ENST00000367976.3, for ENSEMBL gene ENSG00000118523.5 (a.k.a. CTGF), and when I extract the TPM values for said transcript and tissue type 'Artery - Aorta', the resulting median TPM is 935.8 for n=432 (non-zero values).

Correspondingly, in the gene expression file, GTEx_Analysis_2017-06-05_v8_RSEMv1.3.0_transcript_tpm.gct.gz, when I extract the TPM values for said gene and tissue type 'Artery - Aorta', the resulting median TPM is 2043 for n=432 (non-zero values).

Also, please see the attached - and apparently quite similar - violin plots (grouped by donor's age bracket) for the gene and transcript TPM values, respectively.

I've looked at the code briefly, and please excuse that I have yet to explain the this difference (by a factor of ~2.2) of median TPM 2043 for the gene CTGF from 935.8 for the singleton transcript ENST00000367976.3.

Please advise, thanks.
Best,
CB
Rplot.CTGF.ArteryAorta.22_10_20.pdf
Rplot.ENST00000367976.ArteryAorta.12_10_20.pdf

All branch cloud build

On push to any branch (Cloud build, via web ui):

docker build and tag as $BRANCH-NAME

Alternative to collapse transcripts

I am very interested in using RNAseq-QC for our standard pipelines, however I ran into the issue preparing a GTF suitable for the tool. We use ensembl GTFs and the script suggested to prepare the GTF doesn't work for GTFs from this source. I tried to modify the GTF using sed to make it compatible with scritpt, but I still got an error (KeyError: 'level').

My question:

is there is another tool / script / solution you would suggest to prepare an ensembl GTF for RNAseq-QC? If not it would appear that RNAseq-QC is limited to the annotations available from GENCODE, and therefore unsuable for anyone not working with either human or mouse (or unwilling to change the source of annotations). Is this correct?

(I used to use RNAseq-QC a few years ago (JAVA versions), and iirc it worked with any gtf. Is my memory playing tricks?)

Question about reading BAM file in S3 bucket

Hi,

I would like to know if RNA-SeQC supports reading BAM files from AWS S3 bucket?

Thanks

Feature request: additional metric columns

I recommend adding columns indicating:

If reads are PE/SE
chimeric_read_rate and not just read count

Add Duplication Rate vs Genes Detected scatter

Feature request: Interactive plots

The automatic plots generated are great!

I think making at least some of the plots interactive (e.g. using plotly) so users could quickly identify the outlier samples could be very helpful.
It would also be very beneficial for interpreting the PCA plots.

Thanks!

report coverage on all genes

fully masked genes should report 0 coverage instead of being excluded from coverage table

"Duplicate Reads" metric not calculated, always outputs 0.

The "Duplicate Reads" metric is added to the output, however it is never actually calculated. Since there is also a "Mapped Duplicate Reads" metric which is calculated, it would seem reasonable to remove the "Duplicate Reads" metric

--stranded flag reversed

The --stranded flag currently handles opposite of what is expected. A bug fix has been implemented and committed to master, but we are currently performing thorough tests before publishing a new release.

In the meantime, it's safe to continue running RNA-SeQC without providing the --stranded flag

Duplicated exons issue

Dear Authors,

I am using the official annotation from ENSEMBL but the exons are not unique:

grep ENSE00001957285 Homo_sapiens.GRCh38.87.gtf

(This GTF can be downloaded from the official site: ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/ )

Also, the conversion method which you provided does not work, I assume that this happens because the file is ENSEMBL annotation(?) Could you please let me know how should I run rnaseqc with this particular GTF version?

Collapse.py file

@agraubert @joshua-gould @francois-a @dmcgoldrick
Hi, can you please guide how to convert the full length annotation gtf to the input gtf for rnaseqc. I have been using schmidtea_mediterranea.PRJNA379262.WBPS14.canonical_geneset.rnaseqc.gtf downloaded from here.

ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS14/species/schmidtea_mediterranea/PRJNA379262/schmidtea_mediterranea.PRJNA379262.WBPS14.canonical_geneset.gtf.gz

I used collapse.py which produced this error:

python collapse_annotation.py schmidtea_mediterranea.PRJNA379262.WBPS14.canonical_geneset.rnaseqc_type.gtf new.gtf File "collapse_annotation.py", line 100
print('Parsing GTF: {0:d} genes processed\r'.format(len(self.genes)), end='\r')
^
SyntaxError: invalid syntax

And while running rnaseqc command, I'm facing issue of duplicate exon ids.

Failed to parse the GTF: Detected non-unique Exon ID: SMEST026639001.e4

Feature request: support in bioconda

Most bioinformatic tools nowadays are available through bioconda. Would you consider updating the bioconda recipe for RNASeQC from the legacy 1.x series to the current 2.x series?

Thank you

Installation Error

This may be my own error in setting up the Makefile -- apologies if so.

I have gotten this error when running make in the source dir:

$ make
g++ -O3  -o rnaseqc src/BED.o src/Expression.o src/GTF.o src/RNASeQC.o src/Metrics.o src/Fasta.o src/BamReader.o SeqLib/lib/libseqlib.a SeqLib/lib/libhts.a  -lboost_filesystem -lboost_regex -lboost_system -lz -llzma -lbz2 -lpthread
/bin/ld: SeqLib/lib/libseqlib.a(libseqlib_a-BamReader.o): unable to initialize decompress status for section .debug_info
/bin/ld: SeqLib/lib/libseqlib.a(libseqlib_a-BamReader.o): unable to initialize decompress status for section .debug_info
SeqLib/lib/libseqlib.a: error adding symbols: File format not recognized
collect2: error: ld returned 1 exit status
make: *** [rnaseqc] Error 1

Any suggestions?

RNA-seQC output empty

Hi,

I followed the TOPmed pipeline used STAR to get the bam file, picard to mark duplicate, then used RNA-seQC to generate gene tpm. All steps ran without any error, however the gene_tpm.gct file only contain nan for tpm column, and exon_reads.gct and other files contain only 0 for last column, the first two columns (ID and name) looks fine.
head gene_tpm.gct

head exon_reads.gct

Therefore I double checked my work, and noticed a warning from picard mark duplicate step, because the log is very long, I will only paste the head and tail here:

Did this cause the empty output? I've been searching online for solution for several days and did not get any valid results. I tried both STAR BAM file and the BAM file after mark duplicates, both them gave the same results.

Software I used are STAR 2.7.3a, Picard v2.18.17 and rnaseqc.v2.3.6.linux.

Thank you for your help!

rRNA counts differ

Is there any equivalent to the legacy version -BWArRNA using bwa to count rRNA reads? That seems to be a more accurate option.

Proposal: Overhaul of RNASeQC to use SeqLib

https://github.com/walaj/SeqLib

Advantages of using SeqLib over Bamtools:

SeqLib claims to be at least 2x faster when reading. I'll have to benchmark to be sure
SeqLib supports CRAMs
SeqLib implements a chromosome aware interval tree, which may be more efficient than our search
- However, this will need to be benchmarked too. Even though we're doing a regular linear search, it's over a very narrow window, which may still give our implementation the upper hand
- Even if the tree is faster, we'd still have to implement the pruning behavior of the current search. Most of our low memory overhead is due to dropping features (and associated coverage data) when it leaves the search window
SeqLib is actively maintained

Additionally, the refactor would be a good opportunity to implement a multithreading feature I've been playing around with.

Goals and constraints of a refactor:

Must yield smaller memory footprint and/or higher throughput per core
Must not exceed 3Gb memory at any time, when running on a single core
- Should not exceed 1Gb
Must yield identical output to existing version
Static binaries cannot exceed 100Mb

Todo before starting any refactoring:

Build a benchmarking application to compare bam IO
Build a benchmarking application to compare sequential search performance using interval tree vs our narrow linear search.
- Must add pruning functionality to SeqLib interval tree, if not already present
Check if SeqLib is thread-safe or if bam IO would need to be synchronized

Issue with make

Hi,

I downloaded via "git clone --recursive https://github.com/getzlab/rnaseqc.git".
After cd to rnaseqc, I tried to use make command and got hit with this error:

"g++ -Wall -std=c++14 -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -I. -ISeqLib -ISeqLib/htslib/ -c -o src/BED.o src/BED.cpp
g++: error: unrecognized command line option ‘-std=c++14’
make: *** [src/BED.o] Error 1"

Any help with this issue will be much appreciated!

PS. Also, the collapse gtf code is no longer available. Do you mind providing alternative source? Thank you!

Release binaries

On tag (travis, via .travis.yaml):

compile binary and run tests (on multiple platforms)
if tests pass, attach platform binary to release

Gene detection should only use unique reads

Add a separate gene counting mechanism to count only reads without the isDuplicate flag set

Intergenic regions

Hi! Thanks for this tool. I have a lot of reads that align to intergenic regions. I was wondering if there is there a way that I can subset these reads for further analysis?

Thanks!

Library Complexity Estimation question

Hi Aaron,
I find that in many cases the Library Complexity Estimation is higher than # of Total Reads, let alone Mapped Reads. I looked at the Picard's EstimateLibraryComplexity page and couldn't find anything that reconciles this. I may be missing something basic - could you please explain?

Thanks,
Binyamin

in the output *.metrics.tsv file, many metrics are 0/-nan

I am using RNASeQC 2.3.5 to evaluate some RNA-seq data. However, in the output *.metrics.tsv file, many metrics are 0/-nan. I paste some lines in my outputs files here. Here is the command I use:
rnaseqc.v2.3.5 gencode.v33.primary_assembly.genes.gtf *bam outdir
I have tried several samples and all appear this problem. Could you please help to solve the problem?

Genes Detected 0
Estimated Library Complexity 0
Genes used in 3' bias 0
Mean 3' bias 0
Median 3' bias 0
3' bias Std 0
3' bias MAD_Std 0
3' Bias, 25th Percentile 0
3' Bias, 75th Percentile 0
Median of Avg Transcript Coverage 0
Median of Transcript Coverage Std 0
Median of Transcript Coverage CV 0
Median Exon CV 0
Exon CV MAD 0

Rename QC passed plot

It's actually just showing duplication rate again

Optimize docker build

Look over the dockerfile to ensure that images are maximizing shared layers

add dropFeatures overload without baseCoverage argument

For scrinvex

SeqLib: Replace BamTools with SeqLib for BamReader backend

Based on benchmarking, SeqLib can load reads about twice as fast as BamTools (which already accounts for ~30% of our runtime). While the benchmarking tools support both libraries, I don't see any need to do so (no need to implement BamTools versions of the interface)

Add SynchronizedReader and SeqlibReader classes

Feature request: support for MultiQC

Congratulations on reviving this project! All competing RNAseq QC tools have stalled for a long time.

Given that most QC metrics produced by omics pipelines seem to converge to MultiQC these days, would you consider reviving the MultiQC plugin for RNASeQC too? It seems that the current plugin only handles the 1.x legacy series, and discards most of the metrics produced by the tool.

Seems to me that MultiQC support could make a big difference in adoption rates for this tool. Thanks for considering!

Cheers

Add integration tests

Add a test which can be run quickly on a downsampled, public (k562?) bam

Build rnaseqc and verify that an executable was produced
Check that running rnaseqc --version produces output
Run a few test cases:
- Very quick test using a GTF with a single gene, and a bam with a single read pair
- A more thorough test using a downsampled bam on just one chromosome (GTF also limited to one chromosome). Test should take less than 60 seconds total
- A complete test using a downsampled bam, a full GTF, and an intervals bed file. Test should take less than 120 seconds total
- The same test, but in legacy mode
  - Ensure legacy mode output is consistent with java output
- A few expected failures
  - GTF and bam with no chromosomes in common
  - A crafted bam with reads that should get filtered. Ensure output has 0 counts
  - Others?

Make a test runner (python script to check expected outputs), and build test cases into makefile

Master branch cloud build

On push to master (Cloud Build, via .cloudbuild.yaml):

docker build and tag as $COMMIT-SHA
push docker
docker run the integration tests of the image
if tests pass, tag and push the image as latest

Is “-strictMode” flag still valid?

Hi,
I'm using rnaseqc.v2.3.6.linux and follow the the gtex-pipeline (https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq) to get my gene counts. In a recent paper "The GTEx Consortium atlas of genetic regulatory effects across human tissues", it mentioned to add a “-strictMode” flag in RNA-SeQC to filter gene counts, so I tried it with the original pipeline which did not use this option. My code is like: rnaseqc ${genes_gtf} ${bam_file} . -s ${sample_id} --stranded rf -vv -strictMode, however, when I checked the metrics.tsv file, nothing changed, and the count table is the same. Is it expected? Should I use this option?

Thank you!

STAR / Hisat2 bam not incompatible with RNA-SeQC ?

I used RNA-SeQC v2.3.4 with default parameters to process STAR/Hisat2 bam files, but results are weird. My reference genome is GENCODE grch37p13, and gene gtf were followed your advice using GTEx script.
There are some main error results:

Hisat2 / STAR BAM Total Reads is an error, which is different from fastqc
STAR BAM Mapping Rate is 1
Hisat2 BAM High Quality Exonic Rate is -nan
Hisat2 BAM 3' bias is 0

5964 Segmentation fault

Dear All,
I am using RNASEQC tool binary version rnaseqc.v2.3.1.linux from GitHub Releases. When I try to run the tool on human samples. It is throwing the following error,

The output (if any) follows:

/cluster/shadow/.lsbatch/1554797826.1499627.shell: line 1:  5964 Segmentation fault

And the Total memory I request to the run the above tool is cluster was: 80000.00 MB. Any help/suggestions would be much appreciated. I have successfully used the same tool on the mouse samples.

Feature request: reporting Sense Rate in --unpaired mode

It would be great if --unpaired would report the Sense Rate of mapping for the Single End reads. I recommend placing it into "End 1 Sense Rate" rather than making a new column unique to --unpaired mode.

Of note, for the other End 1 & End 2 metrics there are columns that already provide the respective information.

Thanks!

Track high and low quality exonic reads separately

Similarly to how the old tool did it, run exon counts on all mapped reads, but only keep the coverage data for high quality reads, which allows low quality to still be represented in the exonic rate

Closed alignment block end coordinate comparison

Hi,

I just noticed that in Expression.cpp#extractBlocks method, the line that calculates the end position of alignment block is:

block.end = start + current.Length(); // 1-based, closed

I'm wondering if this is inclusive or exclusive. If the block starts from position 1 and length is 1 base, its start and end should be 1 - 2, and 2 is NOT closed which is exclusive.

This may cause some comparison error in GTF.cpp#intersectPoint(const Feature &a, const coord x) where it always compares equity.

Thank you for any suggestions here!

Zheng

Clarity: Rename "Disqualified Reads" to "Ambiguous Reads"

Failed to parse GTF: non-unique Exon ID

I was trying to run the command:

rnaseqc.v2.3.5.linux ${REFDIR}/Macaca_mulatta.Mmul_8.0.1.97.chr.gtf ${ID}.transcript.sorted.bam ${QCDIR}

But get the following error:
"Failed to parse the GTF: Detected non-unique Exon ID: ENSMMUE00000311984"

I grep'ed out the Exon ID, and this is what I see:

1 ensembl exon 25432 25503 . + . gene_id "ENSMMUG00000005947"; gene_version "3"; transcript_id "ENSMMUT00000063154"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SAMD11-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000311984"; exon_version "2";

1 ensembl exon 25432 25503 . + . gene_id "ENSMMUG00000005947"; gene_version "3"; transcript_id "ENSMMUT00000076984"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SAMD11-202"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000311984"; exon_version "2";

1 ensembl exon 25432 25503 . + . gene_id "ENSMMUG00000005947"; gene_version "3"; transcript_id "ENSMMUT00000054447"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SAMD11-203"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000311984"; exon_version "2";
.
.
.
Although the exon id is the same, the transcript id & name is different.

How should this be handled?

many thanks!

Handling of supplementary alignments

Hi,

Thanks for the tool! I recently ran this on a large cohort and noticed some oddities in the output (e.g., end 1 mapping rates > 1 for a subset of the libraries). The issue seems to be attributable to a large number of supplementary alignments. As far as I can tell RNA-SeQC doesn't check the 0x800 flag denoting supplementary alignments and this throws off some of the numbers -- for example, inflating the 'unique mapping, vendor QC passed reads'.

Given the set of metrics that RNA-SeQC calculates, I would have expected it to exclude supplementary alignments from many of the metrics as it does for secondary alignments. Is there any particular reason for ignoring the supplementary alignment flag?

Happy to put together a pull request to treat supplementary alignments in a similar manner as secondary alignments, if you determine that to be desirable.

3' bias calculation issues

In patch 2.1.2 we fixed an issue which caused 3' bias to tend towards either extreme.

However now we are encountering more issues, which may cause us to re-evaluate the method of 3' bias calculation.

Developments made before opening the issue:

Discovered and resolved 3' bias not accurately modeling coverage of - strand genes
Discovered that 3' bias is frequently effected by genes with unexpressed UTRs

--Unpaired option documentation

Hi @ agraubert,

I noticed a strange behaviour with some of my libraries which I managed to pin down to the --unpaired option. The issue is that with single-end libraries rnaseqc was not reporting any counts / fpm / coverage (zeros for all entries) even though (i) the metrics files did have reasonable numbers, and (ii) STAR was reporting gene counts.

To debug this issue I took the example dataset, converted the mapped alignments to fastq and proceeded to map either in paired mode using R1 and R2, or R1 alone in single mode:

#PE
@CO     user command line: STAR --genomeDir /projects/bioinfo/igenomes/Homo_sapiens/Ensembl_v88_custom/GRCh38/Sequence/StarIndex --readFilesIn downsampled_pe_R1.fastq.gz /lustre/projects/bioinfo/domingue/test_RNA-SeQC/alignments/downsampled_pe_R2.fastq.gz --runThreadN 6 --readFilesCommand zcat --outFileNamePrefix downsampled_pe. --outSAMtype BAM SortedByCoordinate --sjdbGTFfile /projects/bioinfo/igenomes/Homo_sapiens/Ensembl_v88_custom/GRCh38/Annotation/Genes/genes.gtf --quantMode GeneCounts --outSAMattrRGline ID:downsampled_pe PL:illumina SM:downsampled_pe

#SE
@CO     user command line: STAR --genomeDir /projects/bioinfo/igenomes/Homo_sapiens/Ensembl_v88_custom/GRCh38/Sequence/StarIndex --readFilesIn downsampled_se.fastq.gz --runThreadN 6 --readFilesCommand zcat --outFileNamePrefix downsampled_se. --outSAMtype BAM SortedByCoordinate --sjdbGTFfile /projects/bioinfo/igenomes/Homo_sapiens/Ensembl_v88_custom/GRCh38/Annotation/Genes/genes.gtf --quantMode GeneCounts --outSAMattrRGline ID:downsampled_se PL:illumina SM:downsampled_se

I then QC-ed these with and without setting --unpaired:

downsampled_pe/downsampled_pe.gene_reads.gct:ENSG00000075624    ACTB    19
downsampled_pe_unpaired/downsampled_pe.gene_reads.gct:ENSG00000075624   ACTB    19
downsampled_se/downsampled_se.gene_reads.gct:ENSG00000075624    ACTB    0
downsampled_se_unpaired/downsampled_se.gene_reads.gct:ENSG00000075624   ACTB    13

As you can see for that example gene PE mapping always reports counts, however SE only works with the unpaired flag set. It was not clear to me from the help that this flag needs to be set for SE end data:

-u, --unpaired Treat all reads as unpaired, ignoring
filters which require properly paired
reads

I would like to ask if either:

a) it could be made clearer in the help docs that this must be set for single end data, and / or
b) rnaseq would detect if the alignments are single end and proceed accordingly.

Cheers,
António

0 genes parsed

Can you specify what features should be listed on the gtf file? I am trying to run rnaseqc on a genome that we annotated ourselves and I keep getting '0 genes parsed' with my input gtf file.

rRNA rate difference in new version.

Hi,
We have been using RNA-SeQC v1.1.8 for quite long. Recently, we have started using version 2.3.5, but we are getting rRNA rate 100 fold lower than the older version. Is rRNA rate calculated differently in RNA-SeQC2 ?

Legacy Compatability: Mapping Rate

@francois-a

Java tool reports mapping rate as mapped reads / primary, QC passed reads
New tool reports mapping rate as mapped reads / all reads

Should we change the metric in legacy mode?

problem with installation

Hi. I downloaded rnaseqc but the command "rnaseqc" does not exist.

I downloaded with the following command:
git clone --recursive https://github.com/broadinstitute/rnaseqc.git

And got this in my folder:
args.hxx bioio.hpp cloudbuild.yaml Dockerfile LICENSE Makefile Metrics.md python README.md SeqLib src test_data THIRD-PARTY-LICENSES.md

But I didn't see rnaseqc command anywhere in folder or sub-folders.
When I try calling rnaseqc command:

$ ./rnaseqc
-bash: ./rnaseqc: No such file or directory

I also tried downloading Git LFS since it's required but when I unzipped and submitted the command git lfs install it said the command didn't exist. I tried git-lfs which is what I see in the folder, it says "cannot execute binary file".

I have CLIP data and ultimately want to plot read coverage across transcripts to see if there is preference for peaks to be at 5'UTR, CDS, or 3'UTR. I think rnaseqc tool will help me get there, like the figure at the bottom of page 11 of manual:
https://software.broadinstitute.org/cancer/cga/sites/default/files/data/tools/rnaseqc/RNA-SeQC_Help_v1.1.2.pdf

If rnaseqc won't help me get there, please let me know since this is the reason I want to use it.

Thank you.

plot.py error: "dataset input should have multiple elements`

Hi there,

I have successfully run rnaseqc on a BAM file, and now I was attempting to use plot.py from the official Docker container to plot metrics:

root@012f04b3af10:/Users/rspreafico/Desktop/quantseq/out/rnaseqc# python3 /scripts/plot.py . out
Detected 1 samples
Loading metrics
Generating Fragment Size KDE
Generating QC figures
Traceback (most recent call last):
  File "/scripts/plot.py", line 285, in <module>
    main(parser.parse_args())
  File "/scripts/plot.py", line 154, in main
    ms=16
  File "/scripts/metrics.py", line 138, in plot_qc_figures
    metrics_plot(v, cohort_s, ylim=[0, 0.55], title='Intergenic rate', threshold=intergenic_rate, threshold_dir='gt', **metrics_args)
  File "/scripts/metrics.py", line 64, in metrics_plot
    sns.kdeplot(v, ax=dax, vertical=True, legend=False, shade=True)
  File "/usr/local/lib/python3.5/dist-packages/seaborn/distributions.py", line 691, in kdeplot
    cumulative=cumulative, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/seaborn/distributions.py", line 294, in _univariate_kdeplot
    x, y = _scipy_univariate_kde(data, bw, gridsize, cut, clip)
  File "/usr/local/lib/python3.5/dist-packages/seaborn/distributions.py", line 366, in _scipy_univariate_kde
    kde = stats.gaussian_kde(data, bw_method=bw)
  File "/usr/local/lib/python3.5/dist-packages/scipy/stats/kde.py", line 195, in __init__
    raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.

Here is my mapped.bam.metrics.tsv file:

Sample  mapped.bam
Mapping Rate    0.964203
Unique Rate of Mapped   1
Duplicate Rate of Mapped        0
Base Mismatch   0.00281886
End 1 Mapping Rate      0
End 2 Mapping Rate      0
End 1 Mismatch Rate     nan
End 2 Mismatch Rate     nan
Expression Profiling Efficiency 0.834477
High Quality Rate       0
Exonic Rate     0.869553
Intronic Rate   0.00322584
Intergenic Rate 0.0628047
Intragenic Rate 0.872779
Ambiguous Alignment Rate        0.0644161
High Quality Exonic Rate        nan
High Quality Intronic Rate      nan
High Quality Intergenic Rate    nan
High Quality Intragenic Rate    nan
High Quality Ambiguous Alignment Rate   nan
Discard Rate    0
rRNA Rate       0
End 1 Sense Rate        nan
End 2 Sense Rate        nan
Avg. Splits per Read    0.256174
Alternative Alignments  0
Chimeric Reads  0
Duplicate Reads 0
End 1 Antisense 0
End 2 Antisense 0
End 1 Bases     0
End 2 Bases     0
End 1 Mapped Reads      0
End 2 Mapped Reads      0
End 1 Mismatches        0
End 2 Mismatches        0
End 1 Sense     0
End 2 Sense     0
Exonic Reads    11942791
Failed Vendor QC        67407
High Quality Reads      0
Intergenic Reads        862584
Intragenic Reads        11987096
Ambiguous Reads 884716
Intronic Reads  44305
Low Mapping Quality     14244304
Low Quality Reads       13734396
Mapped Duplicate Reads  0
Mapped Reads    13734396
Mapped Unique Reads     13734396
Mismatched Bases        1935764
Reads excluded from exon counts 0
Reads used for Intron/Exon counts       13734396
rRNA Reads      0
Total Bases     686719800
Total Mapped Pairs      0
Total Reads     14311711
Unique Mapping, Vendor QC Passed Reads  14244304
Unpaired Reads  14244304
Read Length     50
Genes Detected  0
Estimated Library Complexity    0
Mean 3' bias    0
Median 3' bias  0
3' bias Std     0
3' bias MAD_Std 0
3' Bias, 25th Percentile        0
3' Bias, 75th Percentile        0
Median of Avg Transcript Coverage       0
Median of Transcript Coverage Std       0
Median of Transcript Coverage CV        0
Median Exon CV  0
Exon CV MAD     0

crams

Add a -f/--fasta option for crams. This keep the GC content computations as a precompiler switch

Strand argument clarification

I should likely be able to interpret this without help, but this isn't clear to me:

--stranded=[stranded] Use strand-specific metrics. Only
features on the same strand of a read
will be considered. Allowed values are
'RF', 'rf', 'FR', and 'fr'

Do ('RF', 'rf') and ('FR', 'fr') mean the same thing? Does RF mean 'reverse strand' and 'FR' mean 'forward strand'?

Genes have "0" coverage in sample.coverage.tsv but definitely not 0 in sample.exon_reads.gct

Hi,
I have rnaseq-qc process a batch of targeted RNA-seq data, but I find some genes have "0" coverage in sample.coverage.tsv but definitely not 0 in sample.exon_reads.gct. All my samples (>10) have the same issue, I hope I can get some help to debug / understand this.

Metrics
Sample	Seraseq
Mapping Rate	0.995594
Unique Rate of Mapped	1
Duplicate Rate of Mapped	0
Duplicate Rate of Mapped, excluding Globins	0
Base Mismatch	0.00219932
End 1 Mapping Rate	0.995782
End 2 Mapping Rate	0.995405
End 1 Mismatch Rate	0.00164327
End 2 Mismatch Rate	0.00275544
Expression Profiling Efficiency	0.693188
High Quality Rate	0.945726
Exonic Rate	0.696256
Intronic Rate	0.0614665
Intergenic Rate	0.146946
Intragenic Rate	0.757723
Ambiguous Alignment Rate	0.0953313
High Quality Exonic Rate	0.721175
High Quality Intronic Rate	0.0573186
High Quality Intergenic Rate	0.123681
High Quality Intragenic Rate	0.778493
High Quality Ambiguous Alignment Rate	0.0978252
Discard Rate	0
rRNA Rate	0
Chimeric Alignment Rate	0
End 1 Sense Rate	0.180894
End 2 Sense Rate	0.822415
Avg. Splits per Read	0.426095
Alternative Alignments	432393
Chimeric Reads	96219
Duplicate Reads	0
End 1 Antisense	1820735
End 2 Antisense	408876
End 1 Bases	211264741
End 2 Bases	211232455
End 1 Mapped Reads	2820704
End 2 Mapped Reads	2819634
End 1 Mismatches	347166
End 2 Mismatches	582039
End 1 Sense	402098
End 2 Sense	1893549
Exonic Reads	3927121
Failed Vendor QC	0
High Quality Reads	5334217
Intergenic Reads	828824
Intragenic Reads	4273813
Ambiguous Reads	537701
Intronic Reads	346692
Low Mapping Quality	286133
Low Quality Reads	306121
Mapped Duplicate Reads	0
Mapped Reads	5640338
Mapped Unique Reads	5640338
Mismatched Bases	929205
Non-Globin Reads	5640338
Non-Globin Duplicate Reads	0
Reads excluded from exon counts	0
Reads used for Intron/Exon counts	5640338
rRNA Reads	0
Total Bases	422497196
Total Mapped Pairs	2820704
Total Reads	6097695
Unique Mapping, Vendor QC Passed Reads	5665302
Unpaired Reads	0
Read Length	75
Genes Detected	325
Estimated Library Complexity	0
Genes used in 3' bias	250
Mean 3' bias	0.481574
Median 3' bias	0.466667
3' bias Std	0.253506
3' bias MAD_Std	0.244011
3' Bias, 25th Percentile	0.317972
3' Bias, 75th Percentile	0.653061
Median of Avg Transcript Coverage	40.5074
Median of Transcript Coverage Std	17.0874
Median of Transcript Coverage CV	0.577808
Median Exon CV	0.194139
Exon CV MAD	0.132782

An example of gene/exon is

Seraseq/Seraseq.coverage.tsv 
ENSG00000134259.3	0	0	nan
Seraseq/Seraseq.exon_reads.gct 
ENSG00000134259.3_1	NGF	205.873161
ENSG00000134259.3_2	NGF	180.986735
ENSG00000134259.3_3	NGF	299.923486
ENSG00000134259.3_4	NGF	327.935234
ENSG00000134259.3_5	NGF	211.807303
ENSG00000134259.3_6	NGF	254.474081

GTF of the gene

1       HAVANA  gene    119441651       119474455       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1";
1       HAVANA  transcript      119441651       119474455       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1";
1       HAVANA  exon    119474242       119474455       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_1; exon_number 1";
1       HAVANA  exon    119469133       119469234       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_2; exon_number 2";
1       HAVANA  exon    119467269       119467440       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_3; exon_number 3";
1       HAVANA  exon    119466059       119466226       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_4; exon_number 4";
1       HAVANA  exon    119456738       119456802       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_5; exon_number 5";
1       HAVANA  exon    119441651       119441748       .       -       .       gene_id "ENSG00000134259.3"; transcript_id "ENSG00000134259.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NGF"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "NGF"; level 2; havana_gene "OTTHUMG00000011880.1"; exon_id "ENSG00000134259.3_6; exon_number 6";

Happy to provide more information, or to share the bam.

Thanks!
Jiaan

getzlab / rnaseqc Goto Github PK

rnaseqc's Introduction

RNA-SeQC

Installing

Unit Tests

Usage

Command Line Usage:

OPTIONS:

Output files:

Metrics reported:

Legacy mode differences

rnaseqc's People

Contributors

Stargazers

Watchers

Forkers

rnaseqc's Issues

Recommend Projects

Recommend Topics

Recommend Org