deweylab / rsem Goto Github PK

View Code? Open in Web Editor NEW

398.0 22.0 118.0 51.88 MB

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data

Home Page: http://deweylab.biostat.wisc.edu/rsem/

License: GNU General Public License v3.0

C++ 83.07% Makefile 0.22% R 2.31% C 10.79% Perl 1.77% Java 0.04% Lua 0.12% Python 0.30% Shell 0.13% M4 0.05% Roff 1.20%

rsem's Introduction

README for RSEM

Bo Li (bli28 at mgh dot harvard dot edu)

Introduction
Compilation & Installation
Usage
- Build RSEM references using RefSeq, Ensembl, or GENCODE annotations
- Build RSEM references for untypical organisms
Example
Simulation
Generate Transcript-to-Gene-Map from Trinity Output
Differential Expression Analysis
Prior-Enhanced RSEM (pRSEM)
Authors
Acknowledgements
License

Introduction

RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels. For visualization, It can generate BAM and Wiggle files in both transcript-coordinate and genomic-coordinate. Genomic-coordinate files can be visualized by both UCSC Genome browser and Broad Institute's Integrative Genomics Viewer (IGV). Transcript-coordinate files can be visualized by IGV. RSEM also has its own scripts to generate transcript read depth plots in pdf format. The unique feature of RSEM is, the read depth plots can be stacked, with read depth contributed to unique reads shown in black and contributed to multi-reads shown in red. In addition, models learned from data can also be visualized. Last but not least, RSEM contains a simulator.

Compilation & Installation

To compile RSEM, simply run

make

For Cygwin users, run

make cygwin=true

To compile EBSeq, which is included in the RSEM package, run

make ebseq

To install RSEM, simply put the RSEM directory in your environment's PATH variable. Alternatively, run

make install

By default, RSEM executables are installed to /usr/local/bin. You can change the installation location by setting DESTDIR and/or prefix variables. The RSEM executables will be installed to ${DESTDIR}${prefix}/bin. The default values of DESTDIR and prefix are DESTDIR= and prefix=/usr/local. For example,

make install DESTDIR=/home/my_name prefix=/software

will install RSEM executables to /home/my_name/software/bin.

Note that make install does not install EBSeq related scripts, such as rsem-generate-ngvector, rsem-run-ebseq, and rsem-control-fdr. But rsem-generate-data-matrix, which generates count matrix for differential expression analysis, is installed.

Prerequisites

C++, Perl and R are required to be installed.

To use the --gff3 option of rsem-prepare-reference, Python is also required to be installed.

To take advantage of RSEM's built-in support for the Bowtie/Bowtie 2/STAR/HISAT2 alignment program, you must have Bowtie/Bowtie2/STAR/HISAT2 installed.

Usage

I. Preparing Reference Sequences

RSEM can extract reference transcripts from a genome if you provide it with gene annotations in a GTF/GFF3 file. Alternatively, you can provide RSEM with transcript sequences directly.

Please note that GTF files generated from the UCSC Table Browser do not contain isoform-gene relationship information. However, if you use the UCSC Genes annotation track, this information can be recovered by downloading the knownIsoforms.txt file for the appropriate genome.

To prepare the reference sequences, you should run the rsem-prepare-reference program. Run

rsem-prepare-reference --help

to get usage information or visit the rsem-prepare-reference documentation page.

Build RSEM references using RefSeq, Ensembl, or GENCODE annotations

RefSeq and Ensembl are two frequently used annotations. For human and mouse, GENCODE annotaions are also available. In this section, we show how to build RSEM references using these annotations. Note that it is important to pair the genome with the annotation file for each annotation source. In addition, we recommend users to use the primary assemblies of genomes. Without loss of generality, we use human genome as an example and in addition build Bowtie indices.

For RefSeq, the genome and annotation file in GFF3 format can be found at RefSeq genomes FTP:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

For example, the human genome and GFF3 file locate at the subdirectory vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.31_GRCh38.p5. GCF_000001405.31_GRCh38.p5 is the latest annotation version when this section was written.

Download and decompress the genome and annotation files to your working directory:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.31_GRCh38.p5/GCF_000001405.31_GRCh38.p5_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.31_GRCh38.p5/GCF_000001405.31_GRCh38.p5_genomic.gff.gz

GCF_000001405.31_GRCh38.p5_genomic.fna contains all top level sequences, including patches and haplotypes. To obtain the primary assembly, run the following RSEM python script:

rsem-refseq-extract-primary-assembly GCF_000001405.31_GRCh38.p5_genomic.fna GCF_000001405.31_GRCh38.p5_genomic.primary_assembly.fna

Then type the following command to build RSEM references:

rsem-prepare-reference --gff3 GCF_000001405.31_GRCh38.p5_genomic.gff \
		       --trusted-sources BestRefSeq,Curated\ Genomic \
		       --bowtie \
		       GCF_000001405.31_GRCh38.p5_genomic.primary_assembly.fna \
		       ref/human_refseq

In the above command, --trusted-sources tells RSEM to only extract transcripts from RefSeq sources like BestRefSeq or Curated Genomic. By default, RSEM trust all sources. There is also an --gff3-RNA-patterns option and its default is mRNA. Setting --gff3-RNA-patterns mRNA,rRNA will allow RSEM to extract all mRNAs and rRNAs from the genome. Visit here for more details.

Because the gene and transcript IDs (e.g. gene1000, rna28655) extracted from RefSeq GFF3 files are hard to understand, it is recommended to turn on the --append-names option in rsem-calculate-expression for better interpretation of quantification results.

For Ensembl, the genome and annotation files can be found at Ensembl FTP.

Download and decompress the human genome and GTF files:

ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
ftp://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sapiens.GRCh38.83.gtf.gz

Then use the following command to build RSEM references:

rsem-prepare-reference --gtf Homo_sapiens.GRCh38.83.gtf \
		       --bowtie \
		       Homo_sapiens.GRCh38.dna.primary_assembly.fa \
		       ref/human_ensembl

If you want to use GFF3 file instead, which is unnecessary and not recommended, you should add option --gff3-RNA-patterns transcript because mRNA is replaced by transcript in Ensembl GFF3 files.

GENCODE only provides human and mouse annotations. The genome and annotation files can be found from GENCODE website.

Download and decompress the human genome and GTF files:

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh38.primary_assembly.genome.fa.gz
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/gencode.v24.annotation.gtf.gz

Then type the following command:

rsem-prepare-reference --gtf gencode.v24.annotation.gtf \
		       --bowtie \
		       GRCh38.primary_assembly.genome.fa \
		       ref/human_gencode

Similar to Ensembl annotation, if you want to use GFF3 files (not recommended), add option --gff3-RNA-patterns transcript.

Build RSEM references for untypical organisms

For untypical organisms, such as viruses, you may only have a GFF3 file that containing only genes but not any transcripts. You need to turn on --gff3-genes-as-transcripts so that RSEM will make each gene as a unique transcript.

Here is an example command:

rsem-prepare-reference --gff3 virus.gff \
               --gff3-genes-as-transcripts \
               --bowtie \
               virus.genome.fa \
               ref/virus

II. Calculating Expression Values

To calculate expression values, you should run the rsem-calculate-expression program. Run

rsem-calculate-expression --help

to get usage information or visit the rsem-calculate-expression documentation page.

Calculating expression values from single-end data

For single-end models, users have the option of providing a fragment length distribution via the --fragment-length-mean and --fragment-length-sd options. The specification of an accurate fragment length distribution is important for the accuracy of expression level estimates from single-end data. If the fragment length mean and sd are not provided, RSEM will not take a fragment length distribution into consideration.

Using an alternative aligner

By default, RSEM automates the alignment of reads to reference transcripts using the Bowtie aligner. Turn on --bowtie2 for rsem-prepare-reference and rsem-calculate-expression will allow RSEM to use the Bowtie 2 alignment program instead. Please note that indel alignments, local alignments and discordant alignments are disallowed when RSEM uses Bowtie 2 since RSEM currently cannot handle them. See the description of --bowtie2 option in rsem-calculate-expression for more details. Similarly, turn on --star will allow RSEM to use the STAR aligner. Turn on --hisat2-hca will allow RSEM to use the HISAT2 aligner according to Human Cell Atals SMART-Seq2 pipeline. To use an alternative alignment program, align the input reads against the file reference_name.idx.fa generated by rsem-prepare-reference, and format the alignment output in SAM/BAM/CRAM format. Then, instead of providing reads to rsem-calculate-expression, specify the --alignments option and provide the SAM/BAM/CRAM file as an argument.

RSEM requires the alignments of a read to be adjacent. For paired-end reads, RSEM also requires the two mates of any alignment be adjacent. To check if your SAM/BAM/CRAM file satisfy the requirements, run

rsem-sam-validator <input.sam/input.bam/input.cram>

If your file does not satisfy the requirements, you can use convert-sam-for-rsem to convert it into a BAM file which RSEM can process. Run

convert-sam-for-rsem --help

to get usage information or visit the convert-sam-for-rsem documentation page.

Note that RSEM does ** not ** support gapped alignments. So make sure that your aligner does not produce alignments with intersions/deletions. In addition, you should make sure that you use reference_name.idx.fa, which is generated by RSEM, to build your aligner's indices.

III. Visualization

RSEM includes a copy of SAMtools. When --no-bam-output is not specified and --sort-bam-by-coordinate is specified, RSEM will produce these three files:sample_name.transcript.bam, the unsorted BAM file, sample_name.transcript.sorted.bam and sample_name.transcript.sorted.bam.bai the sorted BAM file and indices generated by the SAMtools included. All three files are in transcript coordinates. When users in addition specify the --output-genome-bam option, RSEM will produce three more files: sample_name.genome.bam, the unsorted BAM file, sample_name.genome.sorted.bam and sample_name.genome.sorted.bam.bai the sorted BAM file and indices. All these files are in genomic coordinates.

a) Converting transcript BAM file into genome BAM file

Normally, RSEM will do this for you via --output-genome-bam option of rsem-calculate-expression. However, if you have run rsem-prepare-reference and use reference_name.idx.fa to build indices for your aligner, you can use rsem-tbam2gbam to convert your transcript coordinate BAM alignments file into a genomic coordinate BAM alignments file without the need to run the whole RSEM pipeline.

Usage:

rsem-tbam2gbam reference_name unsorted_transcript_bam_input genome_bam_output

reference_name : The name of reference built by rsem-prepare-reference unsorted_transcript_bam_input : This file should satisfy: 1) the alignments of a same read are grouped together, 2) for any paired-end alignment, the two mates should be adjacent to each other, 3) this file should not be sorted by samtools genome_bam_output : The output genomic coordinate BAM file's name

b) Generating a Wiggle file

A wiggle plot representing the expected number of reads overlapping each position in the genome/transcript set can be generated from the sorted genome/transcript BAM file output. To generate the wiggle plot, run the rsem-bam2wig program on the sample_name.genome.sorted.bam/sample_name.transcript.sorted.bam file.

Usage:

rsem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight]

sorted_bam_input : Input BAM format file, must be sorted
wig_output : Output wiggle file's name, e.g. output.wig
wiggle_name : The name of this wiggle plot
--no-fractional-weight : If this is set, RSEM will not look for "ZW" tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line

c) Loading a BAM and/or Wiggle file into the UCSC Genome Browser or Integrative Genomics Viewer(IGV)

For UCSC genome browser, please refer to the UCSC custom track help page.

For integrative genomics viewer, please refer to the IGV home page. Note: Although IGV can generate read depth plot from the BAM file given, it cannot recognize "ZW" tag RSEM puts. Therefore IGV counts each alignment as weight 1 instead of the expected weight for the plot it generates. So we recommend to use the wiggle file generated by RSEM for read depth visualization.

Here are some guidance for visualizing transcript coordinate files using IGV:

Import the transcript sequences as a genome

Select File -> Import Genome, then fill in ID, Name and Fasta file. Fasta file should be reference_name.idx.fa. After that, click Save button. Suppose ID is filled as reference_name, a file called reference_name.genome will be generated. Next time, we can use: File -> Load Genome, then select reference_name.genome.

Load visualization files

Select File -> Load from File, then choose one transcript coordinate visualization file generated by RSEM. IGV might require you to convert wiggle file to tdf file. You should use igvtools to perform this task. One way to perform the conversion is to use the following command:

igvtools tile reference_name.transcript.wig reference_name.transcript.tdf reference_name.genome

d) Generating Transcript Wiggle Plots

To generate transcript wiggle plots, you should run the rsem-plot-transcript-wiggles program. Run

rsem-plot-transcript-wiggles --help

to get usage information or visit the rsem-plot-transcript-wiggles documentation page.

e) Visualize the model learned by RSEM

RSEM provides an R script, rsem-plot-model, for visulazing the model learned.

Usage:

rsem-plot-model sample_name output_plot_file

sample_name: the name of the sample analyzed
output_plot_file: the file name for plots generated from the model. It is a pdf file

The plots generated depends on read type and user configuration. It may include fragment length distribution, mate length distribution, read start position distribution (RSPD), quality score vs observed quality given a reference base, position vs percentage of sequencing error given a reference base and alignment statistics.

fragment length distribution and mate length distribution: x-axis is fragment/mate length, y axis is the probability of generating a fragment/mate with the associated length

RSPD: Read Start Position Distribution. x-axis is bin number, y-axis is the probability of each bin. RSPD can be used as an indicator of 3' bias

Quality score vs. observed quality given a reference base: x-axis is Phred quality scores associated with data, y-axis is the "observed quality", Phred quality scores learned by RSEM from the data. Q = -10log_10(P), where Q is Phred quality score and P is the probability of sequencing error for a particular base

Position vs. percentage sequencing error given a reference base: x-axis is position and y-axis is percentage sequencing error

Alignment statistics: It includes a histogram and a pie chart. For the histogram, x-axis shows the number of isoform-level alignments a read has and y-axis provides the number of reads with that many alignments. The inf in x-axis means number of reads filtered due to too many alignments. For the pie chart, four categories of reads --- unalignable, unique, isoform-levelmulti-mapping, filtered -- are plotted and their percentages are noted. In both the histogram and the piechart, numbers belong to unalignable, unique, multi-mapping, and filtered are colored as green, blue, gray and red.

Example

Suppose we download the mouse genome from UCSC Genome Browser. We do not add poly(A) tails and use /ref/mouse_0 as the reference name. We have a FASTQ-formatted file, mmliver.fq, containing single-end reads from one sample, which we call mmliver_single_quals. We want to estimate expression values by using the single-end model with a fragment length distribution. We know that the fragment length distribution is approximated by a normal distribution with a mean of 150 and a standard deviation of 35. We wish to generate 95% credibility intervals in addition to maximum likelihood estimates. RSEM will be allowed 1G of memory for the credibility interval calculation. We will visualize the probabilistic read mappings generated by RSEM on UCSC genome browser. We will generate a list of transcript wiggle plots (output.pdf) for the genes provided in gene_ids.txt. We will visualize the models learned in mmliver_single_quals.models.pdf

The commands for this scenario are as follows:

rsem-prepare-reference --gtf mm9.gtf --transcript-to-gene-map knownIsoforms.txt --bowtie --bowtie-path /sw/bowtie /data/mm9 /ref/mouse_0
rsem-calculate-expression --bowtie-path /sw/bowtie --phred64-quals --fragment-length-mean 150.0 --fragment-length-sd 35.0 -p 8 --output-genome-bam --calc-ci --ci-memory 1024 /data/mmliver.fq /ref/mouse_0 mmliver_single_quals
rsem-bam2wig mmliver_single_quals.sorted.bam mmliver_single_quals.sorted.wig mmliver_single_quals
rsem-plot-transcript-wiggles --gene-list --show-unique mmliver_single_quals gene_ids.txt output.pdf 
rsem-plot-model mmliver_single_quals mmliver_single_quals.models.pdf

Simulation

RSEM provides users the rsem-simulate-reads program to simulate RNA-Seq data based on parameters learned from real data sets. Run

rsem-simulate-reads

to get usage information or read the following subsections.

Usage:

rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q]

reference_name: The name of RSEM references, which should be already generated by rsem-prepare-reference

estimated_model_file: This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using rsem-calculate-expression. The file can be found under the sample_name.stat folder with the name of sample_name.model. model_file_description.txt provides the format and meanings of this file.

estimated_isoform_results: This file contains expression levels for all isoforms recorded in the reference. It can be learned using rsem-calculate-expression from real data. The corresponding file users want to use is sample_name.isoforms.results. If simulating from user-designed expression profile is desired, start from a learned sample_name.isoforms.results file and only modify the TPM column. The simulator only reads the TPM column. But keeping the file format the same is required. If the RSEM references built are aware of allele-specific transcripts, sample_name.alleles.results should be used instead.

theta0: This parameter determines the fraction of reads that are coming from background "noise" (instead of from a transcript). It can also be estimated using rsem-calculate-expression from real data. Users can find it as the first value of the third line of the file sample_name.stat/sample_name.theta.

N: The total number of reads to be simulated. If rsem-calculate-expression is executed on a real data set, the total number of reads can be found as the 4th number of the first line of the file sample_name.stat/sample_name.cnt.

output_name: Prefix for all output files.

--seed seed: Set seed for the random number generator used in simulation. The seed should be a 32-bit unsigned integer.

-q: Set it will stop outputting intermediate information.

Outputs:

output_name.sim.isoforms.results, output_name.sim.genes.results: Expression levels estimated by counting where each simulated read comes from. output_name.sim.alleles.results: Allele-specific expression levels estimated by counting where each simulated read comes from.

output_name.fa if single-end without quality score;
output_name.fq if single-end with quality score;
output_name_1.fa & output_name_2.fa if paired-end without quality score;
output_name_1.fq & output_name_2.fq if paired-end with quality score.

Format of the header line: Each simulated read's header line encodes where it comes from. The header line has the format:

{>/@}_rid_dir_sid_pos[_insertL]

{>/@}: Either '>' or '@' must appear. '>' appears if FASTA files are generated and '@' appears if FASTQ files are generated

rid: Simulated read's index, numbered from 0

dir: The direction of the simulated read. 0 refers to forward strand ('+') and 1 refers to reverse strand ('-')

sid: Represent which transcript this read is simulated from. It ranges between 0 and M, where M is the total number of transcripts. If sid=0, the read is simulated from the background noise. Otherwise, the read is simulated from a transcript with index sid. Transcript sid's transcript name can be found in the transcript_id column of the sample_name.isoforms.results file (at line sid + 1, line 1 is for column names)

pos: The start position of the simulated read in strand dir of transcript sid. It is numbered from 0

insertL: Only appear for paired-end reads. It gives the insert length of the simulated read.

Example:

Suppose we want to simulate 50 millon single-end reads with quality scores and use the parameters learned from Example. In addition, we set theta0 as 0.2 and output_name as simulated_reads. The command is:

rsem-simulate-reads /ref/mouse_0 mmliver_single_quals.stat/mmliver_single_quals.model mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads

Generate Transcript-to-Gene-Map from Trinity Output

For Trinity users, RSEM provides a perl script to generate transcript-to-gene-map file from the fasta file produced by Trinity.

Usage:

extract-transcript-to-gene-map-from-trinity trinity_fasta_file map_file

trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled.
map_file: transcript-to-gene-map file's name.

Differential Expression Analysis

Popular differential expression (DE) analysis tools such as edgeR and DESeq do not take variance due to read mapping uncertainty into consideration. Because read mapping ambiguity is prevalent among isoforms and de novo assembled transcripts, these tools are not ideal for DE detection in such conditions.

EBSeq, an empirical Bayesian DE analysis tool developed in UW-Madison, can take variance due to read mapping ambiguity into consideration by grouping isoforms with parent gene's number of isoforms. In addition, it is more robust to outliers. For more information about EBSeq (including the paper describing their method), please visit EBSeq's website.

RSEM includes EBSeq in its folder named EBSeq. To use it, first type

make ebseq

to compile the EBSeq related codes.

EBSeq requires gene-isoform relationship for its isoform DE detection. However, for de novo assembled transcriptome, it is hard to obtain an accurate gene-isoform relationship. Instead, RSEM provides a script rsem-generate-ngvector, which clusters transcripts based on measures directly relating to read mappaing ambiguity. First, it calculates the 'unmappability' of each transcript. The 'unmappability' of a transcript is the ratio between the number of k mers with at least one perfect match to other transcripts and the total number of k mers of this transcript, where k is a parameter. Then, Ng vector is generated by applying Kmeans algorithm to the 'unmappability' values with number of clusters set as 3. This program will make sure the mean 'unmappability' scores for clusters are in ascending order. All transcripts whose lengths are less than k are assigned to cluster 3. Run

rsem-generate-ngvector --help

to get usage information or visit the rsem-generate-ngvector documentation page.

If your reference is a de novo assembled transcript set, you should run rsem-generate-ngvector first. Then load the resulting output_name.ngvec into R. For example, you can use

NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")

. After that, set "NgVector = NgVec" for your differential expression test (either EBTest or EBMultiTest).

For users' convenience, RSEM also provides a script rsem-generate-data-matrix to extract input matrix from expression results:

rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix

The results files are required to be either all gene level results or all isoform level results. You can load the matrix into R by

IsoMat <- data.matrix(read.table(file="output_name.counts.matrix"))

before running either EBTest or EBMultiTest.

Lastly, RSEM provides two scripts, rsem-run-ebseq and rsem-control-fdr, to help users find differential expressed genes/transcripts. First, rsem-run-ebseq calls EBSeq to calculate related statistics for all genes/transcripts. Run

rsem-run-ebseq --help

to get usage information or visit the rsem-run-ebseq documentation page. Second, rsem-control-fdr takes rsem-run-ebseq 's result and reports called differentially expressed genes/transcripts by controlling the false discovery rate. Run

rsem-control-fdr --help

to get usage information or visit the rsem-control-fdr documentation page. These two scripts can perform DE analysis on either 2 conditions or multiple conditions.

Please note that rsem-run-ebseq and rsem-control-fdr use EBSeq's default parameters. For advanced use of EBSeq or information about how EBSeq works, please refer to EBSeq's manual.

Questions related to EBSeq should be sent to Ning Leng.

Prior-Enhanced RSEM (pRSEM)

I. Overview

Prior-enhanced RSEM (pRSEM) uses complementary information (e.g. ChIP-seq data) to allocate RNA-seq multi-mapping fragments. We included pRSEM code in the subfolder pRSEM/ as well as in RSEM's scripts rsem-prepare-reference and rsem-calculate-expression.

II. Demo

To get a quick idea on how to use pRSEM, you can try this demo. It provides a single script, named run_pRSEM_demo.sh, which allows you to run all pRSEM's functions. It also contains detailed descriptions of pRSEM's workflow, input and output files.

III. Installation

To compile pRSEM, type

make pRSEM

Note that you need to first compile RSEM before compiling pRSEM. Currently, pRSEM has only been tested on Linux.

IV. Example

To run pRSEM on the RSEM example above, you need to provide:

ChIP-seq sequencing file(s) in FASTQ format or a ChIP-seq peak file in BED format. They will be used by pRSEM to obtain complementatry information for allocating RNA-seq multi-mapping fragments.
a genome mappability file in bigWig format to let pRSEM build a training set of isoforms to learn prior. Mappability can be obtained from UCSC's ENCODE composite track for human hg19 and mouse mm9. For other genomes, you can generate the mappability file by following [this tutorial] (http://wiki.bits.vib.be/index.php/Create_a_mappability_track#Install_and_run_the_GEM_library_tools).

Assuming you would like to use RNA Pol II's ChIP-seq sequencing files /data/mmliver_PolIIRep1.fq.gz and /data/mmliver_PolIIRep2.fq.gz, with ChIP-seq control /data/mmliver_ChIPseqCtrl.fq.gz. Also, assuming the mappability file for mouse genome is /data/mm9.bigWig and you prefer to use STAR located at /sw/STAR to align RNA-seq fragments and use Bowtie to align ChIP-seq reads. Then, you can use the following commands to run pRSEM:

rsem-prepare-reference --gtf mm9.gtf \
                       --star \
                       --star-path /sw/STAR \
                       -p 8 \
                       --prep-pRSEM \
                       --bowtie-path /sw/bowtie \
                       --mappability-bigwig-file /data/mm9.bigWig \
                       /data/mm9 \
                       /ref/mouse_0

rsem-calculate-expression --star \
                          --star-path /sw/STAR \
                          --calc-pme \
                          --run-pRSEM \
                          --chipseq-target-read-files /data/mmliver_PolIIRep1.fq.gz,/data/mmliver_PolIIRep2.fq.gz \
                          --chipseq-control-read-files /data/mmliver_ChIPseqCtrl.fq.gz \
                          --bowtie-path /sw/bowtie \
                          -p 8 \
                          /data/mmliver.fq \
                          /ref/mouse_0 \
                          mmliver_single_quals

To find out more about pRSEM options and examples, you can use the commands:

rsem-prepare-reference --help

and

rsem-calculate-expression --help

V. System Requirements

Linux
Perl version >= 5.8.8
Python version >= 2.7.3
R version >= 3.3.1
Bioconductor 3.3

VI. Required External Packages

All the following packages will be automatically installed when compiling pRSEM.

data.table 1.9.6: an extension of R's data.frame, heavily used by pRSEM.
GenomicRanges 1.24.3: efficient representing and manipulating genomic intervals, heavily used by pRSEM.
ShortRead 1.30.0: guessing the encoding of ChIP-seq FASTQ file's quality score.
caTools 1.17.1: used for SPP Peak Caller.
SPP Peak Caller: ChIP-seq peak caller. Source code was slightly modified in terms of included headers in order to be compiled under R v3.3.1.
IDR: calculating Irreproducible Discovery Rate to call peaks from multiple ChIP-seq replicates.

Authors

Bo Li and Colin Dewey designed the RSEM algorithm. Bo Li implemented the RSEM software. Peng Liu contributed the STAR aligner options and prior-enhanced RSEM (pRSEM).

Acknowledgements

RSEM uses the Boost C++ and SAMtools libraries. RSEM includes EBSeq for differential expression analysis.

We thank earonesty, Dr. Samuel Arvidsson, John Marshall, and Michael R. Crusoe for contributing patches.

We thank Han Lin, j.miller, Joël Fillon, Dr. Samuel G. Younkin, Malcolm Cook, Christina Wells, Uroš Šipetić, outpaddling, rekado, and Josh Richer for suggesting possible fixes.

Note that bam_sort.c of SAMtools is slightly modified so that samtools sort -n will not move the two mates of paired-end alignments apart. In addition, we turn on the --without-curses option when configuring SAMtools and thus SAMtools' curses-based tview subcommand is not built.

License

RSEM is licensed under the GNU General Public License v3.

rsem's People

Contributors

Stargazers

Watchers

Forkers

earonesty lonsbio sbenz yangming joshuaar hnakhoul lazycrazyowl marvel64 lry198010 bimsbbioinfo lengning pliu55 lran2008 pimentel renzhonglu cuskinfor rob-p lingdudefeiteng hotdogee jmarshall nipuntalwar mr-c yolandaly jianlian92 nmerowsky cwt1 genomicsnx ckmah anna-alemany alenzhao dorothyzh jchenpku pavlidislab mbernste deto mingleiyang dvdhover flyfl1992 inutano scchess josator aemoore62 tw7649116 jenni-westoby ttdtrang noahpieta fengtieshan kdh4win4 mtelpoukhovskaia tintingli homereamk juadiegaitan celalp yasushi-saito bioinfoacademy fanying2015 jmsendoya ccbio kc-lan dayedepps jun-lizst jieyang031 ambarishk dcjones angsim hanlabunlv conciergesoftwaredesign cntalent-jin jiananlin wangccia windlhg bpar1 epdollinger ahoenselaar liupfskygre xjyx hzongyao chrislou-bioinfo jianguozhou3 yuanjingnan fuchunjin ryanyip-kat bai-junyu howtofindme rajesh-ibm-power standardgalactic nh13 wjt0925 michaelcraige kai2june kpatel427 jing-xinxing suhasaii hi-mydays wailz lialmonacid khanghoj xingjianfeng100 dmgatti veeramv

rsem's Issues

rsem-calculate-expression error

Hello,

I am trying to quantify results from STAR but I am getting error using the rsem-calculate-expression command:

This is with version: RSEM v1.2.31

$ rsem-calculate-expression --version
Current version: RSEM v1.2.31

RSEM Reference:

$ ls rsem_ref_hg38
gencode.v23.annotation.gtf  hg38.chrlist  hg38.fa  hg38.grp  hg38.idx.fa  hg38.n2g.idx.fa  hg38.seq  hg38.ti  hg38.transcripts.fa

STAR command:

$ /mnt/isilon/cbmi/variome/bin/star/STAR-2.5.2b/bin/Linux_x86_64/STAR --runThreadN 4 --limitBAMsortRAM 33874361894 --genomeDir /mnt/isilon/cbmi/variome/reference/star_hg38 --outFileNamePrefix rna --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within --quantMode TranscriptomeSAM --outSAMattributes NH HI AS NM MD --outFilterType BySJout --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --readFilesIn CHP134_R1.fastq.gz CHP134_R2.fastq.gz --readFilesCommand zcat

This generated the following bam files:

rnaAligned.sortedByCoord.out.bam
rnaAligned.toTranscriptome.out.bam

RSEM command:

$ rsem-calculate-expression --paired-end --no-bam-output --quiet --no-qualities -p 4 --forward-prob 0.5 --seed-length 25 --fragment-length-mean -1.0 --bam rnaAligned.toTranscriptome.out.bam /mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38 rna

rsem-parse-alignments /mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38 rna.temp/rna rna.stat/rna rnaAligned.toTranscriptome.out.bam 2 -tag XM -q

rsem-build-read-index 32 0 1 rna.temp/rna_alignable_1.fa rna.temp/rna_alignable_2.fa

rsem-run-em /mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38 2 rna rna.temp/rna rna.stat/rna -p 4 -q
rsem-run-em: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by rsem-run-em)
"rsem-run-em /mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38 2 rna rna.temp/rna rna.stat/rna -p 4 -q" failed! Plase check if you provide correct parameters/options for the pipeline!

What am I doing wrong here?

My calculations of the TPMs are different from RSEM outputs

Hello,

I used the RSEM output gene effective_length and expected_count to compute the TPM, but the results are different from the TPM output from RSEM. Are there any ways for me to correct the discrepancies? Thank you.

Transcript-to-gene-map and NgVector

The rsem-prepare-reference script allows you to specify the mapping of transcripts to genes. The rsem-run-ebseq script utilizes this mapping information to do an isoform-level gene expression analysis, but requires you to provide this information as an "ngvector" file. I looked at the files generated by the rsem-prepare-reference script, but none of them look like an ngvector file.

I see the documentation for the rsem-generate-ngvector script, but it seems to compute its own mapping rather than use the one provided with the --transcript-to-gene-map option in rsem-prepare-reference.

Am I missing something? If I already have the mapping of transcripts to genes, how do I provide this information to rsem-run-ebseq?

Assertion iter == dict.end() failed

Running under Ubuntu 64 bit, I get the following problem under 1.1.18-modified, but NOT under 1.1.12. The problem seems to be confined to rsem-parse-alignments because if I replace rsem-parse-alignments with that from v1.1.12, everything works. Is this an error in v1.1.18 or is this assertion error somehow meaningful?

fastq-dump SRR020782.lite.sra
rsem-calculate-expression -p 8 SRR020782.fastq rsem-reference/human/human SRR020782

...

# reads with at least one reported alignment: 2217543 (25.14%)
# reads that failed to align: 2936010 (33.28%)
# reads with alignments suppressed due to -m: 3668498 (41.58%)
Reported 89680495 alignments to 1 output stream(s)
[samopen] SAM header is present: 37959 sequences.
rsem-parse-alignments: Transcripts.h:103: void Transcripts::buildMappings(int, char**): Assertion `iter == dict.end()' failed.

Incorrect parameters/options

Hello,

I am trying the same command with two different versions and don't understand what's going wrong:

This is with version: RSEM v1.2.31

$ rsem-prepare-reference \
-p 4 \
--gtf /mnt/isilon/cbmi/variome/reference/gencode/v23/gencode.v23.annotation.gtf \
/mnt/isilon/cbmi/variome/reference/hg38/hg38_no_alt.fa \
/mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38

rsem-extract-reference-transcripts /mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38 0 /mnt/isilon/cbmi/variome/reference/gencode/v23/gencode.v23.annotation.gtf None 0 /mnt/isilon/cbmi/variome/reference/hg38/hg38_no_alt.fa
Parsed 200000 lines
Parsed 400000 lines
Parsed 600000 lines
Parsed 800000 lines
Parsed 1000000 lines
Parsed 1200000 lines
Parsed 1400000 lines
Parsed 1600000 lines
Parsed 1800000 lines
Parsed 2000000 lines
Parsed 2200000 lines
Parsed 2400000 lines
Parsing gtf File is done!
/mnt/isilon/cbmi/variome/reference/hg38/hg38_no_alt.fa is processed!
198619 transcripts are extracted.
Extracting sequences is done!
Group File is generated!
Transcript Information File is generated!
Chromosome List File is generated!
Extracted Sequences File is generated!

rsem-preref /mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38.transcripts.fa 1 /mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38
Refs.makeRefs finished!
Refs.saveRefs finished!
/mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38.idx.fa is generated!
/mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38.n2g.idx.fa is generated!

This is with an older version 1.2.25 which I need to replicate this pipeline:

$ /mnt/isilon/cbmi/variome/bin/rsem/RSEM-1.2.25/rsem-prepare-reference \
-p 4 \
--gtf /mnt/isilon/cbmi/variome/reference/gencode/v23/gencode.v23.annotation.gtf \
/mnt/isilon/cbmi/variome/reference/hg38/hg38_no_alt.fa \ 
/mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38

rsem-extract-reference-transcripts /mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38 0 /mnt/isilon/cbmi/variome/reference/gencode/v23/gencode.v23.annotation.gtf 0 /mnt/isilon/cbmi/variome/reference/hg38/hg38_no_alt.fa
Usage: rsem-extract-reference-transcripts refName quiet gtfF sources hasMappingFile [mappingFile] chromosome_file_1 [chromosome_file_2 ...]
"rsem-extract-reference-transcripts /mnt/isilon/cbmi/variome/reference/rsem_hg38/hg38 0 /mnt/isilon/cbmi/variome/reference/gencode/v23/gencode.v23.annotation.gtf 0 /mnt/isilon/cbmi/variome/reference/hg38/hg38_no_alt.fa" failed! Plase check if you provide correct parameters/options for the pipeline!

EOF marker is absent

Hi, I was running the following command:

rsem-calculate-expression --paired-end --alignments Aligned.toTranscriptome.out.bam GCA_000001405.15_GRCh38_no_alt sample1

And an error message popped out:
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated.

I checked the integrity of Aligned.toTranscriptome.out.bam by samtools quickcheck, but it seemed intact. Should I worry about the error message?

RSEM version: 1.2.31
samtools version: 1.4
STAR version: 2.5.2b

`rsem-run-ebseq` is not installed by `make install`

Hello.

git clone https://github.com/deweylab/RSEM/ 
make && make ebseq && mkdir ../bin && make install DESTDIR=../bin

The commands related to EBSeq are not installed to ../bin.
I want that make install installsrsem-run-ebseq .

Thanks.

Allow Bowtie 2 Local Mapping

In end-to-end mode, reads which map partially beyond the end of a reference sequence are discarded. For example, if there's a 100 base read and positions 11 to 100 in the read match the reference sequence's bases 1 to 90, then positions 1 to 10 of the read are beyond the reference sequence which causes the read to be discarded. If mapping was done in local mode, however, such reads might be kept. This is a problem when the RNA-seq assembly is incomplete or the database of transcripts only records the CDS sequence, omitting the UTRs.

Isoforms Results Contains Large Expected Counts When Unsupported by Reads

I manually inspected the top isoforms and I think I could have found a major bug in RSEM.

Consider a particular transcript quantitated in two samples. In the first sample, RSEM estimates 0 counts.

ENST00000589265.5	ENSG00000030582.16	1632	1468.72	0.00	0.00	0.00	0.00

But, in the second sample, it estimates 1069 counts.

ENST00000589265.5	ENSG00000030582.16	1632	1465.16	1068.58	30.85	21.32	5.18

This result contradicts the contents of the transcript-wise BAM files. All mappings for both samples have mapping quality of 0.

After filtering, it can be observed that all of the reads are removed.

Also, I mapped this dataset to the genome with STAR and viewed the mapping result by IGV's Sashimi Plot and there are 0 reads supporting the junction indicated by pink arrows.

This problem causes many of the top-ranked transcripts to be false positives, because they really should be 0 reads in both conditions. I am using RSEM 1.3 with -p 2 --strandedness reverse --paired-end options. I can share the input files, if needed.

EBSeq error: "argument is of length zero"

I have the following data file, generated by rsem-generate-data-matrix, for a small toy data set.

    "rsem-q1-mrna.genes.results"    "rsem-q2-mrna.genes.results"    "rsem-w1-mrna.genes.results"    "rsem-w2-mrna.genes.results"
"isoform1"  99.85   99.85   507.39  507.39
"isoform2"  474.15  474.15  70.61   70.61

When I run EBSeq, however, I get the following error.

Loading required package: gplots
Loading required package: gtools
Loading required package: gdata
gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.

gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.

Attaching package: ‘gdata’

The following object(s) are masked from ‘package:stats’:

    nobs

The following object(s) are masked from ‘package:utils’:

    object.size

Loading required package: caTools
Loading required package: grid
Loading required package: KernSmooth
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
Loading required package: MASS

Attaching package: ‘gplots’

The following object(s) are masked from ‘package:stats’:

    lowess

Error in if (!length(sizeFactors) == ncol(Data)) sizeFactors = sizeFactors[NotAllZeroNames,  :
  argument is of length zero
Calls: EBTest

Can you provide any insight as to the cause of this error?

Does it need to create the fastq-index files?

Hello,

we try to analyze the expression and read depth of TCGA-Data from ovarian cancer. The alignment was already made by STAR. So we just want to calculate the expression with rsem. Below is the command we used. But the program starts to build an index fastq file ("*_alignable.fastq"). This takes very long. For some of the samples more than 5 days. Is this step necessary even if we don't want to have .bam files in the end?

rsem-calculate-expression -p 5 --paired-end
--alignments
--estimate-rspd
--append-names
--no-bam-output
--calc-ci
--ci-memory 30000
--seed 5
$input_file.Aligned.toTranscriptome.out.bam
$ref_file $output_prefix

Thanks

problem installing under cygwin

when I comiled the source under cygwin, I met the following error:
anything missing?
{ RSEM-master } » make cygwin=true /cygdrive/g/NGS/bin/RSEM-master 2
g++ -pthread -o rsem-parse-alignments parseIt.o samtools-1.3/htslib-1.3/libhts.a -lz
parseIt.o:parseIt.cpp:(.text+0x256e): undefined reference to bam_hdr_destroy' parseIt.o:parseIt.cpp:(.text+0x2576): undefined reference to hts_close'
parseIt.o:parseIt.cpp:(.text+0x257f): undefined reference to bam_destroy1' parseIt.o:parseIt.cpp:(.text+0x26d6): undefined reference to sam_read1'
parseIt.o:parseIt.cpp:(.text+0x29bb): undefined reference to bam_aux_get' parseIt.o:parseIt.cpp:(.text+0x29cc): undefined reference to bam_aux2i'
parseIt.o:parseIt.cpp:(.text+0x2c68): undefined reference to sam_read1' parseIt.o:parseIt.cpp:(.text+0x2eeb): undefined reference to bam_aux_get'
parseIt.o:parseIt.cpp:(.text+0x2efc): undefined reference to bam_aux2i' parseIt.o:parseIt.cpp:(.text+0x317d): undefined reference to sam_read1'
parseIt.o:parseIt.cpp:(.text+0x3195): undefined reference to sam_read1' parseIt.o:parseIt.cpp:(.text+0x37fb): undefined reference to bam_aux_get'
parseIt.o:parseIt.cpp:(.text+0x380c): undefined reference to bam_aux2i' parseIt.o:parseIt.cpp:(.text+0x38bb): undefined reference to bam_aux_get'
parseIt.o:parseIt.cpp:(.text+0x38c8): undefined reference to bam_aux2i' parseIt.o:parseIt.cpp:(.text+0x3d68): undefined reference to sam_read1'
parseIt.o:parseIt.cpp:(.text+0x3d80): undefined reference to sam_read1' parseIt.o:parseIt.cpp:(.text+0x43cb): undefined reference to bam_aux_get'
parseIt.o:parseIt.cpp:(.text+0x43dc): undefined reference to bam_aux2i' parseIt.o:parseIt.cpp:(.text+0x455b): undefined reference to bam_aux_get'
parseIt.o:parseIt.cpp:(.text+0x4568): undefined reference to bam_aux2i' parseIt.o:parseIt.cpp:(.text+0x678d): undefined reference to hts_open'
parseIt.o:parseIt.cpp:(.text+0x67a5): undefined reference to hts_set_fai_filename' parseIt.o:parseIt.cpp:(.text+0x67ad): undefined reference to sam_hdr_read'
parseIt.o:parseIt.cpp:(.text+0x67d5): undefined reference to bam_init1' parseIt.o:parseIt.cpp:(.text+0x67de): undefined reference to bam_init1'
parseIt.o:parseIt.cpp:(.text+0x258e): undefined reference to `bam_destroy1'
collect2.exe: error: ld returned 1 exit status
Makefile:81: recipe for target 'rsem-parse-alignments' failed
make: *** [rsem-parse-alignments] Error 1

rsem-calculate-expression fails to find sam/samtools when run through a symbolic link

if you change line 130 of rsem-calculate-expression

from:
my $dir = "$FindBin::Bin/";

to:
my $dir = "$FindBin::RealBin/";

then if it gets run via a symlink, sam/samtools will be found in line 254, where you:

$command .= " | $dir\sam/samtools view -S -b -o $imdName.bam -";

rsem-parse-alignments error

I am using RSEM v.1.2.31 and I’ve run the following:

rsem/rsem-prepare-reference --num-threads 8 --star --gtf /ref/gencode.v19.annotation.hs37d5_chr.gtf /ref/hs37d5.fa /result/rsem_ref

rsem/rsem-calculate-expression --paired-end --star --num-threads 5 /fastq/20150507_HBR_2_mRNA_L1.LB2_1.clipped.fastq /fastq/20150507_HBR_2_mRNA_L1.LB2_2.clipped.fastq /ref/rsem_ref /output/20150507_HBR_2_mRNA_L1.LB2

The STAR mapping works fine, but the pipeline fails on rsem-parse-alignments (see below). Do you have any suggestions for getting past this step?

rsem-parse-alignments /ref/rsem_ref /output/20150507_HBR_2_mRNA_L1.LB2.temp/20150507_HBR_2_mRNA_L1.LB2 /output/20150507_HBR_2_mRNA_L1.LB2.stat/20150507_HBR_2_mRNA_L1.LB2 /output/20150507_HBR_2_mRNA_L1.LB2.temp/20150507_HBR_2_mRNA_L1.LB2.bam 3 -tag XM
Parsed 1000000 entries
Parsed 2000000 entries
Parsed 3000000 entries
Parsed 4000000 entries
Parsed 5000000 entries
Parsed 6000000 entries
Parsed 7000000 entries
Parsed 8000000 entries
Parsed 9000000 entries
Parsed 10000000 entries
Parsed 11000000 entries
Parsed 12000000 entries
Parsed 13000000 entries
Parsed 14000000 entries
Parsed 15000000 entries
Parsed 16000000 entries
Parsed 17000000 entries
Warning: Detected a read pair whose two mates have different names--HWI:1:X:1:1216:4127:50166 and HWI:1:X:1:1216:4180:50180!
Read HWI:1:X:1:1216:4127:50166: RSEM currently does not support gapped alignments, sorry!
"rsem-parse-alignments /ref/rsem_ref /output/20150507_HBR_2_mRNA_L1.LB2.temp/20150507_HBR_2_mRNA_L1.LB2 /output/20150507_HBR_2_mRNA_L1.LB2.stat/20150507_HBR_2_mRNA_L1.LB2 /output/20150507_HBR_2_mRNA_L1.LB2.temp/20150507_HBR_2_mRNA_L1.LB2.bam 3 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!

Mapped Reads to Transcripts Are Sometimes Unsorted

Some BAM output files (sample.transcript.bam) are sorted whereas others aren't, so it causes an issue if samtools index is used immediately after running RSEM.

[E::hts_idx_push] unsorted positions

One third of the BAM files in the dataset I'm working with don't raise such an error, which is strange. They have not been sorted by me. Could RSEM output be made consistent, ideally always sorted?

inconsistent read length

Hello,

I'm running RSEM-1.3.0, rsem-calculate-expression and getting the error 'Paired-end read K00166:58:H7WLVBBXX:4:1223:31751:40965 has alignments with inconsistent mate lengths!'
I'm guessing this originates in read trimming, so am I not supposed to do read trimming before mapping? I've used Skewer for adapter and quality trimming. Should I use a different trimmer or none at all? If I remove read pairs with different lengths I get only 25% of my reads left for that specific sample, which is non-ideal.

Thanks,
Elisabet

RSEM prepare references

Hello,
When i try to build the references for running pRSEM i get the following error while running the RSEM prepare references script:
rsem-prepare-reference
--gtf mm9.refseq.gtf
--star --star-path /software/STAR/bin/Linux_x86_64
--bowtie --bowtie-path /software/bowtie \
--prep-pRSEM --mappability-bigwig-file wgEncodeCrgMapabilityAlign36mer.bigWig
-p 8
mm9_norandom.fa mm9_refseq

results in:
usage: prsem-prepare-reference [-h] [--num-threads NUM_THREADS]
[--bowtie-path BOWTIE_PATH]
[--mappability-bigwig-file MAPPABILITY_BIGWIG_FILE]
[--quiet]
ref_fasta ref_name
prsem-prepare-reference: error: too few arguments

I think there might be a typo in the script?

--Thank you

Update to Samtools 1.4

The latest version has performance improvements.

RSEM Handling of GENCODE Genes

GENCODE Genes sometimes uses the same gene symbol twice or more. For example,

  seqnames    start      end width strand  source type score phase                ID             gene_id      gene_type gene_status gene_name level
1     chr7 63505821 63538927 33107      +  HAVANA gene    NA    NA ENSG00000214652.5 ENSG00000214652.5_2 protein_coding       KNOWN    ZNF727     2
2     chr7 63505821 63538927 33107      + ENSEMBL gene    NA    NA ENSG00000257482.3   ENSG00000257482.3 protein_coding       KNOWN    ZNF727     3

The transcripts associated with these duplicated gene symbols have identically located exons, introns, and UTRs. I notice that even if I see reads mapped to the genome overlapping these kinds of genes, RSEM will calculate an expected count of 0 for both genes in all samples in genes.results. Shouldn't rsem-prepare-reference handle this case better?

rsem-parse-alignments Lacks Adequate Documentation

There's no description of the flags or synopsis of what it does.

$ rsem-parse-alignments -h
Usage : rsem-parse-alignments refName imdName statName alignF read_type [-t fai_file] [-tag tagName] [-q]

In contrast, $ rsem-calculate-expression -h contains well-written documentation of all of the available options.

pRSEM ValueError: max() arg is an empty sequence

Hi, I was testing pRSEM, and got error below.
do you know why is that?

I only provided one ChIP-seq target fastq, so the IDR is complaining?

Thanks!
Tommy

Traceback (most recent call last):
  File "/scratch/genomic_med/apps/RSEM/pRSEM/prsem-calculate-expression", line 74, in <module>
    main()
  File "/scratch/genomic_med/apps/RSEM/pRSEM/prsem-calculate-expression", line 30, in main
    Prsem.genChIPSeqPeakFileBySPPIDR(param)
  File "/scratch/genomic_med/apps/RSEM/pRSEM/Prsem.py", line 53, in genChIPSeqPeakFileBySPPIDR
    cse_target.getPeaksByIDR(cse_control.pooled_tagalign)
  File "/scratch/genomic_med/apps/RSEM/pRSEM/ChIPSeqExperiment.py", line 195, in getPeaksByIDR
    max_npeaks = max(fidr2npeaks.values())
ValueError: max() arg is an empty sequence

Strand Specific Mapping Performs Poorly

I use the --paired-end and --strand-specific options to map to the GENCODE Genes 25 transcriptome with Bowtie.

For --forward-prob 0 I get 2.79 % of reads mapped, for --forward-prob 1 I again get 2.79 % and when all of the strand specific options are left out of the command, I get 79.99 % mapped reads.

I have also mapped with STAR to the genome, and the mapping percentage is high and the reads are clearly strand-specific when viewed in IGV and coloured by strand of the first read in the pair. Why is the mapping percentage so low with the stand-specific approach to RSEM and Bowtie ? Also, using --forward-prob 0 and --forward-prob 1 should not produce identical mapping percentages. Are they being ignored in the code ?

I am using version 1.2.31.

rsem-generate-ngvector( 1.3.0) failed

Hi.

rsem-generate-ngvector( 1.3.0) failed with the flowing message.


# rsem-generate-ngvector ./rna.fa rsem
rsem-for-ebseq-calculate-clustering-info 25 ./rna.fa rsem.ump
The reference is loaded.
All possbile 25 mers are generated.
All 25 mers are sorted.
Clustering information is calculated.

rsem-for-ebseq-generate-ngvector-from-clustering-info rsem.ump rsem.ngvec
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  line 29593 did not have 2 elements
Calls: read.table -> scan
Execution halted
"rsem-for-ebseq-generate-ngvector-from-clustering-info rsem.ump rsem.ngvec" failed! Plase check if you provide correct parameters/options for the pipeline!
[root@T620 GRCh38.p10]# rsem-for-ebseq-generate-ngvector-from-clustering-info
Usage: rsem-for-ebseq-generate-ngvector-from-clustering-info input_file output_file

#

the line 29593 of rsem.ump  is :
NM_007285.6 Homo sapiens GABA type A receptor associated protein like 2 (GABARAPL2), mRNA       0.0129096

RSEM cannot calculate credibility intervals on Debian 7.8

Hi,
I am using RSEM 1.2.21. When running it on a Debian Wheezy 7.8 installation
it looks like the ci bounds could not be computed. I get the following output in the .genes.results file:

FBgn0002521	FBtr0089204,FBtr0089205	2159.00	1976.37	14.00	17047.35	8586.31	14.00	0.00	12983.33	6862.00	6937.52	19666.4	-1e+30	1e+30

When running the same command with the same data on Debian Jessie 8.6 or Centos 6.7 it seems to work:

FBgn0002521	FBtr0089204,FBtr0089205	2159.00	1976.37	14.00	17047.35	8586.31	14.00	0.00	12983.33	6862.00	6937.52	19666.4	3647.78	10230.2

I tried with different versions of RSEM but the problems still remains.

Do you have any idea about what this could be related to?

Thanks

Assertion failed

I've encountered a bug when running rsem-calculate-expression. If I use the
--no-bam-output option, then I it completes without any errors. I believe
this means that my input files must be correct.

However, if I use the --output-genome-bam option, then I always get a crash
with output as follows. I snipped out paths to my files because they are
irrelevant to this issue.

ROUND = 7130, SUM = 1294266, bChange = 0.00101067, totNum = 1
ROUND = 7131, SUM = 1294266, bChange = 0.00101067, totNum = 1
ROUND = 7132, SUM = 1294266, bChange = 0.00101067, totNum = 1
ROUND = 7133, SUM = 1294266, bChange = 0.00101067, totNum = 1
ROUND = 7134, SUM = 1294266, bChange = 0.000966455, totNum = 0
rsem-run-em: BamWriter.h:135: void BamWriter::work(HitWrapper<PairedEndHit>): Assertion `transcripts.getInternalSid(b->core.tid + 1) == hit->getSid()' failed.
Expression Results are written!
"rsem-run-em <SNIP>" failed! Plase check if you provide correct parameters/options for the pipeline!

Do you have any clues or tips? I don't have time to investigate this
error myself.

I ran rsem-calculate-expression like this (variables are not relevant for
this issue, since I ran successfully without --output-genome-bam):

opt=(
    --paired-end
    --num-threads 4
    --bowtie2
    --bowtie2-path $HOME/src/bowtie2-2.2.2
    --bowtie2-k 25
    --phred33-quals
    --output-genome-bam
    --calc-pme
    --estimate-rspd
    --samtools-sort-mem 3G
    --temporary-folder $tmp/rsem
    --time
    $fq1
    $fq2
    $RAM/hg19
    $out/$library
)
rsem-calculate-expression ${opt[*]}

Thanks for the excellent software! I'm very pleased with the way you've written
documentation and with the capabilities of the tools you provide.

The GTF file might be corrupted! Stop at line : 1

I have encountered a bug while preparing reference sequence of Arabidopsis_thaliana using RSEM (latest developer version).
The fasta and gtf files were downloaded from ftp://ftp.ensemblgenomes.org/pub/release-30/plants/.
Here is the cmd output:

blab@blab: rsem-prepare-reference -gtf filtered.gtf --bowtie2 --bowtie-path /usr/bin/bowtie2 Arabidopsis_thaliana.TAIR10.30.dna.toplevel.fa at.tair10.30
Warning: If Bowtie is not used, no need to set --bowtie-path option!
rsem-extract-reference-transcripts at.tair10.30 0 filtered.gtf None 0 Arabidopsis_thaliana.TAIR10.30.dna.toplevel.fa
The GTF file might be corrupted!
Stop at line : 1    tair    exon    4810488 4811109 .   +   .   gene_id "AT1G14040"; gene_version "1"; transcript_id "AT1G14040.1"; transcript_version "1"; exon_number "1"; gene_name "PHO1;H3"; gene_source "tair"; gene_biotype "protein_coding"; transcript_name "PHO1;H3"; transcript_source "tair"; transcript_biotype "protein_coding"; exon_id "AT1G14040.1.exon1"; exon_version "1";
Error Message: Cannot separate the identifier from the value for attribute H3"!
"rsem-extract-reference-transcripts at.tair10.30 0 filtered.gtf None 0 Arabidopsis_thaliana.TAIR10.30.dna.toplevel.fa" failed! Plase check if you provide correct parameters/options for the pipeline!

What is correct format of gtf file?

failure to build from source with GCC6

Hello,

The next stable release of Debian will include GCC version 6 as the default compiler. Currently RSEM fails to build with the following error:

> sbuild (Debian sbuild) 0.67.0 (26 Dec 2015) on dl580gen9-02.hlinux
...
> g++ -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wl,-z,relro -O3 -Wall scanForPairedEndReads.cpp /usr/lib/libbam.a -lz -lpthread -o rsem-scan-for-paired-end-reads
> buildReadIndex.cpp: In function 'void buildIndex(char*, int, bool)':
> buildReadIndex.cpp:38:32: error: cannot convert 'std::basic_istream<char>' to 'bool' in assignment
>    success = (getline(fin, line));
>                                 ^
> 
> buildReadIndex.cpp:40:32: error: cannot convert 'std::basic_istream<char>' to 'bool' in assignment
>    success = (getline(fin, line));
>                                 ^
> 
> buildReadIndex.cpp:44:33: error: cannot convert 'std::basic_istream<char>' to 'bool' in assignment
>     success = (getline(fin, line));
>                                  ^
> 
> buildReadIndex.cpp:46:33: error: cannot convert 'std::basic_istream<char>' to 'bool' in assignment
>     success = (getline(fin, line));
>                                  ^
> 
> Makefile:59: recipe for target 'rsem-build-read-index' failed
> make[2]: *** [rsem-build-read-index] Error 1
> make[2]: *** Waiting for unfinished jobs....
> In file included from getUnique.cpp:12:0:
> utils.h:26:13: warning: 'verbose' defined but not used [-Wunused-variable]
>  static bool verbose = true; // show detail intermediate outputs
>              ^~~~~~~
> 
> In file included from wiggle.cpp:10:0:
> utils.h:26:13: warning: 'verbose' defined but not used [-Wunused-variable]
>  static bool verbose = true; // show detail intermediate outputs
>              ^~~~~~~
> 
> In file included from samValidator.cpp:12:0:
> utils.h:26:13: warning: 'verbose' defined but not used [-Wunused-variable]
>  static bool verbose = true; // show detail intermediate outputs
>              ^~~~~~~
> 
> In file included from parseIt.cpp:22:0:
> SingleHit.h: In member function 'bool SingleHit::read(std::istream&)':
> SingleHit.h:46:22: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>   return (in>>sid>>pos);
>                       ^
> 
> In file included from parseIt.cpp:23:0:
> PairedEndHit.h: In member function 'bool PairedEndHit::read(std::istream&)':
> PairedEndHit.h:29:34: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>      return (in>>sid>>pos>>insertL);
>                                   ^
> 
> In file included from scanForPairedEndReads.cpp:13:0:
> utils.h:26:13: warning: 'verbose' defined but not used [-Wunused-variable]
>  static bool verbose = true; // show detail intermediate outputs
>              ^~~~~~~
> 
> Makefile:55: recipe for target 'parseIt.o' failed
> make[2]: *** [parseIt.o] Error 1
> In file included from tbam2gbam.cpp:5:0:
> utils.h:26:13: warning: 'verbose' defined but not used [-Wunused-variable]
>  static bool verbose = true; // show detail intermediate outputs
>              ^~~~~~~
> 
> In file included from EM.cpp:24:0:
> SingleHit.h: In member function 'bool SingleHit::read(std::istream&)':
> SingleHit.h:46:22: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>   return (in>>sid>>pos);
>                       ^
> 
> In file included from EM.cpp:25:0:
> PairedEndHit.h: In member function 'bool PairedEndHit::read(std::istream&)':
> PairedEndHit.h:29:34: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>      return (in>>sid>>pos>>insertL);
>                                   ^
> 
> In file included from SingleModel.h:26:0,
>                  from simulation.cpp:22:
> SingleHit.h: In member function 'bool SingleHit::read(std::istream&)':
> SingleHit.h:46:22: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>   return (in>>sid>>pos);
>                       ^
> 
> In file included from PairedEndModel.h:27:0,
>                  from simulation.cpp:24:
> PairedEndHit.h: In member function 'bool PairedEndHit::read(std::istream&)':
> PairedEndHit.h:29:34: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>      return (in>>sid>>pos>>insertL);
>                                   ^
> 
> In file included from SingleModel.h:26:0,
>                  from Gibbs.cpp:15:
> SingleHit.h: In member function 'bool SingleHit::read(std::istream&)':
> SingleHit.h:46:22: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>   return (in>>sid>>pos);
>                       ^
> 
> In file included from SingleModel.h:26:0,
>                  from calcCI.cpp:16:
> SingleHit.h: In member function 'bool SingleHit::read(std::istream&)':
> SingleHit.h:46:22: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>   return (in>>sid>>pos);
>                       ^
> 
> In file included from PairedEndModel.h:27:0,
>                  from Gibbs.cpp:17:
> PairedEndHit.h: In member function 'bool PairedEndHit::read(std::istream&)':
> PairedEndHit.h:29:34: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>      return (in>>sid>>pos>>insertL);
>                                   ^
> 
> In file included from PairedEndModel.h:27:0,
>                  from calcCI.cpp:18:
> PairedEndHit.h: In member function 'bool PairedEndHit::read(std::istream&)':
> PairedEndHit.h:29:34: error: cannot convert 'std::basic_istream<char>::__istream_type {aka std::basic_istream<char>}' to 'bool' in return
>      return (in>>sid>>pos>>insertL);
>                                   ^
> 
> Makefile:104: recipe for target 'simulation.o' failed
> make[2]: *** [simulation.o] Error 1
> Makefile:111: recipe for target 'Gibbs.o' failed
> make[2]: *** [Gibbs.o] Error 1
> Makefile:120: recipe for target 'calcCI.o' failed
> make[2]: *** [calcCI.o] Error 1
> Makefile:84: recipe for target 'EM.o' failed
> make[2]: *** [EM.o] Error 1

(Original report at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=811615#5 )

rsem-find-DE on Mac OS X / R 2.15.1

Hi, I am trying to run the rsem-find-DE program on a system with Mac OS 10.8.4 and R version 2.15.1. When I ran make and make ebseq, I didn't see any problems in compilation. However, when I try to run rsem-find-DE, I get the following error.

Error in library(blockmodeling, lib.loc = path) : 
  there is no package called ‘blockmodeling’
Execution halted

Could you provide any insight into this error?

Errors Generate when Compiling RSEM

I download the latest version of RSEM from RSEM Project website. Then I type command "make" to compile the program. It reports many errors. Did RSEM compile completely? The platform I use is OS X version 10.9.2.

g++ -Wall -O3 extractRef.cpp -o rsem-extract-reference-transcripts
extractRef.cpp:1:17: error: cstdio: No such file or directory
extractRef.cpp:2:18: error: cstring: No such file or directory
extractRef.cpp:3:17: error: cctype: No such file or directory
extractRef.cpp:4:18: error: cstdlib: No such file or directory
extractRef.cpp:5:18: error: fstream: No such file or directory
extractRef.cpp:6:18: error: sstream: No such file or directory
extractRef.cpp:7:14: error: map: No such file or directory
extractRef.cpp:8:17: error: vector: No such file or directory
extractRef.cpp:9:20: error: algorithm: No such file or directory
In file included from extractRef.cpp:11:
utils.h:4:16: error: cmath: No such file or directory
utils.h:5:16: error: ctime: No such file or directory
utils.h:10:18: error: cassert: No such file or directory
utils.h:11:17: error: string: No such file or directory
utils.h:13:19: error: stdint.h: No such file or directory
In file included from extractRef.cpp:12:
my_assert.h:7:17: error: cerrno: No such file or directory
my_assert.h:10:18: error: iomanip: No such file or directory
In file included from extractRef.cpp:11:
utils.h:15: error: ‘uint64_t’ does not name a type
utils.h:16: error: ‘uint64_t’ does not name a type
utils.h: In function ‘bool isZero(double)’:
utils.h:28: error: ‘fabs’ was not declared in this scope
utils.h: In function ‘bool isLongZero(double)’:
utils.h:29: error: ‘fabs’ was not declared in this scope
utils.h: At global scope:
utils.h:34: error: expected initializer before ‘<’ token
utils.h:45: error: expected initializer before ‘<’ token
utils.h: In function ‘int get_base_id(char)’:
utils.h:48: error: ‘base2id’ was not declared in this scope
utils.h:49: error: ‘stderr’ was not declared in this scope
utils.h:49: error: ‘fprintf’ was not declared in this scope
utils.h:50: error: ‘exit’ was not declared in this scope
utils.h:52: error: ‘base2id’ was not declared in this scope
utils.h: At global scope:
utils.h:55: error: expected initializer before ‘<’ token
utils.h:66: error: expected initializer before ‘<’ token
utils.h: In function ‘int get_rbase_id(char)’:
utils.h:69: error: ‘rbase2id’ was not declared in this scope
utils.h:70: error: ‘stderr’ was not declared in this scope
utils.h:70: error: ‘fprintf’ was not declared in this scope
utils.h:71: error: ‘exit’ was not declared in this scope
utils.h:73: error: ‘rbase2id’ was not declared in this scope
utils.h: In function ‘char getOpp(char)’:
utils.h:89: error: ‘stderr’ was not declared in this scope
utils.h:89: error: ‘fprintf’ was not declared in this scope
utils.h:90: error: ‘exit’ was not declared in this scope
utils.h: In function ‘char getCharacter(int)’:
utils.h:102: error: ‘stderr’ was not declared in this scope
utils.h:102: error: ‘fprintf’ was not declared in this scope
utils.h:103: error: ‘exit’ was not declared in this scope
utils.h: At global scope:
utils.h:107: error: expected initializer before ‘<’ token
utils.h:113: error: expected initializer before ‘<’ token
utils.h:115: error: ‘string’ in namespace ‘std’ does not name a type
utils.h: In function ‘void genReadFileNames(const char_, int, int, int&, char (_)[10005])’:
utils.h:132: error: ‘strcpy’ was not declared in this scope
utils.h:135: error: ‘strcpy’ was not declared in this scope
utils.h:140: error: ‘sprintf’ was not declared in this scope
utils.h:144: error: ‘sprintf’ was not declared in this scope
utils.h: At global scope:
utils.h:149: error: expected ‘,’ or ‘...’ before ‘&’ token
utils.h:149: error: ISO C++ forbids declaration of ‘time_t’ with no type
utils.h: In function ‘void printTimeUsed(int)’:
utils.h:150: error: ‘b’ was not declared in this scope
utils.h:150: error: ‘a’ was not declared in this scope
utils.h:154: error: ‘program_name’ was not declared in this scope
utils.h:154: error: ‘printf’ was not declared in this scope
In file included from extractRef.cpp:12:
my_assert.h: At global scope:
my_assert.h:12: error: ‘string’ in namespace ‘std’ does not name a type
my_assert.h:19: error: ‘string’ in namespace ‘std’ does not name a type
my_assert.h:25: error: ‘string’ in namespace ‘std’ does not name a type
my_assert.h:29: error: ‘string’ in namespace ‘std’ does not name a type
my_assert.h:37: error: expected unqualified-id before ‘&’ token
my_assert.h:37: error: expected ‘,’ or ‘...’ before ‘&’ token
my_assert.h: In function ‘void general_report()’:
my_assert.h:38: error: ‘putEnter’ was not declared in this scope
my_assert.h:38: error: ‘printf’ was not declared in this scope
my_assert.h:39: error: ‘stderr’ was not declared in this scope
my_assert.h:39: error: ‘errmsg’ was not declared in this scope
my_assert.h:39: error: ‘fprintf’ was not declared in this scope
my_assert.h:40: error: ‘exit’ was not declared in this scope
my_assert.h: At global scope:
my_assert.h:45: error: expected unqualified-id before ‘&’ token
my_assert.h:45: error: expected ‘,’ or ‘...’ before ‘&’ token
my_assert.h: In function ‘void pthread_report(int)’:
my_assert.h:46: error: ‘stderr’ was not declared in this scope
my_assert.h:46: error: ‘errmsg’ was not declared in this scope
my_assert.h:46: error: ‘fprintf’ was not declared in this scope
my_assert.h:48: error: ‘func_name’ was not declared in this scope
my_assert.h:50: error: ‘EAGAIN’ was not declared in this scope
my_assert.h:53: error: ‘EINVAL’ was not declared in this scope
my_assert.h:56: error: ‘EPERM’ was not declared in this scope
my_assert.h:63: error: ‘EDEADLK’ was not declared in this scope
my_assert.h:66: error: ‘EINVAL’ was not declared in this scope
my_assert.h:69: error: ‘ESRCH’ was not declared in this scope
my_assert.h:76: error: ‘EAGAIN’ was not declared in this scope
my_assert.h:79: error: ‘EDEADLK’ was not declared in this scope
my_assert.h:82: error: ‘EINVAL’ was not declared in this scope
my_assert.h:89: error: ‘EAGAIN’ was not declared in this scope
my_assert.h:92: error: ‘EINVAL’ was not declared in this scope
my_assert.h:95: error: ‘EPERM’ was not declared in this scope
my_assert.h:104: error: ‘exit’ was not declared in this scope
In file included from extractRef.cpp:13:
GTFItem.h: At global scope:
GTFItem.h:31: error: ‘std::string’ has not been declared
GTFItem.h:31: error: expected unqualified-id before ‘&’ token
GTFItem.h:31: error: expected ‘,’ or ‘...’ before ‘&’ token
GTFItem.h:40: error: ‘std::string’ has not been declared
GTFItem.h:87: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:88: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:89: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:93: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:94: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:95: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:96: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:97: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:99: error: expected unqualified-id before ‘&’ token
GTFItem.h:99: error: expected ‘,’ or ‘...’ before ‘&’ token
GTFItem.h:103: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:114: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:115: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:118: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:119: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h:120: error: ‘string’ in namespace ‘std’ does not name a type
GTFItem.h: In constructor ‘GTFItem::GTFItem()’:
GTFItem.h:16: error: ‘seqname’ was not declared in this scope
GTFItem.h:16: error: ‘source’ was not declared in this scope
GTFItem.h:16: error: ‘feature’ was not declared in this scope
GTFItem.h:17: error: ‘score’ was not declared in this scope
GTFItem.h:20: error: ‘frame’ was not declared in this scope
GTFItem.h:21: error: ‘gene_id’ was not declared in this scope
GTFItem.h:21: error: ‘transcript_id’ was not declared in this scope
GTFItem.h:22: error: ‘left’ was not declared in this scope
GTFItem.h: In member function ‘bool GTFItem::operator<(const GTFItem&) const’:
GTFItem.h:26: error: ‘gene_id’ was not declared in this scope
GTFItem.h:26: error: ‘const class GTFItem’ has no member named ‘gene_id’
GTFItem.h:26: error: ‘const class GTFItem’ has no member named ‘gene_id’
GTFItem.h:27: error: ‘transcript_id’ was not declared in this scope
GTFItem.h:27: error: ‘const class GTFItem’ has no member named ‘transcript_id’
GTFItem.h:27: error: ‘const class GTFItem’ has no member named ‘transcript_id’
GTFItem.h: In member function ‘void GTFItem::my_assert(char, int&)’:
GTFItem.h:33: error: ‘stderr’ was not declared in this scope
GTFItem.h:33: error: ‘fprintf’ was not declared in this scope
GTFItem.h:34: error: request for member ‘c_str’ in ‘line’, which is of non-class type ‘int’
GTFItem.h:35: error: ‘msg’ was not declared in this scope
GTFItem.h:36: error: ‘exit’ was not declared in this scope
GTFItem.h: In member function ‘void GTFItem::parse(int)’:
GTFItem.h:41: error: ‘istringstream’ is not a member of ‘std’
GTFItem.h:41: error: expected ;' before ‘strin’ GTFItem.h:42: error: ‘string’ is not a member of ‘std’ GTFItem.h:42: error: expected;' before ‘tmp’
GTFItem.h:44: error: ‘strin’ was not declared in this scope
GTFItem.h:44: error: ‘seqname’ was not declared in this scope
GTFItem.h:44: error: ‘getline’ was not declared in this scope
GTFItem.h:45: error: ‘source’ was not declared in this scope
GTFItem.h:46: error: ‘feature’ was not declared in this scope
GTFItem.h:47: error: ‘tmp’ was not declared in this scope
GTFItem.h:48: error: ‘atoi’ was not declared in this scope
GTFItem.h:51: error: ‘score’ was not declared in this scope
GTFItem.h:55: error: ‘frame’ was not declared in this scope
GTFItem.h:57: error: ‘left’ was not declared in this scope
GTFItem.h:63: error: ‘cleanStr’ was not declared in this scope
GTFItem.h:64: error: ‘size_t’ was not declared in this scope
GTFItem.h:64: error: expected ;' before ‘pos’ GTFItem.h:65: error: ‘pos’ was not declared in this scope GTFItem.h:65: error: ‘std::string’ has not been declared GTFItem.h:66: error: ‘string’ is not a member of ‘std’ GTFItem.h:66: error: expected;' before ‘identifier’
GTFItem.h:68: error: ‘identifier’ was not declared in this scope
GTFItem.h:69: error: no matching function for call to ‘GTFItem::my_assert(bool, int&, const char [31])’
GTFItem.h:31: note: candidates are: void GTFItem::my_assert(char, int&)
GTFItem.h:72: error: ‘gene_id’ was not declared in this scope
GTFItem.h:75: error: no matching function for call to ‘GTFItem::my_assert(bool, int&, const char [37])’
GTFItem.h:31: note: candidates are: void GTFItem::my_assert(char, int&)
GTFItem.h:78: error: ‘transcript_id’ was not declared in this scope
GTFItem.h:83: error: no matching function for call to ‘GTFItem::my_assert(bool&, int&, const char [21])’
GTFItem.h:31: note: candidates are: void GTFItem::my_assert(char, int&)
GTFItem.h:84: error: no matching function for call to ‘GTFItem::my_assert(bool&, int&, const char [27])’
GTFItem.h:31: note: candidates are: void GTFItem::my_assert(char, int&)
GTFItem.h: In member function ‘void GTFItem::setGeneID()’:
GTFItem.h:100: error: ‘class GTFItem’ has no member named ‘gene_id’
GTFItem.h:100: error: ‘gene_id’ was not declared in this scope
In file included from extractRef.cpp:14:
Transcript.h: At global scope:
Transcript.h:36: error: expected unqualified-id before ‘&’ token
Transcript.h:36: error: expected ‘,’ or ‘...’ before ‘&’ token
Transcript.h:37: error: ‘Transcript::Transcript()’ cannot be overloaded
Transcript.h:28: error: with ‘Transcript::Transcript()’
Transcript.h:59: error: ISO C++ forbids declaration of ‘string’ with no type
Transcript.h:59: error: invalid use of ‘::’
Transcript.h:59: error: expected ‘;’ before ‘&’ token
Transcript.h:61: error: expected ;' before ‘const’ Transcript.h:61: error: ISO C++ forbids declaration of ‘string’ with no type Transcript.h:61: error: invalid use of ‘::’ Transcript.h:61: error: expected ‘;’ before ‘&’ token Transcript.h:63: error: expected;' before ‘const’
Transcript.h:63: error: ISO C++ forbids declaration of ‘string’ with no type
Transcript.h:63: error: invalid use of ‘::’
Transcript.h:63: error: expected ‘;’ before ‘&’ token
Transcript.h:65: error: expected ;' before ‘char’ Transcript.h:67: error: ISO C++ forbids declaration of ‘string’ with no type Transcript.h:67: error: invalid use of ‘::’ Transcript.h:67: error: expected ‘;’ before ‘&’ token Transcript.h:69: error: expected;' before ‘int’
Transcript.h:71: error: ISO C++ forbids declaration of ‘vector’ with no type
Transcript.h:71: error: invalid use of ‘::’
Transcript.h:71: error: expected ‘;’ before ‘<’ token
Transcript.h:73: error: expected ;' before ‘void’ Transcript.h:73: error: expected unqualified-id before ‘&’ token Transcript.h:73: error: expected ‘,’ or ‘...’ before ‘&’ token Transcript.h:75: error: ‘std::ifstream’ has not been declared Transcript.h:76: error: ‘std::ofstream’ has not been declared Transcript.h:80: error: ISO C++ forbids declaration of ‘vector’ with no type Transcript.h:80: error: invalid use of ‘::’ Transcript.h:80: error: expected ‘;’ before ‘<’ token Transcript.h:82: error: ‘string’ in namespace ‘std’ does not name a type Transcript.h:83: error: ‘string’ in namespace ‘std’ does not name a type Transcript.h: In constructor ‘Transcript::Transcript()’: Transcript.h:30: error: ‘structure’ was not declared in this scope Transcript.h:32: error: ‘seqname’ was not declared in this scope Transcript.h:32: error: ‘gene_id’ was not declared in this scope Transcript.h:32: error: ‘transcript_id’ was not declared in this scope Transcript.h:33: error: ‘left’ was not declared in this scope Transcript.h: In constructor ‘Transcript::Transcript()’: Transcript.h:38: error: ‘class Transcript’ has no member named ‘structure’ Transcript.h:38: error: ‘structure’ was not declared in this scope Transcript.h:40: error: ‘class Transcript’ has no member named ‘seqname’ Transcript.h:40: error: ‘seqname’ was not declared in this scope Transcript.h:41: error: ‘class Transcript’ has no member named ‘gene_id’ Transcript.h:41: error: ‘gene_id’ was not declared in this scope Transcript.h:42: error: ‘class Transcript’ has no member named ‘transcript_id’ Transcript.h:42: error: ‘transcript_id’ was not declared in this scope Transcript.h:46: error: ‘left’ was not declared in this scope Transcript.h:48: error: ‘class Transcript’ has no member named ‘left’ Transcript.h: In member function ‘bool Transcript::operator<(const Transcript&) const’: Transcript.h:56: error: ‘gene_id’ was not declared in this scope Transcript.h:56: error: ‘const class Transcript’ has no member named ‘gene_id’ Transcript.h:56: error: ‘const class Transcript’ has no member named ‘gene_id’ Transcript.h:56: error: ‘transcript_id’ was not declared in this scope Transcript.h:56: error: ‘const class Transcript’ has no member named ‘transcript_id’ Transcript.h:56: error: ‘const class Transcript’ has no member named ‘gene_id’ Transcript.h:56: error: ‘const class Transcript’ has no member named ‘transcript_id’ Transcript.h:56: error: ‘seqname’ was not declared in this scope Transcript.h:56: error: ‘const class Transcript’ has no member named ‘seqname’ Transcript.h: At global scope: Transcript.h:87: error: expected unqualified-id before ‘&’ token Transcript.h:87: error: expected ‘,’ or ‘...’ before ‘&’ token Transcript.h: In member function ‘void Transcript::extractSeq() const’: Transcript.h:88: error: ‘seq’ was not declared in this scope Transcript.h:89: error: ‘structure’ was not declared in this scope Transcript.h:90: error: ‘size_t’ was not declared in this scope Transcript.h:90: error: expected;' before ‘glen’
Transcript.h:92: error: expected `)' before ‘structure’
Transcript.h:93: error: ‘stderr’ was not declared in this scope
Transcript.h:93: error: ‘transcript_id’ was not declared in this scope
Transcript.h:93: error: ‘seqname’ was not declared in this scope
Transcript.h:93: error: ‘fprintf’ was not declared in this scope
Transcript.h:94: error: ‘exit’ was not declared in this scope
Transcript.h:100: error: ‘gseq’ was not declared in this scope
Transcript.h:106: error: ‘gseq’ was not declared in this scope
Transcript.h:110: error: ‘assert’ was not declared in this scope
Transcript.h:113: error: ‘assert’ was not declared in this scope
Transcript.h: At global scope:
Transcript.h:116: error: variable or field ‘read’ declared void
Transcript.h:116: error: ‘ifstream’ is not a member of ‘std’
Transcript.h:116: error: ‘fin’ was not declared in this scope
utils.h:26: warning: ‘verbose’ defined but not used
make: *** [rsem-extract-reference-transcripts] Error 1

Which RSEM output should be used: FPKM vs pme_FPKM

Hi there,

I am using RSEM to calculate gene expression levels (absolute instead of differential expression) in RNA-seq samples. In the output of RSEM, I don't know how posterior mean estimate, e.g. pme_FPKM and pme_TPM, is calculated. To quantify gene expressions in my case, I wonder which you would suggest to use: FPKM (TPM) or pme_FPKM (pme_TPM)?

Another question is: is it necessary to run quartile normalization on RSEM's output?

Thanks,
Qingguo

RSEM Without Gene Database

The RSEM software appears to require a reference sequence set (e.g. hg19) and a gene database (e.g. RefSeq). Could it be made more flexible so that the gene database is not required for the scenario where the reference sequence set is not a genome but a set of alleles of some highly polymorphic genes, hence it's the number of reads assigned to each of the alleles which is the quantity of interest?

rsem-generate-data-matrix question

Hello Dewey Lab,

I was wondering if there is a simple way to have rsem-generate-data-matrix grab the column corresponding to TPMs or FPKM from the .results file rather than the expected counts? I have no experience with Perl and I was hoping this would be a quick change to the code that someone here could help me with.

Best,
Alon

rsem-calculate-expression error

Hello everyone,
I am runing star-cufflinks-rsem to analyse my transnomic sequencing data, everything is all right until I turn to use rsem to calculate the gene expression.
The RSEM gives me an error called:
rsem-parse-alignments /project/huff/huff/stem_cell_differentiation/test/4.rsem.ref/RSEMref Quant.temp/Quant Quant.stat/Quant b /project
rsem-parse-alignments : No such file or directory!
Please check if you have compiled the associated codes by typing related "make" commands and/or made related executables ready to use.
And I found that it is exactlly no tools called rsem-parse-alignments in my rsem install directory. And even though I installed it twice I didn't find it.
And I wonder if your guys have met this problem? How to solve it?
Any help will be gratefully appreciated. Thank u very much.

process-chipseq.R Error in rbindlist(bplapply(files, guessFqEncodingByFile))

I was running pRSEM #54 , everything was fine until this step:

Error in rbindlist(bplapply(files, guessFqEncodingByFile)) :
  attempt to set an attribute on NULL
Calls: main -> <Anonymous> -> rbindlist

I grep and saw this code is from the process-Chipseq.R, any guess what is possibly wrong here?
Thanks,
Tommy

rsem-plot-model result issue

Hi,

I used the rsem-plot-model to produce some plot after mapping my RNASeq data. My library has 62,438,506 reads (paired-end) in total. The pie chart - "Alignment statistics" - shows I have ~36% unique, 35% multiple. However, the numbers from STAR program (I kept the temp files generated during RSEM mapping) shows 64% unique and 16% multiple. Which are far different from each other. On the other hand in the sample.cnt file, I found the following numbers: 12591831, 31048910, 0, 43640741 - supposed to corresponding to multiple, unique, filtered, total. If the rsem-plot-model used these numbers for plotting, I don't understand how the percentages were calculated? If it used numbers from other filesto plot, which files it used? Thanks!

Allow User Specification of stat Folder Location

Currently, it's always placed in the user's home directory. On the university's server, this has a disk quota of 4 GB and RSEM fails after processing the first sample with the error Fail to create folder S2.stat because the hard quota is reached. There should be an option similar to --temporary-folder which the user can use to change this to another directory.

Disk quotas for user dario (uid 3509): 
Filesystem             blocks   quota   limit   grace   files   quota   limit   grace
como:/users          7995692* 4000000 8000000   7days   60751  500000 1000000 
como:/nobackup     218279232  250000000 350000000       53005  500000 1000000

Misleading GTF Parsing Error

If a GTF file is accidentally overwritten on disk with tabs converted to spaces, rsem complains about the strand being incorrect.

$ rsem-prepare-reference --gtf genes.gtf --bowtie2 genome.fasta testRef
rsem-extract-reference-transcripts testRef 0 genes.gtf None 0 genome.fasta
The GTF file might be corrupted!
Stop at line : s1  None    gene    2   21  .   +   .   gene_id "testGene"; gene_name "XXX";
Error Message: Strand is neither '+' nor '-'!
"rsem-extract-reference-transcripts testRef 0 genes.gtf None 0 genome.fasta" failed! Plase check if you provide correct parameters/options for the pipeline!

Could this error message be corrected to highlight the lack of tabs?

Incorrect Version Number Displayed

I was confused why it seemed I wasn't using 1.3.0, but then realised that the version number seems to not have been updated in the software.

$ /tmp/RSEM-1.3.0/rsem-calculate-expression --version
Current version: RSEM v1.2.31

rsem-prepare-reference Refseq

Hello,

When I run rsem-prepare-reference with refseq it only captures "exon" in my gff3 file. I'm not sure why the gff3 even has exons because I work with a bacterium. I only realized this issue because the output from rsem-calculate-expression was a very short list of genes. Any suggestion on how I can get rsem-prepare-reference to capture "gene" rather than "exon"? Would it be okay just to change the third column values in GFF3 from "gene" to "exon"? The concern that I have with doing that is that all the ones that currently have "exon" in the third column also have and ID=rna#;Parent=gene# in the 9th column, and these are put in the second and first column of the rsem-calculate-expression output, respectively. I'm not sure, but there might also or alternatively be an issue associated with transcript sources. Only lines from the gff3 file with tRNAscan-SE in the second colomn, not RefSeq, are retained after rsem-prepare-reference. These lines are the same lines with "exon" in the third column. Before I start manually editing the gff3 file, I was wondering if I'm missing something critical in the pipeline directions. In case you'd like to look at the gff3 file, it's at the url below. And so are some lines of the gff3 file and all of the lines of the short RSEM.genes.results file. An example of "exon" in the third column is close to the end of the gff3 file excerpt.

Thank you for your help!

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Ruminiclostridium_thermocellum/all_assembly_versions/GCF_000184925.1_ASM18492v1/

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

!genome-build ASM18492v1

!genome-build-accession NCBI_Assembly:GCF_000184925.1

sequence-region NC_017304.1 1 3561619

species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=637887

NC_017304.1 RefSeq region 1 3561619 . + . ID=id0;Dbxref=taxon:637887;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=DSM 1313
NC_017304.1 RefSeq gene 212 1543 . + . ID=gene0;Name=CLO1313_RS00010;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00010;old_locus_tag=Clo1313_0001
NC_017304.1 Protein Homology CDS 212 1543 . + 0 ID=cds0;Parent=gene0;Dbxref=Genbank:WP_003516465.1;Name=WP_003516465.1;gbkey=CDS;product=chromosomal replication initiator protein DnaA;protein_id=WP_003516465.1;transl_table=11
NC_017304.1 RefSeq gene 1793 2893 . + . ID=gene1;Name=CLO1313_RS00015;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00015;old_locus_tag=Clo1313_0002
NC_017304.1 Protein Homology CDS 1793 2893 . + 0 ID=cds1;Parent=gene1;Dbxref=Genbank:WP_003513339.1;Name=WP_003513339.1;gbkey=CDS;product=DNA polymerase III subunit beta;protein_id=WP_003513339.1;transl_table=11
NC_017304.1 RefSeq gene 2909 3115 . + . ID=gene2;Name=CLO1313_RS00020;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00020;old_locus_tag=Clo1313_0003
NC_017304.1 Protein Homology CDS 2909 3115 . + 0 ID=cds2;Parent=gene2;Dbxref=Genbank:WP_003513337.1;Name=WP_003513337.1;gbkey=CDS;product=RNA-binding protein S4;protein_id=WP_003513337.1;transl_table=11
NC_017304.1 RefSeq gene 3133 4242 . + . ID=gene3;Name=CLO1313_RS00025;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00025;old_locus_tag=Clo1313_0004
NC_017304.1 Protein Homology CDS 3133 4242 . + 0 ID=cds3;Parent=gene3;Dbxref=Genbank:WP_003513335.1;Name=WP_003513335.1;gbkey=CDS;product=DNA recombination protein RecF;protein_id=WP_003513335.1;transl_table=11
NC_017304.1 RefSeq gene 4410 4724 . + . ID=gene4;Name=CLO1313_RS00030;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00030;old_locus_tag=Clo1313_0005
NC_017304.1 Protein Homology CDS 4410 4724 . + 0 ID=cds4;Parent=gene4;Dbxref=Genbank:WP_003513333.1;Name=WP_003513333.1;gbkey=CDS;product=DUF370 domain-containing protein;protein_id=WP_003513333.1;transl_table=11
NC_017304.1 RefSeq gene 4773 6698 . + . ID=gene5;Name=gyrB;gbkey=Gene;gene=gyrB;gene_biotype=protein_coding;locus_tag=CLO1313_RS00035;old_locus_tag=Clo1313_0006
NC_017304.1 Protein Homology CDS 4773 6698 . + 0 ID=cds5;Parent=gene5;Dbxref=Genbank:WP_003513331.1;Name=WP_003513331.1;Note=negatively supercoils closed circular double-stranded DNA;gbkey=CDS;gene=gyrB;product=DNA gyrase subunit B;protein_id=WP_003513331.1;transl_table=11
NC_017304.1 RefSeq gene 7685 8461 . + . ID=gene6;Name=CLO1313_RS00040;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00040;old_locus_tag=Clo1313_0007
NC_017304.1 Protein Homology CDS 7685 8461 . + 0 ID=cds6;Parent=gene6;Dbxref=Genbank:WP_003513308.1;Name=WP_003513308.1;gbkey=CDS;product=sporulation initiation inhibitor Soj;protein_id=WP_003513308.1;transl_table=11
NC_017304.1 RefSeq gene 8458 9327 . + . ID=gene7;Name=CLO1313_RS00045;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00045;old_locus_tag=Clo1313_0008
NC_017304.1 Protein Homology CDS 8458 9327 . + 0 ID=cds7;Parent=gene7;Dbxref=Genbank:WP_003513306.1;Name=WP_003513306.1;gbkey=CDS;product=chromosome partitioning protein ParB;protein_id=WP_003513306.1;transl_table=11
NC_017304.1 RefSeq gene 9889 10398 . + . ID=gene8;Name=CLO1313_RS00050;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00050;old_locus_tag=Clo1313_0009
NC_017304.1 Protein Homology CDS 9889 10398 . + 0 ID=cds8;Parent=gene8;Dbxref=Genbank:WP_003513304.1;Name=WP_003513304.1;gbkey=CDS;product=hypothetical protein;protein_id=WP_003513304.1;transl_table=11
NC_017304.1 RefSeq gene 10419 11159 . + . ID=gene9;Name=CLO1313_RS00055;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00055;old_locus_tag=Clo1313_0010
NC_017304.1 Protein Homology CDS 10419 11159 . + 0 ID=cds9;Parent=gene9;Dbxref=Genbank:WP_003513303.1;Name=WP_003513303.1;gbkey=CDS;product=hypothetical protein;protein_id=WP_003513303.1;transl_table=11
NC_017304.1 RefSeq gene 11323 12594 . + . ID=gene10;Name=CLO1313_RS00060;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00060;old_locus_tag=Clo1313_0011
NC_017304.1 Protein Homology CDS 11323 12594 . + 0 ID=cds10;Parent=gene10;Dbxref=Genbank:WP_003513301.1;Name=WP_003513301.1;gbkey=CDS;product=serine--tRNA ligase;protein_id=WP_003513301.1;transl_table=11
NC_017304.1 RefSeq gene 12849 12939 . + . ID=gene11;Name=CLO1313_RS00065;gbkey=Gene;gene_biotype=tRNA;locus_tag=CLO1313_RS00065;old_locus_tag=Clo1313_R0001
NC_017304.1 tRNAscan-SE tRNA 12849 12939 . + . ID=rna0;Parent=gene11;anticodon=(pos:12884..12886);gbkey=tRNA;product=tRNA-Ser
NC_017304.1 tRNAscan-SE exon 12849 12939 . + . ID=id1;Parent=rna0;anticodon=(pos:12884..12886);gbkey=tRNA;product=tRNA-Ser
NC_017304.1 RefSeq gene 12969 13062 . + . ID=gene12;Name=CLO1313_RS00070;gbkey=Gene;gene_biotype=tRNA;locus_tag=CLO1313_RS00070;old_locus_tag=Clo1313_R0002
NC_017304.1 tRNAscan-SE tRNA 12969 13062 . + . ID=rna1;Parent=gene12;anticodon=(pos:13004..13006);gbkey=tRNA;product=tRNA-Ser
NC_017304.1 tRNAscan-SE exon 12969 13062 . + . ID=id2;Parent=rna1;anticodon=(pos:13004..13006);gbkey=tRNA;product=tRNA-Ser
NC_017304.1 RefSeq gene 13113 14513 . - . ID=gene13;Name=CLO1313_RS00075;gbkey=Gene;gene_biotype=protein_coding;locus_tag=CLO1313_RS00075;old_locus_tag=Clo1313_0012

RSEM.genes.results
gene_id transcript_id(s) length effective_length expected_count TPM FPKM
gene1076 rna30 76.00 27.80 1.00 913.68 773.47
gene11 rna0 91.00 42.80 131.00 77743.97 65814.04
gene1187 rna31 87.00 38.80 15.00 9819.71 8312.86
gene1190 rna32 86.00 37.80 23.00 15455.22 13083.59
gene1191 rna33 76.00 27.80 26.00 23755.69 20110.35
gene1192 rna34 77.00 28.80 15.50 13670.31 11572.58
gene12 rna1 94.00 45.80 14.00 7764.29 6572.85
gene1267 rna35 83.00 34.80 2.00 1459.79 1235.78
gene1342 rna36 76.00 27.80 0.00 0.00 0.00
gene1343 rna37 77.00 28.80 15.50 13670.31 11572.58
gene1346 rna38 375.00 326.80 1462.00 113633.01 96195.85
gene1565 rna39 76.00 27.80 0.00 0.00 0.00
gene1574 rna40 78.00 29.80 1.00 852.36 721.56
gene1591 rna41 75.00 26.80 4.00 3791.09 3209.34
gene1668 rna42 76.00 27.80 2.00 1827.36 1546.95
gene1728 rna43 87.00 38.80 0.00 0.00 0.00
gene1729 rna44 76.00 27.80 0.00 0.00 0.00
gene1919 rna45 76.00 27.80 5.00 4568.40 3867.37
gene1920 rna46 76.00 27.80 4.00 3654.72 3093.90
gene1921 rna47 77.00 28.80 8.00 7055.64 5972.94
gene2037 rna48 84.00 35.80 5.00 3547.53 3003.16
gene2066 rna49 77.00 28.80 2.00 1763.91 1493.24
gene2067 rna50 74.00 25.80 4.00 3938.03 3333.74
gene2150 rna51 74.00 25.80 4.00 3938.03 3333.74
gene2151 rna52 75.00 26.80 9.75 9242.32 7824.07
gene2475 rna53 77.00 28.80 0.00 0.00 0.00
gene254 rna5 90.00 41.80 39.00 23698.86 20062.24
gene255 rna6 76.00 27.80 2.00 1827.36 1546.95
gene2578 rna54 75.00 26.80 0.00 0.00 0.00
gene2579 rna55 75.00 26.80 3.25 3078.73 2606.30
gene2580 rna56 76.00 27.80 26.00 23755.69 20110.35
gene2581 rna57 77.00 28.80 34.50 30427.46 25758.32
gene2582 rna58 76.00 27.80 2.33 2131.92 1804.77
gene2757 rna59 76.00 27.80 1.00 913.68 773.47
gene2758 rna60 74.00 25.80 1.00 984.51 833.43
gene276 rna7 77.00 28.80 5.00 4409.78 3733.09
gene277 rna8 75.00 26.80 18.00 17059.92 14442.05
gene278 rna9 76.00 27.80 2.33 2131.92 1804.77
gene282 rna10 75.00 26.80 10.00 9477.73 8023.36
gene2845 rna61 106.00 57.80 32.00 14062.46 11904.55
gene2883 rna62 76.00 27.80 0.00 0.00 0.00
gene2886 rna63 76.00 27.80 2.33 2131.92 1804.77
gene2888 rna64 76.00 27.80 11.00 10050.49 8508.22
gene3032 rna65 94.00 45.80 0.00 0.00 0.00
gene3071 rna66 117.00 68.80 2.25 830.68 703.21
gene3072 rna67 2914.00 2865.80 10323.83 91434.25 77403.53
gene3073 rna68 76.00 27.80 8.50 7766.28 6574.54
gene3074 rna69 1535.00 1486.80 1740.57 29731.66 25169.29
gene445 rna11 1535.00 1486.80 1740.57 29731.66 25169.29
gene446 rna12 77.00 28.80 10.00 8819.56 7466.18
gene447 rna13 2912.00 2863.80 10252.38 90855.47 76913.56
gene448 rna14 117.00 68.80 2.25 830.68 703.21
gene52 rna2 77.00 28.80 1.00 881.96 746.62
gene625 rna15 1535.00 1486.80 639.69 10929.38 9252.25
gene626 rna16 2912.00 2863.80 8838.44 78469.04 66427.85
gene627 rna17 117.00 68.80 2.25 830.68 703.21
gene657 rna18 76.00 27.80 5.00 4568.40 3867.37
gene658 rna19 75.00 26.80 18.00 17059.92 14442.05
gene659 rna20 77.00 28.80 34.50 30427.46 25758.32
gene660 rna21 76.00 27.80 7.00 6395.76 5414.32
gene661 rna22 85.00 36.80 0.00 0.00 0.00
gene679 rna23 1535.00 1486.80 1766.17 30180.14 25548.95
gene680 rna24 76.00 27.80 8.50 7766.28 6574.54
gene681 rna25 2912.00 2863.80 9156.36 81280.79 68808.13
gene682 rna26 117.00 68.80 2.25 830.68 703.21
gene683 rna27 76.00 27.80 0.00 0.00 0.00
gene747 rna28 77.00 28.80 0.00 0.00 0.00
gene87 rna3 77.00 28.80 0.00 0.00 0.00
gene89 rna4 91.00 42.80 2.00 1186.93 1004.79
gene970 rna29 74.00 25.80 1.00 984.51 833.43

rsem-prepare-reference with STAR using Trusted Sources

I ran rsem-prepare-reference in '-star' mode but also using the '--trusted-sources' to limit the source to "BestRefSeq" and "Curated Genomic". However, when running rsem-calculate-expression, after STAR alignment, there is an error that the BAM file contains many more sequences than RSEM knows.

It looks like the issue is that the --trusted-sources information is not fed into STAR when it creates it's reference, and so it uses many more transcripts than RSEM is keeping in its own files.

I think I'll just try a workaround where I pre-filter my GTF file, but I just thought I'd bring this to your attention so you could either fix or have the tool raise an error message when -star and --trusted-sources are used together.

can rsem-gff3-to-gtf NOT stop when some warning/error in refseq gff?

Hi.

Can we let rsem-gff3-to-gtf continue to process when some error/warning in refseq gff?

# rsem-gff3-to-gtf GCF_000836905.1_ViralProj14035_genomic.gff /dev/null
Line 192 does not have attribute Parent:
NC_001895.1     RefSeq  primary_transcript      24008   24288   .       +       .       ID=id97;Note=putative;gbkey=prim_transcript
# echo $?
255

GCF_000836905.1_ViralProj14035_genomic.gff.zip

Line 30 error report text

Super small correction: Line 30 error warning output text "Plase" should be "Please"

Thought I post because I see this error more than I should... my own doing. ;)

mask/ignore rRNA when rsem-calculate-expression, but not rsem-prepare-reference

Is it posssible to mask/ignore rRNA when rsem-calculate-expression, but not rsem-prepare-reference.

If we add rRNA to ref, we can get the rRNA rate from the result bam file ,
and Less false positive when mapping.

Links to specific help document pages broken

Example: https://github.com/deweylab/RSEM/blob/master/rsem-prepare-reference.html

make ebseq fails to make

Readme states:

To compile EBSeq, which is included in the RSEM package, run
make ebseq

What I see:

[root@vmpr-res-head-node RSEM-1.3.0]# make ebseq
cd EBSeq && make all
make[1]: Entering directory `/config/packages/rsem/RSEM-1.3.0/EBSeq'
./install
/usr/bin/env: Rscript: No such file or directory
make[1]: *** [EBSeq] Error 127
make[1]: Leaving directory `/config/packages/rsem/RSEM-1.3.0/EBSeq'
make: *** [ebseq] Error 2

I'm running CentOS 7.3, Make 3.82

rsem-calculate-expression error

Hi RSEM expert:

When running this command: "rsem-calculate-expression --bam --estimate-rspd --no-bam-output --paired-end Aligned.toTranscriptome.out.bam ./rsem-out Quant", I received an error (see error message below). I wonder what rsem-out.grp is? Could you please give me a hint how to fix it?

Many thanks,
Qingguo

rsem-parse-alignments ./rsem-out Quant.temp/Quant Quant.stat/Quant b Aligned.toTranscriptome.out.bam -t 3 -tag XM
Cannot open ./rsem-out.grp! It may not exist.
"rsem-parse-alignments ./rsem-out Quant.temp/Quant Quant.stat/Quant b Aligned.toTranscriptome.out.bam -t 3 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!

Please clarify license of IDR source code in pRSEM directory

Thank you for releasing RSEM & pRSEM as free software!

However there are some problems with the third-party source code included (and in some cases, modified) inside the pRSEM directory:

For example, the version of idr that was modified appears to have been never released under any license I can find, as https://sites.google.com/site/anshulkundaje/projects/idr/idrCode.tar.gz?attredirects=0 has not licensing information nor does https://sites.google.com/site/anshulkundaje/projects/idr

Version 1.0, 1.1, and 1.11 of IDR is licensed LGPL (>= 2.0), but the code doesn't appear to be the version adapted here. The version 1.2 on cran is licensed GPL-2+ https://cran.r-project.org/web/packages/idr/ and version 2.0+ on Github are also licensed GPL-2+

Without proof that the 2010 version of IDR was released under any free or open source license means that Debian cannot distribute pRSEM. Without a license it also means that you can not distribute that code or your modifications.

Can you secure a license to the version of IDR used?
Can you switch to the released and maintained version of IDR? Debian, and others, would appreciate that.

Percentage of Bases Covered By Reads Enhancement

It'd be great if the software could output the percentage of transcript bases covered by the assigned RNA-seq reads. For some genes, these values are not close to 100%, which suggests that the transcript isoforms in the database are incomplete for that gene and that the expected counts should be carefully considered. Having such a number automatically output by RSEM would be convenient rather than having to read and process the short reads yet another time.

Output BAM File Contains Inconsistent Read IDs

I used the default bowtie method for rsem-calculate-expression but I had problems determining the number of reads in the resulting BAM file. I did some investigating and found that the reason is the read IDs have different numbers of fields for the first and second reads of a read pair. For example,

$ samtools view 30588WD_PRE.transcript.bam | grep 700666F:126:C8768ANXX:1:1103:1958:2081
700666F:126:C8768ANXX:1:1103:1958:2081  77      *       0       0       *       *       0       0       ATTTTTTTTCTTTATAAATTACGCAATCTATGGTATTCTCTTATAGCAACAGAAAACAGACTAAGACAACCATATTCCCAGTGCTTAGAACAGCCTCTGT    CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG1F@111<1FCFFGGB>1<CEGGEGGFGGBFCGGG1CCGCF@FFGGGGG0B0F>BGG0E   XM:i:0
700666F:126:C8768ANXX:1:1103:1958:2081 2:N:0:GTCCGC     141     *       0       0       *       *       0       0       GTTTATTCAACAGTTTATTCAAAACACGTTTATTGATCATCTCCTGTGAGACAGAGGCTGTTCTAAGCACTGGGAATATGGTTGTCTTAGTCTGTTTTCT   BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCFGGGGG    XM:i:0

The extra 2:N:0:GTCCGC is only kept for second reads, but not for first. Could this be consistent for both pairs of reads?

I also mapped this data with STAR to the genome and the BAM file has consistent naming formats of all the read IDs.

$ samtools view 30588WD_PREAligned.sortedByCoord.out.bam | grep 700666F:126:C8768ANXX:1:1103:1958:2081
700666F:126:C8768ANXX:1:1103:1958:2081  163     chr22   17560462        255     100M    =       17560512        150     GTTTATTCAACAGTTTATTCAAAACACGTTTATTGATCATCTCCTGTGAGACAGAGGCTGTTCTAAGCACTGGGAATATGGTTGTCTTAGTCTGTTTTCT   BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCFGGGGG    NH:i:1  HI:i:1  AS:i:198        nM:i:0
700666F:126:C8768ANXX:1:1103:1958:2081  83      chr22   17560512        255     100M    =       17560462        -150    ACAGAGGCTGTTCTAAGCACTGGGAATATGGTTGTCTTAGTCTGTTTTCTGTTGCTATAAGAGAATACCATAGATTGCGTAATTTATAAAGAAAAAAAAT   E0GGB>F0B0GGGGGFF@FCGCC1GGGCFBGGFGGEGGEC<1>BGGFFCF1<111@F1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC    NH:i:1  HI:i:1  AS:i:198        nM:i:0

Prepare reference error

I want to use RSEM on human genome using RefSeq gene annotations.
I downloaded the browser from UCSC browser.

I run the following command
rsem-prepare-reference --gtf annotations/hg19-annotations-refseq.gtf
genome.fa
ref/human_gencode

I get the following error:
rsem-extract-reference-transcripts ref/human_gencode 0 annotations/hg19-annotations-refseq.gtf 0 genome.fa
Parsed 200000 lines
Parsed 400000 lines
Parsed 600000 lines
Parsed 800000 lines
Parsed 1000000 lines
According to the GTF file given, a transcript has exons on multiple chromosomes!
"rsem-extract-reference-transcripts ref/human_gencode 0 annotations/hg19-annotations-refseq.gtf 0 genome.fa" failed! Plase check if you provide correct parameters/options for the pipeline!

How can I use RefSeq genes with RSEM?

deweylab / rsem Goto Github PK

rsem's Introduction

README for RSEM

Table of Contents

Introduction

Compilation & Installation

Prerequisites

Usage

I. Preparing Reference Sequences

Build RSEM references using RefSeq, Ensembl, or GENCODE annotations

Build RSEM references for untypical organisms

II. Calculating Expression Values

Calculating expression values from single-end data

Using an alternative aligner

III. Visualization

a) Converting transcript BAM file into genome BAM file

b) Generating a Wiggle file

c) Loading a BAM and/or Wiggle file into the UCSC Genome Browser or Integrative Genomics Viewer(IGV)

d) Generating Transcript Wiggle Plots

e) Visualize the model learned by RSEM

Example

Simulation

Usage:

Outputs:

Example:

Generate Transcript-to-Gene-Map from Trinity Output

Usage:

Differential Expression Analysis

Prior-Enhanced RSEM (pRSEM)

I. Overview

II. Demo

III. Installation

IV. Example

V. System Requirements

VI. Required External Packages

Authors

Acknowledgements

License

rsem's People

Contributors

Stargazers

Watchers

Forkers

rsem's Issues

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

!genome-build ASM18492v1

!genome-build-accession NCBI_Assembly:GCF_000184925.1

sequence-region NC_017304.1 1 3561619

species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=637887

Recommend Projects

Recommend Topics

Recommend Org