This program was designed to evaluate the relative 3’UTR length differences between two user-defined conditions. In order to accomplish this, transcripts are divided into sections defined by annotated alternative cleavage and polyadenylation sites. The read abundance in each section is compared between conditions. The final output of the program is a table with all the relative segment abundances as well as the calculated relative use of the 3’UTR as a whole.
For the hg19 segment library used in this study, the kallisto index is available for download here.
- Segments were defined from polyadenylation (polyA) sites annotated in PolyA_DB.
- Only polyA sites used in greater than 30% of samples in PolyA_DB are represented.
- Transcripts with termination codons not in the last exon were excluded.
- The first segment is from the 5’end of a transcript to the first annotated polyA site, and each subsequent segment is defined by sequential polyA sites (see schematic here). To use TreeFar with other quantification tools or to change the polyA site usage filter from 30%, the full list of segments for hg19 is available here.
- The reference sets provided here are for use with human hg19 annotations.
- For other annotations and species, the user will need to generate a segment library, but TreeFar may still be used to determine the APA outcome. Suggestions for this are found at the end of this user guide.
- TreeFar is easiest to use with quantification generated by kallisto.
- Other quantification tools are also compatible with TreeFar, but the user will have to compile the results into the TreeFar input table. Guidelines for this are found at the end of this user guide.
Packages used for program development (required to run TreeFar)
- Python v3.7.5
- rpy2 v3.2.2
- Numpy v1.17.2
- Pandas v0.24.2
- Scipy v1.4.1
Download the script for TreeFar here.
Use kallisto to quantify the segments using RNA-Seq data. Kallisto outputs a folder containing the following three files 1) abundance.h5, 2) abundance.tsv, 3) run_info.json. TreeFar uses the "target_id" and "tpm" columns from the abundance.tsv file.
- The hg19 kallisto index from this study can be downloaded here (Please unzip the file after downloading) and may be used for pseudoalignment if appropriate.
kallisto quant -i hg19index -o contr1 --bias -b 1000 -t 16 --rf-stranded FASTQ-files
For each RNA-Seq dataset, kallisto outputs a folder that is named by the user. However, the name is not carried through to the files within the folder.
- For each dataset that will be used for TreeFar analysis, move/copy the "abundance.tsv" file into a folder with all of its controls and replicates.
- Prefix each kallisto output with its name. Format the file name as "name_abundance.tsv." (See example folder provided here
Usage: python TreeFar.py [arguments]
Required arguments:
--files_path, -f Directory of kallisto output files for use by TreeFar
--group_control, -g0 Comma separated list of the first set of kallisto output
data files (typically control). The algorithm assumes
the kallisto output files are named along the lines of
"NT1_abundance.tsv" so you would just put whatever comes
before the "_abundance.tsv" in this list.
--group_treatment, -g1 Comma separated list of the second set of kallisto output
data files (typically test condition). Similar naming
convention as described above.
--out, -o File prefix for final output table. The final output name
will be this string + l1 + l0, where the final two
components are automatically appended by the algorithm.
--out_main, -x File prefix for data before filtering and analysis
Optional arguments:
--label_control, -l0 Label for condition in -g0 (default: Control). This
will appear also in the output file name.
--label_treatment, -l1 Label for condition in -g1 (default: Treatment).
This will appear also in the output file name.
--mean_filter Value of the mean -g0 TPM that defines the lower limit for
retaining segments in the analysis. Greater values are
more stringent (default: 0.5).
--missing_filter The maximum number of missing log2(TPM) values, in either
condition, required to retain a segment for analysis (default: 2).
--pval_filter In datasets with 3 or more replicates for each condition,
the P value cutoff used by the decision tree for significant
changes. Greater values are less stringent (default: 0.05).
--cutoff Minimum difference in mean log2(TPM) between conditions used
if number of replicates less than 3 and no t-test is applied.
Greater values are more stringent (default: 0.2)
An example command is
python TreeFar.py --files_path Output_collected_for_TreeFar -g0 contr1,contr2,contr3 -g1 8hDex1,8hDex2,8hDex3 -l0 contr -l1 8hDex -o NMD -x Quant
Two output files are generated.
- The first is quantification of all the segments prior to analysis by TreeFar (example here).
- The second is the final output table (example here. Note that the 3’UTR segments filtered out of the analysis are dropped to the end of the table, with the reason for exclusion in the "FILTER" column. For explanation of segdiff and normdiff calculations, see workflow diagram and Kishor et al. methods.
The output is a comma-separated file with the following columns:
GeneID: extracted from target_id
Segment: extracted from target_id; numbered from 5’-most segment to 3’-most segment, does not re-number if segments are excluded from analysis
target_id: Full segment label
TPM for each -g0 replicate
TPM for each -g1 replicate
GeneName: extracted from target_id
Leftcoord: extracted from target_id
Rightcoord: extracted from target_id
pAsite: extracted from target_id
TC: Termination codon, extracted from target_id
Usage: % use of pA site defined from PolyA_DB, extracted from target_id
Strand: extracted from target_id
log2tpm for each g0 replicate
log2tpm for each g1 replicate
is_first: whether this is the first segment of the transcript (1 if yes, 0 if no)
is_only: whether this is the only segment of the transcript (1 if yes, 0 if no. Note: if a segment is the only one, both is_first and is_last will be 1)
is_last: whether this is the last segment of the transcript (1 if yes, 0 if no)
meantpm g0 replicates
meantpm g1 replicates
meanlog2tpm g0 replicates
meanlog2tpm g1 replicates
difflog2tpm g1 condition – g0 condition
normdiff for each g0 replicate
normdiff for each g1 replicate
segdiff for each g0 replicate
segdiff for each g1 replicate
normdiffmean for the -g0 replicates
normdiffmean for the -g1 replicates
segdiffmean for the -g0 replicates
segdiffmean for the -g1 replicates
g1 condition – g0 condition normdiff
g1 condition – g0 condition segdiff
pvallogtpm
pvalnormdiff
pvalsegdiff
eval: which condition has more representation of the given segment
score: point value for each eval code
verdict: provided at the first record of each transcript, the results of the decision tree for which condition has the longer 3’UTR
lastdiff: copy of the norm diff for the last segment of each transcript in line with the first record of each transcript
FILTER: explanation for why the segment was not included in the analysis
To use TreeFar, transcript segments must be defined and numbered for pseudoalignment.
- The names of each segment should have values in the following format: Transcript GeneID|GeneName|left coordinate|right coordinate|pA site|Termination codon site|frequency of pA site usage|segment of transcript|strand (See "target_id" column in this table for examples of multi-segment transcripts)
- The left and right segment coordinates are specific to each segment (for the first segment, the left will be the beginning of the 5’UTR and the right will be the first polyA site in the 3’UTR; for each subsequent segment, the left will be the previous polyA site and the right will be the current polyA site)
- Termination codon will be the same for all segments of a given transcript
- PolyA_DB 3 provides frequency of usage as a percentage—this may be left blank if a different resource is used for polyA site annotation
- PolyA_DB 3 uses the Gene Name for its annotations, so the user will need to assign a GeneID to each transcript (in this case, from RefSeq). Other databases provide the correspondence between Gene name and RefSeq/Ensembl IDs.
- We recommend using only one transcript body for each gene
The 3’UTR segment list (hg19 used in this study can be found here) will need to be formatted for use by the specific quantification algorithm (eg, called by the kallisto index
command, Salmon index
command, etc).
The quantifications from tools other than kallisto will need to be manually aggregated into a table for use by TreeFar. At a minimum, the table should include columns for GeneID, segment (numerical value), and TPMs for each replicate. Labels for the TPM column should be the experimental condition so that TreeFar can perform comparisons. See here for an example input table and below for the command.
In the command line, instead of --files_path
use --table
to point to the location of the aggregated output table.
python TreeFar.py --table data_table_input.txt -g0 contr1,contr2,contr3 -g1 8hDex1,8hDex2,8hDex3 -l0 contr -l1 8hDex -o NMD -x Quant
In this case, -g0
and -g1
will have the list of the column names for the control and treatment conditions, respectively.
Aparna Kishor, Sarah E Fritz, Nazmul Haque, Zhiyun Ge, Ilker Tunc, Wenjing Yang, Jun Zhu, J Robert Hogg. Activation and inhibition of nonsense-mediated mRNA decay controls the abundance of alternative polyadenylation products. Nucleic Acids Research doi: 10.1093/nar/gkaa491.
Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification Nat Biotechnol. 2016;34(5):525‐527. doi:10.1038/nbt.3519
Wang R, Nambiar R, Zheng D, Tian B. PolyA_DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes. Nucleic Acids Res. 2018;46(D1):D315‐D319. doi:10.1093/nar/gkx1000