evolinc / evolinc-i Goto Github PK

Python 29.75% Shell 37.89% R 27.42% Dockerfile 4.94%

rna-seq lincs cyverse-discovery-environment

evolinc-i's Introduction

EVOLINC-I: A rapid Long-Intergenic Noncoding RNA (lincRNA) detection pipeline

Introduction

Evolinc-I is a long intergenic noncoding RNA (lincRNA) identification workflow that also facilitates genome browser visualization of identified lincRNAs and downstream differential gene expression analysis.

Evolinc-I minimally requires the following input data

A set of assembled and merged transcripts from Cuffmerge or Cuffcompare in gene transfer format (GTF)
A reference genome (FASTA)
A reference genome annotation (GFF/GTF/GFF3)

Optional input data

Transposable Elements database (FASTA)
Known LincRNA (GFF)
Transcription start site coordinates (BED)

Availablility

Using Docker image

Since there are several dependencies (these can be seen in Dockerfile) for running Evolinc-I on your linux or MAC OS, we highly recommend using the available Docker image for Evolinc-I or the Dockerfile to build an image and then use the built image. Docker can be installed on any of three platform using the instructions from Docker website. You can also try Play-With-Docker for running Evolinc-I using the below instructions

# Pull the image from CyVerse Dockerhub
docker pull evolinc/evolinc-i:1.7.5

# See the command line help for the Docker image
docker run evolinc/evolinc-i:1.7.5 -h

# Download some sample data 
git clone https://github.com/Evolinc/Evolinc-I.git
cd Evolinc-I/sample_data

# Run Evolinc-I With mandatory files
docker run --rm -v $(pwd):/working-dir -w /working-dir evolinc/evolinc-i:1.7.5 -c Sample_cuffcompare_out.gtf -g TAIR10_chr1.fasta -u TAIR10_chr1_genes.gff -o test_out -n 4

# Run Evolinc-I With both mandatory and optional files
docker run --rm -v $(pwd):/working-dir -w /working-dir evolinc/evolinc-i:1.7.5 -c Sample_cuffcompare_out.gtf -g TAIR10_chr1.fasta -u TAIR10_chr1_genes.gff -b TE_RNA_transcripts.fa -t Sample_TSS_data.gff -x Sample_known_lncRNAs.gff -o test_out -n 4

Using CyVerse Discovery Environment

The Evolinc-I app (search Evolinc-I in the search box of the apps window) is currently integrated in CyVerse’s Discovery Environment (DE) and is free to use by researchers. The complete tutorial is available at this CyVerse wiki. CyVerse's DE is a free and easy to use GUI that simplifies many aspects of running bioinformatics analyses. If you do not currently have access to a high performance computing cluster, consider taking advantange of the DE.

Step-by-step walkthroughs

Step-by-step walkthrough for running Evolinc-I on DE is available here. Step-by-step walkthrough for the command-line, with directions on how to change parameters is coming soon. Information on how to easily create a Cuffmerge/Cuffcompare input file from 1-many SRA IDs within the DE can be found here

Issues

If you experience any issues with running Evolinc-I (DE app or source code or Docker image), please open an issue on this github repo. Alternatively post your query or future requests in this google group

Copyright free

The sources in this Github repository, are copyright free. Thus you are allowed to use these sources in which ever way you like. Here is the full MIT license.

Citing Evolinc-I

If you have used Evolinc-I manuscript in your research, please cite as below..

Andrew D. Nelson*, Upendra K. Devisetty*, Kyle Palos, Asher K. Haug-Baltzell, Eric Lyons, Mark A. Beilstein (2017). "Evolinc: a comparative transcriptomics and genomics pipeline for quickly identifying sequence conserved lincRNAs for functional analysis". Frontiers in Genetics. 1(10)

evolinc-i's People

Contributors

Stargazers

Watchers

Forkers

bioxiao phdindirthoeing chosenobih

evolinc-i's Issues

Add rFAM automatic screen

Add rFAM screen to the end of Evolinc-I so that this doesn't have to be done outside of the DE/command line. This will entail adding the rFAM library of RNAs (except snoRNAs).

final_summary_table_gen_evo-I.R sub() function?

what does the "AGE_PLUS" refer to in line 422 of the R script?

(422) merge2$V1_2 <- sub("AGE_PLUS", "Yes", merge2$V1_2)

Running evolinc on a cluster

Hi,
I am interested in using the evolinc pipeline. I work on a cluster and that's where all my data is, so I want to run evolinc on a cluster using the command line. I tried to get docker, but it is not possible to get it on a cluster and I cannot install it otherwise. Is there a way I can run evolinc on the commandline on the cluster without docker?

Using merged from gffcompare on DE

I tried running Evolinc-I on a cluster with Singularity and I have run into a number of issues, so I am opting to run it on the DE instead. I have the following question, my merged gtf is from gffcompare and not cuffcompare (since this program is now outdated and gffcompare is basically its newer version). Previously I had been told this was fine, but to use the -r flag. I was wondering what I can do when running it on the DE. Is there an option for this?

Questions about Evolinc modifications since publication

Hi,
I have read the paper of the most recent version of Evolinc and I have the following questions:

It says in the paper to run the output FASTA against Rfam, but I see here in the resolved issues github that it says this feature has been added. Do you still suggest that I run my output against Rfam?
Does Evolinc detect only long intergenic non coding RNAs or does it detect other types too?
Since the paper came out, the developers of cuffcompare also made the program gffcompare, which is analogous to cuffcompare and I believe produces the same output. I would like to use gffcompare because its usage is simpler, can I use the output gtf from gffcompare in Evolinc, or do you suggest I stick to cuffcompare?
Thank you

Add an option for long read filtering/comparison.

People should be able to run long read transcripts through Evolinc. Alternatively, they should be able to compare their short read derived lncRNAs against any long read transcripts that are available. This would be an optional argument that would provide further support for the lncRNA annotation.

Error in calling unlink in diamondBlast step

This causes there to be no longest_ORFS_cat.pep.blastp file. Not sure if this is occurring on all systems or just within a windows Docker container.

IUPAC ambiguity codes in FASTA file

Error encountered when running Evolinc-I using FASTA files that have IUPAC ambiguity codes (e.g. KeyError: 'R'). Error was not found when ambiguous characters were replaced with 'N' and run with Evolinc-I again.

Modify "updated gff"

Change gene IDs in the updated gff so that there are no "_" (underscores), as HTseq version 0.6.1 appears to have an issue with them when there are linked to the gene_id.

Offer the option for FPKM filter in Evolinc-I

Can do it in a similar way to the coverage/base filter.

Add AOTs and SOTs to updated GFF file

Users have requested the addition of AOT and SOT lncRNAs to the GFF file in order to perform differential expression.

Replace underscores in Known_lincRNA bed file

There is a known issue when appending the gene ID of a known lincRNA to the final summary table if that known lincRNA has an underscore in its name in the bed/gff file used as input.

Error running Evolinc-I With both mandatory and optional files for sample data

i am getting this error message to the end of the the Evolinc run with optional files for sample data:

Error in as.data.table(newmat) : could not find function "as.data.table"
Calls: cSplit -> is.data.table
Execution halted
cp: cannot stat 'final_Summary_table.tsv': No such file or directory
All necessary files written to test_out
Finished Evolinc-part-I!

Chromosome IDs of Evolinc identified lincRNAs do not match parent annotation.

After running Evolinc 1.7.5 and Evolinc-Merge on the Discovery Environment, the output annotations have chromosome IDs of newly identified lincRNAs that do not match the parent (input) annotation.

The input annotation uses the nomenclature: 'Chr1', 'Chr2', 'Chr3', etc. and 'Scaffold12345'. The 'Final_updated.gtf' from Evolinc-Merge keeps this pattern for existing features, but new lincRNAs will lose the "Chr" identifier or the "Scaffold" identifier in column 1. Additionally, scaffold numbers that begin with a 0 in the parent annotation (e.g. "Scaffold00123") will lose those 0 values and will show "123" as the new chromosome.

Is this an issue for lincRNA identification if Evolinc is not able to assign the lincRNAs to the "known" chromosomes?

I have attached gzipped input and output annotations for your reference .

Thank you for your support.
Final_updated.gtf.gz
Cs_genes_v2.1_annot.gff3.gz

Create "intronic space" parameter

Allow for variable distance for removing gaps and merging hits on similar scaffolds (max gap length currently set to length of query lncRNA).

Error in parsing transcripts

Hi,
I was able to run Evolinc with the test data but now I am getting an error when using it on braker genome annotations. This is the error message

Tue Mar 9 17:39:39 UTC 2021
No fasta index found for referencegenome.fa. Rebuilding, please wait..
Fasta index rebuilt.
Generating Number of transcripts
##################################
grep: transcripts.*.fa: No such file or directory
transcripts.*.fa 
##################################
cat: transcripts.*.filter.fa: No such file or directory
[INFO] read file 'transcripts.all.overlapping.filter.fa'
[INFO] Predicting coding potential, please wait ...
[INFO] Running Done!
[INFO] cost time: 0s
[ERROR] putative_intergenic.genes.fa is not a file
grep: putative_intergenic.genes_cpc2.txt: No such file or directory
Can't open putative_intergenic.genes.fa: No such file or directory.
Generating Number of coding and noncoding
##################################
grep: putative_intergenic.genes_cpc2.txt: No such file or directory
putative_intergenic_coding_transcripts
grep: putative_intergenic.genes_cpc2.txt: No such file or directory
putative_intergenic_noncoding_transcripts
overlapping_coding_transcripts 1
overlapping_coding_transcripts 0

Looks like it's not able to extract the transcript sequences and run transdecoder correctly?
This is the format og my gtf file

CsWA_scaf115    AUGUSTUS        gene    1563351 1564313 .       -       .       jg29579
CsWA_scaf115    AUGUSTUS        transcript      1563351 1564313 .       -       .       transcript_id "jg29579.t1"; gene_id "jg29579"
CsWA_scaf115    AUGUSTUS        stop_codon      1563351 1563353 .       -       0       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_scaf115    AUGUSTUS        CDS     1563351 1564313 0.88    -       0       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_scaf115    AUGUSTUS        exon    1563351 1564313 .       -       .       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_scaf115    AUGUSTUS        start_codon     1564311 1564313 .       -       0       transcript_id "jg29579.t1"; gene_id "jg29579";
CsWA_chr04      AUGUSTUS        gene    6431667 6433016 .       +       .       jg761
CsWA_chr04      AUGUSTUS        transcript      6431667 6433016 .       +       .       transcript_id "jg761.t1"; gene_id "jg761"
CsWA_chr04      AUGUSTUS        start_codon     6431667 6431669 .       +       0       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_chr04      AUGUSTUS        CDS     6431667 6433016 0.94    +       0       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_chr04      AUGUSTUS        exon    6431667 6433016 .       +       .       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_chr04      AUGUSTUS        stop_codon      6433014 6433016 .       +       0       transcript_id "jg761.t1"; gene_id "jg761";
CsWA_scaf115    AUGUSTUS        gene    4180987 4181720 .       +       .       jg31437
CsWA_scaf115    AUGUSTUS        transcript      4180987 4181720 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437"
CsWA_scaf115    AUGUSTUS        start_codon     4180987 4180989 .       +       0       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        CDS     4180987 4181063 0.59    +       0       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        exon    4180987 4181063 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        intron  4181064 4181137 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        CDS     4181138 4181720 0.54    +       1       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        exon    4181138 4181720 .       +       .       transcript_id "jg31437.t1"; gene_id "jg31437";
CsWA_scaf115    AUGUSTUS        stop_codon      4181718 4181720 .       +       0       transcript_id "jg31437.t1"; gene_id "jg31437";

Is there anything wrong with that?
Thank you in advance