Comments (13)
Hi,
it seems that the run was successfully. The final_annotation.gff is the result as GFF. However you need to check whether the results are as you expected, e.g. number of genes, ... . There are several parameters that might be used to filter the predictions, especially from GAF. If you run the GeMoMaPipeline with o=true, it also returns the individual predictions for each reference allowing to run GAF very fast with different parameter settings.
I don't know which kind of fasta you expect. Of course, the genomic regions, predicted proteins and CDSs could be extracted. However, the main result is the annotation (as gff) and you set p=false, which states that the predicted proteins should not be returned. You can use the module Extractor to create fasta files for genomic region, ... on your own using the target genome and the predicted anntation. (This is was the GeMoMaPipeline does internally.) If you like to have such fasta files, you can also specify this in your next GeMoMaPipeline run.
best regards, Jens
from jstacs.
Thank you very much!
I want to get the annotation.gff and the predicted.fasta by this way, while the result of the following final_annotation.gff show nothig.
-rw-rw-r-- 1 xiongqian xiongqian 1.8K Dec 20 11:42 final_annotation.gff
-rw-rw-r-- 1 xiongqian xiongqian 25K Dec 20 11:42 protocol_GeMoMaPipeline.txt
-rw-rw-r-- 1 xiongqian xiongqian 507M Dec 20 11:42 unfiltered_predictions_from_species_0.gff
cat final_annotation.gff
##gff-version 3
#SOFTWARE INFO: GeMoMaPipeline 1.9; SIMPLE PARAMETERS: species: pre-extracted; weight: 1.0; tblastn: false; tag: mRNA; RNA-seq evidence: NO; denoise: DENOISE; DenoiseIntrons.maximum intron length: 15000; DenoiseIntrons.minimum expression: 0.01; DenoiseIntrons.context: 10; Extractor.upcase IDs: false; Extractor.repair: false; Extractor.Ambiguity: AMBIGUOUS; Extractor.discard pre-mature stop: true; Extractor.stop-codon excluded from CDS: false; Extractor.full-length: true; GeMoMa.reads: 1; GeMoMa.splice: true; GeMoMa.gap opening: 11; GeMoMa.gap extension: 1; GeMoMa.maximum intron length: 15000; GeMoMa.static intron length: true; GeMoMa.intron-loss-gain-penalty: 25; GeMoMa.reduction factor: 10; GeMoMa.e-value: 100.0; GeMoMa.contig threshold: 0.4; GeMoMa.hit threshold: 0.9; GeMoMa.output: STATIC; GeMoMa.predictions: 10; GeMoMa.avoid stop: true; GeMoMa.approx: true; GeMoMa.protein alignment: true; GeMoMa.verbose: false; GeMoMa.timeout: 3600; GeMoMa.replace unknown: false; GeMoMa.Score: ReAlign; GAF.default attributes: tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce; GAF.kmeans: NO; GAF.filter: start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75); GAF.sorting: sumWeight,score,aa; GAF.alternative transcript filter: tie==1 or sumWeight>1; GAF.common border filter: 0.75; GAF.maximal number of transcripts per gene: 2147483647; GAF.add alternative transcripts: false; GAF.transfer features: false; AnnotationFinalizer.transfer features: false; AnnotationFinalizer.UTR: NO; AnnotationFinalizer.rename: NO; AnnotationFinalizer.name attribute: true; synteny
from jstacs.
Hi,
The annotation file is probably correct. It just contains a very long comment at the beginning showing the parameters.
If you use head
or tail
instead of cat
you probably see the typical structure of a gff.
Regarding the fasta, you still did not tell me, what the fasta should contain. Proteins, CDSs, genomic regions, ... ? As mentioned earlier those fastas can be obtained setting the corresponding parameters in the GeMoMaPipeline or by running the module Extractor afterwards. If you tell me, what exactly you expect, I could be more precise in the answer.
best regards, Jens
from jstacs.
Hi, Jens
Thanh you,
i tested several samples and all the gff file results only included this :
##gff-version 3
#SOFTWARE INFO: GeMoMaPipeline 1.9; SIMPLE PARAMETERS: species: pre-extracted; weight: 1.0; tblastn: false; tag: mRNA; RNA-seq evidence: NO; denoise: DENOISE; DenoiseIntrons.maximum intron length: 15000; Denois
~
And I want to get the fasta file included proteins. I tried to set p=false, getting the file with null.
from jstacs.
Hi,
This is interesting. It seems that the you have enough unfiltered predictions as the file size is 507Mb. However, the final annotation is only 1.8K. Hence, it seems that the GAF module gets enough input, but it probably filters out all the predctions. Could you please check the protocol and report the information for phase 4? These are given after the line starting phase 4 ;)
If you like to have the proteins you need to set p=true
, which is the default value - if I remember correctly. Setting p=false
just turns it off. As there is a problem with the final prediction, we first need to solve this.
best regards, Jens
from jstacs.
Hi, Jens
Appreciated for your help!
The command is:java -jar /data_group/miaowei/luoshuai/software/miniconda3/envs/gemoma/share/gemoma-1.9-0/GeMoMa-1.9.jar CLI GeMoMaPipeline threads=80 AnnotationFinalizer.r=NO p=true o=true t=/data_group/miaowei/luoshuai/xiongqian/fa_annotation/D2203116335.Euk.IDs.fa outdir=ncbi-D2203116335 s=pre-extracted c=ncbi.Chlamydomonadaceae.fa
the information for phase 4 is:
starting phase 3 (487.448s)
No external annotation given.
Starting: cat for species 0 (487.449s)
Finished: cat for species 0 (489.853s)
starting phase 4 (489.854s)
Starting: GAF (489.855s)
species 0
all: 1393747
filtered: 0
clustered: 0
genes: 0
transcripts: 0
genes genes with maxTie=1 transcripts transcripts with tie=1 transcripts with tie=NA, tpc=1
(max)evidence=1 0 0 0 0 0
Finished: GAF (734.615s)
starting phase 5 (734.615s)
Starting: AnnotationFinalizer (734.616s)
#genes: 0
#warnings: [0, 0]
#mRNAs: 0
#warnings: [0, 0]
#CDSs: 0
#warnings: [0, 0]
#transcripts with 5'-UTR annotation: 0
#transcripts with 3'-UTR annotation: 0
#transcripts with some UTR annotation: 0
#transcripts with 5'- and 3'-UTR annotation: 0
Finished: AnnotationFinalizer (734.643s)
starting phase 6 (734.643s)
Starting: Extractor for final prediction (734.645s)
Finished: Extractor for final prediction (734.737s)
from jstacs.
And the ncbi.Chlamydomonadaceae.fa and the targeted sample was in the same Order showing near relationship.
from jstacs.
Thank. Your protocol says that all the initial predictions (all: 1393747) were removed (filtered: 0). So the GAF filter was too strict. By the default the filter is start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75)
. Hence, I would recommend to check how many unfiltered predictions have a start codon and a stop codon (which is th first part of the filter). This can be done easily on the command line using:
grep -F "start=M;stop=*" -c unfiltered_predictions_from_species_0.gff
from jstacs.
I think I found the problem: You used s=pre-extracted c=ncbi.Chlamydomonadaceae.fa
.
If you use pre-extracted
, you should run the Extractor before and provide CDS parts and assignment file, which are results of the Extractor. It is more easy to use own
and provide the genome (fasta) and annotation (gff) of the related species (Chlamydomonadaceae?). GeMoMaPipeline internally run the Extractor in this case.
from jstacs.
Hi Jens, After grep "start=M;stop=*" -c unfiltered_predictions_from_species_0.gff, the result is 138357. And I checked the ncbi.Chlamydomonadaceae.fa file, all the proteins showed no stop codon, with no *, do this bring out the result?
from jstacs.
I'm sorry. I missed an -F
in the grep comment. I edited the post.
I assume that with -F
it will return zero, as the unput also has no stop codon. GeMoMa tries to predict the proteins that were given as input. If there is no stop codon in the input, it will probably also not predict one in the output. Using GAF with stop=*
will remove all predictions. As written in my last post I would recomment to use own
and use a genome (fasta) and an annotation (gff) - if possible. This also allows to exploit the intron position conservation, which ist not possible if you just use protein sequences.
from jstacs.
Thank you very much, Jens.
It means using protein sequences as input to predicted genes by Gemoma is impossible? And I will try to improve the quality of annotation by using own a genome (fasta) and an annotation (gff) and an external gff file by other predicted/annotation softwares.
Really appreciated for your help!
from jstacs.
No, it's not impossible, but it does not allow to use the full power of GeMoMa. If you like to use proteins, you should have full-length proteins with a start codon at the beginning and stop codon at the end. Hence, the protein sequence has a M at the beginning and * at the end. If you don't have full-length proteins, but still like to use GeMoMa, you need to adapt the filter in GAF.
from jstacs.
Related Issues (20)
- java.lang.IllegalArgumentException: At least two sequences with the same ID but different sequence: HOT 5
- Using RNA-seq from closely-related organism HOT 2
- GeMoMa expected runtime HOT 2
- How does GeMoMa treat masked nucleotides? HOT 1
- Problem when adding external evidence HOT 5
- cdsParts=true, but the ID (gene-si: )seems to be no CDS part HOT 1
- AnnotationFinalizer for renaming genes and transcripts HOT 2
- Exception in thread "main" java.lang.NullPointerException HOT 2
- Why does GeMoma annotationfianlizer remove the exon features ? HOT 1
- Could not open GeMoMa_temp/GeMoMaPipeline-9364982901453846136/mmseqsdb_h.index.5 for writing! HOT 2
- java.lang.OutOfMemoryError: Java heap space when using GeMoma HOT 21
- java.lang.InterruptedException HOT 1
- Result filtering HOT 4
- GeMoMa error HOT 8
- No gene model was extracted from the references. HOT 3
- GeMoMa gives me multiple genes with exact same coordinates HOT 6
- Issues with CLI Analzyer HOT 5
- de.jstacs.data.WrongAlphabetException for gene HOT 1
- GeMoMa v1.9 error: "[tblastn]: BLAST Database error: Error: Not a valid version 4 database." HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jstacs.