Hi, I used GeMoma with protein (merge several files) as refference, and the comman

Comments (13)

JensKeilwagen commented on June 29, 2024

Hi,

it seems that the run was successfully. The final_annotation.gff is the result as GFF. However you need to check whether the results are as you expected, e.g. number of genes, ... . There are several parameters that might be used to filter the predictions, especially from GAF. If you run the GeMoMaPipeline with o=true, it also returns the individual predictions for each reference allowing to run GAF very fast with different parameter settings.

I don't know which kind of fasta you expect. Of course, the genomic regions, predicted proteins and CDSs could be extracted. However, the main result is the annotation (as gff) and you set p=false, which states that the predicted proteins should not be returned. You can use the module Extractor to create fasta files for genomic region, ... on your own using the target genome and the predicted anntation. (This is was the GeMoMaPipeline does internally.) If you like to have such fasta files, you can also specify this in your next GeMoMaPipeline run.

best regards, Jens

from jstacs.

xiongqian123456789 commented on June 29, 2024

Thank you very much!

I want to get the annotation.gff and the predicted.fasta by this way, while the result of the following final_annotation.gff show nothig.
-rw-rw-r-- 1 xiongqian xiongqian 1.8K Dec 20 11:42 final_annotation.gff
-rw-rw-r-- 1 xiongqian xiongqian 25K Dec 20 11:42 protocol_GeMoMaPipeline.txt
-rw-rw-r-- 1 xiongqian xiongqian 507M Dec 20 11:42 unfiltered_predictions_from_species_0.gff

cat final_annotation.gff
##gff-version 3
#SOFTWARE INFO: GeMoMaPipeline 1.9; SIMPLE PARAMETERS: species: pre-extracted; weight: 1.0; tblastn: false; tag: mRNA; RNA-seq evidence: NO; denoise: DENOISE; DenoiseIntrons.maximum intron length: 15000; DenoiseIntrons.minimum expression: 0.01; DenoiseIntrons.context: 10; Extractor.upcase IDs: false; Extractor.repair: false; Extractor.Ambiguity: AMBIGUOUS; Extractor.discard pre-mature stop: true; Extractor.stop-codon excluded from CDS: false; Extractor.full-length: true; GeMoMa.reads: 1; GeMoMa.splice: true; GeMoMa.gap opening: 11; GeMoMa.gap extension: 1; GeMoMa.maximum intron length: 15000; GeMoMa.static intron length: true; GeMoMa.intron-loss-gain-penalty: 25; GeMoMa.reduction factor: 10; GeMoMa.e-value: 100.0; GeMoMa.contig threshold: 0.4; GeMoMa.hit threshold: 0.9; GeMoMa.output: STATIC; GeMoMa.predictions: 10; GeMoMa.avoid stop: true; GeMoMa.approx: true; GeMoMa.protein alignment: true; GeMoMa.verbose: false; GeMoMa.timeout: 3600; GeMoMa.replace unknown: false; GeMoMa.Score: ReAlign; GAF.default attributes: tie,tde,tae,iAA,pAA,score,lpm,maxGap,bestScore,maxScore,raa,rce; GAF.kmeans: NO; GAF.filter: start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75); GAF.sorting: sumWeight,score,aa; GAF.alternative transcript filter: tie==1 or sumWeight>1; GAF.common border filter: 0.75; GAF.maximal number of transcripts per gene: 2147483647; GAF.add alternative transcripts: false; GAF.transfer features: false; AnnotationFinalizer.transfer features: false; AnnotationFinalizer.UTR: NO; AnnotationFinalizer.rename: NO; AnnotationFinalizer.name attribute: true; synteny

from jstacs.

JensKeilwagen commented on June 29, 2024

Hi,

The annotation file is probably correct. It just contains a very long comment at the beginning showing the parameters.
If you use head or tail instead of cat you probably see the typical structure of a gff.

Regarding the fasta, you still did not tell me, what the fasta should contain. Proteins, CDSs, genomic regions, ... ? As mentioned earlier those fastas can be obtained setting the corresponding parameters in the GeMoMaPipeline or by running the module Extractor afterwards. If you tell me, what exactly you expect, I could be more precise in the answer.

best regards, Jens

from jstacs.

xiongqian123456789 commented on June 29, 2024

Hi, Jens

Thanh you,

i tested several samples and all the gff file results only included this :
##gff-version 3
#SOFTWARE INFO: GeMoMaPipeline 1.9; SIMPLE PARAMETERS: species: pre-extracted; weight: 1.0; tblastn: false; tag: mRNA; RNA-seq evidence: NO; denoise: DENOISE; DenoiseIntrons.maximum intron length: 15000; Denois
~
And I want to get the fasta file included proteins. I tried to set p=false, getting the file with null.

from jstacs.

JensKeilwagen commented on June 29, 2024

Hi,

This is interesting. It seems that the you have enough unfiltered predictions as the file size is 507Mb. However, the final annotation is only 1.8K. Hence, it seems that the GAF module gets enough input, but it probably filters out all the predctions. Could you please check the protocol and report the information for phase 4? These are given after the line starting phase 4 ;)

If you like to have the proteins you need to set p=true, which is the default value - if I remember correctly. Setting p=false just turns it off. As there is a problem with the final prediction, we first need to solve this.

best regards, Jens

from jstacs.

xiongqian123456789 commented on June 29, 2024

Hi, Jens

Appreciated for your help!
The command is：java -jar /data_group/miaowei/luoshuai/software/miniconda3/envs/gemoma/share/gemoma-1.9-0/GeMoMa-1.9.jar CLI GeMoMaPipeline threads=80 AnnotationFinalizer.r=NO p=true o=true t=/data_group/miaowei/luoshuai/xiongqian/fa_annotation/D2203116335.Euk.IDs.fa outdir=ncbi-D2203116335 s=pre-extracted c=ncbi.Chlamydomonadaceae.fa

the information for phase 4 is:

starting phase 3 (487.448s)

No external annotation given.
Starting: cat for species 0 (487.449s)
Finished: cat for species 0 (489.853s)

starting phase 4 (489.854s)

Starting: GAF (489.855s)
species 0
all: 1393747
filtered: 0
clustered: 0

genes: 0
transcripts: 0

    genes   genes with maxTie=1     transcripts     transcripts with tie=1  transcripts with tie=NA, tpc=1

(max)evidence=1 0 0 0 0 0
Finished: GAF (734.615s)

starting phase 5 (734.615s)

Starting: AnnotationFinalizer (734.616s)
#genes: 0
#warnings: [0, 0]
#mRNAs: 0
#warnings: [0, 0]
#CDSs: 0
#warnings: [0, 0]

#transcripts with 5'-UTR annotation: 0
#transcripts with 3'-UTR annotation: 0
#transcripts with some UTR annotation: 0
#transcripts with 5'- and 3'-UTR annotation: 0
Finished: AnnotationFinalizer (734.643s)

starting phase 6 (734.643s)

Starting: Extractor for final prediction (734.645s)
Finished: Extractor for final prediction (734.737s)

from jstacs.

xiongqian123456789 commented on June 29, 2024

And the ncbi.Chlamydomonadaceae.fa and the targeted sample was in the same Order showing near relationship.

from jstacs.

JensKeilwagen commented on June 29, 2024

Thank. Your protocol says that all the initial predictions (all: 1393747) were removed (filtered: 0). So the GAF filter was too strict. By the default the filter is start=='M' and stop=='*' and (isNaN(score) or score/aa>=0.75). Hence, I would recommend to check how many unfiltered predictions have a start codon and a stop codon (which is th first part of the filter). This can be done easily on the command line using:
grep -F "start=M;stop=*" -c unfiltered_predictions_from_species_0.gff

from jstacs.

JensKeilwagen commented on June 29, 2024

I think I found the problem: You used s=pre-extracted c=ncbi.Chlamydomonadaceae.fa.
If you use pre-extracted, you should run the Extractor before and provide CDS parts and assignment file, which are results of the Extractor. It is more easy to use own and provide the genome (fasta) and annotation (gff) of the related species (Chlamydomonadaceae?). GeMoMaPipeline internally run the Extractor in this case.

from jstacs.

xiongqian123456789 commented on June 29, 2024

Hi Jens, After grep "start=M;stop=*" -c unfiltered_predictions_from_species_0.gff, the result is 138357. And I checked the ncbi.Chlamydomonadaceae.fa file, all the proteins showed no stop codon, with no *, do this bring out the result?

from jstacs.

JensKeilwagen commented on June 29, 2024

I'm sorry. I missed an -F in the grep comment. I edited the post.
I assume that with -F it will return zero, as the unput also has no stop codon. GeMoMa tries to predict the proteins that were given as input. If there is no stop codon in the input, it will probably also not predict one in the output. Using GAF with stop=* will remove all predictions. As written in my last post I would recomment to use own and use a genome (fasta) and an annotation (gff) - if possible. This also allows to exploit the intron position conservation, which ist not possible if you just use protein sequences.

from jstacs.

xiongqian123456789 commented on June 29, 2024

Thank you very much, Jens.
It means using protein sequences as input to predicted genes by Gemoma is impossible? And I will try to improve the quality of annotation by using own a genome (fasta) and an annotation (gff) and an external gff file by other predicted/annotation softwares.
Really appreciated for your help!

from jstacs.

JensKeilwagen commented on June 29, 2024

No, it's not impossible, but it does not allow to use the full power of GeMoMa. If you like to use proteins, you should have full-length proteins with a start codon at the beginning and stop codon at the end. Hence, the protein sequence has a M at the beginning and * at the end. If you don't have full-length proteins, but still like to use GeMoMa, you need to adapt the filter in GAF.

from jstacs.

Use GeMoma with protein as reference, the result showed no predicted_proteins.fasta and final_annotation.gff about jstacs HOT 13 CLOSED

Comments (13)

starting phase 3 (487.448s)

starting phase 4 (489.854s)

starting phase 5 (734.615s)

starting phase 6 (734.643s)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent