Hi, Jens, after the prediction by the step of "GeMoMa" with several relative genom

usage about the parameter of weight (w) about jstacs HOT 4 CLOSED

jstacs commented on May 27, 2024

usage about the parameter of weight (w)

from jstacs.

Comments (4)

JensKeilwagen commented on May 27, 2024

Hi,

Thanks for that interesting question. However, the question is not completely clear to me. I try to explain what GAF does and try to give an option how to get a workaround.

If I understood it correctly, your situation is the following: A prediction is perfectly supported by several reference species. You like to choose a prediction from the reference species with the smallest phylogenetic distance as representative (ref-gene) and list the remaining predictions in alternatives.

However, GAF does not use weights or any other phylogenetic information at the step of removing redundant predictions. The current implementation uses the score and the ID to determine the representative:

Jstacs/projects/gemoma/GeMoMaAnnotationFilter.java

Lines 531 to 549 in 65a454d

 //combine redundant 

 Collections.sort(pred); 

 Prediction last = pred.get(pred.size()-1); 

 for( int i = pred.size()-2; i >= 0; i-- ){ 

 Prediction cur = pred.get(i); 

 if( cur.compareTo(last) == 0 ) { 

 //identical 

 if( cur.score > last.score || (cur.score== last.score && cur.id.compareTo(last.id)<0) ) { 

 cur.combine(last, addAltTransIDs); 

 pred.remove(i+1); 

 last=cur; 

 } else { 

 last.combine(cur, addAltTransIDs); 

 pred.remove(i); 

 } 

 } else { 

 last=cur; 

 } 

 }

The attributes (e.g. score, iAA, ...) of the representative are later used to sort and select the predictions.

This implementation is based on the assumption that the GeMoMa score is a good indicator whether a prediction is valid or not. To a certain extend the score reflects how good the conservation is. The ID is used to avoid different results based due to multi-threading. (If several predictions are identical (based on the exon-intron-structure) and have same score, the last prediction would be used if ID would be ignore. Different number of threads can have an influence on the order of the predictions. To avoid this, we use the ID to get the same results no matter how many threads were used.)

From the lines of code it become clear that score is the main feature to select the representative. This seems to be reasonable for me. I'll try to give an example. Let's assume you have identical predictions from all reference species, i.e., the exon-intron structure is identical for all these prediction, but they differ in the score (=conservation). Despite having the same structure, the amino acid conservation can differ substantially. Although the conservation should be higher in phylogenetic closer related species on average, this must not be true for every gene. Especially if you're thinking about introgressions from wild species. For this reason, a prediction from the reference species with smallest phylogenetic distance might have a low score, while another prediction from a phylogenetic more distant reference species has a higher score. Besides introgressions, many other reasons could lead to such situations, e.g., gene families.

For me, it seems reasonable to choose the prediction with highest score (=conservation). It has a historical background, when we only used score for sorting predictions, but due to the correlation to other attributes that might be used for sorting it seems to be still valid.

If you use the parameter weight, the attribute sumWeight of the predictions is altered. This attribute can be used for sorting and filtering predictions. However, the weight does not have an influence on the choice of the representative.

Why do you like to have a representative from the phylogenetic closest species?
If you like to analyze synteny, we have implemented SyntenyChecker:
http://www.jstacs.de/index.php/GeMoMa-Docs#Synteny_checker
Maybe the output can be beneficial for you. It returns a table with some standard columns plus one column per reference species.

best regards, Jens

from jstacs.

CrawlingSponge commented on May 27, 2024

Hi, Jens,
Thanks for your proffesional answer, this prediction from different references is very likely inffluenced by the introgression or ILS.

the filter step "GAF" indeed chose the best score of the predictions.

there is still one issue confused me, the predictions from different references (spceices1, species2, species3 and species4) have different exons numbers. how should i solve this problem?

for example, the gene of RPL38, this prediction with 4 exons,

this prediction with 1 exon,

yours, sincerely

from jstacs.

JensKeilwagen commented on May 27, 2024

Hi,

ce is the number of coding exons (of the prediction)
rce is the number of reference coding exons

In both of your examples ce equals rce.

Introns can be gained or lost. Depending on the phylogenetic distance, this should be rare.

Indeed, the filter in GAF is applied after removing redundancy:

Jstacs/projects/gemoma/GeMoMaAnnotationFilter.java

Lines 551 to 570 in 65a454d

 //filter: user-specified 

 for( int i = pred.size()-1; i >= 0; i-- ){ 

 Prediction p = pred.get(i); 

 p.setEvidenceAndWeight( weight ); 

 if( !Tools.filter(engine, filter, p.hash) ) { 

 //System.out.println(p.hash.toString()); 

 pred.remove(i); 

 } else { 

 fillEvidence(p.evidence, 0); 

 /* 

 String[] array = {"aa","raa","score","tie","tpc","pAA","iAA","lpm","maxGap"}; 

 System.out.print(p.id); 

 for( int z=0; z < array.length; z++ ) { 

  System.out.print("\t" + p.hash.get(array[z])); 

 } 

 System.out.println(); 

 */ 

 } 

 } 

 filtered += pred.size();

Hence, alternatives could have a different number of coding exons than the representative. Was this your point?

If I understood you correctly, we only like to have predictions with exactly the same number of ce and rce.
You can tell GeMoMa than gaining or losing an intron is very expensive:
http://www.jstacs.de/index.php/GeMoMa-Docs#GeneModelMapper
Cf. parameter intron-loss-gain-penalty. This might help.

best regards, Jens

from jstacs.

CrawlingSponge commented on May 27, 2024

thanks again, Jens
okay, i ll give it a try with a strict value of intron-loss-gain-penalty

from jstacs.

usage about the parameter of weight (w) about jstacs HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	//combine redundant
	Collections.sort(pred);
	Prediction last = pred.get(pred.size()-1);
	for( int i = pred.size()-2; i >= 0; i-- ){
	Prediction cur = pred.get(i);
	if( cur.compareTo(last) == 0 ) {
	//identical
	if( cur.score > last.score \|\| (cur.score== last.score && cur.id.compareTo(last.id)<0) ) {
	cur.combine(last, addAltTransIDs);
	pred.remove(i+1);
	last=cur;
	} else {
	last.combine(cur, addAltTransIDs);
	pred.remove(i);
	}
	} else {
	last=cur;
	}
	}

	//filter: user-specified
	for( int i = pred.size()-1; i >= 0; i-- ){
	Prediction p = pred.get(i);
	p.setEvidenceAndWeight( weight );
	if( !Tools.filter(engine, filter, p.hash) ) {
	//System.out.println(p.hash.toString());
	pred.remove(i);
	} else {
	fillEvidence(p.evidence, 0);
	/*
	String[] array = {"aa","raa","score","tie","tpc","pAA","iAA","lpm","maxGap"};
	System.out.print(p.id);
	for( int z=0; z < array.length; z++ ) {
	System.out.print("\t" + p.hash.get(array[z]));
	}
	System.out.println();
	*/
	}
	}
	filtered += pred.size();