Giter Club home page Giter Club logo

Comments (4)

JensKeilwagen avatar JensKeilwagen commented on May 27, 2024

Hi,

Thanks for that interesting question. However, the question is not completely clear to me. I try to explain what GAF does and try to give an option how to get a workaround.

If I understood it correctly, your situation is the following: A prediction is perfectly supported by several reference species. You like to choose a prediction from the reference species with the smallest phylogenetic distance as representative (ref-gene) and list the remaining predictions in alternatives.

However, GAF does not use weights or any other phylogenetic information at the step of removing redundant predictions. The current implementation uses the score and the ID to determine the representative:

//combine redundant
Collections.sort(pred);
Prediction last = pred.get(pred.size()-1);
for( int i = pred.size()-2; i >= 0; i-- ){
Prediction cur = pred.get(i);
if( cur.compareTo(last) == 0 ) {
//identical
if( cur.score > last.score || (cur.score== last.score && cur.id.compareTo(last.id)<0) ) {
cur.combine(last, addAltTransIDs);
pred.remove(i+1);
last=cur;
} else {
last.combine(cur, addAltTransIDs);
pred.remove(i);
}
} else {
last=cur;
}
}

The attributes (e.g. score, iAA, ...) of the representative are later used to sort and select the predictions.

This implementation is based on the assumption that the GeMoMa score is a good indicator whether a prediction is valid or not. To a certain extend the score reflects how good the conservation is. The ID is used to avoid different results based due to multi-threading. (If several predictions are identical (based on the exon-intron-structure) and have same score, the last prediction would be used if ID would be ignore. Different number of threads can have an influence on the order of the predictions. To avoid this, we use the ID to get the same results no matter how many threads were used.)

From the lines of code it become clear that score is the main feature to select the representative. This seems to be reasonable for me. I'll try to give an example. Let's assume you have identical predictions from all reference species, i.e., the exon-intron structure is identical for all these prediction, but they differ in the score (=conservation). Despite having the same structure, the amino acid conservation can differ substantially. Although the conservation should be higher in phylogenetic closer related species on average, this must not be true for every gene. Especially if you're thinking about introgressions from wild species. For this reason, a prediction from the reference species with smallest phylogenetic distance might have a low score, while another prediction from a phylogenetic more distant reference species has a higher score. Besides introgressions, many other reasons could lead to such situations, e.g., gene families.

For me, it seems reasonable to choose the prediction with highest score (=conservation). It has a historical background, when we only used score for sorting predictions, but due to the correlation to other attributes that might be used for sorting it seems to be still valid.

If you use the parameter weight, the attribute sumWeight of the predictions is altered. This attribute can be used for sorting and filtering predictions. However, the weight does not have an influence on the choice of the representative.

Why do you like to have a representative from the phylogenetic closest species?
If you like to analyze synteny, we have implemented SyntenyChecker:
http://www.jstacs.de/index.php/GeMoMa-Docs#Synteny_checker
Maybe the output can be beneficial for you. It returns a table with some standard columns plus one column per reference species.

best regards, Jens

from jstacs.

CrawlingSponge avatar CrawlingSponge commented on May 27, 2024

Hi, Jens,
Thanks for your proffesional answer, this prediction from different references is very likely inffluenced by the introgression or ILS.

the filter step "GAF" indeed chose the best score of the predictions.

there is still one issue confused me, the predictions from different references (spceices1, species2, species3 and species4) have different exons numbers. how should i solve this problem?

for example, the gene of RPL38, this prediction with 4 exons,
image

this prediction with 1 exon,
image

yours, sincerely

from jstacs.

JensKeilwagen avatar JensKeilwagen commented on May 27, 2024

Hi,

ce is the number of coding exons (of the prediction)
rce is the number of reference coding exons

In both of your examples ce equals rce.

Introns can be gained or lost. Depending on the phylogenetic distance, this should be rare.

Indeed, the filter in GAF is applied after removing redundancy:

//filter: user-specified
for( int i = pred.size()-1; i >= 0; i-- ){
Prediction p = pred.get(i);
p.setEvidenceAndWeight( weight );
if( !Tools.filter(engine, filter, p.hash) ) {
//System.out.println(p.hash.toString());
pred.remove(i);
} else {
fillEvidence(p.evidence, 0);
/*
String[] array = {"aa","raa","score","tie","tpc","pAA","iAA","lpm","maxGap"};
System.out.print(p.id);
for( int z=0; z < array.length; z++ ) {
System.out.print("\t" + p.hash.get(array[z]));
}
System.out.println();
*/
}
}
filtered += pred.size();

Hence, alternatives could have a different number of coding exons than the representative. Was this your point?

If I understood you correctly, we only like to have predictions with exactly the same number of ce and rce.
You can tell GeMoMa than gaining or losing an intron is very expensive:
http://www.jstacs.de/index.php/GeMoMa-Docs#GeneModelMapper
Cf. parameter intron-loss-gain-penalty. This might help.

best regards, Jens

from jstacs.

CrawlingSponge avatar CrawlingSponge commented on May 27, 2024

thanks again, Jens
okay, i ll give it a try with a strict value of intron-loss-gain-penalty

from jstacs.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.