Comments (4)
Hi,
Thanks for that interesting question. However, the question is not completely clear to me. I try to explain what GAF does and try to give an option how to get a workaround.
If I understood it correctly, your situation is the following: A prediction is perfectly supported by several reference species. You like to choose a prediction from the reference species with the smallest phylogenetic distance as representative (ref-gene) and list the remaining predictions in alternatives.
However, GAF does not use weights or any other phylogenetic information at the step of removing redundant predictions. The current implementation uses the score and the ID to determine the representative:
Jstacs/projects/gemoma/GeMoMaAnnotationFilter.java
Lines 531 to 549 in 65a454d
The attributes (e.g. score, iAA, ...) of the representative are later used to sort and select the predictions.
This implementation is based on the assumption that the GeMoMa score is a good indicator whether a prediction is valid or not. To a certain extend the score reflects how good the conservation is. The ID is used to avoid different results based due to multi-threading. (If several predictions are identical (based on the exon-intron-structure) and have same score, the last prediction would be used if ID would be ignore. Different number of threads can have an influence on the order of the predictions. To avoid this, we use the ID to get the same results no matter how many threads were used.)
From the lines of code it become clear that score is the main feature to select the representative. This seems to be reasonable for me. I'll try to give an example. Let's assume you have identical predictions from all reference species, i.e., the exon-intron structure is identical for all these prediction, but they differ in the score (=conservation). Despite having the same structure, the amino acid conservation can differ substantially. Although the conservation should be higher in phylogenetic closer related species on average, this must not be true for every gene. Especially if you're thinking about introgressions from wild species. For this reason, a prediction from the reference species with smallest phylogenetic distance might have a low score, while another prediction from a phylogenetic more distant reference species has a higher score. Besides introgressions, many other reasons could lead to such situations, e.g., gene families.
For me, it seems reasonable to choose the prediction with highest score (=conservation). It has a historical background, when we only used score for sorting predictions, but due to the correlation to other attributes that might be used for sorting it seems to be still valid.
If you use the parameter weight, the attribute sumWeight of the predictions is altered. This attribute can be used for sorting and filtering predictions. However, the weight does not have an influence on the choice of the representative.
Why do you like to have a representative from the phylogenetic closest species?
If you like to analyze synteny, we have implemented SyntenyChecker:
http://www.jstacs.de/index.php/GeMoMa-Docs#Synteny_checker
Maybe the output can be beneficial for you. It returns a table with some standard columns plus one column per reference species.
best regards, Jens
from jstacs.
Hi, Jens,
Thanks for your proffesional answer, this prediction from different references is very likely inffluenced by the introgression or ILS.
the filter step "GAF" indeed chose the best score of the predictions.
there is still one issue confused me, the predictions from different references (spceices1, species2, species3 and species4) have different exons numbers. how should i solve this problem?
for example, the gene of RPL38, this prediction with 4 exons,
yours, sincerely
from jstacs.
Hi,
ce is the number of coding exons (of the prediction)
rce is the number of reference coding exons
In both of your examples ce equals rce.
Introns can be gained or lost. Depending on the phylogenetic distance, this should be rare.
Indeed, the filter in GAF is applied after removing redundancy:
Jstacs/projects/gemoma/GeMoMaAnnotationFilter.java
Lines 551 to 570 in 65a454d
Hence, alternatives could have a different number of coding exons than the representative. Was this your point?
If I understood you correctly, we only like to have predictions with exactly the same number of ce and rce.
You can tell GeMoMa than gaining or losing an intron is very expensive:
http://www.jstacs.de/index.php/GeMoMa-Docs#GeneModelMapper
Cf. parameter intron-loss-gain-penalty. This might help.
best regards, Jens
from jstacs.
thanks again, Jens
okay, i ll give it a try with a strict value of intron-loss-gain-penalty
from jstacs.
Related Issues (20)
- GeMoMa Extractor Error: There are gene annotations on chromosomes/contigs with missing reference sequence HOT 2
- java.lang.IllegalArgumentException: At least two sequences with the same ID but different sequence: HOT 5
- Using RNA-seq from closely-related organism HOT 2
- GeMoMa expected runtime HOT 2
- How does GeMoMa treat masked nucleotides? HOT 1
- Problem when adding external evidence HOT 5
- cdsParts=true, but the ID (gene-si: )seems to be no CDS part HOT 1
- AnnotationFinalizer for renaming genes and transcripts HOT 2
- Exception in thread "main" java.lang.NullPointerException HOT 2
- Why does GeMoma annotationfianlizer remove the exon features ? HOT 1
- Could not open GeMoMa_temp/GeMoMaPipeline-9364982901453846136/mmseqsdb_h.index.5 for writing! HOT 2
- java.lang.OutOfMemoryError: Java heap space when using GeMoma HOT 21
- java.lang.InterruptedException HOT 1
- Use GeMoma with protein as reference, the result showed no predicted_proteins.fasta and final_annotation.gff HOT 13
- Result filtering HOT 4
- GeMoMa error HOT 8
- No gene model was extracted from the references. HOT 3
- GeMoMa gives me multiple genes with exact same coordinates HOT 6
- Issues with CLI Analzyer HOT 5
- de.jstacs.data.WrongAlphabetException for gene HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jstacs.