Comments (5)
Hello @Hocnonsense
I have a few suggestions for you:
You could potentially reduce the weight of kofam in the config file.
By default each reference is given a weight of 0.7, you can reduce kofam's to a lower value with kofam_weight=0.X
.
Additionally you can change the hit processing algorithm (e.g., bpo
) and set a threshold for the e-value if the default value is not strict enough. In your case the e-value are already quite low which means they are not actually low quality annotations.
Please refer to this extract from HMMER's documentation which explains this quite succinctly:
The E-value is the expected number of false positives (non-homologous sequences) that scored this well or better. The E-value is a measure of statistical significance. The lower the E-value, the more significant the hit. I typically consider sequences with E-values < 10−3 or so to be significant hits.
You can further restrict the e-value, which will then likely increase precision but lower recall, so please keep that in mind.
Keep in mind I also take into account other metrics when evaluating matches:
https://github.com/PedroMTQ/mantis/wiki/Additional-information#what-is-the-e-value-threshold
You can check these options in the wiki https://github.com/PedroMTQ/mantis/wiki/Functionalities#annotate-one-sample
Finally, if you prefer to only annotate with the NCBI reference, you can disable the other references.
Regards,
Pedro
from mantis.
Hello Pedro:
Very thanks for your quick reply and kind advices!
However, it seems that a much better nogg annotation (take seq1 as example, annotion as 397945.Aave_1437 hit [61:361] of 362 with evalue 3.06e-125, full table is shown below) is generated, and I still have doubts because their annotation seems different:
- 397945.Aave_1437 points to activity of Methanethiol oxidase (H2O + methanethiol + O2 = formaldehyde + H+ + H2O2 + hydrogen sulfide, Automatic Annotation, EC:1.8.3.4)
- K20932 points to hydrazine synthase subunit [EC:1.7.2.7], while another slightly better kofam hit (K07404, 6-phosphogluconolactonase [EC:3.1.1.31]) was ignored in consensus_output.tsv.
I've read https://github.com/PedroMTQ/mantis/wiki/Additional-information#inter-reference-hit-processing but cannot understand clearly how it process, what is "IDs and free-text functional descriptions" for mantis, and what's similarity score between them? Can I reproduce the results using mantis/consensus.py with results from integrated_annotation as input?
Query Ref_file Ref_hit Ref_hit_accession evalue bitscore Direction Query_length Query_hit_start Query_hit_end Ref_hit_start Ref_hit_end Ref_length
seq1 kofam_merged K07404 - 3.00E-09 42.7 + 362 129 271 331 476 530
seq1 kofam_merged K20932 - 1.08E-05 30.8 + 362 264 359 45 127 377
seq1 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 1.92E-18 72.8 + 362 181 354 9 170 310
seq1 NOGG_merged 397945.Aave_1437 3.06E-125 366 + 362 63 361 8 306 306
seq1 Pfam-A Cytochrom_D1 PF02239.19 2.31E-06 32.8 + 362 229 355 9 117 368
seq2 kofam_merged K07404 - 6.16E-14 58.2 + 389 115 288 281 460 530
seq2 kofam_merged K20932 - 0.000276989 26.1 + 389 285 387 44 127 377
seq2 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 1.77E-15 63.1 + 389 212 386 8 170 310
seq2 NOGG_merged 399795.CtesDRAFT_PD1384 1.56E-143 417 + 389 1 388 1 393 393
seq2 Pfam-A Lactonase PF10282.12 0.000338543 26.3 + 389 160 242 249 324 344
Sincerely appreciate for your help!
from mantis.
Hello @Hocnonsense ,
Regarding the similarity of annotations (i.e., the IDsa and free-text descriptions) keep the following points in mind:
- an ID refers to an identifier, e.g.,
K07404
. - a free text description refers to a textual description e.g.,
description:Lactonase
When determining the consensus we proceed in 2 ways, depending on whether we are handling IDs, or free text:
Let's imagine we have a match to the NOG IDX which is mapped to ID1, ID2 and ID3 and another match to the Kofam IDY which is mapped to ID4, and ID1, and finally a match to the Pfam IDZ which is mapped to ID5.
During consensus, we will try to determine which is the most likely annotation, depending on the agreement between IDs, in this case since NOG IDX and Kofam IDY share the ID1, we assume that this sequence is more likely to be NOG IDX + Kofam IDY rather than Pfam IDZ since we have 2 "independent" (debatable since some sources integrate data from other sources) sources pointing to the one annotation. (this is an over simplification since there's other internal calculations at play).
For free text descriptions the idea is similar, but in this case instead of having matches between IDs (either it's a match - 1 or it's not - 0) we instead measure string similarity (from 0 to 1, 1 being very similar). This string similarity is calculated with another package that I've developed: https://github.com/PedroMTQ/UniFunc.
In general, the main idea of Mantis is to leverage "independent" annotation sources to determine a consensus, which we assume is more likely to be true than if we used a single source.
Hope this clears things up.
If it's still unclear I'd recommend you post the Mantis output files (you can trim it down to seq1 and seq2) here and I can try to dig through them to explain what's going on.
Regards,
Pedro
from mantis.
Thanks for your advice! There are the three file i got in mantis output folder:
integrated_annotation.tsv
Query Ref_file Ref_hit Ref_hit_accession evalue bitscore Direction Query_length Query_hit_start Query_hit_end Ref_hit_start Ref_hit_end Ref_length | Links
seq1 Pfam-A Cytochrom_D1 PF02239.19 2.308245e-06 32.8 + 362 229 355 9 117 368 | description:Cytochrome D1 heme domain pfam:Cytochrom_D1 pfam:PF02239
seq1 kofam_merged K07404 - 3.0007185000000003e-09 42.7 + 362 129 271 331 476 530 | cog:COG2706 description:6-phosphogluconolactonase enzyme_ec:3.1.1.31 go:0017057 kegg_ko:K07404
seq1 kofam_merged K20932 - 1.0771810000000001e-05 30.8 + 362 264 359 45 127 377 | cog:COG3391 description:hydrazine synthase subunit enzyme_ec:1.7.2.7 kegg_ko:K20932
seq1 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 1.9235375e-18 72.8 + 362 181 354 9 170 310 | description:PQQ-dependent catabolism-associated beta-propeller protein tigrfam:TIGR03866
seq1 NOGG_merged 397945.Aave_1437 3.056207886e-125 366.0 + 362 63 361 8 306 306 | cog:COG3391 eggnog:1P862 eggnog:2VKXA eggnog:4ACF0 eggnog:COG3391 pfam:PF05694
seq2 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 4.3087240000000004e-15 61.8 + 358 255 347 219 309 310 | description:PQQ-dependent catabolism-associated beta-propeller protein tigrfam:TIGR03866
seq2 NOGG_merged 402626.Rpic_2496 3.6437579292e-229 631.0 + 358 1 357 1 357 357 | cog:COG3391 eggnog:1P862 eggnog:2WEY0 eggnog:COG3391 pfam:PF10282
seq2 Pfam-A Cytochrom_D1 PF02239.19 2.3851864999999997e-07 36.0 + 358 146 302 57 214 368 | description:Cytochrome D1 heme domain pfam:Cytochrom_D1 pfam:PF02239
seq2 kofam_merged K07404 - 7.1555595e-17 67.9 + 358 142 344 283 458 530 | cog:COG2706 description:6-phosphogluconolactonase enzyme_ec:3.1.1.31 go:0017057 kegg_ko:K07404
seq2 kofam_merged K20932 - 1.0002395e-11 50.6 + 358 19 124 51 139 377 | cog:COG3391 description:hydrazine synthase subunit enzyme_ec:1.7.2.7 kegg_ko:K20932
output_annotation.tsv
Query Ref_file Ref_hit Ref_hit_accession evalue bitscore Direction Query_length Query_hit_start Query_hit_end Ref_hit_start Ref_hit_end Ref_length
seq1 Pfam-A Cytochrom_D1 PF02239.19 2.308245e-06 32.8 + 362 229 355 9 117 368
seq1 kofam_merged K07404 - 3.0007185000000003e-09 42.7 + 362 129 271 331 476 530
seq1 kofam_merged K20932 - 1.0771810000000001e-05 30.8 + 362 264 359 45 127 377
seq1 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 1.9235375e-18 72.8 + 362 181 354 9 170 310
seq1 NOGG_merged 397945.Aave_1437 3.056207886e-125 366.0 + 362 63 361 8 306 306
seq2 NCBIG_merged PQQ_ABC_repeats TIGR03866.1 4.3087240000000004e-15 61.8 + 358 255 347 219 309 310
seq2 NOGG_merged 402626.Rpic_2496 3.6437579292e-229 631.0 + 358 1 357 1 357 357
seq2 Pfam-A Cytochrom_D1 PF02239.19 2.3851864999999997e-07 36.0 + 358 146 302 57 214 368
seq2 kofam_merged K07404 - 7.1555595e-17 67.9 + 358 142 344 283 458 530
seq2 kofam_merged K20932 - 1.0002395e-11 50.6 + 358 19 124 51 139 377
consensus_annotation.tsv
Query Ref_Files Ref_Hits Consensus_hits Total_hits | Links
seq1 NOGG_merged;kofam_merged 397945.Aave_1437;K20932 2 5 | cog:COG3391 description:hydrazine synthase subunit eggnog:1P862 eggnog:2VKXA eggnog:4ACF0 eggnog:COG3391 enzyme_ec:1.7.2.7 kegg_ko:K20932 pfam:PF05694
seq2 NOGG_merged;Pfam-A;kofam_merged 399795.CtesDRAFT_PD1384;Lactonase;K20932 3 5 | cog:COG3391 description:Lactonase, 7-bladed beta-propeller description:hydrazine synthase subunit eggnog:1P862 eggnog:2VKXA eggnog:4ACF0 eggnog:COG3391 enzyme_ec:1.7.2.7 kegg_ko:K20932 pfam:Lactonase pfam:PF02239 pfam:PF10282
the database config file was used.
Thanks! I think I've found the connection between nogg and kegg annotation: the all annotated as COG3391! integrated_annotation.tsv
is just a bridge between them!
I further searched description for 399795.CtesDRAFT_PD1384 in uniport and COG3391 in eggnog, and found inconsistance between different database. I even found an ec annotation of 3.1.1.31 (points to K07404) in linked database...
I just hope if there is any way to avoid these over-annotated results.
from mantis.
Hey @Hocnonsense ,
I'm glad you found the reason why. Indeed we sometimes have issues with the reference databases, but unfortunately this is not really something I can address.
If you do have concerns regarding a specific database feel free to disable them.
Regards,
Pedro
from mantis.
Related Issues (20)
- reduce RAM consumption during metadata extraction for NOG HOT 2
- Setup process running out of memory and spawning too many processes HOT 2
- Potentially wrong GFF format HOT 1
- Use NOG database with DIAMOND HOT 9
- Can this be used to get consensus annotations for an ortholog/orthogroup? HOT 22
- How to run custom database HOT 2
- [urgent] Bug in 'resources' folder path HOT 1
- conda install finds (?) incompatibilities HOT 2
- binny enhancements HOT 4
- add resources to config HOT 3
- mantis setup issue HOT 5
- Issue with Mantis Setup HOT 1
- setup gets stuck HOT 3
- Installation: Unsatisfiable error HOT 2
- mantis setup issue HOT 1
- Mantis database checks failed HOT 2
- GFF output issue
- Mantis failed to be initiated due to modulenotfounderror
- problems in setup process
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mantis.