Hello, I'm trying to replicate the Mock2 expected abundance results

Mock2 expected abundance replication issue about mockrobiota HOT 4 CLOSED

caporaso-lab commented on August 10, 2024

Mock2 expected abundance replication issue

from mockrobiota.

Comments (4)

dkioroglou commented on August 10, 2024 3

You're absolutely correct on what I was aiming for and on my misinterpretation of the "greengenes_13_8_99 expected taxonomy abundance", and yes your explanation makes absolutely sense.
During my trials my conclusions on parameters were in congruence with yours in 2015 preprint. It was just that "greengenes_13_8_99 expected taxonomy abundance" file that was driving me nuts.

Thank you very much for your time and explanation.

from mockrobiota.

nbokulich commented on August 10, 2024

Expected abundances in a mock community are never replicated 100%, because there are so many factors that impact the relative abundance (human imprecision, copy # variation, PCR/sequence error/bias) that skew these results and to a large degree cannot be corrected for bioinformatically. I have a more in-depth discussion on this on the QIIME2 forum.

Your results actually look pretty good, and I don't think you are doing anything "wrong". Particularly if you consider that taxa like [Eubacterium] and Eubacterium might be the same thing. Some things that could improve accuracy:

Use denoising methods (like dada2 and deblur, which are implemented in QIIME2) instead of OTU picking. These will remove some erroneous reads that lead to false positives.
We have a more recent preprint benchmarking different taxonomy classifiers in QIIME1 and QIIME2. Method recommendations have changed. RDP does quite well, but you can try out others. The data for this study are all in tax-credit if you want to take a closer look.
Mock-2 is not the most accurate mock community that we have (though it does do better than some others). You could try something like mock-12 for a community that tends to get higher accuracy scores. Around half-way down this notebook there are heatmaps for precision/recall/F-measure for each classifier method configuration on each individual mock community, so you can see how each method fares, and how I am judging that mock-12 tends to get higher accuracy scores in general than mock-2.

Have you calculated precision/recall or some other accuracy metric on these data? You can follow the notebooks in tax-credit to run these same evaluations, or we have a method for calculating some of these metrics in QIIME2.

I am going to close this issue, since there is not really an error here, but please let me know if you have any more questions or comments!

from mockrobiota.

dkioroglou commented on August 10, 2024

Thank you for your quick response.
I see that I have created confusion with the word "replicate", apologies for that.
The aim was not to replicate in the lab the source composition of the Mock2 community and run a bioinformatic analysis afterwards, but to analyze the publicly available data of the following repository:

https://github.com/caporaso-lab/mockrobiota/tree/master/data/mock-2

In this repository the following files are provided:

forward-read
reverse-read
index-read
sample-metadata.tsv
source taxonomy abundance
greengenes_13_8_99 expected taxonomy abundance

According to the README file this mock community is referred as B2 dataset in Bokulich et al. 2015.
So what I tried to do was to download the data and try to get as closer to the expect abundance as possible. The reason that I think I'm doing something wrong is because the provided expected abundance file shows that all the genera of the source have been identified with quite accurate abundances, but my results are quite off. I have been trying to understand the reason but with no luck so far.

from mockrobiota.

nbokulich commented on August 10, 2024

Either I am still misunderstanding, or there is no misunderstanding. By "replicate", you mean that you are attempting to analyze the mock-2 data and detect the expected genera at the expected abundances. Correct?

I think I have identified the source of confusion. It looks like you are interpreting the greengenes_13_8_99 expected taxonomy abundance to be the abundances observed after analysis. These are in fact just the source taxonomies reformatted to have labels that are consistent with the Greengenes 13_8 99% OTUs taxonomy, not the product of any type of analysis. That is why the expected abundances are so close to the source — they are the source, except that some source taxonomies may be "collapsed" together into a single expected taxonomy where, e.g., the species labels do not exist in the Greengenes taxonomy.

Does that make sense?

So I think you are actually doing everything correctly and your "found" abundances actually look quite good.

For example, by looking at either the 2015 or 2017 preprints, you will see that none of the methods actually have 100% accuracy at species level for mock communities — this is because the taxon abundances are always skewed during sample handling/PCR/sequencing, and no bioinformatic analysis can really correct this perfectly. Genus-level classification are actually quite a lot better but still not perfect for this same reason.

from mockrobiota.

Mock2 expected abundance replication issue about mockrobiota HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent