Giter Club home page Giter Club logo

Comments (4)

dkioroglou avatar dkioroglou commented on August 10, 2024 3

You're absolutely correct on what I was aiming for and on my misinterpretation of the "greengenes_13_8_99 expected taxonomy abundance", and yes your explanation makes absolutely sense.
During my trials my conclusions on parameters were in congruence with yours in 2015 preprint. It was just that "greengenes_13_8_99 expected taxonomy abundance" file that was driving me nuts.

Thank you very much for your time and explanation.

from mockrobiota.

nbokulich avatar nbokulich commented on August 10, 2024

Expected abundances in a mock community are never replicated 100%, because there are so many factors that impact the relative abundance (human imprecision, copy # variation, PCR/sequence error/bias) that skew these results and to a large degree cannot be corrected for bioinformatically. I have a more in-depth discussion on this on the QIIME2 forum.

Your results actually look pretty good, and I don't think you are doing anything "wrong". Particularly if you consider that taxa like [Eubacterium] and Eubacterium might be the same thing. Some things that could improve accuracy:

  1. Use denoising methods (like dada2 and deblur, which are implemented in QIIME2) instead of OTU picking. These will remove some erroneous reads that lead to false positives.
  2. We have a more recent preprint benchmarking different taxonomy classifiers in QIIME1 and QIIME2. Method recommendations have changed. RDP does quite well, but you can try out others. The data for this study are all in tax-credit if you want to take a closer look.
  3. Mock-2 is not the most accurate mock community that we have (though it does do better than some others). You could try something like mock-12 for a community that tends to get higher accuracy scores. Around half-way down this notebook there are heatmaps for precision/recall/F-measure for each classifier method configuration on each individual mock community, so you can see how each method fares, and how I am judging that mock-12 tends to get higher accuracy scores in general than mock-2.

Have you calculated precision/recall or some other accuracy metric on these data? You can follow the notebooks in tax-credit to run these same evaluations, or we have a method for calculating some of these metrics in QIIME2.

I am going to close this issue, since there is not really an error here, but please let me know if you have any more questions or comments!

from mockrobiota.

dkioroglou avatar dkioroglou commented on August 10, 2024

Thank you for your quick response.
I see that I have created confusion with the word "replicate", apologies for that.
The aim was not to replicate in the lab the source composition of the Mock2 community and run a bioinformatic analysis afterwards, but to analyze the publicly available data of the following repository:

https://github.com/caporaso-lab/mockrobiota/tree/master/data/mock-2

In this repository the following files are provided:

  • forward-read
  • reverse-read
  • index-read
  • sample-metadata.tsv
  • source taxonomy abundance
  • greengenes_13_8_99 expected taxonomy abundance

According to the README file this mock community is referred as B2 dataset in Bokulich et al. 2015.
So what I tried to do was to download the data and try to get as closer to the expect abundance as possible. The reason that I think I'm doing something wrong is because the provided expected abundance file shows that all the genera of the source have been identified with quite accurate abundances, but my results are quite off. I have been trying to understand the reason but with no luck so far.

from mockrobiota.

nbokulich avatar nbokulich commented on August 10, 2024

Either I am still misunderstanding, or there is no misunderstanding. By "replicate", you mean that you are attempting to analyze the mock-2 data and detect the expected genera at the expected abundances. Correct?

I think I have identified the source of confusion. It looks like you are interpreting the greengenes_13_8_99 expected taxonomy abundance to be the abundances observed after analysis. These are in fact just the source taxonomies reformatted to have labels that are consistent with the Greengenes 13_8 99% OTUs taxonomy, not the product of any type of analysis. That is why the expected abundances are so close to the source — they are the source, except that some source taxonomies may be "collapsed" together into a single expected taxonomy where, e.g., the species labels do not exist in the Greengenes taxonomy.

Does that make sense?

So I think you are actually doing everything correctly and your "found" abundances actually look quite good.

For example, by looking at either the 2015 or 2017 preprints, you will see that none of the methods actually have 100% accuracy at species level for mock communities — this is because the taxon abundances are always skewed during sample handling/PCR/sequencing, and no bioinformatic analysis can really correct this perfectly. Genus-level classification are actually quite a lot better but still not perfect for this same reason.

from mockrobiota.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.