Comments (4)
You're absolutely correct on what I was aiming for and on my misinterpretation of the "greengenes_13_8_99 expected taxonomy abundance", and yes your explanation makes absolutely sense.
During my trials my conclusions on parameters were in congruence with yours in 2015 preprint. It was just that "greengenes_13_8_99 expected taxonomy abundance" file that was driving me nuts.
Thank you very much for your time and explanation.
from mockrobiota.
Expected abundances in a mock community are never replicated 100%, because there are so many factors that impact the relative abundance (human imprecision, copy # variation, PCR/sequence error/bias) that skew these results and to a large degree cannot be corrected for bioinformatically. I have a more in-depth discussion on this on the QIIME2 forum.
Your results actually look pretty good, and I don't think you are doing anything "wrong". Particularly if you consider that taxa like [Eubacterium] and Eubacterium might be the same thing. Some things that could improve accuracy:
- Use denoising methods (like dada2 and deblur, which are implemented in QIIME2) instead of OTU picking. These will remove some erroneous reads that lead to false positives.
- We have a more recent preprint benchmarking different taxonomy classifiers in QIIME1 and QIIME2. Method recommendations have changed. RDP does quite well, but you can try out others. The data for this study are all in tax-credit if you want to take a closer look.
- Mock-2 is not the most accurate mock community that we have (though it does do better than some others). You could try something like mock-12 for a community that tends to get higher accuracy scores. Around half-way down this notebook there are heatmaps for precision/recall/F-measure for each classifier method configuration on each individual mock community, so you can see how each method fares, and how I am judging that mock-12 tends to get higher accuracy scores in general than mock-2.
Have you calculated precision/recall or some other accuracy metric on these data? You can follow the notebooks in tax-credit to run these same evaluations, or we have a method for calculating some of these metrics in QIIME2.
I am going to close this issue, since there is not really an error here, but please let me know if you have any more questions or comments!
from mockrobiota.
Thank you for your quick response.
I see that I have created confusion with the word "replicate", apologies for that.
The aim was not to replicate in the lab the source composition of the Mock2 community and run a bioinformatic analysis afterwards, but to analyze the publicly available data of the following repository:
https://github.com/caporaso-lab/mockrobiota/tree/master/data/mock-2
In this repository the following files are provided:
- forward-read
- reverse-read
- index-read
- sample-metadata.tsv
- source taxonomy abundance
- greengenes_13_8_99 expected taxonomy abundance
According to the README file this mock community is referred as B2 dataset in Bokulich et al. 2015.
So what I tried to do was to download the data and try to get as closer to the expect abundance as possible. The reason that I think I'm doing something wrong is because the provided expected abundance file shows that all the genera of the source have been identified with quite accurate abundances, but my results are quite off. I have been trying to understand the reason but with no luck so far.
from mockrobiota.
Either I am still misunderstanding, or there is no misunderstanding. By "replicate", you mean that you are attempting to analyze the mock-2 data and detect the expected genera at the expected abundances. Correct?
I think I have identified the source of confusion. It looks like you are interpreting the greengenes_13_8_99 expected taxonomy abundance
to be the abundances observed after analysis. These are in fact just the source
taxonomies reformatted to have labels that are consistent with the Greengenes 13_8 99% OTUs taxonomy, not the product of any type of analysis. That is why the expected abundances are so close to the source — they are the source, except that some source taxonomies may be "collapsed" together into a single expected taxonomy where, e.g., the species labels do not exist in the Greengenes taxonomy.
Does that make sense?
So I think you are actually doing everything correctly and your "found" abundances actually look quite good.
For example, by looking at either the 2015 or 2017 preprints, you will see that none of the methods actually have 100% accuracy at species level for mock communities — this is because the taxon abundances are always skewed during sample handling/PCR/sequencing, and no bioinformatic analysis can really correct this perfectly. Genus-level classification are actually quite a lot better but still not perfect for this same reason.
from mockrobiota.
Related Issues (20)
- Permanent "issues" or notes pages for each mock community dataset HOT 4
- Automatic taxonomy string extraction
- Update required metadata
- Unable to replicate community composition of Mock-3 HOT 3
- Shotgun Mock HOT 2
- new data integrity checks HOT 5
- split_libraries_fastq.py error with golay barcodes (mock-5 & -7) HOT 8
- Potential issue demultiplexing mock-8 HOT 2
- Failing to demultiplex mock-3 HOT 3
- Error preprocessing mock 7 and 8 : Failed qual conversion HOT 1
- Question about barcode length in mock2 and mock6 HOT 1
- Demultiplexing Reverse Reads HOT 5
- CONTRIBUTING.md not found HOT 1
- Trouble generating a pull request HOT 3
- add accuracy data and recommendations for mock communities on README
- replace QIIME 1 commands with QIIME 2 commands
- Mock 11 primers HOT 1
- Is Mock-1 HiSeq or MiSeq? HOT 2
- How can I distinguish those three mock communities from Mock-10 data? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mockrobiota.