caporaso-lab / mockrobiota Goto Github PK

View Code? Open in Web Editor NEW

76.0 76.0 35.0 696 KB

A public resource for microbiome bioinformatics benchmarking using artificially constructed (i.e., mock) communities.

Home Page: http://mockrobiota.caporasolab.us

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

mockrobiota's People

Contributors

Stargazers

Watchers

mockrobiota's Issues

Mock 11 primers

Hello,

Would it be possible to know the primers used for the library construction of the mock community 11?
Thank you in advance.

Add code to automatically compile dataset metadata into summary table

It would be great if we could automatically compile all dataset metadata tables into one summary table that appears on the repo homepage. This could be a nice way to navigate based on basic specs, rather than requiring users to click through each directory to find the MCs that fit their interests.

Making the table searchable would be ideal.

The table could contain hyperlinks to either redirect to the entry for that mock community on the repo, or else to download raw data and other files directly from the table directly...

Is there any way to have the code run automatically every time a new mock community is submitted?

indicate in human-readable-description the difference between mock-7 and mock-8

These are the same communities sequenced in two different sequencing runs.

How can I distinguish those three mock communities from Mock-10 data?

Hello,

I want to use Mock-10 data which has three mock communities. However, after I downloaded the fastq data based on the link provided in dataset-metadata.tsv, I don't know how to identify which mock community a read sequence belongs to. Perhaps I might run the following python command line provided in the README.md:

split_libraries_fastq.py -i mock-forward-read.fastq.gz -o split_libraries -m sample-metadata.tsv -b mock-index-read.fastq.gz --rev_comp_mapping_barcodes

However, this script might come from QIIME1 but now QIME has version2 and the version2 QIME seems not to provide this script. Also, I tried to run this command line by the script provided here but it doesn't work.

Could you please give me some instructions about how I can know which mock community a read belongs to?

Thanks you!

correct the precision in expected taxonomy files

@nbokulich, when @jairideout and I were testing this we discovered that the relative abundances are off in some of the expected taxonomy files. We wanted to confirm that the taxa abundances in each sample sum to 1.0 (to seven decimal places, the unitetest.assertFloatEqual default). We found that in some cases they're not equal even to two decimal places. For example, in mock-3, the sum of the values in sample HMPMockV1.2.Staggered1 is 1.02. Would you be able to look into this? We have a test file that you can run now which will help you identify these samples (run python tests/check_data_integrity.py in this repository - this is Python 3 only).

Note that this issue is causing the current build to fail - I think it's important to have this block people from using the data for now.

mock-11 metadata is incorrect?

This was brought up on merged commit - creating an issue so this isn't lost. Thanks for bringing it to my attention @jairideout, and thanks for letting us know @mdeleeuw.

Permanent "issues" or notes pages for each mock community dataset

We should create a permanent place for users to archive notes/tips/known issues for each dataset. E.g., some "issues" may not be errors, per se, but rather things like whether mapping barcodes need to be reverse complemented for demultiplexing, or whether header lines in raw data cause issues with specific software. Some of these may not be "issues" that need to be corrected, but rather documented in a permanent place for future users to follow. On the other hand, various other observations may be useful to share but are not "issues", e.g., whether specific samples in a dataset have low read counts post-QC and should be ignored.

@gregcaporaso what do you think?

I see 2 possibilities:

create an issue request for each mock community and leave that issue open permanently. This will keep issues associated with a single MC organized in one place, and more comments can be added to that page as more notes/observations are made. It has the advantage that notes are added as comments without the need of a PR, so streamlines the process. It is disadvantaged by the fact that real issues will be intermixed with notes (even if we separate the "notes" page from real issues, things may get messy as they already are!), and the issue page cannot be closed when real issues are solved. Just having separate issues could get long and messy.
create a notes file in the main directory for each dataset. Users will need to submit a PR to add permanent notes, though this could also help keep things tidy. Another disadvantage is that users would need to go looking for this, and the issues page is where most users will already be searching to find known issues.

Automatic taxonomy string extraction

Add code to convert "source" taxonomy files to expected-taxonomy.tsv by extracting full-length taxonomy strings from reference database X.

Similarly, to extract database identifiers, e.g., from GenBank.

The issues with both of these is that manual curation is still very much needed and database quality can be a major issue. But the first would be approachable and would streamline the process of creating these files.

phenotype data

Is there any mock data where you have measured some metabolites or phenotypes for bench-marking supervised machine learning methods ?

add sample-metadata.tsv validation

Mock-3 data

There is something a miss in the header files:
I attempted to demultiplex and ran into errors:
The first error:
skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: A0A3V120410:1:10:10065:26809/1. This may be because you passed an incorrect value for phred_offset.

I forced the Phred Offset to be 33 after taking a quick look at the file and ran again:

The 2nd error:
qiime.split_libraries_fastq.FastqParseError: Headers of barcode and read do not match. Can't continue. Confirm that the barcode fastq and read fastq that you are passing match one another.

We ran the entire data set through skbio and it did not fail. Walking through one at a time revealed the headers to be identical. Several folks looked at this and were perplexed.

Failing to demultiplex mock-3

I am trying to use QIIME 1.9.1 to demultiplex mock-3. When I enter the command listed in the readme (split_libraries_fastq.py -i mock-forward-read.fastq.gz -o split_libraries -m sample-metadata.tsv -b mock-index-read.fastq.gz) I get the following error:

MacQIIME PHPMB13:mock3 $ split_libraries_fastq.py -i mock-forward-read.fastq.gz -o split_libraries -m sample-metadata.tsv -b mock-index-read.fastq.gz
Error in split_libraries_fastq.py: Some or all barcodes are not valid golay codes. Do they need to be reverse complemented? If these are not golay barcodes pass --barcode_type 12 to disable barcode error correction, or pass --barcode_type # if the barcodes are not 12 base pairs, where # is the size of the barcodes. Invalid codes:
	AATCAACTAGGC CAAATGGTCGTC ACACATAAGTCG TGTACGGATAAC

If you need help with QIIME, see:
http://help.qiime.org

When I muck around with the revcomp options, I can get past that, but get stuck here:

MacQIIME PHPMB13:mock3 $ split_libraries_fastq.py -i mock-forward-read.fastq -b mock-index-read.fastq -o out --store_demultiplexed_fastq -m sample-metadata.tsv  --rev_comp_mapping_barcodes
Traceback (most recent call last):
  File "/macqiime/anaconda/bin/split_libraries_fastq.py", line 365, in <module>
    main()
  File "/macqiime/anaconda/bin/split_libraries_fastq.py", line 344, in main
    for fasta_header, sequence, quality, seq_id in seq_generator:
  File "/macqiime/anaconda/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 317, in process_fastq_single_end_read_file
    parse_fastq(fastq_read_f, strict=False, phred_offset=phred_offset)):
  File "/macqiime/anaconda/lib/python2.7/site-packages/skbio/parse/sequences/fastq.py", line 174, in parse_fastq
    seqid)
skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: A0A3V120410:1:10:10065:26809/1. This may be because you passed an incorrect value for phred_offset.

I've tried specifying phred_offsets of 33 and 64, but that doesn't help.

Mock-11 seems to be listed as B1 and bacterial data, It should be F2 and eukaryotic data

split_libraries_fastq.py error with golay barcodes (mock-5 & -7)

This must be an issue specific to me since no one else seems to have run into anything similar here. I don't have much experience with golay barcodes, so please mind my ignorance!

I have downloaded mock-5 and corrected the barcode headers as recommended in README.md. When I run split libs, I run into the following error:

mock-5$ split_libraries_fastq.py -i mock-forward-read.fastq.gz -o split_libraries -m sample-metadata.tsv -b mock-index-read.corrected.fastq.gz --rev_comp_barcode
Error in split_libraries_fastq.py: Some or all barcodes are not valid golay codes. Do they need to be reverse complemented? If these are not golay barcodes pass --barcode_type 12 to disable barcode error correction, or pass --barcode_type # if the barcodes are not 12 base pairs, where # is the size of the barcodes. Invalid codes:
        AATCAACTAGGC CAAATGGTCGTC ACACATAAGTCG TGTACGGATAAC

If you need help with QIIME, see:
http://help.qiime.org

Similarly, I run into this error with mock-7:

mock-7$split_libraries_fastq.py -i mock-forward-read.fastq -o split_libraries -m sample-metadata.tsv -b mock-index-read.fastq --rev_comp_barcode
Error in split_libraries_fastq.py: Some or all barcodes are not valid golay codes. Do they need to be reverse complemented? If these are not golay barcodes pass --barcode_type 12 to disable barcode error correction, or pass --barcode_type # if the barcodes are not 12 base pairs, where # is the size of the barcodes. Invalid codes:
        CAACGCTAGAAT CCATCACATAGG GGCTAAACTATG

If you need help with QIIME, see:
http://help.qiime.org

I'm running Qiime 1.9.1 on Ubutnu 14.04.

Could anyone advice as to what I'm missing? Thank you!

additional mock communities to include

From Robert Edgar (via feedback on the PeerJ pre-print):

Kozich et al. "dual indexing" V4 HMP mock (doi: 10.1128/AEM.01043-13, FASTQ files: http://www.mothur.org/MiSeqDevelopmentData.html). This is yet another HMP V4 run, but is useful because it has a quite different error profile (presumably due to the different library preparation method).
The "Extreme" community used for DADA2 validation (doi:10.1038/nmeth.3869, SRA run id SRR2990088). This is useful because it has several species which are >97% similar so cannot be resolved by traditional OTU methods.

Question about barcode length in mock2 and mock6

Hello,

Thanks for explanations in #76.

one more question about mock 2 and 6.

The barcode indicated in mock 2 and 6 are respectively of length 12 and 6, but in their mock-index-reads.fastq, index are respectively of length 13 and 7.

exemple of mock2:
barcode in sample_metadata.tsv is : ATCTGCCTGGAA
If I search perfect match in the index fastq file I foun:

     23 AATCTGCCTGGAA
 243167 ATCTGCCTGGAAA
    446 ATCTGCCTGGAAC
     17 ATCTGCCTGGAAG
      1 ATCTGCCTGGAAN
    681 ATCTGCCTGGAAT
     62 TATCTGCCTGGAA

Is the correct barcode ATCTGCCTGGAAA (with a A at the end) ?
What is your advice?

Same problem with mock6

      ACCTGT          ACCTCG          ACCGCA
    951 AACCTGT	    195 AACCTCG	     58 AACCGCA
   2212 ACCTGTA	 210433 ACCTCGA	   1245 ACCGCAA
 277218 ACCTGTC	  36791 ACCTCGC	   5589 ACCGCAC
   4911 ACCTGTG	   1878 ACCTCGG	 312775 ACCGCAG
   1399 ACCTGTT	   5707 ACCTCGT	   2041 ACCGCAT
     24 CACCTGT	     16 CACCTCG	      1 GACCGCA
      1 GACCTGT	     10 GACCTCG	     90 TACCGCA
     46 TACCTGT	      1 NACCTCG	
                   1092 TACCTCG

Many thanks for these datasets!

additional test in check_data_integrity.py

Should check that linked files have:

one of the following names: mock-forward-read.fastq.[gz|zip], mock-reverse-read.fastq.[gz|zip], mock-index-read.fastq.[gz|zip]
no duplicated names
links exist for forward and index files (reverse is optional)

Update required metadata

Remove human-readable-description, it is redundant with the dataset README.md pages. See comment in PR#45. Thanks @gregcaporaso for the recommendation!

Also consider removing bokulich2013-id and bokulich2015-id from the metadata, as these are specific to founder datasets. Instead, add this information to the dataset description in the README.md files for those relevant datasets?

Incorporate graphical representations of community composition data

E.g., bar plots showing community composition of each sample in a mock community dataset

add accuracy data and recommendations for mock communities on README

E.g., recommendations on smallest/fastest, most accurate

Shotgun Mock

Hi,
Just wondering if there are shotgun WGS mock communities (on Hiseq 2x150bp) ??

Thanks and great idea to have this repository.

mock-8 raw data needs to be converted to fastq

they're currently in single-line formats.

@shiffer1 and @wmercurio are going to take a crack at this.

add raw-data-url for mock-11

@nbokulich we have an "NA" placeholder for now in dataset-metadata.tsv for mock-11, can you send us the raw data URL?

mock-5 data

For mock-5, there is an issue with the files contained in the raw-data-url: ftp://ftp.microbio.me/pub/illumina-mock-communities-raw-data/Broad3/

There are three files there: combined.read1, combined.read2, and combined.i

The combined.i file is double zipped, unlike the rest, and it has full reads, not barcodes. I am not positive, but it looks like it could be a combination of read1 and read2 in one file.

pass through the human-readable-description fields for all metadata

and make them more descriptive (related to #10).

@nbokulich, maybe you and I can do that together sometime soon?

Error preprocessing mock 7 and 8 : Failed qual conversion

As reported in #57 , I encountered some trouble using mock 7 and 8.

I am using qiime 19.1

mock 7

split_libraries_fastq.py -i mock7-forward-read.fastq -o split_libraries_M7 -m mock7_sample-metadata.tsv -b mock7-index-read.fastq.gz --rev_comp_mapping_barcodes

Traceback (most recent call last):

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/bin/split_libraries_fastq.py", line 365, in 

    main()

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/bin/split_libraries_fastq.py", line 344, in main

    for fasta_header, sequence, quality, seq_id in seq_generator:

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 317, in process_fastq_single_end_read_file

    parse_fastq(fastq_read_f, strict=False, phred_offset=phred_offset)):

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/lib/python2.7/site-packages/skbio/parse/sequences/fastq.py", line 174, in parse_fastq

    seqid)

skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: ILLUMINA_0275:2:1101:1357:1952#ATAGGCGATCNN. This may be because you passed an incorrect value for phred_offset.

mock 8

split_libraries_fastq.py -i mock8-forward-read.fastq.gz -o split_librariesM8 -m mock8_sample-metadata.tsv -b mock8-index-read.fastq.gz --rev_comp_mapping_barcodes --phred_offset 64

Traceback (most recent call last):

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/bin/split_libraries_fastq.py", line 365, in 

    main()

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/bin/split_libraries_fastq.py", line 344, in main

    for fasta_header, sequence, quality, seq_id in seq_generator:

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/lib/python2.7/site-packages/qiime/split_libraries_fastq.py", line 317, in process_fastq_single_end_read_file

    parse_fastq(fastq_read_f, strict=False, phred_offset=phred_offset)):

  File "/usr/local/genome/VirtualEnv/qiime-1.9.1/lib/python2.7/site-packages/skbio/parse/sequences/fastq.py", line 174, in parse_fastq

    seqid)

skbio.parse.sequences._exception.FastqParseError: Failed qual conversion for seq id: ILLUMINA_0258:3:1101:1184:1974#NNNNNNNNNNNN. This may be because you passed an incorrect value for phred_offset.

It seems that the phred quality score is not valid.

for mock 7

The indicated sequence contain "i" quality character, so corresponding to ascii 105, which is out of the classical score scale.

for mock 8

I checked the sequence but the ascii value are between 66 and 104, so it should work.

Do you have same trouble ? What can I do to solve this ?

A final question, does the sequence still contains primer sequences ?

Potential issue demultiplexing mock-8

I pulled down the forward read and barcode data for mock-8 and attempted to demultiplex it using QIIME2. For load into QIIME2, I'm saving the forward reads as sequences.fastq.qz, and specifying a semantic type of EMPSingleEndSequences. The resulting .qza file is approximately 5GB in size.

If I do not reverse complement the sample barcodes (i.e., default use of q2-demux), the resulting .qza archive is approximately 6kb in size. If I reverse complement the barcodes in the sample metadata file, the resulting .qza is approximately 17MB in size. In the RC case, the samples come out with the following number of sequences:

Even1 : 127730
Even2 : 87659
Even3 : 111879

The readme notes that reverse complement of the mapping file barcodes is necessary, and it does seem to yield more sequence, but the output is substantially smaller than I'd expect given the size of the raw data. I couldn't find expected numbers of sequences in the manuscript cited in the readme, are the above numbers inline with what is expected?

Trouble generating a pull request

I'm trying to submit two new mock communities, that passed the data integrity test.
I'm having trouble creating a pull request. I have some git experience but never created a pull request.
Trying to create a pull request on github it asks me to compare another branch to the master, but I'm stuck at that.
Could you provide more detailed instructions?

CONTRIBUTING.md not found

I'd like to contribute a new mock community but the link http://mockrobiota.caporasolab.us/CONTRIBUTING.md isn't working.

replace QIIME 1 commands with QIIME 2 commands

For example, see the split_libraries_fastq.py command here.

make test code pep8 compliant

mock-7 link to MG-RAST isn't very useful

it's not clear which of the samples in that project are the mock communities, and the "Export metadata" button in MG-RAST just hangs. Do we have other copies of these files that we could post to microbio.me?

Is Mock-1 HiSeq or MiSeq?

Hello!
Was Mock-1 sequenced in MiSeq or HiSeq? Sorry, I got confused, because when I checked on the inventory it says it's MiSeq, but when I check dataset-metadata.tsv of Mock-1 it says HiSeq. Am I looking at the wrong place or is there really a problem with the information?
Thank you!

Unable to replicate community composition of Mock-3

Hi,

I have been trying to benchmark my 16S pipeline (a combination of QIIME and UPARSE) using the mock-3 dataset. However, after several iterations (with different parameter values or databases), the relative abundances seem to be highly skewed towards Staphylococcus (approx. 50% in the "even" and 70% in the "staggered").

The number of OTUs being detected is always close to the original number of strains. Even with the values used in your articles, I do not seem to get the expected composition

new data integrity checks

naming of OTU directories should be of the format: <similarity-threshold>-otus
expected sequences files should be named expected-sequences.fasta, and no other fasta files should be present in those directories

Specify reference taxonomy files (e.g., %OTU ID) used for annotation of expected-taxonomy.tsv files

Expected composition (expected-taxonomy.tsv) files need not only match the database and version, but the exact ref taxonomy file that is used for taxonomy assignment of observed data. In other words, if using 97 OTUs for taxonomy assignment, a 97 OTUs expected taxonomy file must be generated (that's what we have now). If 99 OTUs, 99 OTU expected taxonomy, etc.

Perhaps we should include this information somewhere. Any ideas how/where to do this? Perhaps changing the directory structure to:
database-name/version/OTU%
or
database-name/version-OTU%

One issue with specifying this in the directory name is 1) the name can be ambiguous (e.g., "97" is not very specific) and 2) OTU %ID may not be the only difference between file types (e.g,. if using a curated subset of reference seqs), and is marker-gene ref db specific, e.g., does not apply to metagenome ref dbs. We will need to be very descriptive (e.g., "97-otus" instead of "97") for filenames or perhaps add a README file to the directory? READMEs could get cumbersome.

Mock2 expected abundance replication issue

Hello,

I'm trying to replicate the Mock2 expected abundance results at 99% confidence, and more specifically only the expected abundance at the genus level.
From the provided data I have used only the "forward read" which I have trimmed at the position 135 towards the 3'. I didn't use the "reverse read" because the quality didn't seem that good.

I have done various trials using Qiime 1.9.1 and the parameters that brought me closer to the expected results were the following:

split_libraries_fastq.py \
-m ${mapping} \
-i ${raw} \
-b ${index} \
-o splitting_tmp \
-p 0.75 \
-q 19 \
-r 3 \
--rev_comp_mapping_barcodes

pick_open_reference_otus.py \
-i ${fasta_file} \
-o ${output_dir}\
-r $HOME/gg_13_8_otus/rep_set_aligned/99_otus.fasta \
-p ${parameters_file}

Parameters_file:
pick_otus:similarity	0.99
assign_taxonomy:assignment_method	rdp  # version 2.2
assign_taxonomy:id_to_taxonomy	$HOME/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt
assign_taxonomy:reference_seqs_fp	$HOME/gg_13_8_otus/rep_set/99_otus.fasta
align_seqs:template_fp	$HOME/gg_13_8_otus/rep_set_aligned/99_otus.fasta

Genera identified:
In total 28 genera were identified.
5/28 genera were not in the expected results (denoted as misclassified).
23/24 of the expected genera were identified.

Genera abundances:
In most of the cases the abundances were not matching the expected ones. Please see the results at the bottom.

Bokulich et al. 2015
According to the paper sortmerna should give better results regarding precision, recall and F measure. I used the parameters mentioned in the methods for sortmerna (0.51:0.8:1:0.8:1) and actually I got worst results (8/29 genera misclassified, 21/24 of the expected genera were identified).

Could please tell me what I'm doing wrong?

Results using RDP classifier:

Genera            Found    Expected  Source
Acinetobacter     0.0680   NA        NA
Akkermansia       4.5545   2.1277    2.0833
Alistipes         0.6976   2.1277    2.0833
Anaerococcus      2.5559   2.1277    2.0833
Anaerotruncus     0.4444   2.1277    2.0833
Bacteroides       29.8751  25.5319   25.0000
Bifidobacterium   5.3094   4.2553    4.1667
Blautia           2.0612   4.2553    4.1667
Clostridium       4.5551   8.5106    14.5833
Collinsella       0.1558   2.1277    2.0833
Coprococcus       NA       2.1277    2.0833
Dorea             8.4409   8.5106    4.1667
Edwardsiella      2.0291   2.1277    2.0833
Enterobacter      0.0134   2.1277    2.0833
Escherichia       6.8449   4.2553    4.1667
Eubacterium       NA       NA        6.2500
Faecalibacterium  0.0498   2.1277    2.0833
Lachnospira       2.7171   NA        NA
Lactobacillus     0.3774   NA        NA
Parabacteroides   2.6164   2.1277    2.0833
Proteus           0.0535   2.1277    2.0833
Providencia       0.0123   2.1277    2.0833
Roseburia         0.4342   2.1277    2.0833
Ruminococcus      1.3711   4.2553    6.2500
Shigella          0.0262   NA        NA
Streptococcus     8.9549   2.1277    2.0833
Subdoligranulum   4.8672   2.1277    2.0833
Victivallis       0.0375   NA        NA
[Eubacterium]     0.5718   6.3830    NA
[Ruminococcus]    10.3051  4.2553    NA

fix expected Silva taxonomy files

@nbokulich, creating this issue so we don't lose track :)

Mock 10 reverse reads gz is corrupted

A direct download from the link provided for mock 10 reverse reads when gunzipped results in:
gunzip: SAG_rev.txt.gz: unexpected end of file
gunzip: SAG_rev.txt.gz: uncompress failed

It appears that the file is not complete.
Thanks,
Arron

add contributing guide

Demultiplexing Reverse Reads

I'm working with a few mock communities (mock-8 and mock-9) and was easily able to demultiplex the forward reads using

split_libraries_fastq.py -i mock-forward-read.fastq.gz -o split_libraries -m sample-metadata.tsv -b mock-index-read.fastq.gz --rev_comp_mapping_barcodes

I was unable to demultiplex the reverse reads using this command. I have looked around for instructions on how to do this but I haven't been able to find anything.

Do the reverse reads have the same barcodes? Or is there a separate index file that I am missing somewhere?

caporaso-lab / mockrobiota Goto Github PK

mockrobiota's People

Contributors

Stargazers

Watchers

Forkers

mockrobiota's Issues

Recommend Projects

Recommend Topics

Recommend Org