Giter Club home page Giter Club logo

pipesidle's Introduction

Note

The functions of this workflow were added to nf-core/ampliseq in version 2.9.0.

pipesidle

Multipe region amplicon sequencing analysis with SMURF via Sidle in QIIME2.

The pipeline requires output from nf-core/ampliseq and applies Sidle (SMURF Implementation Done to acceLerate Efficiency) within QIIME2 to scaffolding multiple regions along a reference for improved resolution over a single region.

For example, multiple variable regions of the 16S rRNA gene were sequenced with various primers and need to be unified. This leads to one unified abundance and taxonomy profile over all variable regions.

Pipeline summary

  • The pipeline follows recommendations of the Sidle documentation
  • Import of all files in QIIME2
  • Database preparation (region extractions, reconstruction of taxonomies, truncation)
  • Input amplicon sequence preparation (truncation to region length)
  • Reconstruction of abundance table and taxonomic information
  • Phylogenetic tree (sequence reconstitution, phylogenetic placement)

Software requirement

  • Java
  • nextflow
  • singularity / apptainer / docker (tested with singularity)

Input data

Information to computed abundances, sequences, primers, and expected length, for each sequenced region.

Pipesidle was made for output of nf-core/ampliseq (run the pipeline for each region/primer pair separately, potentially use "--illumina_pe_its --cutadapt_min_overlap 10"), but should also accept any other pipeline output in correct format.

The information has to be provided in a tab-separated sample sheet via --input, one row per sequenced region.

Parameter Description
sampleID Unique ID*
asv_table Path to ASV count table (tsv)
asv_seq Path to ASV sequences (fasta)
primerfw Forward primer sequence (string)
primerrv Reverse primer sequence (string)
region_length Minimal region length (integer)

Warning: alpha-numerically sorted sampleID's must reflect sequence on the reference sequence, e.g. region with sampleID 1 must be located before region 2 in the reference database sequences.

Example sample sheet:

sampleID	asv_table	asv_seq	primerfw	primerrv	region_length
1	nfcoreampliseq/results_TGGCGAACGGGTGAGTAA_CCGTGTCTCAGTCCCARTG/dada2/ASV_table.tsv	nfcoreampliseq/results_TGGCGAACGGGTGAGTAA_CCGTGTCTCAGTCCCARTG/dada2/ASV_seqs.fasta	TGGCGAACGGGTGAGTAA	CCGTGTCTCAGTCCCARTG	145
2	nfcoreampliseq/results_ACTCCTACGGGAGGCAGC_GTATTACCGCGGCTGCTG/dada2/ASV_table.tsv	nfcoreampliseq/results_ACTCCTACGGGAGGCAGC_GTATTACCGCGGCTGCTG/dada2/ASV_seqs.fasta	ACTCCTACGGGAGGCAGC	GTATTACCGCGGCTGCTG	135
3	nfcoreampliseq/results_GTGTAGCGGTGRAATGCG_CCCGTCAATTCMTTTGAGTT/dada2/ASV_table.tsv	nfcoreampliseq/results_GTGTAGCGGTGRAATGCG_CCCGTCAATTCMTTTGAGTT/dada2/ASV_seqs.fasta	GTGTAGCGGTGRAATGCG	CCCGTCAATTCMTTTGAGTT	200
4	nfcoreampliseq/results_GGAGCATGTGGWTTAATTCGA_CGTTGCGGGACTTAACCC/dada2/ASV_table.tsv	nfcoreampliseq/results_GGAGCATGTGGWTTAATTCGA_CGTTGCGGGACTTAACCC/dada2/ASV_seqs.fasta	GGAGCATGTGGWTTAATTCGA	CGTTGCGGGACTTAACCC	115
5	nfcoreampliseq/results_GGAGGAAGGTGGGGATGAC_AAGGCCCGGGAACGTATT/dada2/ASV_table.tsv	nfcoreampliseq/results_GGAGGAAGGTGGGGATGAC_AAGGCCCGGGAACGTATT/dada2/ASV_seqs.fasta	GGAGGAAGGTGGGGATGAC	AAGGCCCGGGAACGTATT	150

Taxonomic reference database

Three files are required:

  • fasta file with reference sequences
  • file with aligned reference sequences
  • file with taxonomies of reference sequences

Here code for the use of greengenes 13_8:

wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz
tar -zxvf gg_13_8_otus.tar.gz gg_13_8_otus/rep_set/99_otus.fasta
mv gg_13_8_otus/rep_set/99_otus.fasta gg_13_8_otus_rep_set_99_otus.fasta
tar -zxvf gg_13_8_otus.tar.gz gg_13_8_otus/rep_set_aligned/99_otus.fasta
mv gg_13_8_otus/rep_set_aligned/99_otus.fasta gg_13_8_otus_rep_set_aligned_99_otus.fasta
tar -zxvf gg_13_8_otus.tar.gz gg_13_8_otus/taxonomy/99_otu_taxonomy.txt
mv gg_13_8_otus/taxonomy/99_otu_taxonomy.txt gg_13_8_otus_taxonomy_99_otu_taxonomy.txt
rm -r gg_13_8_otus

Alternatively, SILVA 128 has also all required files.

Run the pipeline

As example with the above greengenes 13_8 reference database.

NXF_VER=23.10.0 nextflow run d4straub/pipesidle \
  -profile singularity
  --input 'samplesheet.tsv' \
  --db_sequences 'gg_13_8_otus_rep_set_99_otus.fasta' \
  --db_alignedsequences 'gg_13_8_otus_rep_set_aligned_99_otus.fasta' \
  --db_taxonomy 'gg_13_8_otus_taxonomy_99_otu_taxonomy.txt' \
  --db_tree 'https://data.qiime2.org/2021.4/common/sepp-refs-gg-13-8.qza' \
  --outdir result

The --db_tree parameter is optional but recommended.

Additional parameters are

  • --sidle_save_intermediate: Save a bunch of intermediate files (default: false)
  • --max_memory: maximum RAM allowed per process, to scale to computational resources at hand (default: 1992.GB)
  • --max_cpus: number of maximum cpus per process, to scale to computational resources at hand (default: 128)
  • --max_time: maximum time allowed per process, to scale to computational resources at hand (default: 168.h)

Output

This section describes only the most important output files.

  • reconstructed/

    • reconstructed_feature-table.biom: Unified abundance table in biom format
    • reconstructed_feature-table.tsv: Tab-separated unified abundance table
    • reconstructed_taxonomy.tsv: Tab-separated unified taxonomy table
    • reconstructed_merged.tsv: Tab-separated unified table with merged abundance and taxonomy information
    • reconstructed_tree.nwk: Phylogenetic tree
  • barplot/

    • index.html: Interactive barplot, open in your browser (i.e. double-click it)

Credits

This pipeline was originally written by Daniel Straub (@d4straub) for use at the Quantitative Biology Center (QBiC).

Code was inspired by nf-core/ampliseq (doi: 10.5281/zenodo.1493841) (Straub et al., 2020) of the nf-core collection of workflows (Ewels et al., 2020).

Citations

Please cite all tools that you used, i.e. all listed here, but only one containerization tool (here Singularity or Docker):

  • Nextflow

    Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

  • Singularity

    Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

  • Docker

    Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

  • q2-sidle

    Debelius, J.W.; Robeson, M.; Lhugerth, L.W.; Boulund, F.; Ye, W.; Engstrand, L. "A comparison of approaches to scaffolding multiple regions along the 16S rRNA gene for improved resolution." Preprint in BioRxiv. doi: 10.1101/2021.03.23.436606

  • SMURF

    Fuks, G.; Elgart, M.; Amir, A.; Zeisel, A.; Turnbaugh, P.J., Soen, Y.; and Shental, N. (2018). "Combining 16S rRNA gene variable regions enables high-resolution microbial community profiling." Microbiome. 6: 17. doi: 10.1186/s40168-017-0396-x

  • RESCRIPt

    Robeson MS 2nd, O'Rourke DR, Kaehler BD, Ziemski M, Dillon MR, Foster JT, Bokulich NA. RESCRIPt: Reproducible sequence taxonomy reference database management. PLoS Comput Biol. 2021 Nov 8;17(11):e1009581. doi: 10.1371/journal.pcbi.1009581. PMID: 34748542; PMCID: PMC8601625.

  • SEPP

    Janssen S, McDonald D, Gonzalez A, Navas-Molina JA, Jiang L, Xu ZZ, Winker K, Kado DM, Orwoll E, Manary M, Mirarab S, Knight R. Phylogenetic Placement of Exact Amplicon Sequences Improves Associations with Clinical Information. mSystems. 2018 Apr 17;3(3):e00021-18. doi: 10.1128/mSystems.00021-18. PMID: 29719869; PMCID: PMC5904434.

  • QIIME2

    Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, Alexander H, Alm EJ, Arumugam M, Asnicar F, Bai Y, Bisanz JE, Bittinger K, Brejnrod A, Brislawn CJ, Brown CT, Callahan BJ, Caraballo-Rodríguez AM, Chase J, Cope EK, Da Silva R, Diener C, Dorrestein PC, Douglas GM, Durall DM, Duvallet C, Edwardson CF, Ernst M, Estaki M, Fouquier J, Gauglitz JM, Gibbons SM, Gibson DL, Gonzalez A, Gorlick K, Guo J, Hillmann B, Holmes S, Holste H, Huttenhower C, Huttley GA, Janssen S, Jarmusch AK, Jiang L, Kaehler BD, Kang KB, Keefe CR, Keim P, Kelley ST, Knights D, Koester I, Kosciolek T, Kreps J, Langille MGI, Lee J, Ley R, Liu YX, Loftfield E, Lozupone C, Maher M, Marotz C, Martin BD, McDonald D, McIver LJ, Melnik AV, Metcalf JL, Morgan SC, Morton JT, Naimey AT, Navas-Molina JA, Nothias LF, Orchanian SB, Pearson T, Peoples SL, Petras D, Preuss ML, Pruesse E, Rasmussen LB, Rivers A, Robeson MS 2nd, Rosenthal P, Segata N, Shaffer M, Shiffer A, Sinha R, Song SJ, Spear JR, Swafford AD, Thompson LR, Torres PJ, Trinh P, Tripathi A, Turnbaugh PJ, Ul-Hasan S, van der Hooft JJJ, Vargas F, Vázquez-Baeza Y, Vogtmann E, von Hippel M, Walters W, Wan Y, Wang M, Warren J, Weber KC, Williamson CHD, Willis AD, Xu ZZ, Zaneveld JR, Zhang Y, Zhu Q, Knight R, Caporaso JG. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol. 2019 Aug;37(8):852-857. doi: 10.1038/s41587-019-0209-9. Erratum in: Nat Biotechnol. 2019 Sep;37(9):1091. PMID: 31341288; PMCID: PMC7015180.

pipesidle's People

Contributors

d4straub avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.