NOTE: Currently, the pipeline is specific to the pathogen genomics workflow at Public Health lab, SA Pathology. Under development for public use.
This pipeline designed to process HBV, HCV, and HIV probe-capture target enriched whole genome sequencing libraries prepared using the Agilent SureSelectXT system direct from clinical samples. It performs human read removal, initial automated QC assessments, accurate sample specific consensus construction, and downstream processing to support genotype and drug resistance analyses. Used by Pathogen genomics in Public Health and Epidemiology, South Australia. Automated using Snakemake and Slurm for processing on HPC cluster. It has a sub-workflow to process negative controls.
Initiated by input of a Sample Sheet to a shell runner script as:
/path/to/decipher/bcl2deciPHEr_runner.sh /path/to/the/SampleSheet.csv/
,
A snakemake workflow processes fastq files of HBV, HCV, and HIV to perform the following tasks:
-
read trimming (fastp),
-
human read removal (Bowtie2),
-
sequencing yield assessment (Seqtk packaged as 'fq' from the Nullabor pipeline),
-
species identification (Kraken2),
-
de novo assemblies (metaSPAdes) and their qc (Seqtk packaged as 'fa' from the Nullabor pipeline),
-
read mapping to a sample specific reference (using the shiver tool adapted for each virus),
-
consensus construction (iVar),
-
HCV genotyping (abricate using a custom HCVcore database),
-
coverage assessment (QUAST),
-
depth assessment (SAMtools),
-
HIV pol nucleotide variant analysis at drug resistant sites (bammix)
-
sequencing yeild summaries "*_seq_data.tab",
-
species identification summaries "*_species_identification.tab",
-
whole genome and typing region consensus sequences "_consensus/_consensus.fa",
-
depth statistics summaries and histograms "depth_summary_.tsv, _depth_histograms/depth.tab",
-
coverage statistics summaries "summary_coverage_*.tsv"
-
HCV genotyping "HCV_genotype.tab"
-
HIV pol drug resistance site nucleotide variant files "bammix/*_DR_position_base_counts.csv"
The pipeline includes the sub-workflow for negative controls determined by the sample ID given as "NEG*" in the sample sheet input. Only runs sequencing quality and kraken2 on NEGs to check for any potential contamination of HBV, HCV, and HIV viruses.
Human reference genome: GCA_000001405.29
Kraken2 database: k2pluspf (downloaded 20220607)
For HBV whole genome reference database see: /path/to/decipher/HBV_scripts/shbver/hbv_decipher_reference_information.txt
For HCV whole genome reference database see: /path/to/decipher/HCV_scripts/shcver/hcv_decipher_reference_information.txt
For HIV whole genome reference database see: /path/to/decipher/HIV_scripts/shiver/hiv_decipher_reference_information.txt
For HCVcore abricate database see the following settings:
Downloaded from https://hcv.lanl.gov/components/sequence/HCV/search/searchi.html
Genotype: Any Genotype
Subtype: Any Subtype
Include recombinants
Confirmed only
Genomic region: core
Exclude related
Format: Fasta
Gap handling: none
Sequence type: Nucleotides
Include genotype reference sequences
Include H77(NC_004102) reference sequence
This pipeline was written by Rosa C. Coldbeck-Shackley.
Acknowledgments to all the authors of tools used in the pipeline.
-
fastp
Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinforma Oxf Engl. 2018 Sep 1;34(17):i884–90. -
Bowtie2
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012 Mar 4;9(4):357–9. -
Nullarbor
Seemann T, Goncalves da Silva A, Bulach DM, Schultz MB, Kwong JC, Howden BP. -
kraken2
Taxonomic sequence classifier that assigns taxonomic labels to DNA sequences Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2 Genome Biol 20, 257 (2019) -
metaSPAdes
Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017 May;27(5):824–34. -
shiver
Wymant C, Blanquart F, Golubchik T, Gall A, Bakker M, Bezemer D, et al. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver. Virus Evol. 2018 Jan;4(1):vey007. -
ABRicate
Mass screening of contigs for antimicrobial resistance, virulence genes and plasmids.
Seemann T. -
iVar
Grubaugh ND, Gangavarapu K, Quick J, Matteson NL, De Jesus JG, Main BJ, et al. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol. 2019 Jan 8;20(1):8. -
QUAST
Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinforma Oxf Engl. 2013 Apr 15;29(8):1072–5. -
SAMtools
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinforma Oxf Engl. 2009 Aug 15;25(16):2078–9. -
bammix
chrisruis. -
SNAKEMAKE
Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33. -
SLURM
Yoo, A.B., Jette, M.A., Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. In: Feitelson, D., Rudolph, L., Schwiegelshohn, U. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2003. Lecture Notes in Computer Science, vol 2862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/10968987_3