Giter Club home page Giter Club logo

banzai's Introduction

#banzai!#

๐Ÿ„

banzai is a BASH (shell) script that links together the disparate programs needed to process the raw sequencing results from an Illumina run into a contingency table of the number of sequences per taxon found in a set of samples. Some preliminary ecological analyses are included as well.

The script should run on Unix and Linux machines. The script makes heavy usage of Unix command line utilities (such as find, grep, sed, awk, and more) and is written for the BSD versions of those programs as found on standard installations of Mac OSX. I tried to use POSIX-compliant commands wherever possible.

Basic implementation

NEW!!! NOTE that as of 2015-10-09, you must direct banzai.sh to your parameter file. This allows for much easier use when analyzing multiple types of projects. Parameter files can be called whatever you want -- e.g. banzai_params_16s.sh. When you invoke the file banzai.sh, it will source whatever file you give it using the first argument (separated by a space). Simply copy the file 'banzai_params.sh' into a new folder, set parameters as desired, then type into a terminal:

bash /Users/user_name/path/to/the/file/banzai.sh   /User/user_name/path/to/param_file.sh

It's important to use bash rather than sh or . to invoke the script. Someday I'll figure out a better workaround, but for now this was the only way I could guarantee the log file was created in the way I wanted.

Dependencies

Aside from the standard command line utilities (awk, sed, grep, etc) that are already included on Unix machines, this script relies on the following tools:

  • PEAR: merging paired-end reads
  • cutadapt: primer removal (I might replace with awk)
  • vsearch: sequence quality filtering (requires version 1.4.0 or greater); OTU clustering
  • swarm: OTU clustering
  • seqtk: reverse complementing entire fastq/a files
  • python: fast consolidation of duplicate sequences (installed by default on Macs)
  • blast+: taxonomic assignment
  • MEGAN: taxonomic assignment
  • R: ecological analyses. Requires the packages vegan and gtools

Follow the Vagrant-VirtualBox instructions to automatically install your own virtual machine that includes all of these dependencies.

Recommended

  • Compressing and decompressing files can be slow because standard, built-in utilities (gzip) do not run in parallel. Installing the parallel compression tool pigz can yield substantial speedups. Banzai will check for pigz and use it if available.

  • I recommend that before analyzing data, you check and report basic properties of the sequencing runs using fastqc. I have included a script to do this for all the fastq or fastq.gz files in any subdirectory of a directory (run_fastqc.sh).

Optional/Deprecated

  • usearch: filtering paired reads on the basis of the sum of the error probabilities (maximum expected errors). This can be turned off, probably without much change in final data quality. We used to do OTU clustering with usearch, but the 32bit version can't handle larger data sets.

Sequencing Pool Metadata

If you provide a CSV spreadsheet that contains metadata about the samples, banzai can read some of the parameters from it, like the primers and multiplex index sequences. You need to provide the file path to the spreadsheet, and the relevant column names.

It is VERY important that this file be encoded with UNIX line breaks. You can do this from Excel and TextWrangler. It doesn't appear to be critical that the text is encoded using UTF-8, though this is certainly the safest option. Early in the logfile you can check to be sure the correct number of tags and primer sequences were found.

No field should contain any spaces. That means row names, column names, and cells. Accomodating this would require an advanced degree in bash-quoting judo, which I do not have.

LIBRARY NAMES

As of 2015-10-09, libraries no longer have to be named anything in particular (e.g. A, B, lib1, lib2), BUT THEY CANNOT CONTAIN UNDERSCORES or spaces!

Organization of raw data

Your data (fastq files) can be compressed or not; but banzai currently only works with paired-end Illumina data. Thus, the bare minimum input is two fastq files corresponding to the first and second read. Banzai will fail if there are files in your library folders that are not your raw data but have 'fastq' in the filename! For example, if your library contains four files: "R1.fastq", "R1.fastq.gz", "R2.fastq", and "R2.fastq.gz". banzai will grab the first two (R1.fastq and R1.fastq.gz) and try to merge them, and (correctly) fail miserably. Note that while PEAR 0.9.7 merges compressed (*.gz) files directly, PEAR 0.9.6 does not do so correctly. If given compressed files as input, banzai first decompresses them, which will add a little bit of time to the overall analysis.

A note on removal of duplicate sequences##

(dereplicate_fasta.py)

  • Input: a fasta file (e.g. 'infile.fasta')

  • Output: a file with the same name as the input but with the added extension '.derep' (e.g. 'infile.fasta.derep')

This output file contains each unique DNA sequence from the fasta file, followed by the labels of the reads matching this sequence Thus, if an input fasta file consisted of three reads with identical DNA sequences:

>READ1
AATAGCGCTACGT
>READ2
AATAGCGCTACGT
>READ3
AATAGCGCTACGT

The output file is as follows:

AATAGCGCTACGT; READ1; READ2; READ3

Note that the original script also output a file of the sequences only (no names), but I removed this functionality on 20150417

This could take a while...

In Mac OS 10.8 (Mountain Lion) and later, you can override your computer's sleep settings by running the script like so:

caffeinate -i -s bash /Users/user_name/path/to/the/file/banzai.sh

Known Issues/Bugs

  • Currently awaiting catastrophic finding...

###Notes### An alternate hack to have the pipeline print to terminal AND file, in case logging breaks: sh script.sh 2>&1 | tee ~/Desktop/logfile.txt

  • 2015-10-19 expected error filtering implemented via vsearch. OTU clustering can be done with swarm or usearch.
  • 2015-10-09 read length calculated from raw data. Library names are flexible.
  • 2014-11-12 Noticed that the reverse tag removal step removed the tag label from the sequenceID line of fasta files if the tag sequence is RC-palindromic!

banzai's People

Contributors

jimmyodonnell avatar mbarimike avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.