Pipeline Overview

Using this repository you will be able to process your samples to produce a genome assembly, and get a table of top blast hits associated with your samples.

What you will need for to be able to run this script is simply your raw files, in this case you may use the sample files provided

It is assumed that in this script, you will use the HCMV (NCBI accession NC_006273.2) as a reference to build a database, however this can be changed within the script.

Dependencies

You will need to have the following dependencies installed:

BLAST Visit the BLAST website to download and install BLAST according to your operating system. Follow the installation instructions provided by the BLAST documentation.
Bowtie Visit the Bowtie website to download and install Bowtie according to your operating system. Follow the installation instructions provided by the Bowtie documentation.
SPAdes Visit the SPAdes GitHub repository to download and install SPAdes according to your operating system. Follow the installation instructions provided in the SPAdes documentation.

Data

The sample data used to run this pipeline is a subset of the whole sequence, to make it more easily processable. Users might already have their data ready, but to make it easier I will show how I automated the download of my samples. I created a text file containing the accession numbers from NCBI of my samples to download called accessionsList.txt, the python script download_accessions.py was run to download and split the files into forward and reverse reads. Both of these files are provided for reference. File names: accessionList.txt and download_accessions.py

How to Run the Script

There are only two things you need to run the script: your fastq files (forward and reverse in _1.fastq and _2.fastq format
Now, you must run the python_wrapper script with the command: python python_wrapper.py. Likewise, you can clone the whole repository and you will have access too all the files needed to run the python_wrapper.py script
The script should run in about 2-3 minutes and will produce many output files. A description follows below:

a) SampleName_filtered..fastq: these are the filtered fastq files, filtered to keep only the reads that map to the index, in this case the HCMV index.
b) HCMV_index..bt2 : Bowtie2 produces multiple index files because it divides the index into multiple parts for efficient memory usage and faster alignment. Each of these index files serves a different purpose and will be used in the alignment process
c) blast_hits.csv: Contains the top 10 blast hits to the ncbi database created.
d) temp.log: this is just a temporary log file that will be combined with the blast_hits.csv file later to produce the final log file.
e) PipelineProject.log: THIS IS THE MOST IMPORTANT file that reports the number of reads in each sample before and after filtering, the number of contigs in the assembly that are larger than 1000 bp, the number of base pairs in the assembly, and a table with information for the top 10 blast hits.
f) PipelineProject_fullSample.log: This contains the results of running the python_wrapper.py script on the original dataset (full fastq sequences)

vicmmer / genome-assembly Goto Github PK

genome-assembly's Introduction

Pipeline Overview

Dependencies

Data

How to Run the Script

genome-assembly's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent