Giter Club home page Giter Club logo

genome-assembly's Introduction

Pipeline Overview

Using this repository you will be able to process your samples to produce a genome assembly, and get a table of top blast hits associated with your samples.

What you will need for to be able to run this script is simply your raw files, in this case you may use the sample files provided

It is assumed that in this script, you will use the HCMV (NCBI accession NC_006273.2) as a reference to build a database, however this can be changed within the script.

Dependencies

You will need to have the following dependencies installed:

  1. BLAST Visit the BLAST website to download and install BLAST according to your operating system. Follow the installation instructions provided by the BLAST documentation.

  2. Bowtie Visit the Bowtie website to download and install Bowtie according to your operating system. Follow the installation instructions provided by the Bowtie documentation.

  3. SPAdes Visit the SPAdes GitHub repository to download and install SPAdes according to your operating system. Follow the installation instructions provided in the SPAdes documentation.

Data

The sample data used to run this pipeline is a subset of the whole sequence, to make it more easily processable. Users might already have their data ready, but to make it easier I will show how I automated the download of my samples. I created a text file containing the accession numbers from NCBI of my samples to download called accessionsList.txt, the python script download_accessions.py was run to download and split the files into forward and reverse reads. Both of these files are provided for reference. File names: accessionList.txt and download_accessions.py

How to Run the Script

  1. There are only two things you need to run the script: your fastq files (forward and reverse in _1.fastq and _2.fastq format
  2. Now, you must run the python_wrapper script with the command: python python_wrapper.py. Likewise, you can clone the whole repository and you will have access too all the files needed to run the python_wrapper.py script
  3. The script should run in about 2-3 minutes and will produce many output files. A description follows below:
  • a) SampleName_filtered..fastq: these are the filtered fastq files, filtered to keep only the reads that map to the index, in this case the HCMV index.
  • b) HCMV_index..bt2 : Bowtie2 produces multiple index files because it divides the index into multiple parts for efficient memory usage and faster alignment. Each of these index files serves a different purpose and will be used in the alignment process
  • c) blast_hits.csv: Contains the top 10 blast hits to the ncbi database created.
  • d) temp.log: this is just a temporary log file that will be combined with the blast_hits.csv file later to produce the final log file.
  • e) PipelineProject.log: THIS IS THE MOST IMPORTANT file that reports the number of reads in each sample before and after filtering, the number of contigs in the assembly that are larger than 1000 bp, the number of base pairs in the assembly, and a table with information for the top 10 blast hits.
  • f) PipelineProject_fullSample.log: This contains the results of running the python_wrapper.py script on the original dataset (full fastq sequences)

genome-assembly's People

Contributors

vicmmer avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.