Giter Club home page Giter Club logo

cmu-safari / genstore Goto Github PK

View Code? Open in Web Editor NEW
12.0 7.0 0.0 2.31 MB

GenStore is the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. Described in the ASPLOS 2022 paper by Mansouri Ghiasi et al. at https://people.inf.ethz.ch/omutlu/pub/GenStore_asplos22-arxiv.pdf

License: MIT License

Makefile 0.85% C++ 1.02% Shell 0.34% C 63.46% Roff 2.74% JavaScript 12.01% Cython 1.75% Python 0.62% TeX 15.94% Perl 0.08% Gnuplot 0.31% Verilog 0.83% Tcl 0.04%
read-mapping in-storage-processing pre-alignment-filtering exact-matching long-reads ftl hardware-accelerator sequence-alignment ssd near-data-processing

genstore's Introduction

GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis

What is GenStore?

GenStore is the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different properties such as read lengths and error rates, which highly depend on the sequencing technology, and 2) different degrees of genetic variation compared to the reference genome, which highly depends on the genomes that are being compared.

Watch our full talk video (slides) and lightning talk video (slides) about GenStore!

drawing

Citation

If you find this repo useful, please cite the following paper:

Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu, "GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis" Proceedings of the 27th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022

@inproceedings{mansouri2022genstore,
  title={GenStore: a high-performance in-storage processing system for genome sequence analysis},
  author={Mansouri Ghiasi, Nika and Park, Jisung and Mustafa, Harun and Kim, Jeremie and Olgun, Ataberk and Gollwitzer, Arvid and Senol Cali, Damla and Firtina, Can and Mao, Haiyu and Almadhoun Alserr, Nour and others},
  booktitle={Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems},
  year={2022}
}

Table of Contents

Prerequisites

The infrastructure has been tested with the following system configuration:

  • g++ v11.1.0
  • Python v3.6

Prerequisites specific to each experiment are listed in their respective subsections.

Preparing Input Data

Real Genomic Read Sets

The read sets used in the paper can be obtained by searching the read set eccession IDs provided in the paper in the European Bioinformatics Institute ftp.

Synthetic Read Sets

We use mason_simulator (part of the SeqAn package) to simulate short reads of varying degree of genetic distance from the reference genome.

  1. cd input-generation
  2. Download all files specified in files_to_download.txt to this directory
  3. Create a directory called "index" and generate an index of the reference genome using the command
minimap2 -d index/hg38.mmi hg38.fa
  1. Run run_subsample_pipeline.sh

Baseline Software Exact Match Filter

We implement a baseline exact match filter using SIMD operations integrated in minimap2.

  1. For installation, run make
  2. General usage
minimap2 -d ref.mmi ref.fa                     # indexing
minimap2 -a ref.mmi reads.fq > alignment.sam   # alignment

For more information about minimap2, please refer to its original repo.

Code Walkthrough

  • We implement the exact match filer in exact2_match_sse.c
  • The filter in used in map.c by calling function exact_match_sse
  • If a read is detected to be an exact match, the mapper skips the expensive alignment step performed in ksw_extz2_sse

Software GenStore

Software GenStore is an implementation of the GenStore filter without in-storage support.

Experiment Workflow

  1. Set the environment variables REF_FILE, READ_FILE, HASH_SIZE, LOG2_NUM_THREADS. For example, to use the provided sample data, set the variables as follows:
REF_FILE=sample_data/NC_000913.3.head1000.fa
READ_FILE=sample_data/reads.fq
HASH_SIZE=48
LOG2_NUM_THREADS=2
  1. Compile the hash sorter and minimap 2 by running make in genstore-sw-filter and genstore-sw-filter/minimap2/

Parse the reference file

  1. Generate logs for the reference using the command
minimap2/minimap2 -w1 -k150 -d $REF_FILE.mmi $REF_FILE >$REF_FILE.log 2>/dev/null
  1. Generate a hash and position table for the reference by running
./gen_hash $REF_FILE.log > $REF_FILE.hashes
  1. Reduce the table to the target hash size using
./generate_index $HASH_SIZE $REF_FILE.hashes > $REF_FILE.$HASH_SIZE.hashes.bin
  1. Index the table using
./index_index $HASH_SIZE $REF_FILE.$HASH_SIZE.hashes.bin $LOG2_NUM_THREADS > $REF_FILE.$HASH_SIZE.hashes.bin.index

Parse the read file

  1. Generate logs for the read file using the command
minimap2/minimap2 -w1 -k$READ_LENGTH -d $READ_FILE.mmi $READ_FILE >$READ_FILE.log 2>/dev/null
  1. Generate a table for the reads by running
./generate_read_hashes.sh $READ_FILE.log > $READ_FILE.hashes
  1. Reduce the table to the target hash size using
./generate_reads $READ_LENGTH $HASH_SIZE $READ_FILE.hashes > $READ_FILE.$HASH_SIZE.hashes
  1. Index the table using
./index_reads $HASH_SIZE $READ_FILE.$HASH_SIZE.hashes $LOG2_NUM_THREADS > $READ_FILE.$HASH_SIZE.hashes.index

Run the exact match filter

  1. Run the filter using
./check_files_mt $HASH_SIZE $REF_FILE.$HASH_SIZE.hashes.bin $READ_FILE.$HASH_SIZE.hashes

For example, for the provided input set, the output should look like the following:

bit width: 48 num_threads: 4

69782 1001 725 0.724276

where 0.724276 is the ratio of total reads that exactly match some subsequences in the reference genome.

Hardware GenStore

We evaluate hardware configurations using two state-of-the-art simulators to analyze the performance of GenStore. We model DRAM timing with the DDR4 interface in Ramulator, a widely-used, cycle-accurate DRAM simulator. We model SSD performance using MQSim, a widely-used simulator for modern SSDs. We model the end-to-end throughput of GenStore based on the throughput of each GenStore pipeline stage: accessing NAND flash chips, accessing internal DRAM, accelerator computation, and transferring unfiltered data to the host.

HDL Implementation

We implement GenStore's accelerator units in Verilog to faithfully measure the throughput of the accelerators, and their area and power cost. We use Design Compiler version N-2017.09. The implementation can be found in genstore-hdl folder.

  1. In key-script-command.tcl , path_to_verilog_files is the path to genstore verilog source files, <verilog_module>.v is the file name containing the verilog module to synthesize, and <verilog_module_name> is the name of the module defined in this verilog file
  2. Open up Synopsys command line
  3. Run key-script-command.tcl

We will soon release the scripts used for Ramulator to model DRAM timing and the scripts used for MQSim to model SSD timing.

End-to-end Throughput

We will soon release the script used for modelling the end-to-end throughput of GenStore based on the throughput of each GenStore pipeline stage.

Contact

Nika Mansouri Ghiasi - [email protected]

genstore's People

Contributors

hmusta avatar nikamgh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.