Giter Club home page Giter Club logo

cmu-safari / molecules2variations Goto Github PK

View Code? Open in Web Editor NEW
8.0 5.0 0.0 4.42 MB

The first work to provide a comprehensive survey of a prominent set of algorithmic improvement and hardware acceleration efforts for the entire genome analysis pipeline used for the three most prominent sequencing data, short reads (Illumina), ultra-long reads (ONT), and accurate long reads (HiFi). Described in arXiv (2022) by Alser et al. https://arxiv.org/abs/2205.07957

basecalling bioinformatics fast5 fasta fastq genome-analysis hifi illumina minimap2 needleman-wunsch

molecules2variations's Introduction

From Molecules 🧬πŸ§ͺ to Genomic Variations πŸ©ΊπŸ’Š: Accelerating Genome Analysis via Intelligent Algorithms and Architectures

We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. Our paper is the first to provide a comprehensive survey of a prominent set of algorithmic improvement and hardware acceleration efforts for the entire genome analysis pipeline.

We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.

alt text

Table of Contents

The paper covers the following topics in detail with a focus on intelligent algorithms, intelligent hardware accelerators, and intelligent hardware/software co-design.

Computer algorithms and hardware architectures are called intelligent if they are able to efficiently satisfy three principles, data-centric, data-driven, and architecture/algorithm/data-aware.

  • First, we would like to process genomic data efficiently by minimizing data movement and maximizing the efficiency with which data is handled, i.e., stored, accessed, and processed.
  • Second, we would like to take advantage of the vast amounts of genomic data and metadata to continuously improve decision making (self-optimizing decisions) for many different use cases in science, medicine, and technology.
  • Third, we would like to orchestrate the multiple components across the entire analysis system and adapt algorithms by understanding the structure of the underlying hardware, understanding analysis algorithms, and understanding various properties (i.e., the structure of the genome, type of sequencing data, quality of sequencing data) of each piece of data.
  1. Introduction
  2. Obtaining Genomic Sequencing Data
    1. Generating Sequencing Data
    2. Downloading Real Sequencing Data
    3. Simulating Sequencing Data
  3. Types of Genomic Sequencing Data
    1. Short Reads
    2. Ultra-long Reads
    3. Accurate Long Reads
    4. Discussion on Types of Sequencing Reads
  4. Genome Analysis Using Different Types of Sequencing Reads
  5. Basecalling
    1. Illumina
    2. ONT
    3. PacBio
  6. Quality Control
  7. Read Mapping
    1. Accelerating Indexing and Seeding
      1. Sampling Seeds
      2. Improving Data Structures for Seed Lookups
      3. Reducing Data Movement During Indexing
    2. Accelerating Pre-Alignment Filtering
      1. Pigeonhole Principle
      2. Base Counting
      3. q-gram Filtering Approach
      4. Sparse Dynamic Programming
    3. Accelerating Sequence Alignment
      1. Accurate Alignment Accelerators
      2. Alignment Accelerators with Limited Functionality
  8. Variant Calling
  9. Discussion and Future Opportunities

Citing this paper

If you use this paper in your work, please cite:

Mohammed Alser, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu. "From Molecules to Genomic Variations: Accelerating Genome Analysis via Intelligent Algorithms and Architectures" arXiv preprint arXiv (2022). https://arxiv.org/abs/2205.07957

Below is bibtex format for citation.

@misc{https://doi.org/10.48550/arxiv.2205.07957,
  doi = {10.48550/ARXIV.2205.07957},
  url = {https://arxiv.org/abs/2205.07957},
  author = {Alser, Mohammed and Lindegger, Joel and Firtina, Can and Almadhoun, Nour and Mao, Haiyu and Singh, Gagandeep and Gomez-Luna, Juan and Mutlu, Onur},
  title = {Going From Molecules to Genomic Variations to Scientific Discovery: Intelligent Algorithms and Architectures for Intelligent Genome Analysis},
  publisher = {arXiv},
  year = {2022}
}

molecules2variations's People

Contributors

mealser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.