Giter Club home page Giter Club logo

daniel-liu-c0deb0t / block-aligner Goto Github PK

View Code? Open in Web Editor NEW
118.0 7.0 6.0 2.39 MB

SIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.

Home Page: https://crates.io/crates/block_aligner

License: MIT License

Rust 46.28% Python 0.75% Shell 0.87% Makefile 0.05% C 3.04% Jupyter Notebook 49.02%
rust alignment simd avx2 algorithms wasm bioinformatics webassembly neon

block-aligner's Introduction

Greetings, curious internet dweller! ๐Ÿ‘‹

This is the home of most of my code. Take a look around--something may catch your eye!

If you are extraordinarily bold and curious, check out this cool project that has been piquing my interest recently.

block-aligner's People

Contributors

daniel-liu-c0deb0t avatar milot-mirdita avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

block-aligner's Issues

`rand_mutate` cannot generate consecutive insertions

It seems that rand_mutate is slightly limited in the kinds of mutations it can generate, since it decides up front whether to generate a match/substitution/insertion/deletion for each position. This way, it can never insert two or more consecutive characters.

See this line: For each generated insertion, it directly pushes the relevant character from a after pushing the inserted character.

Something like an insertion followed by a substitution also can't be generated in this model.

Probably this doesn't matter too much in practice, but it's a slight deviation from the common model of generating the mutations one by one, as done in e.g. wfa.

Realignment

It should be pretty easy to add a realignment feature based on a known traceback path (can be approximate). The block will just shift to follow that path.

fuzzy text finding

I have a strange use case that I was hoping to use your project for. Specifically, I would like to use it for text matching rather than DNA sequencing. I understand that this may not be a typical use case for your project, but I was wondering if you could provide a simple example for how I could use it for this purpose.

Below is an example of a similar library in python for my use case.

import sys
import parasail

# extension_penalty must be <= open_penalty
open_penalty = 1
extension_penalty = 1


def calculate_score(ayah, search_term, profile=None):
    if profile is not None:
        result = parasail.sw_scan_profile_16(profile, ayah, open_penalty, extension_penalty)
    else:
        result = parasail.sw_scan_16(ayah, search_term, open_penalty, extension_penalty, parasail.blosum62)
    return result


query = 'hello world'
search_in = 'large-file.txt'

current_profile = parasail.profile_create_16(query, parasail.blosum62)

results = []
for line in open(search_in):
    line = line[:-1]
    result = calculate_score(line, query, current_profile)
    results.append((line, result.score, result.__dict__))

sorted_results = sorted(results, key=lambda item: item[1], reverse=True)

for i in range(0, 5):
    print(sorted_results[i])

Thank you!

question on block aligner score and lowest identity allowed to be accurate

Hello Daniel,

I have a question about the global alignment score of block aligner, is it metric (https://en.wikipedia.org/wiki/Metric_space)? Especially whether score (a, b) = score (b, a) for global alignment.

Also, what is the lowest identity block aligner can handle, I saw the protein example you have down to ~50% identity, how about even lower and for DNA since it is not full dynamic programming. Wavefront is fast only when 2 sequences are highly identical like, 95% identity or above. I am wondering whether this is also the case for block aligner.

Another question is for semi-global alignment, both vsearch and usearch use a much smaller penalty for gaps at both ends of the 2 sequences compare to gaps in the sequences to approximate "semi-global" because in practice semi-global does not penalize gaps at both ends.

Thanks,

Jianshu

Option to reuse memory when block aligner is reallocated

Profiling shows that this could save maybe 20% of time when block aligner is repeatedly created and destroyed. One way to reuse memory could be to allow an old block aligner instance to be moved into a new one, and the new one can repurpose the allocated memory regions of the old one.

Neon and related

Hello Daniel,

Can you please show some example on how to support Neon on MacOS M1 and M1 Pro/Max with this? Another question is related: the simdeez library only supports avx2 and sse2. How to allow support for neon, which library?

THanks,

Jianshu

Investigate performance regressions

  1. For some reason, the prefix scan benchmarks are slower than before. No major changes were made to it.
  2. Traceback is a lot slower after changes to make it more correct when tracing back gap open/extend.

Is there a reason gap_open has to be > gap_extend and not >= gap_extend?

If I try to use

let gap_penalties = Gaps { open: -1, extend: -1 };

I get an assertion error. I understand that the way scoring works that gap-open needs to include the gap extension, since a gap of length one is scored as just the open penalty. However, I would like to be able to do alignments where e.g. a 1bp gap costs -1 and a 2bp gap costs -2, etc. which is not possible with the current check.

Is there any reason to require that open > extend instead of open >= extend?

Question: Syntax for modifying the nucleotide scoring matrix

Hi Daniel,

I confess I haven't quite figured out how to customize scoring for each nucleotide character, but I think I'm getting close. As you'll recall from my issue on triple_accel, I'm looking to compute nucleotide distances for pairs of sequences that ignore characters other than A, T, G, and C. Do I have it right that you can control that in block-aligner with the Matrix trait's set method? For example:

let mut scoring_matrix = NucMatrix::new_simple(1, -1);
scoring_matrix.set(b'N', b'A', 0);
scoring_matrix.set(b'N', b'T', 0);
scoring_matrix.set(b'N', b'G', 0);
scoring_matrix.set(b'N', b'C', 0);

If so, could you provide some guidance on how to use an updated NucMatrix in this portion of your readme example?

// Note that PaddedBytes, Block, and Cigar can be initialized with sequence length
// and block size upper bounds and be reused later for shorter sequences, to avoid
// repeated allocations.
let r = PaddedBytes::from_bytes::<NucMatrix>(b"TTAAAAAAATTTTTTTTTTTT", max_block_size);
let q = PaddedBytes::from_bytes::<NucMatrix>(b"TTTTTTTTAAAAAAATTTTTTTTT", max_block_size);

Thanks for the advice and for the excellent crate!
--Nick

Support Alignment of DNA sequences with C API

I'd like to ask if it would be possible to support the alignment of DNA sequences using the C API? I would like to integrate block-aligner into an existing C/C++ tool but cannot use the exisiting C API to do alignments of DNA sequences.

Thoughts on relation to / comparison with WFA

The WFA (and related BiWFA) algorithm have provided some really nice asymptotic and practical advances for pairwise alignment under a set of realistic scoring functions. Do you have any thoughts on the relation of the block alignment idea with WFA? Could ideas from these two approaches be combined fruitfully? How would we expect their performance to compare?

Is using -0.5 as extend penalty supported?

Hi!

I am newer to Rust so I may be misunderstanding something, but I was wondering if using an extend penalty of -0.5 was supported. I noticed that the Gaps struct takes i8 so I am guessing this is not possible?

Also, just curious, what led you to choose the scoring setup for proteins that you did in the paper?
"For scoring, we use BLOSUM62 with (Gopen = -11, Gext = -1) for proteins"

Thanks for this library!

Support alignment of protein sequences containing "*"

I'm using the C API of block-aligner to align protein sequences from UniProt database. There are *s in some protein sequences. Currently using block-aligner to align sequences containing * will cause a Segmentation Fault. Although the users can resolve it by mapping * to other supported chars, it would be nice if we can support * internally! :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.