Giter Club home page Giter Club logo

martinsos / edlib Goto Github PK

View Code? Open in Web Editor NEW
495.0 23.0 164.0 4.74 MB

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.

Home Page: http://martinsos.github.io/edlib

License: MIT License

C++ 67.64% C 6.77% CMake 3.10% Shell 6.86% Makefile 1.49% Python 4.27% Dockerfile 0.50% Meson 1.66% Cython 7.72%
sequence-alignment edit-distance levehnstein-distance library c-plus-plus alignment-path python bioinformatics

edlib's Introduction

I am a computer scientist / software engineer / founder, currently focused on shaping the future of web app development with ๐Ÿ https://github.com/wasp-lang/wasp ๐Ÿ .

While I am a generalist and enjoy learning new languages, currently I am having most of of the fun while coding in Haskell and Javascript.

In my free time I enjoy mechanical keyboards, bodyweight exercise and Rocket League.

edlib's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

edlib's Issues

Speed up calculation of Peq

Although calculation of Peq normally takes unimportant amount of time compared to DP calculation, in some cases like when using NW for very similar proteins, it takes about half of execution time! It would be interesting to speed it up in all or at least such cases.

Segfault because of unsanitazed function parameters

Just to mention, this happens in an older version (not sure if the bug is present in the most recent one).
This is the GDB's output from running my program:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ff2f5399700 (LWP 9563)]
0x0000000000410f8d in obtainAlignment (maxNumBlocks=3, queryLength=178, targetLength=-13, W=14, bestScore=-1,
position=-1, alignData=0x0, alignment=0x7ff2f5397f60, alignmentLength=0x7ff2f5397f5c)
at src/alignment/myers.cpp:765

At the beginning of the obtainAlignment function, there is an allocation without checking the values of input parameters.
Also, it would be good that after every malloc/new there is a check if the pointer is NULL (but please continue handling things in C-like manner, throwing exceptions is just awful :-) ).

Return starting position of alignment (but no alignment path)

@isovic suggested this as useful feature, so I should consider adding it!
For NW and SHW starting position is 0 so that is easy. However, for HW we do not know starting position. Best way to get is to run SHW backwards from ending point of HW, which is what I do now anyway when finding alignment path.
I should expose some flag for this, and then modify code for finding alignment to stop when starting position is found.

Combine Myers with Landau Vishkin

Since Landau Vishkin is so fast when query and target are very similar, maybe it would be good idea to first start calculation with Landau Vishkin for a few small k, then if it does not produce result I can switch to Myers. Reason for this is that Myers is actually not very efficient for small k because it will calculate whole block anyway, although band is maybe very small.

Memory leak

There is a memory leak (near line 160) in myersCalcEditDistance: I do not free positionsSHW!

Check for memory leaks

Check if there are some memory leaks, especially in aligner. I am pretty sure aligner has some memory leaks, in case when it ends earlier because it is missing some housekeeping.

Different comparison with SSW

Right now I am comparing Myers with SSW while using dynamic adjusting of k in Myers. Is that ok? Can k be specified in SSW? If so, then I should try comparing with fixed k. If not, I could still try comparing with fixed k for just Myers maybe.

Remove bool from interface

I introduced bool arguments by mistake to edlibCalcEditDistance, I should replace them with integers, so it can be easily used in C.

Do profiling

Do some profiling to find out which part of code are consuming most cpu time, and try to make those parts faster.

Compare with myers code from 1998

I already run some tests, it seems I have similar speed like myers from 1998. He is somewhat faster(15%) for small k, but for larger k I am faster.
I should test intensively my code agains myers(1998) and other implementations that come with it.

Edlib could be faster if starting with bigger k for unsimilar sequences in HW mode

Tests showed that for HW, when score is big enough, it may be more beneficial to start with larger k! For example, when similarity (1 - score / read_length) is < 60% we can get better results by just running edlib with k = read_length then using k = -1.

So how can we use this to speed up edlib? If we could have some way of very quickly and roughly estimating the similarity of two sequences up front, we could make a decision: "they seem to be pretty unsimilar, so lets use k=read_length instead of k=-1".

Document API

Looking at .h file is not user friendly, I should create some nice documentation for API. It is probably best to use Doxygen for it.

Enable stating that sequence should be consumed in reverse

Enable user to state that query and/or target should be used in reverse. This is good because user does not have to waste time and resources on reversing the sequence. What I do not like about this is that I am not sure how useful is this if calculation is spending much more time than reversal, which I believe to be the case.

Add progress callback

Sometimes calculation takes a lot of time and we may want to be informed about the progress/status.
To accomplish that I can add a progress callback that will be called every some columns to report about progress so far.

Validate function input

I should validate that function input satisfies some basic conditions, for example that array lengths are positive and so on.

Orient blocks parallel with target (horizontally)

Currently blocks are oriented vertically, parallel with query. Since target is placed on top border of table and target is usually much bigger then query, we may benefit if blocks would be parallel with target! Example: if query is of length 64 (same as block size) and target is much bigger, then we can expect some narrow band. If blocks are parallel with query, we will have to calculate each cell of matrix, while if blocks are parallel with target we will calculate only small part of matrix!
I think this is important only when band is small and target is much bigger then query. Otherwise it will not bring much speedup.
How hard is it to implement? I have two ideas:

  1. transform the problem. If I can do this, that would be great, not much work.
  2. use horizontal blocks instead of vertical. This would be major refactoring of code, it sounds hard and complicated.

Update comparison with Landau Vishkin

I have new results of comparison, much more meaningful, so I should update the old ones with new ones!
I should also update other comparison results (for SSW)

Finding alignment

Mogao bih dodati nalazenje alignmenta, po slicnoj ideji kao kod SSW. Za NW bi trebalo ponovno sve izracunat pa to bas i nema smisla, ali za HW i SHW bi trebalo izracunati samo dio matrice tako da bi se to moglo upotrijebiti. Pogotovo za HW ustvari. Dakle ideja je da se od pozicije gdje je kraj alignmenta krene prema nazad racunati, i tada se ocito racuna samo dio matrice.

Optimize adjusting of last block in NW

Right now, when adjusting band (last block) for NW, I say that value of first cell in block is larger or equal than Score - W + 1. However, I could do it even better, I could say it is larger or equal than max(Score' - 1, Score - W + 1)! Should I do this, will it speed things up? I think I should try it.

Obtain alignment path for certain best position

@isovic suggested addition of following feature: User can choose for which of best positions to get alignment path (now alignment path is returned only for first position).
I am not sure what is the best way to do this? Should I enable this before searching, or after finding positions? How can I enable it before when I do not know how many positions will be returned? Should I just find alignment paths for all positions? I should talk with @isovic more to see what is exact use case for which he thinks this will be useful.

Usage of parameter k in aligner

Currently k is always set to -1. Allow aligner to use k if we only want the query with best score, or if we want queries with N best scores. Also allow user to manually set k.

Name of aligner is too generic

The name of the executable aligner is quite generic and could possibly conflict with a similarly named executable from another package. Consider renaming the executable to edlib-aligner.

Unify all output parameters into one parameter

Currently there are many output parameters in main function of edlib, like startLocations, endLocations, alignment, and so on. In order to make usage of function easier, we should pass only one object which will then be filled with output. This deprecates #34.

Improve block calculation

I removed ifs from calculation of block, which is core of whole algorithm. That gave some speed, about 30% - 40% speedup. What I should do further: investigate if operations in block calculation can be further simplified! That would bring more speedup

Improve cigar format

It is not a custom to start alignment path with insertions in cigar: I should replace them with soft clipping CIGAR operation (S) which can come at the start or end of read.
Also, I shoud add support for standard cigar format.

Test speed of alignment

Compare speed of search with alignment and without alignment. Do it for all modes and for different lengths of query. Longer the query, slower should search with alignment be. If query is very big, I expect search with alignment to be much slower

aligner: Consider adding characters to show the locations of matches/mismatches

Commonly spaces are used for mismatches and pipes | for matches. Also consider changing underscore _ to hyphen -, as the latter is more typical.

For example:

T: AGATATGCTGCCGC---GGACAGCGTTATCTCTAACTAACAGTCACTATC (0 - 46)
   |||| |||||||||   ||||||||||| |||||| |||||||| |||| 
Q: AGATGTGCTGCCGCCTTGGACAGCGTTACCTCTAA-TAACAGTCCCTATG (0 - 48)

Make it easier to prepare sequences for usage

Currently sequences have to be transformed into numbers, that go from 0 to N-1 if alphabet has length N. This is a boring thing to implement and nobody should have to go through this trouble.

I should either implement some helper function to transform char sequence to unsigned char sequence of numbers from 0 to N-1, or I should put that transformation inside main function so nobody even knows about it.

What I could also do is create a simpler function, that will detect the length of alphabet and also do this transformation.

Finally, in case that my function accepts numbers as it was accepting them so far, I should check if they are in range 0 to N-1 and report an error if that is not the case!

Runtime error

edlib.cpp:1062: int obtainAlignment(const unsigned char_, const unsigned char_, int, const unsigned char_, const unsigned char_, int, int, int, unsigned char*, int): Assertion `score_ == bestScore' failed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.