Giter Club home page Giter Club logo

Comments (6)

eseraygun avatar eseraygun commented on June 27, 2024

Sorry for the lack of documentation. I should improve it sometime.

Global sequence alignment uses Needleman-Wunsch algorithm. Local sequence alignment uses Smith-Waterman algorithm.

About your blockers:

  • It does give you the original elements if you decode the aligned sequences. Check out the final loop in the example. As for the indices, you're right; it doesn't give you the original indices. This can definitely be improved.

  • I was a little bit obsessed about generating all alignments with the max score when I wrote the backtracing algorithm. Hence the recursion. It should be pretty easy to write an iterative, constant space backtracer that stops when it founds one solution.

So there seems to be three issues that needs fixing:

  1. Document the code
  2. Store original indices in the alignments or at least provide a way to infer them.
  3. Implement a greedy, constant space backtracing.

I hope I can find time to address these but please feel free to contribute if you have the motivation.

from python-alignment.

de-code avatar de-code commented on June 27, 2024

Thank you for getting back to me and clarifying it. What algorithm is StrictGlobalSequenceAligner meant to be?

Regarding getting the original elements back. I tried that. It didn't work. I realised it is because of the Vocabulary as it is converting it to a number. So I can get back some object that was put into the sequence but not the one from the particular sequence. (e.g. my object might contain the index, or a reference where the element originally came from).

Just some further question before I embark on a PR:

How would you feel about making the vocabulary optional? I can see it being a good optimisation when matching but doesn't seem to be required for the algorithm.

How would you feel about making breaking changes?
e.g. if I was to address it, then I would probably change SequenceAlignment how it stores sequences and the backtracing (the latter is probably more internal).

from python-alignment.

eseraygun avatar eseraygun commented on June 27, 2024

Both GlobalSequenceAligner and StrictGlobalSequenceAligner use Needleman-Wunsch. GlobalSequenceAligner is modified to not include the initial and terminal gaps in the result. This is a common practice in bioinformatics. It is sometimes called semiglobal or glocal alignment.

Elements are assumed to have the equivalence relationship: They represent the same thing if and only if they are equal. If you have non-trivial objects as elements, you can implement __eq__ and __hash__ methods in a way that satisfies the equivalence relationship.

I wrote this library out of necessity and one of my requirements was to compare all pairs of sequences in a set. That's why I needed the vocabulary: You can encode all sequences once, perform the alignments on fast integer arrays and decode only the alignments that you find useful. I think that's a very common use case for alignment tasks and I wouldn't want to lose the performance gain that comes with vocabularies. However, if you think you can generalize the existing algorithms without causing extra lag and without duplicating the code, you can go for it. I'm sure Python's duck typing can handle it :)

I'm not sure how many people are actively using the library but I'm aware that several people used it in the past. I would want to break their code (nor mine). Constant time backtracing can definitely be implemented as an extension to the existing code. If you need to get rid of the vocabulary though, I would like to see the end result before committing to it.

from python-alignment.

de-code avatar de-code commented on June 27, 2024

Okay, you have a PR.

I have to admit I don't know much about bioinformatics. I just want to align text from different sources.

I did use __eq__ and __hash__ - but the issue with the vocabulary is that those elements are then equal (otherwise they wouldn't align). As it will then use the integer representation it won't be possible to translate the integer back to the original sequence object (unless I have the index or it is getting the object from the passed in sequence and not the vocabulary - both will work with the PR)

from python-alignment.

eseraygun avatar eseraygun commented on June 27, 2024

I see your problem. I guess you're right. Vocabulary is not ideal in cases where you cannot simply enumerate all possible elements.

I'm pretty much convinced that an interface that supports arbitrary lists of objects would be more "pythonic" and useful for non-bioinformatics-related problems. So, I suggest we refactor the whole library without caring too much about backwards compatibility and release a version 2.0. I'll try to organize this through issues and labels.

from python-alignment.

de-code avatar de-code commented on June 27, 2024

Okay, sure. I think it would be possible to be largely backward compatible by keeping the Sequence wrappers as an optional extra. But I guess it would be good to allow a fresh start. I was also considering Cython.

If you prefer to allow more time for the changes then I could keep my changes in a branch on my fork for now (that will allow me to continue with what I actually wanted to use it for).

from python-alignment.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.