Giter Club home page Giter Club logo

seqfold's Introduction

seqfold

PyPI version DOI

Predict the minimum free energy structure of nucleic acids.

seqfold is an implementation of the Zuker, 1981 dynamic programming algorithm, the basis for UNAFold/mfold, with energy functions from SantaLucia, 2004 (DNA) and Turner, 2009 (RNA).

Installation

pypy3 (recommended)

pypy3 -m ensurepip
pypy3 -m pip install seqfold

For a 200bp sequence (on my laptop), pypy3 takes 2.5 seconds versus 15 seconds for CPython.

Conda

conda install -c bioconda seqfold

Thank you to @jonas-fuchs for this

pip

pip install seqfold

Usage

Python

from seqfold import fold, dg, dg_cache, dot_bracket

# just returns minimum free energy
dg("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC", temp = 37.0)  # -13.4

# `fold` returns a list of `seqfold.Struct` from the minimum free energy structure
structs = fold("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC")
print(sum(s.e for s in structs))  # -13.4, same as dg()
for struct in structs:
    print(struct)  # prints the i, j, ddg, and description of each structure

# `dg_cache` returns a 2D array where each (i,j) combination returns the MFE from i to j inclusive
cache = dg_cache("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC")

# `dot_bracket` returns a dot_bracket representation of the folding
print(dot_bracket(seq, structs))  # ((((((((.((((......))))..((((.......)))).))))))))

CLI

usage: seqfold [-h] [-t FLOAT] [-d] [-r] [--version] SEQ

Predict the minimum free energy (kcal/mol) of a nucleic acid sequence

positional arguments:
  SEQ                   nucleic acid sequence to fold

optional arguments:
  -h, --help            show this help message and exit
  -t FLOAT, --celcius FLOAT
                        temperature in Celsius
  -d, --dot-bracket     write a dot-bracket of the MFE folding to stdout
  -r, --sub-structures  write each substructure of the MFE folding to stdout
  --version             show program's version number and exit

Examples

$ seqfold GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC --celcius 32
-15.3
$ seqfold GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC --celcius 32 --dot-bracket --sub-structures
GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC
((((((((.((((......))))..((((.......)))).))))))))
   i    j    ddg  description
   0   48   -1.9  STACK:GG/CC
   1   47   -1.9  STACK:GG/CC
   2   46   -1.4  STACK:GA/CT
   3   45   -1.4  STACK:AG/TC
   4   44   -1.9  STACK:GG/CC
   5   43   -1.6  STACK:GT/CA
   6   42   -1.4  STACK:TC/AG
   7   41   -0.5  BIFURCATION:4n/3h
   9   22   -1.1  STACK:TT/AA
  10   21   -0.7  STACK:TA/AT
  11   20   -1.6  STACK:AC/TG
  12   19    3.0  HAIRPIN:CA/GG
  25   39   -1.9  STACK:CC/GG
  26   38   -2.3  STACK:CG/GC
  27   37   -1.9  STACK:GG/CC
  28   36    3.2  HAIRPIN:GT/CT
-15.3

Notes

  • The type of nucleic acid, DNA or RNA, is inferred from the input sequence.
  • seqfold is case-insensitive with the input sequence.
  • The default temperature is 37 degrees Celsius for both the Python and CLI interface.

Motivation

Secondary structure prediction is used for making PCR primers, designing oligos for MAGE, and tuning RBS expression rates.

While UNAFold and mfold are the most widely used applications for nucleic acid secondary structure prediction, their format and license are restrictive. seqfold is meant to be an open-source, minimalist alternative for predicting minimum free energy secondary structure.

seqfold mfold UNAFold
License MIT Academic Non-commercial $200-36,000
OS Linux, MacOS, Windows Linux, MacOS Linux, MacOS, Windows
Format python, CLI python CLI binary CLI binary
Dependencies none (mfold_util) Perl, (gnuplot, glut/OpenGL)
Graphical no yes (output) yes (output)
Heterodimers no yes yes
Constraints no yes yes

Citations

Papers, and how they helped in developing seqfold, are listed below.

Nussinov, 1980

Nussinov, Ruth, and Ann B. Jacobson. "Fast algorithm for predicting the secondary structure of single-stranded RNA." Proceedings of the National Academy of Sciences 77.11 (1980): 6309-6313.

Framework for the dynamic programming approach. It has a conceptually helpful "Maximal Matching" example that demonstrates the approach on a simple sequence with only matched or unmatched bp.

Zuker, 1981

Zuker, Michael, and Patrick Stiegler. "Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information." Nucleic acids research 9.1 (1981): 133-148.

The most cited paper in this space. Extends further than Nussinov, 1980 with a nearest neighbor approach to energies and a consideration of each of stack, bulge, internal loop, and hairpin. Their data structure and traceback method are both more intuitive than Nussinov, 1980.

Jaeger, 1989

Jaeger, John A., Douglas H. Turner, and Michael Zuker. "Improved predictions of secondary structures for RNA." Proceedings of the National Academy of Sciences 86.20 (1989): 7706-7710.

Zuker and colleagues expand on the 1981 paper to incorporate penalties for multibranched loops and dangling ends.

SantaLucia, 2004

SantaLucia Jr, John, and Donald Hicks. "The thermodynamics of DNA structural motifs." Annu. Rev. Biophys. Biomol. Struct. 33 (2004): 415-440.

The paper from which almost every DNA energy function in seqfold comes from (with the exception of multibranch loops). Provides neighbor entropies and enthalpies for stacks, mismatching stacks, terminal stacks, and dangling stacks. Ditto for bulges, internal loops, and hairpins.

Turner, 2009

Turner, Douglas H., and David H. Mathews. "NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure." Nucleic acids research 38.suppl_1 (2009): D280-D282.

Source of RNA nearest neighbor change in entropy and enthalpy parameter data. In /data.

Ward, 2017

Ward, M., Datta, A., Wise, M., & Mathews, D. H. (2017). Advanced multi-loop algorithms for RNA secondary structure prediction reveal that the simplest model is best. Nucleic acids research, 45(14), 8541-8550.

An investigation of energy functions for multibranch loops that validates the simple linear approach employed by Jaeger, 1989 that keeps runtime within O(n³).

seqfold's People

Contributors

jjti avatar leshane avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

seqfold's Issues

Numba compatibility?

I tried to Numba jit the dg function as a longshot and it appears that it failed. Have y'all tried Numba compiling any parts of this to increase the performance?

Add conda to readme

Hello there, seqfold is now on bioconda. You could add this to your readme:

conda install -c bioconda seqfold

Great packages btw!

linear structure probability

hello,
Thank you very much for the tool you developed to predict the secondary structure of a sequence. However, if we need to predict the probability of a sequence maintaining a linear structure, how should we calculate it?

ilead-cong

seqfold not finding folding at certain sequence lengths

Hello,
Thank you for making and maintaining this tool.
I'm trying to figure out the best folding for a sequence of unknown length, so I'm trying all reasonable sequence lengths 2-100 nt
the issue is that for certain lengths, seqfold is unable to find more than a single paired base, even though there are plenty of hairpins in the sequence.
in the attached image, each line is the dotbrackt view of one seqfold run, each with +1nt of the same sequence
image

do you know what this software behaviour could be caused by?

Best

Tue

DOI for citations

Hi! Thank you for implementing this! There is quite a lack of open source tools for energy calculations, so this fills a much-needed niche.

Would you ever consider adding a DOI to make it easier to cite?

I think this repository/package is increasingly being sought out due to a lack of python bindings for other tools, and I imagine some people would like to cite it in papers (perhaps even citing a particular release for reproducibility). If you're interested/willing, Github makes DOI registration fairly straightforward when paired with Zenodo.

folded structure is not the same size of the input sequence and differs significantly from mfold prediction

For example:

from seqfold import fold, dot_bracket, dg 
seq = 'AAAAAAAAAAAGGCCCCCTCTCTTGC'   
structure = dot_bracket(fold(seq))                    
energy = dg(seq) 
len(seq) != len(structure)            # 26 != 19
structure == '..........(((...)))'
energy == 0.4

RNAFold structure:
image
with dG: -0.95

and mfold structure:
image
with dG: -0.27

Am I using it wrong or is something off here?

Thanks!

RNA Linear folding feature

A week later I tried to use seqfold to calculate MFE and secondary structure but it took too much time for an HTTP request through a serverless environment (waiting time is normally limited to 30-60 seconds for performance reasons). My use case was specifically for vaccine RNA sequences, so most of them are greater than 3kbp.

So I used Linear Fold by running a bash script to build the project and putting them inside a folder of my project and using some python wrapping to run the command line LinearFold binary and parsing the output. It was much faster and

I think I can do a better job by really taking the BeamCKYParser function and wrapping using some lib like CPython, however, it will take some time.

What do you think?

Maximum Depth?

When trying to run DG on a sequence that is 1448 base pairs long, I encounter the following error:
RecursionError: maximum recursion depth exceeded in comparison

with the traceback:

Traceback (most recent call last):
File "C:\Users\Yash Lal\Desktop\Coevolution-Informatics\playground2.py", line 5, in
v = dg(s)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 89, in dg
structs = fold(seq, temp)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 69, in fold
v_cache, w_cache = _cache(seq, temp)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 161, in _cache
_w(seq, 0, n - 1, temp, v_cache, w_cache, emap)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 198, in _w
w1 = _w(seq, i + 1, j, temp, v_cache, w_cache, emap)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 198, in _w
w1 = _w(seq, i + 1, j, temp, v_cache, w_cache, emap)
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 198, in _w
w1 = _w(seq, i + 1, j, temp, v_cache, w_cache, emap)
[Previous line repeated 989 more times]
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 191, in _w
if w_cache[i][j] != STRUCT_DEFAULT:
File "C:\Users\Yash Lal\AppData\Local\Programs\Python\Python39\lib\site-packages\seqfold\fold.py", line 24, in eq
return self.e == other.e and self.ij == other.ij
RecursionError: maximum recursion depth exceeded in comparison

Is there anything I can do to make dg work with such a long sequence?

Thank you!

inf values returned from dg() on RNA sequences

Hello,

I'm not sure if this is necessarily a bug, but I was hoping to gain some clarity on why inf values would be returned for some RNA sequences. Note this is not the case if the sequence is converted to DNA.

from seqfold import dg

# RNA sequences
>>> dg("CAGCGCGGCGGGCGGGAGUCCGGCGCGCCCUCCAUCCCCGGCGGCGUCGGCAAGGAGUAG", temp=37)
inf
>>> dg("ACAAGCAGGAGCCUAUAAAUCCCUGAGGGGGCUGCUGGGACAUCACAGAAGGUGAAAGUC", temp=37)
inf

# DNA sequences
>>> dg("CAGCGCGGCGGGCGGGAGTCCGGCGCGCCCTCCATCCCCGGCGGCGTCGGCAAGGAGTAG", temp=37)
-12.6
>>> dg("ACAAGCAGGAGCCTATAAATCCCTGAGGGGGCTGCTGGGACATCACAGAAGGTGAAAGTC", temp=37)
-5.3

Expected result dose not match calculated result

Hi, I was just tried to work with the following example from the README file:

just returns minimum free energy
dg("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC", temp = 37.0) # -12.94

And the result I got is -13.4 (and not -12.94 as stated in the example).
Could this be due to dependencies or something else?

Thanks

Add degenerate base functionality

I have been using seqfold to design oligo constructs, but have run into issues when I use degenerate bases (like in unique molecular identifiers, UMIs). Would there be a straightforward way to allow for degenerate bases? A minimal version of this might be to ignore their interactions, but include them in the backbone? Thanks!

from typing import List (one line I had to add to python demo example in README.md) Also ct format doable?

my own TODO

Great work! I may try to tweak to get output in .vienna format which has the mfe appened on same line as dot-brackets. I want to count bulges, and one way to do so involves converting .vienna dot-bracket to .ct (connect) format. Seems doable ;-)

Single-char Ttpo in README.md where I think seqfold -v -l output should have ddg not dg that is: i j ddg description not i j dg description
BTW what does ddg stand for? May also want to mention version(s) of python 3.x required. I think typing did not start showing up until 3.5 or 3.6, right? And it works differently in 3.9 per https://stackoverflow.com/questions/57505071/nameerror-name-list-is-not-defined/

Also, one line I needed in python demo

Note I had to add one extra line to the python demo in README.md
from typing import List
Adding above line got rid of NameError... as shown below. I just pasted python demo from README.md into fileseqfold-demo.py file.

$ head -4  seqfold-demo.py. # after adding `from typing import List` line
from seqfold import dg, dg_cache, fold, Struct
from typing import List
# just returns minimum free energy

$ python. seqfold-demo.py

-15.200000000000001
   0   48   -2.1  STACK:GG/CC    
   1   47   -2.1  STACK:GG/CC    
   2   46   -1.3  STACK:GA/CT    
   3   45   -1.3  STACK:AG/TC    
   4   44   -2.1  STACK:GG/CC    
   5   43   -1.5  STACK:GT/CA    
   6   42   -1.3  STACK:TC/AG    
   7   41   -0.2  BIFURCATION:4n/3h
   9   22   -1.0  STACK:TT/AA    
  10   21   -0.9  STACK:TA/AT    
  11   20   -1.5  STACK:AC/TG    
  12   19    3.1  HAIRPIN:CA/GG  
  25   39   -2.1  STACK:CC/GG    
  26   38   -2.2  STACK:CG/GC    
  27   37   -2.1  STACK:GG/CC    
  28   36    3.4  HAIRPIN:GT/CT

Without that one extra line I get NameError...

$ head -3  seqfold-demo.py # without `from typing import List` line
from seqfold import dg, dg_cache, fold, Struct

# just returns minimum free energy

$ python seqfold-demo.py
Traceback (most recent call last):
  File "seqfold-demo.py", line 7, in <module>
    structs: List[Struct] = fold("GGGAGGTCGTTACATCTGGGTAACACCGGTACTGATCCGGTGACCTCCC")
NameError: name 'List' is not defined
$ python --version
Python 3.8.8

dot_bracket output length does not match sequence length

Hello,

For the given sequence ATATCTAGGTAGATAGCTAAGTAAATAACTACGTACATACCTCTCGGTCGATCGCTCAGTCAATCACTCCGTCCATCCCGGGGAGGGCGGAAGGACGGCAGGCCGAGAGCGAAAGAACGACAGACCGCGCAAGCACGCCAGCCCAAAACAACCACACCCCTTTTGTTTATTTCTTGGTTGATTGCTTAGTTAATTACTTC, the output of dot_bracket function with temp=55, the resulting dot_bracket result: ........................................((.(((.((..((((((...........((((((.(((..........))).))))))........))))))..)).))).)) does not have the same length

Add function that exposes a fold Cache

Useful for things like primer design where the user is going to query many possible start + end positions, checking each for a high delta g, ie low foldability

Ionic conditions correction

Hello,

Any plans to include oligomer corrections for ionic conditions (Na+ and Mg++), similar to mFold?

Thanks!

Possible bug

Hi, I really like the package. A good alternative to Vienna (which isn't on pypi).

But when verifing your example sequence output against Vienna's RNAfold, it looks like seqfold misses two pair bonds? the lowercased g-c and a-u

         GGGAGGUCgUUACAUCUGGGUAAcaCCGGUACUGAUCCGGuGACCUCCC
RNAfold  (((((((((((((......)))))(((((.......))))))))))))) (-25.10)
seqfold  ((((((((.((((......))))..((((.......)))).)))))))) (-25.5)

Folding visualizations and UNAfold differences

Hello and thanks so much for writing accessible code! It's well needed.

I am coming from using UNAfold online, which is luckily still around on IDT's website right now, but I worry that it won't necessarily be there forever. I'm also interested in coding with these programs so UNAfold is out, since the owner of it hasn't responded to anything and I can't find their code anywhere. But when I was using seqfold, I noticed a bunch of differences. Is there a good explanation for this? I trust you, but I'm just not familiar with exactly where all the ambiguity may be coming from. Do you believe that your numbers are more accurate?

Back onto using UNAfold, is there a feature, where I can see the folding that occurs? When I run the fold command in seqfold I just see the base pairing that will occur. Is this for the most likely structure? When I run UNAfold, I get the dG and base pairing for multiple different folded structures.

I also think a brief manual explaining what i and j are would definitely help a bit. Again thanks for all your code and have a nice day!

Incorrect DNA_NN values in dictionary

Hi, I think your values are entered wrong.

Below are corrected values based on The Thermodynamics of DNA Structural Motifs
Annual Review of Biophysics and Biomolecular Structure

Vol. 33:415-440 (Volume publication date 9 June 2004)
https://doi.org/10.1146/annurev.biophys.32.110601.141800
'''
DNA_NN: BpEnergy = {
"init": (0.2, -5.7),
"init_G/C": (0.0, 0.0),
"init_A/T": (2.2, 6.9),
"sym": (0, -1.4),
"AA/TT": (-7.6, -21.3),
"AT/TA": (-7.2, -20.4),
"TA/AT": (-7.2, -21.3),
"CA/GT": (-8.5, -22.7),
"GT/CA": (-8.4, -22.4),
"CT/GA": (-7.8, -21.0),
"GA/CT": (-8.2, -22.2),
"CG/GC": (-10.6, -27.2),
"GC/CG": (-9.8, -24.4),
"GG/CC": (-8.0, -19.9),
}
'''

If this is not correct could you explain your values?

Thanks for putting this together! It is a great resource!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.