Giter Club home page Giter Club logo

goldenhinges's Introduction

Golden Hinges Logo

GitHub CI build status https://coveralls.io/repos/github/Edinburgh-Genome-Foundry/GoldenHinges/badge.svg?branch=master

Golden Hinges (full documentation here) is a Python library to find sets of overhangs (also called junctions, or protrusions) for multipart DNA assembly such as Golden Gate assembly.

Given a set of constraints (GC content bounds, differences between overhangs, mandatory and forbidden overhangs) Golden Hinges enables to find:

  • Maximal sets of valid and inter-compatible overhangs.
  • Sequence decompositions (i.e. position of cuts) which produce valid and inter-compatible overhangs, for type-2S DNA assembly.
  • Sequence mutations (subject to constraints) which enable the sequence decomposition, in exterme cases where the original sequence does not allow for such decomposition.

You can see Golden Hinges in action in this web demo: Design Golden Gate Overhangs

Examples of use

Finding maximal overhang sets

Let us compute a collection of overhangs, as large as possible, where

  • All overhangs have 25-75 GC%
  • There is a 2-basepair difference between any two overhangs (and their reverse-complement)
  • The overhangs ATGC and CCGA are forbidden

Here is the code

from goldenhinges import OverhangsSelector
selector = OverhangsSelector(
    gc_min=0.25,
    gc_max=0.5,
    differences=2,
    forbidden_overhangs=['ATGC', 'CCGA']
)
overhangs = selector.generate_overhangs_set()
print (overhangs)

Result:

>>> ['AACG', 'CAAG', 'ACAC', 'TGAC', 'ACGA', 'AGGT',
     'TGTG', 'ATCC', 'AAGC', 'AGTC', 'TCTC', 'TAGG',
     'AGCA', 'GTAG', 'TGGA', 'ACTG', 'GAAC', 'TCAG',
     'ATGG', 'TTGC', 'TTCG', 'GATG', 'AGAG', 'TACC']

In some cases this may take some time to complete, as the algorithm slowly builds collections of increasing sizes. An alternative algorithm consisting in finding random maximal sets of compatible overhangs is much faster, but gives suboptimal solutions:

overhangs = selector.generate_overhangs_set(n_cliques=5000)

Result:

>>> ['CAAA', 'GTAA', 'ATTC', 'AATG', 'ACAT', 'ATCA',
     'AGAG', 'GCTT', 'AGTT', 'TCGT', 'CTGA', 'TGGA',
     'TAGG', 'GGTA', 'GACA']

The two approaches can be combined to first find an approximate solution, then attempt to find larger sets:

test_overhangs = selector.generate_overhangs_set(n_cliques=5000)
overhangs = selector.generate_overhangs_set(start_at=len(test_overhangs))

Using experimental annealing data from Potapov 2018

This study by Potapov et al. provides insightful data on overhang annealing, in particular which overhangs have weak general annealing power, and which pairs of overhangs have significant "cross-talk". You can use the data in this paper via the Python tatapov library to identify which overhangs or overhang pairs you want the GoldenHinges OverhangSelector to exclude:

import tatapov
from goldenhinges import OverhangsSelector

annealing_data = tatapov.annealing_data['37C']['01h']

self_annealings = tatapov.relative_self_annealings(annealing_data)
weak_self_annealing_overhangs = [
    overhang
    for overhang, self_annealing in self_annealings.items()
    if self_annealing < 0.05
]

cross_annealings = tatapov.cross_annealings(annealing_data)
high_cross_annealing_pairs = [
    overhang_pair
    for overhang_pair, cross_annealing in cross_annealings.items()
    if cross_annealing > 0.005
]

selector = OverhangsSelector(
    forbidden_overhangs=weak_self_annealing_overhangs,
    forbidden_pairs=high_cross_annealing_pairs
)

Finding a sequence decomposition

In this example, we find where to cut a 50-kilobasepair sequence to create assemblable fragments with 4-basepair overhangs. We indicate that:

  • There should be 50 fragments, with a minimum of variance in their sizes.
  • The fragments overhangs should have 25-75 GC% with a 1-basepair difference between any two overhangs (and their reverse-complement). They should also be compatible with the 4-basepair extremities of the sequence.
from Bio import SeqIO
from goldenhinges import OverhangsSelector

sequence = SeqIO.read
selector = OverhangsSelector(gc_min=0.25, gc_max=0.75, differences=1)
solution = selector.cut_sequence(
    sequence, equal_segments=50, max_radius=20,
    include_extremities=True
)

This returns a list of dictionnaries, each representing one overhang with properties o['location'] (coordinate of the overhang in the sequence) and o['sequence'] (sequence of the overhang).

This solution can be turned into a full report featuring all sequences to order (with restriction sites added on the left and right flanks), and a graphic of the overhang's positions, using the following function:

from goldenhinges.reports import write_report_for_cutting_solution

write_report_for_cutting_solution(
    solution, 'full_report.zip', sequence,
    left_flank='CGTCTCA', right_flank='TGAGACG',
    display_positions=False
)

Sequence mutation and decomposition from a Genbank file

If the input sequence is a Genbank record (or a Biopython record) has locations annotated vy features feature labeled !cut, GoldenHinges will attempt to find a decomposition with exactly one cut in each of these locations (favoring cuts located near the middle of each region).

GoldenHinges also allows to modify the sequence to enable some decomposition. Note that solutions involving base changes are penalized and solutions involving the original solution will always be prefered, so no base change will be suggested unless strictly necessary.

If the input record has DNA Chisel annotations such as @AvoidChanges or @EnforceTranslation, these will be enforced to forbid some mutations.

Here is an example of such a record:

[sequence with constraints]

And here is the code to optimize and decompose it:

record = SeqIO.read(genbank_file, 'genbank')
selector = OverhangsSelector(gc_min=0.25, gc_max=0.75,
                             differences=2)
solution = selector.cut_sequence(record, allow_edits=True,
                                 include_extremities=True)

Installation

Install Numberjack's dependencies first:

sudo apt install python-dev swig libxml2-dev zlib1g-dev libgmp-dev

If you have PIP installed, just type in a terminal:

pip install goldenhinges

Golden Hinges can be installed by unzipping the source code in one directory and using this command:

sudo python setup.py install

If you have trouble installing NumberJack, you may try using swig v3 (e.g. Ubuntu 20.04 has swig version 4):

apt-get remove -y swig
apt-get install -y swig3.0
ln /usr/bin/swig3.0 /usr/bin/swig

Then install Numberjack with pip. You may also try and build it from source:

wget https://github.com/Edinburgh-Genome-Foundry/Numberjack/archive/v1.2.0.tar.gz
tar -zxvf v1.2.0.tar.gz
cd Numberjack-1.2.0
python setup.py build -solver Mistral
python setup.py install

Contribute!

Golden Hinges is an open-source software originally written at the Edinburgh Genome Foundry by Zulko and released on Github under the MIT licence. Everyone is welcome to contribute!

goldenhinges's People

Contributors

veghp avatar zulko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

goldenhinges's Issues

Python3.10 breaks GoldenHinges

There seems to be an implicit call of from collections import Sequence in the code. This works up until Python3.9, but then returns the error ImportError: cannot import name 'Sequence' from 'collections'.

This is because collections has been renamed to collections.abc and so this can be fixed with import collections.abc as collections in Python3.10.

Found reference to this issue here: https://stackoverflow.com/questions/69596494/unable-to-import-freegames-python-package-attributeerror-module-collections

As I mentioned this seems to be an implicit import as there is no reference to collections in the code of either GoldenHinges or DnaFeaturesViewer.

Numberjack need"arm64" has 'x86_64' despite compiling with arch arm64 within Goldenhinges

Hi this may be a "duplicate" of the previous issue #5 -but I get the following error when I try a simple GoldenHinges example on OSX. I used the "latest" Numberjack from Github.

(.venv) โžœ  goldenhinges python assemble_plasmid.py
/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/Bio/SeqFeature.py:1112: BiopythonParserWarning: Attempting to fix invalid location '7905..951' as it looks like incorrect origin wrapping. Please fix input file, this could have unintended behavior.
  warnings.warn(
Traceback (most recent call last):
  File "/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/Numberjack-1.2.1-py3.11-macosx-13-arm64.egg/Numberjack/__init__.py", line 910, in load
    lib = __import__(solverstring, fromlist=[solverspkg])
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/Numberjack-1.2.1-py3.11-macosx-13-arm64.egg/Numberjack/solvers/Mistral.py", line 10, in <module>
    from . import _Mistral
ImportError: dlopen(/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/Numberjack-1.2.1-py3.11-macosx-13-arm64.egg/Numberjack/solvers/_Mistral.cpython-311-darwin.so, 0x0002): tried: '/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/Numberjack-1.2.1-py3.11-macosx-13-arm64.egg/Numberjack/solvers/_Mistral.cpython-311-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/Numberjack-1.2.1-py3.11-macosx-13-arm64.egg/Numberjack/solvers/_Mistral.cpython-311-darwin.so' (no such file), '/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/Numberjack-1.2.1-py3.11-macosx-13-arm64.egg/Numberjack/solvers/_Mistral.cpython-311-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/harijayaram/hjworkbench/goldenhinges/assemble_plasmid.py", line 31, in <module>
    main()
  File "/Users/harijayaram/hjworkbench/goldenhinges/assemble_plasmid.py", line 15, in main
    solution = selector.cut_sequence(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/goldenhinges/OverhangsSelector.py", line 402, in cut_sequence
    solution = self.cut_sequence(
               ^^^^^^^^^^^^^^^^^^
  File "/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/goldenhinges/OverhangsSelector.py", line 463, in cut_sequence
    choices = self.select_from_sets(sets_list, solutions=solutions,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/goldenhinges/OverhangsSelector.py", line 214, in select_from_sets
    solver = model.load("Mistral", variables)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/harijayaram/hjworkbench/goldenhinges/.venv/lib/python3.11/site-packages/Numberjack-1.2.1-py3.11-macosx-13-arm64.egg/Numberjack/__init__.py", line 913, in load
    raise ImportError(
ImportError: ERROR: Failed during import, wrong module name? (Mistral)

Here is my code that is adapted from the Goldenhinges example

import os
from Bio import SeqIO
from goldenhinges import OverhangsSelector
from goldenhinges.reports import write_report_for_cutting_solution


def main():
    """Example of cutting problem defined by a record.

    This example has a long
    """
    genbank_file = "pcmv-abe7-10-chiselannottest (2).gb"
    record = SeqIO.read(genbank_file, "genbank")
    selector = OverhangsSelector(gc_min=0.25, gc_max=0.75, differences=2)
    solution = selector.cut_sequence(
        record, allow_edits=True, equal_segments=20, include_extremities=True
    )

    print("Writing the report...")
    write_report_for_cutting_solution(
        solution=solution,
        target=os.path.join("results", "with_mutations"),
        sequence=record,
        left_flank="CGTCTCA",
        right_flank="TGAGACG",
    )
    print("Done! See the report in results/with_mutations/")


if __name__ == "__main__":
    main()

The Numberjack installation from this github repo proceeded without any error.

Decomposing circular sequences

  1. Cut similar-length fragments: the two end fragments in a circular sequence are actually joined and thus the fragment doesn't have the desired length.
  2. Cut in featured zones: the issue doesn't apply here. However, this option can be used as a workaround for the above, by specifying !cut regions at regular intervals on the circle.

A simple solution is to add an option to cut_sequence() that adjusts the cut points for circular sequences:

if equal_segments is not None:

A full solution could consider any rotation of the sequence. However, this does not seem to be practically necessary.

Suggestions are welcome.

Preferred overhangs

Hello! I am having difficulty attempting to provide a set of overhangs to the selector object.
I have two sets of overhangs. Prioritized as set1 and set2. I want to attempt a cutting solution with only overhangs chosen either from set1 (priority) or set2, as such:

from goldenhinges import OverhangsSelector
from goldenhinges.reports import write_report_for_cutting_solution

#define goldenHinges vars
# Overhang sets
set1 = ['CCCT', 'AACG','ATCG','GCTG', 
        'TACA','GAGT','CCGA']

set2 = ['AGTG','CAGG', 'ACTC', 'AAAA', 'AGAC', 'CGAA', 'ATAG', 'AACC', 'TACA',
        'TAGA', 'ATGC', 'GATA', 'CTCC', 'GTAA', 'CTGA', 'ACAA', 
        'AGGA', 'ATTA' , 'ACCG', 'GCGA']


all_preferred = {'set1':set1,'set2': set2}
# List of forbidden 4bp overhangs
forbidden = {"eColi":["TATG", "TGAG"],
             "mam/ins": ["CCAT", "TGAT"]}
#define criteria
gc_min = 0.25
gc_max = 0.75
differences =  2
forbidden_overhangs = forbidden['mam/ins'] 

#define default selector
selector = OverhangsSelector(gc_min=gc_min, gc_max=gc_max, differences=differences)

#search for  overhangs from set 1 else try from set2
for key, value in all_preferred.items():
    #append the set to selector overhangs (something not correct here)
    print(value)
    selector.all_overhangs = value
    print(selector.all_overhangs)
    solution = selector.cut_sequence(record, allow_edits=False, include_extremities=False)
    if solution != None:
        break

This gives:


['CCCT', 'AACG', 'ATCG', 'GCTG', 'TACA', 'GAGT', 'CCGA']
['CCCT', 'AACG', 'ATCG', 'GCTG', 'TACA', 'GAGT', 'CCGA']

IndexError Traceback (most recent call last)
in
16 selector.all_overhangs = value
17 print(selector.all_overhangs)
---> 18 solution = selector.cut_sequence(record, allow_edits=False, include_extremities=False)
19 if solution != None:
20 break

~/miniconda3/envs/keras2/lib/python3.6/site-packages/goldenhinges/OverhangsSelector.py in cut_sequence(self, sequence, intervals, solutions, allow_edits, include_extremities, optimize_score, edit_penalty, equal_segments, max_radius, target_indices)
460 return None
461 choices = self.select_from_sets(sets_list, solutions=solutions,
--> 462 optimize_score=optimize_score)
463
464 def get_solution(choices):

~/miniconda3/envs/keras2/lib/python3.6/site-packages/goldenhinges/OverhangsSelector.py in select_from_sets(self, sets_list, solutions, optimize_score)
223
224 if solutions == 1:
--> 225 returned = get_solution()
226 solver.delete()
227 return returned

~/miniconda3/envs/keras2/lib/python3.6/site-packages/goldenhinges/OverhangsSelector.py in get_solution()
220 return None
221 else:
--> 222 return [self.all_overhangs[v.get_value()] for v in variables]
223
224 if solutions == 1:

~/miniconda3/envs/keras2/lib/python3.6/site-packages/goldenhinges/OverhangsSelector.py in (.0)
220 return None
221 else:
--> 222 return [self.all_overhangs[v.get_value()] for v in variables]
223
224 if solutions == 1:

IndexError: list index out of range

Is there another way to achieve this?

Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.