edinburgh-genome-foundry / dnachisel Goto Github PK

View Code? Open in Web Editor NEW

211.0 10.0 37.0 9.42 MB

:pencil2: A versatile DNA sequence optimizer

Home Page: https://edinburgh-genome-foundry.github.io/DnaChisel/

License: MIT License

Python 99.33% Pug 0.67%

sequence-design dna-optimization synthetic-biology bioinformatics synbio codon-optimization

dnachisel's People

Contributors

Stargazers

Watchers

dnachisel's Issues

AvoidPattern with regex

I ran into an issue recently where I was using a regex in my AvoidPattern and didn't realize that I needed to specify the size. What happened was that even though there plenty of available solutions avoiding the pattern, I was still getting an error that no solution could be found. This was because without a size, regexes dont' optimize on small windows and instead were introducing random perturbations on the entire sequence. Since it was a large sequence, the odds of any of these perturbations fixing the constraint was very low.

The fix was just to specify it as a SequencePattern instead (where I could specify the size) and feed that into AvoidPattern.

However, this really wasn't obvious to me - took quite a while of debugging and reading through the codebase to figure out.

I wanted to file this issue to suggest either a different approach to documenting or warning about this. Maybe either:

A) allow specifying 'size' in AvoidPattern (if providing a string for the pattern instead of SequencePattern object) and document the necessity of 'size' if it's a regex

or B) throw an error when a regex is specified without a size

or C) disallow specifying patterns with strings with AvoidPattern so that users have to go through SequencePattern and are more likely to see the note about how specifying size is needed (still relies on people reading all the docs, though).

Thanks for this tool - it has been incredibly useful and I'm just trying to think of ways to make it better.

Other optimization policies exist?

DnaChisel/dnachisel/builtin_specifications/CodonOptimize.py

Lines 12 to 14 in a318ca5

  Several codon-optimization policies exist. At the moment this Specification 

  implements a method in which codons are replaced by the most frequent 

  codon in the species.

Error if location not specified

Issue for bug mentioned in #52 (comment).
If no location given;

from dnachisel import DnaOptimizationProblem, EnforcePatternOccurence
seq="AAAAAAAAAA"

problem = DnaOptimizationProblem(
    sequence=seq,
    constraints=[
        EnforcePatternOccurence(pattern='CTG', occurences=2, strand=-1)
    ],
)

then:

AttributeError                            Traceback (most recent call last)
<ipython-input-219-82901a020845> in <module>
      5     sequence=seq,
      6     constraints=[
----> 7         EnforcePatternOccurence(pattern='CTG', occurences=2, strand=-1)
      8     ],
      9 )

/path/to/DnaChisel/dnachisel/builtin_specifications/EnforcePatternOccurence.py in __init__(self, pattern, occurences, location, strand, center, boost)
     69             if strand not in [-1, 0, 1]:
     70                 raise ValueError("unknown strand: %s" % strand)
---> 71             self.location.strand = strand
     72         self.strand = strand
     73         self.occurences = occurences

AttributeError: 'NoneType' object has no attribute 'strand'

because in

DnaChisel/dnachisel/builtin_specifications/EnforcePatternOccurence.py

Line 67 in 9c72428

self.location = Location.from_data(location)

from_data() returns a None as location, and a value is assigned to its attribute in:

DnaChisel/dnachisel/builtin_specifications/EnforcePatternOccurence.py

Line 73 in 9c72428

self.location.strand = strand

This issue also exists in other classes.

EnforcePatternOccurence raises IndexError (in some specific parameters)

related to #51

from dnachisel import *
from Bio.SeqFeature import FeatureLocation

seq="ATGCAGAGCAAGGTGCTGCTGGCCGTCGCCCTGTGGCTCTGCGTGGAGACCCGGGCCGCC"

problem = DnaOptimizationProblem(
    sequence=seq,
    constraints=[
        EnforcePatternOccurence( pattern='CTG',occurences=2,strand=-1,location=FeatureLocation(start=0,end=50))
    ],
)

print(problem.constraints_text_summary())
===> FAILURE: 1 constraints evaluations failed
 FAIL ┍ EnforcePatternOccurence[0-50(-)](CTG, occurence, 2)
      │ Failed. Pattern found 1 times instead of 2 wanted, at locations [3-6(-)]

problem.resolve_constraints()

/usr/local/lib/python3.8/site-packages/dnachisel/DnaOptimizationProblem/mixins/ConstraintsSolverMixin.py in resolve_constraints(self, final_check, cst_filter)
    355         ):
    356             try:
--> 357                 self.resolve_constraint(constraint=constraint)
    358             except NoSolutionError as error:
    359                 self.logger(constraint__index=len(constraints))

/usr/local/lib/python3.8/site-packages/dnachisel/DnaOptimizationProblem/mixins/ConstraintsSolverMixin.py in resolve_constraint(self, constraint)
    301                 try:
    302                     if hasattr(constraint, "resolution_heuristic"):
--> 303                         constraint.resolution_heuristic(local_problem)
    304                     else:
    305                         local_problem.resolve_constraints_locally()

/usr/local/lib/python3.8/site-packages/dnachisel/builtin_specifications/EnforcePatternOccurence.py in resolution_heuristic(self, problem)
    183                         mutation_space=problem.mutation_space,
    184                     )
--> 185                     new_occurence_cst.insert_pattern_in_problem(new_problem)
    186                 problem.sequence = new_problem.sequence
    187                 return
constraint:   0%|                                                                 | 0/1 [00:00<?, ?it/s, now=EnforcePatternOccurence[0...]
/usr/local/lib/python3.8/site-packages/dnachisel/builtin_specifications/EnforcePatternOccurence.py in insert_pattern_in_problem(self, problem, reverse)
    132                 sequence=sequence_to_insert, location=new_location
    133             )
--> 134             new_space = MutationSpace.from_optimization_problem(
    135                 problem, new_constraints=[new_constraint]
    136             )

/usr/local/lib/python3.8/site-packages/dnachisel/MutationSpace/MutationSpace.py in from_optimization_problem(problem, new_constraints)
    184             constraints = new_constraints
    185         mutation_choices = sorted(
--> 186             [
    187                 MutationChoice(segment=choice[0], variants=set(choice[1]))
    188                 for cst in constraints

/usr/local/lib/python3.8/site-packages/dnachisel/MutationSpace/MutationSpace.py in <listcomp>(.0)
    187                 MutationChoice(segment=choice[0], variants=set(choice[1]))
    188                 for cst in constraints
--> 189                 for choice in cst.restrict_nucleotides(sequence)
    190             ],
    191             key=lambda choice: (choice.end - choice.start, choice.start),

/usr/local/lib/python3.8/site-packages/dnachisel/builtin_specifications/EnforceSequence.py in restrict_nucleotides(self, sequence, location)
    107         if self.location.strand == -1:
    108             lend = self.location.end
--> 109             return [(i, set(reverse_complement(n) for n in
    110                             IUPAC_NOTATION[self.sequence[lend - i]]))
    111                     for i in range(start, end)]

/usr/local/lib/python3.8/site-packages/dnachisel/builtin_specifications/EnforceSequence.py in <listcomp>(.0)
    108             lend = self.location.end
    109             return [(i, set(reverse_complement(n) for n in
--> 110                             IUPAC_NOTATION[self.sequence[lend - i]]))
    111                     for i in range(start, end)]
    112         else:

IndexError: string index out of range

Error! The optimized sequence is not consist with the original one in protein level.

test code

from dnachisel import *
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna

# DEFINE THE OPTIMIZATION PROBLEM
problem = DnaOptimizationProblem(
    sequence=random_dna_sequence(900),
    constraints=[AvoidPattern(enzyme="BsaI"),
                 EnforceGCContent(mini=0.3, maxi=0.7, window=50)],
    objectives=[CodonOptimize(species='e_coli', location=(0, 3))]
)
# SOLVE THE CONSTRAINTS, OPTIMIZE WITH RESPECT TO THE OBJECTIVE
problem.resolve_constraints()
problem.optimize()
# PRINT SUMMARIES TO CHECK THAT CONSTRAINTS PASS
print(problem.constraints_text_summary())
print(problem.objectives_text_summary())
print(problem.sequence_before)
print(problem.sequence)
print("---------------------------------")
temp1 = str(
    Seq(problem.sequence_before, generic_dna).translate(
        table="Standard",
        stop_symbol="*",
        to_stop=False,
        cds=False,
        gap=None,
    )
)
temp2 = str(
    Seq(problem.sequence, generic_dna).translate(
        table="Standard",
        stop_symbol="*",
        to_stop=False,
        cds=False,
        gap=None,
    )
)
print(temp2 == temp1)
print("---------------------------------")

Why self.sequence = self.sequence_before while doing optimize_by_exhaustive_search?

DnaChisel/dnachisel/DnaOptimizationProblem.py

Line 570 in 78043a3

self.sequence = self.sequence_before

No function solve_all_constraints_one_by_one defined

DnaChisel/examples/plasmid_optimization.py

Line 95 in 50bcd9a

problem.solve_all_constraints_one_by_one()

Allow unknown amino acids denoted by 'X' in `reverse_translate` function

It would be great if the reverse_translate function could handle unknown amino acids in the sequence denoted by 'X', which would be reverse translated to 'NNN'.

Optimization changes encoded sequence-correct way to preserve ?

I really like Dnachisel and use it routinely to design protein expression constructs for E.coli. I recently got tripped up on an optimization of a ~3kb construct where I realized too late ( after the synthetic gene was ordered ) that the way I was doing my optimization mutated my resulting sequence.

I am doing my optimization using the recipe

problem = DnaOptimizationProblem(
sequence=reverse_translate(mysequence),
constraints=[AvoidPattern("SapI_site"),AvoidPattern("BbsI")],
objectives=[CodonOptimize(species="e_coli"),EnforceTranslation(),AvoidNonUniqueSegments(12),EnforceGCContent(0.25, 0.70, window=10),AvoidHairpins()],
)

The translated sequence that results has 1-2 mutations in the ~1000 amino acid sequence. I have tried it with the github version and the pip version of Dnachisel and they both do the same. There does not seem to be a pattern in where the mutations arise.

I have since learnt my lesson and am checking my designs to ensure that they translate to input amino acid sequence. But am hoping there is a way to not have the optimization spit out a mutated sequence and fail to converge.

Attaching a collaboratory notebook link and file showing what I did.

Thanks for your help in advance

Feature: Add "use_breach_form" field in "constraints_breaches_dataframe"

Scenario:
I would like to use constraints_breaches_dataframe for generating my own report and upload SeqRecord in Benchling instead of the .pdf.

Problem:
For our customers, we need a bit more general description of why their sequence doesn't pass the constraints, and currently, I need to copy/paste the whole function instead of re-using it with the use_breach_form=False field.

Unable to load customized `codon_usage_table`

Note that the species can be omited if a ``codon_usage_table`` is provided instead

However, when omitting the species args, this error message will occur.

TypeError: sequence item 0: expected str instance, NoneType found
Exception ignored in: <function tqdm.__del__ at 0x7fd3e5262620>

How to force reproducibility?

Is there a way to force DnaOptimizationProblem to return the same optimized sequence for a given input sequence (particularly when enforcing GC content)? I saw in your tests that you try setting the Numpy random seed numpy.random.seed(123), but that only works when I run problems in a given Python session.

Here is my basic setup.

import random
import numpy
from dnachisel import CodonOptimize, DnaOptimizationProblem, EnforceGCContent, EnforceTranslation

SEED = 123

DNA_SEQ = 'ATGACATTCACCGAGAAGGTATTCAAACAAGTATGCGAATGGTGTATAGGCTTCTATCGTAAATATGCTGTTCCGAGTGAAGAGTGTGACTTCTTCCCTAGATGGATATGGGGGAACAAACGGAACATTCTGACCTTTACATTCAGTACCCCAATTACCTTACTGACGGCCCCGTCGCGACTTCCCGAAGAAGGGTGCCAAATAACCATCTGCGGACAACAGCAGGATTGCTTTGCAGTTGAGCGACACACACTACCGCAGAGCACGAAGATTAACAGATGGTGCGGCCGTTGTGAGCATCATAATCAAATCTATATTGAGCCTACCGAGCCAATTGGTGAACAACACGCGGACCATTTCGATTGCATTGAGAATAACGCTATAGATGACTATCCACCTTCAAATCTACCATGCTTACAAATTCAATCGGTTGAAAATGTCGGCCACCGGGGATGGTGTCAGGTACTGGTAAGATGGTGGGTGGATGCGTACAAAGACGGGGACCCCGTGGCTTGGGGTGACTTTAGGGGCCTGGTGAAATCTGCGGAATGGACCGTCCGCCCCCGGTCCAAGGCTGGAGATGAGACTCACAGAGAGTGGAAACATCCTTCTCTCGGGTGTCCTCACGGAAACATCATCGTTCTTGCTGCACCTGTTTCTTATCATCTCAACATTGATTGCACCGTCAACGTCCATAACCTTCAGAGCGCAGAATATACAGATCCTAGTTACGCTTTCACTCACGTATGTATTACAGCGTTCTTCGCAATACAGGCGATACCGCACGCCCGTATAAGCTTTTCAAAAGAGGCCAATCGGTACTATTATTGGATCATTACCTGTCATTGTTCCAGTTATTGCTGCTTACGTGAATGTTGGATATCGACTCGCAAAGATAAGTGGCATAGCCATAAGTACAAGGACAGCCCTCTATTTGTCGAATTTGGCTGCAAGGAAACTTGGTGCTTAGAACATTATGACAAGGAGGTCTCCCAGACCAAGCGCTACGGTTGCAAGCTCGTTGGAGATCGGTGCGGTTTCCATGAGGCCTTGTTTTTTGAGACCAAATGGAGAAACACGGAACCCCAACGGCTGTACGCATACTGGTCCATAAAAGTAGATTGCTGTAAGATCTATTCGGAAAACAACACAAAGAACGTGGATGAGATAGAATGGGCGACAAGAGAGTGGATTTGGCAAAAAAATATTAAGATACGGTCCAATGTAAGATTCCATTATCAGCCGGACCAAGAACATCCGGACGTCTGGCATGGTTTGTGCGAGCAACTAGTTACTATTGACAAAAATCACGAACTTCCGGACGGCCAGAAAATAATTTGTTGGGGTAAGGAGTGGCATGACGCGATTTACCAGGATATCGCTCCTTTTTACGCGTTAGTGCTGGAATATCGATGGTTTCAGTGGGGCAGATACGAGTGTTGGAAGACCTGGGGTCTCGCGGGCTGCACGGACCTTAATTGGGAGGACCCTTTCAGAGGGACGCACCCCATTCTACACAAGGGACAGGGGTATATTAAAGATGCGGCCTTGTTACGTTGGTTCCGCCACTTACGTTCCTCATTCTTTGCAGGAGTTAAGGCGAAAGACGAAGTCGGATATAATGGTTGCCAGTGGGATTACCGAGATTTACGAGAAGGCTTCGATTGTTGGCAATACACTTGCAGGAACTTTCTTAGTCCTAACCAGGTTAATGACTGTTGGTACAAAGGCGCATGTTGGAAACACACACCCGGATGGTTTGTAAAGGTTAAGTTTTTGTGGTGTACGAATGAAGCTTATTACTGGGGATTAAAGACCTGGCATAAAACACGCCAGATTCATAGTGACGGGACAAACCATGGGCTTACAGAATGCGTTATAACTTCATTGCGCGCAATTGACCAGCGGCCAAATAGCGACAAGGTAGAGAGTCCTGGACGGAAGTGTGAACGCGTGAGTCGAGTCCACTGCAGCTGGCGTAGCCAGGCCAAAGAGAAGGGAAGACAGGAAGCCCATTTCTACCTGGGGTGCGTGGTGGGCATTATAGGACATGACGTGCAAGAGCTTGAGATTGACTTAACGACTGATGACATCAGCAGTAATAAAAATTGCCACCACTGTGATCCTAATGCAAAGTGTCACTTTCACAAGGGGCAGGTAACGGATGAACACGCCCTATGTGATGAATGGGTTGTATTAAGATTTCCATATAATACCCAACGATGGAAAGCCATCCTCAGTTTGGAAGGCTTTTCTCAGCGAGAGTACGACAACTTCCCTCATTTACAAGCCAAACGGTGGCGTAAGGTGGCTCAGATCGATTACACTTTCGAGAATAACTTCTACTGGCATTGGTGGCAGCACCATGGTGGGAAGAAGGTGATCTGGGCGACACCAGCGATTACAGCCGGAATAGATTACCAAGACGCGAGATGCTCACCAACGGTAGCTCATCACAATAAGTTCTCATGTCACTATAATTGGTGGTTTCTTAAGTCCTCAGAGTATGACCAGCACTTCCCTGACACACAGGTAGTTGTTTTAGACTGGGATTGTGCGCCGAAGGGGCAGCTAGAACCGGTTACGTGTATCCAAACGTTTGACCGAATACACGCAGATGTATTCAGACAGTGGAACGCCCAGTGGTGGTGGGATTTCCGATTACAAACGAATTGGTATTGGCACTTCGACAACGATGAATGTCGGCCGTCGCTCCCATGGCGGGATTGTAATCTCCTTCAGTTTTATCTCCAGCTTGGTTGTAAGATCGCATTTCAACGTACCTGGCAGAAAACCCACCATCATCGCGTGTGCCCTGTGCCCGTCAGTTGGTTTCGCTGTATTGACAATCAAATCATTTTCCACTTCATACATGAATCTCAATGGACCCAGTCTCGTCAGAGAACTGTCTACGACCCCGTTCCTCACCGTCATTGGCACTGTTGGCCCTGGGCCATTTGGCGAAATGGACGCTACCATTACAAGTACAGCACACCTTGA'


def get_problem(dna_seq: str, gc_min: float = 0.40, gc_max: float = 0.60, bp_window: int = 60, species: str = 'e_coli'):
    # Force reproducibility
    random.seed(SEED)
    numpy.random.seed(SEED)  

    return DnaOptimizationProblem(
        sequence=dna_seq,
        constraints=[
            EnforceGCContent(mini=gc_min, maxi=gc_max, window=bp_window),
            EnforceTranslation()
        ],
        objectives=[
            CodonOptimize(species=species)
        ],
        logger=None
    )

# This passes!
def test_reproducibility():
    dna_seqs = []

    for i in range(10):
        problem = get_problem(DNA_SEQ)

        problem.resolve_constraints()
        problem.optimize()

        dna_seqs.append(problem.sequence)

    assert len(set(dna_seqs)) == 1


# This fails (optm_dna_seq is copy-pasted from a previous run)
def test_simple_optimization():
    optm_dna_seq = 'ATGACCTTCACCGAGAAGGTGTTTAAGCAGGTGTGCGAATGGTGCATTGGCTTTTATCGCAAATATGCGGTGCCGAGCGAAGAATGCGATTTCTTCCCGCGCTGGATTTGGGGCAACAAACGCAACATTCTGACCTTTACCTTTAGCACCCCAATTACCTTACTGACCGCGCCGAGCCGTTTACCGGAAGAAGGCTGCCAGATTACCATTTGCGGCCAACAGCAAGATTGCTTTGCGGTGGAACGCCATACCCTGCCGCAGAGCACCAAAATTAACCGTTGGTGCGGCCGTTGTGAACATCATAACCAGATTTATATTGAACCGACCGAACCGATTGGCGAACAGCATGCGGATCATTTTGATTGCATTGAGAACAACGCGATTGATGACTACCCGCCGAGCAACCTGCCGTGCTTACAGATTCAAAGCGTTGAAAATGTGGGCCATCGCGGCTGGTGCCAGGTTCTGGTTAGATGGTGGGTGGATGCATATAAAGATGGCGATCCGGTGGCGTGGGGTGATTTTCGCGGTCTGGTTAAATCTGCGGAATGGACTGTGCGCCCACGCAGCAAAGCGGGTGATGAAACCCATCGTGAATGGAAACATCCGAGCCTGGGCTGCCCGCATGGCAACATTATCGTGCTGGCGGCACCAGTTAGCTATCATCTGAACATTGATTGCACCGTGAACGTGCATAACCTGCAGAGCGCGGAATATACCGATCCGAGCTATGCGTTTACCCATGTGTGCATTACCGCGTTTTTTGCGATTCAGGCGATTCCGCATGCGCGCATTAGCTTTAGCAAAGAGGCGAACCGCTATTATTATTGGATCATTACCTGCCACTGCAGCAGCTATTGCTGCCTGCGCGAATGCTGGATTAGCACCCGCAAAGATAAATGGCATAGCCATAAATACAAGGACAGCCCGCTGTTTGTGGAATTTGGCTGCAAAGAAACCTGGTGCCTGGAACATTATGATAAAGAAGTGAGCCAGACCAAACGCTATGGCTGCAAACTGGTGGGCGATCGCTGCGGTTTTCATGAAGCGCTGTTTTTTGAAACCAAATGGCGCAACACCGAACCGCAGCGCCTGTATGCGTATTGGAGCATTAAAGTGGATTGCTGCAAAATCTACAGCGAGAACAACACCAAAAACGTGGATGAGATCGAATGGGCGACCCGCGAATGGATTTGGCAGAAAAACATTAAAATTCGCAGCAACGTGCGTTTTCATTATCAGCCGGACCAGGAACATCCGGATGTGTGGCATGGCCTGTGCGAGCAGCTGGTGACTATTGATAAAAACCATGAACTGCCGGATGGCCAGAAAATTATTTGCTGGGGCAAAGAATGGCATGATGCGATTTACCAGGATATCGCGCCGTTTTATGCGCTGGTGCTGGAATATCGCTGGTTTCAGTGGGGCAGATATGAATGCTGGAAAACCTGGGGCCTGGCGGGCTGTACTGATCTGAATTGGGAAGATCCGTTTCGTGGCACCCATCCGATTCTGCATAAAGGCCAGGGCTATATTAAAGATGCGGCGCTGCTGCGCTGGTTTCGCCATCTGCGTAGCTCTTTTTTTGCGGGTGTTAAAGCGAAAGATGAAGTGGGCTATAACGGCTGCCAGTGGGATTACCGCGATCTGCGCGAAGGCTTCGATTGCTGGCAGTATACCTGCCGCAACTTTCTGAGCCCGAACCAGGTGAACGATTGCTGGTATAAAGGCGCGTGTTGGAAACATACCCCGGGCTGGTTTGTGAAAGTGAAATTTCTGTGGTGCACCAACGAAGCGTATTATTGGGGCCTGAAAACCTGGCATAAAACCCGCCAGATTCATAGCGATGGCACCAACCATGGCCTGACCGAATGCGTGATTACCAGCTTGCGCGCAATTGATCAGCGCCCGAACAGCGATAAAGTGGAAAGCCCAGGTAGAAAATGCGAACGCGTGAGTCGTGTGCATTGCAGCTGGCGTAGCCAGGCGAAAGAAAAAGGCCGTCAGGAAGCGCATTTTTATCTGGGCTGCGTGGTGGGCATTATTGGCCATGACGTGCAGGAACTGGAAATTGACCTGACCACCGATGATATTAGCAGCAACAAAAACTGCCATCATTGCGATCCGAACGCGAAATGCCATTTTCATAAAGGCCAGGTGACCGATGAACATGCGCTGTGCGATGAATGGGTGGTGCTGCGCTTTCCGTATAACACCCAGCGCTGGAAAGCGATTCTGAGCCTGGAAGGCTTTAGCCAGCGCGAATATGATAACTTTCCGCATCTGCAGGCGAAACGCTGGCGCAAAGTGGCGCAGATTGATTATACCTTTGAAAACAACTTTTATTGGCATTGGTGGCAGCACCATGGTGGCAAGAAAGTGATTTGGGCGACCCCGGCGATTACTGCGGGTATTGATTATCAGGATGCGCGCTGCTCTCCAACTGTGGCGCATCACAACAAATTCAGCTGCCATTACAACTGGTGGTTTCTGAAAAGCAGCGAATATGATCAGCATTTCCCGGATACCCAGGTGGTGGTTTTAGATTGGGATTGCGCGCCGAAAGGCCAGCTGGAACCGGTGACCTGCATTCAGACCTTTGATCGCATTCATGCGGATGTGTTTCGCCAGTGGAACGCGCAGTGGTGGTGGGATTTTCGCCTGCAGACCAACTGGTATTGGCATTTTGATAACGATGAATGCCGCCCGAGCCTGCCGTGGCGCGATTGCAACCTGTTACAGTTTTATCTGCAGCTGGGTTGCAAAATTGCGTTTCAGCGTACCTGGCAGAAAACCCATCATCATCGCGTGTGCCCGGTGCCGGTTAGCTGGTTTCGCTGCATTGACAACCAGATTATTTTCCACTTCATTCACGAGAGCCAGTGGACCCAGTCTCGTCAGCGTACTGTGTATGATCCGGTTCCACATCGCCATTGGCATTGTTGGCCGTGGGCGATTTGGCGTAATGGCCGCTATCATTATAAATATAGCACCCCGTAA'

    problem = get_problem(DNA_SEQ)
    problem.resolve_constraints()
    problem.optimize()

    assert problem.sequence == optm_dna_seq

If you run test_reproducibility you get the same sequence for all iterations of the loop. This test passes.

However, I tried running test_simple_optimization once, recording the output and storing it into optm_dna_seq, then running the test again. This almost always fails for large sequences.

I can't tell where the randomization is happening. Is there a different random number generator that I should set the seed for?

Sorry if this is a bit convoluted, couldn't think how else to explain it.

Issue when importing as of Nov. 28, 2022

Seems to be an error with dna_features_viewer or something. Can't import module.

TypeError Traceback (most recent call last)
in
----> 1 from goldenhinges import OverhangsSelector, OverhangsSelector
2 import dnachisel
3 import geneblocks
4 from dnachisel import (EnforceTranslation, Specification, SpecEvaluation,
5 reverse_translate, random_protein_sequence, Location,

5 frames
/usr/local/lib/python3.7/dist-packages/dna_features_viewer/biotools.py in
36
37 aa_short_to_long_form_dict = {
---> 38 _aa1: _aa3[0] + _aa3[1:].lower() for (_aa1, _aa3) in zip(aa1 + "", aa3 + [""])
39 }
40

TypeError: can only concatenate tuple (not "str") to tuple

Solving "high repeat density" in codon optimized sequences

I'm trying to use DnaChisel to perform codon optimization of sequences that will be accepted by Twist for DNA synthesis. However I get an error "high repeat density" and that "more than 15% of your sequence is composed of repeats 9bp or longer". Is there any work-around for this?

def codon_optimize(my_protein):
    """
    Use dnachisel to perform contrained codon optimization
    See: https://www.twistbioscience.com/faq/gene-synthesis/codon-optimization-what-steps-are-taken-maintain-wild-type-protein-expression
    """
    my_sequence = reverse_translate(my_protein, table='Standard')
    problem = DnaOptimizationProblem(
        sequence=my_sequence,
        constraints=[
            AvoidPattern("GGAGG"), # ribosome binding sites
            AvoidPattern("TAAGGAG"), # ribosome binding sites
            AvoidPattern("TTTTT"), # terminator
            AvoidPattern("AAAAA"), # terminator
            UniquifyAllKmers(12),
            AvoidPattern(HomopolymerPattern("A", 10)), # homopolymer runs of 10 or more bases
            AvoidPattern(HomopolymerPattern("C", 10)), # homopolymer runs of 10 or more bases
            AvoidPattern(HomopolymerPattern("G", 10)), # homopolymer runs of 10 or more bases
            AvoidPattern(HomopolymerPattern("T", 10)), # homopolymer runs of 10 or more bases
            AvoidHairpins(location=[0, 48]), # strong hairpins (∆G <-8) in the first 48 base pairs
            EnforceGCContent(mini=0.35, maxi=0.65, window=50), # local GC windows (50 bp) of less than 35% or more than 65%
            EnforceGCContent(mini=0.25, maxi=0.65, window=len(my_sequence)), # global GC% of less than 25% or more than 65% 
            EnforceTranslation(start_codon="keep"), # keep ATG start codon
            AvoidRareCodons(0.08, 'h_sapiens') # rare codons (those with a codon frequency of <8%)
        ],
        objectives=[
            CodonOptimize(species='h_sapiens', method='use_best_codon')
        ],
        logger=None
    )
    problem.resolve_constraints()
    problem.optimize()
    return problem.sequence

Representation of MutationChoice as degenerate sequence

In order to test several constructs which equally satisfy a problem's constraints and select for the optimal/functional ones, it may be desirable to replace a subsequence with degenerate oligonucleotides.
Given the current implementation of the MutationSpace class, it should be fairly easy to provide a toDegenerateSeq method translating the list of choices into such encoding.
I can help with the implementation if you are interested in adding this feature, although there's only partial documentation for the MutationSpace and MutationChoice classes. Feel free to close the issue if you don't think it's worth it.

CodonOptimize mode 'harmonized' only optimizes first 1/3 of the sequence

Using the CodonOptimize 'harmonized' functionality, I do not get a codon distribution similar to the one I specified. It seems that the second 2/3s of a sequence are never optimized.

A minimal example:

from dnachisel import *

problem = DnaOptimizationProblem(
            sequence='GACGACGACAAAAAAAAAAAAAAAAAA',
            constraints=[EnforceTranslation()],
            objectives=[CodonOptimize(species='b_subtilis', mode='harmonized')]
           )

problem.resolve_constraints()
problem.optimize()

print('SEQUENCE:', problem.sequence)

frequencies, positions = biotools.biotools.codons_frequencies_and_positions(problem.sequence)
print(*frequencies.items(), sep='\n')

This sequence encodes DDDKKKKKK and gives as output:

SEQUENCE: GACGATGATAAAAAAAAAAAAAAAAAA
('V', {'total': 0, 'GTA': 0.0, 'GTT': 0.0, 'GTC': 0.0, 'GTG': 0.0})
('L', {'CTT': 0.0, 'TTA': 0.0, 'CTC': 0.0, 'CTG': 0.0, 'CTA': 0.0, 'total': 0, 'TTG': 0.0})
('Q', {'CAG': 0.0, 'total': 0, 'CAA': 0.0})
('G', {'total': 0, 'GGC': 0.0, 'GGA': 0.0, 'GGT': 0.0, 'GGG': 0.0})
('P', {'CCA': 0.0, 'total': 0, 'CCT': 0.0, 'CCG': 0.0, 'CCC': 0.0})
('A', {'total': 0, 'GCG': 0.0, 'GCC': 0.0, 'GCT': 0.0, 'GCA': 0.0})
('T', {'total': 0, 'ACT': 0.0, 'ACA': 0.0, 'ACG': 0.0, 'ACC': 0.0})
('C', {'total': 0, 'TGT': 0.0, 'TGC': 0.0})
('S', {'AGC': 0.0, 'AGT': 0.0, 'TCT': 0.0, 'total': 0, 'TCG': 0.0, 'TCC': 0.0, 'TCA': 0.0})
('N', {'AAC': 0.0, 'total': 0, 'AAT': 0.0})
('H', {'CAC': 0.0, 'total': 0, 'CAT': 0.0})
('I', {'total': 0, 'ATA': 0.0, 'ATT': 0.0, 'ATC': 0.0})
('D', {'GAT': 0.6666666666666666, 'total': 3, 'GAC': 0.3333333333333333})
('M', {'total': 0, 'ATG': 0.0})
('F', {'TTT': 0.0, 'total': 0, 'TTC': 0.0})
('R', {'CGG': 0.0, 'AGG': 0.0, 'CGC': 0.0, 'CGA': 0.0, 'total': 0, 'AGA': 0.0, 'CGT': 0.0})
('K', {'total': 6, 'AAG': 0.0, 'AAA': 1.0})
('E', {'total': 0, 'GAG': 0.0, 'GAA': 0.0})
('*', {'total': 0, 'TAA': 0.0, 'TGA': 0.0, 'TAG': 0.0})
('Y', {'total': 0, 'TAC': 0.0, 'TAT': 0.0})
('W', {'total': 0, 'TGG': 0.0})

It does well for Aspartic Acid (D), as this has a GAT 0.64 / GAC 0.36 ratio, but for K with a AAA 0.7 / AAG 0.3 ratio, it does nothing at all. This is position dependent, as a sequence encoding KKKDDDDDD does the opposite.

While trying to debug, I printed all variants that were used during exhaustive search for this sequence (line 454 of DnaOptimizationProblem.py) , which are:

GACGACGACAAAAAAAAAAAAAAAAAA
GATGACGACAAAAAAAAAAAAAAAAAA
GATGACGACAAAAAAAAAAAAAAAAAA
GATGACGATAAAAAAAAAAAAAAAAAA
GATGACGATAAAAAAAAAAAAAAAAAA
GATGATGATAAAAAAAAAAAAAAAAAA

This again points to the fact that the later positions are never included in the variants to be analysed.

I'm using the latest version, dnachisel 1.1.

It would be great if you could find out where the problem is and come up with a quick fix, so I can use your library for my thesis. Thank you!

DnaChisel CLI - output to genbank not working

Hello,

when using the following command :
>>> dnachisel annotated_record.gb optimized_record.gb
I get the output below but although it says "Result written to optimized_record.gb" no file is created.

Before optimization:

===> FAILURE: 1 constraints evaluations failed
✔PASS ┍ EnforceTranslation[0-507(+)]
│ Enforced by nucleotides restrictions
FAIL ┍ AvoidPattern0-507
│ Failed. Pattern found at positions [493-499(+)]

===> TOTAL OBJECTIVES SCORE: -42.08
-42.08 ┍ MaximizeCAI0-507(+)
│ Codon opt. on window 0-507(+) scored -4.21E+01

Resolving Constraints...
Optimizing...

After optimization:

===> SUCCESS - all constraints evaluations pass
✔PASS ┍ EnforceTranslation[0-507(+)]
│ Enforced by nucleotides restrictions
✔PASS ┍ AvoidPattern0-507
│ Passed. Pattern not found !

===> TOTAL OBJECTIVES SCORE: 0
✔ 0 ┍ MaximizeCAI0-507(+)
│ Codon opt. on window 0-507(+) scored -0.00E+00

Result written to optimized_record.gb

The output files are created correctly if I use the other commands :
>>> dnachisel annotated_record.gb optimization_report/
and
>>> dnachisel annotated_record.gb optimization_report.zip

The original genbank file is also processed correctly using the web app.

Avoid repeats in nucleotide sequence

Hello, a long time ago I think DNA Chisel had an AvoidRepeat spec, but I can't seem to find it anymore.

Is there a way to avoid repeats in the whole nucleotide sequence (e.g. avoid repeats of length 10 - like GGGGGGGGGG or any base - in the entire sequence). I think UniquifyAllKmers is the generalized version of this? But I'm confused as to what values to use to match my use case (sorry, non-biologist here 😳).

Is it just k = 10, leaving reference = None and location = None?

pip install issues

Hello! The library looks great. I like the multi-layered constraints. They would work perfectly for a project I am working on right now.

I am, however, having some issues with pip3 install'ing the package.

python2 pip install totally fails because of some weazyprint version issue (which is fine for me, just sharing)
pip3 install doesn't make the module global. I can't import it elsewhere. I do not know why right now
git cloning the repo and pip3 install -e .'ing it works for adding it as a global dependency, but I still get warnings about versions when importing:

If I get time and find a fix I'll submit a PR. My first approach would be to pin the dependencies (particularly around weazy) in setup.py (move to requirements.txt).

In the Readme the install instructions make the plot package installation look optional, but then in setup.py find_packages is used, which I think is installing weazyprint, etc) even though I have no interest in the visual reports (or the pdf_report tbh). I looks like dna_chisel has close to 35 dependencies right now which is a bit excessive. The rest of my application has ~5. And things like defusedxml, webencodings, CairoSVG are all being added

Geneblocks not displaying gene image on optimization report

Hello!
I have ran dnachisel.reports.write_optimization_report(problem=problem, target=".") and the report consists of every component except for the gene graphs/ geneblocks.

The only way I have been able to see the geneblocks image is by running separate code, shown below, and it is saved as an extra PNG file.
seq_1 = dnachisel.sequence_to_biopython_record(problem.sequence_before) seq_2 = problem.to_record() diff_blocks = DiffBlocks.from_sequences(seq_1, seq_2) ax1, ax2 = diff_blocks.merged().plot(figure_width=8)
But I would really like to see them in the optimization report. Any help would be appreciated. Thanks!

can I generate a sequence with a sequence with 'medium' score?

By default, dnachisel gives a 'best' sequence with score close to 0.
I can get a 'worst' sequence by new_score < score.

If i want a 'medium' sequence , i can set a target score,
and set the if condition to abs(new_score - target) < abs(score - target).

Since scores varies from objectives, I have to get worset score first to determint the target score.

Is there any solution to generate 3 sequences just using optimize() once?

Support for circular sequences

Hi,
Does DnaChisel support circular sequences? I may be missing something, but judging from this line and the behavior on a quick test, I doubt it:

DnaChisel/dnachisel/Location.py

Lines 72 to 74 in 5a8cce7

 def extract_sequence(self, sequence): 

 """Return the subsequence read at the given location.""" 

 result = sequence[self.start:self.end]

Would it be possible to add support for this? I think it is a very useful feature, above all when removing restriction sites, since they could span over the end/start.

Thank you
Simone

avoiding repeats

hey-- great module! I've been using to great effect for awhile now. one feature that'd really help out is repeat avoidance to be able to synthesize DNA at providers with fewer issues. is this possible in the package currently, or would it need to be built? thanks for developing and sharing this software!

I couldnt find the ObjectiveEvaluation and Objective classes mentioned in the examples/non_unique_kmers_minimization.py

I was trying to understand the example included in ".//examples/non_unique_kmers_minimization.py" and they mention two classes "ObjectiveEvaluation" and Objective. I couldnt find these anywhere in the source, can someone point me in the right direction.

Alternatively I am looking to write DNAOptimization objective that will design protein coding repeats without using the same codons for each repeat in an expression plasmid.

Any help in this will be greatly appreciated.

Unable to assign max_random_iters according to the demo example.

DnaChisel/examples/features/example_max_random_iters.py

Line 23 in 50bcd9a

problem.optimize(max_random_iters=10000)

This line will triggered the error.

However, problem.max_random_iters = 10000 will work.

Feature: Add GC content variance constraint

Twist Bioscience has a GC variance constraint. Quote from their website:

Avoid extreme differences in GC content within a gene (i.e. the difference in GC content between the highest and lowest 50 bp stretch should be no greater than 52%).

If a sequence with such abnormality is loaded in their interface the following error occurs:

We located two 50 bp windows with a GC content difference of over 52%. Redesigning these regions to level out the peaks and troughs will improve the likelihood of success. Please note that any changes made to any part of the sequence will trigger rescoring and may highlight new problematic regions. Therefore, if possible, we suggest optimizing the whole sequence.

Using DNA Chisel + DNA Weaver without such constraint might lead to the construction of non-valid sequences.

Question: No boost for AllowPrimer

I noticed some objectives such as "dnachisel.builtin_specifications.AllowPrimer" do not have any boost parameter. How are they balanced with other objectives people might specify concurrently? Many objective options have the boost parameters, but not all?

Best,
Josh

Sequence diversity in output from CodonOptimize

Hello! Thanks for this great tool. I am exploring using it on a project.

Is there an existing way to get dnachisel to return a few codon optimized sequences, maybe some that are slightly sub-optimal so I can have more diversity when ordering codon optimized sequences for DNA synthesis?

Best,
Josh

Issue when using codon optimization with harmonization

I ran into an issue when using codon optimization using mode="harmonized". I get a division by zero error that occurs on line 522 of biotools.py:

data[codon] = 1.0 * value / total

My gene did not have an amino acid ("H") and so that value for "total" for this amino acid was 0, causing the divide by zero error.

Suggested fix:
line 519:
total = max(data['total'], 1)

Suggestion: rename biotools to tools

since there are tools like mathtools,
renaming it and put Location to tools dir will make code more organized.

Uniquify all kmers when include_reverse_complement=False

It crashes due to:
UnboundLocalError: local variable 'L' referenced before assignment
Can be fixed by moving:

DnaChisel/dnachisel/builtin_specifications/UniquifyAllKmers.py

Line 43 in 726b2b0

L = len(sequence)

To above the if statement.

Fix AvoidHairpins to search within last window

I noticed that the AvoidHairpins specification does not search for hairpins within the last window. Intuitively, if I set my window size to be the same as my sequence length, the algorithm should find every hairpin in the sequence. In reality, though, this algorithm will not find any hairpins.

Here is some updated pseudocode that I briefly tested, it seems to be working:

for i in range(len(sequence) - stem_size:
    word = sequence[i : i + stem_size]
    if len(sequence) - i < window:
        window = len(sequence) - i
    if stem_size > window / 2:
        break
    rest = reverse[-(i + window) : -(i + stem_size)]
    if word in rest:
        // follow existing algorithm

Mismatched amino acid sequence

I'm trying to codon optimize a nucleic acid sequence but DnaChisel is returning sequences with different amino acids than the original sequence. Any idea why I am getting this result? I'm looking to codon optimize a nucleic acid sequence for humans.

Here is the relevant part of my code:

some_sequence = "LONG_SEQ(See bottom)"
upper_sequence = some_sequence.upper()
seq_len = len(some_sequence)

problem = DnaOptimizationProblem(
sequence=upper_sequence,
objectives=[CodonOptimize(species='h_sapiens', method="use_best_codon")]
)

problem.optimize()

new sequence = problem.sequence

Here is an example of what I am getting back:
original_aa = "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFT...
new_aa = "MFVFLVFFPLVYRQWVKLTTRTHFPPAYTKYFT...

Here is the sequence
some_sequence = "atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgcatacactaattctttcacacgtggtgtttattaccctgacaaagttttcagatcctcagttttacattcaactcaggacttgttcttacctttcttttccaatgttacttggttccatgctatacatgtctctgggaccaatggtactaagaggtttgataaccctgtcctaccatttaatgatggtgtttattttgcttccactgagaagtctaacataataagaggctggatttttggtactactttagattcgaagacccagtccctacttattgttaataacgctactaatgttgttattaaagtctgtgaatttcaattttgtaatgatccatttttgggtgtttattaccacaaaaacaacaaaagttggatggaaagtgagttcagagtttattctagtgcgaataattgcacttttgaatatgtctctcagccttttcttatggaccttgaaggaaaacagggtaatttcaaaaatcttagggaatttgtgtttaagaatattgatggttattttaaaatatattctaagcacacgcctattaatttagtgcgtgatctccctcagggtttttcggctttagaaccattggtagatttgccaataggtattaacatcactaggtttcaaactttacttgctttacatagaagttatttgactcctggtgattcttcttcaggttggacagctggtgctgcagcttattatgtgggttatcttcaacctaggacttttctattaaaatataatgaaaatggaaccattacagatgctgtagactgtgcacttgaccctctctcagaaacaaagtgtacgttgaaatccttcactgtagaaaaaggaatctatcaaacttctaactttagagtccaaccaacagaatctattgttagatttcctaatattacaaacttgtgcccttttggtgaagtttttaacgccaccagatttgcatctgtttatgcttggaacaggaagagaatcagcaactgtgttgctgattattctgtcctatataattccgcatcattttccacttttaagtgttatggagtgtctcctactaaattaaatgatctctgctttactaatgtctatgcagattcatttgtaattagaggtgatgaagtcagacaaatcgctccagggcaaactggaaagattgctgattataattataaattaccagatgattttacaggctgcgttatagcttggaattctaacaatcttgattctaaggttggtggtaattataattacctgtatagattgtttaggaagtctaatctcaaaccttttgagagagatatttcaactgaaatctatcaggccggtagcacaccttgtaatggtgttgaaggttttaattgttactttcctttacaatcatatggtttccaacccactaatggtgttggttaccaaccatacagagtagtagtactttcttttgaacttctacatgcaccagcaactgtttgtggacctaaaaagtctactaatttggttaaaaacaaatgtgtcaatttcaacttcaatggtttaacaggcacaggtgttcttactgagtctaacaaaaagtttctgcctttccaacaatttggcagagacattgctgacactactgatgctgtccgtgatccacagacacttgagattcttgacattacaccatgttcttttggtggtgtcagtgttataacaccaggaacaaatacttctaaccaggttgctgttctttatcaggatgttaactgcacagaagtccctgttgctattcatgcagatcaacttactcctacttggcgtgtttattctacaggttctaatgtttttcaaacacgtgcaggctgtttaataggggctgaacatgtcaacaactcatatgagtgtgacatacccattggtgcaggtatatgcgctagttatcagactcagactaattctcctcggcgggcacgtagtgtagctagtcaatccatcattgcctacactatgtcacttggtgcagaaaattcagttgcttactctaataactctattgccatacccacaaattttactattagtgttaccacagaaattctaccagtgtctatgaccaagacatcagtagattgtacaatgtacatttgtggtgattcaactgaatgcagcaatcttttgttgcaatatggcagtttttgtacacaattaaaccgtgctttaactggaatagctgttgaacaagacaaaaacacccaagaagtttttgcacaagtcaaacaaatttacaaaacaccaccaattaaagattttggtggttttaatttttcacaaatattaccagatccatcaaaaccaagcaagaggtcatttattgaagatctacttttcaacaaagtgacacttgcagatgctggcttcatcaaacaatatggtgattgccttggtgatattgctgctagagacctcatttgtgcacaaaagtttaacggccttactgttttgccacctttgctcacagatgaaatgattgctcaatacacttctgcactgttagcgggtacaatcacttctggttggacctttggtgcaggtgctgcattacaaataccatttgctatgcaaatggcttataggtttaatggtattggagttacacagaatgttctctatgagaaccaaaaattgattgccaaccaatttaatagtgctattggcaaaattcaagactcactttcttccacagcaagtgcacttggaaaacttcaagatgtggtcaaccaaaatgcacaagctttaaacacgcttgttaaacaacttagctccaattttggtgcaatttcaagtgttttaaatgatatcctttcacgtcttgacaaagttgaggctgaagtgcaaattgataggttgatcacaggcagacttcaaagtttgcagacatatgtgactcaacaattaattagagctgcagaaatcagagcttctgctaatcttgctgctactaaaatgtcagagtgtgtacttggacaatcaaaaagagttgatttttgtggaaagggctatcatcttatgtccttccctcagtcagcacctcatggtgtagtcttcttgcatgtgacttatgtccctgcacaagaaaagaacttcacaactgctcctgccatttgtcatgatggaaaagcacactttcctcgtgaaggtgtctttgtttcaaatggcacacactggtttgtaacacaaaggaatttttatgaaccacaaatcattactacagacaacacatttgtgtctggtaactgtgatgttgtaataggaattgtcaacaacacagtttatgatcctttgcaacctgaattagactcattcaaggaggagttagataaatattttaagaatcatacatcaccagatgttgatttaggtgacatctctggcattaatgcttcagttgtaaacattcaaaaagaaattgaccgcctcaatgaggttgccaagaatttaaatgaatctctcatcgatctccaagaacttggaaagtatgagcagtatataaaatggccatggtacatttggctaggttttatagctggcttgattgccatagtaatggtgacaattatgctttgctgtatgaccagttgctgtagttgtctcaagggctgttgttcttgtggatcctgctgcaaatttgatgaagacgactctgagccagtgctcaaaggagtcaaattacattacacataa"

Support for optimization of RNA structure

Even though the AvoidHairpins specification is a good compromise between speed and quality of the results, there are few tools available for optimizing RNA structure accurately (none actually usable programmatically, as far as I know), and implementing it in DnaChisel would make it even more unique than it already is :)
Using an external package like Vienna, it shouldn't be too hard to implement it. d-tailor is an example of how to integrate it into a codon optimization framework.
Feel free to close this issue if you are not interested in supporting this feature!

Thanks
Simone

constraints_text_summary() says AvoidPattern(RepeatedKmerPattern(2, 9)) passed, but it didn't

Hi,
I'm using DNAchisel to encode some difficult to synthesize proteins. I tried removing all kmer repeats, and constraints_text_summary() says pass, but I'm able to find a repeat. What am I not understanding about how to codon optimize?

for example, in the below code, there's a 26nt repeat that doesn't get removed.

thanks for helping me out!
-Lucas

output:

repeated kmers: ('CCGGATCATATGAAACAGCATGACTT', 2) 26
[222, 936]
===> SUCCESS - all constraints evaluations pass
✔PASS ┍ EnforceTranslation[0-2856]
      │ Enforced by nucleotides restrictions
✔PASS ┍ AvoidPattern[0-2856](pattern:BsaI(GGTCTC))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-2856](pattern:BbsI(GAAGAC))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-2856](pattern:XhoI(CTCGAG))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-2856](pattern:NcoI(CCATGG))
      │ Passed. Pattern not found !
✔PASS ┍ AvoidRareCodons[0-2856]
      │ Enforced by nucleotides restrictions
✔PASS ┍ EnforceGCContent[0-2856](mini:0.38, maxi:0.62, window:100)
      │ Passed !
✔PASS ┍ AvoidPattern[0-2856](pattern:2x9mer)
      │ Passed. Pattern not found !


===> TOTAL OBJECTIVES SCORE:    -33.76
    -33.76 ┍ MatchTargetCodonUsage[0-2856](e_coli) 
           │ Codon opt. on window 0-2856 scored -3.38E+01


make sure the encoding is correct:  True

code:

from pathlib import Path 
import pandas as pd
from dnachisel import (
    DnaOptimizationProblem,
    translate,
    reverse_translate,
    CodonOptimize,
    EnforceTranslation,
    AvoidPattern,
    EnforceGCContent,
    RepeatedKmerPattern,
)

from dnachisel import *
import re
from Bio.Restriction import *
from Bio.Seq import Seq
from collections import defaultdict



#### functions to find longest repeated substring ### 
def getsubs(loc, s):
    substr = s[loc:]
    i = -1
    while(substr):
        yield substr
        substr = s[loc:i]
        i -= 1

def longestRepetitiveSubstring(r, minocc=3):
    occ = defaultdict(int)
    # tally all occurrences of all substrings
    for i in range(len(r)):
        for sub in getsubs(i,r):
            occ[sub] += 1

    # filter out all substrings with fewer than minocc occurrences
    occ_minocc = [k for k,v in occ.items() if v >= minocc]

    if occ_minocc:
        maxkey =  max(occ_minocc, key=len)
        return maxkey, occ[maxkey]
    else:
        raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))


#### encode 4x sfGFP ### 

ntseq = 'ATGCGTAAAGGCGAAGAGCTGTTCACTGGTGTCGTCCCTATTCTGGTGGAACTGGATGGTGATGTCAACGGTCATAAGTTTTCCGTGCGTGGCGAGGGTGAAGGTGACGCAACTAATGGTAAACTGACGCTGAAGTTCATCTGTACTACTGGTAAACTGCCGGTACCTTGGCCGACTCTGGTAACGACGCTGACTTATGGTGTTCAGTGCTTTGCTCGTTATCCGGACCATATGAAGCAGCATGACTTCTTCAAGTCCGCCATGCCGGAAGGCTATGTGCAGGAACGCACGATTTCCTTTAAGGATGACGGCACGTACAAAACGCGTGCGGAAGTGAAATTTGAAGGCGATACCCTGGTAAACCGCATTGAGCTGAAAGGCATTGACTTTAAAGAAGACGGCAATATCCTGGGCCATAAGCTGGAATACAATTTTAACAGCCACAATGTTTACATCACCGCCGATAAACAAAAAAATGGCATTAAAGCGAATTTTAAAATTCGCCACAACGTGGAGGATGGCAGCGTGCAGCTGGCTGATCACTACCAGCAAAACACTCCAATCGGTGATGGTCCTGTTCTGCTGCCAGACAATCACTATCTGAGCACGCAAAGCGTTCTGTCTAAAGATCCGAACGAGAAACGCGATCATATGGTTCTGCTGGAGTTCGTAACCGCAGCGGGCATCACGCATGGTATGGATGAACTGTACAAA'
ntseq = ntseq + ntseq + ntseq + ntseq 

constraints=[
            EnforceTranslation(),
            AvoidPattern("BsaI_site"),
            AvoidPattern("BbsI_site"),
            AvoidPattern("XhoI_site"),
            AvoidPattern("NcoI_site"),
            AvoidRareCodons(0.1, species='e_coli'),
            EnforceGCContent(mini=0.38, maxi=0.62, window=100),
            AvoidPattern(RepeatedKmerPattern(2, 9)),
            ]


problem = DnaOptimizationProblem(
    sequence=reverse_translate(protein_sequence=translate(ntseq),table='Standard'),
    constraints=constraints,
    objectives=[CodonOptimize(species='e_coli', method='match_codon_usage')],  
)


problem.max_random_iters = 2000000
problem.resolve_constraints()
problem.optimize()

new_encoding = problem.sequence

# find longest repeated substring
found_this = longestRepetitiveSubstring(new_encoding, minocc=2)
print("repeated kmers:" , found_this, len(found_this[0]))
print([match.start() for match in re.finditer(found_this[0], new_encoding)])

# how does DNAChisel think it did? 
print(problem.constraints_text_summary())
print(problem.objectives_text_summary())

print('make sure the encoding is correct: ', translate(new_encoding) == translate(ntseq))

Avoid TF sites example does not work

I am using Anaconda on a Windows PC. Everything is up to date on my side. I would really love to use this feature but I am not knowledgeable enough to write my own script.

Thanks for helping/fixing!!

I get the following error after executing the example file:

NoSolutionError Traceback (most recent call last)
C:\Users\Public\Documents\Wondershare\CreatorTemp/ipykernel_23256/3391708123.py in
24 )
25
---> 26 problem.resolve_constraints()
27 problem.max_random_iters = 20000
28 problem.to_record("sequence_without_tf_binding_sites.gb")

~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraints(self, final_check, cst_filter)
358 except NoSolutionError as error:
359 self.logger(constraint__index=len(constraints))
--> 360 raise error
361 if final_check:
362 self.perform_final_constraints_check()

~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraints(self, final_check, cst_filter)
355 ):
356 try:
--> 357 self.resolve_constraint(constraint=constraint)
358 except NoSolutionError as error:
359 self.logger(constraint__index=len(constraints))

~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraint(self, constraint)
319 location__message="Cold exit",
320 )
--> 321 raise error
322 else:
323 continue

~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraint(self, constraint)
303 constraint.resolution_heuristic(local_problem)
304 else:
--> 305 local_problem.resolve_constraints_locally()
306 self._replace_sequence(local_problem.sequence)
307 break

~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraints_locally(self)
171 """
172 if self.mutation_space.space_size < self.randomization_threshold:
--> 173 self.resolve_constraints_by_exhaustive_search()
174 else:
175 self.resolve_constraints_by_random_mutations()

~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraints_by_exhaustive_search(self)
78 raise NoSolutionError(
79 "Exhaustive search failed to satisfy all constraints.",
---> 80 problem=self,
81 )
82

NoSolutionError: While solving AvoidPattern0-3711 in 0-6:

Exhaustive search failed to satisfy all constraints.

Objective functions

Hi,
Is there a way to see all possible constraints/objectives? I don't seem to find them, nor on the git, nor in the documentation.
More specifically I'm interested if there is some kind of harmonization possible using this tool?

Package looks really nice by the way.

Best regards,
Pieter

Report generation broken after geneblocks install

Hello,

I have installed DnaChisel using the [reports] suffix to generate reports. When I run DnaChisel on a genbank file or on a random seq (readme example) I do get a report but in the "Report.pdf" file it tells me to "install Geneblocks to see a friendly plot of sequence edits here. (pip install geneblocks) ". However, if I do so when I run my script again the report is not generated and I get the following error :

Traceback (most recent call last):
File "test.py", line 26, in
problem.optimize_with_report(target="report.zip")
File "/usr/local/lib/python3.8/dist-packages/dnachisel/DnaOptimizationProblem/DnaOptimizationProblem.py", line 260, in optimize_with_report
data = write_optimization_report(
File "/usr/local/lib/python3.8/dist-packages/dnachisel/reports/optimization_reports.py", line 366, in write_optimization_report
diffs_ax = plot_optimization_changes(problem)
File "/usr/local/lib/python3.8/dist-packages/dnachisel/reports/optimization_reports.py", line 277, in plot_optimization_changes
diffs = DiffBlocks.from_sequences(sequence_before, sequence_after)
File "/usr/local/lib/python3.8/dist-packages/geneblocks/DiffBlocks/DiffBlocks.py", line 79, in from_sequences
common_blocks = CommonBlocks.from_sequences(
File "/usr/local/lib/python3.8/dist-packages/geneblocks/CommonBlocks/CommonBlocks.py", line 56, in from_sequences
homologies_dict = find_homologies_between_sequences(
File "/usr/local/lib/python3.8/dist-packages/geneblocks/CommonBlocks/commonblocks_tools.py", line 133, in find_homologies_between_sequences
proc = subprocess.Popen(
File "/usr/lib/python3.8/subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'blastn'

Custom Codon Usage tables for Codon Optimization

Hello,

I just found the DNA Chisel python library and it looks like it's exactly what I was looking for!

I want to codon optimize the sequence of some proteins for expression in plants, but I would like to use "custom" codon usage tables. In the help files I found that "You can also use a TaxID to refer to a species, e.g. species=1423 at which case the codon frequencies will be downloaded from the Kazusa codon usage database (assuming it isn’t down!)". Data from that webpage is vastly outdated (the last data they used to generate the codon usage tables is from 2007, and for some species there are many more sequences available that allow the generation of more reliable codon usage tables). Moreover, I have built codon usage tables for groups of related plant species and would like to test if they work.

My question is, is there any way of "loading" a custom codon usage table (I could prepare them in the same format as the Kazusa webpage). However, I am not proficient in Python and would not know even where to start...

Best,

optimize using usage table cann't get a high score when use alone.

here is my code:

from dnachisel import *

mEGFP_dna = 'ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGCGCGGCGAGGGCGAGGGCGATGCCACCAACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTCCTTCAAGGACGACGGCACCTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTTCAACAGCCACAACGTCTATATCACGGCCGACAAGCAGAAGAACGGCATCAAGGCGAACTTCAAGATCCGCCACAACGTCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCAAGCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA'

problem = DnaOptimizationProblem(
    sequence = mEGFP_dna, 
    constraints= [
        EnforceTranslation(),
#         EnforceGCContent('15-90%/20bp'),
#         EnforceGCContent('15-80%/50bp'),
#         EnforceGCContent('28-76%/100bp'), # GC含量
        AvoidHairpins(stem_size=6, hairpin_window=40,location=(0,40)),
        AvoidPattern(r'(.)\1{8}'),   # v
#         AvoidRareCodons(species='284590',min_frequency = 0.1) # 避免出现usage 小于0.1 的稀有密码子
    ],
    objectives=[
        CodonOptimize(species='284590',method = "match_codon_usage"),
#         EnforceGCContent(target = 0.4005),
#         MinimizeMFE( temperature = 30,location=(0,40)), # RNAfold 计算mfe
        
        
    ], 
)
print("BEFORE OPTIMIZATION:\n\n", problem.constraints_text_summary())
print(problem.objectives_text_summary())

problem.resolve_constraints()
problem.optimize()

print("AFTER OPTIMIZATION:\n\n", problem.constraints_text_summary())
print(problem.objectives_text_summary())

Request: use updated codon usage tables

Hello,

I have a request that should enhance the utility of DNA Chisel for codon optimization.

When selecting Codon Usage Tables for species not included in DNA Chisel, Kazusa Codon tables are used. These codon tables are extremely outdated, generated from Genbank data only up to 2007. For some species, there is only one CDS used, and some codons have a 0 usage (see here for examples).

I suggest that more up-to-date codon tables should be included in DNA chisel. For example, CoCoPUTs is an up-to-date database with codon usage tables generated with each new Genbank or Refseq release every three months. I guess it should not be hard to retrieve the usage tables for each species (there is a huge file with the data, but I assume they could be retrieved "on the fly" too).

I am in no way associated with the people that run CoCoPUTs, but I think it is the most up-to-date resource for codon usage tables.

Remove 'RSCU' from documentation

See #62 (comment)

Issue with AvoidPattern (and potentially other builtin_specifications)

It looks as though AvoidPattern doesn't preserve its strand information when being localized if the original strand information is 'from_location'

DnaChisel/dnachisel/builtin_specifications/AvoidPattern.py

Line 85 in db0b606

def localized(self, location, problem=None, with_righthand=True):

Example illustrating this:

ipdb> cst
    # AvoidPattern[4-1624(-)](pattern:AACAAAT)

ipdb> location
    # 9-10

ipdb> new_cst = cst.localized(location, problem=self)
ipdb> new_cst
    # AvoidPattern[4-16](pattern:AACAAAT)

ipdb> new_cst.location
    # 4-16

ipdb> new_cst.strand
    #  'from_location'

The from_location is strand argument is preserved, but the original strand in the location is lost, so the resulting constraint loses the strandedness. I ran into this today when my optimization kept failing and the violated constraint was being reported on the positive strand even though that constraint should only be on the negative.

This was in version 3.2.6 (though the current code looks unchanged)

To fix this in the short term, it looks like I can specify the strand directly with its own argument when creating the constraint (let me know if this is not good for some reason).

This tool is very useful, thank you for the continued development!

dnachisel not working on AWS Lambda

Importing the library doesn't seem to work in AWS Lambda.

Something like the following:

import dnachisel as dc

def lambda_handler(event, context):  
    return "test"

When run on lambda throws:

{
  "errorMessage": "Unable to import module 'lambda_function': cannot import name '_aligners' from partially initialized module 'Bio.Align' (most likely due to a circular import) (/var/task/Bio/Align/__init__.py)",
  "errorType": "Runtime.ImportModuleError",
  "stackTrace": []
}

Everything works fine if I mock the lambda locally. I uploaded the lambda code and the dependencies as a zip file, but also tried uploading dnachisel as a lambda layer and got the same error.

Python version: 3.8
dnachisel version: 3.2.9

Raise an error if the codon usage table is not found for the given taxonomy ID

I opened a PR in Edinburgh-Genome-Foundry/codon-usage-tables#4, but I think you're using a custom version of this code? I couldn't find it here. I wonder if it would be possible to raise an error when the usage table for a taxonomy ID isn't available on the kazusa website.

Otherwise, DNA Chisel just spits out a KeyError (codon) somewhere downstream when it looks up the codon in an empty table.

suggestion: put builtin_specification to Specification folder and split it into sub folders by specification type

like this:

└── Specification
    ├── builtin_specifications
    │   ├── alignment_related
    │   │   ├── AvoidBlast.py
    │   │   └── AvoidMatches.py
    │   ├── codons_related
    │   ├── pattern_related
    │   │   ├── AvoidChanges.py
    │   │   ├── AvoidPattern.py
    │   │   └── EnforcePattern.py
    │   └── second_structure_related
    │       └── AvoudHairpins.py
    ├── SpecEvaluation
    ├── Specification.py
    └── SpecificationSet.py

Make AvoidPattern() accept multiple patterns

Hi,
When using DnaChisel, I try to set more than one pattern for AvoidPattern() like this, but it seems only the first one can be identified. Is there any way to make AvoidPattern() accept multiple patterns at the same time?

problem = DnaOptimizationProblem(
sequence=some_sequence,
constraints=[
AvoidPattern("BsmBI_site", "BamHI_site"),
EnforceGCContent(mini=0.3, maxi=0.7, window=50),
EnforceTranslation()
],
objectives=[CodonOptimize(species='e_coli')]
)

AvoidPattern with biopython FeatureLocation raises AttributeError (in some specific parameters)

Adding a simple DNA pattern TGAA as an AvoidPattern constraint causes an AttributeError, the stack trace is listed below

"/usr/local/lib/python3.8/site-packages/dnachisel/DnaOptimizationProblem/mixins/ConstraintsSolverMixin.py", line 357, in resolve_constraints
    self.resolve_constraint(constraint=constraint)
  File "/usr/local/lib/python3.8/site-packages/dnachisel/DnaOptimizationProblem/mixins/ConstraintsSolverMixin.py", line 251, in resolve_constraint
    evaluation = this_local_constraint.evaluate(self)
AttributeError: 'NoneType' object has no attribute 'evaluate'

Suggestion: add design method like d-tailor

It useful to design seq for experiment.
Besides, I think it can solve #27

Error if ~harmonize_rca used in GenBank

If the Genbank file has ~harmonize_rca(e_coli => h_sapiens), then running

from dnachisel import DnaOptimizationProblem
problem = DnaOptimizationProblem.from_record("example_sequence.gb")

returns

/path/to/DnaChisel/dnachisel/Specification/FeatureRepresentationMixin.py in from_label(cls, label, location, specifications_dict)
    127                 "Unknown parameter %s for specification %s "
    128                 "at location %s"
--> 129                 % (faulty_parameter, specification, kwargs["location"])
    130             )
    131 

TypeError: Unknown parameter e_coli  for specification harmonize_rca at location 1218-2308(+)

Same error for h_sapiens or h_sapiens_9606.

The function works with CodonOptimize(), e.g. CodonOptimize(species='h_sapiens', location=(0, 99), original_species='e_coli').

	Several codon-optimization policies exist. At the moment this Specification
	implements a method in which codons are replaced by the most frequent
	codon in the species.

	def extract_sequence(self, sequence):
	"""Return the subsequence read at the given location."""
	result = sequence[self.start:self.end]

edinburgh-genome-foundry / dnachisel Goto Github PK

dnachisel's People

Contributors

Stargazers

Watchers

Forkers

dnachisel's Issues

Recommend Projects

Recommend Topics

Recommend Org