edinburgh-genome-foundry / dnachisel Goto Github PK
View Code? Open in Web Editor NEW:pencil2: A versatile DNA sequence optimizer
Home Page: https://edinburgh-genome-foundry.github.io/DnaChisel/
License: MIT License
:pencil2: A versatile DNA sequence optimizer
Home Page: https://edinburgh-genome-foundry.github.io/DnaChisel/
License: MIT License
I ran into an issue recently where I was using a regex in my AvoidPattern and didn't realize that I needed to specify the size. What happened was that even though there plenty of available solutions avoiding the pattern, I was still getting an error that no solution could be found. This was because without a size, regexes dont' optimize on small windows and instead were introducing random perturbations on the entire sequence. Since it was a large sequence, the odds of any of these perturbations fixing the constraint was very low.
The fix was just to specify it as a SequencePattern instead (where I could specify the size) and feed that into AvoidPattern.
However, this really wasn't obvious to me - took quite a while of debugging and reading through the codebase to figure out.
I wanted to file this issue to suggest either a different approach to documenting or warning about this. Maybe either:
A) allow specifying 'size' in AvoidPattern (if providing a string for the pattern instead of SequencePattern object) and document the necessity of 'size' if it's a regex
or B) throw an error when a regex is specified without a size
or C) disallow specifying patterns with strings with AvoidPattern so that users have to go through SequencePattern and are more likely to see the note about how specifying size is needed (still relies on people reading all the docs, though).
Thanks for this tool - it has been incredibly useful and I'm just trying to think of ways to make it better.
DnaChisel/dnachisel/builtin_specifications/CodonOptimize.py
Lines 12 to 14 in a318ca5
Issue for bug mentioned in #52 (comment).
If no location given;
from dnachisel import DnaOptimizationProblem, EnforcePatternOccurence
seq="AAAAAAAAAA"
problem = DnaOptimizationProblem(
sequence=seq,
constraints=[
EnforcePatternOccurence(pattern='CTG', occurences=2, strand=-1)
],
)
then:
AttributeError Traceback (most recent call last)
<ipython-input-219-82901a020845> in <module>
5 sequence=seq,
6 constraints=[
----> 7 EnforcePatternOccurence(pattern='CTG', occurences=2, strand=-1)
8 ],
9 )
/path/to/DnaChisel/dnachisel/builtin_specifications/EnforcePatternOccurence.py in __init__(self, pattern, occurences, location, strand, center, boost)
69 if strand not in [-1, 0, 1]:
70 raise ValueError("unknown strand: %s" % strand)
---> 71 self.location.strand = strand
72 self.strand = strand
73 self.occurences = occurences
AttributeError: 'NoneType' object has no attribute 'strand'
because in
from_data() returns aNone
as location, and a value is assigned to its attribute in:
related to #51
from dnachisel import *
from Bio.SeqFeature import FeatureLocation
seq="ATGCAGAGCAAGGTGCTGCTGGCCGTCGCCCTGTGGCTCTGCGTGGAGACCCGGGCCGCC"
problem = DnaOptimizationProblem(
sequence=seq,
constraints=[
EnforcePatternOccurence( pattern='CTG',occurences=2,strand=-1,location=FeatureLocation(start=0,end=50))
],
)
print(problem.constraints_text_summary())
===> FAILURE: 1 constraints evaluations failed
FAIL ┍ EnforcePatternOccurence[0-50(-)](CTG, occurence, 2)
│ Failed. Pattern found 1 times instead of 2 wanted, at locations [3-6(-)]
problem.resolve_constraints()
/usr/local/lib/python3.8/site-packages/dnachisel/DnaOptimizationProblem/mixins/ConstraintsSolverMixin.py in resolve_constraints(self, final_check, cst_filter)
355 ):
356 try:
--> 357 self.resolve_constraint(constraint=constraint)
358 except NoSolutionError as error:
359 self.logger(constraint__index=len(constraints))
/usr/local/lib/python3.8/site-packages/dnachisel/DnaOptimizationProblem/mixins/ConstraintsSolverMixin.py in resolve_constraint(self, constraint)
301 try:
302 if hasattr(constraint, "resolution_heuristic"):
--> 303 constraint.resolution_heuristic(local_problem)
304 else:
305 local_problem.resolve_constraints_locally()
/usr/local/lib/python3.8/site-packages/dnachisel/builtin_specifications/EnforcePatternOccurence.py in resolution_heuristic(self, problem)
183 mutation_space=problem.mutation_space,
184 )
--> 185 new_occurence_cst.insert_pattern_in_problem(new_problem)
186 problem.sequence = new_problem.sequence
187 return
constraint: 0%| | 0/1 [00:00<?, ?it/s, now=EnforcePatternOccurence[0...]
/usr/local/lib/python3.8/site-packages/dnachisel/builtin_specifications/EnforcePatternOccurence.py in insert_pattern_in_problem(self, problem, reverse)
132 sequence=sequence_to_insert, location=new_location
133 )
--> 134 new_space = MutationSpace.from_optimization_problem(
135 problem, new_constraints=[new_constraint]
136 )
/usr/local/lib/python3.8/site-packages/dnachisel/MutationSpace/MutationSpace.py in from_optimization_problem(problem, new_constraints)
184 constraints = new_constraints
185 mutation_choices = sorted(
--> 186 [
187 MutationChoice(segment=choice[0], variants=set(choice[1]))
188 for cst in constraints
/usr/local/lib/python3.8/site-packages/dnachisel/MutationSpace/MutationSpace.py in <listcomp>(.0)
187 MutationChoice(segment=choice[0], variants=set(choice[1]))
188 for cst in constraints
--> 189 for choice in cst.restrict_nucleotides(sequence)
190 ],
191 key=lambda choice: (choice.end - choice.start, choice.start),
/usr/local/lib/python3.8/site-packages/dnachisel/builtin_specifications/EnforceSequence.py in restrict_nucleotides(self, sequence, location)
107 if self.location.strand == -1:
108 lend = self.location.end
--> 109 return [(i, set(reverse_complement(n) for n in
110 IUPAC_NOTATION[self.sequence[lend - i]]))
111 for i in range(start, end)]
/usr/local/lib/python3.8/site-packages/dnachisel/builtin_specifications/EnforceSequence.py in <listcomp>(.0)
108 lend = self.location.end
109 return [(i, set(reverse_complement(n) for n in
--> 110 IUPAC_NOTATION[self.sequence[lend - i]]))
111 for i in range(start, end)]
112 else:
IndexError: string index out of range
test code
from dnachisel import *
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
# DEFINE THE OPTIMIZATION PROBLEM
problem = DnaOptimizationProblem(
sequence=random_dna_sequence(900),
constraints=[AvoidPattern(enzyme="BsaI"),
EnforceGCContent(mini=0.3, maxi=0.7, window=50)],
objectives=[CodonOptimize(species='e_coli', location=(0, 3))]
)
# SOLVE THE CONSTRAINTS, OPTIMIZE WITH RESPECT TO THE OBJECTIVE
problem.resolve_constraints()
problem.optimize()
# PRINT SUMMARIES TO CHECK THAT CONSTRAINTS PASS
print(problem.constraints_text_summary())
print(problem.objectives_text_summary())
print(problem.sequence_before)
print(problem.sequence)
print("---------------------------------")
temp1 = str(
Seq(problem.sequence_before, generic_dna).translate(
table="Standard",
stop_symbol="*",
to_stop=False,
cds=False,
gap=None,
)
)
temp2 = str(
Seq(problem.sequence, generic_dna).translate(
table="Standard",
stop_symbol="*",
to_stop=False,
cds=False,
gap=None,
)
)
print(temp2 == temp1)
print("---------------------------------")
DnaChisel/dnachisel/DnaOptimizationProblem.py
Line 570 in 78043a3
It would be great if the reverse_translate
function could handle unknown amino acids in the sequence denoted by 'X', which would be reverse translated to 'NNN'.
I really like Dnachisel and use it routinely to design protein expression constructs for E.coli. I recently got tripped up on an optimization of a ~3kb construct where I realized too late ( after the synthetic gene was ordered ) that the way I was doing my optimization mutated my resulting sequence.
I am doing my optimization using the recipe
problem = DnaOptimizationProblem(
sequence=reverse_translate(mysequence),
constraints=[AvoidPattern("SapI_site"),AvoidPattern("BbsI")],
objectives=[CodonOptimize(species="e_coli"),EnforceTranslation(),AvoidNonUniqueSegments(12),EnforceGCContent(0.25, 0.70, window=10),AvoidHairpins()],
)
The translated sequence that results has 1-2 mutations in the ~1000 amino acid sequence. I have tried it with the github version and the pip version of Dnachisel and they both do the same. There does not seem to be a pattern in where the mutations arise.
I have since learnt my lesson and am checking my designs to ensure that they translate to input amino acid sequence. But am hoping there is a way to not have the optimization spit out a mutated sequence and fail to converge.
Attaching a collaboratory notebook link and file showing what I did.
Thanks for your help in advance
Scenario:
I would like to use constraints_breaches_dataframe
for generating my own report and upload SeqRecord
in Benchling
instead of the .pdf
.
Problem:
For our customers, we need a bit more general description of why their sequence doesn't pass the constraints, and currently, I need to copy/paste the whole function instead of re-using it with the use_breach_form=False
field.
Note that the species can be omited if a ``codon_usage_table`` is provided instead
However, when omitting the species args, this error message will occur.
TypeError: sequence item 0: expected str instance, NoneType found
Exception ignored in: <function tqdm.__del__ at 0x7fd3e5262620>
Is there a way to force DnaOptimizationProblem
to return the same optimized sequence for a given input sequence (particularly when enforcing GC content)? I saw in your tests that you try setting the Numpy random seed numpy.random.seed(123)
, but that only works when I run problems in a given Python session.
Here is my basic setup.
import random
import numpy
from dnachisel import CodonOptimize, DnaOptimizationProblem, EnforceGCContent, EnforceTranslation
SEED = 123
DNA_SEQ = 'ATGACATTCACCGAGAAGGTATTCAAACAAGTATGCGAATGGTGTATAGGCTTCTATCGTAAATATGCTGTTCCGAGTGAAGAGTGTGACTTCTTCCCTAGATGGATATGGGGGAACAAACGGAACATTCTGACCTTTACATTCAGTACCCCAATTACCTTACTGACGGCCCCGTCGCGACTTCCCGAAGAAGGGTGCCAAATAACCATCTGCGGACAACAGCAGGATTGCTTTGCAGTTGAGCGACACACACTACCGCAGAGCACGAAGATTAACAGATGGTGCGGCCGTTGTGAGCATCATAATCAAATCTATATTGAGCCTACCGAGCCAATTGGTGAACAACACGCGGACCATTTCGATTGCATTGAGAATAACGCTATAGATGACTATCCACCTTCAAATCTACCATGCTTACAAATTCAATCGGTTGAAAATGTCGGCCACCGGGGATGGTGTCAGGTACTGGTAAGATGGTGGGTGGATGCGTACAAAGACGGGGACCCCGTGGCTTGGGGTGACTTTAGGGGCCTGGTGAAATCTGCGGAATGGACCGTCCGCCCCCGGTCCAAGGCTGGAGATGAGACTCACAGAGAGTGGAAACATCCTTCTCTCGGGTGTCCTCACGGAAACATCATCGTTCTTGCTGCACCTGTTTCTTATCATCTCAACATTGATTGCACCGTCAACGTCCATAACCTTCAGAGCGCAGAATATACAGATCCTAGTTACGCTTTCACTCACGTATGTATTACAGCGTTCTTCGCAATACAGGCGATACCGCACGCCCGTATAAGCTTTTCAAAAGAGGCCAATCGGTACTATTATTGGATCATTACCTGTCATTGTTCCAGTTATTGCTGCTTACGTGAATGTTGGATATCGACTCGCAAAGATAAGTGGCATAGCCATAAGTACAAGGACAGCCCTCTATTTGTCGAATTTGGCTGCAAGGAAACTTGGTGCTTAGAACATTATGACAAGGAGGTCTCCCAGACCAAGCGCTACGGTTGCAAGCTCGTTGGAGATCGGTGCGGTTTCCATGAGGCCTTGTTTTTTGAGACCAAATGGAGAAACACGGAACCCCAACGGCTGTACGCATACTGGTCCATAAAAGTAGATTGCTGTAAGATCTATTCGGAAAACAACACAAAGAACGTGGATGAGATAGAATGGGCGACAAGAGAGTGGATTTGGCAAAAAAATATTAAGATACGGTCCAATGTAAGATTCCATTATCAGCCGGACCAAGAACATCCGGACGTCTGGCATGGTTTGTGCGAGCAACTAGTTACTATTGACAAAAATCACGAACTTCCGGACGGCCAGAAAATAATTTGTTGGGGTAAGGAGTGGCATGACGCGATTTACCAGGATATCGCTCCTTTTTACGCGTTAGTGCTGGAATATCGATGGTTTCAGTGGGGCAGATACGAGTGTTGGAAGACCTGGGGTCTCGCGGGCTGCACGGACCTTAATTGGGAGGACCCTTTCAGAGGGACGCACCCCATTCTACACAAGGGACAGGGGTATATTAAAGATGCGGCCTTGTTACGTTGGTTCCGCCACTTACGTTCCTCATTCTTTGCAGGAGTTAAGGCGAAAGACGAAGTCGGATATAATGGTTGCCAGTGGGATTACCGAGATTTACGAGAAGGCTTCGATTGTTGGCAATACACTTGCAGGAACTTTCTTAGTCCTAACCAGGTTAATGACTGTTGGTACAAAGGCGCATGTTGGAAACACACACCCGGATGGTTTGTAAAGGTTAAGTTTTTGTGGTGTACGAATGAAGCTTATTACTGGGGATTAAAGACCTGGCATAAAACACGCCAGATTCATAGTGACGGGACAAACCATGGGCTTACAGAATGCGTTATAACTTCATTGCGCGCAATTGACCAGCGGCCAAATAGCGACAAGGTAGAGAGTCCTGGACGGAAGTGTGAACGCGTGAGTCGAGTCCACTGCAGCTGGCGTAGCCAGGCCAAAGAGAAGGGAAGACAGGAAGCCCATTTCTACCTGGGGTGCGTGGTGGGCATTATAGGACATGACGTGCAAGAGCTTGAGATTGACTTAACGACTGATGACATCAGCAGTAATAAAAATTGCCACCACTGTGATCCTAATGCAAAGTGTCACTTTCACAAGGGGCAGGTAACGGATGAACACGCCCTATGTGATGAATGGGTTGTATTAAGATTTCCATATAATACCCAACGATGGAAAGCCATCCTCAGTTTGGAAGGCTTTTCTCAGCGAGAGTACGACAACTTCCCTCATTTACAAGCCAAACGGTGGCGTAAGGTGGCTCAGATCGATTACACTTTCGAGAATAACTTCTACTGGCATTGGTGGCAGCACCATGGTGGGAAGAAGGTGATCTGGGCGACACCAGCGATTACAGCCGGAATAGATTACCAAGACGCGAGATGCTCACCAACGGTAGCTCATCACAATAAGTTCTCATGTCACTATAATTGGTGGTTTCTTAAGTCCTCAGAGTATGACCAGCACTTCCCTGACACACAGGTAGTTGTTTTAGACTGGGATTGTGCGCCGAAGGGGCAGCTAGAACCGGTTACGTGTATCCAAACGTTTGACCGAATACACGCAGATGTATTCAGACAGTGGAACGCCCAGTGGTGGTGGGATTTCCGATTACAAACGAATTGGTATTGGCACTTCGACAACGATGAATGTCGGCCGTCGCTCCCATGGCGGGATTGTAATCTCCTTCAGTTTTATCTCCAGCTTGGTTGTAAGATCGCATTTCAACGTACCTGGCAGAAAACCCACCATCATCGCGTGTGCCCTGTGCCCGTCAGTTGGTTTCGCTGTATTGACAATCAAATCATTTTCCACTTCATACATGAATCTCAATGGACCCAGTCTCGTCAGAGAACTGTCTACGACCCCGTTCCTCACCGTCATTGGCACTGTTGGCCCTGGGCCATTTGGCGAAATGGACGCTACCATTACAAGTACAGCACACCTTGA'
def get_problem(dna_seq: str, gc_min: float = 0.40, gc_max: float = 0.60, bp_window: int = 60, species: str = 'e_coli'):
# Force reproducibility
random.seed(SEED)
numpy.random.seed(SEED)
return DnaOptimizationProblem(
sequence=dna_seq,
constraints=[
EnforceGCContent(mini=gc_min, maxi=gc_max, window=bp_window),
EnforceTranslation()
],
objectives=[
CodonOptimize(species=species)
],
logger=None
)
# This passes!
def test_reproducibility():
dna_seqs = []
for i in range(10):
problem = get_problem(DNA_SEQ)
problem.resolve_constraints()
problem.optimize()
dna_seqs.append(problem.sequence)
assert len(set(dna_seqs)) == 1
# This fails (optm_dna_seq is copy-pasted from a previous run)
def test_simple_optimization():
optm_dna_seq = 'ATGACCTTCACCGAGAAGGTGTTTAAGCAGGTGTGCGAATGGTGCATTGGCTTTTATCGCAAATATGCGGTGCCGAGCGAAGAATGCGATTTCTTCCCGCGCTGGATTTGGGGCAACAAACGCAACATTCTGACCTTTACCTTTAGCACCCCAATTACCTTACTGACCGCGCCGAGCCGTTTACCGGAAGAAGGCTGCCAGATTACCATTTGCGGCCAACAGCAAGATTGCTTTGCGGTGGAACGCCATACCCTGCCGCAGAGCACCAAAATTAACCGTTGGTGCGGCCGTTGTGAACATCATAACCAGATTTATATTGAACCGACCGAACCGATTGGCGAACAGCATGCGGATCATTTTGATTGCATTGAGAACAACGCGATTGATGACTACCCGCCGAGCAACCTGCCGTGCTTACAGATTCAAAGCGTTGAAAATGTGGGCCATCGCGGCTGGTGCCAGGTTCTGGTTAGATGGTGGGTGGATGCATATAAAGATGGCGATCCGGTGGCGTGGGGTGATTTTCGCGGTCTGGTTAAATCTGCGGAATGGACTGTGCGCCCACGCAGCAAAGCGGGTGATGAAACCCATCGTGAATGGAAACATCCGAGCCTGGGCTGCCCGCATGGCAACATTATCGTGCTGGCGGCACCAGTTAGCTATCATCTGAACATTGATTGCACCGTGAACGTGCATAACCTGCAGAGCGCGGAATATACCGATCCGAGCTATGCGTTTACCCATGTGTGCATTACCGCGTTTTTTGCGATTCAGGCGATTCCGCATGCGCGCATTAGCTTTAGCAAAGAGGCGAACCGCTATTATTATTGGATCATTACCTGCCACTGCAGCAGCTATTGCTGCCTGCGCGAATGCTGGATTAGCACCCGCAAAGATAAATGGCATAGCCATAAATACAAGGACAGCCCGCTGTTTGTGGAATTTGGCTGCAAAGAAACCTGGTGCCTGGAACATTATGATAAAGAAGTGAGCCAGACCAAACGCTATGGCTGCAAACTGGTGGGCGATCGCTGCGGTTTTCATGAAGCGCTGTTTTTTGAAACCAAATGGCGCAACACCGAACCGCAGCGCCTGTATGCGTATTGGAGCATTAAAGTGGATTGCTGCAAAATCTACAGCGAGAACAACACCAAAAACGTGGATGAGATCGAATGGGCGACCCGCGAATGGATTTGGCAGAAAAACATTAAAATTCGCAGCAACGTGCGTTTTCATTATCAGCCGGACCAGGAACATCCGGATGTGTGGCATGGCCTGTGCGAGCAGCTGGTGACTATTGATAAAAACCATGAACTGCCGGATGGCCAGAAAATTATTTGCTGGGGCAAAGAATGGCATGATGCGATTTACCAGGATATCGCGCCGTTTTATGCGCTGGTGCTGGAATATCGCTGGTTTCAGTGGGGCAGATATGAATGCTGGAAAACCTGGGGCCTGGCGGGCTGTACTGATCTGAATTGGGAAGATCCGTTTCGTGGCACCCATCCGATTCTGCATAAAGGCCAGGGCTATATTAAAGATGCGGCGCTGCTGCGCTGGTTTCGCCATCTGCGTAGCTCTTTTTTTGCGGGTGTTAAAGCGAAAGATGAAGTGGGCTATAACGGCTGCCAGTGGGATTACCGCGATCTGCGCGAAGGCTTCGATTGCTGGCAGTATACCTGCCGCAACTTTCTGAGCCCGAACCAGGTGAACGATTGCTGGTATAAAGGCGCGTGTTGGAAACATACCCCGGGCTGGTTTGTGAAAGTGAAATTTCTGTGGTGCACCAACGAAGCGTATTATTGGGGCCTGAAAACCTGGCATAAAACCCGCCAGATTCATAGCGATGGCACCAACCATGGCCTGACCGAATGCGTGATTACCAGCTTGCGCGCAATTGATCAGCGCCCGAACAGCGATAAAGTGGAAAGCCCAGGTAGAAAATGCGAACGCGTGAGTCGTGTGCATTGCAGCTGGCGTAGCCAGGCGAAAGAAAAAGGCCGTCAGGAAGCGCATTTTTATCTGGGCTGCGTGGTGGGCATTATTGGCCATGACGTGCAGGAACTGGAAATTGACCTGACCACCGATGATATTAGCAGCAACAAAAACTGCCATCATTGCGATCCGAACGCGAAATGCCATTTTCATAAAGGCCAGGTGACCGATGAACATGCGCTGTGCGATGAATGGGTGGTGCTGCGCTTTCCGTATAACACCCAGCGCTGGAAAGCGATTCTGAGCCTGGAAGGCTTTAGCCAGCGCGAATATGATAACTTTCCGCATCTGCAGGCGAAACGCTGGCGCAAAGTGGCGCAGATTGATTATACCTTTGAAAACAACTTTTATTGGCATTGGTGGCAGCACCATGGTGGCAAGAAAGTGATTTGGGCGACCCCGGCGATTACTGCGGGTATTGATTATCAGGATGCGCGCTGCTCTCCAACTGTGGCGCATCACAACAAATTCAGCTGCCATTACAACTGGTGGTTTCTGAAAAGCAGCGAATATGATCAGCATTTCCCGGATACCCAGGTGGTGGTTTTAGATTGGGATTGCGCGCCGAAAGGCCAGCTGGAACCGGTGACCTGCATTCAGACCTTTGATCGCATTCATGCGGATGTGTTTCGCCAGTGGAACGCGCAGTGGTGGTGGGATTTTCGCCTGCAGACCAACTGGTATTGGCATTTTGATAACGATGAATGCCGCCCGAGCCTGCCGTGGCGCGATTGCAACCTGTTACAGTTTTATCTGCAGCTGGGTTGCAAAATTGCGTTTCAGCGTACCTGGCAGAAAACCCATCATCATCGCGTGTGCCCGGTGCCGGTTAGCTGGTTTCGCTGCATTGACAACCAGATTATTTTCCACTTCATTCACGAGAGCCAGTGGACCCAGTCTCGTCAGCGTACTGTGTATGATCCGGTTCCACATCGCCATTGGCATTGTTGGCCGTGGGCGATTTGGCGTAATGGCCGCTATCATTATAAATATAGCACCCCGTAA'
problem = get_problem(DNA_SEQ)
problem.resolve_constraints()
problem.optimize()
assert problem.sequence == optm_dna_seq
If you run test_reproducibility
you get the same sequence for all iterations of the loop. This test passes.
However, I tried running test_simple_optimization
once, recording the output and storing it into optm_dna_seq
, then running the test again. This almost always fails for large sequences.
I can't tell where the randomization is happening. Is there a different random number generator that I should set the seed for?
Sorry if this is a bit convoluted, couldn't think how else to explain it.
Seems to be an error with dna_features_viewer or something. Can't import module.
TypeError Traceback (most recent call last)
in
----> 1 from goldenhinges import OverhangsSelector, OverhangsSelector
2 import dnachisel
3 import geneblocks
4 from dnachisel import (EnforceTranslation, Specification, SpecEvaluation,
5 reverse_translate, random_protein_sequence, Location,
5 frames
/usr/local/lib/python3.7/dist-packages/dna_features_viewer/biotools.py in
36
37 aa_short_to_long_form_dict = {
---> 38 _aa1: _aa3[0] + _aa3[1:].lower() for (_aa1, _aa3) in zip(aa1 + "", aa3 + [""])
39 }
40
TypeError: can only concatenate tuple (not "str") to tuple
I'm trying to use DnaChisel to perform codon optimization of sequences that will be accepted by Twist for DNA synthesis. However I get an error "high repeat density" and that "more than 15% of your sequence is composed of repeats 9bp or longer". Is there any work-around for this?
def codon_optimize(my_protein):
"""
Use dnachisel to perform contrained codon optimization
See: https://www.twistbioscience.com/faq/gene-synthesis/codon-optimization-what-steps-are-taken-maintain-wild-type-protein-expression
"""
my_sequence = reverse_translate(my_protein, table='Standard')
problem = DnaOptimizationProblem(
sequence=my_sequence,
constraints=[
AvoidPattern("GGAGG"), # ribosome binding sites
AvoidPattern("TAAGGAG"), # ribosome binding sites
AvoidPattern("TTTTT"), # terminator
AvoidPattern("AAAAA"), # terminator
UniquifyAllKmers(12),
AvoidPattern(HomopolymerPattern("A", 10)), # homopolymer runs of 10 or more bases
AvoidPattern(HomopolymerPattern("C", 10)), # homopolymer runs of 10 or more bases
AvoidPattern(HomopolymerPattern("G", 10)), # homopolymer runs of 10 or more bases
AvoidPattern(HomopolymerPattern("T", 10)), # homopolymer runs of 10 or more bases
AvoidHairpins(location=[0, 48]), # strong hairpins (∆G <-8) in the first 48 base pairs
EnforceGCContent(mini=0.35, maxi=0.65, window=50), # local GC windows (50 bp) of less than 35% or more than 65%
EnforceGCContent(mini=0.25, maxi=0.65, window=len(my_sequence)), # global GC% of less than 25% or more than 65%
EnforceTranslation(start_codon="keep"), # keep ATG start codon
AvoidRareCodons(0.08, 'h_sapiens') # rare codons (those with a codon frequency of <8%)
],
objectives=[
CodonOptimize(species='h_sapiens', method='use_best_codon')
],
logger=None
)
problem.resolve_constraints()
problem.optimize()
return problem.sequence
In order to test several constructs which equally satisfy a problem's constraints and select for the optimal/functional ones, it may be desirable to replace a subsequence with degenerate oligonucleotides.
Given the current implementation of the MutationSpace class, it should be fairly easy to provide a toDegenerateSeq
method translating the list of choices into such encoding.
I can help with the implementation if you are interested in adding this feature, although there's only partial documentation for the MutationSpace and MutationChoice classes. Feel free to close the issue if you don't think it's worth it.
Using the CodonOptimize 'harmonized' functionality, I do not get a codon distribution similar to the one I specified. It seems that the second 2/3s of a sequence are never optimized.
A minimal example:
from dnachisel import *
problem = DnaOptimizationProblem(
sequence='GACGACGACAAAAAAAAAAAAAAAAAA',
constraints=[EnforceTranslation()],
objectives=[CodonOptimize(species='b_subtilis', mode='harmonized')]
)
problem.resolve_constraints()
problem.optimize()
print('SEQUENCE:', problem.sequence)
frequencies, positions = biotools.biotools.codons_frequencies_and_positions(problem.sequence)
print(*frequencies.items(), sep='\n')
This sequence encodes DDDKKKKKK and gives as output:
SEQUENCE: GACGATGATAAAAAAAAAAAAAAAAAA
('V', {'total': 0, 'GTA': 0.0, 'GTT': 0.0, 'GTC': 0.0, 'GTG': 0.0})
('L', {'CTT': 0.0, 'TTA': 0.0, 'CTC': 0.0, 'CTG': 0.0, 'CTA': 0.0, 'total': 0, 'TTG': 0.0})
('Q', {'CAG': 0.0, 'total': 0, 'CAA': 0.0})
('G', {'total': 0, 'GGC': 0.0, 'GGA': 0.0, 'GGT': 0.0, 'GGG': 0.0})
('P', {'CCA': 0.0, 'total': 0, 'CCT': 0.0, 'CCG': 0.0, 'CCC': 0.0})
('A', {'total': 0, 'GCG': 0.0, 'GCC': 0.0, 'GCT': 0.0, 'GCA': 0.0})
('T', {'total': 0, 'ACT': 0.0, 'ACA': 0.0, 'ACG': 0.0, 'ACC': 0.0})
('C', {'total': 0, 'TGT': 0.0, 'TGC': 0.0})
('S', {'AGC': 0.0, 'AGT': 0.0, 'TCT': 0.0, 'total': 0, 'TCG': 0.0, 'TCC': 0.0, 'TCA': 0.0})
('N', {'AAC': 0.0, 'total': 0, 'AAT': 0.0})
('H', {'CAC': 0.0, 'total': 0, 'CAT': 0.0})
('I', {'total': 0, 'ATA': 0.0, 'ATT': 0.0, 'ATC': 0.0})
('D', {'GAT': 0.6666666666666666, 'total': 3, 'GAC': 0.3333333333333333})
('M', {'total': 0, 'ATG': 0.0})
('F', {'TTT': 0.0, 'total': 0, 'TTC': 0.0})
('R', {'CGG': 0.0, 'AGG': 0.0, 'CGC': 0.0, 'CGA': 0.0, 'total': 0, 'AGA': 0.0, 'CGT': 0.0})
('K', {'total': 6, 'AAG': 0.0, 'AAA': 1.0})
('E', {'total': 0, 'GAG': 0.0, 'GAA': 0.0})
('*', {'total': 0, 'TAA': 0.0, 'TGA': 0.0, 'TAG': 0.0})
('Y', {'total': 0, 'TAC': 0.0, 'TAT': 0.0})
('W', {'total': 0, 'TGG': 0.0})
It does well for Aspartic Acid (D), as this has a GAT 0.64 / GAC 0.36 ratio, but for K with a AAA 0.7 / AAG 0.3 ratio, it does nothing at all. This is position dependent, as a sequence encoding KKKDDDDDD does the opposite.
While trying to debug, I printed all variants that were used during exhaustive search for this sequence (line 454 of DnaOptimizationProblem.py) , which are:
GACGACGACAAAAAAAAAAAAAAAAAA
GATGACGACAAAAAAAAAAAAAAAAAA
GATGACGACAAAAAAAAAAAAAAAAAA
GATGACGATAAAAAAAAAAAAAAAAAA
GATGACGATAAAAAAAAAAAAAAAAAA
GATGATGATAAAAAAAAAAAAAAAAAA
This again points to the fact that the later positions are never included in the variants to be analysed.
I'm using the latest version, dnachisel 1.1.
It would be great if you could find out where the problem is and come up with a quick fix, so I can use your library for my thesis. Thank you!
Hello,
when using the following command :
>>> dnachisel annotated_record.gb optimized_record.gb
I get the output below but although it says "Result written to optimized_record.gb" no file is created.
Before optimization:
===> FAILURE: 1 constraints evaluations failed
✔PASS ┍ EnforceTranslation[0-507(+)]
│ Enforced by nucleotides restrictions
FAIL ┍ AvoidPattern0-507
│ Failed. Pattern found at positions [493-499(+)]===> TOTAL OBJECTIVES SCORE: -42.08
-42.08 ┍ MaximizeCAI0-507(+)
│ Codon opt. on window 0-507(+) scored -4.21E+01Resolving Constraints...
Optimizing...After optimization:
===> SUCCESS - all constraints evaluations pass
✔PASS ┍ EnforceTranslation[0-507(+)]
│ Enforced by nucleotides restrictions
✔PASS ┍ AvoidPattern0-507
│ Passed. Pattern not found !===> TOTAL OBJECTIVES SCORE: 0
✔ 0 ┍ MaximizeCAI0-507(+)
│ Codon opt. on window 0-507(+) scored -0.00E+00Result written to optimized_record.gb
The output files are created correctly if I use the other commands :
>>> dnachisel annotated_record.gb optimization_report/
and
>>> dnachisel annotated_record.gb optimization_report.zip
The original genbank file is also processed correctly using the web app.
Hello, a long time ago I think DNA Chisel had an AvoidRepeat spec, but I can't seem to find it anymore.
Is there a way to avoid repeats in the whole nucleotide sequence (e.g. avoid repeats of length 10 - like GGGGGGGGGG or any base - in the entire sequence). I think UniquifyAllKmers is the generalized version of this? But I'm confused as to what values to use to match my use case (sorry, non-biologist here 😳).
Is it just k = 10
, leaving reference = None
and location = None
?
Hello! The library looks great. I like the multi-layered constraints. They would work perfectly for a project I am working on right now.
I am, however, having some issues with pip3 install
'ing the package.
pip install
totally fails because of some weazyprint version issue (which is fine for me, just sharing)pip3 install
doesn't make the module global. I can't import it elsewhere. I do not know why right nowpip3 install -e .
'ing it works for adding it as a global dependency, but I still get warnings about versions when importing:If I get time and find a fix I'll submit a PR. My first approach would be to pin the dependencies (particularly around weazy) in setup.py
(move to requirements.txt).
In the Readme the install instructions make the plot package installation look optional, but then in setup.py
find_packages
is used, which I think is installing weazyprint, etc) even though I have no interest in the visual reports (or the pdf_report tbh). I looks like dna_chisel
has close to 35 dependencies right now which is a bit excessive. The rest of my application has ~5. And things like defusedxml
, webencodings
, CairoSVG
are all being added
Hello!
I have ran dnachisel.reports.write_optimization_report(problem=problem, target=".")
and the report consists of every component except for the gene graphs/ geneblocks.
The only way I have been able to see the geneblocks image is by running separate code, shown below, and it is saved as an extra PNG file.
seq_1 = dnachisel.sequence_to_biopython_record(problem.sequence_before) seq_2 = problem.to_record() diff_blocks = DiffBlocks.from_sequences(seq_1, seq_2) ax1, ax2 = diff_blocks.merged().plot(figure_width=8)
But I would really like to see them in the optimization report. Any help would be appreciated. Thanks!
By default, dnachisel gives a 'best' sequence with score close to 0.
I can get a 'worst' sequence by new_score < score
.
If i want a 'medium' sequence , i can set a target score,
and set the if condition to abs(new_score - target) < abs(score - target)
.
Since scores varies from objectives, I have to get worset score first to determint the target score.
Is there any solution to generate 3 sequences just using optimize() once?
Hi,
Does DnaChisel support circular sequences? I may be missing something, but judging from this line and the behavior on a quick test, I doubt it:
DnaChisel/dnachisel/Location.py
Lines 72 to 74 in 5a8cce7
Thank you
Simone
hey-- great module! I've been using to great effect for awhile now. one feature that'd really help out is repeat avoidance to be able to synthesize DNA at providers with fewer issues. is this possible in the package currently, or would it need to be built? thanks for developing and sharing this software!
I was trying to understand the example included in ".//examples/non_unique_kmers_minimization.py" and they mention two classes "ObjectiveEvaluation" and Objective. I couldnt find these anywhere in the source, can someone point me in the right direction.
Alternatively I am looking to write DNAOptimization objective that will design protein coding repeats without using the same codons for each repeat in an expression plasmid.
Any help in this will be greatly appreciated.
This line will triggered the error.
However, problem.max_random_iters = 10000
will work.
Twist Bioscience has a GC variance constraint. Quote from their website:
Avoid extreme differences in GC content within a gene (i.e. the difference in GC content between the highest and lowest 50 bp stretch should be no greater than 52%).
If a sequence with such abnormality is loaded in their interface the following error occurs:
We located two 50 bp windows with a GC content difference of over 52%. Redesigning these regions to level out the peaks and troughs will improve the likelihood of success. Please note that any changes made to any part of the sequence will trigger rescoring and may highlight new problematic regions. Therefore, if possible, we suggest optimizing the whole sequence.
Using DNA Chisel + DNA Weaver without such constraint might lead to the construction of non-valid sequences.
I noticed some objectives such as "dnachisel.builtin_specifications.AllowPrimer" do not have any boost parameter. How are they balanced with other objectives people might specify concurrently? Many objective options have the boost parameters, but not all?
Best,
Josh
Hello! Thanks for this great tool. I am exploring using it on a project.
Is there an existing way to get dnachisel to return a few codon optimized sequences, maybe some that are slightly sub-optimal so I can have more diversity when ordering codon optimized sequences for DNA synthesis?
Best,
Josh
I ran into an issue when using codon optimization using mode="harmonized". I get a division by zero error that occurs on line 522 of biotools.py:
data[codon] = 1.0 * value / total
My gene did not have an amino acid ("H") and so that value for "total" for this amino acid was 0, causing the divide by zero error.
Suggested fix:
line 519:
total = max(data['total'], 1)
since there are tools like mathtools,
renaming it and put Location to tools dir will make code more organized.
It crashes due to:
UnboundLocalError: local variable 'L' referenced before assignment
Can be fixed by moving:
To above the if statement.
I noticed that the AvoidHairpins specification does not search for hairpins within the last window. Intuitively, if I set my window size to be the same as my sequence length, the algorithm should find every hairpin in the sequence. In reality, though, this algorithm will not find any hairpins.
Here is some updated pseudocode that I briefly tested, it seems to be working:
for i in range(len(sequence) - stem_size:
word = sequence[i : i + stem_size]
if len(sequence) - i < window:
window = len(sequence) - i
if stem_size > window / 2:
break
rest = reverse[-(i + window) : -(i + stem_size)]
if word in rest:
// follow existing algorithm
I'm trying to codon optimize a nucleic acid sequence but DnaChisel is returning sequences with different amino acids than the original sequence. Any idea why I am getting this result? I'm looking to codon optimize a nucleic acid sequence for humans.
Here is the relevant part of my code:
some_sequence = "LONG_SEQ(See bottom)"
upper_sequence = some_sequence.upper()
seq_len = len(some_sequence)
problem = DnaOptimizationProblem(
sequence=upper_sequence,
objectives=[CodonOptimize(species='h_sapiens', method="use_best_codon")]
)
problem.optimize()
new sequence = problem.sequence
Here is an example of what I am getting back:
original_aa = "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFT...
new_aa = "MFVFLVFFPLVYRQWVKLTTRTHFPPAYTKYFT...
Here is the sequence
some_sequence = "atgtttgtttttcttgttttattgccactagtctctagtcagtgtgttaatcttacaaccagaactcaattaccccctgcatacactaattctttcacacgtggtgtttattaccctgacaaagttttcagatcctcagttttacattcaactcaggacttgttcttacctttcttttccaatgttacttggttccatgctatacatgtctctgggaccaatggtactaagaggtttgataaccctgtcctaccatttaatgatggtgtttattttgcttccactgagaagtctaacataataagaggctggatttttggtactactttagattcgaagacccagtccctacttattgttaataacgctactaatgttgttattaaagtctgtgaatttcaattttgtaatgatccatttttgggtgtttattaccacaaaaacaacaaaagttggatggaaagtgagttcagagtttattctagtgcgaataattgcacttttgaatatgtctctcagccttttcttatggaccttgaaggaaaacagggtaatttcaaaaatcttagggaatttgtgtttaagaatattgatggttattttaaaatatattctaagcacacgcctattaatttagtgcgtgatctccctcagggtttttcggctttagaaccattggtagatttgccaataggtattaacatcactaggtttcaaactttacttgctttacatagaagttatttgactcctggtgattcttcttcaggttggacagctggtgctgcagcttattatgtgggttatcttcaacctaggacttttctattaaaatataatgaaaatggaaccattacagatgctgtagactgtgcacttgaccctctctcagaaacaaagtgtacgttgaaatccttcactgtagaaaaaggaatctatcaaacttctaactttagagtccaaccaacagaatctattgttagatttcctaatattacaaacttgtgcccttttggtgaagtttttaacgccaccagatttgcatctgtttatgcttggaacaggaagagaatcagcaactgtgttgctgattattctgtcctatataattccgcatcattttccacttttaagtgttatggagtgtctcctactaaattaaatgatctctgctttactaatgtctatgcagattcatttgtaattagaggtgatgaagtcagacaaatcgctccagggcaaactggaaagattgctgattataattataaattaccagatgattttacaggctgcgttatagcttggaattctaacaatcttgattctaaggttggtggtaattataattacctgtatagattgtttaggaagtctaatctcaaaccttttgagagagatatttcaactgaaatctatcaggccggtagcacaccttgtaatggtgttgaaggttttaattgttactttcctttacaatcatatggtttccaacccactaatggtgttggttaccaaccatacagagtagtagtactttcttttgaacttctacatgcaccagcaactgtttgtggacctaaaaagtctactaatttggttaaaaacaaatgtgtcaatttcaacttcaatggtttaacaggcacaggtgttcttactgagtctaacaaaaagtttctgcctttccaacaatttggcagagacattgctgacactactgatgctgtccgtgatccacagacacttgagattcttgacattacaccatgttcttttggtggtgtcagtgttataacaccaggaacaaatacttctaaccaggttgctgttctttatcaggatgttaactgcacagaagtccctgttgctattcatgcagatcaacttactcctacttggcgtgtttattctacaggttctaatgtttttcaaacacgtgcaggctgtttaataggggctgaacatgtcaacaactcatatgagtgtgacatacccattggtgcaggtatatgcgctagttatcagactcagactaattctcctcggcgggcacgtagtgtagctagtcaatccatcattgcctacactatgtcacttggtgcagaaaattcagttgcttactctaataactctattgccatacccacaaattttactattagtgttaccacagaaattctaccagtgtctatgaccaagacatcagtagattgtacaatgtacatttgtggtgattcaactgaatgcagcaatcttttgttgcaatatggcagtttttgtacacaattaaaccgtgctttaactggaatagctgttgaacaagacaaaaacacccaagaagtttttgcacaagtcaaacaaatttacaaaacaccaccaattaaagattttggtggttttaatttttcacaaatattaccagatccatcaaaaccaagcaagaggtcatttattgaagatctacttttcaacaaagtgacacttgcagatgctggcttcatcaaacaatatggtgattgccttggtgatattgctgctagagacctcatttgtgcacaaaagtttaacggccttactgttttgccacctttgctcacagatgaaatgattgctcaatacacttctgcactgttagcgggtacaatcacttctggttggacctttggtgcaggtgctgcattacaaataccatttgctatgcaaatggcttataggtttaatggtattggagttacacagaatgttctctatgagaaccaaaaattgattgccaaccaatttaatagtgctattggcaaaattcaagactcactttcttccacagcaagtgcacttggaaaacttcaagatgtggtcaaccaaaatgcacaagctttaaacacgcttgttaaacaacttagctccaattttggtgcaatttcaagtgttttaaatgatatcctttcacgtcttgacaaagttgaggctgaagtgcaaattgataggttgatcacaggcagacttcaaagtttgcagacatatgtgactcaacaattaattagagctgcagaaatcagagcttctgctaatcttgctgctactaaaatgtcagagtgtgtacttggacaatcaaaaagagttgatttttgtggaaagggctatcatcttatgtccttccctcagtcagcacctcatggtgtagtcttcttgcatgtgacttatgtccctgcacaagaaaagaacttcacaactgctcctgccatttgtcatgatggaaaagcacactttcctcgtgaaggtgtctttgtttcaaatggcacacactggtttgtaacacaaaggaatttttatgaaccacaaatcattactacagacaacacatttgtgtctggtaactgtgatgttgtaataggaattgtcaacaacacagtttatgatcctttgcaacctgaattagactcattcaaggaggagttagataaatattttaagaatcatacatcaccagatgttgatttaggtgacatctctggcattaatgcttcagttgtaaacattcaaaaagaaattgaccgcctcaatgaggttgccaagaatttaaatgaatctctcatcgatctccaagaacttggaaagtatgagcagtatataaaatggccatggtacatttggctaggttttatagctggcttgattgccatagtaatggtgacaattatgctttgctgtatgaccagttgctgtagttgtctcaagggctgttgttcttgtggatcctgctgcaaatttgatgaagacgactctgagccagtgctcaaaggagtcaaattacattacacataa"
Even though the AvoidHairpins
specification is a good compromise between speed and quality of the results, there are few tools available for optimizing RNA structure accurately (none actually usable programmatically, as far as I know), and implementing it in DnaChisel would make it even more unique than it already is :)
Using an external package like Vienna, it shouldn't be too hard to implement it. d-tailor is an example of how to integrate it into a codon optimization framework.
Feel free to close this issue if you are not interested in supporting this feature!
Thanks
Simone
Hi,
I'm using DNAchisel to encode some difficult to synthesize proteins. I tried removing all kmer repeats, and constraints_text_summary() says pass, but I'm able to find a repeat. What am I not understanding about how to codon optimize?
for example, in the below code, there's a 26nt repeat that doesn't get removed.
thanks for helping me out!
-Lucas
output:
repeated kmers: ('CCGGATCATATGAAACAGCATGACTT', 2) 26
[222, 936]
===> SUCCESS - all constraints evaluations pass
✔PASS ┍ EnforceTranslation[0-2856]
│ Enforced by nucleotides restrictions
✔PASS ┍ AvoidPattern[0-2856](pattern:BsaI(GGTCTC))
│ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-2856](pattern:BbsI(GAAGAC))
│ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-2856](pattern:XhoI(CTCGAG))
│ Passed. Pattern not found !
✔PASS ┍ AvoidPattern[0-2856](pattern:NcoI(CCATGG))
│ Passed. Pattern not found !
✔PASS ┍ AvoidRareCodons[0-2856]
│ Enforced by nucleotides restrictions
✔PASS ┍ EnforceGCContent[0-2856](mini:0.38, maxi:0.62, window:100)
│ Passed !
✔PASS ┍ AvoidPattern[0-2856](pattern:2x9mer)
│ Passed. Pattern not found !
===> TOTAL OBJECTIVES SCORE: -33.76
-33.76 ┍ MatchTargetCodonUsage[0-2856](e_coli)
│ Codon opt. on window 0-2856 scored -3.38E+01
make sure the encoding is correct: True
code:
from pathlib import Path
import pandas as pd
from dnachisel import (
DnaOptimizationProblem,
translate,
reverse_translate,
CodonOptimize,
EnforceTranslation,
AvoidPattern,
EnforceGCContent,
RepeatedKmerPattern,
)
from dnachisel import *
import re
from Bio.Restriction import *
from Bio.Seq import Seq
from collections import defaultdict
#### functions to find longest repeated substring ###
def getsubs(loc, s):
substr = s[loc:]
i = -1
while(substr):
yield substr
substr = s[loc:i]
i -= 1
def longestRepetitiveSubstring(r, minocc=3):
occ = defaultdict(int)
# tally all occurrences of all substrings
for i in range(len(r)):
for sub in getsubs(i,r):
occ[sub] += 1
# filter out all substrings with fewer than minocc occurrences
occ_minocc = [k for k,v in occ.items() if v >= minocc]
if occ_minocc:
maxkey = max(occ_minocc, key=len)
return maxkey, occ[maxkey]
else:
raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))
#### encode 4x sfGFP ###
ntseq = 'ATGCGTAAAGGCGAAGAGCTGTTCACTGGTGTCGTCCCTATTCTGGTGGAACTGGATGGTGATGTCAACGGTCATAAGTTTTCCGTGCGTGGCGAGGGTGAAGGTGACGCAACTAATGGTAAACTGACGCTGAAGTTCATCTGTACTACTGGTAAACTGCCGGTACCTTGGCCGACTCTGGTAACGACGCTGACTTATGGTGTTCAGTGCTTTGCTCGTTATCCGGACCATATGAAGCAGCATGACTTCTTCAAGTCCGCCATGCCGGAAGGCTATGTGCAGGAACGCACGATTTCCTTTAAGGATGACGGCACGTACAAAACGCGTGCGGAAGTGAAATTTGAAGGCGATACCCTGGTAAACCGCATTGAGCTGAAAGGCATTGACTTTAAAGAAGACGGCAATATCCTGGGCCATAAGCTGGAATACAATTTTAACAGCCACAATGTTTACATCACCGCCGATAAACAAAAAAATGGCATTAAAGCGAATTTTAAAATTCGCCACAACGTGGAGGATGGCAGCGTGCAGCTGGCTGATCACTACCAGCAAAACACTCCAATCGGTGATGGTCCTGTTCTGCTGCCAGACAATCACTATCTGAGCACGCAAAGCGTTCTGTCTAAAGATCCGAACGAGAAACGCGATCATATGGTTCTGCTGGAGTTCGTAACCGCAGCGGGCATCACGCATGGTATGGATGAACTGTACAAA'
ntseq = ntseq + ntseq + ntseq + ntseq
constraints=[
EnforceTranslation(),
AvoidPattern("BsaI_site"),
AvoidPattern("BbsI_site"),
AvoidPattern("XhoI_site"),
AvoidPattern("NcoI_site"),
AvoidRareCodons(0.1, species='e_coli'),
EnforceGCContent(mini=0.38, maxi=0.62, window=100),
AvoidPattern(RepeatedKmerPattern(2, 9)),
]
problem = DnaOptimizationProblem(
sequence=reverse_translate(protein_sequence=translate(ntseq),table='Standard'),
constraints=constraints,
objectives=[CodonOptimize(species='e_coli', method='match_codon_usage')],
)
problem.max_random_iters = 2000000
problem.resolve_constraints()
problem.optimize()
new_encoding = problem.sequence
# find longest repeated substring
found_this = longestRepetitiveSubstring(new_encoding, minocc=2)
print("repeated kmers:" , found_this, len(found_this[0]))
print([match.start() for match in re.finditer(found_this[0], new_encoding)])
# how does DNAChisel think it did?
print(problem.constraints_text_summary())
print(problem.objectives_text_summary())
print('make sure the encoding is correct: ', translate(new_encoding) == translate(ntseq))
I am using Anaconda on a Windows PC. Everything is up to date on my side. I would really love to use this feature but I am not knowledgeable enough to write my own script.
Thanks for helping/fixing!!
I get the following error after executing the example file:
NoSolutionError Traceback (most recent call last)
C:\Users\Public\Documents\Wondershare\CreatorTemp/ipykernel_23256/3391708123.py in
24 )
25
---> 26 problem.resolve_constraints()
27 problem.max_random_iters = 20000
28 problem.to_record("sequence_without_tf_binding_sites.gb")
~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraints(self, final_check, cst_filter)
358 except NoSolutionError as error:
359 self.logger(constraint__index=len(constraints))
--> 360 raise error
361 if final_check:
362 self.perform_final_constraints_check()
~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraints(self, final_check, cst_filter)
355 ):
356 try:
--> 357 self.resolve_constraint(constraint=constraint)
358 except NoSolutionError as error:
359 self.logger(constraint__index=len(constraints))
~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraint(self, constraint)
319 location__message="Cold exit",
320 )
--> 321 raise error
322 else:
323 continue
~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraint(self, constraint)
303 constraint.resolution_heuristic(local_problem)
304 else:
--> 305 local_problem.resolve_constraints_locally()
306 self._replace_sequence(local_problem.sequence)
307 break
~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraints_locally(self)
171 """
172 if self.mutation_space.space_size < self.randomization_threshold:
--> 173 self.resolve_constraints_by_exhaustive_search()
174 else:
175 self.resolve_constraints_by_random_mutations()
~\Anaconda3\lib\site-packages\dnachisel\DnaOptimizationProblem\mixins\ConstraintsSolverMixin.py in resolve_constraints_by_exhaustive_search(self)
78 raise NoSolutionError(
79 "Exhaustive search failed to satisfy all constraints.",
---> 80 problem=self,
81 )
82
NoSolutionError: While solving AvoidPattern0-3711 in 0-6:
Exhaustive search failed to satisfy all constraints.
Hi,
Is there a way to see all possible constraints/objectives? I don't seem to find them, nor on the git, nor in the documentation.
More specifically I'm interested if there is some kind of harmonization possible using this tool?
Package looks really nice by the way.
Best regards,
Pieter
Hello,
I have installed DnaChisel using the [reports] suffix to generate reports. When I run DnaChisel on a genbank file or on a random seq (readme example) I do get a report but in the "Report.pdf" file it tells me to "install Geneblocks to see a friendly plot of sequence edits here. (pip install geneblocks) ". However, if I do so when I run my script again the report is not generated and I get the following error :
Traceback (most recent call last):
File "test.py", line 26, in
problem.optimize_with_report(target="report.zip")
File "/usr/local/lib/python3.8/dist-packages/dnachisel/DnaOptimizationProblem/DnaOptimizationProblem.py", line 260, in optimize_with_report
data = write_optimization_report(
File "/usr/local/lib/python3.8/dist-packages/dnachisel/reports/optimization_reports.py", line 366, in write_optimization_report
diffs_ax = plot_optimization_changes(problem)
File "/usr/local/lib/python3.8/dist-packages/dnachisel/reports/optimization_reports.py", line 277, in plot_optimization_changes
diffs = DiffBlocks.from_sequences(sequence_before, sequence_after)
File "/usr/local/lib/python3.8/dist-packages/geneblocks/DiffBlocks/DiffBlocks.py", line 79, in from_sequences
common_blocks = CommonBlocks.from_sequences(
File "/usr/local/lib/python3.8/dist-packages/geneblocks/CommonBlocks/CommonBlocks.py", line 56, in from_sequences
homologies_dict = find_homologies_between_sequences(
File "/usr/local/lib/python3.8/dist-packages/geneblocks/CommonBlocks/commonblocks_tools.py", line 133, in find_homologies_between_sequences
proc = subprocess.Popen(
File "/usr/lib/python3.8/subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.8/subprocess.py", line 1702, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'blastn'
Hello,
I just found the DNA Chisel python library and it looks like it's exactly what I was looking for!
I want to codon optimize the sequence of some proteins for expression in plants, but I would like to use "custom" codon usage tables. In the help files I found that "You can also use a TaxID to refer to a species, e.g. species=1423 at which case the codon frequencies will be downloaded from the Kazusa codon usage database (assuming it isn’t down!)". Data from that webpage is vastly outdated (the last data they used to generate the codon usage tables is from 2007, and for some species there are many more sequences available that allow the generation of more reliable codon usage tables). Moreover, I have built codon usage tables for groups of related plant species and would like to test if they work.
My question is, is there any way of "loading" a custom codon usage table (I could prepare them in the same format as the Kazusa webpage). However, I am not proficient in Python and would not know even where to start...
Best,
here is my code:
from dnachisel import *
mEGFP_dna = 'ATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGCGCGGCGAGGGCGAGGGCGATGCCACCAACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTCCTTCAAGGACGACGGCACCTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTTCAACAGCCACAACGTCTATATCACGGCCGACAAGCAGAAGAACGGCATCAAGGCGAACTTCAAGATCCGCCACAACGTCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCAAGCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGTAA'
problem = DnaOptimizationProblem(
sequence = mEGFP_dna,
constraints= [
EnforceTranslation(),
# EnforceGCContent('15-90%/20bp'),
# EnforceGCContent('15-80%/50bp'),
# EnforceGCContent('28-76%/100bp'), # GC含量
AvoidHairpins(stem_size=6, hairpin_window=40,location=(0,40)),
AvoidPattern(r'(.)\1{8}'), # v
# AvoidRareCodons(species='284590',min_frequency = 0.1) # 避免出现usage 小于0.1 的稀有密码子
],
objectives=[
CodonOptimize(species='284590',method = "match_codon_usage"),
# EnforceGCContent(target = 0.4005),
# MinimizeMFE( temperature = 30,location=(0,40)), # RNAfold 计算mfe
],
)
print("BEFORE OPTIMIZATION:\n\n", problem.constraints_text_summary())
print(problem.objectives_text_summary())
problem.resolve_constraints()
problem.optimize()
print("AFTER OPTIMIZATION:\n\n", problem.constraints_text_summary())
print(problem.objectives_text_summary())
Hello,
I have a request that should enhance the utility of DNA Chisel for codon optimization.
When selecting Codon Usage Tables for species not included in DNA Chisel, Kazusa Codon tables are used. These codon tables are extremely outdated, generated from Genbank data only up to 2007. For some species, there is only one CDS used, and some codons have a 0 usage (see here for examples).
I suggest that more up-to-date codon tables should be included in DNA chisel. For example, CoCoPUTs is an up-to-date database with codon usage tables generated with each new Genbank or Refseq release every three months. I guess it should not be hard to retrieve the usage tables for each species (there is a huge file with the data, but I assume they could be retrieved "on the fly" too).
I am in no way associated with the people that run CoCoPUTs, but I think it is the most up-to-date resource for codon usage tables.
See #62 (comment)
It looks as though AvoidPattern doesn't preserve its strand information when being localized if the original strand information is 'from_location'
Example illustrating this:
ipdb> cst
# AvoidPattern[4-1624(-)](pattern:AACAAAT)
ipdb> location
# 9-10
ipdb> new_cst = cst.localized(location, problem=self)
ipdb> new_cst
# AvoidPattern[4-16](pattern:AACAAAT)
ipdb> new_cst.location
# 4-16
ipdb> new_cst.strand
# 'from_location'
The from_location
is strand argument is preserved, but the original strand in the location is lost, so the resulting constraint loses the strandedness. I ran into this today when my optimization kept failing and the violated constraint was being reported on the positive strand even though that constraint should only be on the negative.
This was in version 3.2.6 (though the current code looks unchanged)
To fix this in the short term, it looks like I can specify the strand directly with its own argument when creating the constraint (let me know if this is not good for some reason).
This tool is very useful, thank you for the continued development!
Importing the library doesn't seem to work in AWS Lambda.
Something like the following:
import dnachisel as dc
def lambda_handler(event, context):
return "test"
When run on lambda throws:
{
"errorMessage": "Unable to import module 'lambda_function': cannot import name '_aligners' from partially initialized module 'Bio.Align' (most likely due to a circular import) (/var/task/Bio/Align/__init__.py)",
"errorType": "Runtime.ImportModuleError",
"stackTrace": []
}
Everything works fine if I mock the lambda locally. I uploaded the lambda code and the dependencies as a zip file, but also tried uploading dnachisel as a lambda layer and got the same error.
Python version: 3.8
dnachisel version: 3.2.9
I opened a PR in Edinburgh-Genome-Foundry/codon-usage-tables#4, but I think you're using a custom version of this code? I couldn't find it here. I wonder if it would be possible to raise an error when the usage table for a taxonomy ID isn't available on the kazusa website.
Otherwise, DNA Chisel just spits out a KeyError (codon) somewhere downstream when it looks up the codon in an empty table.
like this:
└── Specification
├── builtin_specifications
│ ├── alignment_related
│ │ ├── AvoidBlast.py
│ │ └── AvoidMatches.py
│ ├── codons_related
│ ├── pattern_related
│ │ ├── AvoidChanges.py
│ │ ├── AvoidPattern.py
│ │ └── EnforcePattern.py
│ └── second_structure_related
│ └── AvoudHairpins.py
├── SpecEvaluation
├── Specification.py
└── SpecificationSet.py
Hi,
When using DnaChisel, I try to set more than one pattern for AvoidPattern() like this, but it seems only the first one can be identified. Is there any way to make AvoidPattern() accept multiple patterns at the same time?
problem = DnaOptimizationProblem(
sequence=some_sequence,
constraints=[
AvoidPattern("BsmBI_site", "BamHI_site"),
EnforceGCContent(mini=0.3, maxi=0.7, window=50),
EnforceTranslation()
],
objectives=[CodonOptimize(species='e_coli')]
)
Adding a simple DNA pattern TGAA
as an AvoidPattern
constraint causes an AttributeError
, the stack trace is listed below
"/usr/local/lib/python3.8/site-packages/dnachisel/DnaOptimizationProblem/mixins/ConstraintsSolverMixin.py", line 357, in resolve_constraints
self.resolve_constraint(constraint=constraint)
File "/usr/local/lib/python3.8/site-packages/dnachisel/DnaOptimizationProblem/mixins/ConstraintsSolverMixin.py", line 251, in resolve_constraint
evaluation = this_local_constraint.evaluate(self)
AttributeError: 'NoneType' object has no attribute 'evaluate'
It useful to design seq for experiment.
Besides, I think it can solve #27
If the Genbank file has ~harmonize_rca(e_coli => h_sapiens), then running
from dnachisel import DnaOptimizationProblem
problem = DnaOptimizationProblem.from_record("example_sequence.gb")
returns
/path/to/DnaChisel/dnachisel/Specification/FeatureRepresentationMixin.py in from_label(cls, label, location, specifications_dict)
127 "Unknown parameter %s for specification %s "
128 "at location %s"
--> 129 % (faulty_parameter, specification, kwargs["location"])
130 )
131
TypeError: Unknown parameter e_coli for specification harmonize_rca at location 1218-2308(+)
Same error for h_sapiens
or h_sapiens_9606
.
The function works with CodonOptimize(), e.g. CodonOptimize(species='h_sapiens', location=(0, 99), original_species='e_coli')
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.