koeng101 / dnadesign Goto Github PK

A Go package for designing DNA.

License: Other

Go 99.55% Assembly 0.45%

bioinformatics codon-optimization codon-usage computational-biology dna-barcode dna-folding fasta fastq genbank genetic-engineering

dnadesign's People

Stargazers

Watchers

dnadesign's Issues

DnaDesign data APIs

This is building off of the previous thoughts of bebop/poly#42

My newer idea is much more in line with LLMs doing a lot of the work that humans would previously do. Basically, how do we build a brain specifically for designing DNA. IE, kinda like wikicrow, but specifically searching across existing bioinformatic platforms. If a question or query can be answered from scientific literature, that is great: but many questions are more effectively answered and coordinated with other data.

Each parser's data is unique. So, naturally, this means that there would be a different way to query each database.

How I imagine this working is building an SQLite database with sqlite-vss that can be distributed easily. The general steps are:

Establish an SQLite schema with the specific data from the parser.
Create a workflow / pipeline to go from raw data, pulled from online, to an SQLite database with that particular schema.
Create various ways to query / vector search the data, in a way that makes sense for that particular database's schema.

Then, this is going to take building a final layer on top of each individual service that is able to query information from any one that seems relevant. This will probably complement the lua execution environment, so that the lua code itself can have access to talk to specific APIs, as well as general LLMs gathering information before writing the code.

Default codon tables

link

This is still a feature that is needed, and needed even more for the basic API!

Change WriterTo to make a builder first

Here is an example written by ChatGPT:

func writeHD(w io.Writer, hd map[string]string) error {
    // Define the order of specific keys
    orderedKeys := []string{"VN", "SO", "GO", "SS"}

    // String builder to accumulate the output
    var sb strings.Builder

    // Write specific keys first if they exist
    for _, key := range orderedKeys {
        if value, exists := hd[key]; exists {
            sb.WriteString(fmt.Sprintf("%s:%s\t", key, value))
        }
    }

    // Write the remaining key-value pairs
    for key, value := range hd {
        // Skip if the key is one of the specific keys
        if key == "VN" || key == "SO" || key == "GO" || key == "SS" {
            continue
        }
        sb.WriteString(fmt.Sprintf("%s:%s\t", key, value))
    }

    // Write to the io.Writer
    _, err := w.Write([]byte(sb.String()))
    return err
}

Basically, accumulate the output in a builder before writing out to the io.Writer. This has advantages, since writing more bytes at once will likely be more efficient than writing super often. It also decreases the amount of tests needed.

This should be implemented across all parsers in a single PR.

murmur3 to crc32

We should switch from murmur3 to crc32 for hashing in mash as a default. crc32 is used in the standard library, which is nice, plus it should be good enough https://rigtorp.se/notes/hashing/

Generally, we should be using the (Hash interface)[https://pkg.go.dev/hash#Hash] and allowing users to pick the algorithm, but having crc32 for a basic default seems like a good idea.

Integrate kape

I wrote this a while ago: https://github.com/Koeng101/kape

And it is better than I remember. Once #9 is integrated and we have LLM integrations working (ie, LLMs that can write and run example code) and plannotate like features implemented, this plasmid editor could be a new kind of way to work with DNA. Essentially, instead of directly working with the files, all interaction is through a text box. The UI is ONLY for visualization purposes and being able to directly see the sequence itself.

Then, any feature that is added on top will actually be in the underlying DnaDesign codebase, while the visualization stays the same. This keeps the UI extremely simple (because let's be honest, who other than me wants a terminal-user-interface plasmid editor), while making the features that I want to add available to other user through the API. So this will take a while, but is definitely on my mind.

Todo

Some various todos while in a very unstable period for DNA design.

Rewrite contribution docs
Rewrite top level package docs
Rewrite tutorials package
Reduce dependencies

Uniprot GET

It would be nice to be able to get a single uniprot file from their REST API, given you have the accession number. For example: https://rest.uniprot.org/uniprotkb/P42212.xml?include=yes

I imagine the function would be something like this:

func GetUniprotXML(accessionId string) (uniprot.Entry, error) {
}

embed rebase database

I like the idea of bebop/poly#426

But I also don't care about adding 4.4mb of text to the repo. Embed rebase in main!

Wikipedia paperqa

Something I've been thinking about is how to prove out the usefulness of the "synthetic biology oracle" I'd like to build with dnadesign. I think a good basic way would be to implement a vector database (using SQLite-vss) on wikipedia data, which is only about 22gb, and can be encoded in way less(though this is less relevant, because it'll be SQLite anyway).

There is a lot of basic knowledge on wikipedia that could be useful to pull from and read. For example, if a user wanted to know how blue-white screening works, a quick search of wikipedia would likely be better than a search of all scientific literature.

The idea would be to pull from the wikipedia dumps and create a FAISS database against the whole article. These would then be available over an API. This data would be used to get information, similar to how paperqa works. Except, instead of running python code, LLMs would be instructed to write lua code which is then able to do the summarization and such. Essentially, we want to embed the ability of the LLMs to run different kinds of summarization pipelines themselves, ie, recursion.

Here is some scratch lua code approximating what I'm thinking about:

question = "What does lacZ do?"
wikipedia_entries = search_wikipedia("lacZ function") -- vector search
uniprot_entries = search_uniprot("lacZ function") -- vector search
relevant_entries = select_sources(question, {wikipedia_entries, uniprot_entries}) -- have an LLM sort for most relevant papers
answers = answer_question(question, relevant_entries) -- have an LLM directly answer using relevant entries
full_summary = summarize(question, answers) -- summarize all information from all answers

print(full_summary) -- return summary, which can be used downstream

For example, the select_sources function is actually generating lua code based on the answer that it finds for the paper, but that lua code can then initiate another search.

Easy performance optimizations for MegaMash

I'm currently working on a CUDA implementation for MegaMash, and as I'm re-implementing it I'm finding ways you could make it more efficient in GO. I'll throw them in this thread as I think of them.

Merge BWT

bebop/poly#411

This is a cool feature. Let's merge.

Workflow management

I cannot create two workers for something like "filter data" because the channel can be closed while another worker is processing data.

In reality, it should get passed in a workgroup, something like this:

func FilterData[Data DataTypes](wg *sync.WaitGroup, ctx context.Context, input <-chan Data, output chan<- Data, filter func(a Data) bool) error {
    defer wg.Done()
    
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()

        case data, ok := <-input:
            if !ok {
                return nil // Input channel closed
            }
            if filter(data) {
                output <- data
            }
        }
    }
}

// Usage
var wg sync.WaitGroup

for i := 0; i < numWorkers; i++ {
    wg.Add(1)
    go FilterData(&wg, ctx, input, output, filterFunc)
}

// Wait for all workers to finish
wg.Wait()
close(output)

Gotta think of a good way to do this.

Uniprot API

I would like to create a Uniprot API to start on the idea outlined in #15 . Uniprot is one of the most useful biological databases out there, so this will be a useful exercise in making the LLM-enabled biological databases.

The absolute key is going to be a reliable deployment environment, so that I can walk away and it keeps working. In that vein, I think the following are key steps:

Make SQLite schema for Uniprot. I'm thinking of this just being a JSONB + entryID for now.
Make a build process for making the SQLite database
Figure out devops to make sure the build process can be reliably run every 8 weeks, uploading to dnadesign.bio/downloads/uniprot_sprot.db

From there, the API launch process would be:

Download / mount uniprot_sprot.db from dnadesign.bio/downloads/uniprot_sprot.db.
Start Golang API

In order to make this reliable, I think kubernetes is going to be the right abstraction layer. I'm going to have to be launching quite a few of these APIs, and this is the only way I can think of to front-load the energy of getting the services to work. Ie, these workflows cannot be pets, they need to be cattle.

I'm thinking digitalocean k8s for now, until we want to slurp up RefSeq - which will then require a custom server. Going to think more about this.

Remove gff

Gff is kinda a bad format , and I don't want to maintain the parser, so it should be removed. Use genbank.

Plannotate recode

Described previously at bebop/poly#396

Basically, reproduce https://github.com/mmcguffi/pLannotate/tree/master , but in a nice golang environment with an easy API on top. However, I'd like to combine that with something like in #15 . Essentially, the data sources should be pulled from the various APIs being built there, so that the annotator is essentially acting as a in-between service, while not really directly touching any data.

PCR simulation should have circularity as an aspect of the sequence itself, not of the entire function

link

This should obviously be implemented.

Megamash file

It is useful for me to keep records of megamash matches. I think this should be a file format.

@VN 0.0.1
@KmerSize 16
@MinimialKmerMatches 10
@Threshold 0.2
@Separator |
### START SUBHEADER ###
identifier    sequence
identifier2   sequence
### END SUBHEADER ###
289a197e-4c05-4143-80e6-488e23044378    2    identifier|identifier2    78/150|51/53

The subheader basically has fasta_identifier, sequence as headers, with the actual generated section having query_name(fastq read name), number of matches, matches separated by the separator and then coverage. Cover is actually expressed as int/int, due to the number of kmer matches being relevant information (high number on both means high confidence, while low total kmer means a lower confidence match).

Can also have a complementing JSON implementation, for easily reading after generation. The reason I like having a kind of TSV format is the most common use of the matches will be streaming to other systems.

Restriction Enzymes from Rebase

Describe the desired feature/enhancement

I'd like the ability to use many more restriction sites much like biopython's Restriction package.

Is your feature request related to a problem?

I'm implementing a cloning design tool and have need for many more restriction enzymes. There isn't a problem with dnadesign, but it would be nice if it had more restriction sites built in.

Describe the solution you'd like

Well I'd be happy to contribute a script that downloads the rebase distribution much like Biopython (https://github.com/biopython/biopython/blob/master/Scripts/Restriction/rebase_update.py). It looks like the biopython solution has scripts that pull the rebase files off the FTP site and then generate a python file with a huge dictionary of restriction sites. (https://github.com/biopython/biopython/blob/master/Bio/Restriction/Restriction_Dictionary.py. ). This dictionary file is then committed to the repo as part of the biopython library.

I think that this is a nice solution since the end-user doesn't have to worry about downloading rebase themselves. I looked around for another go package that does this, but it doesn't seem to exist.

Describe alternatives you've considered (optional)

If it doesn't make sense to include in dnadesign, I could also just create a separate go package.

Additional context

Let me know if this is something that you'd like in dnadesign and I'll send a MR. Otherwise, I'll just create a separate package, but it would be nice to use the DNADesign Enzyme/EnzymeManager because it seems to have some of the cut site search logic built in already.

Thanks!

Integrate linearfold

@v-raja did a lot of work on the RBS calculator stuff here - https://github.com/allyourbasepair/rbscalculator/tree/main

All of the RNA folding functionality I would like to integrate, along with a fold interface for using either Zuker or LinearFold.

Uniprot parser

I need to update the uniprot parser to be compatible with the generic parser interface

Add pseudo check

https://github.com/Koeng101/dnadesign/blob/85e882045b7cd1e61284431e5e3b8985243f1fbd/lib/synthesis/codon/codon.go#L264C1-L267C5

for _, feature := range sequence.Features {
        if feature.Type == "CDS" {
            if _, ok := feature.Attributes["pseudo"]; ok {
                continue
            }
            seq, _ := feature.GetSequence()
            codingRegionsBuilder.WriteString(seq)
            genes++
        }
    }

vcf

vcf turns out to help with some of my sequencing. I like having pileup as a backup, though, for manual viewing. Generating vcf with bcftools is a good way to get details about possible mutations.

minimap2 + samtools wasm

I want minimap2 + samtools integrated into dnadesign so that alignment can take place without going to C / bash.

https://biowasm.com/cdn/v3/minimap2/2.22

It can definitely be done, just will take some work to get it working.

Addgene parser

It'd be real neat to have a scraper for Addgene plasmids to put them all into a database. Ideally, this would go through each plasmid and convert the HTML to a relevant JSON file. We'd need a few levels: for example, we would want to parse the sequences, publication, and depositing lab.

Integrate parts library

Previously I had a different repo - https://github.com/koeng101/parts

But having a large monorepo feels nice. Should integrate.

Document LinearFold+mfe and make linters work

I merged much of linearfold directly from https://github.com/allyourbasepair/rbscalculator - and so, not only are the linters very unhappy, but there are lots of missing spots for better documentation (though vivek did very well in most of the code).

I am putting a documentation issue here: squash all linter bugs, add documentation and context for everything in the LinearFold+mfe packages, and generally clean it up to be up-to-spec with the rest of the project.

I am adding it now because I believe we need to get it into the tree, then can begin improvements in an iterative process.

fastqindex

I want a binary fastqindex similar to https://hasindu2008.github.io/slow5specs/slow5-v1.0.0.pdf

This would mainly be used when writing a large fastq file to a data store, like S3, while still wanting to seek out specific lines from that fastq file. There would be two modifications: standardization of size,

- (2 byte) uint16: length of read ID 
- (var byte) read ID (UUIDs can be used directly or a hash of the identifier can be used). Often 16 byte for UUID
- (8 byte) uint64: start position
- (4 byte) uint32: length

30 bytes in total for a typical run. If a promethion flow cell returns 10,000,000 reads, the index file will be approx 286mb.

Fpbase parser

Build a parser for fpbase! It's pretty simple. All the data is here - https://www.fpbase.org/api/proteins as a big CSV file.

Refseq parser

More than just a genbank parser, we should have a refseq parser.

From https://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Halobacterium_salinarum/latest_assembly_versions/GCF_000006805.1_ASM680v1/README.txt

===========================
Data provided per assembly:
===========================
Sequence and other data files provided per assembly are named according to the 
rule:
[assembly accession.version]_[assembly name]_[content type].[optional format]

File formats and content:

   assembly_status.txt
       A text file reporting the current status of the version of the assembly
       for which data is provided. Any assembly anomalies are also reported.
   *_assembly_report.txt file
       Tab-delimited text file reporting the name, role and sequence 
       accession.version for objects in the assembly. The file header contains 
       meta-data for the assembly including: assembly name, assembly 
       accession.version, scientific name of the organism and its taxonomy ID, 
       assembly submitter, and sequence release date.
   *_assembly_stats.txt file
       Tab-delimited text file reporting statistics for the assembly including: 
       total length, ungapped length, contig & scaffold counts, contig-N50, 
       scaffold-L50, scaffold-N50, scaffold-N75, and scaffold-N90
   *_assembly_regions.txt
       Provided for assemblies that include alternate or patch assembly units. 
       Tab-delimited text file reporting the location of genomic regions and 
       listing the alt/patch scaffolds placed within those regions.
   *_assembly_structure directory
       This directory will only be present if the assembly has internal 
       structure. When present, it will contain AGP files that define how 
       component sequences are organized into scaffolds and/or chromosomes. 
       Other files define how scaffolds and chromosomes are organized into 
       non-nuclear and other assembly-units, and how any alternate or patch 
       scaffolds are placed relative to the chromosomes. Refer to the README.txt
       file in the assembly_structure directory for additional information.
   *_cds_from_genomic.fna.gz
       FASTA format of the nucleotide sequences corresponding to all CDS 
       features annotated on the assembly, based on the genome sequence. See 
       the "Description of files" section below for details of the file format.
   *_feature_count.txt.gz
       Tab-delimited text file reporting counts of gene, RNA, CDS, and similar
       features, based on data reported in the *_feature_table.txt.gz file.
       See the "Description of files" section below for details of the file 
       format.
   *_feature_table.txt.gz
       Tab-delimited text file reporting locations and attributes for a subset 
       of annotated features. Included feature types are: gene, CDS, RNA (all 
       types), operon, C/V/N/S_region, and V/D/J_segment. Replaces the .ptt & 
       .rnt format files that were provided in the old genomes FTP directories.
       See the "Description of files" section below for details of the file 
       format.
   *_gene_expression_counts.txt.gz
       Tab-delimited text file with counts of RNA-seq reads mapped to each gene.
       See "Description of files" section below for details of the file format.
   *_gene_ontology.gaf.gz
       Gene Ontology (GO) annotation of the annotated genes in GO Annotation 
       File (GAF) format. Additional information about the GAF format is 
       available at 
       http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/ 
   *_genomic.fna.gz file
       FASTA format of the genomic sequence(s) in the assembly. Repetitive 
       sequences in eukaryotes are masked to lower-case (see below).
       The FASTA title is formatted as sequence accession.version plus 
       description. The genomic.fna.gz file includes all top-level sequences in
       the assembly (chromosomes, plasmids, organelles, unlocalized scaffolds,
       unplaced scaffolds, and any alternate loci or patch scaffolds). Scaffolds
       that are part of the chromosomes are not included because they are
       redundant with the chromosome sequences; sequences for these placed 
       scaffolds are provided under the assembly_structure directory.
   *_genomic.gbff.gz file
       GenBank flat file format of the genomic sequence(s) in the assembly. This
       file includes both the genomic sequence and the CONTIG description (for 
       CON records), hence, it replaces both the .gbk & .gbs format files that 
       were provided in the old genomes FTP directories.
   *_genomic.gff.gz file
       Annotation of the genomic sequence(s) in Generic Feature Format Version 3
       (GFF3). Sequence identifiers are provided as accession.version.
       Additional information about NCBI's GFF files is available at 
       https://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt.
   *_genomic.gtf.gz file
       Annotation of the genomic sequence(s) in Gene Transfer Format Version 2.2
       (GTF2.2). Sequence identifiers are provided as accession.version.
   *_genomic_gaps.txt.gz
       Tab-delimited text file reporting the coordinates of all gaps in the 
       top-level genomic sequences. The gaps reported include gaps specified in
       the AGP files, gaps annotated on the component sequences, and any other 
       run of 10 or more Ns in the sequences. See the "Description of files" 
       section below for details of the file format.
   *_protein.faa.gz file
       FASTA format sequences of the accessioned protein products annotated on
       the genome assembly. The FASTA title is formatted as sequence 
       accession.version plus description.
   *_protein.gpff.gz file
       GenPept format of the accessioned protein products annotated on the 
       genome assembly
   *_rm.out.gz file
       RepeatMasker output; 
       Provided for Eukaryotes 
   *_rm.run file
       Documentation of the RepeatMasker version, parameters, and library; 
       Provided for Eukaryotes 
   *_rna.fna.gz file
       FASTA format of accessioned RNA products annotated on the genome 
       assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA 
       products are not instantiated as a separate accessioned record in GenBank
       but are provided for some RefSeq genomes, most notably the eukaryotes.)
       The FASTA title is provided as sequence accession.version plus 
       description.
   *_rna.gbff.gz file
       GenBank flat file format of RNA products annotated on the genome 
       assembly; Provided for RefSeq assemblies as relevant
   *_rna_from_genomic.fna.gz
       FASTA format of the nucleotide sequences corresponding to all RNA 
       features annotated on the assembly, based on the genome sequence. See 
       the "Description of files" section below for details of the file format.
   *_rnaseq_alignment_summary.txt
       Tab-delimited text file containing counts of alignments that were either
       assigned to a gene or skipped for a specific reason. See "Description of
       files" section below for details of the file format.
   *_rnaseq_runs.txt
       Tab-delimited text file containing information about RNA-seq runs used 
       for gene expression analyses (See *_featurecounts.txt file and *.bw files
       within "RNASeq_coverage_graphs" directory). 
   *_translated_cds.faa.gz
       FASTA sequences of individual CDS features annotated on the genomic 
       records, conceptually translated into protein sequence. The sequence 
       corresponds to the translation of the nucleotide sequence provided in the
       *_cds_from_genomic.fna.gz file. 
   *_wgsmaster.gbff.gz
       GenBank flat file format of the WGS master for the assembly (present only
       if a WGS master record exists for the sequences in the assembly).
   annotation_hashes.txt
       Tab-delimited text file reporting hash values for different aspects
       of the annotation data. See the "Description of files" section below 
       for details of the file format.
   md5checksums.txt file
       file checksums are provided for all data files in the directory

All of this data is provided by refseq for genomes. We should build a parser for getting all of this data into a JSON formatted nice format.

PCR doesn

gene := "CGAGACcAAGTCGTCATAGCTGTTTCCTGAGAGCTTGGCAGGTGATGACACACATTAACAAATTTCGTGAGGAGTCTCCAGAAGAATGCCATTAATTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGCTCAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCGTGCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGCGCTTTCTCATAGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTGCACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAAGACACGACTTATCGCCACTGGCAGCAGCCACTGGTAACAGGATTAGCAGAGCGAGGTATGTAGGCGGTGCTACAGAGTTCTTGAAGTGGTGGCCTAACTACGGCTACACTAGAAGAACAGTATTTGGTATCTGCGCTCTGCTGAAGCCAGTTACCTTCGGAAAAAGAGTTGGTAGCTCTTGATCCGGCAAACAAACCACCGCTGGTAGCGGTGGTTTTTTTGTTTGCAAGCAGCAGATTACGCGCAGAAAAAAAGGATCTCAAGAAGGCCTACTATTAGCAACAACGATCCTTTGATCTTTTCTACGGGGTCTGACGCTCAGTGGAACGAAAACTCACGTTAAGGGATTTTGGTCATGAGATTATCAAAAAGGATCTTCACCTAGATCCTTTTAAATTAAAAATGAAGTTTTAAATCAATCTAAAGTATATATGAGTAAACTTGGTCTGACAGTTACCAATGCTTAATCAGTGAGGCACCTATCTCAGCGATCTGTCTATTTCGTTCATCCATAGTTGCCTGACTCCCCGTCGTGTAGATAACTACGATACGGGAGGGCTTACCATCTGGCCCCAGTGCTGCAATGATACCGCGAGAACCACGCTCACCGGCTCCAGATTTATCAGCAATAAACCAGCCAGCCGGAAGGGCCGAGCGCAGAAGTGGTCCTGCAACTTTATCCGCCTCCATCCAGTCTATTAATTGTTGCCGGGAAGCTAGAGTAAGTAGTTCGCCAGTTAATAGTTTGCGCAACGTTGTTGCCATTGCTACAGGCATCGTGGTGTCACGCTCGTCGTTTGGTATGGCTTCATTCAGCTCCGGTTCCCAACGATCAAGGCGAGTTACATGATCCCCCATGTTGTGCAAAAAAGCGGTTAGCTCCTTCGGTCCTCCGATCGTTGTCAGAAGTAAGTTGGCCGCAGTGTTATCACTCATGGTTATGGCAGCACTGCATAATTCTCTTACTGTCATGCCATCCGTAAGATGCTTTTCTGTGACTGGTGAGTACTCAACCAAGTCATTCTGAGAATAGTGTATGCGGCGACCGAGTTGCTCTTGCCCGGCGTCAATACGGGATAATACCGCGCCACATAGCAGAACTTTAAAAGTGCTCATCATTGGAAAACGTTCTTCGGGGCGAAAACTCTCAAGGATCTTACCGCTGTTGAGATCCAGTTCGATGTAACCCACTCGTGCACCCAACTGATCTTCAGCATCTTTTACTTTCACCAGCGTTTCTGGGTGAGCAAAAACAGGAAGGCAAAATGCCGCAAAAAAGGGAATAAGGGCGACACGGAAATGTTGAATACTCATACTCTTCCTTTTTCAATATTATTGAAGCATTTATCAGGGTTATTGTCTCATGAGCGGATACATATTTGAATGTATTTAGAAAAATAAACAAATAGGGGTTCCGCGCACCTGCACCAGTCAGTAAAACGACGGCCAGTGACTTgGTCTCAGTCTCAGTCTCATCTTTCCCTTCGTCATGTGACCTGATATCGGGGGTTAGTTCGTCATCATTGATGAGGGTTGATTATCACAGTTTATTACTCTGAATTGGCTATCCGCGTGTGTACCTCTACCTGGAGTTTTTCCCACGGTGGATATTTCTTCTTGCGCTGAGCGTAAGAGCTATCTGACAGAACAGTTCTTCTTTGCTTCCTCGCCAGTTCGCTCGCTATGCTCGGTTACACGGCTGCGGCGAGCATCACGTGCTATAAAA"
	primers := []string{"GTCATCACCTGCCAAGCTCT", "GGGTTATTGTCTCATGAGCGG"}
	fragments, _ := pcr.Simulate([]string{gene}, 50.0, true, primers)

	fmt.Println(fragments)

This should definitely amplify

slow5 version

slow5 version should be in the Header, not the HeaderValue. In addition, HeaderValues don't take care of ordering / multiple read groups right now

type Header struct {
    HeaderValues []HeaderValue
}

type HeaderValue struct {
    ReadGroupID        uint32
    Slow5Version       string
    Attributes         map[string]string
    EndReasonHeaderMap map[string]int
}

Better Megamash

Describe the desired feature/enhancement

Some suggestions for how to improve the megamash algorithm

Describe the solution you'd like

Just throwing a thread here to brainstorm ideas and fixes - will keep this message short as an intro/title card

Barcode calling using minimap2

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7200061/

We mapped the sequencing reads to the dual-barcode sequences with 56 bp in length using Minimap2 (v2.11) [26] by -k7 -A1 -m42 -w1 options.

Move mash to align

Mash is a form of alignment, I think. Move it there.

Megamash cannot call complicated sequences

seq_recovery=0.4152_0 for example, cannot be processed into a megamash table because it doesn't have any unique 16mers. This is because of a flawed assumption: ESPECIALLY in protein variant libraries, 16mer windows won't necessarily be unique - and even if a sequence is unique, a 16mer sliding window may not be able to pick it up. This is a real issue.

I'm still trying to figure out how to fix this.

templateMap.csv

Genbank improvements

Referencing bebop/poly#434

@carreter is asking for a full rewrite there, but I think I disagree. Useful link to the spec.

Feature.GetSequence() always returns a nil error value link This should be fairly easy to fix, it is only referenced once.
Gff.AddFeature() code is misleading and mutates Feature state link This doesn't seem like that big of an issue. We can just do a deep copy of the feature and it should be fine.
Common Genbank Feature.Type values should be enumerated link This should be pretty easy, just adding strings enums in a few places.

These are all nice improvements, but are all are actually kinda simple to implement. The first will take just a couple lines of changes with zero impact on functionality, the second just takes a copy, and the third is just adding some enums.

I do think a refactor could be in place: In particular, it might be easy to split the parseChecks into functions. I think there is MASSIVE room for improvement in the test suite as well - but honestly, the genbank parser works pretty darn good right now, so I am hesitant to spend the time on the 4th refactor when I could be using my time on better things. Will implement fixes to those 3 things though.

Fork migration process

Currently forking poly. Here's the todo:

Merge all unmerged PRs
Update copyright
Remove all sponsor information and other linked information (twitter posts, etc)
Change go.mod from upstream to current
Change project name

Rhea parser

https://github.com/Koeng101/rhea

I should finally integrate this into dnadesign

koeng101 / dnadesign Goto Github PK

dnadesign's People

Stargazers

Watchers

dnadesign's Issues

Describe the desired feature/enhancement

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered (optional)

Additional context

Describe the desired feature/enhancement

Describe the solution you'd like

Recommend Projects

Recommend Topics

Recommend Org