Tool to analyze DNA sequences to identify relevant amino acids.
Initially in python with potential for a version in a compiled language.
- path to directory of files containing DNA sequences
- prompt for sequence direction (forward/coding vs reverse/complement)
- May need to reverse the sequence direction then generate complementary base sequence
- nucleotide sequence of start codon (or arbitrary string)
- default: ATG
- amino acid sequence of tag/ stop codon
- default: GAACAAAAGCTTATTTCTGAAGAGGACTTG [3]
Given a DNA sequence (~1k bases), a start codon, and a stop codon...
- Read DNA sequence base-wise in the specified direction
- Identify location of start codon
- Identify location of tag
- Create open reading frame (ORF) between start and tag
- Translate into amino acids backwards from tag
- Continue translation until start codon reached
- Report following data points:
- Sample metadata
- Original and transformed sequence
- ORF sequence and nucleotide count
- In frame? yes/no
- AA sequence from start to stop