biojulia / bio.jl Goto Github PK

View Code? Open in Web Editor NEW

261.0 43.0 65.0 6.86 MB

[DEPRECATED] Bioinformatics and Computational Biology Infrastructure for Julia

Home Page: http://biojulia.dev

License: MIT License

Julia 100.00%

bioinformatics biology genomics dna deprecated biojulia biojulia-packages

bio.jl's Introduction

Bio

This package has been depreceated. Full details are available here. You might still download and use this package, as we don't want old scripts to break. However going forward, know that this repository is archived and read only, no further updates or fixes will be committed. You should use the packages that replace Bio.jl - a list is available here.

bio.jl's People

Contributors

Stargazers

Watchers

bio.jl's Issues

[Seq] Split files for Nucleotides/Aminoacids and respective sequences

Today, nucleotide.jl have almost 1k lines. I think putting aminoacids/nucleotides sequences into separate files would improve readability, what do you think?

Importing functions from the Base module.

In Bio.jl, we often extend functions exported from the Base module. But as @dcjones said (#85 (comment)), it is annoying to synchronize method definitions with the import Base statement.

I listed four possible styles for importing names to extend. pros and cons are my opinion, you may not agree.
The last style is a suggestion from my side.

1. always add `Base.` to definitions like `function Base.foo()`

pros:
- explicit
- no need to keep synchronizing method definitions with the import statement
cons:
- verbose when defining methods

2. explicitly import methods from `Base` in every module (the current style)

pros:
- explicit
- less verbose when defining methods
cons:
- need to keep synchronizing method definitions with the import statement

3. `importall Base` (suggested by @prcastro #85 (comment))

pros:
- less verbose when defining methods
cons:
- implicit
- possible to accidentally extend methods

4. define an extendable method list shared across `Bio` modules

This is something like this:

macro bio_base_import()
    quote
        # these methods can be extended
        import Base:
            convert,
            length,
            ...
    end
end

, and call this macro in each modules:

module Bio.Seq

@bio_base_import

...

pros:
- explicit
- less verbose when defining methods
cons:
- possible to accidentally extend methods as importall Base style, but somewhat less chance to do

Add contributor / style guides

Provide an editorconfig file

Editorconfig is a file format for defining style elements that text editors should adhere to (e.g. indent style and size, whitespace trimming, etc.), and a collection of plugins for pretty much every editor that work with the file format. If we provide a file, contributors don't have to remember to switch tab numbers etc. in their editor when working on this project.

Comparison of sequences

seq = DNASequence("AGCTTTT")
println( DNASequence("AGC") == DNASequence("AGC") )
println( DNASequence("AGC") == dnakmer( DNASequence("AGC") ) )
println( seq[1:3] == DNASequence("AGC") )
println( seq[1:3] == dnakmer( DNASequence("AGC") ) )

All this comparisons are false right now.... Should they be true?

Implement our own bit vector type.

We use the BitVector type to represent ambiguous nucleotides in a sequence.

Bio.jl/src/seq/nucleotide.jl

Line 185 in 0697dc9

ns::BitVector # 'N' mask

What I'm worrying about is we are touching the internal members of the type. For example, at this line

Bio.jl/src/seq/nucleotide.jl

Line 279 in 0697dc9

$(ns).chunks[d + 1] |= UInt64(1) << r

It is less probable that the BitVector changes its internal representation significantly, but it depends on the core developers of the standard library.

I think it is not so large cost to reimplement the data structure, and we can modify it on our demand.
If you agree, I can do that job.

RFC: Bioinformatics project management with BioJulia

This is to further my idea articulated on Gitter. As the core library becomes even more complete, and as the BioJulia organisation gets more attention, one of the things I think would help set BioJulia apart from other Bio* projects (besides being fast and making the best use of Julia's features), would be if we also made it very easy for casual scripters (By that I mean lab based biologists that have to then turn to do some bioinformatics at the end of their experiment) to manage their Bioinformatics projects: if you're already scripting a few data processing steps with BioJulia tools, why not also manage the project with BioJulia?

What I think would be great to see, is being able to start a Bioinformatics project directory from Julia, which is reproducible and self contained, and so perhaps makes use of its own package repo. It could also make use of both Git and Dat to A). Record code changes to scripts and pipelines in a project, and B). keep track and version data produced by scripts in your Bioinformatics project. With regards to pipelines, the ability to scrip pipelines that make use of Bio.jl structures and algorithms, with a streams/flow-based programming approach would be awesome. This all tied together would make it awesome to manage and run a reproducible Bioinformatics project with BioJulia.

So breaking it down a bit, I guess what I'm saying with this brain dump is:

A module for interfacing - reading and writing to and from Dat.
Streams / Flow-based programming (ideally that also makes use of multiple processes - I think nextflow does this).
Project based packages like packrat for R or virtualenv for Python.

Would be cool milestones.

Subsequence of subsequence

The following behavior is counter-intuitive to me and I believe this is a bug, isn't it?

julia> read = dna"AAAATTTT"
8nt DNA Sequence
 AAAATTTT

julia> seed = read[5:end]
4nt DNA Sequence
 TTTT

julia> @assert seed[1:2] == dna"TT"
ERROR: AssertionError: seed[1:2] == @dna_str("TT")

julia> seed[1:2]
2nt DNA Sequence
 AA

Community expansion: steps to being welcoming and inclusive

We've talked on gitter a bunch of times about this stuff. I'm putting it here so it's more visible, and actionable.

The basic problem is this: open source communities tend to have very low diversity, and to be offputting to a large proportion of potential contributors. As in all things, we should try to do better.

Some background from the open science/open source communities:

http://redmonk.com/dberkholz/2012/10/29/githubs-success-and-low-barriers-to-entry/
http://mozillascience.github.io/leadership-training
http://producingoss.com/en/producingoss.html (this is especially good and thorough)

These are some concrete steps we can take to make the project more welcoming and inclusive:

What have I missed?

Toward short read alignment (JSoC 2015).

This issue is a check list of my progress and discussions about it.

Merge BioSeq.jl into the Seq module

Multiple Sequence Alignment

Would it be useful to have a similar to Clustal Omega, MUSCLE, or MAFFT? I usually do it manually but maybe an API or something might be useful.

Merge Phylogenetics.jl into phylo submodule

Rethinking about the parsing interface

Thanks to the great effort of @dcjones, we're getting a powerful framework to parse various data formats at the speed of light.
But I think that the interface of the parser can be more flexible and idiomatic in the context of Julia.
The current interface is the read method only, and its behavior is significantly different from the methods defined in the standard library.
In the Base module, read(io, type) always returns an object of type but read(io, FASTA) returns an iterator of biological sequences; this is counterintuitive to me.

I propose the following functions as public APIs on the top of the parser framework:

open(filename, format): open filename and return a stream indexed by format
read(stream, type): read a type object from stream
read!(stream, object): read a record into the preallocated object from stream
each(stream, type): return an iterator of type objects from stream
close(stream): close stream.

open, read, read! and close behave like methods in the standard library, respectively. each is something like Base.eachline.
A stream object is indexed by a file format: defined as Stream{F<:FileFromat}, and Stream{FASTA} is, for example, a stream type of FASTA format..

The rationale behind this idea is that file format and data type are not completely dependent.
For example, we can read DNA sequences from several formats (including FASTA, FASTQ, SAM, etc.) and a SAM file can generate several kinds of data (including DNA sequence, alignment, quality of sequence, etc.).

The next code may illustrate my idea clearly:

s = open("hg38.fa", FASTA)
while !eof(s)
    seq = read(s, DNASequence)
    ...
end
close(s)

open("hg38.fa", FASTA) do s
    for seq in each(s, DNASequence)
        ...
    end
end

If there is numerous number of records in a file, you can save memory allocation using read! as follows:

s = open("sample1234.bam", BAM)
seq = DNASequence()
while !eof(s)
    read!(s, seq)
    ...
end
close(s)

The multiple dispatching in Julia would enable this interface very easily; a method read(::Stream{Format}, ::Type{Record}) will be defined only if a Record object can be extracted from a Format stream.

I'd like to know your opinions. Do you think this is reasonable?

Create directory structure to reflect code submodules

- src
 |- seq
 |- align
 |- phylo
 |- ranges
 |- tools
 |- services
- test
 |- seq
 |- align
 |- phylo
 |- ranges
 |- tools
 |- services

Neuro Imaging

Would be nice if Bio.jl offered a framework to Neuro Imaging (bio imaging in general). I do not have the expertise and don't know if this is the place for something like this. I couldn't find any package that offered such capabilities.

I was thinks of something on the lines of MATLAB's FieldTrip to deal with EEG or MEG analysis. Something for MRI would also be very nice.

Is this the right place?

Parallel BGZF

I've written a very basic BAM parser, and after some profiling it's clear that the bottleneck for any halfway decent parser is decompression. I was a little surprised, since zlib decompression is very fast, but since parsing BAM is just reading fixed sized fields, it barely takes any time on top of that.

sambamba is implemented using a parallel BAM parser (really just parallel BGZF compression/decompression), which gets considerable speedup over htslib/samtools. They have a short paper in bioinformatics about it. So they obviously saw the same profiling results as I did.

I spent a lot of time today implementing parallel BGZF decompression, to see if I could also get an advantage over htslib. The result is this gist. It works, but since Julia does not have multithreading it relies on multiple processes. It turns out that even using shared memory for buffers, the overhead involved with sending messages between processes leads to no real performance improvement.

I think this is worth pursuing, but probably only when julia gets some kind of multithreading. Another possibility is to use multiple processes but restructure the code so processes work on really big chunks of data at a time. That might reduce the overhead, but it will still be worse that a multithreaded version.

In the mean time, I think we can at least match htslib's performance using serial decompression.

Seq: shuffling

This is an very useful algorithm that's not in any other Bio library that I'm aware of, but should be in ours:
http://digital.cs.usu.edu/~mjiang/ushuffle/

Removing Julia v0.3 from TravisCl Tests.

Since we are aiming for Bio.jl to run against version 0.4 of Julia, it has been suggested we only keep the 0.4 version of Julia in the TravisCl tests and build. So, core_devs - yay or nay?

Additional array operations for Sequences (insert, delete, push, pop)

Re: discussion from gitter

Should we have methods to insert and/or delete one or more nucleotides/amino acids to mutable sequences? @Ward9250 said:

as a sequence may be thought of as a specialized string, which can be thought of as an array, insert and delete (and push and pop?) methods are something we should consider implementing as efficiently as possible.

Suggestions?

Add .2bit file format support

It would be nice to have .2bit file format support:

Parsing FASTA files with missing EOL (new line at end of file)

Working with specimen FASTA files i recognized that for files which are missing a linefeed at the end of the file, the subsequence of last line is not parsed.

e.g.
File: BioFmtSpecimens/FASTA/dna2.fasta does not have an EOL.
According to index.yml it should be successfully parsed. It is. However reading the file with

for seqrec in read(open("dna2.fasta "), FASTA)
    println(seqrec)
end

results in

[...]
>Test2 
210nt DNA Sequence
 CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAA…CCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG

Whereas the input sequence is:

[...]
>Test2
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

The problem is not related to the show() function, since when writing gives the same error.

Strictly speaking, the posix standard defines a line:

A sequence of zero or more non- <newline> characters plus a terminating <newline> character.

source: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

However, IMHO I suppose we can't be that strict in practice and have to deal with missing EOLs.

Module files should start with uppercase

Following conventions in other Julia packages, I propose we rename all module files and folders to start with uppercase (e.g. src/seq/seq.jl to src/Seq/Seq.jl).

Basic Ranges functionality

Early on we will need support for annotated genomic ranges.

A basic genomic range type that support arbitrary annotation.
An interval tree implementation to index a set of ranges
Operations on sets of ranges: union, intersection, merge, etc.

References/inspiration:

bedtools
IRanges
Interval operations supported in Galaxy

RFC: Bio package system

As we approach release of the core, we should start planning the package system.

We have previously decided that all functionality apart from the core will be provided via a package system, and that the package system will have peer review and rules about code reuse, correctness and style to ensure high quality.

Let's use this issue to flesh out those ideas and decided on details that will be be important for implementation.

Note: following from discussion in #44.

[RFC] Proposal for Julia Summer of Code 2015

You may already know about Julia Summer of Code (JSoC) 2015.
This seems to be something like the Google Summer of Code: students develop open source projects under support of their mentor funded by Google, Inc.
Please beware that the application deadline is very soon (June 1)!

More detailed announcement is at the website:

Thanks to a generous grant from the Moore Foundation, we are happy to announce the 2015 Julia Summer of Code (JSoC) administered by NumFocus. We realize that this announcement comes quite late in the summer internship process, but we are hoping to fund six projects. The duration of JSoC 2015 will be June 15-September 15. Last date for submitting applications is June 1.

http://julialang.org/blog/2015/05/jsoc-cfp/

I'm thinking to apply to the JSoC as a subproject of BioJulia if I can find a good mentor and a project here.
@dcjones is so kind that he said he would be able to be my mentor (thanks a lot!), but I haven't got a concrete project idea at the current moment.
So if you have a great project in your head and thank this is a good chance to make it, please let me know! That is the reason why I opened this issue.

If not, I have several abstract project ideas for the candidates:

Implement various parsers for high-throughput sequencing - BAM/SAM/CRAM/VCF/BCF.
Data frames and a variant annotation toolkit like GenomicRanges and VariantAnnotation packages of Bioconductor.
Indexed sequence alignment and search tools like BLAT.

I have to confess that I'm not an expert developer of these fields, but I'm a heavy user of these tools and will continue to use for a long time.

I'm waiting for your involvement of the discussion.

Thanks.

Personal background:

I am a graduate school student at the University of Tokyo, Japan.
I study human genetics and handle lots of NGS data.
I have 1.5 years' experience of Julia programming and developed several packages:

Seq: encoding non-standard nucleotides and amino acids

See the discussion in #15.

We need make a decision regarding how “non-standard” nucleotides and amino-acids should be encoded. I think we disagree slightly on wether AA_X (the missing/ambiguous amino acid code) should be encoded as 0 or as 255 (or keep it how currently encoded, as 20). We should do some benchmarking to see if either way makes significant difference in performance (e.g. 255 would increase the size of the lookup table, and may result in more cache misses).

Policy on Getters and Setters for Types in Bio.jl

In Phylo so far I've written getter and setter methods for all the variables in the types of Phylo. Mostly for the benefit of people who don't know code very well or Julia can tell from the function name what's happening, perhaps someone from a heavily OO background where they may be used to variables being private and requiring getter and setter methods. However I can also see the point it feels redundant given the Type.Variable notation of Julia, allowing you to change variables so long as the type is not immutable.

What are people's views on this? Preferably the same policy should be used for all modules of Bio.jl and this would be a point to go into the documentation or style guide.

Teaching materials for BioJulia

In addition to the usual website, we should start planning some teaching materials for BioJulia.

As with everything else in the project, we should try to go above and beyond. This is the place to discuss whether that's a good idea, and what it would look like.

Phylo submodule review and roadmap

Phylo code review - 20th April 2014

General notes

Overall very good, and a great start for the Bio.jl phylo module. Here I have attempted to map out what would be required to get to a version 1, high quality phylogenetics module to ship with Bio.jl that covers the major use cases.

This review covers:

the API
design of the module
minor style points
testing
repetition
documentation

Most of the review is a to-do list, so that as points are addressed they can be checked off.

1. API

Currently there are methods for the following operations on trees:

parsing/lexing newick format
parsing/lexing PhyloXML (in progress)
getting the root node
getting the children of all nodes in a tree simultaneously
test whether two trees are equal
produce a unique hash from a tree

We should define the minimal API for this library. I suggest it should include (in addition to the existing functionality):

Node/tree manipulation

get and set any property of a node
add child nodes
add sibling nodes
remove child nodes
delete nodes (with or without connecting their descendents to the parent)
detach the node (delete it from the original tree and return it as a new tree)
re-root a (sub)tree

Tree exploration

get the children of a node
check if the node is a leaf (has no children)
get the parent of a node
check if the node is root
get the siblings of a node
get all the descendents of a node
get all the leaves of a (sub)tree (i.e. all terminal descendents of a node)
depth-first search of a (sub)tree
breadth-first search of a (sub)tree

Tree measurement

Get the shortest path between two nodes
Get the distance (aggregate branch length) between two nodes
Check whether a (sub)tree is monophyletic for a given taxonomic rank
Find the midpoint outgroup

Tree visualisation

print a text representation
print an image representation
produce an interactive representation
annotate these visualisations with arbitrary information from the tree

IO

parse/lex nexus
parse/lex NexML

2. Design

The module is structured around the abstract type Phylogeny, representing a phylogeny. It has subtypes Clado and Phylo representing Cladograms and Phylograms respectively.

These types represent the tree as an adjacency list with associated arrays of properties such as node names and edge lengths.

I would like to query two aspects of this design:

For many of the features listed above as desirable to include in the API, a natural representation of trees is as heirarchies of nodes. In this format, a variable containing a tree is really a pointer to the root node, which knows that it is a root and what its children are. Each child node knows it parent, might know how long the branch is from its parent, and knows what its children are. This is how the tree representations in BioPython/ETE and BioRuby work. Benefits of this system are that it is very easy to insert or remove parts of a tree, split a tree into subtrees, and recursively represent a tree in a text format. I can't think of any downsides (but happy to be corrected). We should consider adopting this representation unless there are major constraints on doing so in Julia.
Is it necessary to have separate Cladogram and Phylogram types? It seems to me that a Cladogram is easily represented by a tree where each node has a null branch length. A method on the root node could tell us whether the tree is a cladogram.

I think these two queries should be resolved before any of the remaining sections are addressed.

3. Style points

Method names should be consistent with one another and with the Standard Library. STL uses "$(verb)$(noun)", e.g. joinpath. readtree is consistent whereas treebuild and treewrite are not. Should be buildtree and writetree. Similarly, cladobuild should be buildclado, etc.
Why are file paths restricted to being ASCIIStrings? I think all operating systems now allow unicode paths. If this is true, we should be more liberal with the String type we allow.
Style guide suggests variables should be all lower case, with underscore separators if absolutely necessary. We have, e.g. nodeLabel, tipLabel, treeNames, etc. These should be moved to all-lowercase.
Style guide suggests function names should be lowercase - the names of internal functions of cladobuild, treebuild, newick, etc. should be lowercased. e.g. AddInteral -> addinternal.

4. Testing

Test coverage is quite good - around 95% of the code is hit in tests. --code-coverage shows one constructor and several parts of functions that are not covered - in particular conditional statements inside large functions. We should aim for 100% coverage.

ReducedTopology is not tested.
In cladobuild, the conditional block (if search(tp, ",") == 0:-1) to handle the case where a tree has no commas is not tested.
In GoDown, the same conditionl block (if search(tp, ",") == 0:-1) is not tested.
In treewrite, the case where append == true is not tested.
In newick, the conditional if j > N is not tested.
In addTerminal, the case where name == false is not tested.

5. Repetition

There is some code repetition in the multiple versions of the build methods. Eliminating repetition by moving repeated code to functions makes maintenance easier and reduces method complexity. Functions or blocks that are either completely or mostly repeated should be broken out into functions with higher scope so all methods with the shared code can call them.
AddInternal is exactly the same in cladobuild and treebuild.
could treewrite for the case of a single tree simply add the single tree to an array and call the Array version?
the two versions of newick() share a lot of code

6. Documentation

Currently, the documentation is in the form of comments on the main functions. This is sufficient to use the code, but we should expand this to full standalone documentation for all types, functions and methods and a tutorial highlighting common use-cases.

[RFC] More nucleotides

I'd like to share and discuss ideas about DNA or RNA sequences, partially mentioned in #55.

Currently, only five nucleotides (A/C/G/T/N) are defined in this package and a DNA sequence cannot store other ambiguous nucleotides like M (A or C). This is because we use 2-bit encoding for A/C/G/T and 1-bit ambiguity mask for N. This limitation will definitely cause problems in real-world bioinformatics. Since we are creating a general purpose infrastructure for bioinformatics, this would be an intolerable limitation.

I think we should, at least, support all nucleotide base codes defined by IUPAC, which includes all possible combinations of four nucleotides (http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/iupac_nt_abbreviations.html).
If we use 4-bit encoding, we can store any nucleotides listed here. These ambiguous nucleotides are, of course, allowed in the Biostrings package of Bioconductor. An apparent drawback of 4-bit encoding is that it requires more space than the current (2+1)-bit encoding: the data size of a DNA sequence will become 4/3 = 1.333 times larger. But I think this will not be a serious problem in most cases because it is very rare to handle such a large sequence that hits the RAM limit except eukaryotic reference sequences. For such long sequences, we can prepare a reference sequence type that adopts 2-bit encoding for A/C/G/T + compressed N mask, which is already implemented in my FMM.jl package (https://github.com/bicycle1885/FMM.jl/blob/master/src/genomicseq.jl).

One advantage of this 4-bit encoding is that it will improve the performance of random access on DNA sequences. This is because the current implementation stores the N mask in a separated bit vector object, and hence it has a branch (if seq.ns[i]) ... else ... end) to access an element in a sequence:

function getindex{T}(seq::NucleotideSequence{T}, i::Integer)
    if i > length(seq) || i < 1
        throw(BoundsError())
    end
    i += seq.part.start - 1
    if seq.ns[i]
        return nnucleotide(T)
    else
        return getnuc(T, seq.data, i)
    end

But using 4-bit encoding, we can store all information in a vector so that any branch is not need any more. That would lead to a significant performance improvement in intensive-sequence-access algorithms like pairwise alignment.

EU Codefest to-do list

Bio.Phylo

Finish working through code review (@Ward9250)
PhyNode constructors: rename label to name (@Ward9250)
Full tests (@blahah)
Consider making Trees inherit from Graphs (both)
All functions inline documented (@blahah)
Getting started tutorial (@blahah)
iJulia notebook walk-through demonstration (both)

New documentation syntax

Julia v0.4 removed the need for @doc macro when documenting stuff (JuliaLang/julia#11836). If we remove it, it would make the code cleaner and easier to read.

I volunteer to change the code in master in the next week. On the other hand, other devs should be aware of the change, and make use of it in new PRs. Therefore, I will also change CONTRIBUTING.md to specify this.

Parser generator framework

The current plan is to use ragel to generate parsers for simple, regular file formats like BED, GFF, GTF, SAM, FASTQ, etc.

Todo:

Julia backend for ragel. I have a working implementation, but have not yet tested it in a systematic way.
Add gotos/labels to Julia. This is necessary for the code generated by ragel. I have a patch which is awaiting review.
Define a parser API. I have a pretty good idea of how this will look, but I need to implement a few parsers before I'm sure.
Write some example parsers.

Travis test integration

Overflow while testing parsers

Got the following message:

Interval Parsing
     - BED Parsing
Cloning into '/home/paulo/.julia/v0.4/Bio/test/BioFmtSpecimens'...
remote: Counting objects: 121, done.
remote: Compressing objects: 100% (77/77), done.
remote: Total 121 (delta 40), reused 113 (delta 40), pack-reused 0
Receiving objects: 100% (121/121), 179.75 KiB | 0 bytes/s, done.
Resolving deltas: 100% (40/40), done.
Checking connectivity... done.
ERROR: LoadError: LoadError: OverflowError()
 in getindex at ./array.jl:317
 in prefix at /home/paulo/.julia/v0.4/YAML/src/buffered_input.jl:49
 in scan_plain_spaces at /home/paulo/.julia/v0.4/YAML/src/scanner.jl:1384
 in scan_plain at /home/paulo/.julia/v0.4/YAML/src/scanner.jl:1365
 in fetch_more_tokens at /home/paulo/.julia/v0.4/YAML/src/scanner.jl:227
 in peek at /home/paulo/.julia/v0.4/YAML/src/scanner.jl:149
 in parse_block_sequence_entry at /home/paulo/.julia/v0.4/YAML/src/parser.jl:347
 in parse_block_sequence_first_entry at /home/paulo/.julia/v0.4/YAML/src/parser.jl:339
 in peek at /home/paulo/.julia/v0.4/YAML/src/parser.jl:47
 in compose_sequence_node at /home/paulo/.julia/v0.4/YAML/src/composer.jl:119
 in compose_node at /home/paulo/.julia/v0.4/YAML/src/composer.jl:79
 in compose_document at /home/paulo/.julia/v0.4/YAML/src/composer.jl:43
 in compose at /home/paulo/.julia/v0.4/YAML/src/composer.jl:31
 in load_file at /home/paulo/.julia/v0.4/YAML/src/YAML.jl:76
 in anonymous at /home/paulo/.julia/v0.4/Bio/test/intervals/test_intervals.jl:348
 in context at /home/paulo/.julia/v0.4/FactCheck/src/FactCheck.jl:341
 in anonymous at /home/paulo/.julia/v0.4/Bio/test/intervals/test_intervals.jl:328
 in facts at /home/paulo/.julia/v0.4/FactCheck/src/FactCheck.jl:315
 in include_from_node1 at ./loading.jl:133
while loading /home/paulo/.julia/v0.4/Bio/test/intervals/test_intervals.jl, in expression starting on line 327
while loading /home/paulo/.julia/v0.4/Bio/test/runtests.jl, in expression starting on line 11

Is anyone having the same issue?

Alignment - Support the Padding operation.

TL;DR - Is anything preventing us from supporting CIGAR Pad operations right now?

In my line of work, alignments with software like Clustal generate reference-less alignments. E.g.

Seq1
CGATCA--GACCGATA
Seq2
CGATCAGAGACCGATA
Seq3
CGATCA-AGACCGATAC

This kind of alignment is important to phylogenetics and evolutionary study.
My perception (anyone may disagree) is that this was common when we study gene models, fasta formatted files output from Clustal in evolutionary study, and when we use programs associated with evolutionary study (MEGA, Clustal, Mr Bayes) they all accepted this kind of "Multiple Alignment Fasta" (I'm told MAF means something slightly different to HT Bioinformaticians) representation of alignments. But with HT Sequencing the reference approach became popular and so when I ask people about multiple alignment fasta files - I get weird looks, even when I say "You know - that thing Clustal outputs!". As an evolutionary biologist this multiple aligned fasta format is natural to me - why should one sequence be a reference to the others? They are all independently evolved! (and people are moving away from reference based thinking as pan-genomics grows).

Recently I tried to experiment and adapt the fasta parser for this kind of gapped fasta file, using the alignment types in the align module, but this required padding operations, and I hit an error saying Padding is not supported yet. Why is it not supported, and what is preventing it to be supported at the moment? What needs to be done (I'm asking because I will do some of it!)?

Integration with BioJS for visualisation

It seems to make sense that rather than try to implement visualisations of our own for our Bio types, we use what the BioJS people are building. This should make integration with iJulia notebooks easy, should be compatible at some level with Gadfly (they can both use D3.js, for example - not sure how much deeper it goes), and means that plotting is as simple as converting our type to a JS-compatible representation, i.e. JSON, then feeding it to BioJS.

Thoughts?

Publish on METADATA

I believe once #37 and #40 are merged, we could publish Bio.jl on Julia repository, and make it available on the pkg manager as soon as possible.

Parsing Infrastructure Iteration

RFC: Using external programs with BioJulia.

Opening as a result of discussion on #36.

Let's discuss how external programs should be used with Bio.jl

[RFC] Next-gen Read QC library

Hi all,

As a bit of a learning experience, I'd like to implement a next-gen read QC library. Feature brainstorming:

sicke-like windowed quality trimming
Needleman-Wunsch based paired-end read merge/trimmer (a la libqc++)
scythe-like Bayesian single-end adaptor trimmer
Fastqc-style quality/bias/gc content/kmer enrichment measurement

Overall features

All reports for a qc pipeline available as a single YAML file, which can be compiled into an HTML report
read-block based parallelism (i.e, chunks of ~2k reads read, processed in parallel, written)

Possible features:

A Illumina v4 (or is it v5) chemistry specific trimmer, to remove stretches of high phred score poly-G from reads, which you get if a cluster falls of the flow cell, resulting in a very nice clear, dark spot.

Feel free to suggest any other features. The actual implementation of this stuff is probably going to be a bit slow, as I'll be busy for the next month and a bit. But I'll tinker in my spare time 😄.

Cheers,
K

Missing things (before deprecation of BioSeq.jl)

List of things that need to be present on Bio.jl before deprecation of BioSeq.jl

Important:

Gap representation and gapped sequences; replace BioSeq's aminoacid('-')
Alignment representations: in order to replace BioSeq's Array{AminoAcid,2}

Optional:

Matching functions (search, replace and others):
- IUPAC Regex
- PROSITE patterns
8-bit Bit-Level Coding Scheme for Nucleotides

Website

We need a website (not urgent)

Visually enhanced notations

Maybe, in the future, we can add support for other visuallizations, different from ascii based IUPAC

Visualization

Visualization tools are a great feature.
We need to include some visualization tools like BioPython

Maybe we can use BioJS in order to have interactive graphics on the IPython Notebook.
https://github.com/biojs/biojs

benchmark comparisons

Hi everyone,

Has anyone specifically bench marked BioJulia against other libraries commonly used in the field of bioinformatics? I think that having such stats would be very useful.

Thanks,
Gideon.

Documentation

We need docs, and at last, Julia has a package that supports literate programming.

We should:

implement literate programming across the whole module using Docile.jl
generate documentation from the Docile annotations using Lexicon.jl
ideally, generate a readTheDocs interface

To-do list:

Add Docile
Use it to annotate the Phylo submodule
Add Lexicon
Use it to write documentation for the Phylo submodule, integrating the Docile annotations
annotate the Seq submodule
write documentation for the Seq submodule
write general documentation for the whole library

Restriction Sites

It would be nice to have a tool indicate how DNA is cut using restriction enzymes. I found a list of common Restriction enzymes with their restriction sites here:

http://www.web-books.com/MoBio/Free/Ch9A3.htm

I am guessing the function would be something like this:

Input: Fasta file or sequence, Restriction enzyme

Output: Number of cuts or number of restriction sites.

Output 2: List containing DNA fragment 1 and DNA fragment 2

Travis code coverage integration

As described by @dcjones, we can get code coverage analysis with:

julia --code-coverage test.jl

A Travis hook for this that lets us see coverage for the main repo and for pull requests would make reviewing new contributions easier.