Multiple parts of our webapp will have requirements on their performance, such as the FastaProcessor. We need to investigate possible solutions for this.
Accetance Criteria
Come up with a way to measure the performance of test code (e.g. ScalaMeter)
FASTA a might not be a complex format, still it imposes a lot of challenges on asynchronous parsers. Come up with a prototype of an asynchronous parsers that uses Akka streams
Acceptance Criteria
It should parse "normal" FASTA files. No special format has to be supported as of now.
While playing with the FileUpload feature it became clear that the FastaProcessor is not stable enough to be used in production (it is just a prototype). Some edge cases that lead to trouble are:
Empty FASTA file / bytestring
Add more FASTA files to the test to proove against other FASTA styles
The webapp should be almost entirely controllable via REST. As of now, there are only a few controllers but as this number increases we have to document them systematically.
Acceptance Criteria
Investigate on how to document our REST API (e.g. swagger)
Document all present REST controllers with the found solution
This should serve as a mock for a potential upload of a (FASTA) file that is uploaded so the user can search our index against the contents of his file. Because there is no index as of now, it suffices to show the content of the file to check if our FASTA processor really works/performs.
Acceptance Criterea
When the file is uploaded, all its contents are shown on the page.
The application starts to have some parameters so we should investigate how to do this is properly in Scala. A good start might be: https://github.com/lightbend/config
To really make us of our super cool FileUpload pipeline, we have to create a sink to pipe all the incoming stuff into, which in the far end is Lucene.
That is why we have to come up with a custom sink for this.
At some point in time we have to care about styling. This issue should prototype a very basic style to look how we can integrate it into our build process.
I (personally) doom runtime dependency injection. I want the compiler to check if everything is wired correctly. This ticket should migrate the application from the play-default (runtime) to read-monads (compile-time).
This may not be super urgent but it is easier to switch now than when the application is large.
We need a way of specifying parameters of our project. E.g our configuration can host production and test databases but the database that is actually used at runtime still has to be specified.
Acceptance Criteria
Add a commandline parses to the project and evaluate properties like the database to be used.
After spending a month on really digging deep into the Polay framework and Akka, I can hardly see an advantage using Play instead of just using Akka Http. A lot of things should get simpler when we switch to Akka Http entirely, such as compile-time dependency.
Further down the road I want this project to have a UI. This should be done with a webapp. To clearly separate the logic of the webapp from other parts of the project we have to create a subproject.
The Lucene indexing logic (it's tokenizers and so on) are independent of akka streams and our domain models. Therefore it should be separated into a subproject (named index).
Acceptance Criteria
Create a subproject that contains all the Lucene indexing logic
In issue #25 we investigated the possibility to also index the reverse complement. It turned out, that this is the wrong approach. Instead, we should additionally reverse complement the search query and search with both queries against the same index.
Acceptance Criteria
Add a checkbox in the UI that togges if the reverse complement should also be searched for.
Adjust the REST endpoint for searching accordingly.
Code formatters are a safety net for the programmer in case he (or his IDE) missed something. Also, it presents a common standard for everyone who is contributing.
As an initial hoster we can use Heroku. It is not only free but there are also hooks into our CI.
Additionally, Heroku is something nice to learn about.
Acceptance Criteria
Upon pushing into master, our app should be deployed to Heroku.
The ultimate goal of the project is to provide a (near) real-time search experience against large sequence data sets such as NCBI. To accomplish this, our indexing process must be a lot smarter. Fortunately, Lucene is extremely customizable.
We should write our own tokenizer which follows best practices of algorithmic/biological pattern matching:
Splitting into k-mers
Also considering the reverse complement
As always in this day and age of this project, we don't have to be perfect here. It suffices to make it work without extremely high latencies.
Acceptance Criteria
Create a custom Lucene tokenizer having the 2 mentioned properties
For now, it sufficer to only care about DNA sequences and ignore RNA and Proteins
Focusing on Scalameter was a premature decision. It turned out to be not a mature benchmark framework. I small investigation into JMH turned out to be very succusful. Additionally, JMH seems to be the industry standard.
As a first visual step towards a search engine, a search bar is needed. This issue should also serve a blueprint on now to integrate the Javascript ecosystem into the project.
Acceptance Criteria
When loading the website, the user should be greeted with a search bar.