Comments (5)
Hi Colin,
can you confirm you are running a Docker daemon when you try doing the docker pull
step?
It needs to be installed and running for any Docker operation, as well as Companion, to be successful. The error message seems to suggest it's not running.
You can access Docker without the need for sudo if your user is in the docker
group. The posting https://askubuntu.com/questions/477551/how-can-i-use-docker-without-sudo explains how to do that.
Just let me know if that helped.
Cheers, Sascha
from companion.
Thanks for the tip, apologies as I am very new to docker.
The whole pipeline seems very useful. What is your take on the effort required to extend it to additional use cases, eg annotation of plant genomes ?
Would it be a case of cloning the dockerfile and adding/replacing pathogen specific data resources with plant data sources? I see you have nice documentation on extending some algorithms eg, augustus, with for example pathogene specific hint files, to new references. Would the effort required be a lot bigger to extend companion to plant genomes?
Cheers,
Colin
from companion.
In general, there are few assumptions made about the kind of genome, most things are configurable in one way or another (e.g. weight files). Hints are created, for example, from reference protein alignments, so if there is a valid reference DB, then hints will be created and used automatically. It's just a matter of preparing the reference DB from appropriate input.
However, the most problematic point here I see is that some of the tools used as part of the pipeline (e.g. RATT, ABACAS) are potential bottlenecks as they are not particularly memory efficient. They have been specifically written for small (<100Mb) genomes in the first place. Also, some of the intermediate steps assume that one can just load all of the sequence into main memory. This means that for larger plant genomes such as wheat, these components would probably need to be looked at and optimised before being usable on a desktop computer. That is, for a plant use case with a genome size exceeding, for example, that of Arabidopsis etc. Repeat masking and annotation also massively comes into play here and is not directly addressed yet.
We haven't really tried this as it wasn't the focus of the project.
from companion.
Ok, thanks for the details. I will look into the pipeline in more detail if it is possible in principle. Memory issues should not be a huge problem if you have big mem servers (as we do) and do chromosome by chromosome or even scaffold or BAC analysis. Tasks of this scale can be very common in the non-model plant world.
As you point out, plant genomes do tend to vary widely. You can however get a lot of mileage out of 0.1 to 1 GB plant genomes, even in the crops.
from companion.
Glad to be of help, closing this for now. Just let me know if there are any more questions.
from companion.
Related Issues (20)
- Proactive sanitization of input headers with special characters
- Option for filtering gene models with introns as pseudogenes in kinetoplastids
- Download option for table content
- ENA validation and ID assignment
- Allow optional alphanumeric random locus tags
- Add track with contig placements to Circos plots
- 'Finish line' fixes towards ENA validity
- allow pipeline to complete when no genes are annotated at all
- Make sure Docker Hub builds working images
- Stability improvements
- fix Circos drawing
- use whole genome as RATT input, not just chromosomes
- use new Docker hub container
- pseudogene and chromosome handling
- skip Pfam hits with invalid converted ranges
- Latest work
- small fixes
- Robustness improvement
- Latest work
- do not treat lowercase input sequences as repeat masked in LAST
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from companion.