dib-lab / elvers Goto Github PK

View Code? Open in Web Editor NEW

28.0 6.0 3.0 108.03 MB

(formerly eelpond) an automated RNA-Seq workflow system

Home Page: https://dib-lab.github.io/elvers/

License: Other

Python 82.08% R 16.97% Shell 0.04% TeX 0.92%

transcriptomics rnaseq-analysis bioinformatics

elvers's Introduction

elvers

                           ___
                        .-'   `'.
                       /         \
                      |           ;
                      |           |           ___.--,
             _.._     |O)  ~  (O) |    _.---'`__.-( (_.       
      __.--'`_.. '.__.\      '--. \_.-' ,.--'`     `""`
     ( ,.--'`   ',__ /./;     ;, '.__.'`    __
     _`) )  .---.__.' / |     |\   \__..--""  """--.,_
    `---' .'.''-._.-'`_./    /\ '.  \_.-~~~````~~~-.__`-.__.'
          | |  .' _.-' |    |  \  \  '.
           \ \/ .'     \    \   '. '-._)
            \/ /        \    \    `=.__`-~-.
            / /\         `)   )     / / `"".`\
      , _.-'.'\ \        /   /     (  (   /  /
       `--~`  )  )    .-'  .'       '.'. |  (
             (/`     (   (`           ) ) `-;
              `       '--;            ('

elvers started as a snakemake update of the Eel Pond Protocol for de novo RNAseq analysis. It has evolved slightly to enable a number of workflows for (mostly) RNA data, which can all be run via the elvers workflow wrapper. elvers uses snakemake for workflow management and conda for software installation. The code can be found here.

Getting Started

Linux is the recommended OS. Nearly everything also works on MacOSX, but some programs (fastqc, Trinity) are troublesome.

If you don't have conda yet, install miniconda (for Ubuntu 16.04 Jetstream image):

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Be sure to answer 'yes' to all yes/no questions. You'll need to restart your terminal for conda to be active.

Create a working environment and install `elvers`!

elvers needs a few programs installed in order to run properly. To handle this, we run elvers within a conda environment that contains all dependencies.

Get the elvers code

git clone https://github.com/dib-lab/elvers.git
cd elvers

When you first get elvers, you'll need to create this environment on your machine:

conda env create --file environment.yml -n elvers-env

Now, activate that environment:

conda activate elvers-env

To deactivate after you've finished running elvers, type conda deactivate. You'll need to reactivate this environment anytime you want to run elvers.

Now. install the elvers package.

pip install -e .

Now you can start running workflows on test data!

Default workflow: Eel Pond Protocol for de novo RNAseq analysis

The Eel Pond protocol (which inspired the elvers name) included line-by-line commands that the user could follow along with using a test dataset provided in the instructions. We have re-implemented the protocol here to enable automated de novo transcriptome assembly, annotation, and quick differential expression analysis on a set of short-read Illumina data using a single command. See more about this protocol here.

To test the default workflow:

elvers examples/nema.yaml default

This will download and run a small set of Nematostella vectensis test data (from Tulin et al., 2013)

Running Your Own Data

To run your own data, you'll need to create one or more files:

a yaml file containing basic configuration info

This yaml config file must specify either:

a tsv file containing your read sample info
a reference file input (.fasta file and optional gene_trans_map)

Generate these files by following instructions here: Understanding and Configuring Workflows.

Available Workflows

preprocess: Read Quality Trimming and Filtering (fastqc, trimmomatic)
kmer_trim: Kmer Trimming and/or Digital Normalization (khmer)
assemble: Transcriptome Assembly (trinity)
annotate : Annotate the transcriptome (dammit)
sourmash_compute: Build sourmash signatures for the reads and assembly (sourmash)
quantify: Quantify transcripts (salmon)
diffexp: Conduct differential expression (DESeq2)
plass_assemble: assemble at the protein level with PLASS
paladin_map: map to a protein assembly using paladin

end-to-end workflows:

default: preprocess, kmer_trim, assemble, annotate, quantify
protein assembly: preprocess, kmer_trim, plass_assemble, paladin_map

You can see the available workflows (and which programs they run) by using the --print_workflows flag:

elvers examples/nema.yaml --print_workflows

Each included tool can also be run independently, if appropriate input files are provided. This is not always intuitive, so please see our documentation for running each tools for details (described as "Advanced Usage"). To see all available tools, run:

elvers examples/nema.yaml --print_rules

Citation information

This is pre-publication code; a manuscript is in preparation. Please contact the authors for the current citation information if you wish to use it and cite it.

Additional Info

See the help, here:

elvers -h

References:

elvers's People

Contributors

Stargazers

Watchers

Forkers

maligang karyowgene cliftonlewis

elvers's Issues

write params file to output directory to track parameters used in an analysis

it'd be great to have the params file written into the output directory.

separately, it would be neat to compare the parameter file in the output directory with the parameters currently in use, and if they are different require a --force to continue running.

Flag for saving DAG

Would be useful to have a --save-dag that would dump the snakemake graphviz dot output into a .dot file, then run the right dot command to convert to a png. dot is not pip-installable, but is likely conda-installable.

This relates to #10 - graphviz dot can be installed with conda (ref: https://anaconda.org/anaconda/graphviz) so it can be added to the environment.yml to ensure that users are able to run this flag.

That would require committing to conda rather than pip, but I think you will find plenty of support around the DIB lab for conda and friends. ;)

what happens when you run eelpond from another directory?

e.g. ../eelpond/run_eelpond

and/or we should make this an installable package.

if we go with the 'elvers' rename, that would be a good time.

PEAR failing on travis

working on my local machine (mac). To do: track down issue on jetstream!

'basename' may not be properly quoted

a student used a space in 'basename' in the config yaml file, and it looks like that will break shell commands that aren't properly quoted. I think single quotes should work.

we can add in some tests that do this for basenames containing $ and :)

add a slower quickstart with configuration?

add the key four steps to the README, https://hackmd.io/ZCX_8yx4QLOExT7dCjK3zQ?view#Doing-things-%E2%80%9Cthe-workflow-way%E2%80%9D

make 'salmon' or 'quantify' an equal citizen

A common use case for transcriptomics doesn't involve de novo assembly, but rather just quantification or (perhaps) annotation + quantification, for semi-model organisms like dog and cat which have good reference genomes but not great transcript annotations.

While eelpond can probably handle both of these, the docs for e.g. salmon really focus on addressing the de novo transcriptome assembly case:

https://github.com/dib-lab/eelpond/blob/master/docs/salmon.md

but I think that eelpond could usefully serve the reference-based case as well.

check replicates for diffexp analysis

DESeq2 only works with replicates -- check sample replicate info before attempting the deseq2 workflow.
Add edgeR rule for no-replicate analysis.

License?

What is the license under which eel pond is being released? https://choosealicense.com/

May I recommend the BSD 3-clause license (the favorite of the DIB Lab's) or the MIT license (my favorite)?

one possible approach to testing: inspect command lines

we now have the basic assembly under CI (yay!), up through annotation. but I'm wary of adding too many more execution tests on Travis, because they take real time and memory and disk space.

eelpond is all about getting parameters from config file(s) to downstream programs.

so, what about inspecting the command line that is output by snakemake -n as a way to determine if configuration parameters are making it "out" to downstream programs"

make preparing samples.tsv easier

I spent a frustrating time debugging tabs vs spaces vs...

a few thoughts:

allow yaml input
allow csv input
allow excel input (??)
provide a utility script for checking it

fix force options in run_eelpond

--forcetargets doesn't always work as desired. Particularly for steps with intermediate targets that are not passed as targets to snakemake (e.g. dammit)

pass unlock into snakemake

args.unlock is not getting passed into snakemake. Fixed in #37

submodule 'databases' fails when `-t 6` specified

see error output below.

without -t it works fine.

to rerun it, I just remove the 'databases' directory under eelpond/ and re-run eelpond.

Installing...
#### Run Tasks
- [ ] download:Pfam-A.hmm.gz:
    * Cmd: `curl -o Pfam-A.hmm.gz ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam28.0/Pfam-A.hmm.gz`
    * Python: function check_hash
- [ ] hmmpress:Pfam-A.hmm:
    * Cmd: `/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/hmmpress /home/titus/eelpond/databases/Pfam-A.hmm`
TaskFailed - taskid:hmmpress:Pfam-A.hmm
Command failed: '/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/hmmpress /home/titus/eelpond/databases/Pfam-A.hmm' returned 1

- [ ] download:Rfam.cm.gz:
    * Cmd: `curl -o Rfam.cm.gz ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.1/Rfam.cm.gz`
    * Python: function check_hash
- [ ] download:aa_seq_euk.fasta.gz:
    * Cmd: `curl -o aa_seq_euk.fasta.gz ftp://cegg.unige.ch/OrthoDB8/Eukaryotes/FASTA/aa_seq_euk.fasta.gz`
    * Python: function check_hash
- [ ] cmpress:Rfam.cm:
    * Cmd: `/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/cmpress /home/titus/eelpond/databases/Rfam.cm`
- [ ] lastdb:aa_seq_euk.fasta:
    * Cmd: `/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/lastdb -p -w3 /home/titus/eelpond/databases/aa_seq_euk.fasta /home/titus/eelpond/databases/aa_seq_euk.fasta`
TaskFailed - taskid:cmpress:Rfam.cm
Command failed: '/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/cmpress /home/titus/eelpond/databases/Rfam.cm' returned 1

TaskFailed - taskid:lastdb:aa_seq_euk.fasta
Command failed: '/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/lastdb -p -w3 /home/titus/eelpond/databases/aa_seq_euk.fasta /home/titus/eelpond/databases/aa_seq_euk.fasta' returned 1

osf downloads do not always work on travis ci

this may be because of rate limits on osf, or network issues on travis build machines. see e.g. https://travis-ci.org/dib-lab/eelpond/builds/483913645, which has a bunch of the following errors:

raceback (most recent call last):
  File "/home/travis/miniconda/envs/test-env/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 453, in wrap_socket
    cnx.do_handshake()
  File "/home/travis/miniconda/envs/test-env/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1907, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "/home/travis/miniconda/envs/test-env/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1632, in _raise_ssl_error
    raise SysCallError(-1, "Unexpected EOF")
OpenSSL.SSL.SysCallError: (-1, 'Unexpected EOF')

make editing configfile easier

provide option to print program parameters to the stdout instead of a file, so it's easier to add them to an existing configfile?

how should we generate the gene to transcript map?

@ljcohen has some code, maybe... :)

document and test --extra_config

feature added in #37, see #25 for root issue.

improve print_workflows option

./run_eelpond -w instead of ./run_eelpond nema-test -w

'quantify' target is currently failing.

% ./run_eelpond nema-download.yaml quantify

--------
checking for required files:
--------

        Added default parameters from rule-specific params files.
        Writing full params to .ep_nema-download.yaml
--------
details!
        snakefile: /home/diblions/eelpond/Snakefile
        config: nema-download.yaml
        params: .ep_nema-download.yaml
        targets: ['quantify']
        report: '/home/diblions/eelpond/nema_out/logs/report.html'
--------
Building DAG of jobs...
MissingInputException in line 46 of /home/diblions/eelpond/rules/salmon/salmon.rule:
Missing input files for rule salmon_index:
/home/diblions/eelpond/nema_out/assembly/__assembly__.fasta```

does diffexp include deseq2?

per the text in https://github.com/dib-lab/eelpond/blob/master/docs/diffexp.md, it seems like the workflow listing at the top should contain deseq2. yes?

add multiqc into eelpond preprocess workflow

could be run after fastqc

dammit should fail once and for all, rather than try repeatedly

based on the log from travis, it seems like when dammit install rules fail, dammit continues to be run and retries the failed installs. This is probably because dammit is itself based on a workflow engine :)

support the execution of a bunch of notebooks upon completion

it would be great to have both standard and customized execution of notebooks after a run!

not sure how to make 'em conditional on the workflow run, but that can be a detail we figure out later.

I think Jupyter Notebooks and RMarkdown notebooks would be the primary things to support here. Would need to parameterize the notebooks e.g. via papermill.

resource usage!

on a Jetstream m1.medium, second run of nema-test (so, all software installed etc. etc.)

/usr/bin/time -v reported:

        Command being timed: "./run_eelpond nema-test full"
        User time (seconds): 1637.94
        System time (seconds): 706.61
        Percent of CPU this job got: 93%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 41:38.69
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3926492
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 545
        Minor (reclaiming a frame) page faults: 288161380
        Voluntary context switches: 958956
        Involuntary context switches: 161188
        Swaps: 0
        File system inputs: 179680
        File system outputs: 22846136
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

41 minutes, 4 GB of RAM, 120 MB of disk space used, plus 14 GB of databases (presumably these are the dammit databases?)

Looking into the software, this is a nearly clean miniconda install followed by the eelpond run, so:

5.6 GB in /opt/miniconda/pkgs - this is probably the downloaded pkgs
1.3 GB in /opt/minoconda/envs/eelpond/ - this is the installed software.

So I think we can probably recommend that you have 25 GB of free space after conda install in order to even begin to run eelpond (b/c of databases), + space for reads and assembly and so on.

what's the right way to preinstall everything?

pre-download conda packages, install dammit databases, etc.?

right now the "best" way seems to be to run the full nema-test which includes a bunch of compute.

put nice workflow diagram into the documentation

GATK best practices diagram ++

remove large files from git commit history

The whole repo history is in the ~100 MB range, while the current master is only a few MB. This makes it unnecessarily slow to check out the git repo :)

Using the command here, it looks like we need to go remove any trace of testing_out/ from the history. Not quite sure how to do that but this github help article looks legit.

how specify alternative location for databases?

right now, the databases for dammit are put in databases/ within the run directory, and (if you trash the eelpond repo, as Travis does) they need to be re-downloaded and re-installed.

is it possible to specify an alternative location for databases?

this would also be helpful for system-level installs of eelpond on e.g. HPCs.

allow workflow to be specified in the configfile

e.g. allow the addition of a targets to examples/nema.yaml with a default of 'default'

terminology idea?

tracy teal pointed me towards the term "playbook", https://docs.ansible.com/ansible/latest/user_guide/playbooks.html - maybe we could say eel pond (or elvers, or whatever we call it) is a "playbook" for RNAseq processing.

update test data (Nematostella data set)

note: current test data does not work with trinity after khmer trimming (insufficient data remaining)

output locations etc. at end of pipeline

what was generated, where it is, etc.

could also be a link to the docs.

maybe rename ep_utils/eelpond_environment.yaml?

A few thoughts on this file, which is hard to find because -

it doesn't have -env.yaml in the name, like the other files;
it's in the subdirectory ep_utils/;

looking at repo2docker it should probably be environment.yml in the root directory.

Thoughts?

citation printing

build a yaml of program: citation(s)
enable a --citation option with workflows that will print all citations for that workflow to stdout or to a file.
also make sure all program docs have a Citation section for tool citation(s)

how do you pass in custom param files?

in the excellent docs at e.g. doc/trimmomatic.md it says,

"Be sure the modified lines go into the config file you're using to run eelpond."

but I don't know how to do that :).

hunting down and testing the answer now...

Difference between eel-pond and eelpond?

What is the difference between https://github.com/dib-lab/eel-pond and https://github.com/dib-lab/eelpond?

Jetstream image?

Would be nice to have a Jetstream image with pre-loaded Miniconda, software, dammit db, etc.

add tests & continuous integration

use 'conda activate' instead of 'source activate'?

in e.g. README.md

`get_data` target assumes input data is gzipped.

which is fine, but it means that if you are starting with un-gzipped data the output of get_data is incorrect because it links the ungzipped files to filenames that end with .gz and then commands like trimmomatic barf because it's not gzipped data but it's named .gz :)

maybe we should be retrying downloads on travisci?

see https://travis-ci.org/dib-lab/eelpond/builds/484761329, where evidently an S3 download is failing.

ref #39.

add ascii marine animals to the print while pipeline is running ("lolrus", "rustacean",octopus) and error ("fail whale")

asci art for eel error messages, welcome and progress messages.

Ideas:

is busco completeness output available anywhere?

I know busco is run as part of dammit but it doesn't seem like busco completeness evaluation is available anywhere.

if I remove the `assembly/` output directory, shouldn't eelpond rerun everything after that too?

right now if you remove the assembly directory, and then run default, eelpond shrugs and does nothing because all of its targets are available. this seems wrong overall.

dammit step fails after rerunning assembly

if I remove the assembly, and then run_eelpond trinity default, the dammit step fails. output below.

I can get it to work properly by removing the annotation directory. I suspect that there is a stale indexed file or another dammit intermediate file in there. not sure if this is a dammit but or an eelpond bug - seems like a dammit bug?

- [ ] TransDecoder.Predict:lab4_trinity.fasta:
    * Cmd: `/home/diblions/eelpond/.snakemake/conda/9149e0c4/bin/TransDecoder.Pr
edict -t /home/diblions/eelpond/lab4_out_exp1/annotation/lab4_trinity.fasta.damm
it/lab4_trinity.fasta`
TaskFailed - taskid:TransDecoder.Predict:lab4_trinity.fasta
Command failed: '/home/diblions/eelpond/.snakemake/conda/9149e0c4/bin/TransDecoder.Predict -t /home/diblions/eelpond/lab4_out_exp1/annotation/lab4_trinity.fasta.dammit/lab4_trinity.fasta' returned 25

########################################
TaskFailed - taskid:TransDecoder.Predict:lab4_trinity.fasta
TransDecoder.Predict:lab4_trinity.fasta <stderr>:

#####################
Counts of kept entries according to attributes:
FRAMESCORE      39
FRAMESCORE|LONGORF      29
########################


-indexing [Gene.34::Transcript_9::g.34]  ]  
 Indexed Gene.34::Transcript_9::g.34::m.34 138
Error, no gene obj retrieved based on identifier Gene.2::Transcript_2::g.2::m.2 at /home/diblions/eelpond/.snakemake/conda/9149e0c4/opt/transdecoder/util/../PerlLib/Gene_obj_indexer.pm line 61, <$fh> line 2.
        Gene_obj_indexer::get_gene(Gene_obj_indexer=HASH(0x1d44180), "Gene.2::Transcript_2::g.2::m.2") called at /home/diblions/eelpond/.snakemake/conda/9149e0c4/opt/transdecoder/util/gene_list_to_gff.pl line 29
Error, cmd: /home/diblions/eelpond/.snakemake/conda/9149e0c4/opt/transdecoder/util/gene_list_to_gff.pl lab4_trinity.fasta.transdecoder_dir/longest_orfs.cds.scores.selected lab4_trinity.fasta.transdecoder_dir/longest_orfs.gff3.inx > lab4_trinity.fasta.transdecoder_dir/longest_orfs.cds.best_candidates.gff3 died with ret 6400 at /home/diblions/eelpond/.snakemake/conda/9149e0c4/bin/TransDecoder.Predict line 379.

There is some benefit to allowing users to fastqc each file separately to see quality. By default, should we:
A. combine units at the very beginning of the workflow (get_fq) and only work at the sample level?
B. combine units after trimmomatic, allowing pre-trimming fastqc at the sample-unit level, but nothing else.
Is it worth it to enable pipelines on the sample-unit level, rather than just sample? We could provide an eelpond_params parameter for each program that we populate via a single flag in run_eelpond, or parameter in the main configfile. Otherwise, just move everything to the sample level.

cc: @ctb @ljcohen

update test data input assembly

Update the nema.fasta used in assemblyinput test data.

upload new trinity fasta and gene_trans_map to OSF (generated with extract read subset)
add rule/method to download an assembly and/or gtmap, similar to the get_data download
if not downloading, link instead of copy?
update how "full" is translated in run_eelpond. At the moment, there's no way to pass other targets in at the same time as full, which means we can't generate assemblyinput params in the full configfile