Giter Club home page Giter Club logo

elvers's Introduction

elvers

Build Status

DOI

                           ___
                        .-'   `'.
                       /         \
                      |           ;
                      |           |           ___.--,
             _.._     |O)  ~  (O) |    _.---'`__.-( (_.       
      __.--'`_.. '.__.\      '--. \_.-' ,.--'`     `""`
     ( ,.--'`   ',__ /./;     ;, '.__.'`    __
     _`) )  .---.__.' / |     |\   \__..--""  """--.,_
    `---' .'.''-._.-'`_./    /\ '.  \_.-~~~````~~~-.__`-.__.'
          | |  .' _.-' |    |  \  \  '.
           \ \/ .'     \    \   '. '-._)
            \/ /        \    \    `=.__`-~-.
            / /\         `)   )     / / `"".`\
      , _.-'.'\ \        /   /     (  (   /  /
       `--~`  )  )    .-'  .'       '.'. |  (
             (/`     (   (`           ) ) `-;
              `       '--;            (' 

elvers started as a snakemake update of the Eel Pond Protocol for de novo RNAseq analysis. It has evolved slightly to enable a number of workflows for (mostly) RNA data, which can all be run via the elvers workflow wrapper. elvers uses snakemake for workflow management and conda for software installation. The code can be found here.

Getting Started

Linux is the recommended OS. Nearly everything also works on MacOSX, but some programs (fastqc, Trinity) are troublesome.

If you don't have conda yet, install miniconda (for Ubuntu 16.04 Jetstream image):

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Be sure to answer 'yes' to all yes/no questions. You'll need to restart your terminal for conda to be active.

Create a working environment and install elvers!

elvers needs a few programs installed in order to run properly. To handle this, we run elvers within a conda environment that contains all dependencies.

Get the elvers code

git clone https://github.com/dib-lab/elvers.git
cd elvers

When you first get elvers, you'll need to create this environment on your machine:

conda env create --file environment.yml -n elvers-env

Now, activate that environment:

conda activate elvers-env

To deactivate after you've finished running elvers, type conda deactivate. You'll need to reactivate this environment anytime you want to run elvers.

Now. install the elvers package.

pip install -e .

Now you can start running workflows on test data!

Default workflow: Eel Pond Protocol for de novo RNAseq analysis

The Eel Pond protocol (which inspired the elvers name) included line-by-line commands that the user could follow along with using a test dataset provided in the instructions. We have re-implemented the protocol here to enable automated de novo transcriptome assembly, annotation, and quick differential expression analysis on a set of short-read Illumina data using a single command. See more about this protocol here.

To test the default workflow:

elvers examples/nema.yaml default

This will download and run a small set of Nematostella vectensis test data (from Tulin et al., 2013)

Running Your Own Data

To run your own data, you'll need to create one or more files:

  • a yaml file containing basic configuration info

This yaml config file must specify either:

  • a tsv file containing your read sample info
  • a reference file input (.fasta file and optional gene_trans_map)

Generate these files by following instructions here: Understanding and Configuring Workflows.

Available Workflows

  • preprocess: Read Quality Trimming and Filtering (fastqc, trimmomatic)
  • kmer_trim: Kmer Trimming and/or Digital Normalization (khmer)
  • assemble: Transcriptome Assembly (trinity)
  • annotate : Annotate the transcriptome (dammit)
  • sourmash_compute: Build sourmash signatures for the reads and assembly (sourmash)
  • quantify: Quantify transcripts (salmon)
  • diffexp: Conduct differential expression (DESeq2)
  • plass_assemble: assemble at the protein level with PLASS
  • paladin_map: map to a protein assembly using paladin

end-to-end workflows:

  • default: preprocess, kmer_trim, assemble, annotate, quantify
  • protein assembly: preprocess, kmer_trim, plass_assemble, paladin_map

You can see the available workflows (and which programs they run) by using the --print_workflows flag:

elvers examples/nema.yaml --print_workflows

Each included tool can also be run independently, if appropriate input files are provided. This is not always intuitive, so please see our documentation for running each tools for details (described as "Advanced Usage"). To see all available tools, run:

elvers examples/nema.yaml --print_rules

Citation information

This is pre-publication code; a manuscript is in preparation. Please contact the authors for the current citation information if you wish to use it and cite it.

Additional Info

See the help, here:

elvers -h

References:

elvers's People

Contributors

bluegenes avatar charlesreid1 avatar ctb avatar katrinleinweber avatar maligang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

elvers's Issues

Flag for saving DAG

Would be useful to have a --save-dag that would dump the snakemake graphviz dot output into a .dot file, then run the right dot command to convert to a png. dot is not pip-installable, but is likely conda-installable.

This relates to #10 - graphviz dot can be installed with conda (ref: https://anaconda.org/anaconda/graphviz) so it can be added to the environment.yml to ensure that users are able to run this flag.

That would require committing to conda rather than pip, but I think you will find plenty of support around the DIB lab for conda and friends. ;)

'basename' may not be properly quoted

a student used a space in 'basename' in the config yaml file, and it looks like that will break shell commands that aren't properly quoted. I think single quotes should work.

we can add in some tests that do this for basenames containing $ and :)

make 'salmon' or 'quantify' an equal citizen

A common use case for transcriptomics doesn't involve de novo assembly, but rather just quantification or (perhaps) annotation + quantification, for semi-model organisms like dog and cat which have good reference genomes but not great transcript annotations.

While eelpond can probably handle both of these, the docs for e.g. salmon really focus on addressing the de novo transcriptome assembly case:

https://github.com/dib-lab/eelpond/blob/master/docs/salmon.md

but I think that eelpond could usefully serve the reference-based case as well.

check replicates for diffexp analysis

  • DESeq2 only works with replicates -- check sample replicate info before attempting the deseq2 workflow.
  • Add edgeR rule for no-replicate analysis.

one possible approach to testing: inspect command lines

we now have the basic assembly under CI (yay!), up through annotation. but I'm wary of adding too many more execution tests on Travis, because they take real time and memory and disk space.

eelpond is all about getting parameters from config file(s) to downstream programs.

so, what about inspecting the command line that is output by snakemake -n as a way to determine if configuration parameters are making it "out" to downstream programs"

make preparing samples.tsv easier

I spent a frustrating time debugging tabs vs spaces vs...

a few thoughts:

  • allow yaml input
  • allow csv input
  • allow excel input (??)
  • provide a utility script for checking it

fix force options in run_eelpond

--forcetargets doesn't always work as desired. Particularly for steps with intermediate targets that are not passed as targets to snakemake (e.g. dammit)

submodule 'databases' fails when `-t 6` specified

see error output below.

without -t it works fine.

to rerun it, I just remove the 'databases' directory under eelpond/ and re-run eelpond.

Installing...
#### Run Tasks
- [ ] download:Pfam-A.hmm.gz:
    * Cmd: `curl -o Pfam-A.hmm.gz ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam28.0/Pfam-A.hmm.gz`
    * Python: function check_hash
- [ ] hmmpress:Pfam-A.hmm:
    * Cmd: `/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/hmmpress /home/titus/eelpond/databases/Pfam-A.hmm`
TaskFailed - taskid:hmmpress:Pfam-A.hmm
Command failed: '/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/hmmpress /home/titus/eelpond/databases/Pfam-A.hmm' returned 1

- [ ] download:Rfam.cm.gz:
    * Cmd: `curl -o Rfam.cm.gz ftp://ftp.ebi.ac.uk/pub/databases/Rfam/12.1/Rfam.cm.gz`
    * Python: function check_hash
- [ ] download:aa_seq_euk.fasta.gz:
    * Cmd: `curl -o aa_seq_euk.fasta.gz ftp://cegg.unige.ch/OrthoDB8/Eukaryotes/FASTA/aa_seq_euk.fasta.gz`
    * Python: function check_hash
- [ ] cmpress:Rfam.cm:
    * Cmd: `/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/cmpress /home/titus/eelpond/databases/Rfam.cm`
- [ ] lastdb:aa_seq_euk.fasta:
    * Cmd: `/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/lastdb -p -w3 /home/titus/eelpond/databases/aa_seq_euk.fasta /home/titus/eelpond/databases/aa_seq_euk.fasta`
TaskFailed - taskid:cmpress:Rfam.cm
Command failed: '/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/cmpress /home/titus/eelpond/databases/Rfam.cm' returned 1

TaskFailed - taskid:lastdb:aa_seq_euk.fasta
Command failed: '/home/titus/eelpond/.snakemake/conda/8b49a15e/bin/lastdb -p -w3 /home/titus/eelpond/databases/aa_seq_euk.fasta /home/titus/eelpond/databases/aa_seq_euk.fasta' returned 1

osf downloads do not always work on travis ci

this may be because of rate limits on osf, or network issues on travis build machines. see e.g. https://travis-ci.org/dib-lab/eelpond/builds/483913645, which has a bunch of the following errors:

raceback (most recent call last):
  File "/home/travis/miniconda/envs/test-env/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py", line 453, in wrap_socket
    cnx.do_handshake()
  File "/home/travis/miniconda/envs/test-env/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1907, in do_handshake
    self._raise_ssl_error(self._ssl, result)
  File "/home/travis/miniconda/envs/test-env/lib/python3.6/site-packages/OpenSSL/SSL.py", line 1632, in _raise_ssl_error
    raise SysCallError(-1, "Unexpected EOF")
OpenSSL.SSL.SysCallError: (-1, 'Unexpected EOF')

make editing configfile easier

provide option to print program parameters to the stdout instead of a file, so it's easier to add them to an existing configfile?

'quantify' target is currently failing.

% ./run_eelpond nema-download.yaml quantify

--------
checking for required files:
--------

        Added default parameters from rule-specific params files.
        Writing full params to .ep_nema-download.yaml
--------
details!
        snakefile: /home/diblions/eelpond/Snakefile
        config: nema-download.yaml
        params: .ep_nema-download.yaml
        targets: ['quantify']
        report: '/home/diblions/eelpond/nema_out/logs/report.html'
--------
Building DAG of jobs...
MissingInputException in line 46 of /home/diblions/eelpond/rules/salmon/salmon.rule:
Missing input files for rule salmon_index:
/home/diblions/eelpond/nema_out/assembly/__assembly__.fasta```

support the execution of a bunch of notebooks upon completion

it would be great to have both standard and customized execution of notebooks after a run!

not sure how to make 'em conditional on the workflow run, but that can be a detail we figure out later.

I think Jupyter Notebooks and RMarkdown notebooks would be the primary things to support here. Would need to parameterize the notebooks e.g. via papermill.

resource usage!

on a Jetstream m1.medium, second run of nema-test (so, all software installed etc. etc.)

/usr/bin/time -v reported:

        Command being timed: "./run_eelpond nema-test full"
        User time (seconds): 1637.94
        System time (seconds): 706.61
        Percent of CPU this job got: 93%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 41:38.69
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3926492
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 545
        Minor (reclaiming a frame) page faults: 288161380
        Voluntary context switches: 958956
        Involuntary context switches: 161188
        Swaps: 0
        File system inputs: 179680
        File system outputs: 22846136
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

41 minutes, 4 GB of RAM, 120 MB of disk space used, plus 14 GB of databases (presumably these are the dammit databases?)

Looking into the software, this is a nearly clean miniconda install followed by the eelpond run, so:

5.6 GB in /opt/miniconda/pkgs - this is probably the downloaded pkgs
1.3 GB in /opt/minoconda/envs/eelpond/ - this is the installed software.

So I think we can probably recommend that you have 25 GB of free space after conda install in order to even begin to run eelpond (b/c of databases), + space for reads and assembly and so on.

remove large files from git commit history

The whole repo history is in the ~100 MB range, while the current master is only a few MB. This makes it unnecessarily slow to check out the git repo :)

Using the command here, it looks like we need to go remove any trace of testing_out/ from the history. Not quite sure how to do that but this github help article looks legit.

how specify alternative location for databases?

right now, the databases for dammit are put in databases/ within the run directory, and (if you trash the eelpond repo, as Travis does) they need to be re-downloaded and re-installed.

is it possible to specify an alternative location for databases?

this would also be helpful for system-level installs of eelpond on e.g. HPCs.

maybe rename ep_utils/eelpond_environment.yaml?

A few thoughts on this file, which is hard to find because -

  • it doesn't have -env.yaml in the name, like the other files;
  • it's in the subdirectory ep_utils/;

looking at repo2docker it should probably be environment.yml in the root directory.

Thoughts?

citation printing

  1. build a yaml of program: citation(s)
  2. enable a --citation option with workflows that will print all citations for that workflow to stdout or to a file.
  3. also make sure all program docs have a Citation section for tool citation(s)

how do you pass in custom param files?

in the excellent docs at e.g. doc/trimmomatic.md it says,

"Be sure the modified lines go into the config file you're using to run eelpond."

but I don't know how to do that :).

hunting down and testing the answer now...

Jetstream image?

Would be nice to have a Jetstream image with pre-loaded Miniconda, software, dammit db, etc.

`get_data` target assumes input data is gzipped.

which is fine, but it means that if you are starting with un-gzipped data the output of get_data is incorrect because it links the ungzipped files to filenames that end with .gz and then commands like trimmomatic barf because it's not gzipped data but it's named .gz :)

dammit step fails after rerunning assembly

if I remove the assembly, and then run_eelpond trinity default, the dammit step fails. output below.

I can get it to work properly by removing the annotation directory. I suspect that there is a stale indexed file or another dammit intermediate file in there. not sure if this is a dammit but or an eelpond bug - seems like a dammit bug?

- [ ] TransDecoder.Predict:lab4_trinity.fasta:
    * Cmd: `/home/diblions/eelpond/.snakemake/conda/9149e0c4/bin/TransDecoder.Pr
edict -t /home/diblions/eelpond/lab4_out_exp1/annotation/lab4_trinity.fasta.damm
it/lab4_trinity.fasta`
TaskFailed - taskid:TransDecoder.Predict:lab4_trinity.fasta
Command failed: '/home/diblions/eelpond/.snakemake/conda/9149e0c4/bin/TransDecoder.Predict -t /home/diblions/eelpond/lab4_out_exp1/annotation/lab4_trinity.fasta.dammit/lab4_trinity.fasta' returned 25

########################################
TaskFailed - taskid:TransDecoder.Predict:lab4_trinity.fasta
TransDecoder.Predict:lab4_trinity.fasta <stderr>:

#####################
Counts of kept entries according to attributes:
FRAMESCORE      39
FRAMESCORE|LONGORF      29
########################


-indexing [Gene.34::Transcript_9::g.34]  ]  
 Indexed Gene.34::Transcript_9::g.34::m.34 138
Error, no gene obj retrieved based on identifier Gene.2::Transcript_2::g.2::m.2 at /home/diblions/eelpond/.snakemake/conda/9149e0c4/opt/transdecoder/util/../PerlLib/Gene_obj_indexer.pm line 61, <$fh> line 2.
        Gene_obj_indexer::get_gene(Gene_obj_indexer=HASH(0x1d44180), "Gene.2::Transcript_2::g.2::m.2") called at /home/diblions/eelpond/.snakemake/conda/9149e0c4/opt/transdecoder/util/gene_list_to_gff.pl line 29
Error, cmd: /home/diblions/eelpond/.snakemake/conda/9149e0c4/opt/transdecoder/util/gene_list_to_gff.pl lab4_trinity.fasta.transdecoder_dir/longest_orfs.cds.scores.selected lab4_trinity.fasta.transdecoder_dir/longest_orfs.gff3.inx > lab4_trinity.fasta.transdecoder_dir/longest_orfs.cds.best_candidates.gff3 died with ret 6400 at /home/diblions/eelpond/.snakemake/conda/9149e0c4/bin/TransDecoder.Predict line 379.


Conda environment.yml file / Python requirements.txt file

Running the workflow from a fresh miniconda install resulted in an error from pandas being missing. There are also some features of Snakemake being used that rely on a minimum Snakemake version.

The repo would benefit from a requirements.txt that lists the python packages required (pandas and snakemake in particular).

combine units prior to khmer

k-mer trimming and diginorm should happen on the full sample, not each sample-unit pair of files. We now have a function to do this (currently in the salmon rule) that can now be moved forward.

Questions:

  1. There is some benefit to allowing users to fastqc each file separately to see quality. By default, should we:
    A. combine units at the very beginning of the workflow (get_fq) and only work at the sample level?
    B. combine units after trimmomatic, allowing pre-trimming fastqc at the sample-unit level, but nothing else.

  2. Is it worth it to enable pipelines on the sample-unit level, rather than just sample? We could provide an eelpond_params parameter for each program that we populate via a single flag in run_eelpond, or parameter in the main configfile. Otherwise, just move everything to the sample level.

cc: @ctb @ljcohen

update test data input assembly

Update the nema.fasta used in assemblyinput test data.

  • upload new trinity fasta and gene_trans_map to OSF (generated with extract read subset)
  • add rule/method to download an assembly and/or gtmap, similar to the get_data download
  • if not downloading, link instead of copy?
  • update how "full" is translated in run_eelpond. At the moment, there's no way to pass other targets in at the same time as full, which means we can't generate assemblyinput params in the full configfile

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.