hoohm / dropseqpipe Goto Github PK

A SingleCell RNASeq pre-processing snakemake workflow

License: Creative Commons Attribution Share Alike 4.0 International

Python 61.66% R 38.34%

drop-seq snakemake pipeline dropseq yaml star picard reference-genome umi scrbseq

dropseqpipe's Introduction

Description

This pipeline is based on snakemake and the dropseq tools provided by the McCarroll Lab. It allows to go from raw data of your Single Cell RNA seq experiment until the final count matrix with QC plots along the way.

This is the tool we use in our lab to improve our wetlab protocol as well as provide an easy framework to reproduce and compare different experiments with different parameters.

It uses STAR to map the reads. It is usable for any single cell protocol using two reads where the first one holds the Cell and UMI barcodes and the second read holds the RNA. Here is a non-exhausitve list of compatible protocols/brands:

Drop-Seq
SCRB-Seq
10x Genomics
DroNc-seq
Dolomite Bio (Nadia Instrument)

This package is trying to be as user friendly as possible. One of the hopes is that non-bioinformatician can make use of it without too much hassle. It will still require some command line execution, this is not going to be fully interactive package.

Authors

Patrick Roelli (@Hoohm))
Sebastian Mueller (@seb-mueller))
Charles Girardot (@cgirardot))

Usage

Step 1: Install workflow

If you simply want to use this workflow, download and extract the latest release. If you intend to modify and further develop this workflow, fork this reposity. Please consider providing any generally applicable modifications via a pull request.

In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, once available, its DOI.

Step 2: Configure workflow

Configure the workflow according to your needs via editing the file config.yaml and the samples.tsv following those instructions

Step 3: Execute workflow

All you need to execute this workflow is to install Snakemake via the Conda package manager. Software needed by this workflow is automatically deployed into isolated environments by Snakemake.

Test your configuration by performing a dry-run via

snakemake --use-conda -n --directory $WORKING_DIR

Execute the workflow locally via

snakemake --use-conda --cores $N --directory $WORKING_DIR

using $N cores on the $WORKING_DIR. Alternatively, it can be run in cluster or cloud environments (see the docs for details).

If you not only want to fix the software stack but also the underlying OS, use

snakemake --use-conda --use-singularity

in combination with any of the modes above.

Step 4: Investigate results

After successful execution, you can create a self-contained report with all results via:

snakemake --report report.html

Documentation

You can find the documentation here

Future implementations

I'm actively seeking help to implement the points listed bellow. Don't hesitate to contact me if you wish to contribute.

Create a sharing platform where quality plots/logs can be discussed and troubleshooted.
Create a full html report for the whole pipeline
Multiqc module for drop-seq-tools
Implement an elegant "preview" mode where the pipeline would only run on a couple of millions of reads and allow you to have an approximated view before running all of the data. This would dramatically reduce the time needed to get an idea of what filters whould be used.

I hope it can help you out in your single cell experiments!

Feel free to comment and point out potential improvements via issues

TODO

Add a mixed reference reference for testing purposes
Finalize the parameters validation schema
Make the debug feature a bit "cleaner". Deal with automatic naming of the debug variables
Implement ddseq barcoding strategies

dropseqpipe's People

Contributors

Stargazers

Watchers

dropseqpipe's Issues

Error on split_species rule for shipped test data

Not sure if this is due to my setup, but I'm getting regularly an error on the split_species rule.

To make it reproducible, I've tried it on the shipped data in .test:

Running the following in the .test folder of a freshly cloned (including submodules) repository:

~/code/dropSeqPipe/.test  (master *)$ snakemake --use-conda --cores 3 --snakefile ../Snakefile split_species

Gives the following

...
Activating conda environment: /home/sm934/code/dropSeqPipe/.test/.snakemake/conda/f358d867
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/tmp
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/SPECIES_TWO/sample2_dge.summary.txt OUTPUT=summary/SPECIES_TWO/sample2_unfiltered_umi_expression_matrix.tsv INPUT=data/SPECIES_TWO/sample2_unfiltered.bam EDIT_DISTANCE=1 MIN_BC_READ_THRESHOLD=0 NUM_CORE_BARCODES=100    OUTPUT_READS_INSTEAD=false CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Aug 14 13:01:43 BST 2018] Executing as sm934@genesilencing56 on Linux 4.15.0-30-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/SPECIES_TWO/sample1_dge.summary.txt OUTPUT=summary/SPECIES_TWO/sample1_unfiltered_umi_expression_matrix.tsv INPUT=data/SPECIES_TWO/sample1_unfiltered.bam EDIT_DISTANCE=1 MIN_BC_READ_THRESHOLD=0 NUM_CORE_BARCODES=100    OUTPUT_READS_INSTEAD=false CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/SPECIES_ONE/sample2_dge.summary.txt OUTPUT=summary/SPECIES_ONE/sample2_unfiltered_umi_expression_matrix.tsv INPUT=data/SPECIES_ONE/sample2_unfiltered.bam EDIT_DISTANCE=1 MIN_BC_READ_THRESHOLD=0 NUM_CORE_BARCODES=100    OUTPUT_READS_INSTEAD=false CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Aug 14 13:01:43 BST 2018] Executing as sm934@genesilencing56 on Linux 4.15.0-30-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
[Tue Aug 14 13:01:43 BST 2018] Executing as sm934@genesilencing56 on Linux 4.15.0-30-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Looking for the top 100 cell barcodes
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Selected 0 core barcodes
ERROR   2018-08-14 13:01:43     DigitalExpression       Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=504889344
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Looking for the top 100 cell barcodes
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Selected 0 core barcodes
ERROR   2018-08-14 13:01:43     DigitalExpression       Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=504889344
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Looking for the top 100 cell barcodes
INFO    2018-08-14 13:01:43     BarcodeListRetrieval    Selected 0 core barcodes
ERROR   2018-08-14 13:01:43     DigitalExpression       Running digital expression without somehow selecting a set of barcodes to process no longer supported.
[Tue Aug 14 13:01:43 BST 2018] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=504889344
    [Tue Aug 14 13:01:43 2018]
    Error in rule extract_all_umi_expression_species:
        jobid: 8
    [Tue Aug 14 13:01:43 2018]
    [Tue Aug 14 13:01:43 2018]
        output: summary/SPECIES_TWO/sample1_unfiltered_umi_expression_matrix.tsv, summary/SPECIES_TWO/sample1_dge.summary.txt
    Error in rule extract_all_umi_expression_species:
    Error in rule extract_all_umi_expression_species:
        conda-env: /home/sm934/code/dropSeqPipe/.test/.snakemake/conda/f358d867
        jobid: 9
        jobid: 10

        output: summary/SPECIES_ONE/sample2_unfiltered_umi_expression_matrix.tsv, summary/SPECIES_ONE/sample2_dge.summary.txt
        output: summary/SPECIES_TWO/sample2_unfiltered_umi_expression_matrix.tsv, summary/SPECIES_TWO/sample2_dge.summary.txt
        conda-env: /home/sm934/code/dropSeqPipe/.test/.snakemake/conda/f358d867
        conda-env: /home/sm934/code/dropSeqPipe/.test/.snakemake/conda/f358d867
...

Does this work for you?
Note, the map and filter rules etc. work just fine.

I suspect this is due to the absence of the second species in the reference files (only chr19). However the data-set seems a mixed_species.

I guess the reference has to be adapted to make the split_species rule testable. Thoughts?

Modern August 2018 conda - issues

As you probably know, conda has changed a lot recently.

conda activate single_cell
snakemake --use-conda --cores 16

I experienced this error : which conda
conda could not be found

After entering the following within the conda env, the pipeline can find conda again. I.e. conda does not seem to be installed by default within a conda env, at least for me.

conda install conda

Specification for GTF file

Hi There, thanks for putting all of this together!

I am just putting a few comments together as I seek to run this pipeline on a Rhesus Macaque sample. I'm downloading the GTF from ENSEMBL:
ftp://ftp.ensembl.org/pub/release-92/gtf/macaca_mulatta

Initially this file didn't include a gene_name in the attribute column. This was leading to a NullPointerException in org.broadinstitute.dropseqrna.annotation.ReduceGTF

I manually added a gene_name attribute but now I'm getting an exception "Missing transcript_name"

Problems:
Missing transcript_name
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:97)
at org.broadinstitute.dropseqrna.annotation.GTFParser.next(GTFParser.java:39)
at htsjdk.samtools.util.PeekableIterator.advance(PeekableIterator.java:71)
at htsjdk.samtools.util.PeekableIterator.next(PeekableIterator.java:57)
at org.broadinstitute.dropseqrna.utils.FilteredIterator.next(FilteredIterator.java:71)
at org.broadinstitute.dropseqrna.annotation.ReduceGTF.writeRecords(ReduceGTF.java:166)
at org.broadinstitute.dropseqrna.annotation.ReduceGTF.doWork(ReduceGTF.java:112)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

It would be helpful if there was a similar error message for gene_name missing and if there was documentation in the "reference files" section indicating that the transcript_name and gene_name attributes are required and may not be included directly in the ensembl download.

Anatomy of fasta/gtf files in mixed species experiment

Dear Patrick,

thanks for providing dropSeqPipe as a Snakemake workflow. It made it very easy for me to install the pipeline and get it running.

I have a question regarding how the fasta and gtf files should look like in case of mixed species experiments.
From what I see in the code, I guessed that the fasta should look like this:

>HUMAN_chr1
...
>MOUSE_chr1
....

Is this correct? How should the GTF look? Maybe you can give me the output of head Genome.reference.fasta Annotation.gtf from your example config?

Thanks a lot for your help and best regards,
Jens

high % mixed species?

Apologize for posting all these questions if I'm not supposed to. I found I have more useful answers/responses by posting in this github rather than the group.

I am attaching our mixed species plot. As you can see there are about 21% mixed species. I used the default 80% purity threshold. I know in the wiki plot, the mixed species is about 15%. In one of the post in dropseq google group, Evan mentioned that 6-9% species mixing is relatively high (corresponds to 12-18% overall contamination rate). Should I be concerned with the quality of our experiment? Any suggestions? Thanks again for your help!
species_plot_genes.pdf

Update: just realized the second sample only has ~10% of human cells
sample2_species_plot_genes.pdf

Error in creating annotation.refFlat

Hi there.
Thanks a lot for building up this wonderful package which greatly helped biologist with limited basis for coding like me.

I am trying to run my first test sample, but I met this error:

Error in rule create_refFlat:
jobid: 51
output: /home/coffeelover/NGStools/Reference_Files/annotation.refFlat

RuleException:
CalledProcessError in line 51 of /home/coffeelover/sc-experiment/4-cell-lines/dropSeqPipe/rules/generate_meta.smk:
Command ' set -euo pipefail; /home/coffeelover/NGStools/Drop-seq_tools-1.13 -m 80g -p ConvertToRefFlat ANNOTATIONS_FILE=/home/coffeelover/NGStools/Reference_Files/annotation.gtf OUTPUT=/home/coffeelover/NGStools/Reference_Files/annotation.refFlat SEQUENCE_DICTIONARY=/home/coffeelover/NGStools/Reference_Files/genome.dict ' returned non-zero exit status 126.
File "/home/coffeelover/sc-experiment/4-cell-lines/dropSeqPipe/rules/generate_meta.smk", line 51, in __rule_create_refFlat
File "/home/coffeelover/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run

I downloaded GTF from Ensembl and re-named as annotation.gtf as required. I am not sure what exactly caused this error nor how to troubleshoot it. Could you please help me with this issue?

Thanks a million,
Kai

Feature Request: Ensembl ID alongside gene name in expression table

There are over 100 ensembl genes with the same name. Would be helpful to have unique identifiers used alongside gene names.

keyError

can you help to solve the following problem. I am using latest version 0.23a

dropSeqPipe -f /hd2/ -c /home/dropSeqPipe/local.yaml -m pre-process

[Tue Sep 19 09:00:33 CDT 2017] picard.sam.SamToFastq done. Elapsed time: 16.92 minutes.
Runtime.totalMemory()=1126170624
[Tue Sep 19 09:00:33 2017] Finished job 1.
[Tue Sep 19 09:00:33 2017] 3 of 4 steps (75%) done
[Tue Sep 19 09:00:33 2017]
[Tue Sep 19 09:00:33 2017] localrule all:
input: wts3_tagged_unmapped.fastq.gz
jobid: 0
[Tue Sep 19 09:00:33 2017]
[Tue Sep 19 09:00:33 2017] Finished job 0.
[Tue Sep 19 09:00:33 2017] 4 of 4 steps (100%) done
Running Alignement
[Tue Sep 19 09:00:33 2017] KeyError in line 23 of /usr/lib/python3.4/site-packages/dropSeqPipe/Snakefiles/singleCell/star_align.snake:
'allowed_aligner_mismatch'
File "/usr/lib/python3.4/site-packages/dropSeqPipe/Snakefiles/singleCell/star_align.snake", line 23, in
Plotting STAR logs
Error in file(file, "rt") : cannot open the connection
Calls: star.logs -> read.table -> file
In addition: Warning message:
In file(file, "rt") : cannot open file 'NA': No such file or directory
Execution halted
Traceback (most recent call last):
File "/bin/dropSeqPipe", line 9, in
load_entry_point('dropSeqPipe==0.23a0', 'console_scripts', 'dropSeqPipe')()
File "/usr/lib/python3.4/site-packages/dropSeqPipe/main.py", line 150, in main
shell(star_summary)
File "/usr/lib/python3.4/site-packages/snakemake-3.10.1-py3.4.egg/snakemake/shell.py", line 80, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'Rscript /usr/lib/python3.4/site-packages/dropSeqPipe/Rscripts/STAR_log_plot.R /hd2/' returned non-zero exit status 1

Cell barcode too short error: trimming ?

Hi,

this looks to me like I am trying to pick up a cell barcode of 12bp, but the read is shorter than 12bp (possibly due to quality trimming?). At least I see very short reads in the unmapped.bam file.

Does it make sense for drop-seq-pipe to check the length of the sequence to be tagged and avoid any errors with a try catch block, and further to exclude these sequences ?

We're talking a NextSEq 2x75bp run with R1 of 18-20bp and R2 of 60-62 bp.

Of course, I may have interpreted this error wrongly, I'm new to this.

cheers

+ java -Xmx4g -Djava.io.tmpdir=/lager2/rcug/2018/6E77/fastq2/tobias/tmp -jar /lager2/rcug/2018/6E77/fastq2/tobias/Drop-seq_tools-1.13/jar/dropseq.jar TagBamWithReadSequenceExtended SUMMARY=logs/PE_03_S4_CELL_barcode.txt BASE_RANGE=1-12 BASE_QUALITY=25 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 INPUT=data/PE_03_S4_unaligned.bam OUTPUT=data/PE_03_S4_BC_tagged_unmapped.bam
Picked up _JAVA_OPTIONS: -Dhttp.proxyHost=172.24.2.50 -Dhttp.proxyPort=8080 -Dhttps.proxyHost=172.24.2.50 -Dhttps.proxyPort=8080
[Thu May 31 17:40:42 CEST 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended INPUT=data/PE_03_S4_unaligned.bam OUTPUT=data/PE_03_S4_BC_tagged_unmapped.bam SUMMARY=logs/PE_03_S4_CELL_barcode.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=25 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC    TAG_BARCODED_READ=false HARD_CLIP_BASES=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu May 31 17:40:42 CEST 2018] Executing as rcug@hpc-rc03 on Linux 4.4.0-109-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
[Thu May 31 17:40:42 CEST 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058354688
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 11
        at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.scoreBaseQuality(TagBamWithReadSequenceExtended.java:251)
        at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.processReadPair(TagBamWithReadSequenceExtended.java:220)
        at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.doWork(TagBamWithReadSequenceExtended.java:154)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
        at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
    Error in rule BC_tags:
        jobid: 94
        output: data/PE_03_S4_BC_tagged_unmapped.bam, logs/PE_03_S4_CELL_barcode.txt

Number of Genes/Transcripts wrong in species_plots

I have noticed unusual high numbers of genes in the *species_plot_genes.pdf files and wondering if others have the same behaviour?
After some debugging I suspect this line to be the culprit:

dropSeqPipe/scripts/plot_species_plot.R

Line 82 in 0fbca71

colnames(a)=c("cellBC", "numGenes", "numTranscripts")

It reads in for example ths dge file: summary/HUMAN/mysample_dge.summary.txt" which is stuctured like this:

...
 ## METRICS CLASS▸·org.broadinstitute.dropseqrna.barnyard.DigitalExpression$DESummary¬
 CELL_BARCODE▸·NUM_GENIC_READS▸NUM_TRANSCRIPTS▸NUM_GENES¬
 GTTAAGCTCAACTCTT▸·90448▸71198▸7749¬
 GTGCTTCTCGGGAGTA▸·86394▸69074▸7319¬
 ACGAGGAGTAGGGACT▸·81724▸64794▸7215¬
...

i.e. having 4 columns, where as only 3 column names are assigned, which then shifts the column names as shown here:

 > head(a)¬
       CELL_BARCODE NUM_GENIC_READS NUM_TRANSCRIPTS NUM_GENES¬
 1 GTTAAGCTCAACTCTT           90448           71198      7749¬
 2 GTGCTTCTCGGGAGTA           86394           69074      7319¬
 3 ACGAGGAGTAGGGACT           81724           64794      7215¬
 4 GACTAACAGGCGCTCT           72773           57258      6978¬
 5 CGGGTCATCGGCGCTA           64090           51103      7039¬
 6 ACGATGTAGCTAGGCA           62087           48764      6677¬
 >   colnames(a)=c("cellBC", "numGenes", "numTranscripts")¬
 > head(a)¬
             cellBC numGenes numTranscripts   NA¬
 1 GTTAAGCTCAACTCTT    90448          71198 7749¬
 2 GTGCTTCTCGGGAGTA    86394          69074 7319¬
 3 ACGAGGAGTAGGGACT    81724          64794 7215¬
 4 GACTAACAGGCGCTCT    72773          57258 6978¬
 5 CGGGTCATCGGCGCTA    64090          51103 7039¬
 6 ACGATGTAGCTAGGCA    62087          48764 6677¬

Unless I'm not mistaken, replacing the line in question to this should fix it:

 colnames(a)=c("cellBC", "numGenicReads", "numGenes", "numTranscripts")

If this bug is confirmed, I'm happy to do the corrections!

MinCellFraction parameter to estimate STAMP count from config.yaml disappeared

I've noticed the automatic determination of STAMPS in was taken out from v0.23.
I.e. the following line, which depended on the the cell fraction (MinCellFraction = 0.001 or so in config.yaml) parameter (at least this what I think hapend, please correct my if I'm wrong):

dropSeqPipe/dropSeqPipe/Rscripts/singleCell/knee_plot.R

Line 19 in 2fe6b2a

id = length(subset(diff(y)[seq(x_scale)],diff(y)[seq(x_scale)] > fraction))

This has always been a bit mysterious to me, nevertheless one could have the number of STAMPs estimated (semi)-automatically I was wondering why this has taken out (at least this what I think happened, please correct my if I'm wrong)?
Are there any plans to automatize this step again?

Unrelatedly (not sure this sort deserves an issue by itself)
One other minor comment is it the location of the Snakefile should be discussed in the wiki.

I.e. the commands here assumes to have the Snakefile in the project directory (root folder containing the experiment).

This has to be done manually for each project, so I'd include in the wike something like:

Snakefile which ships with dropSeqPipe needs to be copied or linked into the each project directory, i.e.

ln -s /installpath/dropSeqPipe/Snakefile .

cp -s /installpath/dropSeqPipe/Snakefile .

or specified as parameter:

snakemake --snakefile /installpath/dropSeqPipe/Snakefile --cores 8 qc

Error opening file: hg19_mm10_transgenes.rRNA.intervals

Hi,

I have another error:
xception in thread "main" htsjdk.samtools.SAMException: Error opening file: /opt/dropSeqPipe/ref/hg19_mm10_transgenes.rRNA.intervals

This seems to be a new feature of Drop-seq 1.13. Can you please let me know how to fix this? Also, what does this do?

Thanks!

Empty summary files

I run the whole pipeline (meta qc filter map extract) on one sample with 4M reads. Most of the pipeline run without any issues and then an R script threw an error near the end of the analysis. When I started debugging the scripts I found that all of the summary files are empty (apart from their headers).

You can find the reports here. What's surprising is that the results from qc look reasonable. Also, I'm no expert on bam files but they all seem pretty reasonable to me (I added the contents of the first 100 lines of the intermediate bam files converted to plain text with samtools view into the bamheads directory). You can find the full logs from running the pipeline in stderr.log and stdout.log. Perhaps I missed some option or specified a wrong threshold?

I used a standard pair of gtf/fasta files downloaded straight from GENCODE.

Generate Plot | Error in mmm

Hi,

In generate-plot step I am getting following error. What I am mising?
Using 023a version.

Plotting knee plots
Warning messages:
1: Removed 716167 rows containing missing values (geom_point).
2: Removed 563180 rows containing missing values (geom_point).
3: Removed 1336568 rows containing missing values (geom_point).
4: Removed 1230829 rows containing missing values (geom_point).
5: Removed 1180851 rows containing missing values (geom_point).
6: Removed 1473024 rows containing missing values (geom_point).
Plotting base stats
Loading required package: magrittr
Error in mmm < each : comparison of these types is not implemented
Calls: plotRNAMetrics ... Reduce -> f -> rbind_gtable -> compare_unit -> unit -> comp
Execution halted
Traceback (most recent call last):
File "/bin/dropSeqPipe", line 9, in
load_entry_point('dropSeqPipe==0.23a0', 'console_scripts', 'dropSeqPipe')()
File "/usr/lib/python3.4/site-packages/dropSeqPipe/main.py", line 182, in main
shell(base_summary)
File "/usr/lib/python3.4/site-packages/snakemake-3.10.1-py3.4.egg/snakemake/shell.py", line 80, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'Rscript /usr/lib/python3.4/site-packages/dropSeqPipe/Rscripts/singleCell/rna_metrics.R /PROJECTS/' returned non-zero exit status 1

dropSeqPipe Installation Error

I was trying to install dropSeqPipe using the instructions provided at https://github.com/Hoohm/dropSeqPipe/wiki/Installation , and everything went smoothly until I got to the last step. When I tried to run the command conda env create --file environment.yaml, I got the following result:

bash-4.1$ conda env create --file environment.yaml
Solving environment: failed

ResolvePackageNotFound:
- snakemake=4.8.0
- aioeasywebdav
- snakemake=4.8.0
- google-cloud-storage
- snakemake=4.8.0
- python-irodsclient
- snakemake=4.8.0
- ratelimiter

When I tried to install the dependencies (for example snakemake 4.8.0), I received another error saying that it wasn't available, as seen below:

bash-4.1$ conda install snakemake=4.8.0
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

- snakemake=4.8.0

Current channels:

- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/free/linux-64
- https://repo.anaconda.com/pkgs/free/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
- https://repo.anaconda.com/pkgs/pro/linux-64
- https://repo.anaconda.com/pkgs/pro/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

`https://anaconda.org`

and use the search bar at the top of the page.

Am I missing a critical step? Thanks for your help in advance!

generate-plot error

Hi,

I am using 0.23a version. At the generate-plot step it looks like all the required plots I got, however, I see error message at the end. Just wonder if I am missing anything? How to overcome this?

Plotting knee plots
There were 16 warnings (use warnings() to see them)
Plotting base stats
Loading required package: magrittr
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
geom_smooth() using method = 'loess'
Error in [<-.data.frame(*tmp*, i, 3, value = c(39858902L, 28250991L :
replacement has 2 rows, data has 1
Calls: plotBCDrop -> [<- -> [<-.data.frame
Execution halted
Traceback (most recent call last):
File "/bin/dropSeqPipe", line 9, in
load_entry_point('dropSeqPipe==0.23a0', 'console_scripts', 'dropSeqPipe')()
File "/usr/lib/python3.4/site-packages/dropSeqPipe/main.py", line 182, in main
shell(base_summary)
File "/usr/lib/python3.4/site-packages/snakemake-3.10.1-py3.4.egg/snakemake/shell.py", line 80, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'Rscript /usr/lib/python3.4/site-packages/dropSeqPipe/Rscripts/singleCell/rna_metrics.R /hd2/' returned non-zero exit status 1

Thanks

advice on knee plot

Although this is not issue on running dropSeq pipeline, however, since the wiki has explanation for the knee plot, I thought I'll post it here in hope of getting advice.

This is the second run of DropSeq that we've analyzed and both times my knee plots never show a clear bend (see attached).
knee_plot.pdf
What is your advice for this? Thanks!

Create_star_index failing

Hey,
I tried to run snakemake --cores 12 qc filter map for MixedSpecies experiment but get an error while creating the star index. Unfortunately the error message is not telling me much, maybe there is one of you that has a little more experience with this kind of output!

Here the snakemake.log


Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       MergeBamAlignment
        1       STAR_align
        1       TagReadWithGeneExon
        1       bam_hist
        1       bead_errors_metrics
        1       create_dict
        1       create_refFlat
        1       create_star_index
        1       extract_reads_expression
        1       extract_umi_expression
        1       map
        1       merge_counts
        1       merge_umi
        1       multiqc_star
        1       plot_knee_plot
        1       plot_yield
        1       sort_sam
        1       violine_plots
        18

[Fri Dec 14 03:21:46 2018]
rule create_star_index:
    input: ~/scratch/dropSeqPipe/ref_genome/hg19_mm10/hg19_mm10_transgenes.fasta,~/scratch/dropSeqPipe/ref_genome/hg19_mm10/hg19_mm10_transgenes.gtf
    output: ~/scratch/dropSeqPipe/ref_genome/hg19_mm10/STAR_INDEX/SA_83/SA
    jobid: 35
    wildcards: star_index_prefix=~/scratch/dropSeqPipe/ref_genome/hg19_mm10/STAR_INDEX/SA, read_length=83
    threads: 8

Activating conda environment: ~/dropSeqPipe/.snakemake/conda/fe458c9e
[Fri Dec 14 04:05:08 2018]
Error in rule create_star_index:
    jobid: 35
    output: ~/dropSeqPipe/ref_genome/hg19_mm10/STAR_INDEX/SA_83/SA
    conda-env: ~/dropSeqPipe/.snakemake/conda/fe458c9e

RuleException:
CalledProcessError in line 107 of ~/dropSeqPipe/rules/generate_meta.smk:
Command 'source activate ~/dropSeqPipe/.snakemake/conda/fe458c9e; set -euo pipefail;  mkdir -p ~/scratch/dropSeqPipe/ref_genome/hg19_mm10/STAR_INDEX/SA_83; STAR           --runThreadN 8          --runMode genomeGenerate                --genomeDir ~/scratch/dropSeqPipe/ref_genome/hg19_mm10/STAR_INDEX/SA_83            --genomeFastaFiles ~/scratch/dropSeqPipe/ref_genome/hg19_mm10/hg19_mm10_transgenes.fasta           --sjdbGTFfile ~/scratch/dropSeqPipe/ref_genome/hg19_mm10/hg19_mm10_transgenes.gtf          --limitGenomeGenerateRAM 30000000000            --sjdbOverhang 82               --genomeChrBinNbits 18 ' returned non-zero exit status 137.
  File "~/dropSeqPipe/rules/generate_meta.smk", line 107, in __rule_create_star_index
  File "~/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

generate-meta | UnboundLocalError: local variable 'samples_yaml' referenced before assignment

Hi,

I got following error in generate meta step with mouse Ensembl Grcm38 genome and its gtf file.
Log.out file shows that all steps completed (it says "done ..... finished successfully
DONE: Genome generation, EXITING)
It looks like I have all necessary file generated (not sure though if incomplete)

I just wonder how to fix this error:

[Sun Oct 22 22:23:13 CDT 2017] org.broadinstitute.dropseqrna.annotation.CreateIntervalsFiles SEQUENCE_DICTIONARY=mm38.dict REDUCED_GTF=mm38_reduced.gtf OUTPUT=. PREFIX=mm38 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

[Sun Oct 22 22:23:18 2017] Finished job 1.
[Sun Oct 22 22:23:18 2017] 5 of 6 steps (83%) done
[Sun Oct 22 22:23:18 2017]
[Sun Oct 22 22:23:18 2017] localrule all:
input: mm38.rRNA.intervals, STAR_INDEX_NO_GTF/SA
jobid: 0
[Sun Oct 22 22:23:18 2017]
[Sun Oct 22 22:23:18 2017] Finished job 0.
[Sun Oct 22 22:23:18 2017] 6 of 6 steps (100%) done
Traceback (most recent call last):
File "/bin/dropSeqPipe", line 9, in
load_entry_point('dropSeqPipe==0.23a0', 'console_scripts', 'dropSeqPipe')()
File "/usr/lib/python3.4/site-packages/dropSeqPipe/main.py", line 96, in main
if(samples_yaml['GLOBAL']['data_type'] not in ['singleCell', 'bulk']):
UnboundLocalError: local variable 'samples_yaml' referenced before assignment

Multiqc error

Pipeline continually failed on multiqc report

Fix: Install multiqc development version 1.3
pip install git+https://github.com/ewels/MultiQC.git --user

Unnecessary Fix:
Double forward slash in pipeline does not seem to affect output however it looks weird...
[INFO ] multiqc : Searching '${HOME}/yourdirectory//logs'
Edit: main.py
Change: multiqc = 'multiqc -o {0} {0}/logs {0}/summary --force'.format(args.folder_path)
To: multiqc = 'multiqc -o {0} {0}logs {0}summary --force'.format(args.folder_path)

Premature end of file error

Hi,

I got following error in pre-process step. (Using latest version 0.24)
Specially at MergeBamAlignment step:

[Tue Oct 24 11:01:49 CDT 2017] MergeBamAlignment UNMAPPED_BAM=Secondary_tagged_unmapped.bam ALIGNED_BAM=[Secondary_Aligned_sorted.sam] OUTPUT=/dev/stdout PAIRED_RUN=false INCLUDE_SECONDARY_ALIGNMENTS=false COMPRESSION_LEVEL=0 REFERENCE_SEQUENCE=/home/grcm38/mm38.fa CLIP_ADAPTERS=true IS_BISULFITE_SEQUENCE=false ALIGNED_READS_ONLY=false MAX_INSERTIONS_OR_DELETIONS=1 ATTRIBUTES_TO_REVERSE=[OQ, U2] ATTRIBUTES_TO_REVERSE_COMPLEMENT=[E2, SQ] READ1_TRIM=0 READ2_TRIM=0 ALIGNER_PROPER_PAIR_FLAGS=false SORT_ORDER=coordinate PRIMARY_ALIGNMENT_STRATEGY=BestMapq CLIP_OVERLAPPING_READS=true ADD_MATE_CIGAR=true UNMAP_CONTAMINANT_READS=false MIN_UNCLIPPED_BASES=32 MATCHING_DICTIONARY_TAGS=[M5, LN] UNMAPPED_READ_STRATEGY=DO_NOT_CHANGE ADD_PG_TAG_TO_READS=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

Linux 4.8.13-100.fc23.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_111-b16; Deflater: Intel; Inflater: Intel; Picard version: 2.12.1-SNAPSHOT

Following is the error

INFO 2017-10-24 11:34:56 TagReadWithGeneExon Processed 62,000,000 records. Elapsed time: 00:33:07s. Time for last 1,000,000: 17s. Last read position: 11:50,385,810
[Tue Oct 24 11:40:37 CDT 2017] org.broadinstitute.dropseqrna.metrics.TagReadWithGeneExon done. Elapsed time: 38.80 minutes.
Runtime.totalMemory()=1540358144
Exception in thread "main" htsjdk.samtools.FileTruncatedException: Premature end of file
at htsjdk.samtools.util.BlockCompressedInputStream.readBlock(BlockCompressedInputStream.java:382)
at htsjdk.samtools.util.BlockCompressedInputStream.available(BlockCompressedInputStream.java:127)
at htsjdk.samtools.util.BlockCompressedInputStream.read(BlockCompressedInputStream.java:252)
at java.io.DataInputStream.read(DataInputStream.java:149)
at htsjdk.samtools.util.BinaryCodec.readBytesOrFewer(BinaryCodec.java:404)
at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:380)
at htsjdk.samtools.util.BinaryCodec.readBytes(BinaryCodec.java:366)
at htsjdk.samtools.BAMRecordCodec.decode(BAMRecordCodec.java:199)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.getNextRecord(BAMFileReader.java:661)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.advance(BAMFileReader.java:635)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:629)
at htsjdk.samtools.BAMFileReader$BAMFileIterator.next(BAMFileReader.java:599)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:544)
at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:518)
at org.broadinstitute.dropseqrna.metrics.TagReadWithGeneExon.doWork(TagReadWithGeneExon.java:94)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:29)
[Tue Oct 24 11:40:37 2017] Error in job stage3 while creating output file Secondary_gene_exon_tagged.bam.
[Tue Oct 24 11:40:37 2017] RuleException:
CalledProcessError in line 42 of /usr/lib/python3.4/site-packages/dropSeqPipe/Snakefiles/singleCell/post_align.snake:
Command 'java -Djava.io.tmpdir=/home/tmp -Xmx90000m -jar /home/bin/picard/picard.jar MergeBamAlignment REFERENCE_SEQUENCE=/home/bin/DropSeqMetaData/mm38.fa UNMAPPED_BAM=Secondary_tagged_unmapped.bam ALIGNED_BAM=Secondary_Aligned_sorted.sam INCLUDE_SECONDARY_ALIGNMENTS=false PAIRED_RUN=false OUTPUT=/dev/stdout COMPRESSION_LEVEL=0|
/home/bin/Drop-seq_tools-1.12/TagReadWithGeneExon OUTPUT=Secondary_gene_exon_tagged.bam INPUT=/dev/stdin ANNOTATIONS_FILE=/home/bin/DropSeqMetaData/mm38.refFlat TAG=GE CREATE_INDEX=true
' returned non-zero exit status 1
File "/usr/lib64/python3.4/concurrent/futures/thread.py", line 54, in run
[Tue Oct 24 11:40:37 2017] Removing output files of failed job stage3 since they might be corrupted:
Secondary_gene_exon_tagged.bam
[Tue Oct 24 11:40:37 2017] Will exit after finishing currently running jobs.
[Tue Oct 24 11:40:37 2017] Exiting because a job execution failed. Look above for error message
Traceback (most recent call last):
File "/bin/dropSeqPipe", line 9, in
load_entry_point('dropSeqPipe==0.23a0', 'console_scripts', 'dropSeqPipe')()
File "/usr/lib/python3.4/site-packages/dropSeqPipe/main.py", line 152, in main
shell(post_align)
File "/usr/lib/python3.4/site-packages/snakemake-3.10.1-py3.4.egg/snakemake/shell.py", line 80, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'snakemake -s /usr/lib/python3.4/site-packages/dropSeqPipe/Snakefiles/singleCell/post_align.snake --cores 16 -pT -d /PROJECTS/MOUSE --configfile /home/bin/DropSeqPipe24/dropSeqPipe/local.yaml ' returned non-zero exit status 1

How to Solve this ?

thanks,

STAR: terminate called after throwing an instance of 'std::bad_alloc'

I encountered this issue alexdobin/STAR#103 when running dropSeqPipe in the meta mode. Would it be possible to add the --genomeChrBinNbits parameter to your pipeline? After running the same STAR command but with --genomeChrBinNbits 15 everything went fine.

R figure plotting error

When I run the filter mode, I get this error:

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/path/to/project/projectID/temp
INFO    2018-09-11 16:39:14     TagBamWithReadSequenceExtended  Processed     3,000,000 records.  Elapsed time: 00:00:20s.  Time for last 1,000,000:    6s.  Last read position: */*
[Tue Sep 11 16:39:14 EEST 2018] org.broadinstitute.dropseqrna.readtrimming.PolyATrimmer INPUT=data/sample_name_tags_start_filtered_unmapped.bam OUTPUT=data/sample_name_trimmed_unmapped.bam OUTPUT_SUMMARY=logs/sample_name_polyA_trim.txt MISMATCHES=0 NUM_BASES=5    USE_NEW_TRIMMER=false TRIM_TAG=ZP ADAPTER=^XM^XCACGTACTCTGCGTTGCTACCACTG MAX_ADAPTER_ERROR_RATE=0.1 MIN_ADAPTER_MATCH=4 MIN_POLY_A_LENGTH=20 MIN_POLY_A_LENGTH_NO_ADAPTER_MATCH=6 DUBIOUS_ADAPTER_MATCH_LENGTH=6 MAX_POLY_A_ERROR_RATE=0.1 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Sep 11 16:39:14 EEST 2018] Executing as user@cloud_node on Linux 3.10.0-862.9.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
INFO    2018-09-11 16:39:16     TagBamWithReadSequenceExtended  Processed    17,000,000 records.  Elapsed time: 00:01:58s.  Time for last 1,000,000:    6s.  Last read position: */*
INFO    2018-09-11 16:39:18     TrimStartingSequence    Processed     1,000,000 records.  Elapsed time: 00:00:11s.  Time for last 1,000,000:   11s.  Last read position: */*
INFO    2018-09-11 16:39:21     TagBamWithReadSequenceExtended  Processed     4,000,000 records.  Elapsed time: 00:00:27s.  Time for last 1,000,000:    6s.  Last read position: */*
INFO    2018-09-11 16:39:23     TagBamWithReadSequenceExtended  Processed    18,000,000 records.  Elapsed time: 00:02:05s.  Time for last 1,000,000:    6s.  Last read position: */*
INFO    2018-09-11 16:39:26     PolyATrimmer    Processed     1,000,000 records.  Elapsed time: 00:00:11s.  Time for last 1,000,000:   11s.  Last read position: */*
INFO    2018-09-11 16:39:28     TrimStartingSequence    Processed     2,000,000 records.  Elapsed time: 00:00:22s.  Time for last 1,000,000:   10s.  Last read position: */*
INFO    2018-09-11 16:39:28     TagBamWithReadSequenceExtended  Processed     5,000,000 records.  Elapsed time: 00:00:35s.  Time for last 1,000,000:    7s.  Last read position: */*
INFO    2018-09-11 16:39:29     TagBamWithReadSequenceExtended  Processed    19,000,000 records.  Elapsed time: 00:02:11s.  Time for last 1,000,000:    6s.  Last read position: */*
Error in read.table(file = snakemake@input[[1]], header = T, stringsAsFactors = F,  :
  no lines available in input
Execution halted
    [Tue Sep 11 16:39:33 2018]
    Error in rule plot_barcode_start_trim:
        jobid: 22
        output: plots/sample_name_start_trim.pdf
        conda-env: /path/to/project/projectID/dropSeqPipe/.snakemake/conda/59565c8a

RuleException:
CalledProcessError in line 210 of /path/to/project/projectID/dropSeqPipe/rules/filter.smk:
Command 'source activate /path/to/project/projectID/dropSeqPipe/.snakemake/conda/59565c8a; set -euo pipefail;  Rscript /path/to/project/projectID/dropSeqPipe/scripts/.snakemake.m14wgptb.plot_start_trim.R ' returned non-zero exit status 1.
  File "/path/to/project/projectID/dropSeqPipe/rules/filter.smk", line 210, in __rule_plot_barcode_start_trim
  File "/path/to/project/projectID/miniconda3/lib/python3.6/concurrent/futures/thread.py", line 56, in run
[Tue Sep 11 16:39:34 2018]
Finished job 55.
21 of 73 steps (29%) done

It seems like the function "plot_start_trim.R" fails, because the input file it's trying to read has no lines. It's possible that the data I've got are adapter-free, at least I didn't see any traces of adapters in MultiQC. Can I somehow skip this step or do you have any other suggestions on how to deal with this?

InputFunctionException in line 5 /rules/map.smk

Hi,

I cannot figure out how to fix this error:

InputFunctionException in line 5 of /rules/map.smk:
KeyError: 'the label [sample1_N706_S2_L001_MOUSE] is not in the [index]'
Wildcards:
sample=sample1_N706_S2_L001_MOUSE

I've ran snakemake --cores 12 qc filter map and it ran to completion.
Then, the above error happen when running: snakemake --cores 12 split_species extract_species

here is my config.yaml:
LOCAL:
TMPDIR: ~/tmp
DROPSEQ-wrapper: opt/Drop-seq_tools-1.13/drop-seq-tools-wrapper.sh
MEMORY: 32g
META:
species:
- HUMAN
- MOUSE
species_ratio: 0.20
reference_file: hg19_mm10_transgenes.fasta
annotation_file: hg19_mm10_transgenes.gtf
reference_folder: opt/dropSeqPipe/ref
FILTER:
IlluminaClip: NexteraPE-PE.fa
5PrimeSmartAdapter: GCCTGTCCGCGGAAGCAGTGGTATCAACGCAGAGTAC
Cell_barcode:
start: 1
end: 12
min_quality: 30
num_below_quality: 0
UMI:
start: 13
end: 20
min_quality: 30
num_below_quality: 0
EXTRACTION:
bc_edit_distance: 0
min_count_per_umi: 1
STAR_PARAMETERS:
outFilterMismatchNmax: 10
outFilterMismatchNoverLmax: 0.3
outFilterMismatchNoverReadLmax: 1
outFilterMatchNmin: 0

questions on wiki plots

Hi I have several questions on the wiki plots:

On the cell barcode and UMI quality trim plots, what is the tagged x reads means? what is x? At first I thought x is the total number of reads in the sample. However, mine has a different number.
It would be nice to show percentage as well.
The polyA trimming of reads, x axis are the length of reads after polyA is trimmed or the length of polyA ? How should we use this plot? What does the distribution supposed to tell us?
Same thing, what are we supposed to see on the distribution of SMART adapter?
The barnyard plot, what is the definition for No Call?

Thank you again for your help. I have to say this pipeline and the wiki make it much more easier to run the programs.

Consider a minimal repeated tutorial

Could you please consider preparing a minimal repeated example to help the new users better use the pipeline, as This package is trying to be as user friendly as possible.

I noticed some issues (#60 #61 #62 #63) are probably from some new users. Hope the minimal example could probably help them identify the technical errors, e.g., environment settings, and access to tools.

SureCell / ddSeq support

Hej,

I just got some data, generated with SureCell libraries on a ddSeq machine (i.e. the protocol by Illumina and Bio-Rad). I would like to test your pipeline for the analysis but I'm not sure if it can be used and if so how to fill the config.yaml.
Barcodes are in Read 1, however, they are not at a fixed position, and the cell barcode is split into three parts by spacer sequences:

Below is a small example from the first read fastq file of one of my samples.

Is it possible to process this data with dropSeqPipe?

Cheers

@D00457:259:HKWJNBCX2:1:1105:1128:2079 1:N:0:CCTAAGAC
CTCGGCGTTAGCCATCGCATTGCGGATTGTACCTCTGAGCTGAATCGCCTACGTCCCCGGAGACCNNT
+
<DDD0<CFHHHIIIIIIIIIIIIIIIHIHIGHHHIHHHGHFHHHIHHHIIIIHIIIIIEHHHIII##<
@D00457:259:HKWJNBCX2:1:1105:1168:2089 1:N:0:CCTAAGAC
AATGGAGTAGCCATCGCATTGCACCTTCTACCTCTGAGCTGAAGAAATAACGCCTACGAAGACTTNNT
+
<<<D01<<D1ECH?F0=CEE?<1DG@<1CGEH@HHHHIIHGEGCGEHFHIHGHHHHHIEHHHHEF##<
@D00457:259:HKWJNBCX2:1:1105:1122:2104 1:N:0:CCTAAGAC
ACCCAATAGCCATCGCATTGCCCGTAATACCTCTGAGCTGAATAAGCTACGAAACTGTGGACTTTNNT
+
0<DDDIHHIIEEHHGHIIEHIFDGHHHIIIHIIIH?GHHIIH1<FH1FGHIGHIIHIFHIHE@FH##<
@D00457:259:HKWJNBCX2:1:1105:1102:2126 1:N:0:CCTAAGAC
TTCGTAGAGGTAGCCATCGCATTGCTGAGACTACCTCTGAGCTGAACTCAATACGCTTCGAGCGANNT
+
0<<DBDHHHFCFHEGHIHIHIIIIHHIHGEHIHHIHIHIHI?1<1GHHIHIIIIIGIIGHHGHIH##<
@D00457:259:HKWJNBCX2:1:1105:1158:2127 1:N:0:CCTAAGAC
ACATAGATAGCCATCGCATTGCTAATAGTACCTCTGAGCTGAAGCGAATACGTCCCCCCTGACTTNNT
+
@@B@0<CEGHIIHHI=GEEHCGHEHHEEHHIHFHCHEHCHIHIIHIHIIHHHHI0EHHIII?@1<##<

I have issue on STAR running part. help!!!

It will generate some big index files when running snakemake map. How can I direct the index directory to snakemake,since I have build up the index on my own.

Pipeline stops when it start making R plots

Hi,

Dropseqpipe worked well for me couple months ago but recently after I have updated it it keeps on stoping when it starts using the Rscripts. It looks like some conda environments are incompatible with some R libraries such ggplot2, I think. Anyways here is the error I get, can you help. Thank you very much.

localrule plot_BC_drop:
input: logs/Human_mouse_0.25M_CELL_barcode.txt, logs/Human_mouse_0.25M_UMI_barcode.txt, logs/Human_mouse_0.25M_reads_left.txt, logs/Human_mouse_0.25M_reads_left_trim.txt
output: plots/BC_drop.pdf
jobid: 22

Activating conda environment: /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/.snakemake/conda/122c4b70
Activating conda environment: /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/.snakemake/conda/122c4b70
Error in rule plot_yield:
jobid: 31
Error in rule plot_BC_drop:
output: plots/yield.pdf
jobid: 22
conda-env: /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/.snakemake/conda/122c4b70
output: plots/BC_drop.pdf

    conda-env: /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/.snakemake/conda/122c4b70

RuleException:
CalledProcessError in line 165 of /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/rules/map.smk:
Command 'source activate /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/.snakemake/conda/122c4b70; set -euo pipefail; Rscript /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/scripts/.snakemake.fx15wxlz.plot_yield.R ' returned non-zero exit status 1.
File "/home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/rules/map.smk", line 165, in __rule_plot_yield
File "/home/tommy/miniconda3/envs/dropSeqPipe/lib/python3.6/concurrent/futures/thread.py", line 55, in run
RuleException:
CalledProcessError in line 258 of /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/rules/filter.smk:
Command 'source activate /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/.snakemake/conda/122c4b70; set -euo pipefail; Rscript /home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/scripts/.snakemake.7qdvqpqr.plot_BC_drop.R ' returned non-zero exit status 1.
File "/home/tommy/Nadia5_mix_human_mouse/dropSeqPipe/rules/filter.smk", line 258, in __rule_plot_BC_drop
File "/home/tommy/miniconda3/envs/dropSeqPipe/lib/python3.6/concurrent/futures/thread.py", line 55, in run
Removing temporary output file data/Human_mouse_0.25M/Aligned.out.bam.
Finished job 57.

Double prefix in meta rule

Having followed the wiki, I ran into an error in the meta rule, i.e. running the following:

snakemake --snakefile ~/analysis/dropseq/software/dropSeqPipe/Snakefile meta

got me this error at some point:

rule create_intervals:
    input: /home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.reduced.gtf, /home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa.dict
    output: /home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa.rRNA.intervals
    jobid: 1
    wildcards: reference_prefix=/home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa

Finished job 4.
3 of 6 steps (50%) done
Error in rule create_intervals:
    jobid: 1
    output: /home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa.rRNA.intervals

RuleException:
CalledProcessError in line 64 of /home/user/analysis/dropseq/software/dropSeqPipe/rules/generate_meta.smk:
Command ' set -euo pipefail;  ~/analysis/dropseq/software/Drop-seq_tools-1.13/drop-seq-tools-wrapper.sh -m 20g -p CreateIntervalsFiles		REDUCED_GTF=/home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.reduced.gtf SEQUENCE_DICTIONARY=/home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa.dict		O=/home/user/analysis/dropseq/data/mixed		PREFIX=/home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa ' returned non-zero exit status 1.
  File "/home/user/analysis/dropseq/software/dropSeqPipe/rules/generate_meta.smk", line 64, in __rule_create_intervals
  File "/home/user/.conda/envs/dropSeqPipe/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Will exit after finishing currently running jobs.
Terminating processes on user request.
Cancelling snakemake on user request.

Running the failed command manually:

~/analysis/dropseq/software/Drop-seq_tools-1.13/drop-seq-tools-wrapper.sh -m 20g -p CreateIntervalsFiles  \
REDUCED_GTF=/home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.reduced.gtf \
SEQUENCE_DICTIONARY=/home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa.dict  \
O=/home/user/analysis/dropseq/data/mixed  \
PREFIX=/home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa

got me this:

....
at org.broadinstitute.dropseqrna.annotation.CreateIntervalsFiles.write(CreateIntervalsFiles.java:215)
at org.broadinstitute.dropseqrna.annotation.CreateIntervalsFiles.doWork(CreateIntervalsFiles.java:160)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)
Caused by: java.io.FileNotFoundException: /home/user/analysis/dropseq/data/mixed/home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa.genes.intervals (No such file or directory)
....

The path seems to be appended twice as path. This might be a config error on my side, though digging a bit deeper and changing the manual command from:

PREFIX=/home/user/analysis/dropseq/data/mixed/hg19_mm10_transgenes.fa

PREFIX=hg19_mm10_transgenes.fa

seems to do the trick as a workaround.
The culprit seems in Drop-seq_tools-1.13/public/src/java/org/broadinstitute/dropseqrna/annotation/CreateIntervalsFiles.java line 219:

private File makeIntervalFile(final String intervalType) {
    return new File(OUTPUT, PREFIX + "." + intervalType + ".intervals");
}

Where it probably uses the dot as an additional path on top of the prefix. I haven't complete understood the code but maybe the following line needs changing (but might be completely wrong)?:

dropSeqPipe/rules/generate_meta.smk

Line 68 in 1fe5917

PREFIX={params.reference_prefix}

Sorry for the lengthy report, hope it's not too confusing.
Happy to provide more details.

Confusions about the 5' smart adapter and Trimmomatic adapter files

Hi,
Thanks for developing such a convenient and user friendly tools! I am analyzing some drop-seq data now and want to use dropSeqPipe to do prephase analysis. I'm not so familiar with drop-seq and have some questions about this pipeline:

what does 5' SMART adapter mean? I read the original paper of Drop-seq by Macosko EZ and noted there are two "adapters" after the formation of STAMPs. One is the adapter initially attached to the bead, called "primer handle" in his paper. The other is an adapter ended with GGG, which is added in the process of reverse transcription. Which adapter is the 5' SMART adapter? Or these two "adapters” are same and both can be removed by TrimStartingSequence?
what is trimmomatic adapter for? Is it aimed at trimming the adapters added during Illumina sequencing? Do I need to include the 5' SMART primer sequence in the trimmomatic adapter files?
I'd appreciate so much if you could help me to figure out these definitions!
Thanks!

Yang
(Edit 2h after I asked this question:
I read more about the Drop-seq details and understand, the two adapters in my first question is the same and after tagmentation there will be only one 5' SMART adapter in a read)

R package problem, plotting error

Hey,

I keep having trouble with the same error message. There seems to be a problem during the plotting with R and the "reshape2" package. I already installed all packages and the dependencies and checked if I have the current version but I still keep having the same error! Any suggestions?

Activating conda environment: /scratch/hofphi00/dropSeqPipe/.snakemake/conda/439f0232
Activating conda environment: /scratch/hofphi00/dropSeqPipe/.snakemake/conda/439f0232
Activating conda environment: /scratch/hofphi00/dropSeqPipe/.snakemake/conda/118bc3f0
Activating conda environment: /scratch/hofphi00/dropSeqPipe/.snakemake/conda/118bc3f0
Activating conda environment: /scratch/hofphi00/dropSeqPipe/.snakemake/conda/118bc3f0
Activating conda environment: /scratch/hofphi00/dropSeqPipe/.snakemake/conda/118bc3f0
[Tue Nov 27 01:26:19 2018]
Finished job 10.
1 of 7 steps (14%) done
[Tue Nov 27 01:26:20 2018]
Finished job 16.
2 of 7 steps (29%) done
Error: package or namespace load failed for ‘reshape2’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/home/hofphi00/R/x86_64-pc-linux-gnu-library/3.4/stringi/libs/stringi.so':
libicui18n.so.57: cannot open shared object file: No such file or directory
Error: package or namespace load failed for ‘reshape2’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/home/hofphi00/R/x86_64-pc-linux-gnu-library/3.4/stringi/libs/stringi.so':
libicui18n.so.57: cannot open shared object file: No such file or directory
Error: package or namespace load failed for ‘reshape2’ in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/home/hofphi00/R/x86_64-pc-linux-gnu-library/3.4/stringi/libs/stringi.so':
libicui18n.so.57: cannot open shared object file: No such file or directory
Execution halted
Execution halted
Execution halted
[Tue Nov 27 01:26:30 2018]
[Tue Nov 27 01:26:30 2018]
[Tue Nov 27 01:26:30 2018]
Error in rule plot_yield:
Error in rule plot_BC_drop:
Error in rule plot_rna_metrics:
jobid: 20
jobid: 17
jobid: 14
output: plots/yield.pdf
output: plots/BC_drop.pdf

Do these two parameter in config.yaml file needed to fill?

UMI-edit-distance This is the maximum manhattan distance between two UMI barcode when extracting count matrices.
min-count-per-umi is the minimum UMI/Gene pair needed to be counted as one.

TagBamWithReadSequenceExtended: ArrayIndexOutOfBoundsException: 10

Hi Patrick,

we finally have our first dropSeq data sequenced and I tried to run in through the dropSeqPipe.
Unfortunately, TagBamWithReadSequenceExtended terminates after a short while giving:

(dropSeqPipe) jens@KI-V0205:/data/dropTest/dropSeqPipe$ drop-seq-tools-wrapper.sh -m 60g -t ./tmp/ -p TagBamWithReadSequenceExtended SUMMARY=logs/test_CELL_barcode.txt BASE_RANGE=1-12 BASE_QUALITY=30 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 INPUT=data/test_unaligned.bam OUTPUT=data/test_BC_tagged_unmapped.bam
+ java -Xmx60g -Djava.io.tmpdir=./tmp/ -jar /mnt/software/x86_64/packages/dropSeqTools/1.13/jar/dropseq.jar TagBamWithReadSequenceExtended SUMMARY=logs/test_CELL_barcode.txt BASE_RANGE=1-12 BASE_QUALITY=30 BARCODED_READ=1 DISCARD_READ=false TAG_NAME=XC NUM_BASES_BELOW_QUALITY=1 INPUT=data/test_unaligned.bam OUTPUT=data/test_BC_tagged_unmapped.bam
[Tue Feb 27 11:17:01 CET 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended INPUT=data/test_unaligned.bam OUTPUT=data/test_BC_tagged_unmapped.bam SUMMARY=logs/test_CELL_barcode.txt BASE_RANGE=1-12 BARCODED_READ=1 DISCARD_READ=false BASE_QUALITY=30 NUM_BASES_BELOW_QUALITY=1 TAG_NAME=XC    TAG_BARCODED_READ=false HARD_CLIP_BASES=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Feb 27 11:17:01 CET 2018] Executing as jens@KI-V0205 on Linux 3.16.0-5-amd64 amd64; OpenJDK 64-Bit Server VM 1.8.0_121-b15; Picard version: 1.13(7bed8f4_1513008033)
[Tue Feb 27 11:17:05 CET 2018] org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=1015021568
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10
	at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.scoreBaseQuality(TagBamWithReadSequenceExtended.java:251)
	at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.processReadPair(TagBamWithReadSequenceExtended.java:220)
	at org.broadinstitute.dropseqrna.utils.TagBamWithReadSequenceExtended.doWork(TagBamWithReadSequenceExtended.java:154)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:205)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:94)
	at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:42)

I assume that my fastq/bam is at some place malformed (e.g. an R1 read is too short), although I can't imagine. Is there any debug flag I can place to get more hints about what fails? Or can you quickly dig into TagBamWithReadSequenceExtended.java at line 251 and tell me whats happening there?

Best and thanks!
Jens

Edit: I can share the data/test_unaligned.bam file, if needed!

Rerunning "extract" step

Hello,

I am having trouble rerunning "extract" step successfully.
The description of knee plot on the "Plot" page suggests to rerun the "extract" step after changing the expected_cells parameter in samples.csv, if the clear bend of the curve is higher than the expected_cell parameter. I tried the following command for this after increasing the expected_cells parameter.
snakemake --cores 8 extract --use-conda

However, it doesn't rerun the extract step and execution ends with a message saying "Nothing to be done". Below are the messages printed on the console. Am I doing anything wrong in rerunning the extract step only?

Building DAG of jobs...
Nothing to be done.
Shutting down, this might take some time.
Complete log: /local/.....

Thank you.

still sjdbOverhang Error for very small datasets

When running dropSeqPipe on a rather small test dataset (~1000 reads) I got the following error:

EXITING because of fatal PARAMETERS error: sjdbOverhang <=0 while junctions are inserted on the fly with --sjdbFileChrStartEnd or/and --sjdbGTFfile
SOLUTION: specify sjdbOverhang>0, ideally readmateLength-1

Note, as suggested in issue #4, I've regenerated (generate-meta) the STAR index which helps only up to a certain read depth.

After some research, it turned out the function get_mean_read_length in singleCell/star_align.snake returned the wrong value.

The reason seems 2 fold, firstly a minimum number of reads is hard-coded (n = 1000000), so any dataset with less reads (as in my case) will probably run in this issue.

Secondly, even if the minimum nr of reads is meet, since the mean is calculated, if there are many trimmed reads it might lead to the wrong estimate too.

Not sure what the best solution is, maybe setting the minimum to the number of reads (or 1Mio, whetever is smaller) or mabye allowing to set this value manualy in the config file or excluding the --sjdbOverhang parameter altogether?

A possible workaround for me was to set return(100) instead of of return(int(total_length/(n/4)))

Not sure if this issues is relevant enough to be solved, but thought to share my workaround anyway in case someone else is having a similar dataset.

Happy to send more details if needed.

Changing UMI-edit-distance has seemingly no effect

Hi Patrick,

Trying to play around the impact of varying the Edit (Hamming) distance I've run the same data set with different UMI-edit-distance settings in the config.yaml.
Not sure what to expect really I was still surprised that non of the cell-barcode counts have changed in the *.dge.summary.txt files.
Since the Edit distance is passed on to drop-seq tools and to exclude any problems with dropseqPipe, I've tried to run DigitalExpression from version 1.13 directly with varying edit distances as follows (using the macosko dataset, but results were same for other sets as well):

sample=mac_1000_SRR1748411
for EDIT in $(seq 0 3); do
   echo $EDIT
   ~/software/Drop-seq_tools-1.13/DigitalExpression \
     SUMMARY=summary/${sample}_dge.summary.txt.edit${EDIT} \
     OUTPUT=summary/${sample}_umi_expression_matrix.tsv.edit${EDIT} \
     INPUT=data/${sample}_final.bam \
     EDIT_DISTANCE=$EDIT \
     MIN_BC_READ_THRESHOLD=1 \
     NUM_CORE_BARCODES=87 \
     OUTPUT_READS_INSTEAD=false \
     CELL_BARCODE_TAG=XC \
     MOLECULAR_BARCODE_TAG=XM \
     GENE_EXON_TAG=GE \
     STRAND_TAG=GS \
     READ_MQ=10 \
     USE_STRAND_INFO=true \
     RARE_UMI_FILTER_THRESHOLD=0.0 \
     VERBOSITY=INFO \
     QUIET=false \
     VALIDATION_STRINGENCY=STRICT \
     COMPRESSION_LEVEL=5 \
     MAX_RECORDS_IN_RAM=500000 \
     CREATE_INDEX=false \
     CREATE_MD5_FILE=false \
     GA4GH_CLIENT_SECRETS=client_secrets.json
done

The results are showing the first barcodes for brevity:

mac_1000_SRR1748411_dge.summary.txt.edit3

## htsjdk.samtools.metrics.StringHeader
# org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/mac_1000_SRR1748411_dge.summary.txt.edit3 OUTPUT_READS_INSTEAD=false OUTPUT=summary/mac_1000_SRR1748411_umi_expression_matrix.tsv.edit3 INPUT=data/mac_1000_SRR1748411_final.bam CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS EDIT_DISTANCE=3 READ_MQ=10 MIN_BC_READ_THRESHOLD=1 NUM_CORE_BARCODES=87 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json   
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Oct 10 17:31:43 BST 2018

## METRICS CLASS	org.broadinstitute.dropseqrna.barnyard.DigitalExpression$DESummary
CELL_BARCODE	NUM_GENIC_READS	NUM_TRANSCRIPTS	NUM_GENES
CGTTCTCTCCCC	418305	78517	11974
GTTTTGAGCGAT	339447	73489	16604
TTGCCGTGGAGT	262520	56767	10218
GCGACGACTGCC	274855	55699	10219
CAACGCATCTGA	288624	55612	10559
GCGTTGTCTTTC	254620	54360	9817
...

mac_1000_SRR1748411_dge.summary.txt.edit0

## htsjdk.samtools.metrics.StringHeader
# org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/mac_1000_SRR1748411_dge.summary.txt.edit0 OUTPUT_READS_INSTEAD=false OUTPUT=summary/mac_1000_SRR1748411_umi_expression_matrix.tsv.edit0 INPUT=data/mac_1000_SRR1748411_final.bam CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS EDIT_DISTANCE=0 READ_MQ=10 MIN_BC_READ_THRESHOLD=1 NUM_CORE_BARCODES=87 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json   
## htsjdk.samtools.metrics.StringHeader
# Started on: Wed Oct 10 15:23:00 BST 2018

## METRICS CLASS	org.broadinstitute.dropseqrna.barnyard.DigitalExpression$DESummary
CELL_BARCODE	NUM_GENIC_READS	NUM_TRANSCRIPTS	NUM_GENES
CGTTCTCTCCCC	418305	174290	11974
GTTTTGAGCGAT	339447	126206	16604
TTGCCGTGGAGT	262520	109009	10218
GCGACGACTGCC	274855	108420	10219
CAACGCATCTGA	288624	107455	10559
GACTACCAGAGT	295351	107401	10132
GCGTTGTCTTTC	254620	102031	9817

As of the excerpts above, an edit distance of 0 and 3 gave the same counts per barcode (in fact for the complete list also for distance 1 and 2, which I could send).
I find that a rather unlikely result and was wondering if you had similar experiences, maybe something is wrong with the drop-seq-tools or I'm missing something obvious?

Also, the NUM_GENIC_READS and NUM_GENES don't change, however the NUM_TRANSCRIPT does which is odd.

Also note, the knee_plots (which I thought to have changed in the first place) are based on logs/{sample}_hist_out_cell.txt which I thought should also be affected by the Edit-distance, but they don't seem to have it as parameter (only the dge.summary.txt generation rule has it as input parameter):

dropSeqPipe/rules/map.smk

Line 135 in b91fdce

'logs/{sample}_hist_out_cell.txt'

Problem on macosx with cut

This line doesn't work on macosx. 1 has to be 2 instead.

dropSeqPipe/rules/generate_meta.smk

Line 79 in 1615728

genomeLength = shell("wc -c {} | cut -d' ' -f1".format(file), iterable=True)

A rule for FASTQ merging

The data I'm analyzing comes in a form of several fastq files, like so

sample1_S11_L001_R1_001.fastq.gz
sample1_S11_L001_R2_001.fastq.gz
sample1_S11_L002_R1_001.fastq.gz
sample1_S11_L002_R2_001.fastq.gz
sample1_S11_L003_R1_001.fastq.gz
sample1_S11_L003_R2_001.fastq.gz
sample1_S11_L004_R1_001.fastq.gz
sample1_S11_L004_R2_001.fastq.gz

The current version of dropSeqPipe only allows for one fastq file per sample per read so I'm basically merging all the files by hand into sample1_R{1,2}.fastq.gz before running your pipeline. Do you think it would make sense to add a rule for merging several fastq with the same root name like sample1 in the above case? I'm fine merging them by hand but adding a rule would allow to run the merging in parallel to, say, index generation.

sjdbOverhang Error

Hi,
I have attached the screen-shot of error logs
Could you please help.

thanks

Pipeline Error

Got this error while generating the Expression Matrix:

[Sun Sep 10 01:02:04 2017] Finished job 1.
[Sun Sep 10 01:02:04 2017] 4 of 5 steps (80%) done
[Sun Sep 10 01:02:04 2017]
[Sun Sep 10 01:02:04 2017] localrule all:
input: logs/MLW12_hist_out_cell.txt
log: logs/Dropseq_post_align.log
jobid: 0
[Sun Sep 10 01:02:04 2017]
[Sun Sep 10 01:02:04 2017] Finished job 0.
[Sun Sep 10 01:02:04 2017] 5 of 5 steps (100%) done
Mode is generate-plots
Generating multiqc report
[INFO ] multiqc : This is MultiQC v1.2
[INFO ] multiqc : Template : default
[INFO ] multiqc : Searching '/SSD/MLW12/logs'
[INFO ] multiqc : Searching '/SSD/MLW12/summary'
Searching 62 files.. [####################################] 100%
[INFO ] star : Found 2 reports
[INFO ] fastqc : Found 2 reports
[INFO ] multiqc : Compressing plot data
[INFO ] multiqc : Report : MLW12/multiqc_report.html
[INFO ] multiqc : Data : MLW12/multiqc_data
[INFO ] multiqc : MultiQC complete
Extracting expression
[Sun Sep 10 01:02:43 2017] Provided cores: 20
[Sun Sep 10 01:02:43 2017] Rules claiming more threads will be scaled down.
[Sun Sep 10 01:02:43 2017] Job counts:
count jobs
1 all
1 extract_expression
1 extract_umi_per_gene
1 gunzip
4
[Sun Sep 10 01:02:43 2017]
[Sun Sep 10 01:02:43 2017] rule extract_umi_per_gene:
input: MLW12_final.bam
output: logs/MLW12_umi_per_gene.tsv
jobid: 1
wildcards: sample=MLW12
[Sun Sep 10 01:02:43 2017]
[Sun Sep 10 01:02:43 2017] /programs/Drop-seq_tools-1.12/GatherMolecularBarcodeDistributionByGene I=MLW12_final.bam O=logs/MLW12_umi_per_gene.tsv CELL_BC_FILE=summary/MLW12_barcodes.csv
[Sun Sep 10 01:02:43 2017] rule extract_expression:
input: MLW12_final.bam
output: summary/MLW12_expression_matrix.txt.gz
jobid: 3
wildcards: sample=MLW12
[Sun Sep 10 01:02:43 2017]
[Sun Sep 10 01:02:43 2017] /programs/Drop-seq_tools-1.12/DigitalExpression I=MLW12_final.bam O=summary/MLW12_expression_matrix.txt.gz SUMMARY=summary/MLW12_dge.summary.txt CELL_BC_FILE=summary/MLW12_barcodes.csv MIN_BC_READ_THRESHOLD=1
[Sun Sep 10 01:02:44 EDT 2017] org.broadinstitute.dropseqrna.barnyard.DigitalExpression SUMMARY=summary/MLW12_dge.summary.txt OUTPUT=summary/MLW12_expression_matrix.txt.gz INPUT=MLW12_final.bam MIN_BC_READ_THRESHOLD=1 CELL_BC_FILE=summary/MLW12_barcodes.csv OUTPUT_READS_INSTEAD=false CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS EDIT_DISTANCE=1 READ_MQ=10 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Sun Sep 10 01:02:44 EDT 2017] org.broadinstitute.dropseqrna.barnyard.GatherMolecularBarcodeDistributionByGene OUTPUT=logs/MLW12_umi_per_gene.tsv INPUT=MLW12_final.bam CELL_BC_FILE=summary/MLW12_barcodes.csv CELL_BARCODE_TAG=XC MOLECULAR_BARCODE_TAG=XM GENE_EXON_TAG=GE STRAND_TAG=GS EDIT_DISTANCE=1 READ_MQ=10 MIN_BC_READ_THRESHOLD=0 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Sun Sep 10 01:02:44 EDT 2017] Executing as [email protected] on Linux 3.10.0-229.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Picard version: 1.12(d3aeea7_1452606774) IntelDeflater
[Sun Sep 10 01:02:44 EDT 2017] Executing as [email protected] on Linux 3.10.0-229.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Picard version: 1.12(d3aeea7_1452606774) IntelDeflater
[Sun Sep 10 01:02:44 EDT 2017] org.broadinstitute.dropseqrna.barnyard.DigitalExpression done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2022178816
Exception in thread "main" [Sun Sep 10 01:02:44 EDT 2017] org.broadinstitute.dropseqrna.barnyard.GatherMolecularBarcodeDistributionByGene done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2022178816
Exception in thread "main" htsjdk.samtools.SAMException: Error opening file: MLW12_barcodes.csvhtsjdk.samtools.SAMException: Error opening file: MLW12_barcodes.csv

at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:501)	at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:501)

at picard.util.BasicInputParser.filesToInputStreams(BasicInputParser.java:172)	at picard.util.BasicInputParser.filesToInputStreams(BasicInputParser.java:172)

at picard.util.BasicInputParser.<init>(BasicInputParser.java:78)	at picard.util.BasicInputParser.<init>(BasicInputParser.java:78)

at picard.util.BasicInputParser.<init>(BasicInputParser.java:91)	at picard.util.BasicInputParser.<init>(BasicInputParser.java:91)

at org.broadinstitute.dropseqrna.barnyard.ParseBarcodeFile.readCellBarcodeFile(ParseBarcodeFile.java:13)	at org.broadinstitute.dropseqrna.barnyard.ParseBarcodeFile.readCellBarcodeFile(ParseBarcodeFile.java:13)

at org.broadinstitute.dropseqrna.barnyard.BarcodeListRetrieval.getCellBarcodes(BarcodeListRetrieval.java:47)	at org.broadinstitute.dropseqrna.barnyard.BarcodeListRetrieval.getCellBarcodes(BarcodeListRetrieval.java:47)

at org.broadinstitute.dropseqrna.barnyard.GatherMolecularBarcodeDistributionByGene.doWork(GatherMolecularBarcodeDistributionByGene.java:55)	at org.broadinstitute.dropseqrna.barnyard.DigitalExpression.doWork(DigitalExpression.java:74)

at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)

at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)

at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:29)	at org.broadinstitute.dropseqrna.cmdline.DropSeqMain.main(DropSeqMain.java:29)

Caused by: java.io.FileNotFoundException: summary/MLW12_barcodes.csv (No such file or directory)Caused by: java.io.FileNotFoundException: summary/MLW12_barcodes.csv (No such file or directory)

at java.io.FileInputStream.open0(Native Method)	at java.io.FileInputStream.open0(Native Method)

at java.io.FileInputStream.open(FileInputStream.java:195)	at java.io.FileInputStream.open(FileInputStream.java:195)

at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:497)
at htsjdk.samtools.util.IOUtil.openFileForReading(IOUtil.java:497)
... 9 more
... 9 more

[Sun Sep 10 01:02:44 2017] Error in job extract_expression while creating output file summary/MLW12_expression_matrix.txt.gz.
[Sun Sep 10 01:02:44 2017] Error in job extract_umi_per_gene while creating output file logs/MLW12_umi_per_gene.tsv.
[Sun Sep 10 01:02:44 2017] RuleException:
CalledProcessError in line 21 of /programs/dropSeqPipe/lib/python3.6/site-packages/dropSeqPipe/Snakefiles/singleCell/extract_expression_single.snake:
Command '/programs/Drop-seq_tools-1.12/DigitalExpression I=MLW12_final.bam O=summary/MLW12_expression_matrix.txt.gz SUMMARY=summary/MLW12_dge.summary.txt CELL_BC_FILE=summary/MLW12_barcodes.csv MIN_BC_READ_THRESHOLD=1' returned non-zero exit status 1.
File "/programs/dropSeqPipe/lib/python3.6/site-packages/dropSeqPipe/Snakefiles/singleCell/extract_expression_single.snake", line 21, in __rule_extract_expression
File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 55, in run
[Sun Sep 10 01:02:44 2017] RuleException:
CalledProcessError in line 34 of /programs/dropSeqPipe/lib/python3.6/site-packages/dropSeqPipe/Snakefiles/singleCell/extract_expression_single.snake:
Command '/programs/Drop-seq_tools-1.12/GatherMolecularBarcodeDistributionByGene I=MLW12_final.bam O=logs/MLW12_umi_per_gene.tsv CELL_BC_FILE=summary/MLW12_barcodes.csv' returned non-zero exit status 1.
File "/programs/dropSeqPipe/lib/python3.6/site-packages/dropSeqPipe/Snakefiles/singleCell/extract_expression_single.snake", line 34, in __rule_extract_umi_per_gene
File "/usr/local/lib/python3.6/concurrent/futures/thread.py", line 55, in run
[Sun Sep 10 01:02:44 2017] Removing output files of failed job extract_umi_per_gene since they might be corrupted:
logs/MLW12_umi_per_gene.tsv
[Sun Sep 10 01:02:44 2017] Will exit after finishing currently running jobs.
[Sun Sep 10 01:02:44 2017] Exiting because a job execution failed. Look above for error message
Traceback (most recent call last):
File "/programs/dropSeqPipe/bin/dropSeqPipe", line 11, in
load_entry_point('dropSeqPipe==0.23a0', 'console_scripts', 'dropSeqPipe')()
File "/programs/dropSeqPipe/lib/python3.6/site-packages/dropSeqPipe/main.py", line 223, in main
shell(extract_expression_single)
File "/usr/local/lib/python3.6/site-packages/snakemake/shell.py", line 88, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'snakemake -s /programs/dropSeqPipe/lib/python3.6/site-packages/dropSeqPipe/Snakefiles/singleCell/extract_expression_single.snake --cores 20 -pT -d /SSD/MLW12 --configfile /SSD/local.yaml ' returned non-zero exit status 1.

Exception in thread "main" picard.PicardException: In paired mode, read name 1 does not match read name 2

Hi there,
thanks for this nice package and the good documentation. I was trying to run the first analysis with dropSeqPipe but encountered an error that lead to abruption of the pipeline.

The Snakemake error output is the following:

'Exception in thread "main" picard.PicardException: In paired mode, read name 1 (NB501971:102:HC355BGX5:1:11101:15471:1050) does not match read name 2 (NB501971:102:HC355BGX5:2:11101:19695:1043)
at picard.sam.FastqToSam.getBaseName(FastqToSam.java:446)
at picard.sam.FastqToSam.doPaired(FastqToSam.java:338)
at picard.sam.FastqToSam.makeItSo(FastqToSam.java:309)
at picard.sam.FastqToSam.doWork(FastqToSam.java:282)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:268)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)
[Sun Sep 30 01:42:05 2018]
Error in rule fastq_to_sam:
jobid: 38
output: data/rat_data_N705_unaligned.bam
conda-env: /scratch/hofphi00/dropSeqPipe/.snakemake/conda/a5697629'

I would really much appreciate your help on this. Please let me know if you need something else in order to understand my issue.

I can not find the jar directory! help!!!

Error: Unable to access jarfile /home/u2510/ncbi/public/sra/dropSeqPipe-0.31/jar/dropseq.jar

The merge_single_counts.R step takes 60GB memory each and runs now for >30h.

Ist that normal behaviour?

Hi, i have a question about using dropSeqPipe

Hi,

I'm doing follow the wiki and,

have a problem in generate-meta step.

here is error massages :

[Mon Oct 30 19:16:47 2017] IndexError in line 5 of /usr/lib/python3.6/site-packages/dropSeqPipe/Snakefiles/generate_meta.snake:
[Mon Oct 30 19:16:47 2017] list index out of range
[Mon Oct 30 19:16:47 2017] File "/usr/lib/python3.6/site-packages/dropSeqPipe/Snakefiles/generate_meta.snake", line 5, in
Traceback (most recent call last):
File "/usr/bin/dropSeqPipe", line 11, in
load_entry_point('dropSeqPipe==0.23a0', 'console_scripts', 'dropSeqPipe')()
File "/usr/lib/python3.6/site-packages/dropSeqPipe/main.py", line 81, in main
complementory_args))
File "/usr/lib/python3.6/site-packages/snakemake/shell.py", line 100, in new
raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ' set -euo pipefail; snakemake -s /usr/lib/python3.6/site-packages/dropSeqPipe/Snakefiles/generate_meta.snake --cores 5 -pT -d ../../hg19_dropseqpipe --configfile ./config.yaml ' returned non-zero exit status 1.

and this is my config.yaml:

config.yaml

Samples:
SRR5250848:
fraction: 0.001
expected_cells: 100
GENOMEREF: /storage/hd2/jina/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
REFFLAT: /storage/hd2/jina/Homo_sapiens/UCSC/hg19/Annotation/Genes/refFlat.txt
METAREF: /storage/hd2/jina/hg19_STAR/
RRNAINTERVALS: /storage/hd2/jina/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.rRNA.interval_list
GTF: /storage/hd2/jina/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
SPECIES:
- HUMAN
CORES: 5
GLOBAL:
5PrimeSmartAdapter: CACACTCTTTCCCTACACGACGC
data_type: SingleCell
allowed_aligner_mismatch: 10
min_count_per_umi: 1
Cell_barcode:
start: 1
end: 12
min_quality: 10
num_below_quality: 1
UMI:
start: 13
end: 20
min_quality: 10
num_below_quality: 1

can you tell me about which commend is wrong?

thanks,

Problem in running generate_meta_.smk in 3.2 version

'repair' rule: too many threads declared?

the 'threads': 28 in the repair rule from the develop branch made the entire pipeline run sequentially.
The rule appears by far not to use the 28 threads (rather 2-6 on average).

Maybe adjust that.

Question: should this be fastx, no fatsx ?

rules/map.smk

17	'data/{sample}/Log.final.out'
18	params:
19	extra="""--outReadsUnmapped Fatsx\
20	--outFilterMismatchNmax {}\

violin plot execution fails

Hi,

When the violin plot job is about to be launched, I got:

Traceback (most recent call last):
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/__init__.py", line 541, in snakemake
    report=report)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/workflow.py", line 653, in execute
    success = scheduler.schedule()
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/scheduler.py", line 286, in schedule
    self.run(job)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/scheduler.py", line 302, in run
    error_callback=self._error)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/executors.py", line 638, in run
    jobscript = self.get_jobscript(job)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/executors.py", line 496, in get_jobscript
    cluster=self.cluster_wildcards(job))
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/executors.py", line 556, in cluster_wildcards
    return Wildcards(fromdict=self.cluster_params(job))
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/executors.py", line 551, in cluster_params
    cluster[key] = job.format_wildcards(value)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/jobs.py", line 709, in format_wildcards
    return format(string, **_variables)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/site-packages/snakemake/utils.py", line 326, in format
    return fmt.format(_pattern, *args, **variables)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/string.py", line 191, in format
    return self.vformat(format_string, args, kwargs)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/string.py", line 195, in vformat
    result, _ = self._vformat(format_string, args, kwargs, used_args, 2)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/string.py", line 235, in _vformat
    obj, arg_used = self.get_field(field_name, args, kwargs)
  File "/g/funcgen/gbcs/public/software/conda/envs/snakemake-5.2.1/lib/python3.5/string.py", line 306, in get_field
    obj = getattr(obj, i)
AttributeError: 'Wildcards' object has no attribute 'sample'

Sorry for not doing a PR, but the quick fix is to add the following in the cluster.yaml

violine_plots:
    time: "00:10:00"
    output: "logs/cluster/{rule}.out"
    error: "logs/cluster/{rule}.err"

do you mind adding this in the repo?

Thx

sample_yaml' is not defined

Hi,

I am using latest version of the Dropseq pipeline. (installed today)

However, I am getting following error. I have the config.ymal file. It is in the right directory.
So, how can I overcome this? Could you help?

dropSeqPipe -f /hd2 -c /home/bin/DropSeqPipelineHoohm_v0.23a/dropSeqPipe/local.yaml -m pre-process --rerun

Traceback (most recent call last):
File "/bin/dropSeqPipe", line 9, in
load_entry_point('dropSeqPipe==0.23a0', 'console_scripts', 'dropSeqPipe')()
File "/home/bin/DropSeqPipelineHoohm_v0.23a/dropSeqPipe/dropSeqPipe/main.py", line 96, in main
if(sample_yaml['GLOBAL']['data_type'] not in ['singleCell', 'bulk']):
NameError: name 'sample_yaml' is not defined