marwoes / wg-blimp Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 12.0 6.57 MB

wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data

License: GNU Affero General Public License v3.0

Dockerfile 0.11% Python 35.90% Shell 4.25% R 59.74%

wg-blimp's People

Contributors

Stargazers

Watchers

Forkers

shulp2211 tanglingfung zongchangli zzygyx9119 rajansiv jakelehle verenadietrich cmacphillamy saimmomin12 zachery001 fafaris39

wg-blimp's Issues

FASTQ name conflict for sample names that are a substring of other sample names

Suppose there are multiple samples:

sample1
...
sample11

As wg-blimp currently matches sample names to FASTQ names by using a naive find solution, all FASTQ files of sample11 will be assigned to sample1. This issue can be prevented by using only sample names of equal length. However, this still may cause issues when sample names are contained in subfolders of the ''raw'' directory. For example, raw/sample1-11/sample*.fq.gz would also be assigned to sample1 because it is contained in the folder names.

This issue can be fixed by replacing the current find functionality with an R script that first sorts sample names by length and flags file name candidates iteratively to prevent the substring issue.
Also, an R script can prevent the subfolder issue by simply matching sample names against the basename of the files. Providing an explicit sample-FASTQ table might also be an option.

Pipeline fails due to Picard installation

Sometimes there is an error when executing wg-blimp because the Picard build seems to pull in some incompatible depedencies. This may be resolved by using picard-slim instead of picard in the snakemake environment.

Environment installation broken.

Here we go again.

I was tweaking the pipeline and went to reinstall the environment fresh. Looks like some of the packages are having conflicts with the glibc version 2.31 on linux. The error message output that my current version of python is python 3.10 which was cool that they updated but frustrating the new version break stuff.

I was first having the issue with pysam which just so happened to be updated earlier today to version so I thought rolling it back to the older version would fix this using

conda install -c bioconda/label/cf201901 pysam

but that didn't work. So now I'm thinking the python version is the cause.

I'll do some more digging and see what I can find patch wise.
What version of python do you have on your system @MarWoes

Support for different species

Hi,

currently only the hg19 and hg38 human genome builds appear to be supported by this pipeline.
It would be great if support for other species could be added!
I'm particularly interested in using Mmul_10.

Switch to snakemake environments

Including all dependencies globally into Bioconda's meta.yaml file has caused multiple issues so far. It's probably easier to start all rules in their own conda environment managed by Snakemake. This might also ease solving #11 .

Reference genome

Hi there,

Just want to make sure if i should use the human reference genome that is bisulfitated or the regular one or i shouldnt include any ?

Thanks

Running wg-blimp in cluster environments

Hi,
Is it possible to run wg-blimp on cloud or cluster environments? You are using snakemake which supports them but since you are wrapping around it, it is not clear how to use these feature.
Would you provide instructions on how to do that?

Thanks!

Issue with bwameth.py errno 32

Hello,

I've been trying to adopt this pipeline for some of our labs WGBS data and I thought I had gotten past most of the bugs but I got a new one that has me stumped so I wanted to open an issue,

My samples are in mice so I updated the config file and downloaded my own reference genome and annotation file from Ensembl as well as a CpG island annotation from UCSC as you suggested in some past issue threads. (Thank you for that by the way).

I'll upload my config file so you can see how I have everything set up.

I'm getting a new error message in the results folder under with the name Sample.align.log that seems to indicate there is an issue with the bwameth.py script errno32 broken pipe.

I'm not sure if this is trying to indicate that the script is having issues with aligning my reads or what.

Any help with this would be a big help.

Faulty hg19 gene location annotations

File gene-locations-hg19.csv.gz does not contain hg19 gene locations, but rather hg38 locations (likely due to faulty biomaRt query code). Transcription start sites are not affected.

This will be fixed when reworking arbitrary species support and annotation through GTF files instead of biomaRt-queried locations (see issue #5 ).

Error in rule prep_gemBS_files:

hello! thanks for building this tools! I met some problems when I try to deal with my WGBS data.

I tried to run wg-blimp from config file, but I went some error, it cannot be run successfully.

this is the log file:

`Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /bin/bash
Provided cores: 32
Rules claiming more threads will be scaled down.
Job stats:
job                              count    min threads    max threads
-----------------------------  -------  -------------  -------------
all                                  1              1              1
bedgraph_to_methylation_ratio       12              1              1
benchmark_plot                       1              1              1
bsseq                                1              8              8
clean_gemBS_csv                      1              1              1
dmr_annotation                       1              1              1
dmr_combination                      1              1              1
dmr_coverage                        12              8              8
fastqc                              12              1              1
gemBS                                1              1              1
gemBS_csv                           12              1              1
index_bam                           12              1              1
mark_duplicates                     12              1              1
mbias                               12              1              1
methyl_dackel                       12              1              1
methylation_metrics                  1              1              1
methylseekr                          1              8              8
metilene                             1              1              1
metilene_input                       1              1              1
multiqc                              1              1              1
picard_metrics                      12              1              1
prep_fai                             1              1              1
prep_gemBS_files                     1              1              1
qualimap                            12              8              8
total                              134              1              8

Select jobs to execute...

[Fri Apr  7 10:15:41 2023]
rule prep_gemBS_files:
    output: /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv, /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
    jobid: 119
    reason: Missing output files: /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv, /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
    priority: 9
    resources: tmpdir=/var/folders/2v/v69c5xt93y377wv2pll3j24h0000gq/T


            touch /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
            sed -i '1i"Barcode","Dataset","File1", "File2"' /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
            cat << EOF > /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
    reference = /Users/xiaoyu/igv/genomes/seq/mm10.fa
    index_dir = /Volumes/PBLAB2/WGBS/Clean_Data
    base = $HOME
    sequence_dir = /Volumes/PBLAB2/WGBS/Clean_Data
    bam_dir = /Volumes/PBLAB2/WGBS/results/alignment
    bcf_dir = /Volumes/PBLAB2/WGBS/results/alignment
    extract_dir = /Volumes/PBLAB2/WGBS/results/alignment
    report_dir = /Volumes/PBLAB2/WGBS/results/logs
    threads = 8
    jobs = 4
    include IHEC_standard.conf
    EOF
            
[Fri Apr  7 10:15:41 2023]
Error in rule prep_gemBS_files:
    jobid: 119
    output: /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv, /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
    shell:
        
            touch /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
            sed -i '1i"Barcode","Dataset","File1", "File2"' /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
            cat << EOF > /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
    reference = /Users/xiaoyu/igv/genomes/seq/mm10.fa
    index_dir = /Volumes/PBLAB2/WGBS/Clean_Data
    base = $HOME
    sequence_dir = /Volumes/PBLAB2/WGBS/Clean_Data
    bam_dir = /Volumes/PBLAB2/WGBS/results/alignment
    bcf_dir = /Volumes/PBLAB2/WGBS/results/alignment
    extract_dir = /Volumes/PBLAB2/WGBS/results/alignment
    report_dir = /Volumes/PBLAB2/WGBS/results/logs
    threads = 8
    jobs = 4
    include IHEC_standard.conf
    EOF
            
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job prep_gemBS_files since they might be corrupted:
/Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-04-07T101535.945747.snakemake.log`

this is my yaml file:

high ctrl.txt

Could you please help me on this? Thank you very much for your time.

Error in rule gemBS

Dear sir,
when I use "wg-blimp run-snakemake --cores=8 fastq/ chr22.fasta blood1,blood2 sperm1,sperm2 results --dry-run"
The job finished correctly.
However, when I use "wg-blimp run-snakemake --cores=8 fastq/ chr22.fasta blood1,blood2 sperm1,sperm2 results "
The error came as follow

I also attach the log file.
2024-07-13T221532.200897.snakemake.log

Qualimap report inconsistent to mosdepth

Qualimap does not exclude reads of low mapping quality, while mosdepth does. As a result, coverage computations are currently more optimistic when estimated by qualimap. It would be best to use mosdepth only as this would also improve workflow run-times. It would be best to wait until MultiQC supports mosdepth output, see MultiQC/MultiQC#924

IGV BAM file links not working in Shiny interface

In its current version, wg-blimp uses a not-so-stable workaround to enable byte-range HTTP requests for Shiny. In the most recent Shiny versions, this workaround does not work anymore. A workaround is to downgrade Shiny using

conda install shiny==1.2.0

For future versions it would be better to not use a workaround, but actual Shiny features to enable byte-range requests. This could be achieved by adding custom HTTP handlers, see rstudio/shiny#2395

How to find tool default parameters

Dear @MarWoes, thank you for the efforts on wg-blimp.

Could you please let me know how to find the default parameters for the tools like bwa-meth, bsseq, metilene etc.

Thank you!

Add error message when sample .csv contains no (or wrong) column names

Otherwise one only gets a cryptic message:

Error: comparison (1) is possible only for atomic and list types
Execution halted

run pipeline for one sample

Thanks for developing such great tool for WGBS data analysis.

It seems mandatory to include two groups of samples to invoke the pipeline. But what if I have only one sample, how can I run the pipeline? Thanks.

Broken Snakefile (invalid name for input, output, ...): insert is reserved for internal use

The current version of snakemake-minimal is not compatible with the latest wg-blimp version because wg-blimps Snakefile contains a keyword (insert). This should be resolvable by simply renaming all occurrences of insert with something else.

For the time being, this issue may be worked around by installing an older, compatible version, for example by using:
conda install snakemake-minimal=5.8.1
for existing installations, or to use
conda create -n wg-blimp wg-blimp python=3.6.7 r-base=3.6.2 snakemake-minimal=5.8.1
for fresh installations.

How to provide --latency-wait flag? Is this the problem?

Hi,

I am trying to run wg-bimp on a HPC managed by slurm. Yesterday, I tried to submit this job and it would immediately error out with:

Submitted job 13 with external jobid 'Submitted batch job 34588715'.
Waiting at most 3 seconds for missing files.
MissingOutputException in line 55 of /home/barton/bin/anaconda3/lib/python3.7/site-packages/snakemake_wrapper/Snakefile:
Job Missing files after 3 seconds:
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.first.txt
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.second.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 20 completed successfully, but some output files are missing. 20
Removing output files of failed job find_fqs since they might be corrupted:
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.first.txt, /resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.second.txt
Waiting at most 3 seconds for missing files.
MissingOutputException in line 55 of /home/barton/bin/anaconda3/lib/python3.7/site-packages/snakemake_wrapper/Snakefile:
Job Missing files after 3 seconds:
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.first.txt
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.second.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 20 completed successfully, but some output files are missing. 20

I ran the same command this morning and it went through, but seemed to hang up (no slurm jobs in squeue but Python script had not been killed in the tmux tab). I eventually killed the job and tried to re-run with the same command. Now I am back to the same error code as above.

I have tried to add the --latency-wait flag in multiple locations, but it has resulted in a unknown flag error each time. Can you provide how to submit this flag? Is there another solution?

My job submission command:
-bash-4.2$ wg-blimp run-snakemake-from-config --cores 32 --nodes 2 --cluster "sbatch -p compute0 --nodes=1 --ntasks-per-node 32 --time 01:00:00" wg-blimp-config.yaml

My config file:
wg-blimp-config.yaml.txt

My csv file:
wg-blimp-csv.csv

Thank you!

Are there pipeline options for single-ended reads?

I have additional questions. Is it possible to analyze single-ended reads using wg-blimp?
Thanks in advance!!

Running wg-blimp in one control and one experiment data, without replication.

Thank you for developing a WGBS data analysis tool.

I want to use wg-blimp to analyze a WGBS data set, one control and one experiment.

All steps go smoothly. However, in

Rscript --vanilla /media/wooje/epi-T/Peggy_wg_blimp/.snakemake/scripts/tmpz6x06vto.bsseq.R
Activating conda environment: /home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/site-packages/snakemake_wrapper/conda/8c03d5578c6dd7b4f0accc99ba7b7c00

I received the following message

Error in rule bsseq:
    jobid: 4
    output: /media/wooje/epi-T/Peggy_wg_blimp/results/dmr/bsseq/bsseq.Rdata, /media/wooje/epi-T/Peggy_wg_blimp/results/dmr/bsseq/dmrs.csv, /media/wooje/epi-T/Peggy_wg_blimp/results/dmr/bsseq/top100.pdf
    log: /media/wooje/epi-T/Peggy_wg_blimp/results/logs/bsseq.log (check log file(s) for error message)
    conda-env: /home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/site-packages/snakemake_wrapper/conda/8c03d5578c6dd7b4f0accc99ba7b7c00

RuleException:
CalledProcessError in line 410 of /home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/site-packages/snakemake_wrapper/Snakefile:
Command 'source /home/wooje/anaconda3/envs/wg-blimp/bin/activate '/home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/site-packages/snakemake_wrapper/conda/8c03d5578c6dd7b4f0accc99ba7b7c00'; Rscript --vanilla /media/wooje/epi-T/Peggy_wg_blimp/.snakemake/scripts/tmpz6x06vto.bsseq.R' returned non-zero exit status 1.
  File "/home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /media/wooje/epi-T/Peggy_wg_blimp/.snakemake/log/2021-08-24T182610.612489.snakemake.log

in log file, I found that

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: ‘Biobase’

The following object is masked from ‘package:MatrixGenerics’:

    rowMedians

The following objects are masked from ‘package:matrixStats’:

    anyMissing, rowMedians

[1] "Filtering out 0 rows containing NA"
Error in BSmooth.tstat(smoothedData[!invalidRows], group1 = group1Samples,  : 
  length(group1) + length(group2) >= 3 is not TRUE
Calls: callDmrs -> BSmooth.tstat -> stopifnot
Execution halted

Would you give me some advice on how to run the pipeline with one control and one experiment without replicate data?

Any comments will help us proceed with the analysis.

Thank you!!

SA Builder Error running the example

Dear @MarWoes,

Thank you for your work. I had some issues running the example. The first four tasks were running fine; however the log file of the fifth one says,

Index file 'path/chr22.BS.gem' Missing
gemBS_Reference file 'chr22.gemBS.ref' Missing
Contig_sizes file 'chr22.contg.sizes' Missing

It showed this error:
ERROR: SA Builder. Index total length (1) is below minimum threshold (8)
ValueError: Error while executing the Bisulphite gem-indexer
GEM Index /home/athos-ai/data/test_wg/fastq/chr22.BS.gem not found. Run 'gemBS index' or correct configuration file and rerun

Could you please help me on this? Thank you very much for your time.

Make create-config fastq selection less ambigous

It would make more sense to decide to use a .csv file or automatic .fastq inference when using create-config. Currently, a .csv file can only be added after calling create-config

Picard running out of disk space

When /tmp has size constrains, Picard may fail. Picard's TMP_DIR should be configurable through config files.