marwoes / wg-blimp Goto Github PK
View Code? Open in Web Editor NEWwg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data
License: GNU Affero General Public License v3.0
wg-blimp: an end-to-end analysis pipeline for whole genome bisulfite sequencing data
License: GNU Affero General Public License v3.0
Suppose there are multiple samples:
sample1
...
sample11
As wg-blimp
currently matches sample names to FASTQ names by using a naive find
solution, all FASTQ files of sample11
will be assigned to sample1
. This issue can be prevented by using only sample names of equal length. However, this still may cause issues when sample names are contained in subfolders of the ''raw'' directory. For example, raw/sample1-11/sample*.fq.gz
would also be assigned to sample1
because it is contained in the folder names.
This issue can be fixed by replacing the current find
functionality with an R script that first sorts sample names by length and flags file name candidates iteratively to prevent the substring issue.
Also, an R script can prevent the subfolder issue by simply matching sample names against the basename of the files. Providing an explicit sample-FASTQ table might also be an option.
Sometimes there is an error when executing wg-blimp
because the Picard build seems to pull in some incompatible depedencies. This may be resolved by using picard-slim
instead of picard
in the snakemake environment.
Here we go again.
I was tweaking the pipeline and went to reinstall the environment fresh. Looks like some of the packages are having conflicts with the glibc version 2.31 on linux. The error message output that my current version of python is python 3.10 which was cool that they updated but frustrating the new version break stuff.
I was first having the issue with pysam which just so happened to be updated earlier today to version so I thought rolling it back to the older version would fix this using
conda install -c bioconda/label/cf201901 pysam
but that didn't work. So now I'm thinking the python version is the cause.
I'll do some more digging and see what I can find patch wise.
What version of python do you have on your system @MarWoes
Hi,
currently only the hg19 and hg38 human genome builds appear to be supported by this pipeline.
It would be great if support for other species could be added!
I'm particularly interested in using Mmul_10.
Including all dependencies globally into Bioconda's meta.yaml
file has caused multiple issues so far. It's probably easier to start all rules in their own conda environment managed by Snakemake. This might also ease solving #11 .
Hi there,
Just want to make sure if i should use the human reference genome that is bisulfitated or the regular one or i shouldnt include any ?
Thanks
Hi,
Is it possible to run wg-blimp on cloud or cluster environments? You are using snakemake which supports them but since you are wrapping around it, it is not clear how to use these feature.
Would you provide instructions on how to do that?
Thanks!
Hello,
I've been trying to adopt this pipeline for some of our labs WGBS data and I thought I had gotten past most of the bugs but I got a new one that has me stumped so I wanted to open an issue,
My samples are in mice so I updated the config file and downloaded my own reference genome and annotation file from Ensembl as well as a CpG island annotation from UCSC as you suggested in some past issue threads. (Thank you for that by the way).
I'll upload my config file so you can see how I have everything set up.
I'm getting a new error message in the results folder under with the name Sample.align.log that seems to indicate there is an issue with the bwameth.py script errno32 broken pipe.
I'm not sure if this is trying to indicate that the script is having issues with aligning my reads or what.
Any help with this would be a big help.
File gene-locations-hg19.csv.gz
does not contain hg19 gene locations, but rather hg38 locations (likely due to faulty biomaRt query code). Transcription start sites are not affected.
This will be fixed when reworking arbitrary species support and annotation through GTF files instead of biomaRt-queried locations (see issue #5 ).
hello! thanks for building this tools! I met some problems when I try to deal with my WGBS data.
I tried to run wg-blimp from config file, but I went some error, it cannot be run successfully.
this is the log file:
`Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /bin/bash
Provided cores: 32
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
----------------------------- ------- ------------- -------------
all 1 1 1
bedgraph_to_methylation_ratio 12 1 1
benchmark_plot 1 1 1
bsseq 1 8 8
clean_gemBS_csv 1 1 1
dmr_annotation 1 1 1
dmr_combination 1 1 1
dmr_coverage 12 8 8
fastqc 12 1 1
gemBS 1 1 1
gemBS_csv 12 1 1
index_bam 12 1 1
mark_duplicates 12 1 1
mbias 12 1 1
methyl_dackel 12 1 1
methylation_metrics 1 1 1
methylseekr 1 8 8
metilene 1 1 1
metilene_input 1 1 1
multiqc 1 1 1
picard_metrics 12 1 1
prep_fai 1 1 1
prep_gemBS_files 1 1 1
qualimap 12 8 8
total 134 1 8
Select jobs to execute...
[Fri Apr 7 10:15:41 2023]
rule prep_gemBS_files:
output: /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv, /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
jobid: 119
reason: Missing output files: /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv, /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
priority: 9
resources: tmpdir=/var/folders/2v/v69c5xt93y377wv2pll3j24h0000gq/T
touch /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
sed -i '1i"Barcode","Dataset","File1", "File2"' /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
cat << EOF > /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
reference = /Users/xiaoyu/igv/genomes/seq/mm10.fa
index_dir = /Volumes/PBLAB2/WGBS/Clean_Data
base = $HOME
sequence_dir = /Volumes/PBLAB2/WGBS/Clean_Data
bam_dir = /Volumes/PBLAB2/WGBS/results/alignment
bcf_dir = /Volumes/PBLAB2/WGBS/results/alignment
extract_dir = /Volumes/PBLAB2/WGBS/results/alignment
report_dir = /Volumes/PBLAB2/WGBS/results/logs
threads = 8
jobs = 4
include IHEC_standard.conf
EOF
[Fri Apr 7 10:15:41 2023]
Error in rule prep_gemBS_files:
jobid: 119
output: /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv, /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
shell:
touch /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
sed -i '1i"Barcode","Dataset","File1", "File2"' /Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
cat << EOF > /Volumes/PBLAB2/WGBS/results/alignment/gemBS.conf
reference = /Users/xiaoyu/igv/genomes/seq/mm10.fa
index_dir = /Volumes/PBLAB2/WGBS/Clean_Data
base = $HOME
sequence_dir = /Volumes/PBLAB2/WGBS/Clean_Data
bam_dir = /Volumes/PBLAB2/WGBS/results/alignment
bcf_dir = /Volumes/PBLAB2/WGBS/results/alignment
extract_dir = /Volumes/PBLAB2/WGBS/results/alignment
report_dir = /Volumes/PBLAB2/WGBS/results/logs
threads = 8
jobs = 4
include IHEC_standard.conf
EOF
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job prep_gemBS_files since they might be corrupted:
/Volumes/PBLAB2/WGBS/results/alignment/gemBS.csv
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-04-07T101535.945747.snakemake.log`
this is my yaml file:
Could you please help me on this? Thank you very much for your time.
Dear sir,
when I use "wg-blimp run-snakemake --cores=8 fastq/ chr22.fasta blood1,blood2 sperm1,sperm2 results --dry-run"
The job finished correctly.
However, when I use "wg-blimp run-snakemake --cores=8 fastq/ chr22.fasta blood1,blood2 sperm1,sperm2 results "
The error came as follow
I also attach the log file.
2024-07-13T221532.200897.snakemake.log
Qualimap does not exclude reads of low mapping quality, while mosdepth does. As a result, coverage computations are currently more optimistic when estimated by qualimap. It would be best to use mosdepth only as this would also improve workflow run-times. It would be best to wait until MultiQC supports mosdepth output, see MultiQC/MultiQC#924
In its current version, wg-blimp uses a not-so-stable workaround to enable byte-range HTTP requests for Shiny. In the most recent Shiny versions, this workaround does not work anymore. A workaround is to downgrade Shiny using
conda install shiny==1.2.0
For future versions it would be better to not use a workaround, but actual Shiny features to enable byte-range requests. This could be achieved by adding custom HTTP handlers, see rstudio/shiny#2395
Dear @MarWoes, thank you for the efforts on wg-blimp.
Could you please let me know how to find the default parameters for the tools like bwa-meth
, bsseq
, metilene
etc.
Thank you!
Otherwise one only gets a cryptic message:
Error: comparison (1) is possible only for atomic and list types
Execution halted
Thanks for developing such great tool for WGBS data analysis.
It seems mandatory to include two groups of samples to invoke the pipeline. But what if I have only one sample, how can I run the pipeline? Thanks.
The current version of snakemake-minimal
is not compatible with the latest wg-blimp
version because wg-blimp
s Snakefile contains a keyword (insert
). This should be resolvable by simply renaming all occurrences of insert
with something else.
For the time being, this issue may be worked around by installing an older, compatible version, for example by using:
conda install snakemake-minimal=5.8.1
for existing installations, or to use
conda create -n wg-blimp wg-blimp python=3.6.7 r-base=3.6.2 snakemake-minimal=5.8.1
for fresh installations.
Hi,
I am trying to run wg-bimp on a HPC managed by slurm. Yesterday, I tried to submit this job and it would immediately error out with:
Submitted job 13 with external jobid 'Submitted batch job 34588715'.
Waiting at most 3 seconds for missing files.
MissingOutputException in line 55 of /home/barton/bin/anaconda3/lib/python3.7/site-packages/snakemake_wrapper/Snakefile:
Job Missing files after 3 seconds:
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.first.txt
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.second.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 20 completed successfully, but some output files are missing. 20
Removing output files of failed job find_fqs since they might be corrupted:
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.first.txt, /resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.second.txt
Waiting at most 3 seconds for missing files.
MissingOutputException in line 55 of /home/barton/bin/anaconda3/lib/python3.7/site-packages/snakemake_wrapper/Snakefile:
Job Missing files after 3 seconds:
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.first.txt
/resource3/data/WGBS/Processed_wg-blimp/results-from-config/raw/Control_S1.second.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 20 completed successfully, but some output files are missing. 20
I ran the same command this morning and it went through, but seemed to hang up (no slurm jobs in squeue but Python script had not been killed in the tmux tab). I eventually killed the job and tried to re-run with the same command. Now I am back to the same error code as above.
I have tried to add the --latency-wait flag in multiple locations, but it has resulted in a unknown flag error each time. Can you provide how to submit this flag? Is there another solution?
My job submission command:
-bash-4.2$ wg-blimp run-snakemake-from-config --cores 32 --nodes 2 --cluster "sbatch -p compute0 --nodes=1 --ntasks-per-node 32 --time 01:00:00" wg-blimp-config.yaml
My config file:
wg-blimp-config.yaml.txt
My csv file:
wg-blimp-csv.csv
Thank you!
I have additional questions. Is it possible to analyze single-ended reads using wg-blimp?
Thanks in advance!!
Thank you for developing a WGBS data analysis tool.
I want to use wg-blimp to analyze a WGBS data set, one control and one experiment.
All steps go smoothly. However, in
Rscript --vanilla /media/wooje/epi-T/Peggy_wg_blimp/.snakemake/scripts/tmpz6x06vto.bsseq.R
Activating conda environment: /home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/site-packages/snakemake_wrapper/conda/8c03d5578c6dd7b4f0accc99ba7b7c00
I received the following message
Error in rule bsseq:
jobid: 4
output: /media/wooje/epi-T/Peggy_wg_blimp/results/dmr/bsseq/bsseq.Rdata, /media/wooje/epi-T/Peggy_wg_blimp/results/dmr/bsseq/dmrs.csv, /media/wooje/epi-T/Peggy_wg_blimp/results/dmr/bsseq/top100.pdf
log: /media/wooje/epi-T/Peggy_wg_blimp/results/logs/bsseq.log (check log file(s) for error message)
conda-env: /home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/site-packages/snakemake_wrapper/conda/8c03d5578c6dd7b4f0accc99ba7b7c00
RuleException:
CalledProcessError in line 410 of /home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/site-packages/snakemake_wrapper/Snakefile:
Command 'source /home/wooje/anaconda3/envs/wg-blimp/bin/activate '/home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/site-packages/snakemake_wrapper/conda/8c03d5578c6dd7b4f0accc99ba7b7c00'; Rscript --vanilla /media/wooje/epi-T/Peggy_wg_blimp/.snakemake/scripts/tmpz6x06vto.bsseq.R' returned non-zero exit status 1.
File "/home/wooje/anaconda3/envs/wg-blimp/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /media/wooje/epi-T/Peggy_wg_blimp/.snakemake/log/2021-08-24T182610.612489.snakemake.log
in log file, I found that
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘Biobase’
The following object is masked from ‘package:MatrixGenerics’:
rowMedians
The following objects are masked from ‘package:matrixStats’:
anyMissing, rowMedians
[1] "Filtering out 0 rows containing NA"
Error in BSmooth.tstat(smoothedData[!invalidRows], group1 = group1Samples, :
length(group1) + length(group2) >= 3 is not TRUE
Calls: callDmrs -> BSmooth.tstat -> stopifnot
Execution halted
Would you give me some advice on how to run the pipeline with one control and one experiment without replicate data?
Any comments will help us proceed with the analysis.
Thank you!!
Dear @MarWoes,
Thank you for your work. I had some issues running the example. The first four tasks were running fine; however the log file of the fifth one says,
It showed this error:
ERROR: SA Builder. Index total length (1) is below minimum threshold (8)
ValueError: Error while executing the Bisulphite gem-indexer
GEM Index /home/athos-ai/data/test_wg/fastq/chr22.BS.gem not found. Run 'gemBS index' or correct configuration file and rerun
Could you please help me on this? Thank you very much for your time.
It would make more sense to decide to use a .csv file or automatic .fastq inference when using create-config
. Currently, a .csv file can only be added after calling create-config
When /tmp
has size constrains, Picard may fail. Picard's TMP_DIR
should be configurable through config files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.