googlingthecancergenome / sv-gen Goto Github PK

View Code? Open in Web Editor NEW

6.0 4.0 1.0 23.03 MB

Snakemake-based workflow for generating artificial genomes with structural variants

Home Page: https://research-software.nl/software/sv-gen

License: Apache License 2.0

Python 96.00% Shell 4.00%

bioinformatics structural-variants cancer-genomics wgs simulator workflow snakemake hpc-applications

sv-gen's Introduction

sv-gen

Structural variants (SVs) are an important class of genetic variation implicated in a wide array of genetic diseases. sv-gen is a Snakemake-based workflow to generate artificial short-read alignments based on a reference genome with(out) SVs. The workflow is easy to use and deploy on any Linux-based machine. In particular, the workflow supports automated software deployment, easy configuration and addition of new analysis tools as well as enables to scale from a single computer to different HPC clusters with minimal effort.

Dependencies

Python 3
Conda - package/environment management system
Snakemake - workflow management system
Xenon CLI - command-line interface to compute and storage resources
jq - command-line JSON processor (optional)
YAtiML - library for YAML type inference and schema validation

The workflow (DAG) includes the following tools:

The software dependencies and versions can be found in the conda environment.yaml files (1, 2).

1. Clone this repo.

git clone https://github.com/GooglingTheCancerGenome/sv-gen.git
cd sv-gen

2. Install dependencies.

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba
# create a new environment with dependencies & activate it
mamba env create -n wf -f environment.yaml
conda activate wf

3. Configure the workflow.

config files:
- analysis.yaml - analysis-specific settings
- environment.yaml - software dependencies and versions

4. Execute the workflow.

cd workflow
# 'dry' run only checks I/O files
snakemake -np

# run the workflow locally
snakemake --use-conda --cores

Submit jobs to Slurm/GridEngine-based cluster

SCH=slurm   # or gridengine
snakemake --use-conda --latency-wait 30 --jobs \
--cluster "xenon scheduler $SCH --location local:// submit --name smk.{rule} --inherit-env --max-run-time 5 --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log" &>smk.log&

Query job accounting information

SCH=slurm   # or gridengine
xenon --json scheduler $SCH --location local:// list --identifier [jobID] | jq ...

sv-gen's People

Contributors

Stargazers

Watchers

Forkers

wen-workflow

sv-gen's Issues

Enable sv-gen in the workflow catalog

See here (standards compliant).

Rename repo

For example, sv-gen-workflow -> sv-gen(erator)?

Memory allocation fails for samtools sort

Ubuntu Linux 19
Intel© Core™ i7-8550U CPU @ 1.80GHz x 4 (8 x threads with HTT)
15.1 GiB RAM

bwa mem -t 8 -R '@RG\tID:hmz\tLB:hmz\tSM:hmz' data/out/seqids.fasta data/out/INDEL/r150_i500/hmz_1.fq data/out/INDEL/r150_i500/hmz_2.fq | samtools sort -@ 8 -o data/out/INDEL/r150_i500/cov30/hmz.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 399960 sequences (59994000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 198577, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (492, 499, 505)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (466, 531)
[M::mem_pestat] mean and std.dev: (498.49, 9.93)
[M::mem_pestat] low and high boundaries for proper pairs: (453, 544)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 399960 reads in 35.909 CPU sec, 4.771 real sec
samtools sort: couldn't allocate memory for bam_mem...

Issue related to samtools/samtools#831. Use samtools with -m arg but its value should be set dynamically given the number of threads or cores.

bwa mem -t 8 -R '@RG\tID:hmz\tLB:hmz\tSM:hmz' data/out/seqids.fasta data/out/INDEL/r150_i500/hmz_1.fq data/out/INDEL/r150_i500/hmz_2.fq | samtools sort -@ 8 -m 340M -o data/out/INDEL/r150_i500/cov30/hmz.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 399960 sequences (59994000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 198577, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (492, 499, 505)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (466, 531)
[M::mem_pestat] mean and std.dev: (498.49, 9.93)
[M::mem_pestat] low and high boundaries for proper pairs: (453, 544)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 399960 reads in 31.575 CPU sec, 4.161 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 8 -R @RG\tID:hmz\tLB:hmz\tSM:hmz data/out/seqids.fasta data/out/INDEL/r150_i500/hmz_1.fq data/out/INDEL/r150_i500/hmz_2.fq
[main] Real time: 4.750 sec; CPU: 32.164 sec
[bam_sort_core] merging from 0 files and 8 in-memory blocks...

Add citation files

CITATION.cff
.zenodo.json

Update analysis.yaml

file_exts -> filext
sim_genomes -> simulation
- sv_type: DUP -> dup...INV-DEL -> invdel, INV-DUP -> invdup
sim_reads -> simulation

related to #25

Add support for multi-threading

Add threads key to analysis.yaml. The following tools make use of it:

samtools
bwa

Use GitHub Actions instead of Travis CI

Add symlinks to hmz-sv.vcf and htz-sv.vcf

Each data subfolder now contains the files hmz.bam, hmz-sv.bam and htz-sv.bam, with the relative .bai indices.
It would help to have also symlinks to the files data/hmz-sv.vcf and data/htz-sv.vcf, so that each subfolder contains all the necessary files for the downstream analysis.

Fix bwa indexing & alignment

Add unit tests

see helper_functions.py

Use new docker images for CI testing

Use gtcg/xenon-gridengine:dev and gtcg/xenon-slurm:dev, and a separate script (install.sh) to install dependencies.

Choice of chromosome set for SV generation

The user should be able to select on which chromosomes (one, multiple or all of them) the SVs should be created.

Enable different SV types in the simulation via config

Reads generated from hmz.fasta and hmz-sv.fasta are not mapped

There is a problem with the mapping of the reads generated from hmz.fasta

bwa mem -R '@RG\tID:hmz\tLB:hmz\tSM:hmz' data/hmz.fasta data/r150_i300/hmz_1.fq data/r150_i300/hmz_2.fq

and hmz-sv.fasta

bwa mem -R '@RG\tID:hmz-sv\tLB:hmz-sv\tSM:hmz-sv' data/hmz-sv.fasta data/r150_i300/hmz-sv_1.fq data/r150_i300/hmz-sv_2.fq

This are the first lines of the output:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 66668 sequences (10000200 bp)...
[M::process] read 66668 sequences (10000200 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66668 reads in 6.748 CPU sec, 6.756 real sec
[M::process] read 66668 sequences (10000200 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66668 reads in 6.817 CPU sec, 7.068 real sec

The value of FR (Forward-Reverse) reads should be greater than 0.

SURVIVOR VCF files are malformed

hmz-sv.vcf and htz-sv.vcf (and I suppose also hmz.vcf, but it is empty) are malformed.

In the header, hmz-sv.vcf does not include the sample name. It currently is as follows:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

and it should be, with tab separator:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HMZ-SV

Same for htz-sv.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HTZ-SV

Convert SURVIVOR output in BED to BEDPE

In fact, SURVIVOR writes TSV file. See an example below:

22      636172  22      636261  DEL
22      869827  22      870124  INS

INV_del and INV_dup should be 0 by default in SURVIVOR config file

Currently, they are set to 2 by default.

Check if at least one svtype count is non-zero

Use GitHub Container Registry instead of Docker Hub

Add file extensions to analysis.yaml

SURVIVOR output includes:

*.vcf
*.bed
*.bedpe (also see #14)

To be added:

sv-gen/snakemake/analysis.yaml

Lines 16 to 28 in 9ca02df

 file_exts: 

 fasta: .fasta 

 fasta_idx: 

 - .fasta.ann # BWA v0.6.x index files 

 - .fasta.amb # 

 - .fasta.bwt # 

 - .fasta.pac # 

 - .fasta.sa # 

 fastq: .fq 

 bam: .bam 

 bam_idx: .bam.bai 

 # simulation parameters

Code review

Malformed VCF files

SURVIVOR simSV generates a malformed VCF file, in particular for translocations (TRA).
We need to include first a VCF->BEDPE conversion with this script and a BEDPE->VCF conversion with this script.

Update conda and pip environments

Pin down yatiml (v0.5.0) as soon as it's released.

Do not run SURVIVOR simSV for translocations given one chromosome

Improve error handling to avoid
We cannot simulate translocations without a second chromosome error message.

Add outdir to analysis.yaml

Currently, all output files are generated either relative to the workdir (with Snakefile) followed by the relative filepath or using absolute filepath (input.fasta).

For the sample input,
data/chr22_44-45Mb.GRCh37.fasta

the following (sub)dirs are created

data/r150_i500/cov10/
data/r150_i500/cov30/

Add output.basedir key to the analysis.yaml. Perhaps, consider output per sim_genomes.sv_type(s) with the following output directory structure

{basedir}/{svtype}/r{readlen}_i{insertlen}/cov{coverage}.

Validate workflow config in YAML

Use yatiml to validate the analysis.yaml config.

Adding read mapping with bwa-mem

Mapping the generated reads to the reference sequence (choosing GRCh37 or GRCh38). The final BAM files will be generated.

Documenting ART parameter for insert size standard deviation

It is unclear if the parameter should be expressed in percent or absolute number of bases. I cannot find a reference to it in the original publication nor in the documentation.

TMP directory in samtools sort

In the analysis.yaml file, a TMP directory can be specified where the temporary files generated by samtools sort are temporarily stored. To achieve this, we could use the -T option available in samtools sort.

Move error handling to validator

Use pyfaidx to check FASTA input.

Move SURVIVOR config

update analysis.yaml: input.config -> sim_genomes.config
move survivor.cfg to {output.basedir}/{svtype}

configSURVIVOR.py wrong format

configSURVIVOR does not generate the correct sv_template, probably because of wrong spacing/indents. Do not use it until it's fixed.

SURVIVOR requires minimum two chromosomes

SV simulation with SURVIVOR v1.0.7 does not work with a single chromosome. This is because to simulate (inter-chromosomal) translocations you need at least two chromosomes. Strangely, this happens also when you specify 0 for the number of translocations in the analysis.yaml file.
I suggest to update the test data with two chromosomes using the FASTA file in the CNN test data. We should also establish if the >2 chromosomes requirement has been added in the latest (v1.0.7) version of SURVIVOR.

stdev as list

@arnikz Could you modify insert:stdev into a list so there is a 1:1 correspondence for its values with respect to the values in the list of insert:length?
In this way, each length value can have its own stdev value. For instance, in case you want to specify stdev as 10% of the length.

FASTQ cleanup

The generated FASTQ files should be deleted when the workflow is completed. They take up a lot of space and are not needed.

The two SV files generated with SURVIVOR are the same

@arnikz I cannot explain why hmz-sv and htz-sv generated independently by SURVIVOR SV are the same file. Could it be that the random seed used by SURVIVOR is the same? However, simSV does not have a seed among its input parameters.

@arnikz Could you look into it?

	file_exts:
	fasta: .fasta
	fasta_idx:
	- .fasta.ann # BWA v0.6.x index files
	- .fasta.amb #
	- .fasta.bwt #
	- .fasta.pac #
	- .fasta.sa #
	fastq: .fq
	bam: .bam
	bam_idx: .bam.bai

	# simulation parameters