Giter Club home page Giter Club logo

sv-gen's Introduction

sv-gen

DOI CI Codacy Badge Codacy Badge

Structural variants (SVs) are an important class of genetic variation implicated in a wide array of genetic diseases. sv-gen is a Snakemake-based workflow to generate artificial short-read alignments based on a reference genome with(out) SVs. The workflow is easy to use and deploy on any Linux-based machine. In particular, the workflow supports automated software deployment, easy configuration and addition of new analysis tools as well as enables to scale from a single computer to different HPC clusters with minimal effort.

Dependencies

  • Python 3
  • Conda - package/environment management system
  • Snakemake - workflow management system
  • Xenon CLI - command-line interface to compute and storage resources
  • jq - command-line JSON processor (optional)
  • YAtiML - library for YAML type inference and schema validation

The workflow (DAG) includes the following tools:

The software dependencies and versions can be found in the conda environment.yaml files (1, 2).

1. Clone this repo.

git clone https://github.com/GooglingTheCancerGenome/sv-gen.git
cd sv-gen

2. Install dependencies.

# download Miniconda3 installer
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# install Conda (respond by 'yes')
bash miniconda.sh
# update Conda
conda update -y conda
# install Mamba
conda install -n base -c conda-forge -y mamba
# create a new environment with dependencies & activate it
mamba env create -n wf -f environment.yaml
conda activate wf

3. Configure the workflow.

4. Execute the workflow.

cd workflow
# 'dry' run only checks I/O files
snakemake -np

# run the workflow locally
snakemake --use-conda --cores

Submit jobs to Slurm/GridEngine-based cluster

SCH=slurm   # or gridengine
snakemake --use-conda --latency-wait 30 --jobs \
--cluster "xenon scheduler $SCH --location local:// submit --name smk.{rule} --inherit-env --max-run-time 5 --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log" &>smk.log&

Query job accounting information

SCH=slurm   # or gridengine
xenon --json scheduler $SCH --location local:// list --identifier [jobID] | jq ...

sv-gen's People

Contributors

arnikz avatar lsantuari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

wen-workflow

sv-gen's Issues

Rename repo

For example, sv-gen-workflow -> sv-gen(erator)?

Memory allocation fails for samtools sort

Ubuntu Linux 19
Intel© Core™ i7-8550U CPU @ 1.80GHz x 4 (8 x threads with HTT)
15.1 GiB RAM

bwa mem -t 8 -R '@RG\tID:hmz\tLB:hmz\tSM:hmz' data/out/seqids.fasta data/out/INDEL/r150_i500/hmz_1.fq data/out/INDEL/r150_i500/hmz_2.fq | samtools sort -@ 8 -o data/out/INDEL/r150_i500/cov30/hmz.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 399960 sequences (59994000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 198577, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (492, 499, 505)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (466, 531)
[M::mem_pestat] mean and std.dev: (498.49, 9.93)
[M::mem_pestat] low and high boundaries for proper pairs: (453, 544)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 399960 reads in 35.909 CPU sec, 4.771 real sec
samtools sort: couldn't allocate memory for bam_mem...

Issue related to samtools/samtools#831. Use samtools with -m arg but its value should be set dynamically given the number of threads or cores.

bwa mem -t 8 -R '@RG\tID:hmz\tLB:hmz\tSM:hmz' data/out/seqids.fasta data/out/INDEL/r150_i500/hmz_1.fq data/out/INDEL/r150_i500/hmz_2.fq | samtools sort -@ 8 -m 340M -o data/out/INDEL/r150_i500/cov30/hmz.bam
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 399960 sequences (59994000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 198577, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (492, 499, 505)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (466, 531)
[M::mem_pestat] mean and std.dev: (498.49, 9.93)
[M::mem_pestat] low and high boundaries for proper pairs: (453, 544)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 399960 reads in 31.575 CPU sec, 4.161 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 8 -R @RG\tID:hmz\tLB:hmz\tSM:hmz data/out/seqids.fasta data/out/INDEL/r150_i500/hmz_1.fq data/out/INDEL/r150_i500/hmz_2.fq
[main] Real time: 4.750 sec; CPU: 32.164 sec
[bam_sort_core] merging from 0 files and 8 in-memory blocks...

Update analysis.yaml

  • file_exts -> filext
  • sim_genomes -> simulation
    • sv_type: DUP -> dup...INV-DEL -> invdel, INV-DUP -> invdup
  • sim_reads -> simulation

related to #25

Add symlinks to hmz-sv.vcf and htz-sv.vcf

Each data subfolder now contains the files hmz.bam, hmz-sv.bam and htz-sv.bam, with the relative .bai indices.
It would help to have also symlinks to the files data/hmz-sv.vcf and data/htz-sv.vcf, so that each subfolder contains all the necessary files for the downstream analysis.

Reads generated from hmz.fasta and hmz-sv.fasta are not mapped

There is a problem with the mapping of the reads generated from hmz.fasta

bwa mem -R '@RG\tID:hmz\tLB:hmz\tSM:hmz' data/hmz.fasta data/r150_i300/hmz_1.fq data/r150_i300/hmz_2.fq

and hmz-sv.fasta

bwa mem -R '@RG\tID:hmz-sv\tLB:hmz-sv\tSM:hmz-sv' data/hmz-sv.fasta data/r150_i300/hmz-sv_1.fq data/r150_i300/hmz-sv_2.fq

This are the first lines of the output:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 66668 sequences (10000200 bp)...
[M::process] read 66668 sequences (10000200 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66668 reads in 6.748 CPU sec, 6.756 real sec
[M::process] read 66668 sequences (10000200 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 0, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 66668 reads in 6.817 CPU sec, 7.068 real sec

The value of FR (Forward-Reverse) reads should be greater than 0.

SURVIVOR VCF files are malformed

hmz-sv.vcf and htz-sv.vcf (and I suppose also hmz.vcf, but it is empty) are malformed.

In the header, hmz-sv.vcf does not include the sample name. It currently is as follows:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

and it should be, with tab separator:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HMZ-SV

Same for htz-sv.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HTZ-SV

Malformed VCF files

SURVIVOR simSV generates a malformed VCF file, in particular for translocations (TRA).
We need to include first a VCF->BEDPE conversion with this script and a BEDPE->VCF conversion with this script.

Add outdir to analysis.yaml

Currently, all output files are generated either relative to the workdir (with Snakefile) followed by the relative filepath or using absolute filepath (input.fasta).

For the sample input,
data/chr22_44-45Mb.GRCh37.fasta

the following (sub)dirs are created

data/r150_i500/cov10/
data/r150_i500/cov30/

Add output.basedir key to the analysis.yaml. Perhaps, consider output per sim_genomes.sv_type(s) with the following output directory structure

{basedir}/{svtype}/r{readlen}_i{insertlen}/cov{coverage}.

Move SURVIVOR config

  • update analysis.yaml: input.config -> sim_genomes.config
  • move survivor.cfg to {output.basedir}/{svtype}

configSURVIVOR.py wrong format

configSURVIVOR does not generate the correct sv_template, probably because of wrong spacing/indents. Do not use it until it's fixed.

SURVIVOR requires minimum two chromosomes

SV simulation with SURVIVOR v1.0.7 does not work with a single chromosome. This is because to simulate (inter-chromosomal) translocations you need at least two chromosomes. Strangely, this happens also when you specify 0 for the number of translocations in the analysis.yaml file.
I suggest to update the test data with two chromosomes using the FASTA file in the CNN test data. We should also establish if the >2 chromosomes requirement has been added in the latest (v1.0.7) version of SURVIVOR.

stdev as list

@arnikz Could you modify insert:stdev into a list so there is a 1:1 correspondence for its values with respect to the values in the list of insert:length?
In this way, each length value can have its own stdev value. For instance, in case you want to specify stdev as 10% of the length.

FASTQ cleanup

The generated FASTQ files should be deleted when the workflow is completed. They take up a lot of space and are not needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.