Giter Club home page Giter Club logo

bravo_data_prep's Introduction

BRAVO Data Pipeline

Processing data to power the BRowse All Variants Online (BRAVO) API

  1. Build, download, or install dependencies.
    1. Compile custom tools
    2. Install external tools
    3. Download external data
  2. Collect data to be processed into convenient location.
  3. Modify nextflow configs to match paths on your system or cluster.
  4. Run nextflow workflows

Input Data

Naming: The pipeline depends on the names of the input cram files having the sample ID as the first part of the filename. Specifically, the expectation that the ID preceeds the first . such that a call to getSimpleName() yields the ID.

Sequence Data

Source cram files. Original sequences from which the variant calls were made.

Variant calls

Source bcf files. Generated running the topmed variant calling pipeline

Data Preparation Tools

Compile Custom Tools

In the tools/ directory you will find tools/scripts to prepare your data for importing into Mongo database and using in BRAVO browser.

cd tools/cpp_tools
cget install .

This build executables in tools/cpp_tools/cget/bin

External Tools

BamUtil, VEP, and Loftee tools required are described in dependencies.md

External Data

Gencode, Ensembl, dbSNP, and HUGO data required are described in basis_data.md

Nextflow Scripts

In the workflows/ directory are three Nextflow configs and scripts used to prepare the runtime data for the BRAVO API.

Details about the steps of the pipeline are detailed in data_prep_steps.md.

The three nextflow pipelines are:

  1. Prepare VCF Teddy
  2. Sequences
  3. Coverage

Downstream data for BRAVO API

The make_vignette_dir.sh script consolidates the results from the nextflow scripts into a data directory organized for the BRAVO API. It is designed for small data sets, and should be run after the three data pipelines complete.

There are two data sets that Bravo API needs to run:

  • Runtime Data are flat files on disk read at runtime.
  • Basis Data files processed and loaded into mongo db.

Downstream data subdirectory notes

data/
├── cache
├── coverage
│   ├── bin_1
│   ├── bin_25e-2
│   ├── bin_50e-2
│   ├── bin_75e-2
│   └── full
├── crams
│   ├── sequences
│   ├── variant_map.tsv.gz
│   └── variant_map.tsv.gz.tbi
└── reference
    ├── hs38DH.fa
    └── hs38DH.fa.fai
  • reference/ holds the refercence fasta files for the genome
  • API's SEQUENCE_DIR config val is asking for directory that contains the 'sequences' directory.
    • sequences dirname is hardcoded
    • variant_map.tsv.gz file name is hardcoded.
    • variant_map.tsv.gz.tbi file name is hardcoded.
  • Under sequence/, directory structure and filenames are perscribed.
    • All two hex character directories 00 to ff should exist as subdirectories.
    • cram files must have the filename in the exact form of sample_id.cram
    • The sub dir a cram belongs in is the first two characters of the md5 hexdigest of the sample_id.
      • E.g. foobar123.cram would be in directory "ae"
        hashlib.md5("foobar123".encode()).hexdigest()[:2]
      • This dir structure is produced by the nextflow pipeline
  • coverage directory contents are taken from result/ dir of coverage workflow
  • variant_map.tsv.gz is an output of RandomHetHom3

bravo_data_prep's People

Contributors

abought avatar birndle avatar brettpthomas avatar bw2 avatar coverbeck avatar dgmacarthur avatar dtaliun avatar grosscol avatar jdpleiness avatar konradjk avatar monkollek avatar pjvandehaar avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

bravo_data_prep's Issues

Allow for htslib on Gentoo Base System/macOS

Desired State

Run Cmake on macOS or Gentoo Base System (computecanada remote server) to build htslib

Context

The previous one is not working on the Gentoo Base System (computecanada). The new version allows htslib to be built on the server.

OS info of server:

LSB Version:    n/a
Distributor ID: Gentoo
Description:    Gentoo Base System release 2.6
Release:        2.6
Codename:       n/a

Coverage: investigate and better handle tabix indexing concatenated bgzipped files.

Current State

During coverage aggregation, bgzipped results are concatenated together as the process runs.

# Aggregate depths from depth file chunks
mlr -N --tsv 'nest' --ivar ";" -f 3 \${PIPES[@]} |\
sort --numeric-sort --key=2 |\
bgzip >> ${result_file}

Tabix sometimes fails to produce a valid index for concatenated summary data occasionally. An index gets written, has the contig name, but can't be used to get data by region. tabix file.tsv.gz chr22:10000100-10000200 | wc -l gets 0.

Work around currently involves re-writing entire bgzipped file.

Action items

  • Generate small reproducible example of tabix not producing an index of data.
  • Sort out solution that is more efficient that re-writing the entire gzipped file.

Reference

Suspected to be related to:

Prepare for collaboration

The repository has no guidance nor expectations spelled out for contributing to it. In anticipation of having code contributed from downstream forks, create a document spelling out expectations and instructions for contributing.

This is done when:

  • create a contributing.md with the information about contribution workflow
  • create an issue template for new features.
  • add instructions about linting, formatting, and pull requests to contributing.md
  • create a staging branch

Describe symlinking strategy for nextflow workflows

Making symlinks to relevant data and references inside the workflow directories allows config to point to relative paths instead of absolute paths. This use, the impact, and expectations needs to be in a readme for each workflow.

Extract data processing custom tools to own repos

Rationale

  • Facilitate a la carte installation of custom tools use in nextflow workflow.
  • Make it easier to install applications to common location on slurm cluster.
  • Reduce readme scope.
  • Separation of concerns and dependencies

Actions

  • Split out tolls that need to be compiled (tools/cpp_tools) and python processing tools (tools/base_coverage and tools/py_tools) into their own repositories. Use a clone and remove contents strategy to preserve commit history and attribution.

  • Include tool projects in data prep as submodules. Update installation instructions to shallow clone submodules.

Handling coverage files derived from vcf files

Desired State

Need to have scripts that handle generating and pruning coverage data from VCFs instead of depth files from mpileup.

Context

Getting VCFs from CARTaGENE WGS. Have to compute depth from the raw vcfs instead of using mpileup with crams. As a result, only the mean is calculated and the median is not available.

Add samples count step as a results output

Desired State

Emit a document, compatible with mongoimport, with summary statistics from the data processing. Specifically, the number of samples in the variant calling since this cannot be derived from the processed data.

Context

The UI reports the number of genomes from which the data is derived, but this is currently manually coded.
It would be useful for handling multiple data sets if that information could be included as a matter of data processing instead of having to be done manually included. statgen/bravo_vue#4

Extract allele frequency data from 1000G VCFs

Create a new workflow for allelic frequency information. The current AF data comes along with the VEP process due to the --af flag.

The allelic frequency information from the VEP output appears to be incomplete. E.g. 1-55063514-G-A should have AF data, but it does not appear to be present.

  1. Download VCFs for 1000G on GRCh38 into reference data storage: https://www.internationalgenome.org/data-portal/data-collection/grch38
  2. Extract AF and *_AF fields. SNV ids are pos-ref-alt
  3. Convert to Mongo's bson format for use with mongoimport.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.