Giter Club home page Giter Club logo

ukbb-tools's Introduction

ukbb-tools

This repository contains our complete set of tools for preprocessing, quality control, and preliminary analyses on UK Biobank data. There is a folder in the repo per set of methods as defined in the Table of Contents below. Each subdirectory has a README.md file that should be read before use. These files detail how to use all files within the directory.

Contents

  1. Preprocessing
  2. Phenotyping
  3. Filtering
  4. GWAS
  5. GBE
  6. PheWAS
  7. LD score regression (LDSC)
  8. UK Biobank Bulk Download
  9. Flip-check, flip-fix, and coordinate lift over w/ UCSC liftOver
  10. Biomarker Adjustment
  11. GREAT Enrichment
  12. SciDB Query for PheWAS
  13. Multiple Rare-variants and Phenotypes (MRP) - Rare-variant signal aggregator
  14. LD map and LD pruning
  15. Genetic Relationship Matrix calculation (GRM via GCTA)
  16. snpnet (Large-scale Cox Proportional Hazards)
  17. VEP Variant Annotation
  18. Meta-analysis with METAL
  19. Multiple Rare-variants and Phenotypes Mixed Model (MRPMM)

The ukbb-tools module on Sherlock

All this code has been ported to a module on Sherlock. Click for more details on how to load and use this module.

There is an updater script that pushes your current directory - use with appropriate caution, as it takes the master branch - and makes it a version of the module. The only argument for the updater is a date; this is used as a version label.

Example Usage:

bash ukbb-tools.module.updater.sh 20200225

ukbb-tools's People

Contributors

cdeboever3 avatar erflynn avatar guhanrv avatar jmjustesen avatar maguirre1 avatar marivascruz avatar technopolymath avatar yk-tanigawa avatar ykwon0407 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ukbb-tools's Issues

GWAS QC table

For the latest GWAS freeze, let's compute the followings and tabulate it

  • Lambda GC across different frequency bins
  • LD score (heritability estimates)
  • LD score intercept
  • Number of hits
  • Number of independent hits (load in LD pruned set)
  • non-NA line count

GWAS finishing effort - 21,572 missing variants

It turned out that the summary statistics generated in array-combined/gwas/current does not match the number of expected lines (1,080,969).

There are 880 files with 1059397 lines (meaning that 21572 lines are missing in each file).

  • 21,033 variants are on both arrays
  • 539 variants are on one array

GWAS finishing effort - Simple line counts check

As a QC of the GWAS sum stats freeze, we perform line counts.

We identify the list of (pop, GBE_ID) pairs that satisfy the minimum N >= 100 criteria. We then ask whether we have the results in the array-combined/gwas/current directory.

For the files linked from array-combined/gwas/current directory, we apply wc -l to see if the sum stats are complete.

Summary

missing sum stats

As of 2020/6/27, we have the following number of traits missing in the gwas/current dir

Screenshot 2020-07-04 13 22 55

The corresponding analysis notebook.

For others and related, the jobs were submitted.

incomplete sum stats

As of 2020/6/29, here is the summary of wc -l across populations.

Screenshot 2020-07-04 13 15 23

The corresponding analysis notebook.

GWAS finishing effort - 691 missing variants

It turned out that the summary statistics generated in array-combined/gwas/current does not match the number of expected lines (1,080,969).

There were 407 files with 1080278 lines (meaning that 691 lines are missing in each file).

Update phenotype grouping

On GBE, we have been using a phenotype grouping based on the prefix of GBE_IDs.

To improve the interpretability of the phenotype groupings, we will update the phenotype grouping.

Missing lines in the array variant annotation file

The number of lines in the pvar and the variant annotation files does not match -- indicating that the variant annotation is incomplete.

  • 804070 lines in the variant annotation file
  • 805427 lines in the pvar file

The paths to files are:

oak/stanford/groups/mrivas/private_data/ukbb/variant_filtering/variant_filter_table.6302020.tsv.gz
/oak/stanford/groups/mrivas/ukbb24983/cal/pgen/ukb24983_cal_cALL_v2_hg19.pvar.zst

cf: we saw a similar issue in exome (#29) hinting that we may have some bugs in the annotation pipeline.

GWAS finishing effort - 402 missing variants

It turned out that the summary statistics generated in array-combined/gwas/current does not match the number of expected lines (1,080,969).

A majority of them have 1,080,566 lines indicating that there are 402 variants missing in the summary statistics.

Phenotyping error for 11 INI phenotypes (coding 339)

Phenotyping error for 11 INI phenotypes (coding 339)

As we investigated in #20, the coding annotation and the phenotype files for 11 INI traits are wrong.

To quickly finalize the GWAS analysis, I manually extracted those fields and generated phe files using custom scripts.

86516682-a7429b80-bdd7-11ea-95fe-6111be3d171b

86516685-a90c5f00-bdd7-11ea-832e-e0282a19e3de

Data storage

Need to decide where to store the data and update the documentation.

LDSC h2 for FinnGen

As we see in #27, we would love to compute LDSC h2 first.

We submitted jobs in /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/finngen_r3.

8_ldsc_h2.sh
Submitted batch job 3644060

Missing medication (MED) phenotypes

There is likely an error in MED.py causing certain medication codes to be skipped. See issue #4. Data can be found at $OAK/users/magu/repos/rivas-lab/in_old_not_tools.txt

Bug in master.20190509.phe file?

It seems like there is an issue in the current master.phe file (master.20190509.phe).
Specifically, there is at least one individual (IID == 3000000) that are not properly handled.
The IID is duplicated into 3000000 and 3e+06 and the phenotype info is scattered around those (1875 items are on 3000000 and 58 items are on 3e+06).

$ cat /oak/stanford/groups/mrivas/ukbb24983/phenotypedata/master_phe/master.20190509.phe | cut -f1-2 | grep -n 'e+06'
308539:3e+06    3e+06
$ cat /oak/stanford/groups/mrivas/ukbb24983/phenotypedata/master_phe/master.20190509.phe | awk 'NR==308539' | cut -f3- | tr "\t" "\n" | grep -v -- "-9" | wc
     58      58    1038
$ cat /oak/stanford/groups/mrivas/ukbb24983/phenotypedata/master_phe/master.20190509.phe | cut -f1-2 | grep -n '3000000'
205606:3000000  3000000
$ cat /oak/stanford/groups/mrivas/ukbb24983/phenotypedata/master_phe/master.20190509.phe | awk 'NR==205606' | cut -f3- | tr "\t" "\n" | grep -v -- "-9" | wc
   1875    1875    4882

LDSC munge for UKB sumstats

We convert the UKB sumstats into LDSC munge format.

This will enable us to perform

  • GBE global meta-analysis #25
  • Compute LDSC intercept as a GWAS QC metric #21

LDSC rg between FinnGen and UKB

To generate GBE_ID mapping between FinnGen and UKB, we apply LDSC rg between UKB and FinnGen.

We prepared FinnGen in LDSC munge format here.

We are also preparing UKB in LDSC munge format in issue #26.

We use WB sum stats for this rg analysis.

ICD info for UKB M-A sumstats

  • One quick way to do it is to just look at the phenotype info table and sum up Ns across 7 pops.
  • The correct way to handle this is to check the Metal log files and sum up Ns across populations that are actually used in M-A (in some cases, GWAS failed/skipped due to low N).

The 1st approach was used in here: https://github.com/rivas-lab/ukbb-tools/blob/master/18_metal/202006_metal/4_icdinfo.ipynb

We have the results file here: https://github.com/rivas-lab/ukbb-tools/blob/master/18_metal/202006_metal/icdinfo.metal.20200717.txt

CNV covariates lost in GWAS dependency

also gwas.py lines 97-98, cnv burden test covariates aren't in the file referenced for that test

this might be better handled in another part of the repo, but it's where the bug is now

Missing lines in the exome variant annotation file

The number of lines in the pvar and the variant annotation files does not match -- indicating that the variant annotation is incomplete.

  • 10316409 lines in the variant annotation file
  • 10448725 lines in the pvar file

The paths to files are:

/oak/stanford/groups/mrivas/ukbb24983/exome/pgen/spb/data/ukb_exm_spb-white_british_variant_annots.tsv.gz
/oak/stanford/groups/mrivas/ukbb24983/exome/pgen/ukb24983_exome.pvar.zst

GWAS finishing effort - wrong phenotypes for coding 319

There was an error in coding annotation activity.

Screenshot 2020-07-04 09 20 47

Screenshot 2020-07-04 09 20 03

This resulted in incomplete phenotype file generation and caused the errors in GWAS for the following traits:

  • INI21049,INI21051,INI21052,INI21053,INI21054,INI21055,INI21056,INI21058,INI21059,INI21060,INI21061

Specifically, the logistic regression was performed instead of Gaussian linear regression.

ToDo: exome gVCF download

[ytanigaw@sh-102-07 /scratch/groups/mrivas/ukbb/24983/phenotypedata/download/2003422/28249/ukb28249.23161]$ bash /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/08_bulk_DL/ukbfetch_bulk_wrapper.sh /oak/stanford/groups/mrivas/private_data/ukbb/24983/phenotypedata/download/2003422/28249/ukb28249.23161.bulk /scratch/groups/mrivas/ukbb/24983/phenotypedata/download/2003422/28249/ukb28249.23161 /oak/stanford/groups/mrivas/private_data/ukbb/24983/phenotypedata/download/2003422/28249/ukb28249.key

GWAS freeze

GWAS freeze version 2020/8/15

Output dir:

/oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/freeze/20200815

duplicate flags in gwas.py

lines 97-98 introduce a "duplicate --covar-variance-standardize flag error" in plink if running any pair (or all 3) of CNV burden test, non- white british population, and biomarker phenotype.

UKB Meta-analysis

We have a working version of meta-analysis summary statistics. When we finalize the summary statistics from each population, we should refresh the M-A as well.

Exome 200k GWAS QC

Exome 200k GWAS

Exome 200k GWAS is mostly finished.

[ytanigaw@sh02-02n07 /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/20201026_exome_gwas_parallel]$ bash 5_count_sumstats.sh
3360    white_british
3152    non_british_white
3031    others
2921    related
2897    s_asian
2641    african
2011    e_asian
1922    metal

For now, we applied metal to phenotypes with summary statistics from all 7 populations.

We should

  • run wc -l
  • check the log files in 3b_merge_job_list.20201102-183728.tsv
  • apply metal for the remaining summary statistics

Clean-up files in /gwas dir

We have some intermediate files in the GWAS dir, like the ones from autosomal runs and chrX runs.

Let's clean them up.

GBE Global Meta-Analysis

GBE Global Meta-Analysis

Summary

While we have UKB M-A in #22, we will also perform a meta-analysis across cohorts.

This involves several tasks.

  1. Prepare summary statistics.
  • Format conversion to plink2 format because most of the analysis pipeline were designed for that format
  • Apply liftOver so that we have the summary statistics in hg19 coordinate.
  • We should also make sure that the same variants have the same IDs because Metal uses the variant ID column in the input file.
  • We will also prepare summary statistics in LDSC munge format (see the LDSC rg section below).
  1. Identify the phenotyping mapping across cohorts
  • We don't have GBE_ID for other cohorts. We will generate the phenotype mapping to enable M-A.
  • To perform a semi-automated mapping assignment, we will apply LDSC rg.
  1. Perform the meta-analysis
  • We apply Metal as in #22

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.