rivas-lab / ukbb-tools Goto Github PK

Tools for preprocessing, QC, and preliminary analyses from raw UK BioBank data

Python 1.63% Jupyter Notebook 94.73% Shell 3.06% R 0.56% HTML 0.02%

ukbiobank gwas gwas-summary-statistics grm ld-score-regression phewas phenotyping-algorithms global-biobank-engine

ukbb-tools's Introduction

ukbb-tools

This repository contains our complete set of tools for preprocessing, quality control, and preliminary analyses on UK Biobank data. There is a folder in the repo per set of methods as defined in the Table of Contents below. Each subdirectory has a README.md file that should be read before use. These files detail how to use all files within the directory.

The `ukbb-tools` module on Sherlock

All this code has been ported to a module on Sherlock. Click for more details on how to load and use this module.

There is an updater script that pushes your current directory - use with appropriate caution, as it takes the master branch - and makes it a version of the module. The only argument for the updater is a date; this is used as a version label.

Example Usage:

bash ukbb-tools.module.updater.sh 20200225

ukbb-tools's People

Contributors

Stargazers

Watchers

ukbb-tools's Issues

GWAS QC table

For the latest GWAS freeze, let's compute the followings and tabulate it

Lambda GC across different frequency bins
LD score (heritability estimates)
LD score intercept
Number of hits
Number of independent hits (load in LD pruned set)
non-NA line count

GWAS finishing effort - 21,572 missing variants

It turned out that the summary statistics generated in array-combined/gwas/current does not match the number of expected lines (1,080,969).

There are 880 files with 1059397 lines (meaning that 21572 lines are missing in each file).

21,033 variants are on both arrays
539 variants are on one array

GWAS finishing effort - Simple line counts check

As a QC of the GWAS sum stats freeze, we perform line counts.

We identify the list of (pop, GBE_ID) pairs that satisfy the minimum N >= 100 criteria. We then ask whether we have the results in the array-combined/gwas/current directory.

For the files linked from array-combined/gwas/current directory, we apply wc -l to see if the sum stats are complete.

Summary

missing sum stats

As of 2020/6/27, we have the following number of traits missing in the gwas/current dir

The corresponding analysis notebook.

For others and related, the jobs were submitted.

incomplete sum stats

As of 2020/6/29, here is the summary of wc -l across populations.

The corresponding analysis notebook.

GWAS finishing effort - 691 missing variants

It turned out that the summary statistics generated in array-combined/gwas/current does not match the number of expected lines (1,080,969).

There were 407 files with 1080278 lines (meaning that 691 lines are missing in each file).

Ensuring tab-delimiting across all .phe files

e.g. HC and cancer v2 phenotypes seem to be space-delimited

Update phenotype grouping

On GBE, we have been using a phenotype grouping based on the prefix of GBE_IDs.

To improve the interpretability of the phenotype groupings, we will update the phenotype grouping.

Missing lines in the array variant annotation file

The number of lines in the pvar and the variant annotation files does not match -- indicating that the variant annotation is incomplete.

804070 lines in the variant annotation file
805427 lines in the pvar file

The paths to files are:

oak/stanford/groups/mrivas/private_data/ukbb/variant_filtering/variant_filter_table.6302020.tsv.gz
/oak/stanford/groups/mrivas/ukbb24983/cal/pgen/ukb24983_cal_cALL_v2_hg19.pvar.zst

cf: we saw a similar issue in exome (#29) hinting that we may have some bugs in the annotation pipeline.

GWAS finishing effort - re-run for 236 traits

In #21, there were 302 files that need to be performed.

After finishing #20, this number was reduced to 236.

We submitted those on scg4.

GWAS finishing effort - 402 missing variants

It turned out that the summary statistics generated in array-combined/gwas/current does not match the number of expected lines (1,080,969).

A majority of them have 1,080,566 lines indicating that there are 402 variants missing in the summary statistics.

Phenotyping error for 11 INI phenotypes (coding 339)

As we investigated in #20, the coding annotation and the phenotype files for 11 INI traits are wrong.

To quickly finalize the GWAS analysis, I manually extracted those fields and generated phe files using custom scripts.

re-define 16698 phenotypes using 24983 data (the most recent one)

https://github.com/rivas-lab/ukbb16698wiki/tree/master/phenotype_data

Data storage

Need to decide where to store the data and update the documentation.

LDSC h2 for FinnGen

As we see in #27, we would love to compute LDSC h2 first.

We submitted jobs in /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/finngen_r3.

8_ldsc_h2.sh
Submitted batch job 3644060

bug in phenotyping

make sure to check the treatment for -9, etc.

Multiple phenotype files in gwas.py

I would like for multiple phenotype files to be handled in gwas.py so that sbatch array jobs can be submitted.

Missing medication (MED) phenotypes

There is likely an error in MED.py causing certain medication codes to be skipped. See issue #4. Data can be found at $OAK/users/magu/repos/rivas-lab/in_old_not_tools.txt

Bug in master.20190509.phe file?

It seems like there is an issue in the current master.phe file (master.20190509.phe).
Specifically, there is at least one individual (IID == 3000000) that are not properly handled.
The IID is duplicated into 3000000 and 3e+06 and the phenotype info is scattered around those (1875 items are on 3000000 and 58 items are on 3e+06).

$ cat /oak/stanford/groups/mrivas/ukbb24983/phenotypedata/master_phe/master.20190509.phe | cut -f1-2 | grep -n 'e+06'
308539:3e+06    3e+06
$ cat /oak/stanford/groups/mrivas/ukbb24983/phenotypedata/master_phe/master.20190509.phe | awk 'NR==308539' | cut -f3- | tr "\t" "\n" | grep -v -- "-9" | wc
     58      58    1038

$ cat /oak/stanford/groups/mrivas/ukbb24983/phenotypedata/master_phe/master.20190509.phe | cut -f1-2 | grep -n '3000000'
205606:3000000  3000000
$ cat /oak/stanford/groups/mrivas/ukbb24983/phenotypedata/master_phe/master.20190509.phe | awk 'NR==205606' | cut -f3- | tr "\t" "\n" | grep -v -- "-9" | wc
   1875    1875    4882

missing phenotypes in icdinfo

diff icdinfos

v1:

/oak/stanford/groups/mrivas/users/$USER/repos/rivas-lab/wiki/ukbb/icdinfo/icdinfo.txt

v2:

https://github.com/rivas-lab/ukbb-tools/blob/master/02_phenotyping/icdinfo.txt

LDSC munge for UKB sumstats

We convert the UKB sumstats into LDSC munge format.

This will enable us to perform

GBE global meta-analysis #25
Compute LDSC intercept as a GWAS QC metric #21

LDSC rg between FinnGen and UKB

To generate GBE_ID mapping between FinnGen and UKB, we apply LDSC rg between UKB and FinnGen.

We prepared FinnGen in LDSC munge format here.

We are also preparing UKB in LDSC munge format in issue #26.

We use WB sum stats for this rg analysis.

ICD info for UKB M-A sumstats

One quick way to do it is to just look at the phenotype info table and sum up Ns across 7 pops.
The correct way to handle this is to check the Metal log files and sum up Ns across populations that are actually used in M-A (in some cases, GWAS failed/skipped due to low N).

The 1st approach was used in here: https://github.com/rivas-lab/ukbb-tools/blob/master/18_metal/202006_metal/4_icdinfo.ipynb

We have the results file here: https://github.com/rivas-lab/ukbb-tools/blob/master/18_metal/202006_metal/icdinfo.metal.20200717.txt

Revisit the population-specific PCs

As you can see in this notebook, the population-specific PCs computed for WB suffers from wired batch effects.

GWAS summary statistics should be stored in a compressed format

The output from the PLINK GWAS (such as ukb24983_v2.{GBE_code}.genotyped.glm.linear, ukb24983_v2.{GBE_code}.genotyped.glm.logistic.hybrid) should be compressed.
Specifically, we should apply bgzip and tabix for compression & indexing.

CNV covariates lost in GWAS dependency

also gwas.py lines 97-98, cnv burden test covariates aren't in the file referenced for that test

this might be better handled in another part of the repo, but it's where the bug is now

Flip fix in the GWAS summary statistics

Does the current pipeline fix the allele flip potentially caused by PLINK?
Or, it is not a problem anymore??

In the previous version of our pipeline, we applied the flip fix script[1] like this:

https://github.com/rivas-lab/ukbb24983wiki/blob/master/scripts/ukb-cal_gwas-v4.sh#L110
https://github.com/rivas-lab/ukbb24983wiki/blob/master/scripts/ukb-cal_gwas-v4.sh#L120

[1] https://github.com/rivas-lab/ukbb24983wiki/blob/master/scripts/flipfix_A1A2-v2.py

Missing lines in the exome variant annotation file

The number of lines in the pvar and the variant annotation files does not match -- indicating that the variant annotation is incomplete.

10316409 lines in the variant annotation file
10448725 lines in the pvar file

The paths to files are:

/oak/stanford/groups/mrivas/ukbb24983/exome/pgen/spb/data/ukb_exm_spb-white_british_variant_annots.tsv.gz
/oak/stanford/groups/mrivas/ukbb24983/exome/pgen/ukb24983_exome.pvar.zst

GWAS finishing effort - wrong phenotypes for coding 319

There was an error in coding annotation activity.

This resulted in incomplete phenotype file generation and caused the errors in GWAS for the following traits:

INI21049,INI21051,INI21052,INI21053,INI21054,INI21055,INI21056,INI21058,INI21059,INI21060,INI21061

Specifically, the logistic regression was performed instead of Gaussian linear regression.

ToDo: exome gVCF download

[ytanigaw@sh-102-07 /scratch/groups/mrivas/ukbb/24983/phenotypedata/download/2003422/28249/ukb28249.23161]$ bash /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/08_bulk_DL/ukbfetch_bulk_wrapper.sh /oak/stanford/groups/mrivas/private_data/ukbb/24983/phenotypedata/download/2003422/28249/ukb28249.23161.bulk /scratch/groups/mrivas/ukbb/24983/phenotypedata/download/2003422/28249/ukb28249.23161 /oak/stanford/groups/mrivas/private_data/ukbb/24983/phenotypedata/download/2003422/28249/ukb28249.key

GWAS freeze

GWAS freeze version 2020/8/15

Output dir:

/oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/freeze/20200815

duplicate flags in gwas.py

lines 97-98 introduce a "duplicate --covar-variance-standardize flag error" in plink if running any pair (or all 3) of CNV burden test, non- white british population, and biomarker phenotype.

10_phe_adjustment should be moved to 02_phenotyping/extras and documented

Gene interval list files should be documented

Process for generating dependency files for phewas' --gene option (genes_hg19.txt) should be documented.

Downloading the imaging dataset

Bulk Download of the imaging datasets

Liver images - gradient echo - DICOM (UKB Field: 20203)
Pancreatic fat - DICOM (UKB Field: 20202)
Pancreas Images - gradient echo - DICOM (UKB Field: 20260)

https://github.com/rivas-lab/ukbb-tools/tree/master/08_bulk_DL/20200922_imaging

37751 ./ukb2005693.41413.20260/ukb2005693.41413.20260.bulk
44108 ./ukb2005693.41413.20202/ukb2005693.41413.20202.bulk
10108 ./ukb2005693.41413.20203/ukb2005693.41413.20203.bulk

UKB Meta-analysis

We have a working version of meta-analysis summary statistics. When we finalize the summary statistics from each population, we should refresh the M-A as well.

Exome 200k GWAS QC

Exome 200k GWAS

Exome 200k GWAS is mostly finished.

[ytanigaw@sh02-02n07 /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/20201026_exome_gwas_parallel]$ bash 5_count_sumstats.sh
3360    white_british
3152    non_british_white
3031    others
2921    related
2897    s_asian
2641    african
2011    e_asian
1922    metal

For now, we applied metal to phenotypes with summary statistics from all 7 populations.

We should

run wc -l
check the log files in 3b_merge_job_list.20201102-183728.tsv
apply metal for the remaining summary statistics

Summary

While we have UKB M-A in #22, we will also perform a meta-analysis across cohorts.

This involves several tasks.

Prepare summary statistics.

Format conversion to plink2 format because most of the analysis pipeline were designed for that format
Apply liftOver so that we have the summary statistics in hg19 coordinate.
We should also make sure that the same variants have the same IDs because Metal uses the variant ID column in the input file.
We will also prepare summary statistics in LDSC munge format (see the LDSC rg section below).

Identify the phenotyping mapping across cohorts

We don't have GBE_ID for other cohorts. We will generate the phenotype mapping to enable M-A.
To perform a semi-automated mapping assignment, we will apply LDSC rg.

Perform the meta-analysis

We apply Metal as in #22

rivas-lab / ukbb-tools Goto Github PK

ukbb-tools's Introduction

ukbb-tools

Contents

The ukbb-tools module on Sherlock

ukbb-tools's People

Contributors

Stargazers

Watchers

ukbb-tools's Issues

Summary

missing sum stats

incomplete sum stats

Phenotyping error for 11 INI phenotypes (coding 339)

GWAS freeze version 2020/8/15

Bulk Download of the imaging datasets

Exome 200k GWAS

GBE Global Meta-Analysis

Summary

Recommend Projects

Recommend Topics

Recommend Org

The `ukbb-tools` module on Sherlock