omerwe / s-pcgc Goto Github PK

View Code? Open in Web Editor NEW

15.0 4.0 3.0 11.25 MB

Heritability, genetic correlation and functional enrichment estimation for case-control studies

License: GNU General Public License v3.0

Python 99.85% Rouge 0.15%

statistical-genetics gwas

s-pcgc's Introduction

S-PCGC

Heritability, genetic correlation and functional enrichment estimation for case-control studies.

S-PCGC is an adaptation of stratified LD score regression (S-LDSC) for case-control studies. Like S-LDSC, S-PCGC can estimate genetic heritability, genetic correlation and functional enrichment. However, S-PCGC is explicitly designed for case-control studies rather than quantitative phenotypes. This can make a large difference, especially in the presence of strong non-genetic risk factors. Parts of the S-PCGC code are adapted from S-LDSC with permission.

The main features of S-PCGC are:

S-PCGC is designed for case-control studies. Such studies include many subtleties not accounted for by methods designed for quantitative phenotypes like S-LDSC.
Seamless integration with S-LDSC format. S-PCGC accepts the same input and creates the same output files as S-LDSC.
Computational efficiency. S-PCGC can analyze datasets with hundreds of thousands of individuals and dozens of functional annotations in a few hours.

S-PCGC is described in the following paper:
Estimating SNP-Based Heritability and Genetic Correlation in Case-Control Studies Directly and with Summary Statistics. The American Journal of Human Genetics, 103(1) 89-99.. This paper however, doesn't include a description of functional enrichment estimation. We will post a bioRxiv preprint with details soon.

Please note: We switched from Python 2.7 to Python 3 on April 25 2019!

Installation

S-PCGC is designed for Python 3, and depends on the following freely available Python packages:

We recommend running S-PCGC via the Anaconda Python distribution. In Anaconda, you can install all the packages with the command "conda install <package_name>". Alternatively, the packages can be installed with the command "pip install --user <package_name>".

Once all the prerequisite packages are installed, you can install S-PCGC on a git-enabled machine by typing:

git clone https://github.com/omerwe/S-PCGC

Alternatively, you can download S-PCGC as a zip file from the Github website. After downloading, we recommend checking that everything is ok by typing python test_sumstats.py (if you have pytest installed, you can simply go the project directory and type pytest). It should run for a few minutes. If everything is ok then you shouldn't see an error.

Usage overview

S-PCGC performs a case-control analysis in four stages:

Generate a sync file for your annotations. This is a very simple offline step that only needs to be run once. It only gathers some information about the annotations (e.g., the minimum value of each annotation across all SNPs).
Estimate the cross-product of r^2 values across all pairs of functional annotations, via a reference panel such as 1000 genomes. This step is similar to LD-score computation.
Generate summary statistics. These are specialized summary statistics explicitly designed for S-PCGC (unlike standard summary statistics analyzed by S-LDSC).
Estimate functional enrichment. This is what we came here for.

S-PCGC fully supports S-LDSC input and output formats, which enables it to interact with the S-LDSC ecosystem. We therefore recommend that you familiarize yourself with S-LDSC before running S-PCGC. For example, you can input S-PCGC results into the S-LDSC scripts for partitioned heritability from continuous annotations.

A toy example

The following is a simple end-to-end S-PCGC analysis, which can run in ~2 minutes. We will estimate heritability, genetic correlation and enrichment for four simulated functional annotations in two simulated case-control studies with 250 shared controls and a disease population prevalence of 1%. To estimate the cross-product of r^2 values, we will use a (simulated) reference panel. All the input files are found in the directory example. To begin, please cd into the S-PCGC directory, and type the following commands (using the anaconda version of python if available):

mkdir temp_results

#create a sync file (a one-time offline operation)
python pcgc_sync.py \
--annot-chr example/model. \
--frqfile-chr example/model. \
--out temp_results/model

#Compute cross-r^2 between functional annotations
python pcgc_r2.py \
--bfile example/ref_panel \
--annot-chr example/model. \
--sync temp_results/model. \
--out temp_results/prodr2

#Compute summary statistics for study 1
python pcgc_sumstats_creator.py \
--bfile example/s1 \
--pheno example/s1.phe \
--covar example/s1.cov \
--frqfile-chr example/model. \
--annot-chr example/model. \
--sync temp_results/model. \
--prev 0.01 \
--out temp_results/s1

#Compute summary statistics for study 2
python pcgc_sumstats_creator.py \
--bfile example/s2 \
--pheno example/s2.phe \
--covar example/s2.cov \
--frqfile-chr example/model. \
--annot-chr example/model. \
--sync temp_results/model. \
--prev 0.01 \
--out temp_results/s2

#Run S-PCGC to estimate h^2, rg and functional enrichment
python pcgc_main.py \
--annot-chr example/model. \
--sync temp_results/model. \
--sumstats temp_results/s1.,temp_results/s2. \
--prodr2 temp_results/prodr2. \
--out temp_results/results

#view heritability estimates for the two studies
cat temp_results/results.s1.output | column -t
cat temp_results/results.s2.output | column -t

#view a table of genetic correlation estimates between the studies
cat temp_results/results.rg

#view the functional enrichment estimates of the two studies
cat temp_results/results.s1.results | column -t
cat temp_results/results.s2.results | column -t

Some quick comments about this example:

Many of the flags are analogous to S-LDSC flags and have similar names.
The .results files have the exact same format as S-LDSC output files.
The .output results show both marginal and conditional heritability (please see details below).
The flag --sumstats can accept any number of comma-separated files.
In this example we ran pcgc_r2.py and pcgc_sumstats_creator.py on the entire genome. In real data analysis it may be easier to run these scripts on each chromosome separately, and then use the flag --prodr2-chr, --sumstats-chr when calling pcgc_main.py
S-PCGC supports many more options than shown here. For a full list and explanations, please type python <file_name> --help

An example with real annotations

The following example uses simulated genotypes and real functional annotations from the Baseline-LD model. In this example, we will use a 'representative set of genotyped or well-imputed SNPs' stored in the file example/good_snps.txt. (generally, 'good SNPs' should be genotyped or well-imputed SNPs (e.g. INFO score >0.9) that passed QC and are not in the MHC region --- see details below). To run this example, you need a directory called 1000G that contains plink files of European individuals from the 1000 genomes reference panel (one file per chromosome; see download instructions below). You can create a symbolic link to this directory from your working directory with the command ln -s <path_to_1000G> 1000G.

#download and uncompress the Baseline-LD (v2.2) annotations
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_baselineLD_v2.2_ldscores.tgz
tar -xzvf 1000G_Phase3_baselineLD_v2.2_ldscores.tgz

#Compute 1000G MAFs (please change ~/plink/plink to the local path of your plink executable)
for i in {1..22};
do
    ~/plink/plink \
    --bfile 1000G/1000G.EUR.QC.${i} \
    --freq \
    --out 1000G/1000G.EUR.QC.${i}
done

#run pcgc_sync.py to collect annotations details
python pcgc_sync.py \
--annot-chr baselineLD_v2.2/baselineLD. \
--frqfile-chr 1000G/1000G.EUR.QC. \
--out baselineLD_v2.2/baselineLD

#run pcgc_r2.py on each chromosome file, using the set of 'good SNPs'
for i in {1..22};
do
    python pcgc_r2.py \
    --annot baselineLD_v2.2/baselineLD.${i}. \
    --sync baselineLD_v2.2/baselineLD. \
    --bfile 1000G/1000G.EUR.QC.${i} \
    --extract example/good_snps.txt \
    --out baselineLD_v2.2/baselineLD.goodSNPs.${i} 
done

#Create summary statistics
mkdir s1_sumstats
for i in {1..22};
do
    python pcgc_sumstats_creator.py \
    --bfile example/s1_chr${i} \
    --extract example/good_snps.txt \
    --pheno example/s1.phe \
    --covar example/s1.cov \
    --annot baselineLD_v2.2/baselineLD.${i}. \
    --frqfile 1000G/1000G.EUR.QC.${i}. \
    --sync baselineLD_v2.2/baselineLD. \
    --prev 0.01 \
    --out s1_sumstats/s1_chr${i}
done

#estimate heritability and functional enrichment
python pcgc_main.py \
--annot-chr baselineLD_v2.2/baselineLD. \
--sync baselineLD_v2.2/baselineLD. \
--sumstats-chr s1_sumstats/s1_chr \
--prodr2-chr baselineLD_v2.2/baselineLD.goodSNPs. \
--out s1_sumstats/pcgc

#view heritabiltiy estimates
cat s1_sumstats/pcgc.s1_chr.output | column -t

#view functional enrichment estimates
cat s1_sumstats/pcgc.s1_chr.results | column -t

Additional information

Obtaining reference panel and annotation files

S-PCGC is fully compatible with the S-LDSC input format. It requires two pieces of information also used by S-LDSC:

Functional annotation files. We recommend using the Baseline-LD 2.2 model (Gazal et al. 2017). You can download it by typing:

wget https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_baselineLD_v2.2_ldscores.tgz

A reference panel (required to compute cross-product of r^2 values). We recommend using 1000 genomes data from a relevant population. For example, you can download 1000G data from the 1000 Genomes FTP site and then convert it to plink format using the plink --vcf command.

Regressing genotypes on principal components

It is common to include principal components (PCs) as covariates to prevent possible confounding due to population structure. We recommend computing PCs via external software (e.g. FlashPCA2) and including them as additional covariates before computing summary statistics.

A complexity of case-control studies is that the PCs cannot be regressed from the phenotypes. We therefore recommend to regress PCs from the genotypes. You can do this in pcgc_sumstast_creator.py via the following two flags:

--covars-regress <covariate names>. This is a comma-separated list of names of covariates that will be regressed out of the genotypes (i.e., PCs).
--pve <pve file>. This is the name of a file with a single column and no header that shows the fraction of variance explained by each PC mentioned in --covars-regress. Common PCA software like FlashPCA2 produce such a file.

Case-Control GWAS Simulator

To gain more intuition about case-control studies, you may wish to look at the GWAS simulator code included in this package. As a gentle interface, you can look at the last few lines of the file run_pcgc_simulations.py. These lines continuously simulate case-control studies with annotations and with no LD, apply PCGC and GCTA estimates, and report the average estimation bias compared to the true generative values. S-LDSC is not included in the comparison because it cannot be invoked in the absence of LD (simulations with LD are possible but are much slower and more complicated; please contact me if you are interested in these).

Important notes

pcgc_r2.py must use exactly the same SNPs as pcgc_sumstats_creator.py. If you use --extract in one of them, you must use it in the other as well.
The set of SNPs with summary statistics (as in good_snps.txt in the example above) should ideally be a set of genotyped or well-imputed SNPs (e.g. INFO score > 0.9) that passed QC. It is highly recommended to also exclude the MHC region (chromosome 6 28M-32M) from these SNPs. One reasonable choice of SNPs is the set of HapMap3 SNPs (which you can download here), as these SNPs are typically genotyped or well-imputed. However, please make sure to also exclude SNPs within the MHC region.
Overlapping individuals (shared between the two studies) will not be automatically detected. Please make sure that overlapping individuals are clearly marked in the plink files by having exactly the same family id and individual id.
To account for population stratification, it is not enough to simple include principal components as covariates as is typically done. Instead, you need to invoke pcgc_sumstats_creator.py with the flags --covars-regress and --pve (please see the detailed explanation above).
The code assumes that the first annotation is always a base annotation that simply assigns 1.0 to all SNPs. Please adhere to this convention!
The .output files report two numbers: marginal and conditional heritability. The difference is that conditional heritability conditions on the covariates in the study, meaning that it ignores the variance introduced by the covariates. Most studies report conditional heritability, but we believe that the marginal heritability is more informative. For details, please see the PCGC-s paper.
We highly recommend that you provide external estimates of SNP frequencies to pcgc_sumstats_creator.py via the flag --frqfile, based on a reference panel. This can prevent bias arising due to the fact that cases are over-enriched in case-control studies, which could bias the MAF estimates and the heritability estimates.
S-PCGC can handle huge datasets by loading only small subsets of the data at each time, using the pandas-plink package. The script pcgc_sumstats_creator.py has a parameter called --chunk-size which controls the number of SNPs that can be loaded into memory at once. The default value is (50M divided by sample size), but you may wish to tweak this value based on your machine's capabilities.

FAQ

Q: Can I create my own annotations?
A: Yes! Please read the section called "Creating an annot file" in the S-LDSC wiki for instructions.

Q: Can I run just a basic analysis without functional annotations?
A: Yes. Just omit the --annot and --annot-chr flags, and S-LDSC will perform a simple analysis.

Q: Is it safe to make the summary statistics files publicly available?
A: Absolutely! These summary statistics don't expose any individual-level data. You may see that some of the output files include individual ids, but this is only intended to help identify overlapping individuals. The only information included about these individuals is (a) the noise in the estimation of their kinship with themselves (which should be 1.0 in expectation), and (b) the noise multiplied by their phenotype value. These fields are required to obtain accurate heritability/rg estimates (please see the PCGC-s paper for details)

Q: Should I include imputed SNPs in the summary statistics?
A: Not necessarily. The set of SNPs with summary statistics should be a representative set of common SNPs, corresponding to the "regression SNPs" of S-LDSC. We recommend using a set of reliable common SNPs that are either genotyped or well-imputed, such as HapMap3 SNPs (which you can download here). We also recommend excluding SNPs within the MHC (chromosome 6 28M-32M) from all analyses, as done by S-LDSC and other tools.

Q: Can I use my case-control data for pcgc_r2.py instead of a reference panel?
A: Yes, but this has two caveats: (1) the analysis will be slower with larger data, and (2) for genetic correlation, it makes more sense to use an external reference panel instead of arbitrarily choosing one of the two studies. If you only care about heritability, there's no need for a reference panel. In this case, we suggest that you downsample your data to ~5000 individuals and only run pcgc_r2.py on this subset for greater speed.

Q: Can I use standard (publicly available) summary statistics instead of having to create my own summary statistics?
A: Unfortunately no. To obtain unbiased estimates, S-PCGC must uses specialized summary statistics.

Q: Can my data include related individuals?
A: No.

Q: Does S-PCGC recognize categorical covariates?
A: No (though we might add this in the future). Please transform all your categorical covariates into a series of binary dummy variables.

Q: Can pcgc_sumstats_creator.py use bgen files instead of plink files?
A: Currently no.

Q: Can my data include two studies with overlapping individuals?
A: Yes, as long as such individuals are clearly marked in the plink files by having exactly the same family id and individual id. Otherwise, you might get severely biased results.

Q: Can I estimate rg for each annotation?
A: Yes, by using the flag --rg-annot. This will create a separate .rg file for every pair of studies with the per-annotation rg estimates. However, please note that the results may be nonsensical for continuous annotations, because they can explain a negative amount of heritability.

Q: Can S-PCGC estimate heritability directly from raw genotypes, without using summary statistics?
A: No. In our experience using summary statistics is preferable, because it allows extremely fast performance at a negligible loss of accuracy. However, if you want an exact PCGC implementation, we recommend trying out LDAK. Note that the LDAK implementation is limited to less than 100,000 individuals and 20 annotations.

Q: Can I include LDAK weights?
A: Yes - you can just include them as an S-LDSC annotation. If you want to combine these weights with S-LDSC annotations, we recommend creating joint annotations as described in Gazal et al. 2019 Nature Genetics.

Q: Can S-PCGC fit an intercept like LDSC?
A: No. There's no need to estimate an intercept when the summary statistics are created with pcgc_sumstats_creator.py, because the intercept is already known.

Q: My plink files use different rsids than the ones in your annotation files. Can you match SNPs based on genetic position and alleles instead of rsid?
A: Unfortunatley no, because the annotation files are compatible with LDSC annotation files, which do not include allelic information (see explanation below). We recommend that you create new plink files and annotation files to enforce a consistent variant naming scheme (e.g. CHR.BP.A1.A2). Explanation: S-PCGC cannot match variants based on only genetic position (without allelic info) because there might be several (multi-allelic) variants in the same chromosome+position, and S-PCGC will not be able to distinguish between them.

s-pcgc's People

Contributors

Stargazers

Watchers

Forkers

lrousselet xueenci nvrivera

s-pcgc's Issues

Unable to use pcgc_r2.py in combination with --extract

Dear Omer,

I'm trying to implement s-PCGC and running the example with 1000G data. I'm able to run all steps with the toy example. I'm also able to tun pcgc_sync.py.

I run into an error with pcgc_r2.py:

[INFO] reading annot file...
/hpc/hers_en/mbakker2/tools/S-PCGC/pcgc_utils.py:56: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_chr = pd.read_table(fname, delim_whitespace=True, index_col=index_col, header=header, usecols=usecols)
/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py:29: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_sync = pd.read_table(args.sync+'sync', index_col='Category')
/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py:42: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_extract = pd.read_table(args.extract, squeeze=True, header=None)
[INFO] Extracting 60 SNPs
[INFO] Computing r^2 products for 60 SNPs
/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py:63: FutureWarning: read_table is deprecated, use read_csv instead, passing sep='\t'.
df_fam = pd.read_table(args.bfile+'.fam', header=None)
[INFO] Loading SNP file...
After filtering, 60 SNPs remain
Traceback (most recent call last):
File "/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py", line 115, in
df_prod_r2 = compute_r2_prod(args)
File "/hpc/hers_en/mbakker2/tools/S-PCGC//pcgc_r2.py", line 73, in compute_r2_prod
df_annotations = df_annotations.iloc[geno_array.kept_snps]
File "/hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc/lib/python2.7/site-packages/pandas/core/indexing.py", line 1500, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc/lib/python2.7/site-packages/pandas/core/indexing.py", line 2221, in _getitem_axis
return self._get_list_axis(key, axis=axis)
File "/hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc/lib/python2.7/site-packages/pandas/core/indexing.py", line 2203, in _get_list_axis
raise IndexError("positional indexers are out-of-bounds")
IndexError: positional indexers are out-of-bounds

I used:

chr=22
python $PCGC/pcgc_r2.py \
    --annot $baseline/baselineLD.${chr}. \
    --sync $baseline/baselineLD. \
    --bfile $refDir/1000G_EUR_Phase3_plink/1000G.EUR.QC.${chr} \
    --extract $PCGC/example/good_snps.txt \
    --out $baseline/baselineLD.goodSNPs.${chr}

When I leave out --extract $PCGC/example/good_snps.txt PCGC runs fine and I get the expected output. But for all SNPs and not the 'good SNPs'

I tried to subset good_snps.txt to only the chr22 SNPs, but the same error occurs.

I am using conda to create an environment to run PCGC. This is created using
conda create -n pcgc python=2.7 numpy pandas scikit-learn scipy tqdm bitarray
And conda list -n pcgc gives:

# packages in environment at /hpc/hers_en/mbakker2/tools/miniconda2/envs/pcgc:
#
# Name Version Build Channel
atomicwrites 1.3.0 py_0 conda-forge
attrs 19.1.0 py_0 conda-forge
backports_abc 0.5 py_1 conda-forge
bitarray 0.8.3 py27h14c3975_0
blas 1.0 mkl
bokeh 1.0.4 py27_1000 conda-forge
ca-certificates 2019.3.9 hecc5488_0 conda-forge
certifi 2019.3.9 py27_0 conda-forge
cffi 1.12.2 py27hf0e25f4_1 conda-forge
click 7.0 py_0 conda-forge
cloudpickle 0.8.0 py_0 conda-forge
cytoolz 0.9.0.1 py27h14c3975_1001 conda-forge
dask 1.1.4 py_0 conda-forge
dask-core 1.1.4 py_0 conda-forge
distributed 1.26.0 py27_1 conda-forge
freetype 2.9.1 h94bbf69_1005 conda-forge
funcsigs 1.0.2 py_3 conda-forge
futures 3.2.0 py27_1000 conda-forge
heapdict 1.0.0 py27_1000 conda-forge
intel-openmp 2019.1 144
jinja2 2.10 py_1 conda-forge
jpeg 9c h14c3975_1001 conda-forge
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 8.2.0 hdf63c60_1
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.36 h84994c4_1000 conda-forge
libstdcxx-ng 8.2.0 hdf63c60_1
libtiff 4.0.10 h648cc4a_1001 conda-forge
locket 0.2.0 py_2 conda-forge
markupsafe 1.1.1 py27h14c3975_0 conda-forge
mkl 2019.1 144
mkl_fft 1.0.10 py27ha843d7b_0
mkl_random 1.0.2 py27hd81dba3_0
more-itertools 4.3.0 py27_1000 conda-forge
msgpack-python 0.6.1 py27h6bb024c_0 conda-forge
ncurses 6.1 he6710b0_1
numpy 1.16.2 py27h7e9f1db_0
numpy-base 1.16.2 py27hde5b4d6_0
olefile 0.46 py_0 conda-forge
openssl 1.1.1b h14c3975_1 conda-forge
packaging 19.0 py_0 conda-forge
pandas 0.24.1 py27he6710b0_0
pandas-plink 1.2.30 py27h14c3975_0 conda-forge
partd 0.3.9 py_0 conda-forge
pathlib2 2.3.3 py27_1000 conda-forge
pillow 5.3.0 py27h00a061d_1000 conda-forge
pip 19.0.3 py27_0
pluggy 0.9.0 py_0 conda-forge
psutil 5.6.1 py27h14c3975_0 conda-forge
py 1.8.0 py_0 conda-forge
pycparser 2.19 py27_1 conda-forge
pyparsing 2.3.1 py_0 conda-forge
pytest 4.3.1 py27_0 conda-forge
pytest-runner 4.4 py_0 conda-forge
python 2.7.15 h9bab390_6
python-dateutil 2.8.0 py27_0
pytz 2018.9 py27_0
pyyaml 5.1 py27h14c3975_0 conda-forge
readline 7.0 h7b6447c_5
scandir 1.10.0 py27h14c3975_0 conda-forge
scikit-learn 0.20.3 py27hd81dba3_0
scipy 1.2.1 py27h7c811a0_0
setuptools 40.8.0 py27_0
singledispatch 3.4.0.3 py27_1000 conda-forge
six 1.12.0 py27_0
sortedcontainers 2.1.0 py_0 conda-forge
sqlite 3.27.2 h7b6447c_0
tblib 1.3.2 py_1 conda-forge
tk 8.6.8 hbc83047_0
toolz 0.9.0 py_1 conda-forge
tornado 5.1.1 py27h14c3975_1000 conda-forge
tqdm 4.31.1 py_0
wheel 0.33.1 py27_0
xz 5.2.4 h14c3975_1001 conda-forge
yaml 0.1.7 h14c3975_1001 conda-forge
zict 0.1.4 py_0 conda-forge
zlib 1.2.11 h7b6447c_3

Failed tests with anaconda3

I am not sure how to handle this error:

conda 4.3.30

conda env create PCGC
conda create --name PCGC
source activate PCGC
conda install numpy
conda install scipy
conda install scikit-learn
conda install pandas
conda install pandas-plink
conda install -c conda-forge pandas-plink
conda install tqdm

git clone https://github.com/omerwe/S-PCGC

python test_sumstats.py
temporary directory: /tmp/2qujmo39
Creating summary statistics for study s3...
Command python pcgc_sumstats_creator.py --bfile example/s3 --covar example/s3.cov --prev 0.01 --pheno example/s3.phe --frqfile /projects/b1137/Diabetic_retinopathy/GeneticCorrelation/S-PCGC/S-PCGC/example/model.1. --annot /projects/b1137/Diabetic_retinopathy/GeneticCorrelation/S-PCGC/S-PCGC/example/model.1. --out /tmp/2qujmo39/s3.1 --sync /projects/b1137/Diabetic_retinopathy/GeneticCorrelation/S-PCGC/S-PCGC/example/model. returned 1 with the following stdout:
b'pcgc_sumstats_creator.py:388: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n df_pheno.iloc[:,-1] = df_pheno.iloc[:,-1].astype(np.int)\npcgc_sumstats_creator.py:50: DeprecationWarning: np.float is a deprecated alias for the builtin float. To silence this warning, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n df_covar[c] = df_covar[c].astype(np.float)\n[INFO] reading frq file...\n[INFO] Reading plink file from disk (this may take a while...)\n*********************************************************************\n* S-PCGC sumstats creator\n* Version 2.0.0\n* (C) 2018 Omer Weissbrod\n*********************************************************************\n\n\rMapping files: 0%| | 0/3 [00:00<?, ?it/s]\rMapping files: 100%|\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88| 3/3 [00:00<00:00, 81.52it/s]\n[INFO] 1500 SNPs remained after removing SNPs without MAFs\npcgc_sumstats_creator.py:373: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n is_relevant_col = np.zeros(df.shape[1], dtype=np.bool)\npcgc_sumstats_creator.py:373: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n is_relevant_col = np.zeros(df.shape[1], dtype=np.bool)\npcgc_sumstats_creator.py:373: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.\nDeprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations\n is_relevant_col = np.zeros(df.shape[1], dtype=np.bool)\nTraceback (most recent call last):\n File "pcgc_sumstats_creator.py", line 592, in \n sumstats_creator = PCGC_Sumstats(args)\n File "pcgc_sumstats_creator.py", line 75, in init\n self.bfile, df_pheno, df_maf, self.num_snps, self.sample_size = self.read_plink(args, df_pheno, df_maf)\n File "pcgc_sumstats_creator.py", line 486, in read_plink\n is_consistent_snp = (df_maf[allele1_col] == df_bim['a1']) & (df_maf[allele0_col] == df_bim['a0'])\n File "/home/lpe159/.conda/envs/PCGC/lib/python3.8/site-packages/pandas/core/ops/common.py", line 70, in new_method\n return method(self, other)\n File "/home/lpe159/.conda/envs/PCGC/lib/python3.8/site-packages/pandas/core/arraylike.py", line 40, in eq\n return self._cmp_method(other, operator.eq)\n File "/home/lpe159/.conda/envs/PCGC/lib/python3.8/site-packages/pandas/core/series.py", line 5617, in _cmp_method\n raise ValueError("Can only compare identically-labeled Series objects")\nValueError: Can only compare identically-labeled Series objects\n'
Traceback (most recent call last):
File "test_sumstats.py", line 106, in
test_sumstats(temp_dir)
File "test_sumstats.py", line 58, in test_sumstats
run_command(ss_command)
File "test_sumstats.py", line 20, in run_command
raise IOError()
OSError

pcgc_sumstats_creator.py : Error after removing "because they have different alleles in plink and MAF files"

When a SNP with different alleles in the plink and MAF files are present they are successfully removed via:

is_consistent_snp = (df_maf[allele1_col] == df_bim['a1']) & (df_maf[allele0_col] == df_bim['a0'])
is_consistent_snp = is_consistent_snp | (df_maf[allele1_col] == df_bim['a0']) & (df_maf[allele0_col] == df_bim['a1'])
if not np.all(is_consistent_snp):
     logging.info('%d SNPs will be removed because they have different alleles in plink and MAF files'%(np.sum(~is_consistent_snp)))
     df_maf = df_maf.loc[is_consistent_snp]
     df_bim = df_bim.loc[is_consistent_snp]

However, this produces an error later in the code due to a size mismatch.

0: Traceback (most recent call last):
0:   File "path/S-PCGC/pcgc_sumstats_creator.py", line 593, in <module>
0:     sumstats_creator.compute_all_sumstats(args.chunk_size)
0:   File "path/S-PCGC/pcgc_sumstats_creator.py", line 271, in compute_all_sumstats
0:     self.set_locus(snp1, snp2)
0:   File "path/S-PCGC/pcgc_sumstats_creator.py", line 318, in set_locus
0:     snp_maf = self.mafs[snp1+j]
0:   File "path/lib/python3.8/site-packages/pandas/core/series.py", line 879, in __getitem__
0:     return self._values[key]
0: IndexError: index 44269 is out of bounds for axis 0 with size 44269

If I force removal of additional SNPs, the index at which the error occurs reduces by a corresponding interval which leads me to believe this is what causes the problem. E.g. I forced removal of 2 additional SNPs the error becomes.

IndexError: index 44267 is out of bounds for axis 0 with size 44267

I am able to get around this by identifying the mismatched SNPs and removing them from my good_snps.txt for now

Different shape/ length error

Hey There,

Thanks for your reading. I'm using this package and get an error:

[WARNING] 8634629 SNPs are found in the annotation files and in all the sumstats files
[INFO] reading M files...
100%|???????????????????????????????????????????????????????????????????????| 22/22 [16:14<00:00, 44.29s/it]
/S-PCGC/pcgc_main.py:397: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself.
Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
gencov_arr = np.empty((len(pcgc_data_list), len(pcgc_data_list)), dtype=np.object)
Traceback (most recent call last):
File "/S-PCGC/pcgc_main.py", line 857, in
pcgc_obj = SPCGC(args)
File "/S-PCGC/pcgc_main.py", line 402, in init
cov_ij = self.create_cov_obj(args, oi, oj,
File "/pcgc_main.py", line 628, in create_cov_obj
self.compute_taus(args, oi, oj,
File "/S-PCGC/pcgc_main.py", line 753, in compute_taus
z1_anno = df_annotations_sumstats_noneg.values * sumstats1[:, np.newaxis] * np.sqrt(trace_ratios1)
ValueError: operands could not be broadcast together with shapes (8634629,97) (8636723,1)

And with the same data, same codes we get a different error message when performed by another person:

[WARNING] 8636723 SNPs are found in the annotation files and in all the sumstats files
[INFO] reading M files...
[INFO] reading annot files...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22/22 [21:54<00:00, 59.77s/it]
Traceback (most recent call last):
File "pcgc_main.py", line 857, in
pcgc_obj = SPCGC(args)
File "pcgc_main.py", line 394, in init
self.load_annotations_data(args, df_prodr2, index_intersect)
File "pcgc_main.py", line 488, in load_annotations_data
is_same = (df.index == index_intersect).all()
File "/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 123, in cmp_method
raise ValueError("Lengths must match to compare")
ValueError: Lengths must match to compare

Do you have any idea how this error comes and how to solve it? Thanks a lot and looking forward to your reply :))

How to perform an analysis using just summary statistics

I apologize for my obtusity, but I Can't figure out from the manual how to actually analyze my data.
I have, let's say, two summary statistics. I do not have access to the subject left genotypes or phenotypes. Just the summary statistics (and they were done with Regenie too).

I can build an LD panel from 1000 G (I am a bit confused regarding whether I need to select good SNPs since the summary stats is already seriously filtered) but how to I turn a summary statistics from Regenie into the one that would work for you?

Would it be correct to say that all I need is to preprocess it with LDSC tools? specifically pass them through munge_sumstats.py and then into S-PCGA?

Sincerely

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.