genepi / nf-gwas Goto Github PK

View Code? Open in Web Editor NEW

56.0 7.0 19.0 66.72 MB

A nextflow pipeline to perform state-of-the-art genome-wide association studies.

Home Page: https://genepi.github.io/nf-gwas

License: MIT License

Dockerfile 2.27% Nextflow 74.21% Shell 0.43% Java 11.17% R 6.22% Groovy 5.70%

gwas regenie nextflow slurm singularity imputation

nf-gwas's Introduction

nf-gwas

nf-gwas is a Nextflow pipeline to run biobank-scale genome-wide association studies (GWAS) analysis. The pipeline automatically performs numerous pre- and post-processing steps, integrates regression modeling from the REGENIE package and currently supports single-variant, gene-based and interaction testing. All modules are structured in sub-workflows which allows to extend the pipeline to other methods and tools in future. nf-gwas includes an extensive reporting functionality that allows to inspect thousands of phenotypes and navigate interactive Manhattan plots directly in the web browser.

The pipeline is tested using the unit-style testing framework nf-test and includes a schema definition to run with Nextflow Tower.

Documentation

Documentation can be found here.

Citation

Please cite our paper if you use nf-gwas:

Schönherr S*, Schachtl-Riess JF*, Di Maio S*, Filosi M, Mark M, Lamina C, Fuchsberger C, Kronenberg F, Forer L. Performing highly parallelized and reproducible GWAS analysis on biobank-scale data. NAR Genom Bioinform. 2024 Feb 7;6(1):lqae015. doi: 10.1093/nargab/lqae015. PMID: 38327871; PMCID: PMC10849172.

Quick Start

Install Nextflow (>=22.10.4)
Run the pipeline on a test dataset

nextflow run genepi/nf-gwas -r v1.0.7 -profile test,<docker,singularity,slurm,slurm_with_scratch>

Run the pipeline on your data

nextflow run genepi/nf-gwas -c <nextflow.config> -r v1.0.7 -profile <docker,singularity,slurm,slurm_with_scratch>

Please click here for available test config files.

Development

git clone https://github.com/genepi/nf-gwas
cd nf-gwas
docker build -t genepi/nf-gwas . # don't ignore the dot
nextflow run main.nf -profile test,development

nf-test

nf-gwas makes use of nf-test, a unit-style test framework for Nextflow.

cd nf-gwas
curl -fsSL https://code.askimed.com/install/nf-test | bash
./nf-test test

License

nf-gwas is MIT Licensed and was developed at the Institute of Genetic Epidemiology, Medical University of Innsbruck, Austria.

Contact

nf-gwas's People

Contributors

Stargazers

Watchers

Forkers

johsr bioinformatics-lab vikash84 seppinho dariushghasemi drpatelh filosi jpcartailler rnaimehaom robsyme pbshetty12 hediatnani vsarsani evakoe genostack nvrivera digital-cloud-team-singapore imid

nf-gwas's Issues

Output are not sorted by chromosome and bp

The output summary statistic is sorted by lexical order and not by chromosme and position which will possibly raise error when directly input to pheweb

annotation step seems to take a long time

hi all,

i have an annotation file that is a .tsv.gz with 1 million rows (variants), and I see the annotation step after REGENIE steps 1 & 2 seems to take a long time (NF_GWAS:SINGLE_VARIANT_TESTS:ANNOTATION:ANNOTATE_RESULTS).

currently it seems like for 1 million rows it takes 44 minutes on 8 CPUs & 4 GB. Which seems a bit long relative to the actual steps 1 & 2 of REGENIE. I went to see the Java source code & it seems a bit inefficient doing row-by-row sequential processing with single thread. Do you think this step can be multi-threaded or sped up in any way?
Thanks

Parse regenie call

Parse complete regenie command and add it to the R-Report

jbang exceptions while running on cloud

Hi genepi team 👋

Thanks for the neat pipeline!

I was able to successfully run the pipeline locally using the -profile test,docker profiles however it seems that the jbang's native behavior of downloading JARs on the fly might not be well suited for the cloud environment.

Issue encountered

When I tried to run the pipeline on the cloud (both Azure/AWS Batch) setting via Nextflow CLI by adding the cloud-specific configs (azure.config) and invoking

$ nextflow -c ./azure.config run https://github.com/genepi/nf-gwas -profile test,docker,azb -r main -latest

I kept running across the following issue in the initial caching process

Error executing process > 'NF_GWAS:CACHE_JBANG_SCRIPTS'

Caused by:
  Process `NF_GWAS:CACHE_JBANG_SCRIPTS` terminated with an error exit status (1)

Command executed:

  jbang export portable -O=RegenieLogParser.jar RegenieLogParser.java
  jbang export portable -O=RegenieFilter.jar RegenieFilter.java
  jbang export portable -O=RegenieValidateInput.jar RegenieValidateInput.java

Command exit status:
  1

Command output:
  (empty)

Command error:
  Exception in thread "main" dev.jbang.cli.ExitException: Resource could not be copied from class path: jbang.properties
        at dev.jbang.source.resolvers.ClasspathResourceResolver.getClasspathResource(ClasspathResourceResolver.java:66)
        at dev.jbang.source.resolvers.ClasspathResourceResolver.resolve(ClasspathResourceResolver.java:31)
        at dev.jbang.source.resolvers.CombinedResourceResolver.lambda$resolve$0(CombinedResourceResolver.java:26)
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at java.base/java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958)
        at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:127)
        at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:502)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:488)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:150)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:543)
        at dev.jbang.source.resolvers.CombinedResourceResolver.resolve(CombinedResourceResolver.java:28)
        at dev.jbang.source.ResourceRef.forResource(ResourceRef.java:136)
        at dev.jbang.Configuration.getBuiltin(Configuration.java:265)
        at dev.jbang.Configuration.defaults(Configuration.java:218)
        at dev.jbang.Configuration.getMerged(Configuration.java:251)
        at dev.jbang.Configuration.instance(Configuration.java:196)
        at dev.jbang.cli.JBang$ConfigurationResourceBundle.getKeys(JBang.java:212)
        at picocli.CommandLine$Model$Messages.keys(CommandLine.java:11250)
        at picocli.CommandLine$Model$Messages.<init>(CommandLine.java:11234)
        at picocli.CommandLine$Model$CommandSpec.setBundle(CommandLine.java:6346)
        at picocli.CommandLine$Model$CommandSpec.resourceBundle(CommandLine.java:6342)
        at picocli.CommandLine.setResourceBundle(CommandLine.java:3319)
        at dev.jbang.cli.JBang.getCommandLine(JBang.java:227)
        at dev.jbang.cli.JBang.getCommandLine(JBang.java:107)
        at dev.jbang.Main.main(Main.java:12)

I suspect that this might have to do with how jbang relies on the download of jar dependencies (in a local lib folder) which is not available to the tasks in other nodes (or other instances of the container) in a multi-node setting.

Suggestions

Allow me to share a couple of suggestions which might be worth considering.

Use a compiled shadow-jar/uber-jar (single jar with all deps baked in) so that there is no need for the lib folder to be available to downstream processes which rely upon these cached jar files.
Another alternative, perhaps with less effort, is to perhaps bake in the compiled jar files in the container itself since the tool is already available in the container, this way we can ensure that the dependencies (i.e. lib folder) as well as the Regenie* JARs are all available within the container instances across different nodes.

Collaboration

I'd be happy to test the pipeline on the cloud and to discuss any changes which might be necessary for making the pipeline optimal (hardware config etc) for the cloud setting.

Support genotyped array data in VCF format

Currently genotyped array data must be provided in plink (bim/bed/fam) format. Support for VCF would be simplify the usage with data from MIS.

Support --force-step1

Currently there is no way to run the pipeline if >1M are included in step 1.

WGS data

The pipeline looks like it is optimized for processing imputed vcf data from UMICH or TOPMed imputation server which generates a DS field. Is is possible to run the pipeline on GATK WGS sequencing data without the DS field. Or does that need to be calculated with the PL field and written out to plink format before running the pipeline?

Report qqplot by MAF

Provide qqplot in the report divided by maf

how do I create my own (possibly artificial) .bed/.bam/.fam files?

hi,

apologies for the somewhat clueless question as I am not a bioinformatician but i want to stress test this Nextflow pipeline & bring it into the cloud for some in-house projects

i see that you have given example files for testing in this path:
genotypes_prediction = "$projectDir/tests/input/pipeline/example.{bim,bed,fam}"

to my understanding, those files contain 500 positions/SNPs and it is a small file just for testing that the pipeline works instead of crashing/error-ing. I want to inflate this file to have more data, e.g. 500k SNPs or even 5 million, and run it on a large machine in the cloud, and check the runtimes, compute costs & things like that.

is there an easy way for me to do this? i do not need the data to be "real"

I tried looking around for public datasets but I think you have to sign some researcher agreements since these data tend to be very confidential/sensitive (people's genomes after all). basically i have not had success looking on the web

the other problem is these are binary files which seem to be produced by another program on upstream data so I can't easily write them myself (as opposed to the TSVs for example, which I could easily inflate with python or even linux commands)

thanks a lot

Set default value for annotation_min_log10p to 7.3

Now used differently compared to older pipeline versions. 7.3 is better for report module (top loci definition), is it also ok for report_index module?

question regarding paper: running on SLURM/AWS Batch

hi again,

I have a question about the exact configuration of your runs for the experiments reported in the paper, which were conducted either on AWS Batch or SLURM.

For these large-scale GWAS runs, do you provide the input files as a single file (e.g. a single giant phenotype.txt file with N columns for N phenotypes, a single giant .vcf.gz file containing all variants, 3 giant .bed/.fam/.bam files), prepare 1 single config file, then execute the Nextflow run?

Then, how do you specify the resources for AWS Batch? Since there is no single AWS machine that has 450 CPUs, surely it has to be split up on smaller machines (eg 32 CPUs per machine). Or the alternative is do you split the input files by phenotype so with 10 phenotypes you have 10 runs, and then you execute 10 separate Nextflow runs?

Basically I am just curious about how you specified the total resources & orchestrated the entire workload

sorry for the repeated questions - I loved reading your paper & the code is extremely well written hence my interest in this work!

Thanks a lot

Add check if phenotype / covariate columns are set

Error while running the pipeline on a test dataset

If I run the command

nextflow run genepi/nf-gwas -r v0.5.0 -profile test, docker

the following error is occuring-

I couldn't find this RegenieValidateInput.jar file in the github repo.
Could you please let me know what am I missing here?

Add closest genes for the significant loci to the Manhattan plot

Currently, genes of significant loci are not shown in the Manhattan plot. Either add genes for the genome-wide significance level or for the threshold chosen for the annotated file.

Add effect prediction step

E.g. using http://pcingola.github.io/SnpEff/

Validate phenotype file

Parse phenotype file and check if all requirements for regenie are fulfilled.

some Nextflow help to change "genotypes_prediction" param to 3 separate paths

Hi again,

I am having some trouble specifying the genotypes_prediction parameter on AWS. The problem is that it expects 3 files together as a pattern, e.g. "example.{bim,bed,fam}" but AWS S3 doesn't recognise this format of paths and my run is failing on AWS because of it.

To solve this problem, I'd like to create 3 separate input params: genotypes_prediction_bim, genotypes_prediction_bed and genotypes_prediction_fam. This way, I can just provide the absolute path to all 3 files without patterns and it will work on any platform.

I understand this involves changing the fromFilePairs channel into 3 separate fromPaths channels, but I'm having trouble on how I could structure the Nextflow code such that it will give the same channel output as fromFilePairs, which I believe is something like [filename, [path_bim, path_bed, path_fam]] I have never coded in Nextflow before, so apologies if this is too simple a problem

Any assistance would be greatly appreciated!
Thank you so much

Issues with empty files on cloud

Hi team,

Since we're already evaluating use of bundled JARs in the container, I went ahead and tested with the next steps (once we're past the jar issues).

The problems here seems to be the use of empty files, for example NO_COV_LOG here

nf-gwas/workflows/nf_gwas.nf

Line 46 in 6367cc8

if (params.covariates_filename != 'NO_COV_FILE' && !covariates_file.exists()){

This works fine on real filesystems, however as cloud also provides the option to run a pipeline using a blob storage as a backend (Azure Blob container or AWS S3) , this expectation fails.

I'm currently working on a separate internal branch to test this and once that's done I'll add here for a review.

UPDATE: I've merged that code within #46

Request to move the tabix-merge.jar file into docker containers

The ./bin folder is designed to hold small, self-contained executable scripts used by the Nextflow processes, ideally below 1MB. The nf-gwas workflow contains a 22MB jar file, which does not seem to be called directly in the pipeline.

Is this jar file still necessary for the workflow? If not, I would recommend that it be removed. If the jar file is required by one of the executables contained in the docker image, it would make sense that tabix-merge.jar is also included in the docker image.

Improve binary traits support

Currently only -firth 0.01 --approx is supported for binary traits

Gene level test option

Dear developers,
Regenie 3 allows for gene-level testing (https://rgcgithub.github.io/regenie/options/#gene-based-testing), but it requires to play with the step 2 configuration script. Is it already possible to make such an analysis with nextflow?

If not, what do you suggest? I also attach the example steps configurationsl, highlighting in red the differences between the two analysis parameters.

All the best,
Giacomo

Gene-level analysis

An error is thrown when using plink set files divided by chromosome on gene-level analysis.
In particular for:
genotype_association=<mydir>/chr*.{bim,bed,fam}
Error on REGENIE_STEP2_GENE_TESTS:
chr1.bed Not Found

When running with:
genotype_association=<mydir>/chr1.{bim,bed,fam}

Everything goes fine.

Problem in creating report when "TEST_FAIL"

Process Report return this error when the regenie test fails and LOG10P -> NA

  java.lang.NumberFormatException: For input string: "NA"
        at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
        at java.base/jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.base/java.lang.Double.parseDouble(Double.java:651)
        at genepi.io.table.reader.AbstractTableReader.getDouble(AbstractTableReader.java:38)
        at genepi.io.table.reader.AbstractTableReader.getDouble(AbstractTableReader.java:30)
        at genepi.genomic.utils.commands.GwasReportCommand.createHtmlReport(GwasReportCommand.java:316)
        at genepi.genomic.utils.commands.GwasReportCommand.call(GwasReportCommand.java:216)
        at genepi.genomic.utils.commands.GwasReportCommand.call(GwasReportCommand.java:31)
        at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
        at picocli.CommandLine.access$1500(CommandLine.java:148)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273)
        at picocli.CommandLine$RunLast.execute(CommandLine.java:2417)
        at picocli.CommandLine.execute(CommandLine.java:2170)
        at genepi.genomic.utils.App.main(App.java:38)

RMarkdown report settings step 1

For documentation, it might be useful to include the settings for pruning and qc of step 1 (e.g. prune_window_kbsize, qc_hwe, etc) into the report.

test run seems to be hanging on Ubuntu with docker

hi, i am on Ubuntu 22.04, trying to run the test workflow

I am using nextflow 23.10.1 from https://www.nextflow.io/docs/latest/getstarted.html#installation installed into my /home/<username>/ directory and added the path to $PATH so that I can just run nextflow from anywhere in my filesystem.

this is with Java 17.0.6 (OpenJDK Runtime Environment Corretto-17.0.6.10.1 (build 17.0.6+10-LTS)) installed via SDKMAN! as advised by nextflow documentation

now, i just git clone the repo, then I execute nextflow run genepi/nf-gwas -r v1.0.4 -profile test,docker
I had to make my cloned github directory writable to all users with chmod 777 -R <cloned_repo_dir> which I thought was strange but if I don't do this, nextflow will fail saying it can't create the workDir's within the folder, e.g. /home/minhtoo/repos/nf-gwas/work/e1/93c6238c11c0659e5b90902261c4e1

it prints the below log but seems to get stuck as the 0% never advances (it's been at least 20 minutes). i also have the .nextflow.log file contents in this link: CLICK ME, which do not raise any errors:

N E X T F L O W  ~  version 23.10.1
Pulling genepi/nf-gwas ...
 downloaded from https://github.com/genepi/nf-gwas.git
Launching `https://github.com/genepi/nf-gwas` [grave_bardeen] DSL2 - revision: 7927c67696 [v1.0.4]
Downloading plugin [email protected]
Core Nextflow options
  revision                    : v1.0.4
  runName                     : grave_bardeen
  containerEngine             : docker
  container                   : quay.io/genepi/nf-gwas:v1.0.4
  launchDir                   : /home/minhtoo/repos/nf-gwas
  workDir                     : /home/minhtoo/repos/nf-gwas/work
  projectDir                  : /home/minhtoo/.nextflow/assets/genepi/nf-gwas
  userName                    : minhtoo
  profile                     : test,docker
  configFiles                 : 

Input/output options
  outdir                      : default

All Tests
  project                     : test-gwas
  genotypes_association       : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/example.bgen
  genotypes_association_format: bgen
  phenotypes_filename         : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/phenotype.txt
  phenotypes_columns          : Y1,Y2
  genotypes_prediction        : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/example.{bim,bed,fam}

Single-variant tests only
  regenie_min_imputation_score: 0.00

General options
  validate_params             : true

R report
  annotation_min_log10p       : 2
  plot_ylimit                 : 0

Other options
  genotypes_build             : hg19
  rsids_filename              : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/rsids.tsv.gz

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
WARN: Option genotypes_build is deprecated. Please use association_build instead.
executor >  local (2)
[67/2c9d8b] process > NF_GWAS:SINGLE_VARIANT_TESTS:INPUT_VALIDATION:VALIDATE_PHENOTYPES           [  0%] 0 of 1
[e1/93c623] process > NF_GWAS:SINGLE_VARIANT_TESTS:QUALITY_CONTROL:QC_FILTER_GENOTYPED (1)        [  0%] 0 of 1
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP1:REGENIE_STEP1_RUN        -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP1:REGENIE_LOG_PARSER_STEP1 -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP2:REGENIE_STEP2_RUN        -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP2:REGENIE_LOG_PARSER_STEP2 -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:ANNOTATION:ANNOTATE_RESULTS                    -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:MERGE_RESULTS                                  -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:FILTER_RESULTS                                 -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REPORTING:REPORT                               -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REPORTING:REPORT_INDEX                         -

Add rsID to the annotated output

It would be great if the annotated output would also include the rsIDs of the SNPs.

Wrong gene start, end and distance in the annotated output

Currently, gene start and end (and consequently also distance) don't fit the start and end reported for the genes in other sources.

'Nextflow.exit(int,String)' doesn't support array

Hi,
I got the following error message

Missing process or function exit([1, Parameter genotypes_prediction is required.])`

with stacktrace:

Caused by: groovy.lang.MissingMethodException: No signature of method: static WorkflowMain.exit() is applicable for argument types: (Integer, String) v
alues: [1, Parameter genotypes_prediction is required.]
Possible solutions: wait(), wait(long), wait(long, int), getAt(java.lang.String), grep(), dump()
        at groovy.lang.MetaClassImpl.invokeStaticMissingMethod(MetaClassImpl.java:1570)
        at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1556)
        at org.codehaus.groovy.runtime.callsite.StaticMetaClassSite.callStatic(StaticMetaClassSite.java:62)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:55)
(...)

using NF 23.10.0 .

it looks like Nextflow.exit https://github.com/nextflow-io/nextflow/blob/master/modules/nextflow/src/main/groovy/nextflow/Nextflow.groovy#L196 doesn't accept array/vector

Gene-level analysis take long time

Gene-level analysis take long time to run on big imputed dataset.

Improve error message for required parameters

If a required parameter is missing then the message "Please specify all required parameters" appears.
It would help to mention which parameter is missing.

Issue with Gene level analysis on MERGE_RESULTS

If a setlist contains a key containing "CHROM" string the header line of the summary stat will contain multiple lines and the tabix fail creating the indexing.

problem with rsID file download

hi, I'm trying to use the next low pipeline version 0.6.3 and when the pipe tries to download the rsID file, I get this error.

Process NF_GWAS:DOWNLOAD_RSIDS is unable to find [UnixPath]: /shares/CIBIO-Storage/NB/scratch/sfari_new/WGS123_complete_run_cov-pheno/c4/06e662ebb16c20061e9616ad88159b/rsids-v154-null.index.gz (pattern: rsids-v154-null.index.gz)
Jun-14 10:34:12.791 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=NF_GWAS:DOWNLOAD_RSIDS; work-dir=/shares/CIBIO-Storage/NB/scratch/sfari_new/WGS123_complete_run_cov-pheno/c4/06e662ebb16c20061e9616ad88159b
error [nextflow.exception.MissingFileException]: Missing output file(s) rsids-v154-null.index.gz expected by process NF_GWAS:DOWNLOAD_RSIDS
Jun-14 10:34:12.820 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NF_GWAS:DOWNLOAD_RSIDS'

Caused by:
Missing output file(s) rsids-v154-null.index.gz expected by process NF_GWAS:DOWNLOAD_RSIDS

I'm attaching the config file for reference and the entire log.
.nextflow.log.7.txt
gwas_conf.conf.txt

Thanks a lot.