Giter Club home page Giter Club logo

nf-gwas's Introduction

nf-gwas

nf-gwas nf-test

nf-gwas is a Nextflow pipeline to run biobank-scale genome-wide association studies (GWAS) analysis. The pipeline automatically performs numerous pre- and post-processing steps, integrates regression modeling from the REGENIE package and currently supports single-variant, gene-based and interaction testing. All modules are structured in sub-workflows which allows to extend the pipeline to other methods and tools in future. nf-gwas includes an extensive reporting functionality that allows to inspect thousands of phenotypes and navigate interactive Manhattan plots directly in the web browser.

The pipeline is tested using the unit-style testing framework nf-test and includes a schema definition to run with Nextflow Tower.

image

Documentation

Documentation can be found here.

image

Citation

Please cite our paper if you use nf-gwas:

Schönherr S*, Schachtl-Riess JF*, Di Maio S*, Filosi M, Mark M, Lamina C, Fuchsberger C, Kronenberg F, Forer L. Performing highly parallelized and reproducible GWAS analysis on biobank-scale data. NAR Genom Bioinform. 2024 Feb 7;6(1):lqae015. doi: 10.1093/nargab/lqae015. PMID: 38327871; PMCID: PMC10849172.

Quick Start

  1. Install Nextflow (>=22.10.4)

  2. Run the pipeline on a test dataset

nextflow run genepi/nf-gwas -r v1.0.7 -profile test,<docker,singularity,slurm,slurm_with_scratch>
  1. Run the pipeline on your data
nextflow run genepi/nf-gwas -c <nextflow.config> -r v1.0.7 -profile <docker,singularity,slurm,slurm_with_scratch>

Please click here for available test config files.

Development

git clone https://github.com/genepi/nf-gwas
cd nf-gwas
docker build -t genepi/nf-gwas . # don't ignore the dot
nextflow run main.nf -profile test,development

nf-test

nf-gwas makes use of nf-test, a unit-style test framework for Nextflow.

cd nf-gwas
curl -fsSL https://code.askimed.com/install/nf-test | bash
./nf-test test

License

nf-gwas is MIT Licensed and was developed at the Institute of Genetic Epidemiology, Medical University of Innsbruck, Austria.

Contact

nf-gwas's People

Contributors

abhi18av avatar amstlerstephan avatar drpatelh avatar ejseqera avatar filosi avatar johsr avatar lukfor avatar robsyme avatar salvidm avatar seppinho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

nf-gwas's Issues

annotation step seems to take a long time

hi all,

i have an annotation file that is a .tsv.gz with 1 million rows (variants), and I see the annotation step after REGENIE steps 1 & 2 seems to take a long time (NF_GWAS:SINGLE_VARIANT_TESTS:ANNOTATION:ANNOTATE_RESULTS).

currently it seems like for 1 million rows it takes 44 minutes on 8 CPUs & 4 GB. Which seems a bit long relative to the actual steps 1 & 2 of REGENIE. I went to see the Java source code & it seems a bit inefficient doing row-by-row sequential processing with single thread. Do you think this step can be multi-threaded or sped up in any way?
Thanks

jbang exceptions while running on cloud

Hi genepi team 👋

Thanks for the neat pipeline!

I was able to successfully run the pipeline locally using the -profile test,docker profiles however it seems that the jbang's native behavior of downloading JARs on the fly might not be well suited for the cloud environment.

Issue encountered

When I tried to run the pipeline on the cloud (both Azure/AWS Batch) setting via Nextflow CLI by adding the cloud-specific configs (azure.config) and invoking

$ nextflow -c ./azure.config run https://github.com/genepi/nf-gwas -profile test,docker,azb -r main -latest

I kept running across the following issue in the initial caching process

Error executing process > 'NF_GWAS:CACHE_JBANG_SCRIPTS'

Caused by:
  Process `NF_GWAS:CACHE_JBANG_SCRIPTS` terminated with an error exit status (1)

Command executed:

  jbang export portable -O=RegenieLogParser.jar RegenieLogParser.java
  jbang export portable -O=RegenieFilter.jar RegenieFilter.java
  jbang export portable -O=RegenieValidateInput.jar RegenieValidateInput.java

Command exit status:
  1

Command output:
  (empty)

Command error:
  Exception in thread "main" dev.jbang.cli.ExitException: Resource could not be copied from class path: jbang.properties
        at dev.jbang.source.resolvers.ClasspathResourceResolver.getClasspathResource(ClasspathResourceResolver.java:66)
        at dev.jbang.source.resolvers.ClasspathResourceResolver.resolve(ClasspathResourceResolver.java:31)
        at dev.jbang.source.resolvers.CombinedResourceResolver.lambda$resolve$0(CombinedResourceResolver.java:26)
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
        at java.base/java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958)
        at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:127)
        at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:502)
        at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:488)
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
        at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:150)
        at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.base/java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:543)
        at dev.jbang.source.resolvers.CombinedResourceResolver.resolve(CombinedResourceResolver.java:28)
        at dev.jbang.source.ResourceRef.forResource(ResourceRef.java:136)
        at dev.jbang.Configuration.getBuiltin(Configuration.java:265)
        at dev.jbang.Configuration.defaults(Configuration.java:218)
        at dev.jbang.Configuration.getMerged(Configuration.java:251)
        at dev.jbang.Configuration.instance(Configuration.java:196)
        at dev.jbang.cli.JBang$ConfigurationResourceBundle.getKeys(JBang.java:212)
        at picocli.CommandLine$Model$Messages.keys(CommandLine.java:11250)
        at picocli.CommandLine$Model$Messages.<init>(CommandLine.java:11234)
        at picocli.CommandLine$Model$CommandSpec.setBundle(CommandLine.java:6346)
        at picocli.CommandLine$Model$CommandSpec.resourceBundle(CommandLine.java:6342)
        at picocli.CommandLine.setResourceBundle(CommandLine.java:3319)
        at dev.jbang.cli.JBang.getCommandLine(JBang.java:227)
        at dev.jbang.cli.JBang.getCommandLine(JBang.java:107)
        at dev.jbang.Main.main(Main.java:12)

I suspect that this might have to do with how jbang relies on the download of jar dependencies (in a local lib folder) which is not available to the tasks in other nodes (or other instances of the container) in a multi-node setting.

Suggestions

Allow me to share a couple of suggestions which might be worth considering.

  1. Use a compiled shadow-jar/uber-jar (single jar with all deps baked in) so that there is no need for the lib folder to be available to downstream processes which rely upon these cached jar files.

  2. Another alternative, perhaps with less effort, is to perhaps bake in the compiled jar files in the container itself since the tool is already available in the container, this way we can ensure that the dependencies (i.e. lib folder) as well as the Regenie* JARs are all available within the container instances across different nodes.

Collaboration

I'd be happy to test the pipeline on the cloud and to discuss any changes which might be necessary for making the pipeline optimal (hardware config etc) for the cloud setting.

WGS data

The pipeline looks like it is optimized for processing imputed vcf data from UMICH or TOPMed imputation server which generates a DS field. Is is possible to run the pipeline on GATK WGS sequencing data without the DS field. Or does that need to be calculated with the PL field and written out to plink format before running the pipeline?

how do I create my own (possibly artificial) .bed/.bam/.fam files?

hi,

apologies for the somewhat clueless question as I am not a bioinformatician but i want to stress test this Nextflow pipeline & bring it into the cloud for some in-house projects

i see that you have given example files for testing in this path:
genotypes_prediction = "$projectDir/tests/input/pipeline/example.{bim,bed,fam}"

to my understanding, those files contain 500 positions/SNPs and it is a small file just for testing that the pipeline works instead of crashing/error-ing. I want to inflate this file to have more data, e.g. 500k SNPs or even 5 million, and run it on a large machine in the cloud, and check the runtimes, compute costs & things like that.

is there an easy way for me to do this? i do not need the data to be "real"

I tried looking around for public datasets but I think you have to sign some researcher agreements since these data tend to be very confidential/sensitive (people's genomes after all). basically i have not had success looking on the web

the other problem is these are binary files which seem to be produced by another program on upstream data so I can't easily write them myself (as opposed to the TSVs for example, which I could easily inflate with python or even linux commands)

thanks a lot

question regarding paper: running on SLURM/AWS Batch

hi again,

I have a question about the exact configuration of your runs for the experiments reported in the paper, which were conducted either on AWS Batch or SLURM.

For these large-scale GWAS runs, do you provide the input files as a single file (e.g. a single giant phenotype.txt file with N columns for N phenotypes, a single giant .vcf.gz file containing all variants, 3 giant .bed/.fam/.bam files), prepare 1 single config file, then execute the Nextflow run?

Then, how do you specify the resources for AWS Batch? Since there is no single AWS machine that has 450 CPUs, surely it has to be split up on smaller machines (eg 32 CPUs per machine). Or the alternative is do you split the input files by phenotype so with 10 phenotypes you have 10 runs, and then you execute 10 separate Nextflow runs?

Basically I am just curious about how you specified the total resources & orchestrated the entire workload

sorry for the repeated questions - I loved reading your paper & the code is extremely well written hence my interest in this work!

Thanks a lot

Error while running the pipeline on a test dataset

If I run the command

nextflow run genepi/nf-gwas -r v0.5.0 -profile test, docker

the following error is occuring-

Screenshot from 2022-10-21 11-42-27

I couldn't find this RegenieValidateInput.jar file in the github repo.
Could you please let me know what am I missing here?

some Nextflow help to change "genotypes_prediction" param to 3 separate paths

Hi again,

I am having some trouble specifying the genotypes_prediction parameter on AWS. The problem is that it expects 3 files together as a pattern, e.g. "example.{bim,bed,fam}" but AWS S3 doesn't recognise this format of paths and my run is failing on AWS because of it.

To solve this problem, I'd like to create 3 separate input params: genotypes_prediction_bim, genotypes_prediction_bed and genotypes_prediction_fam. This way, I can just provide the absolute path to all 3 files without patterns and it will work on any platform.

I understand this involves changing the fromFilePairs channel into 3 separate fromPaths channels, but I'm having trouble on how I could structure the Nextflow code such that it will give the same channel output as fromFilePairs, which I believe is something like [filename, [path_bim, path_bed, path_fam]] I have never coded in Nextflow before, so apologies if this is too simple a problem

Any assistance would be greatly appreciated!
Thank you so much

Issues with empty files on cloud

Hi team,

Since we're already evaluating use of bundled JARs in the container, I went ahead and tested with the next steps (once we're past the jar issues).

The problems here seems to be the use of empty files, for example NO_COV_LOG here

if (params.covariates_filename != 'NO_COV_FILE' && !covariates_file.exists()){

This works fine on real filesystems, however as cloud also provides the option to run a pipeline using a blob storage as a backend (Azure Blob container or AWS S3) , this expectation fails.

I'm currently working on a separate internal branch to test this and once that's done I'll add here for a review.

UPDATE: I've merged that code within #46

Request to move the tabix-merge.jar file into docker containers

The ./bin folder is designed to hold small, self-contained executable scripts used by the Nextflow processes, ideally below 1MB. The nf-gwas workflow contains a 22MB jar file, which does not seem to be called directly in the pipeline.

Is this jar file still necessary for the workflow? If not, I would recommend that it be removed. If the jar file is required by one of the executables contained in the docker image, it would make sense that tabix-merge.jar is also included in the docker image.

Gene level test option

Dear developers,
Regenie 3 allows for gene-level testing (https://rgcgithub.github.io/regenie/options/#gene-based-testing), but it requires to play with the step 2 configuration script. Is it already possible to make such an analysis with nextflow?

If not, what do you suggest? I also attach the example steps configurationsl, highlighting in red the differences between the two analysis parameters.

All the best,
Giacomo

regenieExampleSettings

Gene-level analysis

An error is thrown when using plink set files divided by chromosome on gene-level analysis.
In particular for:
genotype_association=<mydir>/chr*.{bim,bed,fam}
Error on REGENIE_STEP2_GENE_TESTS:
chr1.bed Not Found

When running with:
genotype_association=<mydir>/chr1.{bim,bed,fam}

Everything goes fine.

Problem in creating report when "TEST_FAIL"

Process Report return this error when the regenie test fails and LOG10P -> NA

  java.lang.NumberFormatException: For input string: "NA"
        at java.base/jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054)
        at java.base/jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
        at java.base/java.lang.Double.parseDouble(Double.java:651)
        at genepi.io.table.reader.AbstractTableReader.getDouble(AbstractTableReader.java:38)
        at genepi.io.table.reader.AbstractTableReader.getDouble(AbstractTableReader.java:30)
        at genepi.genomic.utils.commands.GwasReportCommand.createHtmlReport(GwasReportCommand.java:316)
        at genepi.genomic.utils.commands.GwasReportCommand.call(GwasReportCommand.java:216)
        at genepi.genomic.utils.commands.GwasReportCommand.call(GwasReportCommand.java:31)
        at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
        at picocli.CommandLine.access$1500(CommandLine.java:148)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273)
        at picocli.CommandLine$RunLast.execute(CommandLine.java:2417)
        at picocli.CommandLine.execute(CommandLine.java:2170)
        at genepi.genomic.utils.App.main(App.java:38)

See also: genepi/genomic-utils#5

RMarkdown report settings step 1

For documentation, it might be useful to include the settings for pruning and qc of step 1 (e.g. prune_window_kbsize, qc_hwe, etc) into the report.

test run seems to be hanging on Ubuntu with docker

hi, i am on Ubuntu 22.04, trying to run the test workflow

I am using nextflow 23.10.1 from https://www.nextflow.io/docs/latest/getstarted.html#installation installed into my /home/<username>/ directory and added the path to $PATH so that I can just run nextflow from anywhere in my filesystem.

this is with Java 17.0.6 (OpenJDK Runtime Environment Corretto-17.0.6.10.1 (build 17.0.6+10-LTS)) installed via SDKMAN! as advised by nextflow documentation

now, i just git clone the repo, then I execute nextflow run genepi/nf-gwas -r v1.0.4 -profile test,docker
I had to make my cloned github directory writable to all users with chmod 777 -R <cloned_repo_dir> which I thought was strange but if I don't do this, nextflow will fail saying it can't create the workDir's within the folder, e.g. /home/minhtoo/repos/nf-gwas/work/e1/93c6238c11c0659e5b90902261c4e1

it prints the below log but seems to get stuck as the 0% never advances (it's been at least 20 minutes). i also have the .nextflow.log file contents in this link: CLICK ME, which do not raise any errors:

N E X T F L O W  ~  version 23.10.1
Pulling genepi/nf-gwas ...
 downloaded from https://github.com/genepi/nf-gwas.git
Launching `https://github.com/genepi/nf-gwas` [grave_bardeen] DSL2 - revision: 7927c67696 [v1.0.4]
Downloading plugin [email protected]
Core Nextflow options
  revision                    : v1.0.4
  runName                     : grave_bardeen
  containerEngine             : docker
  container                   : quay.io/genepi/nf-gwas:v1.0.4
  launchDir                   : /home/minhtoo/repos/nf-gwas
  workDir                     : /home/minhtoo/repos/nf-gwas/work
  projectDir                  : /home/minhtoo/.nextflow/assets/genepi/nf-gwas
  userName                    : minhtoo
  profile                     : test,docker
  configFiles                 : 

Input/output options
  outdir                      : default

All Tests
  project                     : test-gwas
  genotypes_association       : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/example.bgen
  genotypes_association_format: bgen
  phenotypes_filename         : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/phenotype.txt
  phenotypes_columns          : Y1,Y2
  genotypes_prediction        : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/example.{bim,bed,fam}

Single-variant tests only
  regenie_min_imputation_score: 0.00

General options
  validate_params             : true

R report
  annotation_min_log10p       : 2
  plot_ylimit                 : 0

Other options
  genotypes_build             : hg19
  rsids_filename              : /home/minhtoo/.nextflow/assets/genepi/nf-gwas/tests/input/pipeline/rsids.tsv.gz

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
WARN: Option genotypes_build is deprecated. Please use association_build instead.
executor >  local (2)
[67/2c9d8b] process > NF_GWAS:SINGLE_VARIANT_TESTS:INPUT_VALIDATION:VALIDATE_PHENOTYPES           [  0%] 0 of 1
[e1/93c623] process > NF_GWAS:SINGLE_VARIANT_TESTS:QUALITY_CONTROL:QC_FILTER_GENOTYPED (1)        [  0%] 0 of 1
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP1:REGENIE_STEP1_RUN        -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP1:REGENIE_LOG_PARSER_STEP1 -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP2:REGENIE_STEP2_RUN        -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REGENIE:REGENIE_STEP2:REGENIE_LOG_PARSER_STEP2 -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:ANNOTATION:ANNOTATE_RESULTS                    -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:MERGE_RESULTS                                  -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:FILTER_RESULTS                                 -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REPORTING:REPORT                               -
[-        ] process > NF_GWAS:SINGLE_VARIANT_TESTS:REPORTING:REPORT_INDEX                         -

'Nextflow.exit(int,String)' doesn't support array

Hi,
I got the following error message

Missing process or function exit([1, Parameter genotypes_prediction is required.])`

with stacktrace:

Caused by: groovy.lang.MissingMethodException: No signature of method: static WorkflowMain.exit() is applicable for argument types: (Integer, String) v
alues: [1, Parameter genotypes_prediction is required.]
Possible solutions: wait(), wait(long), wait(long, int), getAt(java.lang.String), grep(), dump()
        at groovy.lang.MetaClassImpl.invokeStaticMissingMethod(MetaClassImpl.java:1570)
        at groovy.lang.MetaClassImpl.invokeStaticMethod(MetaClassImpl.java:1556)
        at org.codehaus.groovy.runtime.callsite.StaticMetaClassSite.callStatic(StaticMetaClassSite.java:62)
        at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallStatic(CallSiteArray.java:55)
(...)

using NF 23.10.0 .

it looks like Nextflow.exit https://github.com/nextflow-io/nextflow/blob/master/modules/nextflow/src/main/groovy/nextflow/Nextflow.groovy#L196 doesn't accept array/vector

problem with rsID file download

hi, I'm trying to use the next low pipeline version 0.6.3 and when the pipe tries to download the rsID file, I get this error.

Process NF_GWAS:DOWNLOAD_RSIDS is unable to find [UnixPath]: /shares/CIBIO-Storage/NB/scratch/sfari_new/WGS123_complete_run_cov-pheno/c4/06e662ebb16c20061e9616ad88159b/rsids-v154-null.index.gz (pattern: rsids-v154-null.index.gz)
Jun-14 10:34:12.791 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
task: name=NF_GWAS:DOWNLOAD_RSIDS; work-dir=/shares/CIBIO-Storage/NB/scratch/sfari_new/WGS123_complete_run_cov-pheno/c4/06e662ebb16c20061e9616ad88159b
error [nextflow.exception.MissingFileException]: Missing output file(s) rsids-v154-null.index.gz expected by process NF_GWAS:DOWNLOAD_RSIDS
Jun-14 10:34:12.820 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NF_GWAS:DOWNLOAD_RSIDS'

Caused by:
Missing output file(s) rsids-v154-null.index.gz expected by process NF_GWAS:DOWNLOAD_RSIDS

I'm attaching the config file for reference and the entire log.
.nextflow.log.7.txt
gwas_conf.conf.txt

Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.