Giter Club home page Giter Club logo

gaptools's Introduction

GaPTools

dbGaP data validation tool. GaPTools is distributed as a docker image on Docker Hub. See GaPTools.md for more information about the tool.

Pre-requisites:

Docker Installation:

You must have Docker installed and working to be able to run GaPTools. Docker is available on many different operating systems, including most modern Linux distributions, like CentOS, Debian, Ubuntu, etc. Follow the link below for more information about how to install Docker on your particular operating system.

To ensure that you can run GaPTools under your user account, run the below command and check for a response similar to the one below (your version and build numbers might be different than the ones below). The minimum supported Docker version to run GaPTools is 17.04.0.

docker -v

Docker version 19.03.6, build 369ce74a3c
Docker Compose Installation:

GaPTools uses docker-compose to run multiple containers under a single service. Follow the link below for more details on how to install docker-compose.

Access to data files

The docker host running GaPTools requires access to the data files that need to be validated. The files can either be on a local file system, a network file share (NFS) or in a storage bucket on the cloud. If the files are on a network file share (NFS) or in a storage bucket on the cloud, they need to be mounted as file system on the docker host. Below are some tools that are commonly used to mount cloud storage buckets as file systems on linux servers

  1. s3-fuse for Amazon Web Services (AWS)
  2. gcsfuse for Google Cloud Platform
Unused port 8080 on your docker host

GaPTools requires port 8080 to be available on the host system running docker. Run the below command to check if port 8080 is available on the docker host. If the below command does not produce any output, then port 8080 is available on the docker host.

netstat -an | grep "8080"

Setup

Once all pre-requisites are met, follow the instructions below to setup GaPTools. The setup can be validated using a sample study that is included as part of GaPTools installation. The input files for the sample study are inside the input_files/1000_Genomes_Study/ directory of the cloned GaPTools GitHub repository.

For the sample study, we will have GaPTools generate all output files inside the output_files/1000_Genomes_Study/ directory.

git clone https://github.com/ncbi/gaptools

cd gaptools
mkdir -p output_files/1000_Genomes_Study

# Change file permissions to allow GaPTools to write output files on docker host
chmod -R o+w output_files

Execution

Once GaPTools is setup, to execute it on the included sample study, run the below script from inside the same directory where the GaPTools GitHub repository is cloned.

./dbgap-docker.bash -i ./input_files/1000_Genomes_Study/ -o ./output_files/1000_Genomes_Study -m ./input_files/1000_Genomes_Study/metadata.json up

GaPTools uses Apache Airflow behind the scenes as the workflow orchestrator to perform all the validation tasks. To view the validation results of the dbGaP validation tool, browse to the following URL:

http://<your_docker_host_ip>:8080

At the end of the workflow, the output files will be created under the specified output directory.

Usage

To use GaPTools for your study, modify the above command and pass as input parameters:

-i -- path to the input files for your study

-o -- path where output files should be generated

-m -- path to the manifest file for your study

Stop Docker Containers

Once your study is processed, run the below command to stop the GaPTools service.

./dbgap-docker.bash down

Contact

If you have any questions or to report any issues, please contact us at: [email protected]

gaptools's People

Contributors

akomaragiri avatar alanhoyle avatar alexandra-soboleva avatar dale-conklin avatar danieleweeks avatar davidshao avatar derydan avatar gaptools-service avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gaptools's Issues

Need gnu getopt on Macs

Might be helpful to point out in your Readme that on a Mac, it will be necessary to switch to using gnu getopt instead of the default getopt in order for your dbgap-docker.bash script to run. This can be done by issuing these commands:

brew install gnu-getopt 
brew link --force gnu-getopt

With the default version of getopt that comes with the Macs, the dbgap-docker.bash example commands dies with this rather cryptic error message:

$ ./dbgap-docker.bash -i ./input_files/1000_Genomes_Study/ -o ./output_files/1000_Genomes_Study -m ./input_files/1000_Genomes_Study/metadata.json up
getopt: illegal option -- l
getopt: illegal option -- g
getopt: illegal option -- e
getopt: illegal option -- l
getopt: illegal option -- ,
getopt: illegal option -- u
getopt: illegal option -- :
getopt: illegal option -- ,
getopt: illegal option -- u
getopt: illegal option -- u
getopt: illegal option -- :
getopt: illegal option -- ,
getopt: illegal option -- a
Wrong number/type of arguments
Usage: dbgap-docker.bash [-h] [-i, --input [input directory]]
[-o, --output [output directory]] 
[-m, --manifest [manifest file]]
[up|down]

Can gaptools be run on only the subject phenotype files?

The gaptools documentation mentions only two required files: the subject consent and the subject sample map files.

So I was wondering if it could be run on input that contains those two plus the subject phenotype files without providing any genotype data? Or is it required that genotype data be present for a run of gaptools?

For example, if I run it on a made-up test data set with this json below, it doesn't do anything beyond generating a metrics-CHECK_METADATA_FILES.json file containing the following:

[
    {
        "dag_id": "GapTools",
        "run_id": "2023-02-08-13-57-32",
        "dag_task": "CHECK_METADATA_FILES",
        "metric_name": "dag_run_start_time",
        "metric_value": "2023-02-08 13:57:45",
        "metric_object": ""
    }
]

Input meta data json file

{
  "NAME": "Example 1",
  "FILES": [
    {
      "name": "ExampleConsent_DS.txt",
      "type": "subject_consent_file"
    },
    {
      "name": "ExampleConsent_DD.xlsx",
      "type": "subject_consent_data_dictionary_file"
    },
    {
      "name": "ExampleSSM_DS.txt",
      "type": "subject_sample_mapping_file"
    },
    {
      "name": "ExampleSSM_DD.xlsx",
      "type": "subject_sample_mapping_data_dictionary_file"
    },
    {
      "name": "DS_Example.txt",
      "type": "phenotype_ds"
    },
    {
      "name": "3b_SSM_DD_Example1.xlsx",
      "type": "phenotype_dd"
    }
  ]
}

Errors in the `gapstools-flower-1` log

When running on my example files where it fails to proceed beyond creating the metrics-CHECK_METADATA_FILES.json file, I see text like this in the gapstools-flower-1 log which seems to indicate some sort of error (note this is not a complete log as it was very long):

[2023-02-09 04:14:06,108] {cli_action_loggers.py:105} WARNING - Failed to log action with (psycopg2.errors.UndefinedTable) relation "log" does not exist
LINE 1: INSERT INTO log (dttm, dag_id, task_id, event, execution_dat...
                    ^

[SQL: INSERT INTO log (dttm, dag_id, task_id, event, execution_date, owner, extra) VALUES (%(dttm)s, %(dag_id)s, %(task_id)s, %(event)s, %(execution_date)s, %(owner)s, %(extra)s) RETURNING log.id]
[parameters: {'dttm': datetime.datetime(2023, 2, 9, 4, 14, 6, 58699, tzinfo=Timezone('UTC')), 'dag_id': None, 'task_id': None, 'event': 'cli_flower', 'execution_date': None, 'owner': 'airflow', 'extra': '{"host_name": "ef46e7cc54f4", "full_command": "[\'/usr/local/airflow/.local/bin/airflow\', \'celery\', \'flower\']"}'}]
(Background on this error at: http://sqlalche.me/e/13/f405)

Warning: entry point could not be loaded. Contact its author for help.


Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/click_plugins/core.py", line 37, in decorator
    group.add_command(entry_point.load())
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2457, in load
    self.require(*args, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2480, in require
    items = working_set.resolve(reqs, env, installer, extras=self.extras)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 788, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (celery 5.1.2 (/usr/local/airflow/.local/lib/python3.9/site-packages), Requirement.parse('celery<5.0.0,>=4.3.0; python_version >= "3.7"'))

[2023-02-09 04:14:20,097] {cli_action_loggers.py:105} WARNING - Failed to log action with (psycopg2.errors.UndefinedTable) relation "log" does not exist
LINE 1: INSERT INTO log (dttm, dag_id, task_id, event, execution_dat...
                    ^

[SQL: INSERT INTO log (dttm, dag_id, task_id, event, execution_date, owner, extra) VALUES (%(dttm)s, %(dag_id)s, %(task_id)s, %(event)s, %(execution_date)s, %(owner)s, %(extra)s) RETURNING log.id]
[parameters: {'dttm': datetime.datetime(2023, 2, 9, 4, 14, 20, 50820, tzinfo=Timezone('UTC')), 'dag_id': None, 'task_id': None, 'event': 'cli_flower', 'execution_date': None, 'owner': 'airflow', 'extra': '{"host_name": "ef46e7cc54f4", "full_command": "[\'/usr/local/airflow/.local/bin/airflow\', \'celery\', \'flower\']"}'}]
(Background on this error at: http://sqlalche.me/e/13/f405)

Warning: entry point could not be loaded. Contact its author for help.


Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/click_plugins/core.py", line 37, in decorator
    group.add_command(entry_point.load())
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2457, in load
    self.require(*args, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2480, in require
    items = working_set.resolve(reqs, env, installer, extras=self.extras)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 788, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (celery 5.1.2 (/usr/local/airflow/.local/lib/python3.9/site-packages), Requirement.parse('celery<5.0.0,>=4.3.0; python_version >= "3.7"'))


Warning: entry point could not be loaded. Contact its author for help.


Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/click_plugins/core.py", line 37, in decorator
    group.add_command(entry_point.load())
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2457, in load
    self.require(*args, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2480, in require
    items = working_set.resolve(reqs, env, installer, extras=self.extras)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 788, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (celery 5.1.2 (/usr/local/airflow/.local/lib/python3.9/site-packages), Requirement.parse('celery<5.0.0,>=4.3.0; python_version >= "3.7"'))


Warning: entry point could not be loaded. Contact its author for help.


Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/click_plugins/core.py", line 37, in decorator
    group.add_command(entry_point.load())
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2457, in load
    self.require(*args, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2480, in require
    items = working_set.resolve(reqs, env, installer, extras=self.extras)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 788, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (celery 5.1.2 (/usr/local/airflow/.local/lib/python3.9/site-packages), Requirement.parse('celery<5.0.0,>=4.3.0; python_version >= "3.7"'))


Warning: entry point could not be loaded. Contact its author for help.


Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/click_plugins/core.py", line 37, in decorator
    group.add_command(entry_point.load())
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2457, in load
    self.require(*args, **kwargs)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 2480, in require
    items = working_set.resolve(reqs, env, installer, extras=self.extras)
  File "/usr/local/airflow/.local/lib/python3.9/site-packages/pkg_resources/__init__.py", line 788, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (celery 5.1.2 (/usr/local/airflow/.local/lib/python3.9/site-packages), Requirement.parse('celery<5.0.0,>=4.3.0; python_version >= "3.7"'))

How to find my docker host ip?

In the readme, it says:

To view the validation results of the dbGaP validation tool, browse to the following URL:

http://<your_docker_host_ip>:8080

However, being not very experienced with using Docker yet, I am uncertain how to find the docker host IP address.

I did figure out these IP addresses:

$ docker inspect -f '{{.Name}} - {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -aq)
/gaptools-worker-1 - 172.18.0.8
/gaptools-gaptools-ui-1 - 172.18.0.7
/gaptools-scheduler-1 - 172.18.0.6
/gaptools-airflow-web-1 - 172.18.0.5
/gaptools-flower-1 - 172.18.0.4
/gaptools-postgres-1 - 172.18.0.3
/gaptools-redis-1 - 172.18.0.2

But when I try to connect to

http://172.18.0.5:8080

it is non-responsive and times out.

Looks like the work process ran O.K., as I see and can read the e-mail *.html file that was generated in the

gaptools/output_files/1000_Genomes_Study/client_emails/studies

folder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.