Giter Club home page Giter Club logo

library-beam's Introduction

Open Targets Library - NLP Pipeline

NLP Analysis of MedLine/PubMed Running in Apache Beam

This pipeline is designed to run with Apache Beam using the dataflow runner. It has not been tested with other Beam backends, but it should work there as well pending minimal modifications. Please see the Apache Beam SDK for more info.

Steps to reproduce a full run

Use python2 with pip and virtualenv

  • Generate a mirror of MEDLINE FTP to a Google Storage Bucket (any other storage provider supported by Python Beam SDK should work). E.g. using rclone

    • Download pre-built rclone binaries rather than platform packaged ones as they tend to be more up-to-date
    • configure rclone with MEDLINE FTP ftp.ncbi.nlm.nih.gov and your target gcp project (my-gcp-project-buckets) rclone config. Medline must have username anonymous and password anonymous.
    • Generate a full mirror: rclone sync -v medline-ftp:pubmed/baseline my-gcp-project-buckets:my-medline-bucket/baseline
    • Update new files: rclone sync -v medline-ftp:pubmed/updatefiles my-gcp-project-buckets:my-medline-bucket/updatefiles
    • Note: you can use --dry-run argument to test
  • install tooling

    sudo apt-get install python-dev virtualenv build-essential git libxml2-dev libxslt-dev zlib1g-dev tmux
  • Download the pipeline

    git clone https://github.com/opentargets/library-beam
    cd library-beam
  • Create a virtual environment to manage dependencies in

    virtualenv venv --python=python2
    source venv/bin/activate
  • Install the pipeline into the virtual environment

    python setup.py install
    #note this needs between 3.75GB and 7.5GB RAM
    pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.0/en_core_web_lg-2.2.0.tar.gz
  • Run pipeline

    python -m main \
      --project open-targets-library \
      --job_name medline201911 \
      --runner DataflowRunner \
      --temp_location gs://medline_2019_11/temp \
      --setup_file ./setup.py \
      --worker_machine_type n1-highmem-32 \
      --input_baseline gs://medline_2019_11/baseline/pubmed19n*.xml.gz \
      --input_updates gs://medline_2019_11/updatefiles/pubmed19n*.xml.gz \
      --output_enriched gs://medline_2019_11/analyzed/pubmed19 \
      --output_splitted gs://medline_2019_11/splitted/pubmed19 \
      --max_num_workers 32 \
      --region europe-west1 \
      --zone europe-west1-d

    This can be monitored via Google Dataflow. Note that "wall time" displayed is not the usual definition but is per thread and worker.

    In total it takes approximately 4h.

    image

    image

Steps to load the JSON dumps into ElasticSearch

The directory gcp contains the infrastructure scripts to generate the Elasticsearch cluster.

  • Create a virtual environment to manage dependencies in
    virtualenv venv_elasticsearch --python=python2
    source venv_elasticsearch/bin/activate
    pip install -r venv_elasticsearch.txt
  • Run job load JSONs in Elasticsearch

WARNING: the loading scripts takes a lot of time currently, particurlarly the concept one (24h+). It is good to use screen or tmux or similar, so it will keep going after disconect and can be recovered.

python load2es.py publication bioentity taggedtext concept --es http://es:9200

Note: Elasticsearch must have the International Components for Unicode support plugin installed.i.e. /usr/share/elasticsearch/bin/elasticsearch-plugin -s install analysis-icu

  • Increase elasticsearch capacity for the adjancency matrix aggregation (used by LINK tool)
    curl -XPUT 'http://myesnode1:9200/pubmed-18-concept/_settings' -H 'Content-Type: application/json' -d'
       {
          "index" : {
              "max_adjacency_matrix_filters" : 500
              }
       }'

Google Cloud Platform

When controlling this process from a Google cloud machine, make sure it has sufficient scopes enabled.

library-beam's People

Contributors

cmalangone avatar afaulconbridge avatar apierleoni avatar priyankaw avatar mkarmona avatar elipapa avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.