Open Targets Library - NLP Pipeline

NLP Analysis of MedLine/PubMed Running in Apache Beam

This pipeline is designed to run with Apache Beam using the dataflow runner. It has not been tested with other Beam backends, but it should work there as well pending minimal modifications. Please see the Apache Beam SDK for more info.

Steps to reproduce a full run

Use python2 with pip and virtualenv

Generate a mirror of MEDLINE FTP to a Google Storage Bucket (any other storage provider supported by Python Beam SDK should work). E.g. using rclone
- Download pre-built rclone binaries rather than platform packaged ones as they tend to be more up-to-date
- configure rclone with MEDLINE FTP ftp.ncbi.nlm.nih.gov and your target gcp project (my-gcp-project-buckets) rclone config. Medline must have username anonymous and password anonymous.
- Generate a full mirror: rclone sync -v medline-ftp:pubmed/baseline my-gcp-project-buckets:my-medline-bucket/baseline
- Update new files: rclone sync -v medline-ftp:pubmed/updatefiles my-gcp-project-buckets:my-medline-bucket/updatefiles
- Note: you can use --dry-run argument to test

install tooling

sudo apt-get install python-dev virtualenv build-essential git libxml2-dev libxslt-dev zlib1g-dev tmux

Download the pipeline

git clone https://github.com/opentargets/library-beam
cd library-beam

Create a virtual environment to manage dependencies in

virtualenv venv --python=python2
source venv/bin/activate

Install the pipeline into the virtual environment

python setup.py install
#note this needs between 3.75GB and 7.5GB RAM
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.0/en_core_web_lg-2.2.0.tar.gz

Run pipeline

python -m main \
  --project open-targets-library \
  --job_name medline201911 \
  --runner DataflowRunner \
  --temp_location gs://medline_2019_11/temp \
  --setup_file ./setup.py \
  --worker_machine_type n1-highmem-32 \
  --input_baseline gs://medline_2019_11/baseline/pubmed19n*.xml.gz \
  --input_updates gs://medline_2019_11/updatefiles/pubmed19n*.xml.gz \
  --output_enriched gs://medline_2019_11/analyzed/pubmed19 \
  --output_splitted gs://medline_2019_11/splitted/pubmed19 \
  --max_num_workers 32 \
  --region europe-west1 \
  --zone europe-west1-d

This can be monitored via Google Dataflow. Note that "wall time" displayed is not the usual definition but is per thread and worker.

In total it takes approximately 4h.

Steps to load the JSON dumps into ElasticSearch

The directory gcp contains the infrastructure scripts to generate the Elasticsearch cluster.

Create a virtual environment to manage dependencies in

virtualenv venv_elasticsearch --python=python2
source venv_elasticsearch/bin/activate
pip install -r venv_elasticsearch.txt

Run job load JSONs in Elasticsearch

WARNING: the loading scripts takes a lot of time currently, particurlarly the concept one (24h+). It is good to use screen or tmux or similar, so it will keep going after disconect and can be recovered.

python load2es.py publication bioentity taggedtext concept --es http://es:9200

Note: Elasticsearch must have the International Components for Unicode support plugin installed.i.e. /usr/share/elasticsearch/bin/elasticsearch-plugin -s install analysis-icu

Increase elasticsearch capacity for the adjancency matrix aggregation (used by LINK tool)

curl -XPUT 'http://myesnode1:9200/pubmed-18-concept/_settings' -H 'Content-Type: application/json' -d'
   {
      "index" : {
          "max_adjacency_matrix_filters" : 500
          }
   }'

Google Cloud Platform

When controlling this process from a Google cloud machine, make sure it has sufficient scopes enabled.

korchris / library-beam Goto Github PK

library-beam's Introduction

Open Targets Library - NLP Pipeline

NLP Analysis of MedLine/PubMed Running in Apache Beam

Steps to reproduce a full run

Steps to load the JSON dumps into ElasticSearch

Google Cloud Platform

library-beam's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent