Giter Club home page Giter Club logo

cogstack / cogstack-pipeline Goto Github PK

View Code? Open in Web Editor NEW
41.0 16.0 13.0 26.19 MB

Distributed, fault tolerant batch processing for Natural Language Applications and Search, using remote partitioning

Home Page: https://cogstack.atlassian.net/wiki/spaces/COGDOC/overview

License: Other

Shell 2.50% Java 90.35% Groovy 0.47% Dockerfile 1.08% Python 5.60%
batch-processing cogstack elasticsearch spring nlp tika tesseract ocr semantic-search alerting

cogstack-pipeline's Introduction

Archived

This project is archived and no longer maintained. CogStack-Nifi is the successor to this project and continues to be actively maintained.

Introduction

CogStack is a lightweight distributed, fault tolerant database processing architecture and ecosystem, intended to make NLP processing and preprocessing easier in resource constrained environments. It comprises of multiple components, where CogStack Pipeline, the one covered in this documentation, has been designed to provide a configurable data processing pipelines for working with EHR data. For the moment it mainly uses databases and files as the primary source of EHR data with the possibility of adding custom data connectors soon. It makes use of the Java Spring Batch framework in order to provide a fully configurable data processing pipeline with the goal of generating an annotated JSON files that can be readily indexed into ElasticSearch, stored as files or pushed back to a database.

Documentation

For the most up-to-date documentation about usage of CogStack, building, running with example deployments please refer to the official CogStack Confluence page.

Discussion

If you have any questions why not reach out to the community discourse forum.

Quick Start Guide

Introduction

This simple tutorial demonstrates how to get CogStack Pipeline running on a sample electronic health record (EHR) dataset stored initially in an external database. CogStack ecosystem has been designed with handling efficiently both structured and unstructured EHR data in mind. It shows its strength while working with the unstructured type of data, especially as some input data can be provided as documents in PDF or image formats. For the moment, however, we only show how to run CogStack on a set of structured and free-text EHRs that have been already digitalized. The part covering unstructured type of data in form of PDF documents, images and other clinical notes which needs to processed prior to analysis is covered in the official CogStack Confluence page.

This tutorial is divided into 3 parts:

  1. Getting CogStack (link),
  2. A brief description of how does CogStack pipeline work and its ecosystem (link),
  3. Running CogStack pipeline 'out-of-the-box' using the dataset already preloaded into a sample database (link).

To skip the brief description and to get hands on running CogStack pipeline please head directly to Running CogStack part.

The main directory with resources used in this tutorial is available in the CogStack bundle under examples/. This tutorial is based on the Example 2, however, there are more examples available to play with.

Getting CogStack

The most convenient way to get CogStack bundle is to download it directly from the official github repository either by cloning the source by using git:

git clone https://github.com/CogStack/CogStack-Pipeline.git

or by downloading the bundle from the repository's Releases page and decompressing it.

How does CogStack work

Data processing workflow

The data processing workflow of CogStack pipeline is based on Java Spring Batch framework. Not to dwell too much into technical details and just to give a general idea -- the data is being read from a predefined data source, later it follows a number of processing operations with the final result stored in a predefined data sink. CogStack pipeline implements variety of data processors, data readers and writers with scalability mechanisms that can be selected in CogStack job configuration. Although the data can be possibly read from different sources, the most frequently used data sink is ElasicSearch. For more details about the CogStack functionality, please refer to the CogStack Documentation.

cogstack

In this tutorial we only focus on a simple and very common use-case, where CogStack pipeline reads and process structured and free-text EHRs data from a single PostgreSQL database. The result is then stored in ElasticSearch where the data can be easily queried in Kibana dashboard. However, CogStack pipeline data processing engine also supports multiple data sources -- please see Example 3 which covers such case.

A sample CogStack ecosystem

CogStack ecosystem consists of multiple inter-connected microservices running together. For the ease of use and deployment we use Docker (more specifically, Docker Compose), and provide Compose files for configuring and running the microservices. The selection of running microservices depends mostly on the specification of EHR data source(s), data extraction and processing requirements.

In this tutorial the CogStack ecosystem is composed of the following microservices:

  • samples-db -- PostgreSQL database loaded with a sample dataset under db_samples name,
  • cogstack-pipeline -- CogStack data processing pipeline with worker(s),
  • cogstack-job-repo -- PostgreSQL database for storing information about CogStack jobs,
  • elasticsearch-1 -- ElasticSearch search engine (single node) for storing and querying the processed EHR data,
  • kibana -- Kibana data visualization tool for querying the data from ElasticSearch.

Since all the examples share the common configuration for the microservices used, the base Docker Compose file is provided in examples/docker-common/docker-compose.yml. The Docker Compose file with configuration of microservices being overriden for this example can be found in examples/example2/docker/docker-compose.override.yml. Both configuration files are automatically used by Docker Compose when deploying CogStack, as will be shown later.

Sample datasets

The sample dataset used in this tutorial consists of two types of EHR data:

  • Synthetic - structured, synthetic EHRs, generated using synthea application,
  • Medial reports - unstructured, medical health report documents obtained from MTsamples.

These datasets, although unrelated, are used together to compose a combined dataset.

Full description of these datasets can be found in the official CogStack Confluence page.

Running CogStack platform

Running CogStack pipeline for the first time

For the ease of use CogStack is being deployed and run using Docker. However, before starting the CogStack ecosystem for the first time, one needs to have the database dump files for sample data either by creating them locally or downloading from Amazon S3. To download the database dumps, just type in the main examples/ directory:

bash download_db_dumps.sh

Next, a setup scripts needs to be run locally to prepare the Docker images and configuration files for CogStack data processing pipeline. The script is available in examples/example2/ path and can be run as:

bash setup.sh

As a result, a temporary directory __deploy/ will be created containing all the necessary artifacts to deploy CogStack.

Docker-based deployment

Next, we can proceed to deploy CogStack ecosystem using Docker Compose. It will configure and start microservices based on the provided Compose files:

  • common base configuration, copied from examples/docker-common/docker-compose.yml ,
  • example-specific configuration copied from examples/example2/docker/docker-compose.override.yml. Moreover, the PostgreSQL database container comes with pre-initialized database dump ready to be loaded directly into.

In order to run CogStack, type in the examples/example2/__deploy/ directory:

docker-compose up

In the console there will be printed status logs of the currently running microservices. For the moment, however, they may be not very informative (sorry, we're working on that!).

Connecting to the microservices

CogStack ecosystem

The picture below sketches a general idea on how the microservices are running and communicating within a sample CogStack ecosystem used in this tutorial.

workflow

Assuming that everything is working fine, we should be able to connect to the running microservices. Selected running services (elasticsearch-1 and kibana) have their port connections forwarded to host localhost.

Kibana and ElasticSearch

Kibana dashboard used to query the EHRs can be accessed directly in browser via URL: http://localhost:5601/. The data can be queried using a number of ElasticSearch indices, e.g. sample_observations_view. Usually, each index will correspond to the database view in db_samples (samples-db PostgreSQL database) from which the data was ingested. However, when entering Kibana dashboard for the first time, an index pattern needs to be configured in the Kibana management panel -- for more information about its creation, please refer to the official Kibana documentation.

In addition, ElasticSearch REST end-point can be accessed via URL http://localhost:9200/. It can be used to perform manual queries or to be used by other external services -- for example, one can list the available indices:

curl 'http://localhost:9200/_cat/indices'

or query one of the available indices -- sample_observations_view:

curl 'http://localhost:9200/sample_observations_view'

For more information about possible documents querying or modification operations, please refer to the official ElasticSearch documentation.

As a side note, the name for ElasticSearch node in the Docker Compose has been set as elasticsearch-1. The -1 ending emphasizes that for larger-scale deployments, multiple ElasticSearch nodes can be used -- typically, a minimum of 3.

PostgreSQL sample database

Moreover, the access PostgreSQL database with the input sample data is exposed directly at localhost:5555. The database name is db_sample with user test and password test. To connect, one can run:

psql -U 'test' -W -d 'db_samples' -h localhost -p 5555

Publications

CogStack - Experiences Of Deploying Integrated Information Retrieval And Extraction Services In A Large National Health Service Foundation Trust Hospital, Richard Jackson, Asha Agrawal, Kenneth Lui, Amos Folarin, Honghan Wu, Tudor Groza, Angus Roberts, Genevieve Gorrell, Xingyi Song, Damian Lewsley, Doug Northwood, Clive Stringer, Robert Stewart, Richard Dobson. BMC medical informatics and decision making 18, no. 1 (2018): 47.

logos logos logos Cogstack Pipeline logos logos logos logos

cogstack-pipeline's People

Contributors

afolarin avatar antsh3k avatar hkkenneth avatar honghan avatar jstuczyn avatar lrog avatar richjackson avatar tomolopolis avatar vladd-bit avatar yatharthranjan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cogstack-pipeline's Issues

fix: read from filesystem or object-store

Currently loading of documents from the FS is not working, and our typical deployment pattern uses blobs in the DB so currently an ETL would be required.

This is a non-urgent fix to support pointers to a filesystem or an object-store as the document source, but will require some refactoring to support.

ed.
Previous functionality for this was included in the Docman profile but this is presently deprecated.
https://cogstack.atlassian.net/wiki/spaces/COGDOC/pages/16678946/CogStack+pipeline#CogStackpipeline-profilesSpringprofiles

Default for scheduler.rate does not follow the cron syntax

From @hkkenneth on March 21, 2017 23:59

@Scheduled(cron = "${scheduler.rate:30000}")

@Scheduled can take one of cron(), fixedDelay(), or fixedRate() , but the default does not follow the cron syntax

Note that the cron syntax in spring (with precision up to seconds) is different from normal ones.
Spring: http://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/scheduling/annotation/Scheduled.html#fixedDelay--
Normal: http://www.nncron.ru/help/EN/working/cron-format.htm

Copied from original issue: KHP-Informatics/cogstack#16

Refactor Integration and acceptance tests

Use a docker-compose file to automatically run the integration and acceptance with just a simple gradle task.
Also add the testing using docker containers to travis for better CI

[Feature] Support arbitrary parameter for SQL INSERT statement for jdbc_out

From @hkkenneth on March 9, 2017 11:28

Currently the parameter template is supported by BeanPropertyItemSqlParameterSourceProvider. Sometimes we want to write something other than Document object's bean property to the database.

Most information generated by the pipeline is stored at Document's HashMap<String,Object> associativeArray. Being able to use SQL parameter from this map would be great.

Copied from original issue: KHP-Informatics/cogstack#14

TesseractOCRParser timeouts on longer documents

For some longer documents (having more than 10 pages, especially if they are scanned), OCR is going to fail (and hence PDF and thumbnail generation) as Tesseract parsing will timeout before it manages to finish. It is because it is using the default timeout of 120s.

It would have been beneficial if it was possible to set this value using the config file.

ElasticsearchRest Client will fail silently if index contains invalid character

If elasticsearch.index.name contains an invalid character (in my case I had a trailing space), it will silently fail. Cogstack will not catch an error. I can only suspect that performRequest() does not throw IOException.
Every log will say that document were successfully written.

This, however, will raise an issue with the native Java Client.

Refactor the build process

CogStack-Pipeline/build.gradle
seems to doing a lot of extraneous things like building containers.

  • cleanup container building code
  • Use docker-compose to build the containers
  • odd scripts
  • config file generation
  • Move old scripts to deprecated and will be deleted in the future builds
  • Add initial travis build
  • Configure Travis for CI. Will require admin access to github repo

Unable to index Docman Documents

Trying to index documents from a file system (Docman) but get the following error:

11:25:52.602pool-3-thread-1INFOuk.ac.kcl.partitioners.CogstackJobPartitioner168This job SQL: SELECT MAX(primarykeyfieldvalue) AS max_id, MIN(primarykeyfieldvalue) AS min_id, MAX(updatetime) AS max_time_stamp ,MIN(updatetime) AS min_time_stamp FROM ( SELECT top 100 percent primarykeyfieldvalue,updatetime FROM referrals WHERE CAST (updatetime as DATETIME) BETWEEN CAST ('2011-12-19 00:00:00.0' AS DATETIME ) AND CAST ('2011-12-20 00:00:00.0' AS DATETIME) )t1 11:25:52.679pool-3-thread-1INFOorg.springframework.beans.factory.xml.XmlBeanDefinitionReader317Loading XML bean definitions from class path resource [org/springframework/jdbc/support/sql-error-codes.xml] 11:25:52.800pool-3-thread-1INFOorg.springframework.jdbc.support.SQLErrorCodesFactory127SQLErrorCodes loaded: [DB2, Derby, H2, HSQL, Informix, MS-SQL, MySQL, Oracle, PostgreSQL, Sybase, Hana] 11:25:52.803pool-3-thread-1ERRORorg.springframework.batch.core.step.AbstractStep229Encountered an error executing step docmanJob_testing2MasterStep in job docmanJob_testing2 org.springframework.jdbc.BadSqlGrammarException: StatementCallback; bad SQL grammar [SELECT MAX(primarykeyfieldvalue) AS max_id, MIN(primarykeyfieldvalue) AS min_id, MAX(updatetime) AS max_time_stamp ,MIN(updatetime) AS min_time_stamp FROM ( SELECT top 100 percent primarykeyfieldvalue,updatetime FROM referrals WHERE CAST (updatetime as DATETIME) BETWEEN CAST ('2011-12-19 00:00:00.0' AS DATETIME ) AND CAST ('2011-12-20 00:00:00.0' AS DATETIME) )t1 ]; nested exception is com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax near 'SELECT top 100 percent primarykeyfieldvalue,updatetime FROM referrals WHERE CAS' at line 1

Log metrics on Binary Doc conversion

From @afolarin on October 4, 2016 17:46

We should aim to track which pathway a document goes through and what success/failure occur as it gets processed into XHTML. It will give us some basic understanding of the relative importance of things like the OCR pipeline

OCR Test failure

This might be consequence of resolving issue #3 as the build would not finish successfully, because it would fail at test stage. The assertion assertTrue(parsedString.contains("Father or mother")) in testParseRequiringOCR inside PDFPreprocessorParserTest.java fails. Rather than being recognised as "Father or mother", the OCR'd document contains the string "Father er mother".

Add support for PDF Form Parsing

From @afolarin on February 9, 2017 17:9

The present options for Tika don't include parsing pdf forms to text.

It would be useful to include the Apache PDFBox https://pdfbox.apache.org/ capability to this so it is available for CogStack as we see the Clinical Record in some hospitals might have a significant number of these. A first attempt at implementation could just parse the whole form tree, this would at least provide the capability to run queries against it in Elasticsearch.

Further refinement might be made to allow limited configuration to specify which parts of the forms might be parsed, as we've seen very large documents result from PDFBox.

Copied from original issue: KHP-Informatics/cogstack#13

Towards modularity in CogStack for site-specific features

From @hkkenneth on March 25, 2017 13:58

As CogStack is being deployed at various NHS sites and projects, some site-specific features will be needed. For example, in an UCL project we need to post-process bio-yodie results. In the SLaM timeline project we need to generate thumbnails and PDFs. Both of these are not needed in other CogStack use case.

Goal: new features can be implemented in separate project, loaded dynamically as extra JAR files and configured using properties and profiles.

Copied from original issue: KHP-Informatics/cogstack#17

Add PDF Table Extraction using Tabula

Ed. Looks like Camelot might be a better tooling set https://blog.socialcops.com/technology/engineering/camelot-python-library-pdf-data/

Tabula is an open source project for table extraction from pdfs and we should be able to automate (see below). Could solve the problems with data locked in pdf archives, a common thing in the EHR unstructured data and something we frequently get asked about with Cogstack.

https://github.com/tabulapdf/tabula

see:
Incorporating Tabula into your own project

Tabula is open-source, so we'd love for you to incorporate pieces of Tabula into your own projects. The "guts" of Tabula -- that is, the logic and heuristics that reconstruct tables from PDFs -- is contained in the tabula-java repo. There's a JAR file that you can easily incorporate into JVM languages like Java, Scala or Clojure and it includes a command-line tool for you to automate your extraction tasks. Visit that repo for more information on how to use tabula-java on the CLI and on how Tabula exports tabula-java script

cc/ @Honghan

Etiquette note

Hello Cogstack (Amos right?). Just a quick etiquette poke - sorry! Provenance wise, it's bad form to start a new repo without acknowledging existing github projects. It's better to fork (e.g. here or here

:)

Outdated Tika Dependencies

Currently gradle would not successfully build Cogstack as it has dependency for
org.apache.tika:tika-core:1.15-SNAPSHOT and org.apache.tika:tika-parsers:1.15-SNAPSHOT which are no longer present on the maven repository. Those would need to be updated to exclude "-SNAPSHOT" bit.

PDF and Thumbnail generation will fail if Tika throws length warning

If Tika Fails to process lengthy documents, returning following information:

Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). 

rather than processing the available text (up to 100000 characters), the pipeline fails.

Cogstack docker download issues

I am not able to pull the bioyodie when I run the .yml file:

bioyodie:
cpu_quota: 800000
image: richjackson/bioyodie:D4.5
ports:
- 8080:8080
Neither can I pull cogstack:

cogstack:
image: richjackson/cogstack:1.2.0

When I search bioyodie on docker hub, I find nothing.
For cogstack, docker hub shows cogstack/cogstack but we get an error message at runtime that we need to provide docker login/password for cogstack.

PDFGenerationItemProcessor for .docx and .png

Hi,

is there any particular reason PDFGenerationItemProcessor is ignoring .docx and .png documents? After adding appropriate cases in the switch for mime types of those file types and testing it in docker instance of libre office I had no issues and the files were converted successfully.

De-Identification

Issue to track ElasticGazetteer implemetation or other De-ID algos.

Last November I went to meet the Manchester "deid" people and we helped them a bit in this de-identification tool https://github.com/healtex/texscrubber . It is a GATE app wrapped around by Spring with GUI. You can possibly use the GATE app with CogStack and see how well it does.

Cheers,
Kenneth

Mechanism to prevent stale CogStack structured data in Elasticsearch

Thoughts on this one welcome. The main purpose of CogStack pipeline was to bring fast querying of the unstructured data (the structured components are generally better queried in the authoritative database esp. if relational). However great advantage is made by collecting a judicious amount of structured data so it can be used for dynamically refining searches.

Problem when using CogStack to process structured data from the data source into Elasticsearch to pack alongside the unstructured data, we either take the:

  1. most current existing structured at loading time (convenient, but inaccurate for historical context)
  2. the closest historical structured data where there is an change log of the data available (perhaps through a stored procedure that makes the correct data from that point in time available (correct for historical, provided that is static)
  3. either 1 or 2 then continuously update the data in the elasticsearch JSON fields as the information changes in the authoritative data source. There is clearly overhead associated with 3 but it is the only way I can see to avoid the index becoming stale.

As suggested by @HeglerTissot We should also probably make a separate index that tracks key structured data for convenience querying (although noted that JOINS are not possible).

Log metrics on Binary Doc conversion

From @afolarin on October 4, 2016 17:46

We should aim to track which pathway a document goes through and what success/failure occur as it gets processed into XHTML. It will give us some basic understanding of the relative importance of things like the OCR pipeline

Copied from original issue: KHP-Informatics/cogstack#4

ElasticsearchRest Client not working with scheduler

While the first batch of jobs is going to run fine, the next one is going to throw the following exception:

java.lang.IllegalStateException: Request cannot be executed; I/O reactor status: STOPPED

From what I've managed to gather, it is probably due to the way Elasticsearch RestClient is using Apache's HttpAsyncClient. For some reason after the batch the client is closed? It might require further investigation.

The possible solution might include recreating RestClient for each batch (however, that would be rather inefficient) or somehow trying to determine if the client is still alive.

For now, the Native Java client can be used (I have updated it to work with X-pack). However, Elastic recommends the migration to the Rest client in a future.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.