How to handle "disease" as text in a search?

Dug: digging up dark data

Dug applies semantic web and knowledge graph methods to improve the FAIR-ness of research data.

As an example, dbGaP is a rich source of metadata about biomedical knowledge derived from clinical research like the underutilized TOPMed data sets. A key obstacle to leveraging this knowledge is the lack of researcher tools to navigate from a set of concepts of interest towards relevant study variables. In a word, search.

While other approaches to searching this data exist, our focus is semantic search: For us, "relevant" is defined as having a basis in curated, peer reviewed ontologically represented biomedical knowledge. Given a search term, Dug returns results that are related based on connections in ontological biomedical knowledge graphs.

To achieve this, we annotate study metadata with terms from biomedical ontologies, contextualize them within a unifying upper ontology allowing study data to be federated with larger knowledge graphs, and create a full text search index based on those knowledge graphs.

Quickstart

NOTE: You must run make init once you've cloned the repo to enable the commit-msg git hook so that conventional commits will apply automatically.

To install Dug in your environment , run make install. Alternatively,

pip install -r requirements.txt
pip install -e .

To run the tests , run make test. Alternatively,

pytest .

(The makefile currently only executes unit tests, whereas running pytest in the root directory will execute unit tests, integration tests, and doctests)

To bring up the backend services, run:

docker-compose up

If you're running dug-specific commands (i.e. dug) outside the docker container, you have to make sure the env vars are set. Also, make sure all hostnames are correct for the environment you're running in. For example, to be able to connect to dug backend services from outside the container (but in a shell env), run:

source .env
export $(cut -d= -f1 .env)
export ELASTIC_API_HOST=localhost
export REDIS_HOST=localhost

(These values are already set up in the running docker container)

Then you can actually crawl the data:

dug crawl tests/integration/data/test_variables_v1.0.csv -p "TOPMedTag"

After crawling, you can search:

dug search -q "vein" -t "concepts"
dug search -q "vein" -t "variables" -k "concept=UBERON:0001638"

You can also query Dug's REST API:

query="`echo '{"index" : "concepts_index", "query" : "vein"}'`"

curl --data "$query" \
     --header "Content-Type: application/json" \
     --request POST \
     http://localhost:5551/search

Additional Notes

If you want to change or re-configure the dug service authentication credentials to be different from the defaults, run:

mv .env .env.bak
DATA_DIR=/path/to/dug-data-storage RANDOM=$RANDOM envsubst < .env.template > .env
docker-compose down
docker system prune -a  # NOTE: This will remove *all* images, layers, and volumes.
                        #       Be sure you're okay with this before running.
docker-compose up

The Dug Framework

Dug's ingest uses the Biolink upper ontology to annotate knowledge graphs and structure queries used to drive full text indexing and search. It uses Monarch Initiative APIs to perform named entity recognition on natural language prose to extract ontology identifiers. It also uses Translator normalization services to find preferred identifiers and Biolink types for each extracted identifier. The final step of ingest is to represent the annotated data in a Neo4J graph database.

Dug's integration phase uses Translator's Plater and Automat to generate a Reasoner Standard API compliant service and integrates that service into TranQL. This enables queries that span TOPMed, ROBOKOP, and other reasoners.

Dug's indexing & search phase query the graph infrastructure and analyze the resulting graphs. These are used to create documents associating natural language terms with annotations and the annotated variables and studies.

Dug will then generate Translator knowledge sources for the annotated variables and present them for query via TranQL.

Knowledge Graphs

Dug's core data structure is the knowledge graph. Here's a query of a COPDGene knowledge graph created by Dug from harmonized TOPMed variables.

Figure 1: A Biolink knowledge graph of COPDGene metadata. It shows the relationship between the biological process "Sleep" and a meta variable. The highlighted node is aTOPMed meta variable or harmonized variable. It is in turn associated with variables connected to two studies in the data set. By linking additional ontological terms to the biological process sleep, we will be able to provde progressively more helpful search results rooted in curated biomedical knowledge.

And one more example to illustrate the knowledge model we use to inform the search index: Figure 2: The TOPMed harmonized variable is highlighted, showing its relationships with the ontology term for Heart Failure and the Heart Failure and with a specific study variable. Several similar disease, harmonized variable, variable, study relationships are also shown.

These graphs are used to create the document's well add to the search index to power full text search.

In phase 1, we use Neo4J to build queries. In subsequent phases, we integrate other semantic services using TranQL.

Figure 3: A TranQL knowledge graph query response. Integrating TOPMed harmonized variables as a Translator service called by TranQL allows us to query the federation of Translator ontological connections as a precursor to indexing. This includes chemical, phenotypic, disease, cell type, genetic, and other ontologies from sources like ROBOKOP as well as clinical aggregate data from sources like ICEES. The image above shows a query linking cholesterol to "LDL in Blood" a harmonized TOPMed variable. That variable is, in turn, linked to source variables and each of those is linked to its source study.

Elasticsearch Logic

Elasticsearch contains the indexed knowledge graphs against which we perform full-text queries. Our query currently supports conditionals (AND, OR, NOT) as well as exact matching on quoted terms. Because we don't specify an analyzer in our query or when we index documents, we default to the standard analyzer, which is recommended for full-text search. The standard analyzer performs grammar-based tokenization (e.g., splitting up an input string into tokens by several separators including whitespace, commas, hyphens), defined by the Unicode Standard Annex #29.

Example Documents and Query Behavior

Two toy examples of indexed documents, blood_color and blood_shape, are shown below to demonstrate query behavior.

Query

The query searches the fields in all indexed documents to return a matching subset of documents.

query = {
            'query_string': {
                'query' : query,
                'fuzziness' : fuzziness,
                'fields': ['name', 'description', 'instructions', 'nodes.name', 'nodes.synonyms'],
                'quote_field_suffix': ".exact"
            }
        }

Tests

Query	Behavior
blood	This returns both documents (`blood_color` and `blood_shape`).
blood AND magenta	This returns only `blood_color`.
magenta AND cerulean	This returns `blood_color` even though this might be unexpected. The words 'magenta' and 'cerulean' appear in the same document in the searched fields, even though they appear in different fields, so the document is still returned.
blue AND square	No documents are returned.
blue and square	This returns both documents because the 'and' term is treated as just another term instead of an operator because it is not capitalized. The actual search resolves to blue OR and OR square
"round blood"	No documents are returned.
"blood, round"	This returns `blood_shape`
"blood round"	The document `blood_shape` is returned - the standard analyzer performs tokenization based on grammar separators, including commas in this case.

Approach

The methodology, from start to finish, reads raw data, annotates it with ontological terms, normalizes those terms, inserts them into a queryable knowledge graph, queries that graph along pre-configured lines, turns the resulting knowledge graphs into documents, and indexes those documents in a full text search engine.

Link

Link ingests raw dbGaP study metadata and performs semantic annotation by

Parsing a TOPMed data dictionary XML file to extract variables.
Using the Monarch SciGraph named entity recognizer to identify ontology terms.
Using the Translator SRI identifier normalization service to
- Select a preferred identifier for the entity
- Determine the BioLink types applying to each entity
Writing each variable with its annotations as a JSON object to a file.

Load

Converts the annotation format written in the steps above to a KGX graph
Inserts that graph into a Neo4J database.

In phase-1, we query Neo4J to create knowledge graphs. In phase-2 we'll use the Neo4J to create a Translator Knowledge Provider API. That API will be integrated using TranQL with other Translator reasoners like ROBOKOP. This will allow us to build more sophisticated graphs spanning federated ontological knowledge.

Crawl

Runs those graph queries and caches knowledge graph responses.

Index

Consumes knowledge graphs produced by the crawl.
Uses connections in the graph to create documents including both full text of variable descriptions and ontology terms.
Produces a queryable full text index of the variable set.

Search API

Presents an OpenAPI compliant REST interface
Provides a scalable microservice suitable as an Internet endpoint.

The Dug Data Development Kit (DDK)

Dug provides a tool chain for the ingest, annotation, knowledge graph representation, query, crawling, indexing, and search of datasets with metadata. The following sections provide an overview of the relevant components.

Ingest

Data formats for harmonized variables appear to be in flux, hence the multiple approaches. More on this soon.

Command	Description	Example
bin/dug link	Use NLP, etc to add ontology identifiers and types.	bin/dug link {input}
bin/dug load	Create a knowledge graph database.	bin/dug load {input}

There are three sets of example metadata files in the repo.

A COPDGene dbGaP metadata file is at data/dd.xml
A harmonized variable metadata CSV is at data/harmonized_variable_DD.csv
Files with names starting with: data/topmed_*

This last format seems to be the go-forward TOPMed harmonized variable form.

These can be run with

bin/dug link data/dd.xml
bin/dug load data/dd_tagged.json

or

bin/dug link data/harmonized_variable_DD.csv
bin/dug load data/harmoinzed_variable_DD_tagged.json

or

bin/dug link data/topmed_variables_v1.0.csv [--index x]

The first two formats will likely go away. The last format

Consists of two sets of files following that naming convention.
Combines the link and load phases into link.
Optionally allows the --index flag. This will run graph queries and index data in Elasticsearch.

Crawl & Index

Command	Description	Example
bin/dug crawl	Execute graph queries and accumulate knowledge graphs in response.	bin/dug crawl
bin/dug index	Analyze crawled knowledge graphs and create search engine indices.	bin/dug index
bin/dug query	Test the index by querying the search engine from Python.	bin/dug query {text}

Search API

Exposing the Elasticsearch interface to the internet is strongly discouraged for security reasons. Instead, we have a REST API. We'll use this as a place to enforce a schema and validate requests so that the search engine's network endpoint is strictly internal.

Command	Description	Example
bin/dug api	Run the REST API.	bin/dug api [--debug] [--port={int}]

To call the API endpoint using curl:

Command	Description	Example
bin/dug query_api	Call the REST API.	bin/dug query_api {query}

Development

A docker-compose is provided that runs four services:

Redis
Neo4J
Elasticsearch
The Dug search OpenAPI

This system can be started with the following command:

Command	Description	Example
bin/dug stack	Runs all services	bin/dug stack

Developers: Internal to bin/dug, an environment file is automatically created. That file is in docker/.env. If you are running in development, and are not using a public IP address and hostname, you'll want to create a separate .env file to allow programs to connect to the docker containers as services. This matters if, for example, you want to run bin/test, as the clients in that test need to know how to connect to each of the services they call. Copy the generated docker/.env to docker/.env.dev. Change all hostnames to localhost. That should do it. Be sure to keep the generated passwords from the generated .env the same.

Testing

Dug's automated functional tests:

Delete the test index
Execute the link and load phases for the dbGaP data dictionary and harmonized variables.
Execute the crawl and index phases.
Execute a number of searches over the generated search index.

Command	Description	Example
bin/dug test	Run automated functional tests	bin/dug test

Once the test is complete, a command line search shows the contents of the index: Figure 4: A command line query using the Dug Search OpenAPI to query the Elasticsearch index for a term.

Data Formats

TOPMed phenotypic concept data is here.

Release

To release, commit the change and select feature.

Fail on Vulnerability Detection

During PR's several vulnerability scanners are run. If there are vulnerabilities detected, the pr checks will fail and a report will be sent to Github Security Dashboard for viewing. Please ensure the vulnerability is mitigated prior to continuing the merge to protected branches.

helxplatform / dug Goto Github PK

dug's Introduction

Dug: digging up dark data

Quickstart

Additional Notes

The Dug Framework

Knowledge Graphs

Elasticsearch Logic

Example Documents and Query Behavior

Query

Tests

Approach

Link

Load

Crawl

Index

Search API

The Dug Data Development Kit (DDK)

Ingest

Crawl & Index

Search API

Development

Testing

Data Formats

Release

Fail on Vulnerability Detection

dug's People

Contributors

Stargazers

Watchers

Forkers

dug's Issues

Introduction

HEAL Harmonization

Design

PM Tracking

Recommend Projects

Recommend Topics

Recommend Org