Giter Club home page Giter Club logo

dug's Introduction

Dug: digging up dark data

Dug applies semantic web and knowledge graph methods to improve the FAIR-ness of research data.

As an example, dbGaP is a rich source of metadata about biomedical knowledge derived from clinical research like the underutilized TOPMed data sets. A key obstacle to leveraging this knowledge is the lack of researcher tools to navigate from a set of concepts of interest towards relevant study variables. In a word, search.

While other approaches to searching this data exist, our focus is semantic search: For us, "relevant" is defined as having a basis in curated, peer reviewed ontologically represented biomedical knowledge. Given a search term, Dug returns results that are related based on connections in ontological biomedical knowledge graphs.

To achieve this, we annotate study metadata with terms from biomedical ontologies, contextualize them within a unifying upper ontology allowing study data to be federated with larger knowledge graphs, and create a full text search index based on those knowledge graphs.

Quickstart

NOTE: You must run make init once you've cloned the repo to enable the commit-msg git hook so that conventional commits will apply automatically.

To install Dug in your environment , run make install. Alternatively,

pip install -r requirements.txt
pip install -e .

To run the tests , run make test. Alternatively,

pytest .

(The makefile currently only executes unit tests, whereas running pytest in the root directory will execute unit tests, integration tests, and doctests)

To bring up the backend services, run:

docker-compose up

If you're running dug-specific commands (i.e. dug) outside the docker container, you have to make sure the env vars are set. Also, make sure all hostnames are correct for the environment you're running in. For example, to be able to connect to dug backend services from outside the container (but in a shell env), run:

source .env
export $(cut -d= -f1 .env)
export ELASTIC_API_HOST=localhost
export REDIS_HOST=localhost

(These values are already set up in the running docker container)

Then you can actually crawl the data:

dug crawl tests/integration/data/test_variables_v1.0.csv -p "TOPMedTag"

After crawling, you can search:

dug search -q "vein" -t "concepts"
dug search -q "vein" -t "variables" -k "concept=UBERON:0001638"

You can also query Dug's REST API:

query="`echo '{"index" : "concepts_index", "query" : "vein"}'`"

curl --data "$query" \
     --header "Content-Type: application/json" \
     --request POST \
     http://localhost:5551/search

Additional Notes

If you want to change or re-configure the dug service authentication credentials to be different from the defaults, run:

mv .env .env.bak
DATA_DIR=/path/to/dug-data-storage RANDOM=$RANDOM envsubst < .env.template > .env
docker-compose down
docker system prune -a  # NOTE: This will remove *all* images, layers, and volumes.
                        #       Be sure you're okay with this before running.
docker-compose up

The Dug Framework

Dug's ingest uses the Biolink upper ontology to annotate knowledge graphs and structure queries used to drive full text indexing and search. It uses Monarch Initiative APIs to perform named entity recognition on natural language prose to extract ontology identifiers. It also uses Translator normalization services to find preferred identifiers and Biolink types for each extracted identifier. The final step of ingest is to represent the annotated data in a Neo4J graph database.

Dug's integration phase uses Translator's Plater and Automat to generate a Reasoner Standard API compliant service and integrates that service into TranQL. This enables queries that span TOPMed, ROBOKOP, and other reasoners.

Dug's indexing & search phase query the graph infrastructure and analyze the resulting graphs. These are used to create documents associating natural language terms with annotations and the annotated variables and studies. image

Dug will then generate Translator knowledge sources for the annotated variables and present them for query via TranQL.

Knowledge Graphs

Dug's core data structure is the knowledge graph. Here's a query of a COPDGene knowledge graph created by Dug from harmonized TOPMed variables.

image Figure 1: A Biolink knowledge graph of COPDGene metadata. It shows the relationship between the biological process "Sleep" and a meta variable. The highlighted node is aTOPMed meta variable or harmonized variable. It is in turn associated with variables connected to two studies in the data set. By linking additional ontological terms to the biological process sleep, we will be able to provde progressively more helpful search results rooted in curated biomedical knowledge.

And one more example to illustrate the knowledge model we use to inform the search index: image Figure 2: The TOPMed harmonized variable is highlighted, showing its relationships with the ontology term for Heart Failure and the Heart Failure and with a specific study variable. Several similar disease, harmonized variable, variable, study relationships are also shown.

These graphs are used to create the document's well add to the search index to power full text search.

In phase 1, we use Neo4J to build queries. In subsequent phases, we integrate other semantic services using TranQL.

image Figure 3: A TranQL knowledge graph query response. Integrating TOPMed harmonized variables as a Translator service called by TranQL allows us to query the federation of Translator ontological connections as a precursor to indexing. This includes chemical, phenotypic, disease, cell type, genetic, and other ontologies from sources like ROBOKOP as well as clinical aggregate data from sources like ICEES. The image above shows a query linking cholesterol to "LDL in Blood" a harmonized TOPMed variable. That variable is, in turn, linked to source variables and each of those is linked to its source study.

Elasticsearch Logic

Elasticsearch contains the indexed knowledge graphs against which we perform full-text queries. Our query currently supports conditionals (AND, OR, NOT) as well as exact matching on quoted terms. Because we don't specify an analyzer in our query or when we index documents, we default to the standard analyzer, which is recommended for full-text search. The standard analyzer performs grammar-based tokenization (e.g., splitting up an input string into tokens by several separators including whitespace, commas, hyphens), defined by the Unicode Standard Annex #29.

Example Documents and Query Behavior

Two toy examples of indexed documents, blood_color and blood_shape, are shown below to demonstrate query behavior.

image

Query

The query searches the fields in all indexed documents to return a matching subset of documents.

query = {
            'query_string': {
                'query' : query,
                'fuzziness' : fuzziness,
                'fields': ['name', 'description', 'instructions', 'nodes.name', 'nodes.synonyms'],
                'quote_field_suffix': ".exact"
            }
        }

Tests

Query Behavior
blood This returns both documents (blood_color and blood_shape).
blood AND magenta This returns only blood_color.
magenta AND cerulean This returns blood_color even though this might be unexpected. The words 'magenta' and 'cerulean' appear in the same document in the searched fields, even though they appear in different fields, so the document is still returned.
blue AND square No documents are returned.
blue and square This returns both documents because the 'and' term is treated as just another term instead of an operator because it is not capitalized. The actual search resolves to blue OR and OR square
"round blood" No documents are returned.
"blood, round" This returns blood_shape
"blood round" The document blood_shape is returned - the standard analyzer performs tokenization based on grammar separators, including commas in this case.

Approach

The methodology, from start to finish, reads raw data, annotates it with ontological terms, normalizes those terms, inserts them into a queryable knowledge graph, queries that graph along pre-configured lines, turns the resulting knowledge graphs into documents, and indexes those documents in a full text search engine.

Link

Link ingests raw dbGaP study metadata and performs semantic annotation by

  • Parsing a TOPMed data dictionary XML file to extract variables.
  • Using the Monarch SciGraph named entity recognizer to identify ontology terms.
  • Using the Translator SRI identifier normalization service to
    • Select a preferred identifier for the entity
    • Determine the BioLink types applying to each entity
  • Writing each variable with its annotations as a JSON object to a file.

Load

  • Converts the annotation format written in the steps above to a KGX graph
  • Inserts that graph into a Neo4J database.

In phase-1, we query Neo4J to create knowledge graphs. In phase-2 we'll use the Neo4J to create a Translator Knowledge Provider API. That API will be integrated using TranQL with other Translator reasoners like ROBOKOP. This will allow us to build more sophisticated graphs spanning federated ontological knowledge.

Crawl

  • Runs those graph queries and caches knowledge graph responses.

Index

  • Consumes knowledge graphs produced by the crawl.
  • Uses connections in the graph to create documents including both full text of variable descriptions and ontology terms.
  • Produces a queryable full text index of the variable set.

Search API

  • Presents an OpenAPI compliant REST interface
  • Provides a scalable microservice suitable as an Internet endpoint.

The Dug Data Development Kit (DDK)

Dug provides a tool chain for the ingest, annotation, knowledge graph representation, query, crawling, indexing, and search of datasets with metadata. The following sections provide an overview of the relevant components.

Ingest

Data formats for harmonized variables appear to be in flux, hence the multiple approaches. More on this soon.

Command Description Example
bin/dug link Use NLP, etc to add ontology identifiers and types. bin/dug link {input}
bin/dug load Create a knowledge graph database. bin/dug load {input}

There are three sets of example metadata files in the repo.

  • A COPDGene dbGaP metadata file is at data/dd.xml
  • A harmonized variable metadata CSV is at data/harmonized_variable_DD.csv
  • Files with names starting with: data/topmed_*

This last format seems to be the go-forward TOPMed harmonized variable form.

These can be run with

bin/dug link data/dd.xml
bin/dug load data/dd_tagged.json

or

bin/dug link data/harmonized_variable_DD.csv
bin/dug load data/harmoinzed_variable_DD_tagged.json

or

bin/dug link data/topmed_variables_v1.0.csv [--index x]

The first two formats will likely go away. The last format

  • Consists of two sets of files following that naming convention.
  • Combines the link and load phases into link.
  • Optionally allows the --index flag. This will run graph queries and index data in Elasticsearch.

Crawl & Index

Command Description Example
bin/dug crawl Execute graph queries and accumulate knowledge graphs in response. bin/dug crawl
bin/dug index Analyze crawled knowledge graphs and create search engine indices. bin/dug index
bin/dug query Test the index by querying the search engine from Python. bin/dug query {text}

Search API

Exposing the Elasticsearch interface to the internet is strongly discouraged for security reasons. Instead, we have a REST API. We'll use this as a place to enforce a schema and validate requests so that the search engine's network endpoint is strictly internal.

Command Description Example
bin/dug api Run the REST API. bin/dug api [--debug] [--port={int}]

To call the API endpoint using curl:

Command Description Example
bin/dug query_api Call the REST API. bin/dug query_api {query}

Development

A docker-compose is provided that runs four services:

  • Redis
  • Neo4J
  • Elasticsearch
  • The Dug search OpenAPI

This system can be started with the following command:

Command Description Example
bin/dug stack Runs all services bin/dug stack

Developers: Internal to bin/dug, an environment file is automatically created. That file is in docker/.env. If you are running in development, and are not using a public IP address and hostname, you'll want to create a separate .env file to allow programs to connect to the docker containers as services. This matters if, for example, you want to run bin/test, as the clients in that test need to know how to connect to each of the services they call. Copy the generated docker/.env to docker/.env.dev. Change all hostnames to localhost. That should do it. Be sure to keep the generated passwords from the generated .env the same.

Testing

Dug's automated functional tests:

  • Delete the test index
  • Execute the link and load phases for the dbGaP data dictionary and harmonized variables.
  • Execute the crawl and index phases.
  • Execute a number of searches over the generated search index.
Command Description Example
bin/dug test Run automated functional tests bin/dug test

Once the test is complete, a command line search shows the contents of the index: image Figure 4: A command line query using the Dug Search OpenAPI to query the Elasticsearch index for a term.

Data Formats

TOPMed phenotypic concept data is here.

Release

To release, commit the change and select feature.

Fail on Vulnerability Detection

During PR's several vulnerability scanners are run. If there are vulnerabilities detected, the pr checks will fail and a report will be sent to Github Security Dashboard for viewing. Please ensure the vulnerability is mitigated prior to continuing the merge to protected branches.

dug's People

Contributors

alexwaldrop avatar andrew-rti avatar braswent avatar cbcunc avatar cmball1 avatar cnbennett3 avatar cschreep avatar dependabot[bot] avatar frostyfan109 avatar gaurav avatar gingin77 avatar hoid avatar howardlander avatar jcheadle-rti avatar joshua-seals avatar mbacon-renci avatar muralikarthikk avatar pj-linebaugh avatar rencibuild avatar stevencox avatar vladimir2217 avatar warrenstephens avatar yaphetkg avatar yskale avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dug's Issues

Epic: HEAL Harmonization

Introduction

The HEAL program includes hundreds of projects generating highly diverse data sets.

RTI and RENCI, the HEAL Stewards, will

  • Use semantic knowledge graphs to link data from disparate studies to facilitate analysis.
  • Perform preliminary harmonization of data sets towards the HEAL Common Data Elements (CDE)s
  • Provide user friendly interfaces with a biological lens on the data.

HEAL Harmonization

To accomplish this:

  • Ingest HEAL CDEs
    • Create (or locate) machine readable versions of the HEAL CDEs @gaurav
    • Discuss w/NIH how to publish machine readable HEAL CDEs @gaurav
    • Annotate with controlled vocabulary and ontology identifiers @gaurav
      • Map provided NCI Metathesaurus ids to Human Phenotype Ontology (etc) identifiers @gaurav
      • Annotate HEAL CDEs with Monarch's BioLink API SciGraph Named Entity Recognition service (NER) @gaurav
      • Look into alternate NER tools for finding terms in HEAL CDEs (https://github.com/helxplatform/development/issues/804) @gaurav
      • Convert to Biolink and KGX compliant artifacts @gaurav
      • Apply the SRI Normalizer to use Translator preferred identifiers @YaphetKG @gaurav
      • Clean up SciGraph annotations and resend cleaned KGX files to Yaphet @gaurav
    • Optimize TranQL queries to take advantage of Redisgraph performance. @YaphetKG
    • Create new harmonization and translational TranQL queries to @YaphetKG
    • Link variables through phenotypes, chemicals, diseases to CDEs @YaphetKG
    • Index, recording harmonization connections to enable display
  • Ingest HEAL study data as it becomes available @waTeim
  • Update the HeLx/Dug UI to @mbwatson
    • Render
      • An "All" tab for generic search results including lexical matches.
      • A "Harmonized" tab for everything else with markers for CDE, PhenX, Biolink, and other groupings.
    • Allow deployment with or without
      • Authentication
      • App Workspaces
    • Present an information dense display minimizing paging and scrolling

Design

image

@alexwaldrop @jcheadle-rti

PM Tracking

@hhiles to work with @vgardner-renci and Kathy to report on the following HEAL epics and their related GitHub tickets; give updates directly to PMs or plug into Monday.com

  • Index Data Dictionaries
    • #169
    • #768
    • Collect remaining 500-600 HEAL studies
  • Develop Automated Curation and Search
  • Enable Cloud Based Semantic Search

Ingest network resources

Having to bundle the data in the docker image really isn't fun.

Acceptance criteria:

  • split load and parse into separate tasks
  • 2 loader classes, i.e. HttpLoader and FilesystemLoader
  • Should be able to dispatch them based on target, i.e. "http://...." goes to http loader, "file://..." goes to file loader, default to file loader if not valid URI
  • should be able to run dug crawl http://example.com/my-data-dict.tar.gz

Error handling on empty results

When running a search, if there are no matching entries in the ES index, the code will raise a KeyError trying to access a nested dictionary of the result hits. This should be handled more gracefully.

Automatic backups

We need automatic backups of our ES index to easily revert or restore data

Incomplete concept results

Currently, dug searches frequently yield "incomplete" concepts which lack key identifying attributes such as name, description, and/or type fields.

This is handled in the UI by removing such results (note that those without a type are still shown). There is also an error in the console, but it is not explicitly revealed to the user that results are removed.

image

For example, a search for asthma purportedly yields 51 results, although in reality only 24 are shown to the user, with the other 27 being removed due to a lack of information.

  • A search for epicanthus yields 4 results on the first page, but only 3 are shown
  • A search for aggravated by returns "Aggravated by" which lacks a type concept

The dug annotator currently relies naively on the ontology api to retrieve information such as names/descriptions of concepts.

The annotator normalizes the node to its preferred identifier but neglects to account for the specific ontologies that the api supports, and thus will oftentimes send out requests using unsupported ontologies and return empty-handed.

This is especially apparent, for example, with chemical concepts, which are generally normalized to PUBCHEM.COMPOUND. This ontology is not supported by the annotation api and will always return no information about the concept. In turn, chemical concepts are very frequently removed from the UI due to a lack of information about them.

Another consideration is that sometimes the ontology api returns incomplete information. For example, https://api.monarchinitiative.org/api/bioentity/HP%3A0025285 returns a name and description, but lacks a type/category.

Tentative steps

  • Normalize node identifiers for annotation while accounting for supported ontologies in relevant apis
  • Consider the use of node-norm as a fallback mechanism when initial requests to the ontology api are lacking in important information (specifically, node type)
  • If descriptionless concepts are still somewhat frequent (e.g. if we discover that some nodes can't be normalized to supported ontologies), consider adapting the UI to display concepts without descriptions.

Change sync ARAGORN query to be truly synchronous

Currently, when ARAGORN is queried it asynchronously makes calls to other services that return its data. This is because we don't store the underlaying knowledge graph locally. We want to change that so we have the ability to call ARAGORN synchronously if we want, using a stored locally knowledge graph. Instead of standing up a whole other ARAGORN server just to service the centralized workflow, we're creating two new endpoints and some ingresses to forward requests to a separate base URL to the correct endpoints on the core server.

  • Test ingresses that forward traffic from 2 different base URLs to a single main base URL with certain URI prefixes (see below)
  • Add these ingresses to the main aragorn deployment
  • Add service URL for plater to the config file for the deployment (check translator-devops repo for the deployment.yml file)
  • Change sync endpoint to query robokop (through plater) instead of strider

To-do

  • Figure out whether I need to run other workflows myself in the sync call or whether plater will take care of that for me

Ingest SPARC data

Given sparc KGX in JSON format, ingest it into our ES index

@YaphetKG please work with Howard to flesh out detailed steps

Harmonization phase2

Incorporate CDE's, PhenX and LOINC harmonization into Dug

Acceptance Criteria:

  • Ingest Variable cross-reference into knowledge graph from phenx
  • Write queries that make associations between CDE's, PhenX and Loinc

Improve directory crawling

Currently the only way to crawl a directory is by bash scripting, calling dug crawl on each individual file. This is not idea for our use case, so we need to be able to specify either a single file or a directory to the dug CLI without having to script extra steps around it.

investigate new "expand" icons for dug interface

Icon should clearly state to users that an object can be interacted with. Users have commented that Dug search result tiles as well as the collapsing drawer in the Workspace are not intuitively things they would "open".

  • @connectbo and @mbwatson investigate new icon options
  • @hhiles update documentation to guide users through card expansion

image

Variable View

Dates TBD by 5/24

For MVP dev testing

  • @hhiles set up meeting to discuss prod version of variable view - invite jeff, ginnie, mac, matt w
  • @gingin77 create image for @waTeim to pull
  • @hhiles and kira will play in this space for 1-2 weeks
  • @hhiles meet with @gingin77 and share screen to show any bugs/problems that are found during kira/hannah testing

For Beta with BDC @gingin77

  • Define Dug Score for users
  • After zoom in, being able to select studies for further search - prioritize variables for "search within search" functionality
  • Some kind of popup/help text to explain functionality of the push pins

Then

  • @hhiles will work with kira to ID HEAL BDC users to beta test for 2-3 weeks - PREFERABLY JACKIE, Ben and Adrian would be helpful too
  • @hhiles will set up a workshop meeting for 2-3 weeks from their access start date; discuss issues and what is liked

Update Quickstart

The Quickstart in the README is a little out of date. File paths, search params need to be updated. Go through the guide and make sure everything lines up correctly

identify user metrics for dug

Baseline user metrics need to be determined for each deployment of Dug. What we want to know:

  • how many people are using Dug at each instance
  • number of unique users
  • number of searches completed by users
  • average time spent using Dug
  • do users preform multiple searches in 1 session? or 1 search their entire session
  • ratio of users who "dig deeper" into concepts and KGs : users who preform 1-2 clicks then end session

BY END OF AUGUST

  • @frostyfan109 review GA4 offerings
  • @frostyfan109 review the GA4 tokens and analytics libraries; make updates as needed
  • @frostyfan109 identify what new events will need to be seen in GA4 (eg, shopping cart, clicking through CDEs, etc)
  • @hhiles will compile a complete list of deployments to be represented in GA4
  • make sure each required deployment of Dug is connected to MixPanel GA4 @frostyfan109
  • set up dashboards to capture above analytics @hhiles

Then,

  • generate initial reports, share with leadership @hhiles
  • tweak above questions and generate a second wave of reports after 1-2 months @hhiles

Simplification and Normalization of Standard Apps Deployment

Deployment of applications on Kubernetes is facilitated by Helm and requires some amount of maintenance to keep running smoothly, it would be useful to have this organized to help facilitate this.

  • Add a feature to help in private deployments
  • Merge previous work into current develop and master branches
  • Maintain-suggest a version numbering
  • Collect current work and set numbers.
  • Convert heal deployments
  • Convert BDC deployments
  • Add metrics tracking data processing efficiency

updated internal documentation

Technical documentation for the Dug team needs updating.

  • write up "how to manually get metadata from HEAL into Dug" page @alexwaldrop
  • Roger README needs updating @YaphetKG
    • how to install dug in the cloud
    • helm charts as a component of dug (not necessarily a whole section; but make sure you reference it)

Tranql Support for SKIP and LIMIT as keywords instead of where conditions

Latest implementation of Tranql + redis uses skip and limit as part of the where clause.

To improve this better and make adding familiarity it we need to adjust it to be part of the Tranql query keywords.

Current implementation:

select d2:disease<-gene<-[interacts_with]-chemical_substance->disorder_involving_pain:disease
  from "redis:test"
 where disorder_involving_pain='MONDO:0021668' 
   and limit=200 
   and skip=200

Proposed

select d2:disease<-gene<-[interacts_with]-chemical_substance->disorder_involving_pain:disease
  from "redis:test"
 where disorder_involving_pain='MONDO:0021668' 
 limit 200 
 skip 200

Error handling around reading tags file

Encountered this error seeding Dug deployment:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/dug/.local/lib/python3.8/site-packages/dug/core.py", line 693, in <module>
    main()
  File "/home/dug/.local/lib/python3.8/site-packages/dug/core.py", line 672, in main
    crawler.crawl()
  File "/home/dug/.local/lib/python3.8/site-packages/dug/core.py", line 454, in crawl
    self.elements = self.parser.parse(self.crawl_file)
  File "/home/dug/.local/lib/python3.8/site-packages/dug/parsers.py", line 184, in parse
    tags = json.load(stream)
  File "/usr/local/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/local/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

The code is trying to load a JSON file:

class TOPMedTagParser:

    @staticmethod
    def parse(input_file):
        """..."""
        tags_input_file = input_file.replace(".csv", ".json").replace("_variables_", "_tags_")
        if not os.path.exists(tags_input_file):
            raise ValueError(f"Accompanying tags file: {tags_input_file} must exist.")

        # Read in huamn-created tags/concepts from json file before reading in elements
        with open(tags_input_file, "r") as stream:
            tags = json.load(stream)
        ...

First let's determine that this file shouldn't be loaded and then add some better error handling around bad files. Alternatively, if it should be loading, let's fix it

Search UI Autocomplete

We'd like a way to add search bar autocompletion suggestions (and possibly predictions?) in the search UI akin to Google.

Currently, a user is largely left on their own when navigating between dug searches in the UI. Ideally, for example, if someone searches "heart", the UI could suggest a number of direct completions, e.g. ['heart', 'heart disease', 'heart failure', ...]. This may also include context-aware predictions, e.g. searching asthma might yield suggestions ['asthma', 'dyspnea', 'hypoxemia', 'lung disease', ...].

Needs to be performant to scale while providing results that a user would actually find useful.

  1. Finish #226 and consider whether or not this functionality would be suitable for autocompletion on helx-ui as well.
  2. If not, explore elasticsearch implementation.

image

updated user-facing documentation

Dug's user-facing documentation is out of date.

  • README.md for /dug needs reviewing and updating
  • Dug README needs checking to see if other Dug-adjacent repos (eg kg-explorer) are referenced; update their documentation if they are
  • /support page on the prod Dug instances needs updating
  • Pull Dug documentation out of HeLx GitBook; make sure it's updated

Patchy results for gene/drug queries

The use of genes and drugs to get to relevant terms seems a little hit or miss

So in the realm of asthma
"Albuterol" brings back asthma
"ADRB2" does not

For hypertension
"ACE" brings back hypertension
"lisinopril" does not

CDE Harmonization

the front-end is receiving CDE studies with empty properties.

Screenshot from 2022-02-25 10-37-05

For reference, this screenshot shows how the cde studies compare to the dbGaP ones:

Screenshot from 2022-02-25 10-42-00
:

Bio-NER for Dug on HEAL

For discussion:

Most HEAL data is not extensively curated. Many are like those in NIDA-Share, providing narrative prose documents like the ones in the "Study Documents" box here.

Can we index these documents lexically (regular Elastic) and semantically (biomedical NER)?

We would like to do this in order to then align identifiers discovered via NER with HEAL CDE linkages.

The protocol PDF at the link above, for example, is brimming with biomedical terms we could use for indexing.

While there are plugins for indexing PDF in Elastic, we need access to the text to map ontology terms so I'm guessing the plugins won't do what we need. Is that correct, or do they provide hooks for us to use the text in arbitrary ways?

(We're also not ready to do this with a lot of text until RENCI has an in house bio-NER)

If the plugins don't provide hooks, what are our best options for a design to index documents like this lexically and semantically (i.e. NER)?

FYI @gaurav @satusky @cbizon

image

Tranql - Redis connection bug

Description :
We have had an incident with Tranql where connections to redis k8s services sometimes goes stale and we get connection reset by peer error when running a Tranql query.

From the logs

 File "/home/tranql/.local/lib/python3.7/site-packages/redis/connection.py", line 752, in read_response
    response = self._parser.read_response()
  File "/home/tranql/.local/lib/python3.7/site-packages/redis/connection.py", line 322, in read_response
    raw = self._buffer.readline()
  File "/home/tranql/.local/lib/python3.7/site-packages/redis/connection.py", line 254, in readline
    self._read_from_socket()
  File "/home/tranql/.local/lib/python3.7/site-packages/redis/connection.py", line 196, in _read_from_socket
    data = self._sock.recv(socket_read_size)
ConnectionResetError: [Errno 104] Connection reset by peer

Resolution:
As an immediate resolution, restarting the tranql pod seemed to resolve the error

Permanent solution:
Need to test tranql with multiple queries and see what the cause of the issue is and have solid resolution.

Semantic Search Metrics

Produce indexing and search performance metrics addressing (moved from https://github.com/helxplatform/development/issues/693)

  • Statistics: Quantify ingested data elements by biolink type and identifier class
  • Latency: Quantify the elapsed time required for pipeline phases
  • Scale: Quantify resource consumption in terms of CPU, memory, and disk.
  • Relevance: Quantify search result relevance.

give dug users context for missing search tiles

Some searches hide results that Dug deems irrelevant to the user. See screenshot for HIV where 13 results found, only 4 cards shown.

@mbwatson update search results text to read something like n results for "concept" (1 page) x results were removed because they do not contain sufficiently useful information.

image

image

Question: What should a search for CHD return?

CHD returns only CAD. And there is nothing under "Knowledge Graph" I assume from this that (1) there are no study variables associated with CHD. (2) there are no study variables annotated with anything linked to CHD in the graph. (3) CHD->CAD is happening via some kind of lexical match?

Should the UI show something like "Interpreting CHD as Congenital Heart Disease. No HEAL studies annotated with Congenital Heart Disease" to let the user know that we didn't misunderstand the query, but there's nothing that matches it.

On (2) I'm sort of surprised b/c in robokop CHD is associated with something like 450 phenotypes. I would think that one of them would be represented in this dataset, but maybe not...

Investigate and use Dag Factory for Roger workflow

  • Implement ways to generate workflows dynamically through a restful endpoint.
  • User submits minimal workflow declarations as a yaml. An Airflow dag is constructed on the fly and run. One approach to this is using Dag factory. Based discussions with @stevencox , it sounds like a reasonable approach might be to template this with some templating engine (Jinja) and have a minimalistic data structure for Workflow definition.

HEAL - Enable Ingest of NIDA metadata (NIDA Datashare Parser)

We want the ability to ingest NIDA Datashare metadata for the HEAL initiative. NIDA is a big player within the HEAL initiative, and is the planned repository for around 300 studies, which is over half of the studies in the HEAL initiative. They have well-documented metadata, but only a handful of studies contain machine-readable dictionaries (as .xlsx).

I wrote a script to transform their data dictionaries to JSON blobs that can be ingested by Dug. This issue tracks the progress from this metadata generation to making necessary code enhancements to allow for a (local) dug deployment with NIDA metadata, for demonstration purposes.

Biolink type-based filtering

Offer a way for users to filter search results in the UI based on their biolink types.

A search on concepts_indexusing the /search endpoint will currently return concepts with unrestricted biolink types. If a user, for example, searches asthma, but is only interested specifically in phenotypes related to asthma, they have to look through a bunch of extraneous results to get to what they really want to see.

There should be some sort of filter menu to zero-in on results more easily in the UI.

  • There's also the consideration of adding the filtration functionality directly into DUG for more effective API pagination (i.e. if someone searches for asthma but only want to see phenotypes, they may only get 3 results on the first/x page if filtration is done via the UI.
  • It's probably best to just leave it to the UI, at least for now.

Tranql Auto-complete

For tranql when users type in name of an entity (such as asthma) we are currently using a translator service to resolve this to actual identifiers (eg MONDO:0004979).

This has some issues as sometimes the result curies might or might not exist in the graph. In addition the curies need to be normalized before sending to tranql for answers.

Maybe a better version would be actually using Redis search indexes via tranql to resolve to node ids that we know exist. In this article (http://www.odbms.org/2020/04/introducing-redisgraph-2-0/) they describe a problem that seems to fit this use case, where Graph nodes are looked up using indexes. This could enable us to get much better and accurate results.

It would be nice if we investigate this and implement a more reliable solution.

Consider using NLTK to get rid of stop words in search inputs

This query

dug search -q "thomas tank" -t "concepts"

returns about what you'd expect: nothing.

But this query

dug search -q "thomas the tank" -t "concepts"

returns 35Kb worth of results. I haven't looked through all of them, but I'm guessing that none of them truly pertain to Thomas the Tank...

One possible way of addressing this would be to use the NLTK to remove any stop words from the input. See this [link] (https://www.geeksforgeeks.org/removing-stop-words-nltk-python/) for examples of doing this in python.

I really don't know how big a deal this is, but the behavior did seem a little strange to me.

Create status check for Dug

Dug relies on several other backend services (e.g. elastic, redis). We can't crawl or search without them, which means we can't start up Dug on kubernetes until those other services come online. Right now we check it directly with some scripted curl calls, but it would be an improvement if we could just ask Dug itself if it was in a runnable state.

Acceptance criteria:

  • add a CLI integration so that a user can call dug status and receive some meaningful output.

    • if everything is OK, exit with exit code 0. Otherwise exit with exit code 1.
    • CLI output should look something like.:
    $ dug status
       - elasticsearch (elastic.svc:9200) ...ok
       - redis (redis.svc:6379) ...ok
       - ...
  • add a REST API endpoint, e.g. /health, /_health, etc.

    • Should return HTTP 200 if everything is OK, HTTP 500 if something is down
    • Response should have meaningful info, e.g.
      {
       "status": "down",
       "services": {
         "redis": {
           "status": "ok"
         },
         "elasticsearch": {
           "status": "down",
           "error": "unable to reach http://elasticsearch.svc, Address unreachable"
         }
       }
      }

Remove incorrect ERROR logs

Dug reports "error" messages when some of the API services we use don't have a result, i.e. the normalization service.

These aren't errors, so we should record them as info or warning statements.

Annotation Performance Tweaks for Dug deployment

Making calls to external services specifically Monarch's NER service takes most of our execution time in Annotation step in roger / dug .
We have also noticed that that service will start failing under heavy load.
To mitigate this issue and have a better throughput we are investigating the following options:

  • Deploy Scigraph (Monarch's NER) in NER namespace here in sterling
  • If regular replication of the above deployment doesn't buy us much , try to have an http accelerator proxy (Varnish Cache) interfacing the above deployment.
  • Add Parameters in search chart to change url's for Annotation , Synonymizer service to allow changes in different env. These new values would need to default to the current one's to allow users outside sterling cluster to be able to do deployments.

CLI arguments should be case-insensitive

Hit this error today:

dug.core.parsers.ParserNotFoundException: Cannot find parser of type ' "TOPMedTag"'
Supported parsers: dbgap, topmedtag

It would be nice if the CLI were case-insensitive with regard to parser (or other args for that matter)

A user can create a json file of Dug search results to take outside of Dug for further analyses

This work satisfies the need for a Dug Shopping Cart. It is broken into 3 buckets for v1.

From the card view, a user can click studies and select 1 or more studies to send it to the shopping cart

After expanding the card view and selecting studies, a user can expand individual studies and select 1 o more variables to send to the shopping cart

View their shopping cart with a list of all selected items for export to json

Handle http status codes for synonyms

Onto synonyms API client detects and handles errors based on if the content fails to json parse.

except json.decoder.JSONDecodeError as e:

this should probably detect errors based on response http status codes .

On latest onto.renci.org instance a 400 hundred http status code is returned with a valid json. But this json is not an array. So this causes an exception up in the call chain when trying to add synonyms to actual data elements.

How to handle "disease" as text in a search?

Searching for something like "heart disease" returns a lot of good hits. But it also returns things like "brain disease" and "polydactyly (disease)".

My suspicion is that the text "disease" is producing these matches. Is there a way to down-weight those results? Maybe disease should be removed from names? But seems like you need it in there to get heart disease to match heart disease...

Rename project

This project will be renamed to helx-search to be published on the PyPI repo with a more descriptive name. We will also organize it into a namespace repository, with "helx" as the parent namespace (other helx projects will subsequently be moved under this namespace).

Whereas currently users access it as

from dug import ...

The namespace will change to

from helx.search import ...

Likewise CLI tooling will change, from

dug crawl ...
dug search ...

to

helx crawl ...
helx search ...

Search class init does not respect arguments

The Search class accepts several parameters in its init method that it does not actually use. For example, it accepts a "host" param. In the body of the method we find:

        self.host = os.environ.get('ELASTIC_API_HOST', 'localhost')

So the argument isn't used. There are also other values that are not passed in as params, but only configured via env vars. They should be parameterized.

Acceptance criteria:

  • the Search class is configured entirely via init or other methods.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.