ohdsi / dataqualitydashboard Goto Github PK

View Code? Open in Web Editor NEW

118.0 26.0 89.0 13.7 MB

A tool to help improve data quality standards in observational data science.

Home Page: https://ohdsi.github.io/DataQualityDashboard

License: Apache License 2.0

R 20.94% CSS 0.67% JavaScript 76.97% HTML 0.81% Perl 0.07% Handlebars 0.45% Shell 0.10%

data-quality

dataqualitydashboard's Introduction

DataQualityDashboard

DataQualityDashboard is part of HADES.

The goal of the Data Quality Dashboard (DQD) project is to design and develop an open-source tool to expose and evaluate observational data quality.

Introduction

This package will run a series of data quality checks against an OMOP CDM instance (currently supports v5.4, v5.3 and v5.2). It systematically runs the checks, evaluates the checks against some pre-specified threshold, and then communicates what was done in a transparent and easily understandable way.

Overview

The quality checks were organized according to the Kahn Framework¹ which uses a system of categories and contexts that represent strategies for assessing data quality. For an introduction to the kahn framework please click here.

Using this framework, the Data Quality Dashboard takes a systematic-based approach to running data quality checks. Instead of writing thousands of individual checks, we use “data quality check types”. These “check types” are more general, parameterized data quality checks into which OMOP tables, fields, and concepts can be substituted to represent a singular data quality idea. For example, one check type might be written as

The number and percent of records with a value in the cdmFieldName field of the cdmTableName table less than plausibleValueLow.

This would be considered an atemporal plausibility verification check because we are looking for implausibly low values in some field based on internal knowledge. We can use this check type to substitute in values for cdmFieldName, cdmTableName, and plausibleValueLow to create a unique data quality check. If we apply it to PERSON.YEAR_OF_BIRTH here is how that might look:

The number and percent of records with a value in the year_of_birth field of the PERSON table less than 1850.

And, since it is parameterized, we can similarly apply it to DRUG_EXPOSURE.days_supply:

The number and percent of records with a value in the days_supply field of the DRUG_EXPOSURE table less than 0.

Version 1 of the tool includes 24 different check types organized into Kahn contexts and categories. Additionally, each data quality check type is considered either a table check, field check, or concept-level check. Table-level checks are those evaluating the table at a high-level without reference to individual fields, or those that span multiple event tables. These include checks making sure required tables are present or that at least some of the people in the PERSON table have records in the event tables. Field-level checks are those related to specific fields in a table. The majority of the check types in version 1 are field-level checks. These include checks evaluating primary key relationship and those investigating if the concepts in a field conform to the specified domain. Concept-level checks are related to individual concepts. These include checks looking for gender-specific concepts in persons of the wrong gender and plausible values for measurement-unit pairs. For a detailed description and definition of each check type please click here.

After systematically applying the 24 check types to an OMOP CDM version approximately 4,000 individual data quality checks are resolved, run against the database, and evaluated based on a pre-specified threshold. The R package then creates a json object that is read into an RShiny application to view the results.

Features

Utilizes configurable data check thresholds
Analyzes data in the OMOP Common Data Model format for all data checks
Produces a set of data check results with supplemental investigation assets.

Data Requirements

Prior to execution the DataQualityDashboard package requires that the CDM_SOURCE table is properly populated. The following table is a guide to the expected contents of the CDM_SOURCE table.

cdmFieldName	userGuidance	etlConventions
cdm_source_name	The name of the CDM instance.	Descriptive name for the source data.
cdm_source_abbreviation	The abbreviation of the CDM instance.	The abbreviation should consistent for different release from the same source.
cdm_holder	The holder of the CDM instance.	The institution that controls access to the data. If possible include contact information for who to contact to request access to the data.
source_description	The description of the CDM instance.	Add notes, caveats, special characteristics about the source data that would not be assumed from the general descriptive name. This description intended to help analysts determine if the data is suitable for the problem they are studying.
source_documentation_reference	Reference to where one can find documentation about the source data.	Can include URLs, file name, source data experts contact information (if they agree to it)
cdm_etl_reference	Reference to where one can find documentation about the source to ETL into OMOP CDM.	Assuming there is a document or files (such as Rabbit in the Hat) describing the ETL. May be the location of the ETL source and documentation repository.
source_release_date	The release date of the source data.	When the source data was made available for ETL'ing. For sites doing incremental updates, the date the last increment made available. This implies that for sites doing incremental updates the CDM Source table should be updated to reflect that changes were made to the CDM.
cdm_release_date	The release date of the CDM instance.	When the source data was made available for general use. For sites doing incremental updates, this implies that the CDM Source table should be updated to reflect that changes were made to the CDM.
cdm_version	Identifies the CDM version	Enter the numeric portion of the version, e.g. 5.4
cdm_version_concept_id	The Concept Id representing the version of the CDM.	SELECT concept_id WHERE domain = Metadata and vocabulary_id = CDM and concept_code like %[numeric portion of the version]%
vocabulary_version	The vocabulary version used in the ETL	Obtained by SELECT vocabulary_version FROM vocabulary WHERE vocabulary_id = 'None'

Technology

DataQualityDashboard is an R package

System Requirements

Requires R (version 3.2.2 or higher). Requires DatabaseConnector (version 2.0.2 or higher).

A variety of database platforms are supported, as documented here.

Note that while data quality check threshold files are provided for OMOP CDM versions 5.2, 5.3, and 5.4, the package is currently only tested against versions 5.3 and 5.4.

Installation

See the instructions here for configuring your R environment, including RTools and Java.
In R, use the following commands to download and install DataQualityDashboard:

install.packages("remotes")
remotes::install_github("OHDSI/DataQualityDashboard")

User Documentation

Documentation can be found on the package website.

PDF versions of the documentation are also available:

Vignette: Add a New Data Quality Check
Vignette: Check Status Descriptions
Vignette: Data Quality Check Type Descriptions
Vignette: Running the DQD on a Cohort
Vignette: Failure Thresholds and How to Change Them
Vignette: SqlOnly Mode
Package manual: DataQualityDashboard manual

Support

Developer questions/comments/feedback: OHDSI Forum
We use the GitHub issue tracker for all bugs/issues/enhancements

License

DataQualityDashboard is licensed under Apache License 2.0

Development

DataQualityDashboard is being developed in R Studio.

Development status

DataQualityDashboard latest release (representing code in the main branch) is ready for use.

Acknowledgements

This project is supported in part through the National Science Foundation grant IIS 1251151.

1 Kahn, M.G., et al., A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC), 2016. 4(1): p. 1244. ↩

dataqualitydashboard's People

Contributors

Stargazers

Watchers

Forkers

t-abdul-basser abmi seunghyung seolpaik cukarthik clairblacketer yilizhang1216 vojtechhuser parkchanghhyun odysseusinc christinazzz cumc-dbmi jdposada researchapps crabcakeworld inform-health-informatics thehyve jiaola stephanieshong tannerzhang ablack3 sreerammakam maximmoinat kinyusui bigbangthings octri mgkahn softwarecountry yilingchenn philatdatabricksdotcom dsjy09 dmitryilyn scossin richard1933 aandryc hollorol a1exanderalexeyuk dleprovost andyjessen drroad zhengchund tmswhite priagopal parkd24 edencehealth laacdm snuh-info gkovaig katy-sadowski ksdkalluri scheah92 varghesm ranatech08 darwin-eu-dev minthantsin hkim271 ganisimov donghanyu jhajagos shawnrg galachad matthewlrobinson schuemie eminty mderidder95 alondhe danielh2942 lynxcare luisleon94 odikia krevett-iqvia crosenb3 hurcy tomwhite-medstar tuftsctsi watilde dimshitc niteshbalakrishnan chatchatbio justkissme009 greshje jennybc adambouras linartova meriemgrs

dataqualitydashboard's Issues

Error when executing writeJsonResultsToTable

I got a connection error when trying to write the results in a JSON file.

> DataQualityDashboard::writeJsonResultsToTable(connectionDetails = connectionDetails, 
+                                       resultsDatabaseSchema = resultsDatabaseSchema, 
+                                       jsonFilePath = "/tmp/result.json")
Error in open.connection(con, "rb") : cannot open the connection

I tried to solve it by providing the file in advance but instead, I got a new error message:

Error in parse_con(txt, bigint_as_char) : parse error: premature EOF
                                       
                     (right here) ------^

missing ER visits rule (threshold) - editing the KB

For Psoriasis, we define low and high values.

How would I define low value of at least 0.001% of ER visits.

Also, the current file lists them as condition and that seems wrong

Do I just edit the row 3 in the concept level file.

Edits to Knowledge base - how a PR for those works?

The DataQuality study produced a lot of reference benchmark data and I would be super keen to implement them in DQD

Add clear documentation for each check

There is currently only one small sentence given for each check. We should add better documentation about what each one is doing. Let's start by adding it to the website and then moving it to the dqd itself later

Postgres and CDM_DATATYPE check

Hi @AnthonyMolinaro ,

Can you take a look at the CDM_DATATYPE sql template? Looks like in Postgres, the "~" operator isn't supported:

ERROR: operator does not exist: integer ~ unknown Hint: No operator matches the given name and argument type(s). You might need to add explicit type casts.

Thanks,
Ajit

Excel to csv files

The goal is for the dashboard to read off of four csv files specifying the checks (table, field, and concept) and the description of the checks. It is difficult to keep up with four files so it would be nice to have a mechanism that reads in the excel file and create the csvs from that.

perform checks for all data types

in a future version perform checks on all data types, not just integers

CSV: note_nlp.person_id

Doesn't exist in v.5.3.1

Connecting to the DB server only once

Currently it seems a new connection is opened for every check. This seems inefficient, and for some DBMSs this is even impossible (Oracle will run out of its pool of random numbers). Instead of passing the connectionDetail object (internally), could a connection object be passed?

Design proposal

Background

OHDSI has collected comprehensive expertise in libraries development and today it is possible to outline several lessons learned so that we could stick to best patterns during the development of the DQ Dashboard.

Positive patterns:

Back-end represented by SQL templates with absence or minimal dynamic SQL;
Encapsulation of middle-tier logic into Java library so that it could be used by both R and Java web apps. Usage of rJava to utilize the Java library in R. E.g. SqlRender, FeatureExtraction
Standardization of specifications / DTOs. StandardizedAnalysisAPI

Problematic patterns:

Middle-tier logic just in R. This makes it impossible to re-use the code and bumps people to duplicate code when you want to implement the use-case in Java / other language. E.g.: Arachne duplicates logic of Achilles query execution and results assembly into JSONs.
UI coded in R Shiny. The same issue - you cannot simply embed such UI into an existing web app. Only iframe would work. E.g. PLE / PLP viewers - their integration into Atlas would require separate R Server and complicated logic or code re-writing. Proposed solution: apply pattern similar to what's done at middle-tier - code UI in HTML and JS, wrap them into library and use that library in R Shiny via htmlwidgets

High-level Design

Use-cases

Components

Technology stack:

SQL for analytics
Java for middle-tier logic
R with r Java to wrap Java library into R library
UI in HTML and JS, Vega library for visualizations and its JSON spec as standard for data exchange format
Htmlwidgets to use the HTML-JS-CSS in R and R Shiny

Discussed with @alondhe

Add cdm source name in empty space above ohdsi logo

In the empty blue space above the ohdsi logo can we add the name of the CDM? That way it is always visible as you scroll down.

cdmDatatype fails on negative integers

Although we shouldn't have negative drug_era.gap_days, if we do, the cdmDatatype check fails. The regex should allow for negative integers.

Identify patterns in DQ SQL

Examine DQ checks and necessary SQL to perform those checks to identify SQL patterns. These patterns will then become parameterized SQL templates.

Add vignette

Add vignette covering how to execute the checks, but also how to interpret the results

Rename 'execute' function to 'executeDataQualityChecks'?

Or 'executeDqChecks`.

Just 'execute' seems very generic and may conflict with other packages (like our study packages).

Finalize DQ Checks - Field Level

CDM_FIELD: Given the tables present (as found in table level checks), verify the fields conform to CDM spec.
IS_NOT_NULLABLE: For each table, check that the fields in which IS_NOT_NULLABLE == TRUE, there are no null values in that field.
CDM_DATATYPE: At a minimum, for each field that is supposed to be an integer, verify it is an integer
IS_PRIMARY_KEY: Primary Key - verify those fields where IS_PRIMARY_KEY == Yes, the values in that field are unique
IS_FOREIGN_KEY, FK_TABLE, FK_FIELD: Foreign Key - all values are valid based on the associated table
IS_STANDARD_VALID_CONCEPT: all standard concept id fields are standard and valid
FK_DOMAIN: all standard concept ids are part of specified domain
FK_CLASS: Drug era standard concepts, ingredients only
MEASURE_VALUE_COMPLETENESS: computing number of null values and the proportion to total records
STANDARD_CONCEPT_RECORD_COMPLETENESS: number of 0s / total number of records
SOURCE_CONCEPT_RECORD_COMPLETENESS: number of 0s / total number of records
SOURCE_VALUE_COMPLETENESS: number of source values with 0 standard concept / number of distinct source values
PLAUSIBLE_VALUE_LOW: get number of records and the proportion to total number of eligible records that fall below this threshold
PLAUSIBLE_VALUE_HIGH: get number of records and the proportion to total number of eligible records that exceed this threshold
PLAUSIBLE_TEMPORAL_AFTER: get number of records and the proportion to total number of eligible records with datetimes that do not occur on or after their corresponding datetimes
PLAUSIBLE_DURING_LIFE: get number of events that occur after death event (PLAUSIBLE_DURING_LIFE == Yes)

notification about non-preferred units

KB-driven templated rules are great ! I tried to have those in Achilles and I was always referred to the perfect re-design in phase 2 of Achilles. DQD just did it. I love it.

In ThemisConcept study, we defined data driven preferred units.

How DQD can implement this logic?

Perhaps a column that indicates a unit as non-preferred?

CSV provider.provider_concept_id

Doesn't exist in v5.3.1

CSV: cost.person_id

Doesn't exist in v5.3.1

Add loading screen

Add a loading screen while the JSON data is loading

Add vocabulary tables to be evaluated

Currently, the dq dashboard only looks at CDM tables. A future version should include the vocabulary tables as well. For example, a recent download of the vocabulary switched a columns blanks to NULLs; an issue the dq dashboard would have caught.

left menu resize issues

latest change to add the data source name has introduced some resize issues with the left menu

Expose parameter values in HarnessV2.R?

@fdefalco, @clairblacketer:

Would it be possible to return parameter values? For example, a recent flatiron run shows that 100% of the drug_era rows violate check 513. The sql follows:

SELECT num_violated_rows, 1.0*num_violated_rows/denominator.num_rows AS pct_violated_rows
FROM (
SELECT COUNT(violated_rows.violating_field) AS num_violated_rows
FROM(
SELECT 'PERSON.gender_concept_id' AS violating_field, PERSON.* 
FROM cdm.PERSON LEFT JOIN cdm.CONCEPT ON PERSON.gender_concept_id = CONCEPT.CONCEPT_ID
WHERE CONCEPT.DOMAIN_ID != '@fkDomain' OR CONCEPT.CONCEPT_CLASS_ID != '@fkClass' 
) violated_rows
) violated_row_count,
( SELECT COUNT(*) AS num_rows\r\n\tFROM cdm.PERSON ) denominator;

The checkResults data frame has the QUERY_TEXT column, making it easy to get the query:

checkResults[which(checkResults$CHECK_ID==513),"QUERY_TEXT"]

..however, it's hard to debug without the actual parameter values for @fkDomain and @fkClass.

Thanks!

strange units in concept level CSV

Defining plausible values in inches for A1c is incorrect.

What is the rationale here?

Hard-coded call to jConnection fails on non-JDBC connections (like Eunomia)

There is a call to the JConnection property to turn autoCommit on here. However, not all connection objects have this property (specifically, SQLite connections such as those used by Eunomia don't).

Is this call really necessary when just querying the data? DatabaseConnector in general knows when to turn autoCommit on and off.

If really necessary, please use

if (inherits(connection, "DatabaseConnectorJdbcConnection")) {...}

to only apply the code to the appropriate objects.

Add filters to results UI

filter by table or field in the results UI

Add UI to view the checks

We currently only offer the CSVs as the source of what the checks are, before running the package. This is cumbersome for users to consume, perhaps we could add a UI that shows the check names and fully qualified check descriptions.

We could add a viewDqChecks() function that you provide all the same params as executeDqChecks, and it launches a web page with a dynamic table of checks to be run given the params. This web page should also be used as a section in the dashboard itself.

Thoughts @fdefalco and @clairblacketer ?

Achilles Heel test for era building

There are currently no test for whether the era building went OK, even though it is easy to make mistakes there.

I'm currently using a simple test: I'm comparing the results of these two queries:

SELECT DISTINCT person_id, condition_concept_id FROM condition_occurrence;
SELECT DISTINCT person_id, condition_concept_id FROM condition_era;

and they should be identical. (It would be easy to test this through a join I guess). If that doesn't scale to bigger databases, at least we could count the number of people with a condition and compare those counts.
For drug_era we'd first need to map to ingredients, but otherwise it would be the same.

Field checks when table doesn't exist

When not-required tables don't exist in a CDM, the field checks are still run and throw errors about the missing tables. Should we make the field checks conditional on the presence of the table?

Plausible Value Low : cost.paid_dispensing_fee looks at date

The CSV has a plausible value low check for the year 1950-01-01 for cost.paid_dispensing_fee

Plausible Value Low: cost.payer_plan_period_id looks at date

The CSV has a plausible value low check for the year 1950-01-01 for cost.payer_plan_period_id

execution_time capture

The current check execution time capture includes units, which then vary so the output in the checkResults dataframe looks like this:

1.757 mins
2.324 secs
1.893 mins

It would be easier if we just captured the execution time in total seconds without the unit names, rounded to two decimals.

Get finalized v5.3.1 Excel file

@clairblacketer -- can we finalize the v5.3.1 Excel workbook that will be used as the input? Currently we're operating off of the v6 version, and I believe some changes have been made this week.

scientific notation

use scipen=999 to ensure values are not displayed using scientific notations

Create Result JSON file

Early draft:

{
	"levels": {
		"table": {
			"issues": [
				{
					"is_required": [ "OBSERVATION_PERIOD" ]
				}
			],
			"results": [ { "cdm_table" : "PERSON" },
						{ "cdm_table" : "VISIT_OCCURRENCE",
						  "is_temporally_constant": [
							{ "date": "2000-01", "value": 5.6 }
						  ]
						}, 
						{ "cdm_table": "CONDITION_OCCURRENCE",
						  "is_temporally_constant": [
							{ "date": "2000-01", "value": 5.6 }
						  ]
						} 
			]
		},
		"field": {					
			"cdm_table": "PERSON",
			"issues": [
				{
					"exists": [],
					"is_required": [{ "cdm_field": "gender_concept_id"}],
					"cdm_datatype": [ { "cdm_datatype": "birth_datetime", "original": "date" } ],
					"pk": [],
					"fk": [],
					"fk_table": [],
					"fk_field": [],
					"fk_domain": [],
					"fk_class": [],
					"measure_value_comp": [],
					"measure_mapping_comp": [],
					"plausible_value_low": [],
					"plausible_value_high": [],
					"plausible_during_life": []
				}
			]				
		},
		"concept": {
			"issues": [
				{
					"units": [ 
						{ "concept_id": 3025315, "orig_unit_concept_id": 9999 }, 
						{ "concept_id": 3036277, "orig_unit_concept_id": 9999 }
					],
					"plausible_value_low": [ { } ],
					"plausible_value_high": [],
					"plausible_gender": [],
					"temporal": [],
					"valid_prev_low": [],
					"valid_prev_high": []					
				}
			]
		}		
	}
}

special characters in metadata

When pulling the CDM source metadata, special characters encoded in latin1 are brought into the JSON file. When calling viewDqDashboard, the jsonlite call to read the JSON fails due to these characters.

Need to re-encode to UTF-8?

empiric threshold implementation from OHDSI DataQuality network study

I would like to add to DQD the results of the DataQuality study. It is published as Medinfo poster. Abstract below.
We used a method of 10th and 90th percentile to derive benchmark data. See yellow highlights.
E.g., what is a threshold for expected ER visits.

The reference thresholds are here
https://github.com/vojtechhuser/DataQuality/blob/master/inst/csv/empiric_reference.csv .

Abstract
Large healthcare datasets of Electronic Health Record data became indispensable in clinical research. Data quality in such datasets recently have become a focus of many distributed research networks. Despite the fact that data quality is specific to a given research question, many existing data quality platforms prove that general data quality assessment on dataset level (given a spectrum of research questions) is possible and highly requested by researchers. We present comparison of 12 datasets and extension of the Achilles Heel data quality software tool with new rules and data characterization measures.
Keywords: data quality, observational study
Introduction
Data quality is an important pre-requisite for research on Electronic Health Record (EHR) data. In recent years, several efforts and tools emerged that perform data quality assessment (DQA).[1] Another important trend is that research is increasingly conducted using distributed research networks. Such networks often provide tools to their data partners that lower the barrier to join or participate within the network and help with data preparation or analysis execution.
The Achilles tool from the Observational Health Data Scienc-es and Informatics Consortium (OHDSI) is one such tool that performs data characterization and includes an Achilles Heel part that contains rules for checking data quality (DQ). The Achilles tool was first deployed in October 2014 (version 1.0) with several updates (versions 1.1 to 1.6) during a period from 2014 to 2018. Since 2016, the web-based user interface part of Achilles was incorporated into the OHDSI Atlas tool, bring-ing data source summary statistics and DQA results into the primary study design application used by OHDSI researchers.
In developing the Achilles tool, the OHDSI consortium actively encourages researchers to submit requests for new data quality checks or insightful data visualizations that would extend the tool’s utility. The Achilles’ software repository receives numerous inputs (in a form of Github issues) that identify such new DQ measures or checks. In addition to this ongoing feedback, the European EMIF research network conducted a formal survey of the tool that indicated the need for new features.
This study describes a set of extensions of the Achilles tool based on a comparison of data quality indicators of several healthcare datasets.
Methods
The study had two goals. The first goal was to compare data quality characteristics across datasets. Informed by this comparison, the second goal was to extend Achilles with new features and new data quality rules that would improve the assessment of data quality generated by the tool. This study includes a larger set of exported dataset metadata compared with a previous study done by our team, that only focused on Achilles Heel output messages.
The Achilles tool currently generates over 170 measures (e.g., number of measurement records by test and unit). However, many healthcare dataset administrators are not permitted to share such comprehensive set of dataset indicators. To be able to conduct our comparison, we designed a smaller set of measures generated by Achilles pre-computations that includes only measures that were deemed acceptable by the dataset administrators.
To maintain a data aggregation privacy-preserving principle for our comparison, our study used a small-cell count threshold of 11+ patients per aggregated count. Achilles tool allows suppressing pre-computations that result in small counts of patients (or small counts of providers, or healthcare events). This filtering is done either when Achilles pre-computations are executed, but if it was not done during the Achilles pre-computation phase, our methodology enforced it again during when site data extract generation. The R package for our study (called DataQuality) is open source and available on the Github platform at https://github.com/OHDSI/StudyProtocolSandbox/ tree/master/DataQuality.Actual input data for the study consisted of the following: (1) subset of Achilles analyses converted to ratios (for example, ratio of persons with at least one visit by visit type); (2) Achilles derived measures (for example, percentage of unmapped source data concepts by domain) and (3) an approximate size of the dataset (for example, <10k, 10-99k,100k-1M, 1-5M,5-9M and >10M; exact size of populations is masked into a dataset size category). Sample input data (for a synthetic SynPuf OMOP dataset) is available at Github.
Each dataset was assigned a random meaningless identifier to facilitate the comparison. The purpose for this dataset masking is the fact that data quality comparisons can lead to withdrawal of a data partner from a research consortium (or an analysis project) if a particular partner’s dataset is identified as having low quality data. Masking was done to avoid this outcome and to focus on advancing the methodologies for DQA. For the same reason, neither a list of individual datasets is provided. We plan to destroy individual site aggregated data used as input at 6 months after the publication of the study results. To protect the sites, only masked and isolated combined comparisons are reported in this article. Non-aggregated, single dataset DQ data are never posted publically.
Determination of goodness of fit or “data fitness” is highly dependent on the research question being asked. This phenomenon was described earlier and is sometimes referred to a task-dependence nature of DQA.[2] A dataset that only contains inpatient events and data may not be appropriate for general research questions (e.g., descriptive study of a course of a disease); however, it may be sufficient for a subset of research questions (e.g., inpatient-only research questions).
One can conclude that without knowing the specific research question context, any data quality assessment is impossible to pre-empt. This requirement for specifying research question context up-front makes development of general DQA tools almost impossible. However, existing DQA tools and efforts indicate that some general DQA rules indeed exist.
To partially overcome this problem (“data fitness for what?”), we assumed that the dataset being assessed represents the lifetime record of general population and the tool should perform DQA for a wide range of possible research questions (“general data fitness for a wide range of research questions”). Once a general DQA assesment is done, a researcher with a specific research analysis can simply ignore DQA messages that do not apply to his/her context. (e.g., ignore messages about lack of eye doctor’s visit and eye care data if data about vision care are not essential for his/her research question).
Results
A total of 12 datasets were compared in the study; however due to use of prior Achilles versions by some sites, comparison of some newly implemented measures are made on data from datasets that implemented at least Achilles version 1.4 at the time of our study data extraction.
Version 1.6 of Achilles contains a total of 44 data quality rules (also called data quality checks). A total of 12 rules are model conformance rules that check adherence to the CDM specification. For example, a model conformance rule may require that provider specialty column contains only concepts that are indeed specialties. The remaining 22 rules are data quality rules that check for data completeness, data plausibility or other data quality problems. Such rules can be considered model-independent and should be portable to other data models, such as Sentinel model or PCORNet. The pooled dataset of all Achilles Heel messages from all datasets consisted of 546 messages. For a single dataset, we found a median of 51 of Heel messages, a median of 25 errors, a median of 22 warnings, and a median of 4 notifications. The poster will show evaluation of severity of each rule violation by computing median record counts for each rule. The second goal of our study was to add new functionality (either new DQ rules or new DQ measures to Achilles) based on availability of data about multiple CDM datasets. The results are divided into multiple sections according to the data domain of the new rule and will be included in the poster.
(1) Empirical rules: Comparing selected dataset parameters and computing 90th or 10th percentile and using them as benchmark thresholds.
(2) Data density rules: We considered data density on several levels: concepts per person as the number of distinct measurements per person (e.g., count of 2 measurements per person, such as cholesterol and hematocrit). This comparison aims at “data breadth”; records per person as the total number of all measurement records per person (e.g., count of 8 tests, such as 3 LDL cholesterol and 5 hematocrit measurements). This comparison aims at “data depth”; records per visit as a data density measure on a visit level. Because visits with no measurements can occur, the per visit count can be below 1. However, for a similar per visit count analyzing clinical notes (if in scope for the dataset), it may be reasonable to expect at least one note per visit.
(3) Minimum-data patients: For many research questions, at least one data point in a given clinical data domain (such as medications) is required for any meaningful analysis. For example, for analyzing event prevalence, using a proper denominator and determining the size of the relevant population can significantly affect the reported measure. We determined empiric thresholds for existing Achilles DQA measures that count number of patients with at least one event in a clinical data domain. (e.g., patients with at least one visit, patients with at least 1 diagnosis and 1 medication.
(4) Unmapped data: OMOP CDM allows storage of data that is not fully semantically mapped to standard concepts (for example, drug exposure data may include data rows that have a value of 0 (‘No matching concept’) in drug_concept_id while the yet-to-be-mapped local code is stored in drug_source_value). We introduced measures computing unmapped data and threshold rules for several domains, such as Conditions, Procedures or Drug Exposure.
Discussion and Conclusion
Our current method for picking an empiric threshold is using a fixed threshold (e.g., 10th percentile). Future methodology revision may alter this approach for each considered DQ measure. Another limitation is our primary focus on OMOP CDM sites. Our extensions to Achilles rule knowledge base, however, point to what data measures are required by each rule and whether a rule is terminology dependent. We compared data quality indicators across several datasets. We arrived at empirical values that could be used as thresholds for several DQA measures. The study resulted in several new data quality checks being added to Achilles.
Acknowledgement: We would like to thank Ritu Khare, Taha Abdul, Chris Knoll and Martijn Schuemie. VH work was supported by the Intramural Research Program of the National Institutes of Health (NIH)/ National Library of Medicine (NLM)/ Lister Hill National Center for Biomedical Communications (LHNCBC).
References
[1] H. Estiri and K. Stephens, DQe-v: A Database-Agnostic Framework for Exploring Variability in Electronic Health Record Data Across Time and Site Location. , eGEMs (Generating Evidence & Methods to improve patient outcomes) 1 (2017).
[2] N.G. Weiskopf and C. Weng, Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research, J Am Med Inform Assoc 20 (2013), 144-151.

Value as concept id vs value as number Rule in Measurement

Some concept ID in measurement requires 'value as concept id' rather than 'value as number'
such as 'presence' or 'dip stick' result (because dip stick result is 1+, 2+, which is not numeric)

Should we add this rule to the DQ dashboard?

Update checks to exclude NULL child rows

Queries that check for "orphan rows" typically use an outer join. However, when the child column also has NULL rows, the count may return a non-zero value. For example:

with 
child  as (select 1 id union select 2 union select NULL), 
parent as (select 1 id union select 2 union select 3)
select count(*), count(p.id) 
from child c left join parent p on c.id = p.id
where p.id is null

orphan_count1 | orphan_count2
--------------+--------------
            1 |             0

orphan_count1 indicates a foreign key violation while orphan_count2 does not.

The fix is to either use the parent column in the count or add an additional predicate to the WHERE clause (and c.id is not null).

CSV: care_site.care_site_concept_id

Doesn't exist in v5.3.1

CSV: location.location_concept_id

This field doesn't exist in v5.3.1

Events After Death

This is more of a question than an issue, but looking at Themis Issue 42, the proposal is to accept events within 60 days post death. Will this be incorporated into PLAUSIBLE_DURING_LIFE mentioned in this issue for the DQDashboard? I'm happy to make a PR but wanted to see if it was by design to flag all events after death.
By the way, the dashboard is awesome!!!

Design JSON files

We need to create a standard format for the JSON files:

Reference files that establish the DQ checks
Threshold file that sets the thresholds of success/warn/fail for a CDM
Results file that highlights DQ check violations and contains value sets

Finalize DQ Checks - Concept Level

PLAUSIBLE_VALUE_LOW: given measurements with the specified measurement_concept_id and unit_concept_id, get the number of records that are below the PLAUSIBLE_VALUE_LOW
PLAUSIBLE_VALUE_HIGH: given measurements with the specified measurement_concept_id and unit_concept_id, get the number of records that are above the PLAUSIBLE_VALUE_HIGH
PLAUSIBLE_GENDER: for rows with PLAUSIBLE_GENDER in ("Male", "Female"), quantify patients with these measurements but have a gender_concept_id that does not match.

Ensure that every person has at least 1 observation_period
Tabulating number of records per person per month for tables we know to not change over time (IS_TEMPORALLY_CONSTANT != Yes).
Example: for 2009-01, we see on average 10 conditions per person, then in 2009-02, we see average of 20 conditions per person. We want to highlight and understand that change: perhaps an EHR added a new practice, which is the reason for the increase.

Visualized as histogram
Perhaps leverage Kronos to identify temporal breaks

ohdsi / dataqualitydashboard Goto Github PK

dataqualitydashboard's Introduction

DataQualityDashboard

Introduction

Overview

Features

Data Requirements

Technology

System Requirements

Installation

User Documentation

Support

License

Development

Development status

Acknowledgements

dataqualitydashboard's People

Contributors

Stargazers

Watchers

Forkers

dataqualitydashboard's Issues

Background

High-level Design

Use-cases

Components

Technology stack:

Recommend Projects

Recommend Topics

Recommend Org