Giter Club home page Giter Club logo

dataqualitydashboard's Introduction

DataQualityDashboard

codecov.io Build Status

DataQualityDashboard is part of HADES.

The goal of the Data Quality Dashboard (DQD) project is to design and develop an open-source tool to expose and evaluate observational data quality.

Introduction

This package will run a series of data quality checks against an OMOP CDM instance (currently supports v5.4, v5.3 and v5.2). It systematically runs the checks, evaluates the checks against some pre-specified threshold, and then communicates what was done in a transparent and easily understandable way.

Overview

The quality checks were organized according to the Kahn Framework1 which uses a system of categories and contexts that represent strategies for assessing data quality. For an introduction to the kahn framework please click here.

Using this framework, the Data Quality Dashboard takes a systematic-based approach to running data quality checks. Instead of writing thousands of individual checks, we use “data quality check types”. These “check types” are more general, parameterized data quality checks into which OMOP tables, fields, and concepts can be substituted to represent a singular data quality idea. For example, one check type might be written as

The number and percent of records with a value in the cdmFieldName field of the cdmTableName table less than plausibleValueLow.

This would be considered an atemporal plausibility verification check because we are looking for implausibly low values in some field based on internal knowledge. We can use this check type to substitute in values for cdmFieldName, cdmTableName, and plausibleValueLow to create a unique data quality check. If we apply it to PERSON.YEAR_OF_BIRTH here is how that might look:

The number and percent of records with a value in the year_of_birth field of the PERSON table less than 1850.

And, since it is parameterized, we can similarly apply it to DRUG_EXPOSURE.days_supply:

The number and percent of records with a value in the days_supply field of the DRUG_EXPOSURE table less than 0.

Version 1 of the tool includes 24 different check types organized into Kahn contexts and categories. Additionally, each data quality check type is considered either a table check, field check, or concept-level check. Table-level checks are those evaluating the table at a high-level without reference to individual fields, or those that span multiple event tables. These include checks making sure required tables are present or that at least some of the people in the PERSON table have records in the event tables. Field-level checks are those related to specific fields in a table. The majority of the check types in version 1 are field-level checks. These include checks evaluating primary key relationship and those investigating if the concepts in a field conform to the specified domain. Concept-level checks are related to individual concepts. These include checks looking for gender-specific concepts in persons of the wrong gender and plausible values for measurement-unit pairs. For a detailed description and definition of each check type please click here.

After systematically applying the 24 check types to an OMOP CDM version approximately 4,000 individual data quality checks are resolved, run against the database, and evaluated based on a pre-specified threshold. The R package then creates a json object that is read into an RShiny application to view the results.

Features

  • Utilizes configurable data check thresholds
  • Analyzes data in the OMOP Common Data Model format for all data checks
  • Produces a set of data check results with supplemental investigation assets.

Data Requirements

Prior to execution the DataQualityDashboard package requires that the CDM_SOURCE table is properly populated. The following table is a guide to the expected contents of the CDM_SOURCE table.

cdmFieldName userGuidance etlConventions
cdm_source_name The name of the CDM instance. Descriptive name for the source data.
cdm_source_abbreviation The abbreviation of the CDM instance. The abbreviation should consistent for different release from the same source.
cdm_holder The holder of the CDM instance. The institution that controls access to the data.  If possible include contact information for who to contact to request access to the data.
source_description The description of the CDM instance. Add notes, caveats, special characteristics about the source data that would not be assumed from the general descriptive name.  This description intended to help analysts determine if the data is suitable for the problem they are studying.
source_documentation_reference Reference to where one can find documentation about the source data. Can include URLs, file name, source data experts contact information (if they agree to it)
cdm_etl_reference Reference to where one can find documentation about the source to ETL into OMOP CDM. Assuming there is a document or files (such as Rabbit in the Hat) describing the ETL.  May be the location of the ETL source and documentation repository.
source_release_date The release date of the source data. When the source data was made available for ETL'ing.  For sites doing incremental updates, the date the last increment made available.  This implies that for sites doing incremental updates the CDM Source table should be updated to reflect that changes were made to the CDM.
cdm_release_date The release date of the CDM instance. When the source data was made available for general use.  For sites doing incremental updates, this implies that the CDM Source table should be updated to reflect that changes were made to the CDM.
cdm_version Identifies the CDM version Enter the numeric portion of the version, e.g. 5.4
cdm_version_concept_id The Concept Id representing the version of the CDM. SELECT concept_id WHERE domain = Metadata and vocabulary_id = CDM and concept_code like %[numeric portion of the version]%
vocabulary_version The vocabulary version used in the ETL Obtained by SELECT vocabulary_version FROM vocabulary WHERE vocabulary_id = 'None'

Technology

DataQualityDashboard is an R package

System Requirements

Requires R (version 3.2.2 or higher). Requires DatabaseConnector (version 2.0.2 or higher).

A variety of database platforms are supported, as documented here.

Note that while data quality check threshold files are provided for OMOP CDM versions 5.2, 5.3, and 5.4, the package is currently only tested against versions 5.3 and 5.4.

Installation

  1. See the instructions here for configuring your R environment, including RTools and Java.

  2. In R, use the following commands to download and install DataQualityDashboard:

install.packages("remotes")
remotes::install_github("OHDSI/DataQualityDashboard")

User Documentation

Documentation can be found on the package website.

PDF versions of the documentation are also available:

Support

License

DataQualityDashboard is licensed under Apache License 2.0

Development

DataQualityDashboard is being developed in R Studio.

Development status

DataQualityDashboard latest release (representing code in the main branch) is ready for use.

Acknowledgements

  • This project is supported in part through the National Science Foundation grant IIS 1251151.

1 Kahn, M.G., et al., A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data. EGEMS (Wash DC), 2016. 4(1): p. 1244.

dataqualitydashboard's People

Contributors

ablack3 avatar actualben avatar alondhe avatar andyjessen avatar anthonymolinaro avatar anthonysena avatar chandryou avatar clairblacketer avatar cukarthik avatar danielh2942 avatar dimshitc avatar dleprovost avatar dmitryilyn avatar don-torok avatar dsjy09 avatar eminty avatar fdefalco avatar hollorol avatar jdposada avatar jshoughtaling avatar katy-sadowski avatar ksdkalluri avatar maximmoinat avatar pbr6cornell avatar priagopal avatar tmswhite avatar tomwhite-medstar avatar vojtechhuser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataqualitydashboard's Issues

Error when executing writeJsonResultsToTable

I got a connection error when trying to write the results in a JSON file.

> DataQualityDashboard::writeJsonResultsToTable(connectionDetails = connectionDetails, 
+                                       resultsDatabaseSchema = resultsDatabaseSchema, 
+                                       jsonFilePath = "/tmp/result.json")
Error in open.connection(con, "rb") : cannot open the connection

I tried to solve it by providing the file in advance but instead, I got a new error message:

Error in parse_con(txt, bigint_as_char) : parse error: premature EOF
                                       
                     (right here) ------^

missing ER visits rule (threshold) - editing the KB

For Psoriasis, we define low and high values.

How would I define low value of at least 0.001% of ER visits.

Also, the current file lists them as condition and that seems wrong

Do I just edit the row 3 in the concept level file.

Edits to Knowledge base - how a PR for those works?

The DataQuality study produced a lot of reference benchmark data and I would be super keen to implement them in DQD

image

Add clear documentation for each check

There is currently only one small sentence given for each check. We should add better documentation about what each one is doing. Let's start by adding it to the website and then moving it to the dqd itself later

Postgres and CDM_DATATYPE check

Hi @AnthonyMolinaro ,

Can you take a look at the CDM_DATATYPE sql template? Looks like in Postgres, the "~" operator isn't supported:

ERROR: operator does not exist: integer ~ unknown Hint: No operator matches the given name and argument type(s). You might need to add explicit type casts.

Thanks,
Ajit

Excel to csv files

The goal is for the dashboard to read off of four csv files specifying the checks (table, field, and concept) and the description of the checks. It is difficult to keep up with four files so it would be nice to have a mechanism that reads in the excel file and create the csvs from that.

Connecting to the DB server only once

Currently it seems a new connection is opened for every check. This seems inefficient, and for some DBMSs this is even impossible (Oracle will run out of its pool of random numbers). Instead of passing the connectionDetail object (internally), could a connection object be passed?

Design proposal

Background

OHDSI has collected comprehensive expertise in libraries development and today it is possible to outline several lessons learned so that we could stick to best patterns during the development of the DQ Dashboard.

Positive patterns:

  • Back-end represented by SQL templates with absence or minimal dynamic SQL;
  • Encapsulation of middle-tier logic into Java library so that it could be used by both R and Java web apps. Usage of rJava to utilize the Java library in R. E.g. SqlRender, FeatureExtraction
  • Standardization of specifications / DTOs. StandardizedAnalysisAPI

Problematic patterns:

  • Middle-tier logic just in R. This makes it impossible to re-use the code and bumps people to duplicate code when you want to implement the use-case in Java / other language. E.g.: Arachne duplicates logic of Achilles query execution and results assembly into JSONs.
  • UI coded in R Shiny. The same issue - you cannot simply embed such UI into an existing web app. Only iframe would work. E.g. PLE / PLP viewers - their integration into Atlas would require separate R Server and complicated logic or code re-writing. Proposed solution: apply pattern similar to what's done at middle-tier - code UI in HTML and JS, wrap them into library and use that library in R Shiny via htmlwidgets

High-level Design

Use-cases

image

Components

image

Technology stack:

  • SQL for analytics
  • Java for middle-tier logic
  • R with r Java to wrap Java library into R library
  • UI in HTML and JS, Vega library for visualizations and its JSON spec as standard for data exchange format
  • Htmlwidgets to use the HTML-JS-CSS in R and R Shiny

Discussed with @alondhe

Identify patterns in DQ SQL

Examine DQ checks and necessary SQL to perform those checks to identify SQL patterns. These patterns will then become parameterized SQL templates.

Add vignette

Add vignette covering how to execute the checks, but also how to interpret the results

Finalize DQ Checks - Field Level

  1. CDM_FIELD: Given the tables present (as found in table level checks), verify the fields conform to CDM spec.
  2. IS_NOT_NULLABLE: For each table, check that the fields in which IS_NOT_NULLABLE == TRUE, there are no null values in that field.
  3. CDM_DATATYPE: At a minimum, for each field that is supposed to be an integer, verify it is an integer
  4. IS_PRIMARY_KEY: Primary Key - verify those fields where IS_PRIMARY_KEY == Yes, the values in that field are unique
  5. IS_FOREIGN_KEY, FK_TABLE, FK_FIELD: Foreign Key - all values are valid based on the associated table
  6. IS_STANDARD_VALID_CONCEPT: all standard concept id fields are standard and valid
  7. FK_DOMAIN: all standard concept ids are part of specified domain
  8. FK_CLASS: Drug era standard concepts, ingredients only
  9. MEASURE_VALUE_COMPLETENESS: computing number of null values and the proportion to total records
  10. STANDARD_CONCEPT_RECORD_COMPLETENESS: number of 0s / total number of records
  11. SOURCE_CONCEPT_RECORD_COMPLETENESS: number of 0s / total number of records
  12. SOURCE_VALUE_COMPLETENESS: number of source values with 0 standard concept / number of distinct source values
  13. PLAUSIBLE_VALUE_LOW: get number of records and the proportion to total number of eligible records that fall below this threshold
  14. PLAUSIBLE_VALUE_HIGH: get number of records and the proportion to total number of eligible records that exceed this threshold
  15. PLAUSIBLE_TEMPORAL_AFTER: get number of records and the proportion to total number of eligible records with datetimes that do not occur on or after their corresponding datetimes
  16. PLAUSIBLE_DURING_LIFE: get number of events that occur after death event (PLAUSIBLE_DURING_LIFE == Yes)

notification about non-preferred units

KB-driven templated rules are great ! I tried to have those in Achilles and I was always referred to the perfect re-design in phase 2 of Achilles. DQD just did it. I love it.

In ThemisConcept study, we defined data driven preferred units.

How DQD can implement this logic?

Perhaps a column that indicates a unit as non-preferred?

image

Add vocabulary tables to be evaluated

Currently, the dq dashboard only looks at CDM tables. A future version should include the vocabulary tables as well. For example, a recent download of the vocabulary switched a columns blanks to NULLs; an issue the dq dashboard would have caught.

left menu resize issues

latest change to add the data source name has introduced some resize issues with the left menu

image

Expose parameter values in HarnessV2.R?

@fdefalco, @clairblacketer:

Would it be possible to return parameter values? For example, a recent flatiron run shows that 100% of the drug_era rows violate check 513. The sql follows:

SELECT num_violated_rows, 1.0*num_violated_rows/denominator.num_rows AS pct_violated_rows
FROM (
SELECT COUNT(violated_rows.violating_field) AS num_violated_rows
FROM(
SELECT 'PERSON.gender_concept_id' AS violating_field, PERSON.* 
FROM cdm.PERSON LEFT JOIN cdm.CONCEPT ON PERSON.gender_concept_id = CONCEPT.CONCEPT_ID
WHERE CONCEPT.DOMAIN_ID != '@fkDomain' OR CONCEPT.CONCEPT_CLASS_ID != '@fkClass' 
) violated_rows
) violated_row_count,
( SELECT COUNT(*) AS num_rows\r\n\tFROM cdm.PERSON ) denominator;

The checkResults data frame has the QUERY_TEXT column, making it easy to get the query:

checkResults[which(checkResults$CHECK_ID==513),"QUERY_TEXT"]

..however, it's hard to debug without the actual parameter values for @fkDomain and @fkClass.

Thanks!

Hard-coded call to jConnection fails on non-JDBC connections (like Eunomia)

There is a call to the JConnection property to turn autoCommit on here. However, not all connection objects have this property (specifically, SQLite connections such as those used by Eunomia don't).

Is this call really necessary when just querying the data? DatabaseConnector in general knows when to turn autoCommit on and off.

If really necessary, please use

if (inherits(connection, "DatabaseConnectorJdbcConnection")) {...}

to only apply the code to the appropriate objects.

Add UI to view the checks

We currently only offer the CSVs as the source of what the checks are, before running the package. This is cumbersome for users to consume, perhaps we could add a UI that shows the check names and fully qualified check descriptions.

We could add a viewDqChecks() function that you provide all the same params as executeDqChecks, and it launches a web page with a dynamic table of checks to be run given the params. This web page should also be used as a section in the dashboard itself.

Thoughts @fdefalco and @clairblacketer ?

Achilles Heel test for era building

There are currently no test for whether the era building went OK, even though it is easy to make mistakes there.

I'm currently using a simple test: I'm comparing the results of these two queries:

SELECT DISTINCT person_id, condition_concept_id FROM condition_occurrence;
SELECT DISTINCT person_id, condition_concept_id FROM condition_era;

and they should be identical. (It would be easy to test this through a join I guess). If that doesn't scale to bigger databases, at least we could count the number of people with a condition and compare those counts.
For drug_era we'd first need to map to ingredients, but otherwise it would be the same.

Field checks when table doesn't exist

When not-required tables don't exist in a CDM, the field checks are still run and throw errors about the missing tables. Should we make the field checks conditional on the presence of the table?

execution_time capture

The current check execution time capture includes units, which then vary so the output in the checkResults dataframe looks like this:

1.757 mins
2.324 secs
1.893 mins

It would be easier if we just captured the execution time in total seconds without the unit names, rounded to two decimals.

scientific notation

use scipen=999 to ensure values are not displayed using scientific notations

Create Result JSON file

Early draft:

{
	"levels": {
		"table": {
			"issues": [
				{
					"is_required": [ "OBSERVATION_PERIOD" ]
				}
			],
			"results": [ { "cdm_table" : "PERSON" },
						{ "cdm_table" : "VISIT_OCCURRENCE",
						  "is_temporally_constant": [
							{ "date": "2000-01", "value": 5.6 }
						  ]
						}, 
						{ "cdm_table": "CONDITION_OCCURRENCE",
						  "is_temporally_constant": [
							{ "date": "2000-01", "value": 5.6 }
						  ]
						} 
			]
		},
		"field": {					
			"cdm_table": "PERSON",
			"issues": [
				{
					"exists": [],
					"is_required": [{ "cdm_field": "gender_concept_id"}],
					"cdm_datatype": [ { "cdm_datatype": "birth_datetime", "original": "date" } ],
					"pk": [],
					"fk": [],
					"fk_table": [],
					"fk_field": [],
					"fk_domain": [],
					"fk_class": [],
					"measure_value_comp": [],
					"measure_mapping_comp": [],
					"plausible_value_low": [],
					"plausible_value_high": [],
					"plausible_during_life": []
				}
			]				
		},
		"concept": {
			"issues": [
				{
					"units": [ 
						{ "concept_id": 3025315, "orig_unit_concept_id": 9999 }, 
						{ "concept_id": 3036277, "orig_unit_concept_id": 9999 }
					],
					"plausible_value_low": [ { } ],
					"plausible_value_high": [],
					"plausible_gender": [],
					"temporal": [],
					"valid_prev_low": [],
					"valid_prev_high": []					
				}
			]
		}		
	}
}

special characters in metadata

When pulling the CDM source metadata, special characters encoded in latin1 are brought into the JSON file. When calling viewDqDashboard, the jsonlite call to read the JSON fails due to these characters.

Need to re-encode to UTF-8?

empiric threshold implementation from OHDSI DataQuality network study

I would like to add to DQD the results of the DataQuality study. It is published as Medinfo poster. Abstract below.
We used a method of 10th and 90th percentile to derive benchmark data. See yellow highlights.
E.g., what is a threshold for expected ER visits.

The reference thresholds are here
https://github.com/vojtechhuser/DataQuality/blob/master/inst/csv/empiric_reference.csv .

Abstract
Large healthcare datasets of Electronic Health Record data became indispensable in clinical research. Data quality in such datasets recently have become a focus of many distributed research networks. Despite the fact that data quality is specific to a given research question, many existing data quality platforms prove that general data quality assessment on dataset level (given a spectrum of research questions) is possible and highly requested by researchers. We present comparison of 12 datasets and extension of the Achilles Heel data quality software tool with new rules and data characterization measures.
Keywords: data quality, observational study
Introduction
Data quality is an important pre-requisite for research on Electronic Health Record (EHR) data. In recent years, several efforts and tools emerged that perform data quality assessment (DQA).[1] Another important trend is that research is increasingly conducted using distributed research networks. Such networks often provide tools to their data partners that lower the barrier to join or participate within the network and help with data preparation or analysis execution.
The Achilles tool from the Observational Health Data Scienc-es and Informatics Consortium (OHDSI) is one such tool that performs data characterization and includes an Achilles Heel part that contains rules for checking data quality (DQ). The Achilles tool was first deployed in October 2014 (version 1.0) with several updates (versions 1.1 to 1.6) during a period from 2014 to 2018. Since 2016, the web-based user interface part of Achilles was incorporated into the OHDSI Atlas tool, bring-ing data source summary statistics and DQA results into the primary study design application used by OHDSI researchers.
In developing the Achilles tool, the OHDSI consortium actively encourages researchers to submit requests for new data quality checks or insightful data visualizations that would extend the tool’s utility. The Achilles’ software repository receives numerous inputs (in a form of Github issues) that identify such new DQ measures or checks. In addition to this ongoing feedback, the European EMIF research network conducted a formal survey of the tool that indicated the need for new features.
This study describes a set of extensions of the Achilles tool based on a comparison of data quality indicators of several healthcare datasets.
Methods
The study had two goals. The first goal was to compare data quality characteristics across datasets. Informed by this comparison, the second goal was to extend Achilles with new features and new data quality rules that would improve the assessment of data quality generated by the tool. This study includes a larger set of exported dataset metadata compared with a previous study done by our team, that only focused on Achilles Heel output messages.
The Achilles tool currently generates over 170 measures (e.g., number of measurement records by test and unit). However, many healthcare dataset administrators are not permitted to share such comprehensive set of dataset indicators. To be able to conduct our comparison, we designed a smaller set of measures generated by Achilles pre-computations that includes only measures that were deemed acceptable by the dataset administrators.
To maintain a data aggregation privacy-preserving principle for our comparison, our study used a small-cell count threshold of 11+ patients per aggregated count. Achilles tool allows suppressing pre-computations that result in small counts of patients (or small counts of providers, or healthcare events). This filtering is done either when Achilles pre-computations are executed, but if it was not done during the Achilles pre-computation phase, our methodology enforced it again during when site data extract generation. The R package for our study (called DataQuality) is open source and available on the Github platform at https://github.com/OHDSI/StudyProtocolSandbox/ tree/master/DataQuality.Actual input data for the study consisted of the following: (1) subset of Achilles analyses converted to ratios (for example, ratio of persons with at least one visit by visit type); (2) Achilles derived measures (for example, percentage of unmapped source data concepts by domain) and (3) an approximate size of the dataset (for example, <10k, 10-99k,100k-1M, 1-5M,5-9M and >10M; exact size of populations is masked into a dataset size category). Sample input data (for a synthetic SynPuf OMOP dataset) is available at Github.
Each dataset was assigned a random meaningless identifier to facilitate the comparison. The purpose for this dataset masking is the fact that data quality comparisons can lead to withdrawal of a data partner from a research consortium (or an analysis project) if a particular partner’s dataset is identified as having low quality data. Masking was done to avoid this outcome and to focus on advancing the methodologies for DQA. For the same reason, neither a list of individual datasets is provided. We plan to destroy individual site aggregated data used as input at 6 months after the publication of the study results. To protect the sites, only masked and isolated combined comparisons are reported in this article. Non-aggregated, single dataset DQ data are never posted publically.
Determination of goodness of fit or “data fitness” is highly dependent on the research question being asked. This phenomenon was described earlier and is sometimes referred to a task-dependence nature of DQA.[2] A dataset that only contains inpatient events and data may not be appropriate for general research questions (e.g., descriptive study of a course of a disease); however, it may be sufficient for a subset of research questions (e.g., inpatient-only research questions).
One can conclude that without knowing the specific research question context, any data quality assessment is impossible to pre-empt. This requirement for specifying research question context up-front makes development of general DQA tools almost impossible. However, existing DQA tools and efforts indicate that some general DQA rules indeed exist.
To partially overcome this problem (“data fitness for what?”), we assumed that the dataset being assessed represents the lifetime record of general population and the tool should perform DQA for a wide range of possible research questions (“general data fitness for a wide range of research questions”). Once a general DQA assesment is done, a researcher with a specific research analysis can simply ignore DQA messages that do not apply to his/her context. (e.g., ignore messages about lack of eye doctor’s visit and eye care data if data about vision care are not essential for his/her research question).
Results
A total of 12 datasets were compared in the study; however due to use of prior Achilles versions by some sites, comparison of some newly implemented measures are made on data from datasets that implemented at least Achilles version 1.4 at the time of our study data extraction.
Version 1.6 of Achilles contains a total of 44 data quality rules (also called data quality checks). A total of 12 rules are model conformance rules that check adherence to the CDM specification. For example, a model conformance rule may require that provider specialty column contains only concepts that are indeed specialties. The remaining 22 rules are data quality rules that check for data completeness, data plausibility or other data quality problems. Such rules can be considered model-independent and should be portable to other data models, such as Sentinel model or PCORNet. The pooled dataset of all Achilles Heel messages from all datasets consisted of 546 messages. For a single dataset, we found a median of 51 of Heel messages, a median of 25 errors, a median of 22 warnings, and a median of 4 notifications. The poster will show evaluation of severity of each rule violation by computing median record counts for each rule. The second goal of our study was to add new functionality (either new DQ rules or new DQ measures to Achilles) based on availability of data about multiple CDM datasets. The results are divided into multiple sections according to the data domain of the new rule and will be included in the poster.
(1) Empirical rules: Comparing selected dataset parameters and computing 90th or 10th percentile and using them as benchmark thresholds.
(2) Data density rules: We considered data density on several levels: concepts per person as the number of distinct measurements per person (e.g., count of 2 measurements per person, such as cholesterol and hematocrit). This comparison aims at “data breadth”; records per person as the total number of all measurement records per person (e.g., count of 8 tests, such as 3 LDL cholesterol and 5 hematocrit measurements). This comparison aims at “data depth”; records per visit as a data density measure on a visit level. Because visits with no measurements can occur, the per visit count can be below 1. However, for a similar per visit count analyzing clinical notes (if in scope for the dataset), it may be reasonable to expect at least one note per visit.
(3) Minimum-data patients: For many research questions, at least one data point in a given clinical data domain (such as medications) is required for any meaningful analysis. For example, for analyzing event prevalence, using a proper denominator and determining the size of the relevant population can significantly affect the reported measure. We determined empiric thresholds for existing Achilles DQA measures that count number of patients with at least one event in a clinical data domain. (e.g., patients with at least one visit, patients with at least 1 diagnosis and 1 medication.
(4) Unmapped data: OMOP CDM allows storage of data that is not fully semantically mapped to standard concepts (for example, drug exposure data may include data rows that have a value of 0 (‘No matching concept’) in drug_concept_id while the yet-to-be-mapped local code is stored in drug_source_value). We introduced measures computing unmapped data and threshold rules for several domains, such as Conditions, Procedures or Drug Exposure.
Discussion and Conclusion
Our current method for picking an empiric threshold is using a fixed threshold (e.g., 10th percentile). Future methodology revision may alter this approach for each considered DQ measure. Another limitation is our primary focus on OMOP CDM sites. Our extensions to Achilles rule knowledge base, however, point to what data measures are required by each rule and whether a rule is terminology dependent. We compared data quality indicators across several datasets. We arrived at empirical values that could be used as thresholds for several DQA measures. The study resulted in several new data quality checks being added to Achilles.
Acknowledgement: We would like to thank Ritu Khare, Taha Abdul, Chris Knoll and Martijn Schuemie. VH work was supported by the Intramural Research Program of the National Institutes of Health (NIH)/ National Library of Medicine (NLM)/ Lister Hill National Center for Biomedical Communications (LHNCBC).
References
[1] H. Estiri and K. Stephens, DQe-v: A Database-Agnostic Framework for Exploring Variability in Electronic Health Record Data Across Time and Site Location. , eGEMs (Generating Evidence & Methods to improve patient outcomes) 1 (2017).
[2] N.G. Weiskopf and C. Weng, Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research, J Am Med Inform Assoc 20 (2013), 144-151.

Value as concept id vs value as number Rule in Measurement

Some concept ID in measurement requires 'value as concept id' rather than 'value as number'
such as 'presence' or 'dip stick' result (because dip stick result is 1+, 2+, which is not numeric)

Should we add this rule to the DQ dashboard?

Update checks to exclude NULL child rows

Queries that check for "orphan rows" typically use an outer join. However, when the child column also has NULL rows, the count may return a non-zero value. For example:

with 
child  as (select 1 id union select 2 union select NULL), 
parent as (select 1 id union select 2 union select 3)
select count(*), count(p.id) 
from child c left join parent p on c.id = p.id
where p.id is null

orphan_count1 | orphan_count2
--------------+--------------
            1 |             0

orphan_count1 indicates a foreign key violation while orphan_count2 does not.

The fix is to either use the parent column in the count or add an additional predicate to the WHERE clause (and c.id is not null).

Events After Death

This is more of a question than an issue, but looking at Themis Issue 42, the proposal is to accept events within 60 days post death. Will this be incorporated into PLAUSIBLE_DURING_LIFE mentioned in this issue for the DQDashboard? I'm happy to make a PR but wanted to see if it was by design to flag all events after death.
By the way, the dashboard is awesome!!!

Design JSON files

We need to create a standard format for the JSON files:

  1. Reference files that establish the DQ checks
  2. Threshold file that sets the thresholds of success/warn/fail for a CDM
  3. Results file that highlights DQ check violations and contains value sets

Finalize DQ Checks - Concept Level

  1. PLAUSIBLE_VALUE_LOW: given measurements with the specified measurement_concept_id and unit_concept_id, get the number of records that are below the PLAUSIBLE_VALUE_LOW
  2. PLAUSIBLE_VALUE_HIGH: given measurements with the specified measurement_concept_id and unit_concept_id, get the number of records that are above the PLAUSIBLE_VALUE_HIGH
  3. PLAUSIBLE_GENDER: for rows with PLAUSIBLE_GENDER in ("Male", "Female"), quantify patients with these measurements but have a gender_concept_id that does not match.

DQ Check Numbering

We need to devise a numbering scheme for each DQ check so that we can identify each check and the result easily.

@clairblacketer has volunteered, anyone else want to help?

Not required foreign key check

For fields that are not required, if null then it should not be checked for foreign key compliance. A check shouldn't fail because it isn't required but is then evaluated later.

Finalize DQ Checks - Table Level

In this level, we are focusing on 2 patterns:

  1. Ensure that every person has at least 1 observation_period
  2. Tabulating number of records per person per month for tables we know to not change over time (IS_TEMPORALLY_CONSTANT != Yes).
    Example: for 2009-01, we see on average 10 conditions per person, then in 2009-02, we see average of 20 conditions per person. We want to highlight and understand that change: perhaps an EHR added a new practice, which is the reason for the increase.
  • Visualized as histogram
  • Perhaps leverage Kronos to identify temporal breaks

Add easy to copy violating records SQL

In the dashboard, we do provide the full query the DQD executed, which includes a subquery for seeing the violating records.

However, we should make it easier to grab the violating records query.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.