Comments (7)
Also just pointing out that this function by defaults returns it's results in three different ways (two side effects and one return value):
- Results are returned from the function as an R dataframe
- Results are also written to a file
- Results are also loaded from R into a database table
I might suggest making the return value invisible as recommended here.
Concur with making return value invisible. I'll file a separate issue for that.
I also agree that we shouldn't be writing to the database by default; the default value for writeToTable
should be FALSE and/or we should force users to specify the database table name (the same way we require a value for outputFolder
). However, I think we will need to wait until the next major release to make this change as it could be breaking for some users expecting this default behavior. In the meantime I can update the docs to make it clear the default mode will write to the database.
from dataqualitydashboard.
This problem occurs because the cdm abbreviation is being taken from the cdm_source table. When there are two abbreviations the resultFilename
will be a length 2 character vector (here) which causes write
to fail here.
My suggestion would be to only make sure that the cdm abbreviation is a length 1 character vector and the output file name is a length 1 character vector.
from dataqualitydashboard.
Also note that this function creates a table in the main schema called "main.dqdashboard_results" which is probably not what is intended.
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
# demonstrate that this works as is
result <- DataQualityDashboard::executeDqChecks(
connectionDetails = connectionDetails,
cdmDatabaseSchema = "main",
resultsDatabaseSchema = "main",
cdmSourceName = "GiBleed_Eunomia",
checkLevels = "TABLE",
checkNames = c("cdmTable", "cdmField"),
outputFolder = "~/Desktop/example_dqd_output"
)
#> Warning in DataQualityDashboard::executeDqChecks(connectionDetails =
#> connectionDetails, : Missing check names to calculate the 'Not Applicable'
#> status: measureValueCompleteness
#> Connecting using SQLite driver
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#>
#> Rows: 22 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): checkLevel, checkName, checkDescription, kahnContext, kahnCategory,...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> CDM Tables skipped: CONCEPT, VOCABULARY, CONCEPT_ANCESTOR, CONCEPT_RELATIONSHIP, CONCEPT_CLASS, CONCEPT_SYNONYM, RELATIONSHIP, DOMAIN
#> Connecting using SQLite driver
#> Processing check description: cdmTable
#> Writing results to file: ~/Desktop/example_dqd_output/synthea-20230728161501.json
#> Execution Complete
#> Connecting using SQLite driver
#> Writing results to table main.dqdashboard_results
#> | | | 0% | |=================================== | 50% | |======================================================================| 100%
#> Executing SQL took 0.00429 secs
#> Warning: Column 'warning' is of type 'logical', but this is not supported by
#> many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'error' is of type 'logical', but this is not supported by many
#> DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'threshold_value' is of type 'logical', but this is not
#> supported by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'notes_value' is of type 'logical', but this is not supported
#> by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Inserting data took 0.0174 secs
#> Finished writing table
conn <- DatabaseConnector::connect(connectionDetails)
#> Connecting using SQLite driver
library(DatabaseConnector)
getTableNames(conn)
#> [1] "care_site" "cdm_source"
#> [3] "cohort" "cohort_attribute"
#> [5] "concept" "concept_ancestor"
#> [7] "concept_class" "concept_relationship"
#> [9] "concept_synonym" "condition_era"
#> [11] "condition_occurrence" "cost"
#> [13] "death" "device_exposure"
#> [15] "domain" "dose_era"
#> [17] "drug_era" "drug_exposure"
#> [19] "drug_strength" "fact_relationship"
#> [21] "location" "measurement"
#> [23] "metadata" "note"
#> [25] "note_nlp" "observation"
#> [27] "observation_period" "payer_plan_period"
#> [29] "person" "procedure_occurrence"
#> [31] "provider" "relationship"
#> [33] "source_to_concept_map" "specimen"
#> [35] "visit_detail" "visit_occurrence"
#> [37] "vocabulary" "dqdashboard_results"
#> [39] "main.dqdashboard_results"
tibble::tibble(querySql(conn, "select * from main.'main.dqdashboard_results'"))
#> # A tibble: 23 × 26
#> NUM_VIOLATED_ROWS PCT_VIOLATED_ROWS NUM_DENOMINATOR_ROWS EXECUTION_TIME
#> <dbl> <dbl> <dbl> <chr>
#> 1 0 0 1 0.004840 secs
#> 2 0 0 1 0.004684 secs
#> 3 0 0 1 0.005634 secs
#> 4 0 0 1 0.005763 secs
#> 5 0 0 1 0.005449 secs
#> 6 0 0 1 0.003824 secs
#> 7 0 0 1 0.002964 secs
#> 8 0 0 1 0.004513 secs
#> 9 0 0 1 0.003135 secs
#> 10 0 0 1 0.003447 secs
#> # ℹ 13 more rows
#> # ℹ 22 more variables: QUERY_TEXT <chr>, CHECK_NAME <chr>, CHECK_LEVEL <chr>,
#> # CHECK_DESCRIPTION <chr>, CDM_TABLE_NAME <chr>, CDM_FIELD_NAME <chr>,
#> # CONCEPT_ID <chr>, UNIT_CONCEPT_ID <chr>, SQL_FILE <chr>, CATEGORY <chr>,
#> # SUBCATEGORY <chr>, CONTEXT <chr>, WARNING <dbl>, ERROR <dbl>,
#> # CHECKID <chr>, FAILED <dbl>, PASSED <dbl>, IS_ERROR <dbl>,
#> # NOT_APPLICABLE <dbl>, NOT_APPLICABLE_REASON <chr>, THRESHOLD_VALUE <dbl>, …
Created on 2023-07-28 with reprex v2.0.2
from dataqualitydashboard.
Also just pointing out that this function by defaults returns it's results in three different ways (two side effects and one return value):
- Results are returned from the function as an R dataframe
- Results are also written to a file
- Results are also loaded from R into a database table
I might suggest making the return value invisible as recommended here.
from dataqualitydashboard.
Thanks for the detailed report, @ablack3 !
The CDM_SOURCE error is a known issue (#410) and is actually next in my queue to fix. In my response to the previous issue, I took a stance where DQD would enforce CDM_SOURCE only containing 1 row. However, the more I think about it, given there's not a hard and fast convention on this, I think I'd be open to allowing it, with a warning. We have 2 options in that case:
- LIMIT 1 in our query against CDM_SOURCE and warn users that the DQD metadata will only contain data from the first row in the table
- Concatenate the data within each CDM_SOURCE column, to collapse the data into a single row
I prefer the 1st option, because if someone has more than a few rows in CDM_SOURCE their metadata is going to look very messy. I also do think that, in its current form, the CDM_SOURCE is really only designed to hold a single row, and anyone storing multiple rows in there is sort of "hacking" the CDM. So it'll be difficult to anticipate all of ways those folks are populating the table, at least until there's an official convention for this.
There is also a 3rd option, which is allowing users to input the metadata parameters as an alternative to pulling from CDM_SOURCE. I believe this was previously discussed and decided against, though, so I'd probably want to run it by the DQ working group before pursuing that.
What do you think?
PS - FYI there's a second issue beyond the output file naming which is the metadata passed into the results file & displayed in the Shiny app UI (
DataQualityDashboard/R/executeDqChecks.R
Lines 308 to 315 in e18c308
from dataqualitydashboard.
Thanks @katy-sadowski! I think either option would work for me. In general the issue is about being clear and strict with our interfaces and expected structure of data types. Is there a CDM convention that the cdm_source table should only have one row? I wasn't aware of it but could have easily missed it. I wouldn't throw a warning unless there is something explicitly wrong with the CDM. Maybe it should be a DQ conformance check. Let me know if you'd like any help.
from dataqualitydashboard.
Is there a CDM convention that the cdm_source table should only have one row?
Not as far as I'm aware - I think this is still an open question (that's why I've changed my mind here not to block DQD execution if a database does have multiple rows in this table). The CDM documentation doesn't explicitly state this one way or the other, but the language describing each column seems to imply it's designed to hold a single row, to document metadata at the whole CDM instance level.
I wouldn't throw a warning unless there is something explicitly wrong with the CDM.
The warning would just be to alert users that the metadata stored/displayed in DQD will only represent one of their CDM_SOURCE rows (assuming we go for the first solution). I also don't think we want to add a DQ check until there's an official CDM convention on this.
@clairblacketer do you have any input on this? Is there a convention for this (in which case CDM docs should probably be updated), or should we file a CDM GitHub issue to decide on one? Thanks 😄
from dataqualitydashboard.
Related Issues (20)
- Upgrade from v1.4.1 to v2.1.2 DataTables Error HOT 5
- org.postgresql.util.PSQLException: ERROR: syntax error at or near "@" HOT 4
- Rewrite `temporalPlausibleAfter` for checking events before birth HOT 3
- `measureValueCompleteness` errors for optional fields HOT 5
- listChecks.R does not read plausibleConceptIds correctly HOT 1
- Issue with @schema in `measureValueCompleteness` HOT 1
- Warning: Error in convertJsonResultsFileCase: could not find function "convertJsonResultsFileCase" HOT 3
- package maintenance should be updating copyright year HOT 2
- unitConceptId --> conceptId in writeJsonResultsTo if/then
- Update NEWS.md with removal of meas. value plausibility checks
- DataQualityDashboard::viewDqDashboard "This site can’t be reached" HOT 3
- Error in globalCallingHandlers when calling executeDqChecks HOT 3
- Add back outputFile param to Getting Started code HOT 1
- New check idea: plausibleValueAsConcept
- Missing plausible unit concept IDs for measurement concepts 3006627, 3015681 (and maybe more)
- Make executeDqChecks return value invisible
- plausibleDuringLife overcounts numerator if there are multiple DEATH records for a person
- In Getting Started code to run, cdmVersion set but not passed into executeDqChecks HOT 1
- Parameters in SQL scripts cause errors in the following checks: cdmField, cdmDatatype, isForeignKey, isPrimaryKey, isRequired, plausibleValueHigh, plausibleValueLow, cdmTable, measurePersonCompleteness HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataqualitydashboard.