Here is a reproducible code example showing that executeDqChec

Thanks for the detailed report, <a class="user-mention notranslate" data-hovercard-typ

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

exectuteDqChecks fails if cdm_source table has more than one row and outputFile is left as default about dataqualitydashboard HOT 7 OPEN

ablack3 commented on June 2, 2024

exectuteDqChecks fails if cdm_source table has more than one row and outputFile is left as default

from dataqualitydashboard.

Comments (7)

katy-sadowski commented on June 2, 2024 1

Also just pointing out that this function by defaults returns it's results in three different ways (two side effects and one return value):

Results are returned from the function as an R dataframe

Results are also written to a file

Results are also loaded from R into a database table

I might suggest making the return value invisible as recommended here.

Concur with making return value invisible. I'll file a separate issue for that.

I also agree that we shouldn't be writing to the database by default; the default value for writeToTable should be FALSE and/or we should force users to specify the database table name (the same way we require a value for outputFolder). However, I think we will need to wait until the next major release to make this change as it could be breaking for some users expecting this default behavior. In the meantime I can update the docs to make it clear the default mode will write to the database.

from dataqualitydashboard.

ablack3 commented on June 2, 2024

This problem occurs because the cdm abbreviation is being taken from the cdm_source table. When there are two abbreviations the resultFilename will be a length 2 character vector (here) which causes write to fail here.

My suggestion would be to only make sure that the cdm abbreviation is a length 1 character vector and the output file name is a length 1 character vector.

from dataqualitydashboard.

ablack3 commented on June 2, 2024

Also note that this function creates a table in the main schema called "main.dqdashboard_results" which is probably not what is intended.

connectionDetails <- Eunomia::getEunomiaConnectionDetails()

# demonstrate that this works as is
result <- DataQualityDashboard::executeDqChecks(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "main",
  resultsDatabaseSchema = "main",
  cdmSourceName = "GiBleed_Eunomia",
  checkLevels = "TABLE",
  checkNames = c("cdmTable", "cdmField"),
  outputFolder = "~/Desktop/example_dqd_output"
)
#> Warning in DataQualityDashboard::executeDqChecks(connectionDetails =
#> connectionDetails, : Missing check names to calculate the 'Not Applicable'
#> status: measureValueCompleteness
#> Connecting using SQLite driver
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#> 
#> Rows: 22 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): checkLevel, checkName, checkDescription, kahnContext, kahnCategory,...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> CDM Tables skipped: CONCEPT, VOCABULARY, CONCEPT_ANCESTOR, CONCEPT_RELATIONSHIP, CONCEPT_CLASS, CONCEPT_SYNONYM, RELATIONSHIP, DOMAIN
#> Connecting using SQLite driver
#> Processing check description: cdmTable
#> Writing results to file: ~/Desktop/example_dqd_output/synthea-20230728161501.json
#> Execution Complete
#> Connecting using SQLite driver
#> Writing results to table main.dqdashboard_results
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.00429 secs
#> Warning: Column 'warning' is of type 'logical', but this is not supported by
#> many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'error' is of type 'logical', but this is not supported by many
#> DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'threshold_value' is of type 'logical', but this is not
#> supported by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'notes_value' is of type 'logical', but this is not supported
#> by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Inserting data took 0.0174 secs
#> Finished writing table

conn <- DatabaseConnector::connect(connectionDetails)
#> Connecting using SQLite driver
library(DatabaseConnector)
getTableNames(conn)
#>  [1] "care_site"                "cdm_source"              
#>  [3] "cohort"                   "cohort_attribute"        
#>  [5] "concept"                  "concept_ancestor"        
#>  [7] "concept_class"            "concept_relationship"    
#>  [9] "concept_synonym"          "condition_era"           
#> [11] "condition_occurrence"     "cost"                    
#> [13] "death"                    "device_exposure"         
#> [15] "domain"                   "dose_era"                
#> [17] "drug_era"                 "drug_exposure"           
#> [19] "drug_strength"            "fact_relationship"       
#> [21] "location"                 "measurement"             
#> [23] "metadata"                 "note"                    
#> [25] "note_nlp"                 "observation"             
#> [27] "observation_period"       "payer_plan_period"       
#> [29] "person"                   "procedure_occurrence"    
#> [31] "provider"                 "relationship"            
#> [33] "source_to_concept_map"    "specimen"                
#> [35] "visit_detail"             "visit_occurrence"        
#> [37] "vocabulary"               "dqdashboard_results"     
#> [39] "main.dqdashboard_results"

tibble::tibble(querySql(conn, "select * from main.'main.dqdashboard_results'"))
#> # A tibble: 23 × 26
#>    NUM_VIOLATED_ROWS PCT_VIOLATED_ROWS NUM_DENOMINATOR_ROWS EXECUTION_TIME
#>                <dbl>             <dbl>                <dbl> <chr>         
#>  1                 0                 0                    1 0.004840 secs 
#>  2                 0                 0                    1 0.004684 secs 
#>  3                 0                 0                    1 0.005634 secs 
#>  4                 0                 0                    1 0.005763 secs 
#>  5                 0                 0                    1 0.005449 secs 
#>  6                 0                 0                    1 0.003824 secs 
#>  7                 0                 0                    1 0.002964 secs 
#>  8                 0                 0                    1 0.004513 secs 
#>  9                 0                 0                    1 0.003135 secs 
#> 10                 0                 0                    1 0.003447 secs 
#> # ℹ 13 more rows
#> # ℹ 22 more variables: QUERY_TEXT <chr>, CHECK_NAME <chr>, CHECK_LEVEL <chr>,
#> #   CHECK_DESCRIPTION <chr>, CDM_TABLE_NAME <chr>, CDM_FIELD_NAME <chr>,
#> #   CONCEPT_ID <chr>, UNIT_CONCEPT_ID <chr>, SQL_FILE <chr>, CATEGORY <chr>,
#> #   SUBCATEGORY <chr>, CONTEXT <chr>, WARNING <dbl>, ERROR <dbl>,
#> #   CHECKID <chr>, FAILED <dbl>, PASSED <dbl>, IS_ERROR <dbl>,
#> #   NOT_APPLICABLE <dbl>, NOT_APPLICABLE_REASON <chr>, THRESHOLD_VALUE <dbl>, …

^{Created on 2023-07-28 with reprex v2.0.2}

from dataqualitydashboard.

ablack3 commented on June 2, 2024

Also just pointing out that this function by defaults returns it's results in three different ways (two side effects and one return value):

Results are returned from the function as an R dataframe
Results are also written to a file
Results are also loaded from R into a database table

I might suggest making the return value invisible as recommended here.

from dataqualitydashboard.

katy-sadowski commented on June 2, 2024

Thanks for the detailed report, @ablack3 !

The CDM_SOURCE error is a known issue (#410) and is actually next in my queue to fix. In my response to the previous issue, I took a stance where DQD would enforce CDM_SOURCE only containing 1 row. However, the more I think about it, given there's not a hard and fast convention on this, I think I'd be open to allowing it, with a warning. We have 2 options in that case:

LIMIT 1 in our query against CDM_SOURCE and warn users that the DQD metadata will only contain data from the first row in the table
Concatenate the data within each CDM_SOURCE column, to collapse the data into a single row

I prefer the 1st option, because if someone has more than a few rows in CDM_SOURCE their metadata is going to look very messy. I also do think that, in its current form, the CDM_SOURCE is really only designed to hold a single row, and anyone storing multiple rows in there is sort of "hacking" the CDM. So it'll be difficult to anticipate all of ways those folks are populating the table, at least until there's an official convention for this.

There is also a 3rd option, which is allowing users to input the metadata parameters as an alternative to pulling from CDM_SOURCE. I believe this was previously discussed and decided against, though, so I'd probably want to run it by the DQ working group before pursuing that.

What do you think?

PS - FYI there's a second issue beyond the output file naming which is the metadata passed into the results file & displayed in the Shiny app UI (

DataQualityDashboard/R/executeDqChecks.R

Lines 308 to 315 in e18c308

 allResults <- list( 

 startTimestamp = startTime, 

 endTimestamp = endTime, 

 executionTime = sprintf("%.0f %s", delta, attr(delta, "units")), 

 CheckResults = checkResults, 

 Metadata = metadata, 

 Overview = overview 

 )

)

from dataqualitydashboard.

ablack3 commented on June 2, 2024

Thanks @katy-sadowski! I think either option would work for me. In general the issue is about being clear and strict with our interfaces and expected structure of data types. Is there a CDM convention that the cdm_source table should only have one row? I wasn't aware of it but could have easily missed it. I wouldn't throw a warning unless there is something explicitly wrong with the CDM. Maybe it should be a DQ conformance check. Let me know if you'd like any help.

from dataqualitydashboard.

katy-sadowski commented on June 2, 2024

Is there a CDM convention that the cdm_source table should only have one row?

Not as far as I'm aware - I think this is still an open question (that's why I've changed my mind here not to block DQD execution if a database does have multiple rows in this table). The CDM documentation doesn't explicitly state this one way or the other, but the language describing each column seems to imply it's designed to hold a single row, to document metadata at the whole CDM instance level.

I wouldn't throw a warning unless there is something explicitly wrong with the CDM.

The warning would just be to alert users that the metadata stored/displayed in DQD will only represent one of their CDM_SOURCE rows (assuming we go for the first solution). I also don't think we want to add a DQ check until there's an official CDM convention on this.

@clairblacketer do you have any input on this? Is there a convention for this (in which case CDM docs should probably be updated), or should we file a CDM GitHub issue to decide on one? Thanks 😄

from dataqualitydashboard.

exectuteDqChecks fails if cdm_source table has more than one row and outputFile is left as default about dataqualitydashboard HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	allResults <- list(
	startTimestamp = startTime,
	endTimestamp = endTime,
	executionTime = sprintf("%.0f %s", delta, attr(delta, "units")),
	CheckResults = checkResults,
	Metadata = metadata,
	Overview = overview
	)