Giter Club home page Giter Club logo

Comments (7)

katy-sadowski avatar katy-sadowski commented on June 2, 2024 1

Also just pointing out that this function by defaults returns it's results in three different ways (two side effects and one return value):

  1. Results are returned from the function as an R dataframe
  2. Results are also written to a file
  3. Results are also loaded from R into a database table

I might suggest making the return value invisible as recommended here.

Concur with making return value invisible. I'll file a separate issue for that.

I also agree that we shouldn't be writing to the database by default; the default value for writeToTable should be FALSE and/or we should force users to specify the database table name (the same way we require a value for outputFolder). However, I think we will need to wait until the next major release to make this change as it could be breaking for some users expecting this default behavior. In the meantime I can update the docs to make it clear the default mode will write to the database.

from dataqualitydashboard.

ablack3 avatar ablack3 commented on June 2, 2024

This problem occurs because the cdm abbreviation is being taken from the cdm_source table. When there are two abbreviations the resultFilename will be a length 2 character vector (here) which causes write to fail here.

My suggestion would be to only make sure that the cdm abbreviation is a length 1 character vector and the output file name is a length 1 character vector.

from dataqualitydashboard.

ablack3 avatar ablack3 commented on June 2, 2024

Also note that this function creates a table in the main schema called "main.dqdashboard_results" which is probably not what is intended.

connectionDetails <- Eunomia::getEunomiaConnectionDetails()

# demonstrate that this works as is
result <- DataQualityDashboard::executeDqChecks(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = "main",
  resultsDatabaseSchema = "main",
  cdmSourceName = "GiBleed_Eunomia",
  checkLevels = "TABLE",
  checkNames = c("cdmTable", "cdmField"),
  outputFolder = "~/Desktop/example_dqd_output"
)
#> Warning in DataQualityDashboard::executeDqChecks(connectionDetails =
#> connectionDetails, : Missing check names to calculate the 'Not Applicable'
#> status: measureValueCompleteness
#> Connecting using SQLite driver
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger. (This message will not be shown again this R session)
#> 
#> Rows: 22 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (8): checkLevel, checkName, checkDescription, kahnContext, kahnCategory,...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> CDM Tables skipped: CONCEPT, VOCABULARY, CONCEPT_ANCESTOR, CONCEPT_RELATIONSHIP, CONCEPT_CLASS, CONCEPT_SYNONYM, RELATIONSHIP, DOMAIN
#> Connecting using SQLite driver
#> Processing check description: cdmTable
#> Writing results to file: ~/Desktop/example_dqd_output/synthea-20230728161501.json
#> Execution Complete
#> Connecting using SQLite driver
#> Writing results to table main.dqdashboard_results
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.00429 secs
#> Warning: Column 'warning' is of type 'logical', but this is not supported by
#> many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'error' is of type 'logical', but this is not supported by many
#> DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'threshold_value' is of type 'logical', but this is not
#> supported by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Warning: Column 'notes_value' is of type 'logical', but this is not supported
#> by many DBMSs. Converting to numeric (1 = TRUE, 0 = FALSE)
#> Inserting data took 0.0174 secs
#> Finished writing table

conn <- DatabaseConnector::connect(connectionDetails)
#> Connecting using SQLite driver
library(DatabaseConnector)
getTableNames(conn)
#>  [1] "care_site"                "cdm_source"              
#>  [3] "cohort"                   "cohort_attribute"        
#>  [5] "concept"                  "concept_ancestor"        
#>  [7] "concept_class"            "concept_relationship"    
#>  [9] "concept_synonym"          "condition_era"           
#> [11] "condition_occurrence"     "cost"                    
#> [13] "death"                    "device_exposure"         
#> [15] "domain"                   "dose_era"                
#> [17] "drug_era"                 "drug_exposure"           
#> [19] "drug_strength"            "fact_relationship"       
#> [21] "location"                 "measurement"             
#> [23] "metadata"                 "note"                    
#> [25] "note_nlp"                 "observation"             
#> [27] "observation_period"       "payer_plan_period"       
#> [29] "person"                   "procedure_occurrence"    
#> [31] "provider"                 "relationship"            
#> [33] "source_to_concept_map"    "specimen"                
#> [35] "visit_detail"             "visit_occurrence"        
#> [37] "vocabulary"               "dqdashboard_results"     
#> [39] "main.dqdashboard_results"

tibble::tibble(querySql(conn, "select * from main.'main.dqdashboard_results'"))
#> # A tibble: 23 × 26
#>    NUM_VIOLATED_ROWS PCT_VIOLATED_ROWS NUM_DENOMINATOR_ROWS EXECUTION_TIME
#>                <dbl>             <dbl>                <dbl> <chr>         
#>  1                 0                 0                    1 0.004840 secs 
#>  2                 0                 0                    1 0.004684 secs 
#>  3                 0                 0                    1 0.005634 secs 
#>  4                 0                 0                    1 0.005763 secs 
#>  5                 0                 0                    1 0.005449 secs 
#>  6                 0                 0                    1 0.003824 secs 
#>  7                 0                 0                    1 0.002964 secs 
#>  8                 0                 0                    1 0.004513 secs 
#>  9                 0                 0                    1 0.003135 secs 
#> 10                 0                 0                    1 0.003447 secs 
#> # ℹ 13 more rows
#> # ℹ 22 more variables: QUERY_TEXT <chr>, CHECK_NAME <chr>, CHECK_LEVEL <chr>,
#> #   CHECK_DESCRIPTION <chr>, CDM_TABLE_NAME <chr>, CDM_FIELD_NAME <chr>,
#> #   CONCEPT_ID <chr>, UNIT_CONCEPT_ID <chr>, SQL_FILE <chr>, CATEGORY <chr>,
#> #   SUBCATEGORY <chr>, CONTEXT <chr>, WARNING <dbl>, ERROR <dbl>,
#> #   CHECKID <chr>, FAILED <dbl>, PASSED <dbl>, IS_ERROR <dbl>,
#> #   NOT_APPLICABLE <dbl>, NOT_APPLICABLE_REASON <chr>, THRESHOLD_VALUE <dbl>, …

Created on 2023-07-28 with reprex v2.0.2

from dataqualitydashboard.

ablack3 avatar ablack3 commented on June 2, 2024

Also just pointing out that this function by defaults returns it's results in three different ways (two side effects and one return value):

  1. Results are returned from the function as an R dataframe
  2. Results are also written to a file
  3. Results are also loaded from R into a database table

I might suggest making the return value invisible as recommended here.

from dataqualitydashboard.

katy-sadowski avatar katy-sadowski commented on June 2, 2024

Thanks for the detailed report, @ablack3 !

The CDM_SOURCE error is a known issue (#410) and is actually next in my queue to fix. In my response to the previous issue, I took a stance where DQD would enforce CDM_SOURCE only containing 1 row. However, the more I think about it, given there's not a hard and fast convention on this, I think I'd be open to allowing it, with a warning. We have 2 options in that case:

  • LIMIT 1 in our query against CDM_SOURCE and warn users that the DQD metadata will only contain data from the first row in the table
  • Concatenate the data within each CDM_SOURCE column, to collapse the data into a single row

I prefer the 1st option, because if someone has more than a few rows in CDM_SOURCE their metadata is going to look very messy. I also do think that, in its current form, the CDM_SOURCE is really only designed to hold a single row, and anyone storing multiple rows in there is sort of "hacking" the CDM. So it'll be difficult to anticipate all of ways those folks are populating the table, at least until there's an official convention for this.

There is also a 3rd option, which is allowing users to input the metadata parameters as an alternative to pulling from CDM_SOURCE. I believe this was previously discussed and decided against, though, so I'd probably want to run it by the DQ working group before pursuing that.

What do you think?

PS - FYI there's a second issue beyond the output file naming which is the metadata passed into the results file & displayed in the Shiny app UI (

allResults <- list(
startTimestamp = startTime,
endTimestamp = endTime,
executionTime = sprintf("%.0f %s", delta, attr(delta, "units")),
CheckResults = checkResults,
Metadata = metadata,
Overview = overview
)
)

from dataqualitydashboard.

ablack3 avatar ablack3 commented on June 2, 2024

Thanks @katy-sadowski! I think either option would work for me. In general the issue is about being clear and strict with our interfaces and expected structure of data types. Is there a CDM convention that the cdm_source table should only have one row? I wasn't aware of it but could have easily missed it. I wouldn't throw a warning unless there is something explicitly wrong with the CDM. Maybe it should be a DQ conformance check. Let me know if you'd like any help.

from dataqualitydashboard.

katy-sadowski avatar katy-sadowski commented on June 2, 2024

Is there a CDM convention that the cdm_source table should only have one row?

Not as far as I'm aware - I think this is still an open question (that's why I've changed my mind here not to block DQD execution if a database does have multiple rows in this table). The CDM documentation doesn't explicitly state this one way or the other, but the language describing each column seems to imply it's designed to hold a single row, to document metadata at the whole CDM instance level.

I wouldn't throw a warning unless there is something explicitly wrong with the CDM.

The warning would just be to alert users that the metadata stored/displayed in DQD will only represent one of their CDM_SOURCE rows (assuming we go for the first solution). I also don't think we want to add a DQ check until there's an official CDM convention on this.

@clairblacketer do you have any input on this? Is there a convention for this (in which case CDM docs should probably be updated), or should we file a CDM GitHub issue to decide on one? Thanks 😄

from dataqualitydashboard.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.