CDMv5 introduced the ideas of “domains”. Domains dictate which table(s) a concept should live in and these domain assignments can be surprising. For instance, HCPCS code G0248 has been assigned the domain of “Observation” meaning that this HCPCS code does not generate a procedure_occurrence, but instead is stored in the observation table. As another example, the ICD-9 code V53.2 (fitting of hearing aid) maps to the domain of “Procedure”, meaning the ICD-9 code generates a procedure_occurrence, not a condition_occurrence.
It would be nice to lessen the cognitive burden on our ETLers and provide them some automated guidance on how their source data might best be mapped to the CDM. For example, when working on the SynPUF ETL, we completely forgot that some ICD-9 codes generate procedure_occurrences. We had to go back and draw a bunch of arrows and come up with a lot of new logic long after we thought we had finished the spec and our minds had moved on to other matters.
If we matched the values for each column in White Rabbit’s scan report against the values in concept.concept_code, we could see if the column consistently matches codes for a particular vocabulary. This gives us very helpful information that we can use to automate some of the work done in Rabbit in a Hat.
For instance, in the SynPUF data, we could take White Rabbit’s values from the dx1 column in the inpatient file and try to match them against concept.concept_code. We’d find that those values all match ICD-9 concepts. From that we reasonably infer that inpatient.dx1 is an ICD-9 column. And since the CDMv5 vocabulary tells us that ICD-9 codes are not only associated with the Condition domain, but Observation, Measurement, and Procedure as well, we could then have RiaH automatically draw arrows between the inpatient table and condition_occurrence, but inpatient and measurement, observation, and procedure_occurrence as well.
We can do this for every vocabulary we find in each source table, drawing arrows between source tables and their potential target tables. Not only can we make the table-level mappings, but we can link the *_concept_id field and the *_source_value fields to the source column.
Combine this feature with #35 and RaiH begins to be a tool that guides an ETLer through the mapping process, rather than requiring an ETLer to intimately know the source data, CDM, and the domains assigned to each source vocabulary. Instead RiaH informs an ETLer about the relationships the ETLer must consider. Plus it will help avoid situations like the one we encountered where we completely ignored a set of relationships because we didn’t realize what domains ICD-9 mapped into.