outcomesinsights / generalized_data_model Goto Github PK
View Code? Open in Web Editor NEWOutcomes Insights' Data Model for Clinical Research
License: MIT License
Outcomes Insights' Data Model for Clinical Research
License: MIT License
I think we need to give users the option of doing the ETL and including a “source value" and “source vocab” in the clinical codes table. It should not be there by default, but it will make the table easier for people to use if they want it. I realize it is one join away by using the vocab table, but I believe it will help make the model understandable at first. And our first order of business is to convince people that this is a better way forward.
Need to remove one of the provider_id's from the exposures table
We may want to use this in the future: http://stackoverflow.com/questions/929684/is-there-common-street-addresses-database-design-for-all-addresses-of-the-world
costs.total_paid currently is just a sum of all the paid fields. Suggestion to change the field to total_paid_drugs where we store only total paid for drugs.
Change the order of the tables in the Readme doc so that the flow makes more sense. Possibly providers first, than claims, then claims_providers, then lines, etc.
this will have admit, discharge, length of stay, admit source, and discharge source
need to consider whether this applies to outpatient facility claims -- for example, observation stays.
this would require us to put a collections_id in the admission_details table and would be more consistent with how measurements and drug exposure are handled
Currently in the column type_concept_id exists in tables contexts, clinical_codes, and admission_details. The column name is not very descriptive about what goes in there. For the most part this column describes the provenance of the record and so we should consider changing to prov_type_concept_id or something like that.
We don't have a way of assigning a type to a collection record. This is useful for visit-based data like CPRD or other EHR data. It might contain "visit" or "claim" or consultation type (COT with 61 possible types) from CPRD.
file_type currently lives in clinical_codes table but should be moved to claims table
Route (oral, intravenous, rectal, etc) is commonly used and could be added to drug table. should check OMOP v 5.1 to see what they are doing.
Joining through the contexts table is an elegant way to have a single table. But I think it might be cleaner to have cost tables at the two levels where costs are generally represented. It also makes it more clear that costs can be duplicated (for example at the claim and line levels).
The costs details table would be like the drug and measurement tables and would provide cost details for individual clinical codes records (generally procedures or prescriptions).
The collections costs table would be for summarized (claims) costs and visit costs.
We should rename lines something like "related_details" and claims something like "grouped_records". it is possible, but not required, that we can group related details.
clinical details and clinical conditions can be related to each other using related details. So, systolic and diastolic blood pressure could be connected together. Diagnoses and procedures can be connected together. And sets of clinical codes (diagnoses, procedures, etc) and clinical details (labs, and other observations) can all be linked together.
Potentially rename claims and lines tables to something more generic to other types of data. Possibly encounters (lines) and grouped_encounters (claims).
Days supplied is most commonly used in research, but the raw data might not include this information explicitly. CPRD includes ndd which is the numeric daily dose, which is the number of items to be used each day. Hence, if quantity is 60 and ndd is 2 then the days supplied is 30. CPRD does not report days supplied.
This would be "qualifier" from CPRD
Currently details table only has a claim_id. Need to add line_id to details.
This allows us to clarify that the items are linked together and connected as a single "record". Generally this is because they are all related in a domain-specific way. For example, the are all cancer diagnosis variables, claim diagnosis variables, laboratory values collected at the same time, etc.
Need to make comment shorter to work with loading data into impala
Need to fix apostrophe characters in Readme file.
This clarifies that the concept id is geared toward provenance (e.g., admitting, discharge, primary, problem list, symptom list, etc.)
These are generally going to require a mapping to RxNorm to identify. Part D data only uses NDC codes and there is no text fields. CPRD has drug substance (ingredient, like Furosemide) and product name (which is the fully qualified name like Furosemide 40mg tablets). These are also RxNorm ideas, so it may be better to use these in place of brand and generic. We may need to look at some other drug data to finalize this.
We are assuming 1 provider per line so it makes sense for the provider_id to live in the lines table.
According to https://www.cms.gov/Regulations-and-Guidance/Guidance/Transmittals/downloads/R136CP.pdf "As new “WW” codes are established for oral anticancer drugs they will be communicated in a Recurring Update Notification. "
This is an issue for the cancer drug code webpage at http://ndc_map.cohortjigsaw.com/ because these appear not to be included. However they seem to be created based on NDC codes and may only be in DME.
In SEER-Medicare the DME file has ndc codes in it. If we want to add those ndc's here we need to have the ability to link it to claims and lines.
Drugs are represented as different clinical codes, hcpcs, ndc, etc. and could possible live inside the clinical codes table. Then we would use a modifier table to represent other data about the code such as quantity, days supplied, etc,
Should we add census tract information to the addresses table or create an auxiliary table that holds other information about the addresses? census tract, health service area, urban/rural, etc.
We currently have specialty_concept_id which may be where we add a facility type or is that for something else and we need another column? SEER-Medicare has a variable fac_type which has values:
1 = Hospital
2 = Skilled nursing facility (SNF)
3 = Home health agency (HHA)
4 = Religious Nonmedical (Hospital) (eff. 8/1/00); prior to 8/00 referenced Christian Science (CS)
5 = Religious Nonmedical (Extended Care) (eff. 8/1/00); prior to 8/00 referenced CS (discontinued effective 10/1/05)
6 = Intermediate care
7 = Clinic or hospital-based renal dialysis facility
8 = Special facility or ASC surgery
9 = Reserved
Rather than wide tables, it may be easier to use tall tables with concept ids in place of the specific cost types. This gives flexibility to users to define their own concept ids for costs without having to change the data model.
Should we use concept ids for results? Should they be used only for string results (i.e., string is the source value that goes with concept id)?
This will allow locations to be used in the study period section of the jigsaw user interface. Consider supporting this in the information_periods table
We decided to move provider_id from clinical_codes to lines so we should do the same for details.
admission source concept id
discharge location concept id
None of the other tables include a source value. Do we need the source values for this table?
In processing of raw data ignore lines that are duplicates or errors. For example in SEER-Medicare there is the variable proindcd where "M" means duplicate line items that we may want to drop.
These should be indicated with an appropriate row in the clinical codes table and perhaps a details table to store the data.
For text records, we don't expect to get raw text. We will probably get specific terms mined from text data (e.g., "diabetes"). So, this will either need to be mapped to an existing vocab or put in as raw text.
For QOL data, we might get scale scores or answers to individual questions. Each instrument could be considered its own vocabulary with names for questions and scores. And the values could be in the clinical codes table.
I copied the embedded HTML but it doesn't show. Is it possible to do this?
I'd like to add another column to the readme that indicates if a column should never be null. If someone could determine which columns are always required, I'd appreciate it.
CPRD has a consultation (visit) duration. This could be handled using a datetime variable so that the start and end times represent the duration. It could also be handled with explicit time variables or a duration variable.
The use case is that costing studies in the UK can use this variable to estimate the cost of a visit.
does admission_details only contain inpatient admission and emergency department encounters ?
what about outpatient ?
This would link to a specific procedure in the clinical codes table. There may be many records per procedure (usually up to 4 or 8).
Drop the position column from the lines table. Since their code be multiple codes per line it make sense to keep position in the clinical_codes table only
could just be "seq" or "sequence number"
There are instances where patients are grouped, most likely by a family id as is the case with CPRD. There are a couple of ways to handle this:
This works for CPRD and can be obtained from RXNORM. Both variables entered as text fields.
We can support a master (billing) address at the collections level if we have multiple different sites at the contexts level. This generally applies to a DME type of file where a site (e.g., CVS pharmacy) can have multiple locations but has one billing address.
Currently there is some ambiguity on what gets stored where in the costs table. Need to update to make it clearer.
Jen suggests either to add npi as a variable to the tables and change identifier and identifier_type to other_identifier and other_identifier_type so we can hold both npi and one other identifier in the same record. Another suggest is adding primary_identifier, primary_identifier_type, other_identifier, other_identifier_type to keep it less US-centric.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.