The repository for code and documentation produced by the N3C Phenotype and Data Acquisition workstream

Python 68.38% R 26.23% Batchfile 5.39%

phenotype_data_acquisition's People

Stargazers

Watchers

phenotype_data_acquisition's Issues

Discuss the distinction in OMOP between Measurement v Observation in small group

Is the separation solely based on whether a fact has a quantitative or categorical value vs qualitative? Can the same concept_ids exist in both tables?

Duplicate summary rows

Create table creates duplicate rows in ACT and possibly PCORNET scripts. It seems to create a row for each fact tat satisfies the inclusion criteria. Probably need to add a subquery to grab only one fact

Additional LOINCs from PCORnet's list

I compared the N3C phenotype to the PCORnet phenotype. The following LOINCs were included in the PCORnet list but not N3C. Consider for inclusion.

94306-8 | SARS coronavirus 2 RNA panel - Unspecified specimen by NAA with probe detection
94503-0 | SARS coronavirus 2 IgG and IgM panel - Serum or Plasma Qualitative by Rapid immunoassay
94504-8 | SARS coronavirus 2 IgG and IgM panel - Serum or Plasma by Immunoassay
94531-1 | SARS coronavirus 2 RNA panel - Respiratory specimen by NAA with probe detection

Translate all code to SQL Server

Translate extract and phenotype code to SQL Server syntax.

Two different weak dxs

Ensure phenotype code searches for two DIFFERENT weak dxs, not just two weak dxs.

Same encounter/Same day

Change phenotype to require two weak dx in the same encounter and/or the same day (instead of just encounter)

Post ACT_COVID ontology concept dimension table

When V3 of ACT_COVID ontology is released post concept_dimension table subset ( name_char code, concept_path) for use with ACT path based code for sites that do not want to implement entire ontology. Also to document codes used.

Two ACT code changes

need to remove variable declarations from top of the script, as we discovered those may not work unless you're specifically using SQL Developer as your client-side tool.
need to add comments showing where multi-fact models should swap in their own table names in the NON path-based queries.

The path-based query did not retrieve any patients at UNC, I suspect because we don't have the covid ontology installed? The non-path-based query worked perfectly.

Zip file structure

Need to change the Python extract script to structure output like so:

DATA_COUNTS and MANIFEST at the root level of the directory
All other tables in a subdirectory call "datafiles"
naming convention of parent directory per documentation.

Split out lab categories

This does not impact the inclusion criteria, but rather the categories for downstream analysis. There is likely a need to split out our "lab confirmed negative" and "lab confirmed positive" categories into two--one for PCR tests, and one for antibody tests. Having a positive or negative means different things for those two types of tests.

ICD-10 code addition

The WHO also created an emergency ICD-10 code for COVID-19 diagnoses without lab confirmation: U07.2. https://www.who.int/classifications/icd/covid19/en/
Should it be added to the phenotype of suspected positive cases as strong positive code?

Fix Date Format

In documentation, OMOPExporter and Python Exporter

SqlRender does not support CONVERT translation

There was reported issue in SQLRender package about problem with CONVERT function. The problem is that this function is not supported translations list

The code is not able to be executed on some dialects.

OMOP Exporter README

Please create a README (or Wiki page) describing how to use the OMOP Exporter.

B97.21

Suggestion to downgrade B97.21 ICD-10 code to a weak positive on or after 4/1/2020, when the "real" covid ICD-10 code was released for use. This would be the same rule that we applied to B97.29.

OMOP Vocab Update Needed

@cukarthik and I need to coordinate with the OMOP Vocabulary team to add the following new source_concepts:
94547-7 SARS-CoV-2 IgG + IgM, serum or plasma
94558-4 SARS-CoV-2 Ag, respiratory specimen
94559-2 SARS-CoV-2 ORF1ab region, respiratory specimen
94562-6 SARS-CoV-2 IgA, serum or plasma
94563-4 SARS-CoV-2 IgG, serum or plasma
94564-2 SARS-CoV-2 IgM, serum or plasma
94565-9 SARS-CoV-2 RNA, nasopharyngeal specimen
94639-2 SARS-CoV-2 ORF1ab region, unspecified specimen
94640-0 SARS-CoV-2 S gene, respiratory specimen
94641-8 SARS-CoV-2 S gene, unspecified specimen
94642-6 SARS-CoV-2 S gene, respiratory specimen
94643-4 SARS-CoV-2 S gene, unspecified specimen
94644-2 SARS-CoV-2 ORF1ab region, respiratory specimen
94645-9 SARS-CoV-2 RdRp gene, unspecified specimen
94646-7 SARS-CoV-2 RdRp gene, respiratory specimen
94647-5 SARS-related CoV, unspecified specimen
94660-8 SARS-CoV-2 RNA, serum/plasma
94661-6 SARS-CoV-2 antibody interpretation

Current phenotype will be updated as these are included in OMOP vocabularies.

COVID-19 disease progression & category

The current 4-tier proposal may need additional layer to capture the multi-faceted nature of this COVID-19 disease; and the lagged codings in EHR may hinge how we collect data for downstream research, we need think a bit out of the box for this brand new phenotype.

Based on the known observation thus far, the disease displays spectrum of severity, for example, Severity classification according to Australian guidelines for the clinical care of people with COVID-19 (v2) (https://app.magicapp.org/app#/guideline/4179) - mild/moderate/severe/critical.

Many mild patients may not have symptoms - those patients will not likely be recorded or coded in EHR, but may be rediscovered by antibody tests and recorded in EHR later or different systems. Although we'll not anticipate those severity spectrum being assigned different ICD10 codes at this early stage, however I believe introduce this phenotype spectrum plus the proposed tier concept will prepare us to handle both known and unknowns better down the line.

OMOP Vocab 30AP2020 - still missing some LOINC codes

Raised a ticket with the OMOP Vocab team to look into this.

OHDSI/Vocabulary-v5.0#303

Possible additional codes

May incorporate the latest NLM SNOMED value sets (including COVID-19 and signs/symptoms, see attached)
In the ICD10 side, i.e. U072/U073 are also possible candidates, additional suspected ones like J00-J06, J09-J11, R092, R918, etc.; also CMS new HCPCS G2023, G2024
Not sure we need consider other COVID-19 severe complications - i.e. ARDS, Sepsis (unique pathway) and ventilator related?

2.16.840.1.113762.1.4.1114.7.xlsx
2.16.840.1.113762.1.4.1181.51.xlsx

GBQ errors with R package

From [email protected]:

BigQuery SqlRender issues for N3C R package execution:

generate_cohort.sql:
The statement: “create table #Codesets()” is processed by SqlRender into “create table ttav56kacodesets()” The post SQLRender statement is invalid because it is missing the dataset name in front of the table name. The correct output should be “create table [dataset name].ttav56kacodesets()”. The dataset name in this project could be @resultsDatabaseSchema.
source_extract_scripts.sql
a. Convert() function is not a BigQuery function and has not been mapped in the SqlRender replacement patterns . The extract script has CONVERT(VARCHAR(20),OBSERVATION_PERIOD_START_DATE, 120), after sqlrender, it's convert(STRING,observation_period_start_date, 120).
This is the script we adjusted for BigQuery: format_datetime('%Y-%m-%d %H:%M:%S',datetime(observation_period_start_date))
b. Date format for where clause is not been transformed to a valid BigQuery date format. The extract script has visit_start_date >= '1/1/20', BigQuery accepts visit_start_date >= '2018-01-01'
c. Convert() and date add/subtract function. For the last query, extract script has CONVERT(VARCHAR(20), GETDATE() -2, 120) as UPDATE_DATE. First is the convert part, second is the data add/subtract part. Bigquery accepts the following format: format_datetime('%Y-%m-%d %H:%M:%S',datetime( date_add(CURRENT_DATE(), interval -2 day))) as UPDATE_DATE
May or may not be an issue: The output file combines all columns into one excel column. For example, PERSON should have 15 columns from the source_extract_scripts.sql, the output csv file has only one column which includes all 15 columns separated by “|”.

result coded values

For LOINC encoded lab test, there is also a need to standardize the results

Positive vs Detected

Add negative flu test as selection criterion

Suggestion from Ken Gersing:

Are we looking for influenza with a negative results as part of the phenotype? We are being told a lot of folks early on pre LOINC Covid code were being tested for influenza and many the neg folks are probably COVID patients

Add assumptions to OMOP documentation

Kristin noted a couple of assumptions for OMOP on the Colorado call: Updating vocab regularly and using ERA tables. Can you add these and any others to the OMOP documentation please?

Multi-fact model

For the ACT phenotype code--do we need to accommodate sites that use the multi-fact model? I know we can't know their table names, so no point in writing out code for it exactly, but maybe add comments indicating where sites would swap in their specific table names in the code?

Add code to drop tables

Drop table at top of phenotype scripts

R script uses NA instead of nulls

Can we prevent the R script from inserting its "NA" convention where nulls should be? We want the nulls to stay null in the final CSVs. Thanks!

OMOP documentation update

Please update OMOP Wiki page to reflect that we're now using the R exporter instead of raw SQL.

Reformat phenotype documentation

Separate into two section: inclusion criteria, without categories, and categories. This should clear up confusion around whether sites need to precalculate the categories in their phenotypes (they shouldn't).

Fix file/directory name case in exporter scripts, fix directory structure

@marshallclark if it's not doing it already, can you make the Python script UPPER the output_file names so that they're always consistent case?

I will do the same for the R script. The R script also needs a directory structure fix that I will take care of.

Discuss changing ACT extraction

After looking at the how mapping is being approached, I want to change the ACT extraction to export multiple fact tables using the ontology otherwise there is no way the Harmoninzation team is going to be able to back into the OMOP mapping. So there will be at least 4 OBSERVATION_FACT tables diagnosis, procedures, labs and meds plus demographics would change as well so the hispanic element would be included in the patient table.

Feedback from UTH

"To be included as a "suspected positive"/"probable positive" (over 50% patients should have COVID), the use of CPT code 86318 (infectious agent antibody) is going to potentially greatly skew these numbers. This code is used generically for many rapid infectious agent antibody tests. CPT 86328 is created specifically for COVID-19 rapid antibody testing. Again, they might think there could be miscoding here, but I would worry that this code is used extensively for other infectious agents and could affect the numbers. "

Gather Date Obfuscation procedures per site

Epidemiological implications of date obfuscation may be minimal to analyses. Issue persists as to whether this information should be gathered per site.

This is only a test for integration

Delete me.

Update LOINC codes for v 1.5

Phenotype 1.5 released, please update phenotype scripts.

PCORnet
ACT
OMOP
TriNetX

Community Issue: Want to help us test?

If you are an ACT site using SQL Server, we would love to make sure the SQL Server versions of our ACT phenotype and extract code runs with no errors. If you are able to run one or both of the ACT phenotype scripts (path based and hardcoded) and the ACT extract script, drop us a comment with the results.

You'll have our gratitude!

Unclear on the definition of a 2 year look back

I may have missed it but is the destination additive? Are we trying to send less data with each submission? Should we use some absolute date 2 years before COVID onset.

Fix extract mistake

I realized a silly thing I did in the PCORnet extract scripts (which I have since fixed)--please check to see if my mistake trickled down to the other models! In the DATA_COUNTS table, I neglected to join to the N3C_COHORT table each time (and add the 1/1/2018 date range where it made sense), so I was just counting the number of rows in the base table rather than the N3C extract.

Please add to your code (you can use mine as an example) if you don't have this already. Thanks!

Add MANIFEST parameters to R exporter

Should move MANIFEST fields to R exporter execution so that sites do not forget to fill it out.

Wording Consistency

The first section nominated the four tiers as “lab-confirmed positive,” “lab-confirmed negative,” “probable positive,” and “possible positive"; and the next section of "Phenotype rules" writes as “lab-confirmed positive cases,” “lab-confirmed negative cases,” “suspected positive cases,” and “possible positive cases". Maybe either could be changed to keep the consistency.

TriNetX MANIFEST table

TriNetX to determine how to handle the MANIFEST table; this will need to be filled out by TriNetX when they run the script (which means they need to know the right information from the site), or a process is needed to have the site create their own MANIFEST table before submitting data.

Clean up stray OMOP files

Clean up any unneeded files from the OMOPExporter folder.
Within \OMOPExporter\inst\sql\sql_server, make sure that the "generate_cohort.sql" and "source_extract_scripts.sql" files are running the logic as you expect and contain all the queries. The individual SQL files should then be removed.

Phenotype picks up lots of children

Thought for future versions--we have reports from UNC and UTH that the phenotype picks up a disproportionate number of children. This may not be an issue, as we know there will be many false positives in the mix, but worth thinking about whether we need to add some criteria to narrow down the number of kids that qualify.

ACT acquisition script

For sites that have the ACT i2b2 phenotype, wouldn't it make sense to start with a query like:

SELECT * FROM observation_fact WHERE concept_cd in 
     (SELECT concept_cd from concept_dimension where concept_path like '
      \\ACT\\UMLS_C0031437\\SNOMED_3947185011\\%');

This would catch updates to the phenotype and automagically include synthetic types as well as local mapping. If we wanted to use the local mappings, it would also be necessary to add something of the form:

SELECT * from concept_dimension where concept_path like 
'\\ACT\\UMLS_C0031437\\SNOMED_3947185011\\%';

Suppress row number in R output

Emily to fix R script to ensure row numbers are not dumped out with data extract.

Parameterize Python script

See title

Check document sample data for proper date format

Antibody tests

On a different note - Will antibody test be classified in the lab-confirmed positive/negative tiers? Some past patient case reported as "lab-confirmed negative" but imaging (i.e. CT/x-ray) confirmed as COVID-19 pneumonia - do we treat those as "probable positive"? Is it any value to include imaging diagnosis in the phenotype? Similar puzzle is the existence of L/H phenotypes during ICU treatment. There're many unknowns, we need to design such unique phenotype definition so those unknowns can be uncovered in the downstream analysis.

Originally posted by @johnl8888 in #5 (comment)

Figure out better system to version phenotype code

See title

SITE_NAME is missing in the manifest table

The manifest table only contains the SITE_ABBREV and it is missing SITE_NAME.
We use this field to set the site's full name in our data quality dashboard to generate the output files via the OMOP's DQDashboard. Could you add SITE_NAME field in the manifest table?

Create cohort tables

End phenotype scripts for OMOP and TriNetX with CREATE TABLE N3C_COHORT.

national-covid-cohort-collaborative / phenotype_data_acquisition Goto Github PK

phenotype_data_acquisition's People

Stargazers

Watchers

Forkers

phenotype_data_acquisition's Issues

Recommend Projects

Recommend Topics

Recommend Org