<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Generate random test data that mimic a large OpenMRS HIV dataset about fhir-data-pipes HOT 20 CLOSED

google commented on June 7, 2024

Generate random test data that mimic a large OpenMRS HIV dataset

from fhir-data-pipes.

Comments (20)

ibacher commented on June 7, 2024 1

@kimaina Thanks for the update on this! Just to ensure this is properly documented, I'm creating a little simulator for this data over in this repo: https://github.com/ibacher/hiv-test-data.

Is it possible to run the numbers over the same 5-ish year period? From the previous data set, it looked like June 2020 - November 2020 was a bit of an outlier period (I'm guessing it's related to the current pandemic).

from fhir-data-pipes.

kimaina commented on June 7, 2024

Thanks, @jecihjoy for documenting this!

from fhir-data-pipes.

kimaina commented on June 7, 2024

Please find attached monthly summary statistics for key metrics in the past 5 years (2015-2020). To improve granularity, I have provided monthly instead of yearly summaries so that when we generate the synthetic dataset it can reflect this distribution and pattern.

encounters_monthly_aggregates.xlsx

Here is the SQL query used to generate (FYI this is just a draft and not final):

SELECT 
	date_format(encounter_datetime,'%m-%Y') as month_year,
	MONTH(encounter_datetime) as month,
	YEAR(encounter_datetime) as year,    
    COUNT(DISTINCT `encounter_id`) AS distinct_encounters,
	COUNT(DISTINCT `visit_id`) AS distinct_visits,
	COUNT(DISTINCT if(prev_clinical_datetime_hiv is null,`person_id`,null)) AS newly_enrolled,
    COUNT(DISTINCT if(encounter_datetime >= arv_start_date,`person_id`,null)) as patients_on_arvs,
    COUNT(DISTINCT if((encounter_datetime < arv_start_date  or arv_start_date is null),`person_id`,null)) as patients_not_on_arvs,
	COUNT(DISTINCT if(vl_1 < 1000, `encounter_id` ,null)) as vl_suppressed_encounters,
    COUNT( DISTINCT if(vl_1 >= 1000, `encounter_id`,null)) as vl_failure_encouunters,
    COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_datetime, INTERVAL 30 DAY)), endDate) > 90,  `person_id`, null)) as LTFU
    
FROM
    etl.dates 
JOIN etl.flat_hiv_summary_v15b  on date(encounter_datetime) <= date(endDate)
WHERE is_clinical_encounter = 1  and (next_clinical_datetime_hiv is null or date(next_clinical_datetime_hiv) > endDate) and
encounter_datetime >= timestamp(date("2015-01-01"))  #and encounter_datetime<= date('2020-03-01')
group by MONTH(encounter_datetime),  YEAR(encounter_datetime)
order by encounter_datetime desc

from fhir-data-pipes.

ibacher commented on June 7, 2024

@kimaina Thanks!

For generating VL values, from what I've understood, AMPATH pretty much consistently has actual VL values reported in the resulting Obs, but this isn't the case for all implementations. See this comment by Mike Seaton and just my general overview of the landscape I reviewed. So the question is: do we want to simulate data where we always have a numeric VL result or should we include some number of these non-numeric results? Opinions?

from fhir-data-pipes.

kimaina commented on June 7, 2024

So the question is: do we want to simulate data where we always have a numeric VL result or should we include some number of these non-numeric results? Opinions?

I think we should simulate both quantitative and qualitative VL results since the majority of implementers have both. From my experience working on this, we normally have 2 different concepts representing either the quantitative value or the VL qualifier (LDL or <300).

from fhir-data-pipes.

jonathandick commented on June 7, 2024

I'm surprised but I supposed it's possible that people are recording this categorically. Does anyone have an example of a ciel concept that people are *using* to record this?

…

On Wed, Nov 11, 2020 at 9:35 AM Allan Kimaina ***@***.***> wrote: So the question is: do we want to simulate data where we always have a numeric VL result or should we include some number of these non-numeric results? Opinions? I think we should simulate both quantitative and qualitative VL results since the majority of implementers have both. From my experience working on this, we normally have 2 different concepts representing either the quantitative value or the VL qualifier (LDL or <300). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#59 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABCFHVNIOXXSSV2EWSDNSLSPLDM7ANCNFSM4TQ6GYPQ> .

-- Jonathan Dick, MD Chief Medical Information Officer, AMPATH Assistant Professor of Clinical Medicine, IU School of Medicine Research Scientist, Regenstrief Institute Inc.

from fhir-data-pipes.

ibacher commented on June 7, 2024

@jonathandick CIEL:1305 (qualitative VL) is the base concept for qualitative VLs and at least CIEL:1302 (Not detected) and CIEL:1306 (Beyond detectable limit) are used to record suppressed VLs in KenyaEMR and UgandaEMR respectively. No idea how PIH are representing the "< 300" or "< 500" part. Theoretically, OpenMRS has a ComplexNumeric datatype to handle those, but AFAIK it's never actually been used.

from fhir-data-pipes.

ibacher commented on June 7, 2024

@kimaina Could you get us an estimate of the # encounters and visits / year (or month) for suppressed versus non-suppressed patients? JJ had indicated that VL non-suppressed patients will be seen more regularly and I'd like a way to estimate that, if possible. Thanks!

from fhir-data-pipes.

kimaina commented on June 7, 2024

@kimaina Could you get us an estimate of the # encounters and visits / year (or month) for suppressed versus non-suppressed patients? JJ had indicated that VL non-suppressed patients will be seen more regularly and I'd like a way to estimate that, if possible. Thanks!

Hey @ibacher,I have provided these 4 estimates in the above excel --> distinct_encounters, distinct_visits, vl_suppressed_encounters, and vl_failure_encounters. I'll rename the above columns to be more understandable, in the next version

from fhir-data-pipes.

kimaina commented on June 7, 2024

@ibacher do you prefer monthly aggregates or yearly aggregates?

from fhir-data-pipes.

ibacher commented on June 7, 2024

@kimaina Ah! I see... then do I have a way of estimating the percent of patients suppressed vs patients unsuppressed?

do you prefer monthly aggregates or yearly aggregates?

Monthly is generally more useful, I just wasn't sure if a single patient was likely to have multiple visits per month or not.

from fhir-data-pipes.

kimaina commented on June 7, 2024

I'll also add-in

the number of unique patients per months
of these patients how many are suppressed
and how many are not

@kimaina Ah! I see... then do I have a way of estimating the percent of patients suppressed vs patients unsuppressed?

from fhir-data-pipes.

kimaina commented on June 7, 2024

Monthly is generally more useful, I just wasn't sure if a single patient was likely to have multiple visits per month or not.

We can also do yearly + monthly aggregates

from fhir-data-pipes.

kimaina commented on June 7, 2024

I have updated the query to include a few more metrics as well as some of @jonathandick and @ibacher suggestions e.g :

How likely is it that a patient will remain active in care this month?
How likely is it that a patient will get a viral load test this month if the patient is eligible for a test?
How likely is it that the test will be < 1000?

This is still a draft and I still need to add @ibacher suggestions on

% of patients LTFU who return to care and, ideally, the meantime to return to care.

SELECT 
	date_format(endDate,'%m-%Y') as month_year,
    # Visits 
    SUM(hmsd.visit_this_month) AS `visit_this_month`,
    SUM(hmsd.scheduled_this_month) AS `scheduled_this_month`,
    SUM(hmsd.unscheduled_this_month) AS `unscheduled_this_month`,
    SUM(hmsd.enrolled_this_month) AS `enrolled_this_month`,
    
    # LTFU and active in care
    COUNT( DISTINCT `person_id` ) as distinct_patients,
    SUM(hmsd.active_in_care_this_month) AS `active_in_care_this_month`,
    AVG(hmsd.days_since_rtc_date) AS `avg_days_since_rtc_date`,
    COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_date, INTERVAL 30 DAY)), endDate) > 90,  `person_id`, null)) as LTFU_since_2015,
    COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_date, INTERVAL 30 DAY)), endDate) > 90  AND TIMESTAMPDIFF(DAY, encounter_date,`endDate`) <365,  `person_id`, null)) as LTFU_past_year,
    COUNT( DISTINCT if( timestampdiff(day, if(rtc_date, rtc_date, DATE_ADD(encounter_date, INTERVAL 30 DAY)), endDate) > 90  AND TIMESTAMPDIFF(DAY, encounter_date,`endDate`) <365/2,  `person_id`, null)) as LTFU_past_6months,
    COUNT( DISTINCT if( days_since_rtc_date > 90,  `person_id`, null)) as LTFU_this_month,
    COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, death_date,`endDate`) < 30,`person_id`, null)) as deaths_this_month,
    COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, death_date,`endDate`) < 365,`person_id`, null)) as deaths_past_year,
    COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, transfer_out_date,`endDate`) < 30,`person_id`, null)) as transfer_out_this_month,
    COUNT( DISTINCT if(TIMESTAMPDIFF(DAY, transfer_out_date,`endDate`) < 365,`person_id`, null)) as transfer_out_past_year,
    
	# ART Status
	SUM(hmsd.art_revisit_this_month) AS `art_revisit_this_month`,
        SUM(hmsd.is_pre_art_this_month) AS `pre_art_this_month`,
	SUM(hmsd.on_art_this_month) AS `on_art_this_month`,
         SUM(hmsd.started_art_this_month) AS `started_art_this_month`,
	SUM(had_med_change_this_month) AS `had_med_change_this_month`,
     
     # ART Line
     SUM(hmsd.on_original_first_line_this_month) AS `on_original_first_line_this_month`,
     SUM(hmsd.on_alt_first_line_this_month) AS `on_alt_first_line_this_month`,
     SUM(hmsd.on_second_line_or_higher_this_month) AS `on_second_line_or_higher_this_month`,

	# Viral Load
    COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `encounter_id`, null)) as vl_suppressed_encounters_this_month,
    COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `encounter_id`, null)) as vl_unsuppressed_encounters_this_month,
    COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `person_id`, null)) as vl_suppressed_patients_this_month,
    COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 30, `person_id`, null)) as vl_unsuppressed_patients_this_month,
    COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 365, `person_id`, null)) as vl_suppressed_patients_past_year,
    COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) <365, `person_id`, null)) as vl_unsuppressed_patients_past_year,
    COUNT( DISTINCT if(vl_1 < 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 365, `encounter_id`, null)) as vl_suppressed_encounters_past_year,
    COUNT( DISTINCT if(vl_1 >= 1000 AND TIMESTAMPDIFF(DAY, vl_1_date,`endDate`) < 365, `encounter_id`, null)) as vl_unsuppressed_encounters_past_year,
     
     
	SUM(hmsd.due_for_vl_this_month) AS `due_for_vl_this_month`,
	SUM(hmsd.qualifies_for_follow_up_vl) AS `qualifies_for_follow_up_vl`,
        SUM(hmsd.got_follow_up_vl_this_month) AS `got_follow_up_vl_this_month`,
        SUM(hmsd.follow_up_vl_suppressed_this_month) AS `follow_up_vl_suppressed_this_month`,
        SUM(hmsd.follow_up_vl_unsuppressed_this_month) AS `follow_up_vl_unsuppressed_this_month`,
        SUM(hmsd.follow_up_vl_suppressed_this_month) AS `follow_up_vl_unsuppressed`,
        SUM(hmsd.follow_up_vl_suppressed_this_month) AS `follow_up_vl_suppressed`,
        AVG(hmsd.num_days_to_follow_vl) AS `avg_num_days_to_follow_vl`
    
    
    

FROM
    etl.hiv_monthly_report_dataset_frozen `hmsd`
WHERE
    (endDate >= '2015-01-01')
GROUP BY endDate
order by endDate desc

here is a sample xlxs
2020_monthly_aggregates.xlsx

from fhir-data-pipes.

kimaina commented on June 7, 2024

hey, @ibacher I have completed generating the necessary data elements for generating the required metrics for the simulation. Quick question, do you prefer proportions or percentages?

https://github.com/kimaina/vl-indicators/blob/main/2020_monthly_aggregates_v2.csv

from fhir-data-pipes.

ibacher commented on June 7, 2024

I’ve been converting everything into percentages, apart from raw counts, but whatever is easiest!

from fhir-data-pipes.

kimaina commented on June 7, 2024

you mean proportions, right! I'm happy to provide proportions and counts

from fhir-data-pipes.

kimaina commented on June 7, 2024

I have completed creating the query with the recommendations, I'll be running it during off-peak to generate the dataset then get back to you

from fhir-data-pipes.

kimaina commented on June 7, 2024

Hi @ibacher

Please see the PR: https://github.com/ibacher/hiv-test-data/pulls

You will find:

proportions.csv containing key proportions
script.Rmd is the R script used to generate portions
2020_monthly_aggregates_v2.csv raw dataset containing over 55 indicators

from fhir-data-pipes.

ibacher commented on June 7, 2024

@kimaina Thanks for the PRs and all the data!

from fhir-data-pipes.

Generate random test data that mimic a large OpenMRS HIV dataset about fhir-data-pipes HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent