Giter Club home page Giter Club logo

detect_pilot_test_1y's Introduction

detect_pilot_test_1y's People

Contributors

booma-naren avatar corvidfox avatar jared-g-wiegand avatar mbcann01 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

corvidfox

detect_pilot_test_1y's Issues

Merge MedStar data with APS data

Goal: We want to measure the agreement between the results of DETECT screenings and the results APS investigations.

Problem 1: Currently, the results of the DETECT screenings are in a dataset we received from MedStar Mobile Healthcare and the results of APS investigations are in a separate dataset we received from APS. We need to merge the two separate datasets into a single dataset that can be used for analysis.

Problem 2: There is no common identifier variable in both datasets that we can use to match records in the MedStar data with records in the APS data. Therefore, we will have to match based on name and date of birth, which we have in both datasets.

Problem 3. Although we have name and date of birth (dob) in both datasets, we can't match records across datasets in a deterministic way (i.e., IF first name = John in MedStar AND first name = John in APS THEN match, ELSE no match) because there are typos in the data. For example, "John" and "Jon" clearly being the same person (i.e., same last name, dob, and address).

Solution: Therefore, we will need to link records across the datasets probabilistically. R has at least two packages that are designed for probabilistic record linking:

  1. RecordLinkage
  2. fastLink

Steps in the record linking process:

  • Prepare data for linking. First, standardize string variables that will be used for match. For example, convert all string values to lower case and extra spaces. Second, break name, dob, and address variables into separate variables containing their component parts. For example, convert "name" to "name_first" and "name_last" and "dob" to "dob_month", "dob_day", and "dob_year." We did this step in separate files for each of the datasets: data_aps_02_variable_management.Rmd and data_medstar_epcr_02_variable_management.Rmd.

-[ ] Next step...

Old stuff....

I copied "data_medstar_aps_merged_01.Rmd" from the 5-week analysis project to the 1-year analysis project. Before moving on to trying to get FastLink to work or writing you own matching algorithm, see if you can get this file to work using the new RecordLinkage big data classes.

https://cran.r-project.org/web/packages/RecordLinkage/vignettes/BigData.pdf

  • Remove the TOC stuff from the top of data_medstar_aps_merged_01_merge.Rmd
  • Check the really low weight matches too. I'm not sure how Record Linkage handles missing data. Maybe start with a random sample just to quickly get an idea.
  • Save RecordLinkage objects to secure drive
  • Move drop investigation stage to data_aps_02_variable_management.Rmd, if you keep it
  • If we just reduce our search space to unique combinations, the entire section "Prepare APS data for record matching" may be unnecessary.
  • Move all the data management stuff in the "reduce search space" section to the appropriate variable management file.

After you finish matching, consider breaking this code up into 3 separate files:

  • Cleaning and merging
  • Filtering merge
  • Data checking merge

Incident Complaint v Symptoms Table

Hi @mbcann01 ,

I am having some trouble with this code. I just pushed what I have gotten so far to the develop-medical-conditions branch. Starting at line 172, I do not know how to create the table since we are looking for the frequency of the cells, but the rows are incident complaints and columns are the totals from each individual symptom column.

Get Jared access to the data server

We need to get you direct access to the data server. This way we are always both working from the same data source and we can use common file paths in our code (i.e., one file path that works on both of our computers).

  • Brad needs to talk to Chris Harvey about adding Jared as an authorized user

  • Jared needs to learn how to access the data from his laptop once he's been authorized to do so.

Pre-clean the APS data: One row per case number

Right now, in detect_pilot_test_1y_refine_matches.Rmd, the APS data has a row for each reporter, as opposed to each case. This is causing some issues with merging the data (#27). Specifically, it affects our ability to identify the most proximal APS investigation to each DETECT screening. And we can't just arbitrarily pick one row from each case because we found differing investigation outcomes in some cases.

So, we want to retain the information related to multiple reporters, but we want to do it in a wide format.

  • Resolve within case discrepancies between rows
  • Widen reporter information by creating dummy variables
  • Create min and max investigation date variables
  • Create a number of reporters variable
  • Check to see how what effect, if any, this has on the RecordLinkage results.
    • Changed ems dummy variable from "ems" to "reporter_ems". See how that affects the DDD code.
  • Push wide APS data frame through downstream code

Addressing in the bug-31-aps-one-row-per-case branch.

Merge the APS and MedStar datasets into a single dataset for analysis

Overview

  • MedStar provided us with data on all of the initial DETECT screenings their medics completed during the study period.
  • APS provided us with data on all of the investigations they completed during the study period.
  • We need to link DETECT screenings in the MedStar data with investigation outcomes in the APS data.

Compliations

  • There is no unique person identifier in the datasets that we can use to link rows. Therefore, we will have to probabilistically link rows based on name, dob, and address.
  • Each person may have more than one row in the MedStar data.
  • Each person may have more than one row in the APS data. Additionally, each investigation may have more than one row in the APS data.
  • There are typos and misspellings within our matching variables (name, dob, and address).

Software

Here is a list of software packages we have tried already with mixed results.

  • R RecordLinkage package. This package has worked well for us in the past (see files in the DETECT 5-week pilot repo). However, with the larger 1-year dataset, we have repeatedly run into really slow run times and memory errors.
  • R fastLink package. We also experimented with this package. It seems to avoid the slow run times and memory errors that we had with RecordLinkage; however, we had trouble getting the output we needed from this package. See an issue we posted on fastLink's GitHub repo for details.
  • Python Dedupe package. Patrick and Sydney from Meadow's have been experimenting with this package. However, we are not as familiar with Python as we are with R.
    • My understanding is that we are having trouble using the results of this package to create unique identifiers in the datasets.
    • Additionally, I believe there may be a point-and-click element to the training process. For example, I think the package may randomly select potential matches for the user to evaluate while the model is training. If that is the case, I'm concerned about the reproducibility of our results.
  • SAS Link King macro. Finally, somebody suggested this package. If it's the best choice, then so be it, but I would personally prefer not to use SAS if we don't have to. Additionally, it looks as though this macro is no longer under active development.

Tasks

Depending on how involved each of these tasks are, and on Morri's workflow, it may make sense to break some of these off into their own separate issues.

  • Reduce APS data to one row per investigation if it is possible to do so without the loss of information.
  • Create a unique person identifier in the APS data.
  • Create a unique person identifier in the MedStar data.
  • Reduce both data frames to one row per unique person.
  • Probabilistically link rows from both data frames.
  • Manually review probabilistic matches.
  • Create a unique match number that can be used to join the MedStar data and the APS data.
  • Add unique match numbers back to the full APS and MedStar data frames.
  • Join the full MedStar and APS data frames together by match id.
  • Filter matches by date.

Notes

  • Use topical branches for files that are in development.

  • Clear the environment at the bottom of every file

  • Start to incorporate pathfinder

  • Put all functions in R scripts with roxygen headers. At the end of the analysis add to bfuncs.

  • Use built-in TOC for notebooks and explore different themes as described here: https://minimaxir.com/2017/06/r-notebooks/

  • Try versioning - bibliography etc in RStudio?

  • Learn how to save directly to Google Docs?

  • Try creating Word output using officer

  • Try making shaded diagram boxes where it makes sense

Add APS data to REDCap

Overview

If Morri decides to go this route, we will add the APS data to a REDCap database. We would do this to eliminate the need to share files through shared folders and reduce the risk of accidentally pushing data to GitHub.

Tasks

  • Sign up for a REDCap account. There are actually two choices here. UTHealth has a REDCap instance and the School of Public Health also has a REDCap instance. For either, you need to request access. I'm not aware of any big differences between the two instances. I think the UTHealth instance might be running a slightly newer version of REDCap. All else being equal, we may want to use that one.
  • Create a REDCap project that can store the data.
  • Add Brad to the project.
  • Import the raw data into REDCap.
  • Request an API for the project.
  • Test importing data into R/Python via the API.
  • Document the process in an SOP/Continuity Guide.

Fix fmr_add_unique_id

Currently, fmr_add_unique_id assumes that the fastLink object is correct. However, in reality, we determined that we get more accurate results when we do some manual adjustment to the fastLink results. We need to add in the ability to account for that manual adjustment.

Revise data for ITS analysis

  • Create data_aps_02_variable_management.Rmd
  • Renumber data_aps_02_process_for_its.Rmd
  • Clean character variables
  • Subset city for its
  • Add dummy for city
  • Add allegation outcome variables
  • Make available to Livingston

Add MedStar data to REDCap

Overview

If Morri decides to go this route, we will add the MedStar data to a REDCap database. We would do this to eliminate the need to share files through shared folders and reduce the risk of accidentally pushing data to GitHub.

Tasks

  • Sign up for a REDCap account. There are actually two choices here. UTHealth has a REDCap instance and the School of Public Health also has a REDCap instance. For either, you need to request access. I'm not aware of any big differences between the two instances. I think the UTHealth instance might be running a slightly newer version of REDCap. All else being equal, we may want to use that one.
  • Create a REDCap project that can store the data.
  • Add Brad to the project.
  • Import the raw data into REDCap.
  • Request an API for the project.
  • Test importing data into R/Python via the API.
  • Document the process in an SOP/Continuity Guide.

Create codebooks

Overview

After merging and wrangling the APS and MedStar data (#33), we need to create a codebook (or codebooks) for the data using codebookr

Merge branches

The branch history is getting messy and hard to read. Merge and delete branches.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.