Giter Club home page Giter Club logo

pccc's People

Contributors

ck2136 avatar dewittpe avatar magic-lantern avatar tdbennett avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pccc's Issues

`pccc-overview` vignette

Hi @magic-lantern and @jamesfeinstein,

Something isn't clear about this section of the overview vignette:

* All codes in all categories employ "starts with substring" matching logic. Because of this, if a code to be evaluated starts with a code listed in one of the CCC categories, a match will be found. This means that if a bad ICD code is provided (such as ICD-9-CM code 0492,25042) PCCC would indicate a match for the Neuromuscular CCC.

How are those codes "bad?" 0492 is used in an example above that section in a way that makes it seem acceptable.

I'll push up some text edits to a few files shortly.

CCC Version 2 Code Resolutions

Document from Chris Feudtner attached with his decisions on code discrepancies. He highlighted codes that ARE CCCs and wrote his comments in CAPS. We will need to discuss strategy for implementing code for some of the substring issues, for example, ICD-9-CM 359*.
CCC V2 Issues_dai.docx

Discussion/take home points

  • R is free, open source, take advantage of package structure and documentation
  • leveraging C++ back-end (proprietary software likely has, but this is explicit)
  • benefits of collaborative code development/version control workflow
  • @tdbennett to look for references, perhaps from ROpenSci post-doc looking at barriers

Data Set documentation throws a WARNING on R CMD check

* checking for code/documentation mismatches ... WARNING
Data codoc mismatches from documentation object 'pccc_icd10_dataset':
Variables in data frame 'pccc_icd10_dataset'
  Code: dx1 dx10 dx2 dx3 dx4 dx5 dx6 dx7 dx8 dx9 g1 g10 g2 g3 g4 g5 g6
        g7 g8 g9 id pc1 pc10 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9
  Docs: dx1:dx10 g1:g10 id pc1:pc10

Data codoc mismatches from documentation object 'pccc_icd9_dataset':
Variables in data frame 'pccc_icd9_dataset'
  Code: dx1 dx10 dx2 dx3 dx4 dx5 dx6 dx7 dx8 dx9 g1 g10 g2 g3 g4 g5 g6
        g7 g8 g9 id pc1 pc10 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9
  Docs: dx1:dx10 g1:g10 id pc1:pc10

Error in ICD9 codes for metabolic

I think I've noticed some errors in the ICD9 procedure codes for the metabolic CCC.

> pccc::get_codes(9)["metabolic", ]$pc
 [1] "064"  "0652" "0681" "073"  "0764" "0765" "0768" "0769" "6241" "645"  "6551" "6553"
[13] "6561" "6563" "6841" "6849" "6851" "6859" "6861" "6869" "6871" "6879" "8606"

Shouldn't '064' and '073' be '0064' and '0073', respectively? Otherwise I match 64.0/0640 = circumcision or 73.0 = procedures during delivery.

Thanks

Use RcppParallel to improve speed.

Set up the C++ code to use RcppParallel to work on a subject.

Currently, the C++ is designed for one subject. A Map call in the ccc function sends each subject's data to the ccc_rcpp call. Redesign so that the each subject is set up to go to its own core.

package installation bug

pccc will not build from source or install from github on Windows or OSX currently. I believe this is related to how we are requiring Rcpp and dplyr. I'm exploring this and will update.

Update R scripts to use dplyr 0.7

With the release of dplyr version 0.7, the underscored functions such as dplyr::select_ have been deprecated. Read vignette("programming", package = "dplyr") for details.

The ccc.data.frame function, see the file: R/ccc.R, needs to be updated.

KID validation sample

@jamesfeinstein to ask Chris Feudtner/Dingwei Dai if they restricted the KID sample prior to conducting the validation analysis of the 2014 SAS and Stata code

Expect "Big" Data

Expect hundreds of thousands, if not millions, of individuals in the data sets that need to be searched. Design the ccc function to work on the rows of the input data. avoid the tidyr::gather method that was used in the initial design.

PROTECT bug in C code

Got this from CRAN maintainers:

There is a PROTECT bug in your package pccc (version 1.0.2). The bug manifests itself as "heap-use-after-free" with ASAN (see Additional issues in CRAN checks). One can provoke the problem also with gctorture (e.g. R_GCTORTURE=100) and using this small example which will segfault:

library(pccc); get_codes(10)

The problem is in get_codes.cpp, in calls like

 Rcpp::List dx_fixed = Rcpp::List::create(
      Rcpp::wrap(cds.get_dx_fixed_neuromusc()),
      Rcpp::wrap(cds.get_dx_fixed_cvd()),
      Rcpp::wrap(cds.get_dx_fixed_respiratory()),
      Rcpp::CharacterVector::create(),
      Rcpp::CharacterVector::create(), ...

Rcpp::wrap returns an (unprotected) newly allocated SEXP. All uses of Rcpp::wrap inside a call to Rcpp::List (many instances in get_codes.cpp) need to be protected via Rcpp::Shield(), otherwise these get destroyed by allocations when evaluating the other arguments - as in the attached patch.

Make editing of ICD Codes that make up each CCC easier

Right now the ICD codes for each CCC are hard coded into the file pccc.cpp

For ease of maintenance, pull the codes out into some external format - such as a CSV or other data structure that allows for users to inspect codes and modify if desired.

Would need to provide documentation on how to modify.

If there are issues or changes with the set of ICD codes that go with a particular CCC, this might make it easier for users to make a pull request as they wouldn't need to be familiar with the source code.

ICD code pre-processing

@dewittpe to explore icd R package capabilities vs. Stata native icd clean functions.

SAS and Stata code in 2014 paper differ in this: Stata script includes pre-processing that removes leading and trailing blanks and leading 0's.

Failed to install PCCC to Linux R

Hi. Thank you very much for developing a wonderful package. While I was able to use it on Mac OS R, I failed to install it into R on the redht linux /gnu. Here is the error message:

In file included from ccc.cpp:6:
pccc.h:43: error: a brace-enclosed initializer is not allowed here before ‘{’ token
pccc.h:43: error: ISO C++ forbids initialization of member ‘empty’
pccc.h:43: error: making ‘empty’ static
pccc.h:43: error: invalid in-class initialization of static data member of non-integral type ‘const std::vector<std::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::basic_string<char, std::char_traits, std::allocator > > >���
make: *** [ccc.o] Error 1
ERROR: compilation failed for package ‘pccc’

  • removing ‘/****/R/x86_64-redhat-linux-gnu-library/3.4/pccc’
    Warning in install.packages :
    installation of package ‘pccc’ had non-zero exit status

If this is not a bug, would you mind assisting me in installing the package? Thank you.0

ccc function used the dx code twice

The ccc function uses the dx codes twice and ignores the px code inputs.

When fixing this also allow for NULL values to be set for either the dx or px codes.

ICD 10 DX codes for Malignancy

Documentation for malignancy ICD10 codes include "C00-C96" (https://github.com/CUD2V/pccc/blob/master/inst/pccc_references/Categories_of_CCCv2_and_Corresponding_ICD.docx)

In the src code for the package, however, only "C" is defined.

dx_malignancy = {"C","D00","D01","D02","D03","D04","D05","D06","D07","D08","D09","D37","D38",

Do we need to explicitly define C00, C01, C02, C03, C04, ..., C96? I think so for two reasons,

  1. Note that the D01-D09 codes are explicitly defined.
  2. Here is an example of an errant mapping.
library(pccc)
packageVersion("pccc")
# [1] ‘1.0.5’

# id2 has a made up code "CB" which should not match anything, but returns true
# for malignancy
eg_data <- data.frame(id = c("id1", "id2", "id3"),
                      dx1 = c("NOTACODE", "NOTACODE", "notacode"),
                      dx2 = c("C00", "E75", "NOTACODE"),
                      dx3 = c("A", "CB", "C"))

ccc(eg_data, dx_cols = dplyr::starts_with("dx"), icdv = 10)
#   neuromusc cvd respiratory renal gi hemato_immu metabolic congeni_genetic malignancy neonatal tech_dep transplant ccc_flag
# 1         0   0           0     0  0           0         0               0          1        0        0          0        1
# 2         1   0           0     0  0           0         0               0          1        0        0          0        1
# 3         0   0           0     0  0           0         0               0          1        0        0          0        1

Invalid ICD10CM codes

Found some codes that appear to be invalid ICD10CM codes. Some appear to be typos, but others I'm not sure about.

Reached out to JF to get advice.

icd10cm category code_type notes
D08 malignancy dx invalid
D85 hemato_immu dx ?E85 (Amyloidosis)
D87 hemato_immu dx invalid?
D88 hemato_immu dx invalid
G8290 neuromusc dx invalid
P2521 neonatal dx ? P52.21 (Intraventricular nontraumatic hemorrhage)
P2522 neonatal dx invalid
Z446 tech_dep dx ?Z46.6 (Encounter for fitting and adjustment of urinary device)
Z446 renal dx ?Z46.6
Z45441 tech_dep dx invalid
Z45442 tech_dep dx invalid

@dewittpe - tagging you so you are aware I've filed this issue.

Discrepancies and potential problems with current ICD code lists

This issue was originally opened as magic-lantern#1 - moving to new primary repo. Due to the number of potential issues with the ICD matching, I've copied the entire original post here.

Code fixes will be implemented by @magic-lantern, actual decisions to how each item will be resolved will be primarily made by @jamesfeinstein

ICD 9 duplicates (are any of these mistakes and should actually be a different code?)

  • neuromusc_ccc dx 343 listed twice
  • tech_dep_ccc dx T84498A, T86890, T86891, T86899 and are all listed twice
  • transplant_ccc dx T86890, T86891, T86899 are all listed twice
  • neuromusc_ccc dx G253 is listed twice

ICD9 code discrepancies

  • neuromusc_ccc dx - includes 359 as well as 3590, 3591, 3592, 3593. Due to substring matching any code starting with 359 will match.
  • Respiratory_CCC dx - includes 5163 as well as 51631. Due to substring matching any code starting with 5163 will match.
  • Metabolic CCC dx - 2359 metabolic should actually be 2539.
  • Metabolic CCC pc - includes 624 as well as 6241. Due to substring matching, any code starting with 624 will match.
  • The Stata version has several different codes not present in either the R version nor the SAS version:
    • neuromusc_ccc dx: 3311, 3318
    • cardiovascular_ccc dx: 416
    • respiratory_ccc dx: 4160

There are several differences between SAS/R and Stata for ICD10 codes. Here are the mismatches that I found

  • Stata has 9782 as a neuromuscular ccc; SAS/R don’t have that code at all
  • SAS/R have 9620 as a respiratory dx - Stata doesn’t have it anywhere.
  • SAS/R have G3289 as a neuromuscular dx - Stata doesn't have that code at all.
  • SAS/R have G4753 as a respiratory dx, Stata has G4735 (3/5 transposed)
  • Stata has J9620 as a respiratory dx, SAS/R don’t have it at all.
  • Stata has Q219, Q258, and Q259 as Cardiovascular dx, SAS/R don’t.
  • Stata has Q851 as a NEUROMUSCULAR dx, SAS/R don’t.
  • SAS/R have T82121A as a Cardiovascular and Tech_dep CCC, Stata doesn’t have it at all.

There are also some ICD10 codes that could be problematic - this is looking at the SAS file:

  • hemato_immu_ccc dx: has D84 and metabolic_ccc has D841. Since D84 comes before D841 in the SAS file, D841 will never be categorized as a metabolic CCC.
  • hemato_immu_ccc dx: has both D86 and D869. . Due to substring matching any code that starts D86 will be included.
  • metabolic_ccc dx has E75 and neuromusc_ccc has E750, E751, E752, and E754. Due to ordering in the file, neuromuscular CCCs are flagged before the more general category E75 would be flagged as metabolic.
  • neuromusc_ccc dx: G318 and G3189 are both listed. Due to substring matching any code that starts G318 will be included.
  • neuromusc_ccc dx: G80 and G803 are both listed. Due to substring matching any code that starts G80 will be included.
  • I43 is listed as both a cvd_ccc dx and a respiratory_ccc dx. Since CVD comes first, I43 will never be classified as a respiratory CCC.

General logic question: For the procedure codes, I noticed that only respiratory_ccc is using “in:“ and all others are using “in” Doing substring matching for one group and not for the rest is quite a bit different than the ICD9 code logic. Is that what is wanted?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.