Giter Club home page Giter Club logo

rsmatch's Introduction

Hi! I'm Sean Kent πŸ‘‹

I love to write code at the intersection of statistics, machine learning, and new research. Here, I'll highlight a few of my favorite code projects.

SVM-based algorithms for Multiple Instance Learning

mildsvm: Weakly supervised, multiple instance data lives in numerous interesting applications such as drug discovery, object detection, and tumor prediction on whole slide images. The mildsvm package provides an easy way to learn from this data by training Support Vector Machine (SVM)-based classifiers. It also contains helpful functions for building and printing multiple instance data frames. mildsvm includes an implementation of MI-SMM from our research paper Kent and Yu (2022) "Non-convex SVM for cancer diagnosis based on morphologic features of tumor microenvironment". The package can be installed via install.packages("mildsvm") in R.

Causal Matching for Longitudinal Data

rsmatch is an R package designed to perform Risk Set Matching. Risk set matching is useful for causal inference in longitudinal studies where subjects are treated at varying time points. The main idea is that treated subjects can match with anyone who hasn't yet been treated and those who never get treatment, but each subject can only be used in one pair. This creates a mixed-integer programming problem that we implement based on Li, Propert, and Rosenbaum (2001) Balanced Risk Set Matching. This package can be installed via install_github("skent259/rsmatch").

Simulate the Game of Craps

I don't gamble oftenβ€”but, for me, the most entertaining way to lose money in a casino is playing craps. As a side-project, I developed a simulator in python (skent259/CrapsSim) to test various betting strategies. With it, I analyzed the best craps strategies for players on a budget, published on my blog.


The rest of my repositories are a mixture of machine learning implementations, visualizations from other contexts, talks that I've given, and more. Feel free to check out those projects below!

rsmatch's People

Contributors

pauknemj avatar skent259 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

rsmatch's Issues

Release rsmatch 0.2.0

First release:

Prepare for release:

  • git pull
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • git push
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted πŸŽ‰
  • Add preemptive link to blog post in pkgdown news menu
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)
  • usethis::use_news_md()
  • Finish blog post
  • Tweet

Create a vignette

Would be good to include a short vignette that goes through an example of the risk set matching on a smallish, open-source data set (needs to be longitudinal). Some plots of balance (e.g. love plot, exact matching fraction) could be created to show that it works.

Error in if (a == b) { : the condition has length > 1

Dear Sir/Madam,

I am trying to apply your tool to my data. I have some issues.

'diff_admit_hours' is the time for admission to the hospital, not integer but numeric. 'startdrug_hours' is the time to take the drugs, some have the same values for the same ID. May I know if the 'trt_time' should be set as NA for those who are not treated? I have tried both ways and there are errors like below:

error 1: trt_time is not NA for different treatments.

drug2<-drug%>%select(all_of(select_cols))
drug_test<-drug2[1:100,]%>%select(all_of(select_cols))
pairs <- coxpsmatch(

  • n_pairs = 5,
  • data = drug_test,
  • covariates=covars,
  • id = "subject_id", time = 'diff_admit_hours', trt_time = "startdrug_hours"
  • )
    Error in if (a == b) { : the condition has length > 1

error 2: when I set trt_time as NA for those who are not treated.
Error in $<-.data.frame(*tmp*, "p", value = c(-1.23439407919853e-10, :
The replacement data has 80 rows, but the data has 492064 rows.

Any sgestions?

`brsmatch()` fails silently when there are too many pairs

library(rsmatch)
data(oasis)

pairs <- brsmatch(
  n_pairs = 14,
  oasis,
  id = "subject_id", time = "visit", trt_time = "time_of_ad",
  balance = FALSE
)
pairs
#>    subject_id pair_id type
#> 1   OAS2_0002      NA <NA>
#> 2   OAS2_0007      NA <NA>
#> 3   OAS2_0009      NA <NA>
#> 4   OAS2_0010      NA <NA>
#> 5   OAS2_0014      NA <NA>
#> 6   OAS2_0016      NA <NA>
#> 7   OAS2_0021      NA <NA>
#> 8   OAS2_0023      NA <NA>
#> 9   OAS2_0026      NA <NA>
#> 10  OAS2_0028      NA <NA>
#> 11  OAS2_0032      NA <NA>
#> 12  OAS2_0037      NA <NA>
#> 13  OAS2_0039      NA <NA>
#> 14  OAS2_0040      NA <NA>
#> 15  OAS2_0043      NA <NA>
#> 16  OAS2_0046      NA <NA>
#> 17  OAS2_0050      NA <NA>
#> 18  OAS2_0058      NA <NA>
#> 19  OAS2_0060      NA <NA>
#> 20  OAS2_0063      NA <NA>
#> 21  OAS2_0075      NA <NA>
#> 22  OAS2_0079      NA <NA>
#> 23  OAS2_0080      NA <NA>
#> 24  OAS2_0081      NA <NA>
#> 25  OAS2_0089      NA <NA>
#> 26  OAS2_0098      NA <NA>
#> 27  OAS2_0099      NA <NA>
#> 28  OAS2_0102      NA <NA>
#> 29  OAS2_0104      NA <NA>
#> 30  OAS2_0108      NA <NA>
#> 31  OAS2_0111      NA <NA>
#> 32  OAS2_0112      NA <NA>
#> 33  OAS2_0113      NA <NA>
#> 34  OAS2_0114      NA <NA>
#> 35  OAS2_0116      NA <NA>
#> 36  OAS2_0124      NA <NA>
#> 37  OAS2_0134      NA <NA>
#> 38  OAS2_0137      NA <NA>
#> 39  OAS2_0139      NA <NA>
#> 40  OAS2_0140      NA <NA>
#> 41  OAS2_0150      NA <NA>
#> 42  OAS2_0159      NA <NA>
#> 43  OAS2_0160      NA <NA>
#> 44  OAS2_0162      NA <NA>
#> 45  OAS2_0172      NA <NA>
#> 46  OAS2_0175      NA <NA>
#> 47  OAS2_0179      NA <NA>
#> 48  OAS2_0181      NA <NA>
#> 49  OAS2_0182      NA <NA>
#> 50  OAS2_0184      NA <NA>
#> 51  OAS2_0185      NA <NA>

Created on 2021-05-22 by the reprex package (v0.3.0)

Question about `coxpsmatch()` and exact matching

@pauknemj what would happen in your function when using exact matching with a limited number of pairs?

I see that this section of code grabs the top n_pairs:

if(length(matches$trt.id) > n_pairs) {
    matches <- matches[1:n_pairs,, drop = FALSE]
  }

But will the numbering grab a biased number from a given exact match group? For example, if you had two equal-sized exact match groups and asked for half of the maximum pairs, would it give you pretty much everyone from the first group?

NA rows in `df` will cause uninformative error for `brsmatch()`

Simple test case. Problem is that the data frame output in .compute_distances() will have different number of rows since model.matrix() will automatically remove NAs.

library(rsmatch)

df <- data.frame(
  id = rep(1:3, each = 3),
  time = rep(1:3, 3),
  trt_time = rep(c(2, 3, NA), each = 3),
  X1 = c(2, 2, 2, 3, 3, 3, 9, 9, 9),
  X2 = rep(c("a", "a", "b"), each = 3),
  X3 = c(9, 4, 5, 6, NA, NA, 3, 4, 8),
  X4 = c(8, 9, 4, 5, 6, 7, 2, 3, 4)
)

brsmatch(n_pairs = 1, data = df)
#> Error in data.frame(trt_id = i, all_id = df_at_trt[[id]], trt_time = trt_time_i, : arguments imply differing number of rows: 1, 3, 2

Created on 2024-02-03 with reprex v2.0.2

`brsmatch()` fails when there are no un-treated individuals

brsmatch returns error when treatment time contains no NA value. When one of the unit's treatment time is changed to "NA", no error is returned.

library(rsmatch)
df1 <- data.frame(
  hhidpn = rep(1:5, each = 7),
  wave = rep(1:7, 5),
  treatment_time = rep(c(2,3,3,4,NA), each = 7),
  X1 = c(2,2,4,5,5,5,4,
         9,9,10,10,10,7,7,
         2,3,4,5,6,6,7,
         4,5,6,6,6,5,1,
         3,5,6,6,7,5,6),
  X2 = rep(c("a","a","b","c","d"), each = 7),
  X3 = c(9,4,5,6,7,2,3,
         4,8,5,7,8,5,8,
         7,4,5,6,7,7,8,
         4,5,6,7,8,9,7,
         5,6,7,5,6,5,5),
  X4 = c(8,9,4,5,6,7,2,
         3,4,6,4,2,5,7,
         3,3,4,6,2,4,5,
         3,5,6,3,4,3,3,
         3,2,3,3,5,6,3)
)
brsmatch(n_pairs = 2, df = df1, id = "hhidpn", time = "wave",
         trt_time = "treatment_time", optimizer = "glpk")
#>   hhidpn pair_id type
#> 1      1       1  trt
#> 2      2       1  all
#> 3      3      NA <NA>
#> 4      4       2  trt
#> 5      5       2  all

#df2 is the same dataframe as df1 except the treatment_time contains no NA
df2 <- data.frame(
  hhidpn = rep(1:5, each = 7),
  wave = rep(1:7, 5),
  treatment_time = rep(c(2,3,3,4,7), each = 7),
  X1 = c(2,2,4,5,5,5,4,
         9,9,10,10,10,7,7,
         2,3,4,5,6,6,7,
         4,5,6,6,6,5,1,
         3,5,6,6,7,5,6),
  X2 = rep(c("a","a","b","c","d"), each = 7),
  X3 = c(9,4,5,6,7,2,3,
         4,8,5,7,8,5,8,
         7,4,5,6,7,7,8,
         4,5,6,7,8,9,7,
         5,6,7,5,6,5,5),
  X4 = c(8,9,4,5,6,7,2,
         3,4,6,4,2,5,7,
         3,3,4,6,2,4,5,
         3,5,6,3,4,3,3,
         3,2,3,3,5,6,3)
)
brsmatch(n_pairs = 2, df = df2, id = "hhidpn", time = "wave",
         trt_time = "treatment_time", optimizer = "glpk")
#> Error in data.frame(trt_id = i, all_id = df_at_trt[[id]][valid_match], : arguments imply differing number of rows: 1, 0

Created on 2021-02-09 by the reprex package (v1.0.0)

brsmatch() fails when `id` refers to a character vector

It seems reasonable that id should be able to be a character vector. This same error will occur when the 'trt_time' refers to a character vector, and I think that a warning should be thrown there.

library(rsmatch)
  df <- data.frame(
    hhidpn = rep(1:3, each = 3),
    wave = rep(1:3, 3),
    treatment_time = rep(c(2,3,NA), each = 3),
    X1 = c(2,2,2,3,3,3,9,9,9),
    X2 = rep(c("a","a","b"), each = 3),
    X3 = c(9,4,5,6,7,2,3,4,8),
    X4 = c(8,9,4,5,6,7,2,3,4)
  )
  
  df$hhidpn <- as.character(df$hhidpn)
  
  pairs <- brsmatch(n_pairs = 1, df = df, id = "hhidpn", time = "wave", trt_time = "treatment_time",
                    optimizer = "glpk", options = "between period treatment")
#> Error in t(B_p) - t(B_e): non-numeric argument to binary operator

Created on 2021-02-09 by the reprex package (v0.3.0)

Difference and format between time & trt_time

Dear Sir or Madam,

I would like to know more about 'time' & 'trt_time' in 'coxpsmatch' function. Take the following figure codes as an example: in oasis data, visit is used as 'time' and time_of_ad is as 'trt_time'. 'time' is the same as the row num in each subject group, i.e., a subject who has n visits has 'time' ranging from 1 to n. What about 'trt_time'? Should it be integer for the treatment? Is it larger or equal to 'time'? Is there a relationship between 'time' & 'trt_time'?

image

NA row in brsmatch() result

df1 and df2 both contain NA in treatment time, the results returned from brsmatch contain the matched pairs and a row of NA.

library(rsmatch)
library(reprex)
#> Warning: package 'reprex' was built under R version 4.0.3
df1 <- data.frame(
  hhidpn = rep(1:3, each = 3),
  wave = rep(1:3, 3),
  treatment_time = rep(c(2,3,NA), each = 3),
  X1 = c(2,2,2,3,3,3,9,9,9),
  X2 = rep(c("a","a","b"), each = 3),
  X3 = c(9,4,5,6,7,2,3,4,8),
  X4 = c(8,9,4,5,6,7,2,3,4)
)
brsmatch(n_pairs = 1, df = df1, id = "hhidpn", time = "wave",
         trt_time = "treatment_time", optimizer = "glpk")
#>   hhidpn pair_id type
#> 1      1       1  trt
#> 2      2       1  all
#> 3      3      NA <NA>

df2 <- data.frame(
  hhidpn = rep(1:5, each = 3),
  wave = rep(1:3, 5),
  treatment_time = rep(c(2,3,2,3,NA), each = 3),
  X1 = c(2,2,2,3,3,3,9,9,9,10,10,10,7,7,7),
  X2 = rep(c("a","a","b"), each = 5),
  X3 = c(9,4,5,6,7,2,3,4,8,5,7,8,5,8,7),
  X4 = c(8,9,4,5,6,7,2,3,4,6,4,2,5,7,3)
)
brsmatch(n_pairs = 2, df = df2, id = "hhidpn", time = "wave",
         trt_time = "treatment_time", optimizer = "glpk")
#>   hhidpn pair_id type
#> 1      1       1  trt
#> 2      2       1  all
#> 3      3      NA <NA>
#> 4      4       2  trt
#> 5      5       2  all

Created on 2021-02-09 by the reprex package (v1.0.0)

Release rsmatch 0.2.1

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted πŸŽ‰
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.