Giter Club home page Giter Club logo

Comments (17)

corvidfox avatar corvidfox commented on August 15, 2024 1

@mbcann01 So I've been looking at fastLink's source code. If I'm understanding things correctly, it's effective for the project purpose because it enables multithreading/multiprocessing of the data in as memory-optimized of a fashion as I think R is capable of, what with using atomic vectors and matrices and "chunking" the data rather efficiently. It also frequently calls for garbage collection after each step, and keeps its variables rather contained to both minimize the amount of memory necessary for each step and maximize the amount of memory released after each step completes.

Addressing the issue posted to fastLink

I can't see any issue with your potential solution, maybe a memory concern? But I doubt there's any actual issues - it seems to be built in for convenience with the assumption that you are intending only to use it to dedupe a single data set, rather than wanting the confusion matrix. For convenience, this is the code snippet you highlighted:

if (identical(dfA, dfB)) {
    cat("dfA and dfB are identical, assuming deduplication of a single data set.\nSetting return.all to FALSE.\n\n")
    dedupe.matches <- FALSE
    return.all <- FALSE
    dedupe.df <- TRUE
  }

The "problem variable" that cuts the amount of posterior probabilities when return.all=False is threshold.match, which can be manually set in passing into the function with values from 0-1 with a default of 0.85. I don't see how threshold.match=0.0 wouldn't return all values, as return.all=TRUE itself only sets threshold.match=0.001.

The loss of return.all=TRUE has 2 main effects, which might be "snipped" from the code for our purpose:

  1. The loss of return.all=TRUE means we don't trigger class(out) <- c("fastLink", "confusionTable")
  2. The addition of dedupe.df=TRUE means we do trigger class(out) <- c(class(out), "fastLink.dedupe")

For convenience, the original fastlink_out code you posted in the issue posted:

fastlink_out <- fastLink::fastLink(
  dfA = df_unique_combo,
  dfB = df_unique_combo,
  varnames = c("nm_first", "nm_last", "birth_mnth", "birth_year", "add_num", "add_street"),
  stringdist.match = c("nm_first", "nm_last", "add_street"),
  numeric.match = c("birth_mnth", "birth_year", "add_num"),
  dedupe.matches = FALSE,
  return.all = TRUE
)

This would cause fastlink_out to inherit from classes ("fastLink","fastLink.dedupe") since dedupe.df=TRUE from the "identical catch", and threshold.matches=0.001 from the original return.all=TRUE that was overridden.

What about:

fastlink_out <- fastLink::fastLink(
  dfA = df_unique_combo,
  dfB = df_unique_combo,
  varnames = c("nm_first", "nm_last", "birth_mnth", "birth_year", "add_num", "add_street"),
  stringdist.match = c("nm_first", "nm_last", "add_street"),
  numeric.match = c("birth_mnth", "birth_year", "add_num"),
  dedupe.matches = FALSE,
  return.all = FALSE,
  threshold.match = 0.0
)

class(fastlink_out <- c(class(fastlink_out),"confusionTable")

This might give us a similar result to if that catch did not exist, without altering the source code of fastLink.

My own issues trying to run fastLink from the example data

That being said, when I attempted to test this theory with R using the sample data you provided in the posted issue, I got an odd error that I don't think I have the theoretical understanding to troubleshoot or understand.

For convenience, the sample data you'd posted in the issue:

df <- tibble(
  incident   = c(1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008),
  nm_first   = c("john", "john", "jane", "jon", "jane", "joy", "michael", "amy"),
  nm_last    = c(rep("smith", 7), "jones"),
  sex        = c("m", "m", "f", "m", "f", "f", "m", "f"),
  birth_mnth = c(9, 9, 2, 9, 3, 8, 9, 1),
  birth_year = c(1936, 1936, 1937, 1936, 1937, 1941, 1936, 1947),
  add_num    = c(101, 101, 14, 101, 14, 101, 101, 1405),
  add_street = c("main", "main", "elm", "main", "elm", "main", "main", "texas")
) %>% 
  mutate(row = row_number()) %>% 
  select(row, everything()) %>% 
  print()

df_unique_combo <- df %>% 
  select(-row) %>% 
  mutate(group = paste(nm_first, nm_last, birth_year, birth_mnth, add_num, add_street, sep = "_")) %>%
  group_by(group) %>% 
  filter(row_number() == 1) %>% 
  ungroup()

When running the following code:

fl_test_out <- fastLink::fastLink(
  dfA = df_unique_combo, 
  dfB = df_unique_combo,
  varnames = c("nm_first", "nm_last", "birth_mnth", "birth_year", "add_num", "add_street"),
  stringdist.match = c("nm_first", "nm_last", "add_street"),
  numeric.match = c("birth_mnth", "birth_year", "add_num"),
  verbose = True #for troubleshooting
)

I received this error:
Image

In troubleshooting, I was able to find that the function gammaCK2par did not seem to recognize identical values of num_first with the default cut.a=94 (but did for a cut.a of 0.92 or less), while for num_last it did not recognize identical values at all unless cut.a=0, which seemed untenable. This may comes down to me not fully understanding the theory of Jaro-Winkler, but I don't see why identical values weren't matched without reducing the cut.a value.

The big error that stopped everything was coming from gammaNUMCK2par, which seems to be attempting to access outside of the matrix boundaries in how it's communicating to foreach to repeatedly execute a function. The matrix in question should theoretically be organizing the values of the variable being processed into columns representing each unique value in the variable. I'm honestly not sure how to fix that at this point.

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024 1

This week I was able to do the initial clean of the APS data set, including a preliminary codebook. We are pending feedback from APS regarding clarification on some observations. APS data included a unique subject ID, which did not appear to have any false matches.

Initial exploration of merging APS and MedStar data sets (through fastLink pairing) has started. It seems I'm finding some failed matches for APS Person ID as I explore the variable combinations for the merge. I'll continue to assess so I can decide if I should do a fastLink match within the APS data to make a unique subject ID similar to how I made one in the MedStar Data.

Files are in pull request #47

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024 1

This week I made some progress in how I'm manually reviewing the matches between the MedStar & APS Data sets.

Should have feedback from APS this week, which should (hopefully) help resolve the remaining issues in the APS data set.

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024 1

As of Pull Request #47, the MedStar/APS Merge Map and some revisions to Source Subject IDs in both data sets should be achieved. The unique ID linking both data sets has been added to the original data sets.

  • Probabilistically link rows from both data frames.
  • Manually review probabilistic matches.
  • Create a unique match number that can be used to join the MedStar data and the APS data.
  • Add unique match numbers back to the full APS and MedStar data frames.

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024 1

As of Pull Request #49, the Intake-Response pairs have been identified. Initial merges have been created.

  • Linked APS Intakes to MedStar Responses based on matching Subject ID, 72 hour Response-Intake time frame
  • Created single aggregate data set that facilitates the review of each subject "timeline", cataloging all APS Intakes and all MedStar responses in the source data sets
  • Created a version of the MedStar data set that incorporated data from the matched APS intakes, facilitating review of each response and all paired data

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024

Do we want to do any linkage of the MedStar EPCR and MedStar Compliance Data? There does seem to be a linking identifier in both data sets (Response Numbers) which could make reconciliation relatively less complicated than the overall APS/MedStar merge. For that reason, it's something I personally think could happen either before or after any big APS/MedStar merge.

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024

Once the data is uploaded and in an accessible format, I'd like to do some exploratory coding to check to see if I can help with the memory and time management of things.

From what I could see of the code, it looked like it was written by someone who was R-native. DataFrames/Tibbles aren't a part of base Python, unlike R, and there's some finesse to memory management and optimization. However, Python does have a lot of analytical strengths that have made it a leader in Machine Learning (though Julia is becoming a bigger competitor). There may be something I can do with one of the machine learning packages.

Dedupe does have some point-and-click, but it produces a JSON file that is the results of the training that can be used to recreate the same results. So it has reproducibility built in 🎉. The idea is that you might train on a smaller data set, then apply the results of that training to a larger one. It's based on human judgement, which is both a strength and a limitation. That would need to be standardized, and a threshold for the number of "matches" and "rejections" for a successful training is a methodological consideration. I can look into this more and let you know exactly how it works, if it looks like the best option. However, it DOES NOT automatically condense anything, it just adds additional adjacent rows and identifiers. So, it was almost exponentially growing when the PANDAS DataFrame wasn't already memory optimized. Plus, you would still have to manually sort through the potential duplicates.

There are a LOT of different Python options. Once I can actually see the size of the various elements in the data sets, and see if any cleaning/recoding helps optimize memory, I could give a better idea of what approaches are potentially viable or not.

from detect_pilot_test_1y.

mbcann01 avatar mbcann01 commented on August 15, 2024

@corvidfox , I guess it doesn't hurt to go ahead and join the compliance data to the EPCR data. I'm not sure if we will end up using it, but it doesn't sound like a heavy lift.

from detect_pilot_test_1y.

mbcann01 avatar mbcann01 commented on August 15, 2024

@corvidfox Thank you for all of the info on Python Dedupe! I look forward to seeing what you figure out!

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024

@mbcann01 I managed to get a lot of my "roadblocks" fixed. Part of my issue was some sort of .dll permission issue with rlang of all things - but like I said, I got it fixed.

There were two major issues I've continued to experience with small datasets:

  1. fastLink simultaneously telling me I had no variation in the entries for a variable, and also no identical entries for a variable (since those are mutually exclusive, that's definitely an error)
  2. gammaNUMCK2par attempting to index a matrix outside of its range.

This seems to be isolated only to small datasets - which is why it flagged for your sample data of 7 rows, but not for their sample data of 510 rows. My guess is that it's an edge-case situation.

BUT! That's a non-issue with a large data set, which ours are (of course, hence our issues).

I had an idea that I thought was a bit dumb, but it seems to work: when passing two identical dataframes, you can simply append a "junk row" of missing values to make the "second" dataframe "non-identical." This completely circumvents the "identical dataframe" check that tries to "help."

Using the sample data that fastLink posted (with data types modified by me):

data(samplematch)
dfA <- rbind(dfA, dfA[sample(1:nrow(dfA), 10, replace = FALSE),])
dfA %>% mutate(across(where(is.factor),as.character))-> dfA
dfA$housenum <-as.numeric(dfA$housenum)

I was able to run:

fl_out<- fastLink(
  dfA = dfA, 
  dfB = rbind(dfA,NA),
  varnames = c("firstname", "lastname", "housenum",
               "streetname", "city", "birthyear"),
  stringdist.match = c("firstname", "lastname","streetname", "city"),
  numeric.match = c("birthyear","housenum"),
  dedupe.matches = FALSE,
  return.all = TRUE,
  threshold.match = 0,
  # verbose = TRUE # for troubleshooting and times
)

And execute your beautifully written alternative to getMatches to see all posterior probabilities:

fmr_fastlink_stack_matches(fl_out,dfA)

I repeated this on the ePCR data from MedStar, as well as APS, and then "linked them together" to test the time and memory.

In the "roughest" of these tests, my machine was able to do this in less than 3 minutes, and only took about 13GB of RAM and 20% CPU utilization. My machine isn't particularly fancy. I have an Intel Core i7-11370H (quad core with two-way hyper-threading, 3.3GHz), with 32 GB RAM. So, that gives me hope that fastLink can be a viable solution.

My next steps would be to get the individual data sets cleaned, de-duplicated as much as possible, and organized in preparation for a merge. 🎉

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024

This week I've made large progress towards one of the major goals of this issue:

  • Create a unique person identifier in the MedStar data.

I have:

  • Cleaned the MedStar ePCR data
  • Benchmarked fastLink on the MedStar ePCR data to explore constraints, and found that the number of variables used for matching is a significant limiting factor
  • Scrubbed as much PHI as possible from the MedStar ePCR cleaning document (still some roadblocks before we can release that publicly to ensure privacy compliance)
  • Made a first attempt at generating a unique Subject ID for the MedStar ePCR data

Roadblocks I plan to focus on next:

  • Removing PHI from MedStar ePCR cleaning so we can put the document on GitHub for reproducibility and transparency
  • Troubleshooting and polishing the Unique Subject ID creation. My base first attempt has over 10 people who are clearly not the same person linked together, so it seems its threshold is too low

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024

This week I've made large progress towards one of the major goals of this issue:

  • Create a unique person identifier in the MedStar data.

I have:

  • Further cleaned the MedStar ePCR data, and removed all PHI from the cleaning file
  • Iterated through possible variable combinations for fastLink matching to develop a useful match product, with a range of posterior probabilities that would need to be manually reviewed.

Roadblocks I plan to focus on next:

  • Finalizing a Unique Subject ID in this data set

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024

This week I was able to:

  • Create a unique person identifier in the MedStar data

I'll look over it again with fresher eyes next week to ensure it really is done, and fix some format issues.

So far it doesn't look like it's a good idea to consolidate the data down to a single row, as some groups appear to be clearly the same person who has either been listed at more than one address, or goes by at least one other name. That would likely result in a large number of mismatches between APS and MedStar data. That is a consideration for later in the process.

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024

Accidentally closed due to attaching issue to pull request for partial completion. Reopened due to ongoing task.

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024

This week I was able to finalize the Unique IDs in the MedStar data, and linked the observations that appeared in both the ePCR and Compliance data. Since Compliance does not have many identifiers, only Response Number produced any credible connections.

I also got a preliminary codebook for the MedStar data - I'll be able to polish that up more next week and then that's 1/2 data sets that could be individually used for some sort of analysis

from detect_pilot_test_1y.

corvidfox avatar corvidfox commented on August 15, 2024

As of pull request #50, there are 3 merges created. Codebooks for all merges have also been created.

from detect_pilot_test_1y.

mbcann01 avatar mbcann01 commented on August 15, 2024

Hi @corvidfox ,
I haven't had a chance to look at the actual code books yet. I'm viewing this on my phone, but it sounds great! Thank you!

from detect_pilot_test_1y.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.