Ran into this today, where we want to see if anyone w/ the same ID had specified diffe

create get_not_dupes() function about janitor HOT 5 OPEN

sfirke commented on June 6, 2024

create get_not_dupes() function

from janitor.

Comments (5)

sfirke commented on June 6, 2024

To point out the obvious, this is different than distinct() from dplyr as that would return one record of a duplicated set; I want no records.

from janitor.

sfirke commented on June 6, 2024

A month later, this does not seem useful enough to make a permanent function for.

from janitor.

sfirke commented on June 6, 2024

Now I think I have a need for this again. I have two ID columns and want to understand if they are 1-to-1. Does ID A ever appear without the same ID B? It would be hierarchical, as an ID shouldn't appear with multiple values of say location but of course the same location will have multiple values of ID.

Maybe I need a function check_one_to_one that takes multiple variables and checks whether there is any violation of 1:1.

Check first to see if someone else has coded this?

from janitor.

rgknight commented on June 6, 2024

Great work on janitor! Awesome to see Ed types building real tools.

This is actually two different issues

Question 1: Are there duplicates?

Your introduction to get_dups states

This is for hunting down and examining duplicate records during data cleaning

Hunting down is different from examining. Hunting should be a different (and faster re #67 ) function.

I'd recommend an is_id or has_dupes function instead of check_one_to_one. It's the same idea: are these combinations unique?

The workflow is: Are there duplicates (has_dups)? If yes, what do I do about them (get_dups)?

Stata has an isid implementation that I used for this purpose, back in my Stata days. Helpfile here.

You'd be looking for a more pipe-able, NSE version of

is_id <- function(x){
  numdups <- sum(duplicated(x))
  if (numdups > 0){
    stop(sprintf("There are %i duplicates in %s", numdups, deparse(substitute(x)))) 
  }
}

Question 2: Can I get the elements of a data frame that are never duplicated?

In the handling duplicates workflow, I will sometimes separate the elements that are ever duplicated from the elements that are never duplicated, use a bunch of business logic to manipulate the ever duplicated elements, then recombine them. I think a get_nondups function could be worthwhile.

Here's what I use:

sep_dups <- function(df, ...){
  target <- df %>% select_(.dots=...)
  dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)

  list(unique = df[!dup_index, ],
       duplicates = df[dup_index, ])
}

from janitor.

jzadra commented on June 6, 2024

Just a random thought: a sankey diagram would visually indicate what we are discussing here - maybe some of the code that goes into organizing that data from a plotting package could be used as reference?

from janitor.

Recommend Projects

create get_not_dupes() function about janitor HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent