Giter Club home page Giter Club logo

Comments (5)

sfirke avatar sfirke commented on June 6, 2024

To point out the obvious, this is different than distinct() from dplyr as that would return one record of a duplicated set; I want no records.

from janitor.

sfirke avatar sfirke commented on June 6, 2024

A month later, this does not seem useful enough to make a permanent function for.

from janitor.

sfirke avatar sfirke commented on June 6, 2024

Now I think I have a need for this again. I have two ID columns and want to understand if they are 1-to-1. Does ID A ever appear without the same ID B? It would be hierarchical, as an ID shouldn't appear with multiple values of say location but of course the same location will have multiple values of ID.

Maybe I need a function check_one_to_one that takes multiple variables and checks whether there is any violation of 1:1.

Check first to see if someone else has coded this?

from janitor.

rgknight avatar rgknight commented on June 6, 2024

Great work on janitor! Awesome to see Ed types building real tools.

This is actually two different issues

Question 1: Are there duplicates?

Your introduction to get_dups states

This is for hunting down and examining duplicate records during data cleaning

Hunting down is different from examining. Hunting should be a different (and faster re #67 ) function.

I'd recommend an is_id or has_dupes function instead of check_one_to_one. It's the same idea: are these combinations unique?

The workflow is: Are there duplicates (has_dups)? If yes, what do I do about them (get_dups)?

Stata has an isid implementation that I used for this purpose, back in my Stata days. Helpfile here.

You'd be looking for a more pipe-able, NSE version of

is_id <- function(x){
  numdups <- sum(duplicated(x))
  if (numdups > 0){
    stop(sprintf("There are %i duplicates in %s", numdups, deparse(substitute(x)))) 
  }
}

Question 2: Can I get the elements of a data frame that are never duplicated?

In the handling duplicates workflow, I will sometimes separate the elements that are ever duplicated from the elements that are never duplicated, use a bunch of business logic to manipulate the ever duplicated elements, then recombine them. I think a get_nondups function could be worthwhile.

Here's what I use:

sep_dups <- function(df, ...){
  target <- df %>% select_(.dots=...)
  dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)

  list(unique = df[!dup_index, ],
       duplicates = df[dup_index, ])
}

from janitor.

jzadra avatar jzadra commented on June 6, 2024

Just a random thought: a sankey diagram would visually indicate what we are discussing here - maybe some of the code that goes into organizing that data from a plotting package could be used as reference?

from janitor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.