Comments (5)
To point out the obvious, this is different than distinct()
from dplyr as that would return one record of a duplicated set; I want no records.
from janitor.
A month later, this does not seem useful enough to make a permanent function for.
from janitor.
Now I think I have a need for this again. I have two ID columns and want to understand if they are 1-to-1. Does ID A ever appear without the same ID B? It would be hierarchical, as an ID shouldn't appear with multiple values of say location
but of course the same location
will have multiple values of ID
.
Maybe I need a function check_one_to_one
that takes multiple variables and checks whether there is any violation of 1:1.
Check first to see if someone else has coded this?
from janitor.
Great work on janitor! Awesome to see Ed types building real tools.
This is actually two different issues
Question 1: Are there duplicates?
Your introduction to get_dups
states
This is for hunting down and examining duplicate records during data cleaning
Hunting down is different from examining. Hunting should be a different (and faster re #67 ) function.
I'd recommend an is_id
or has_dupes
function instead of check_one_to_one
. It's the same idea: are these combinations unique?
The workflow is: Are there duplicates (has_dups
)? If yes, what do I do about them (get_dups
)?
Stata has an isid
implementation that I used for this purpose, back in my Stata days. Helpfile here.
You'd be looking for a more pipe-able, NSE version of
is_id <- function(x){
numdups <- sum(duplicated(x))
if (numdups > 0){
stop(sprintf("There are %i duplicates in %s", numdups, deparse(substitute(x))))
}
}
Question 2: Can I get the elements of a data frame that are never duplicated?
In the handling duplicates workflow, I will sometimes separate the elements that are ever duplicated from the elements that are never duplicated, use a bunch of business logic to manipulate the ever duplicated elements, then recombine them. I think a get_nondups
function could be worthwhile.
Here's what I use:
sep_dups <- function(df, ...){
target <- df %>% select_(.dots=...)
dup_index <- duplicated(target) | duplicated(target, fromLast = TRUE)
list(unique = df[!dup_index, ],
duplicates = df[dup_index, ])
}
from janitor.
Just a random thought: a sankey diagram would visually indicate what we are discussing here - maybe some of the code that goes into organizing that data from a plotting package could be used as reference?
from janitor.
Related Issues (20)
- Bug: adorn_totals() on one-way tabyl changes attribute to two_way HOT 5
- adorn_percentages("col") after adorn_totals("col") does not work HOT 3
- Dplyr warning when using tabyl HOT 5
- extend dplyr to include tabyl class so that tabyl attributes are preserved by dplyr operations HOT 6
- adorn_totals() attempting to add additional "Total" factor level HOT 5
- remove_empty: select empty columns HOT 2
- Require character input to `make_clean_names()` HOT 1
- adorn_ns() adds excluded values to a adorn_totals() in a pipe HOT 3
- German transliterations in `make_clean_names()` HOT 4
- Feature suggestion: allow multiple rows input to `row_to_names()` HOT 16
- Feature Request: `paste_skip_na()` function that skips NA values when pasting HOT 4
- Feature suggestion: `most()` and `assert_count_true()` HOT 6
- Add paste_skip_NA to catalog vignette
- Edge case for `janitor::remove_emtpy()`: dataframe row dimension remains after columns removed HOT 1
- `get_one_to_one()` errors with duplicated dttm HOT 4
- Possible to enrich the get_dupes() HOT 1
- Upkeep proposition / spring cleaning HOT 10
- CRAN notification re: janitor/man/janitor.Rd
- Submit 2.3.0 to CRAN
- Remove `%>%` in favor of `|>`? HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from janitor.