Giter Club home page Giter Club logo

tidygenomics's Introduction

tidygenomics

CRAN_Status_Badge

Tidy Verbs for Dealing with Genomic Data Frames

Description

Handle genomic data within data frames just as you would with GRanges. This packages provides method to deal with genomics intervals the "tidy-way" which makes it simpler to integrate in the the general data munging process. The API is inspired by the popular bedtools and the genome_join() method from the fuzzyjoin package.

Installation

install.packages("tidygenomics")

Or to get the latest development version

devtools::install_github("const-ae/tidygenomics")

Documentation

genome_intersect

Joins 2 data frames based on their genomic overlap. Unlike the genome_join function it updates the boundaries to reflect the overlap of the regions.

genome_intersect

x1 <- data.frame(id = 1:4, 
                chromosome = c("chr1", "chr1", "chr2", "chr2"),
                start = c(100, 200, 300, 400),
                end = c(150, 250, 350, 450))

x2 <- data.frame(id = 1:4,
                 chromosome = c("chr1", "chr2", "chr2", "chr1"),
                 start = c(140, 210, 400, 300),
                 end = c(160, 240, 415, 320))

genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")
id.x chromosome id.y start end
1 chr1 1 140 150
4 chr2 3 400 415

genome_subtract

Subtracts one data frame from the other. This can be used to split the x data frame into smaller areas.

genome_subtract

x1 <- data.frame(id = 1:4,
                chromosome = c("chr1", "chr1", "chr2", "chr1"),
                start = c(100, 200, 300, 400),
                end = c(150, 250, 350, 450))

x2 <- data.frame(id = 1:4,
                chromosome = c("chr1", "chr2", "chr1", "chr1"),
                start = c(120, 210, 300, 400),
                end = c(125, 240, 320, 415))

genome_subtract(x1, x2, by=c("chromosome", "start", "end"))
id chromosome start end
1 chr1 100 119
1 chr1 126 150
2 chr1 200 250
3 chr2 300 350
4 chr1 416 450

genome_join_closest

Joins 2 data frames based on their genomic location. If no exact overlap is found the next closest interval is used.

genome_join_closest

x1 <- data_frame(id = 1:4, 
                 chr = c("chr1", "chr1", "chr2", "chr3"),
                 start = c(100, 200, 300, 400),
                 end = c(150, 250, 350, 450))

x2 <- data_frame(id = 1:4,
                 chr = c("chr1", "chr1", "chr1", "chr2"),
                 start = c(220, 210, 300, 400),
                 end = c(225, 240, 320, 415))
genome_join_closest(x1, x2, by=c("chr", "start", "end"), distance_column_name="distance", mode="left")
id.x chr.x start.x end.x id.y chr.y start.y end.y distance
1 chr1 100 150 2 chr1 210 240 59
2 chr1 200 250 1 chr1 220 225 0
2 chr1 200 250 2 chr1 210 240 0
3 chr2 300 350 4 chr2 400 415 49
4 chr3 400 450 NA NA NA NA NA

genome_cluster

Add a new column with the cluster if 2 intervals are overlapping or are within the max_distance.

genome_cluster

x1 <- data.frame(id = 1:4, bla=letters[1:4],
                chromosome = c("chr1", "chr1", "chr2", "chr1"),
                start = c(100, 120, 300, 260),
                end = c(150, 250, 350, 450))
genome_cluster(x1, by=c("chromosome", "start", "end"))
id bla chromosome start end cluster_id
1 a chr1 100 150 0
2 b chr1 120 250 0
3 c chr2 300 350 2
4 d chr1 260 450 1
genome_cluster(x1, by=c("chromosome", "start", "end"), max_distance=10)
id bla chromosome start end cluster_id
1 a chr1 100 150 0
2 b chr1 120 250 0
3 c chr2 300 350 1
4 d chr1 260 450 0

genome_complement

Calculates the complement of a genomic region.

genome_complement

x1 <- data.frame(id = 1:4,
                 chromosome = c("chr1", "chr1", "chr2", "chr1"),
                 start = c(100, 200, 300, 400),
                 end = c(150, 250, 350, 450))

genome_complement(x1, by=c("chromosome", "start", "end"))
chromosome start end
chr1 1 99
chr1 151 199
chr1 251 399
chr2 1 299

genome_join

Classical join function based on the overlap of the interval. Implemented and maintained in the fuzzyjoin package and documented here only for completeness.

genome_join

x1 <- data_frame(id = 1:4, 
                 chr = c("chr1", "chr1", "chr2", "chr3"),
                 start = c(100, 200, 300, 400),
                 end = c(150, 250, 350, 450))

x2 <- data_frame(id = 1:4,
                 chr = c("chr1", "chr1", "chr1", "chr2"),
                 start = c(220, 210, 300, 400),
                 end = c(225, 240, 320, 415))
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="inner")
id.x chr.x start.x end.x id.y chr.y start.y end.y
2 chr1 200 250 1 chr1 220 225
2 chr1 200 250 2 chr1 210 240
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="left")
id.x chr.x start.x end.x id.y chr.y start.y end.y
1 chr1 100 150 NA NA NA NA
2 chr1 200 250 1 chr1 220 225
2 chr1 200 250 2 chr1 210 240
3 chr2 300 350 NA NA NA NA
4 chr3 400 450 NA NA NA NA
fuzzyjoin::genome_join(x1, x2, by=c("chr", "start", "end"), mode="anti")
id chr start end
1 chr1 100 150
3 chr2 300 350
4 chr3 400 450

Inspiration

If you have any additional questions or encounter issues please raise them on the github page.

tidygenomics's People

Contributors

const-ae avatar jennybc avatar liutiming avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tidygenomics's Issues

Feature Request: Join nearby

Hi,
I'm working on this now, but it may be a while. The idea is that if you want to join ranges that are close to each other, but up to a threshold. This threshold could be a distance (e.g. join any ranges within 1MB of each other), or could be a ratio of the total length of the two ranges + the gap.

Currently, my understanding is that tidygenomics only has join_closest which joins two ranges that are closest together regardless of what that distance is.

Here's an illustration of what I'm proposing.

joinnearby

The idea of the ratio option is based on an addon script from PennCNV. If I figure this out, maybe I'll try a pull request. Never done that before, but willing to try.

genome_cluster may have a bug.

Thank you for your great package.

genome_cluster does not work well when the range has several numbers of digits.
for example,

x2 <- data.frame(id = 1:3, bla=letters[1:3],
chromosome = c("chr1", "chr1", "chr1"),
start = c(1696, 2846, 945),
end = c(1700, 2850, 946))
genome_cluster(x2, by=c("chromosome", "start", "end"))

dose not work.
cluster_id of "a" and "c" is "0", and that of "b" is "1".
(it should be 0, 1, and 2, right?)

I guess genome_cluster cannot distinguish ranges with different numbers of digits like 1700 and 945.
Do you have any good idea?

Thanks,
Kentaro

NAs introduced error when using genome_subtract

I've been using tidygenomics for a while without issue, but today ran into a weird problem. I'm trying to use genome_subtract on two dataframes, and am getting an error message. I'm including row 1 of my data and the first row of the dataframe I want to subtract.

Thanks in advance for your assistance.

test1 <- structure(list(V1 = "chr1:151832901-152370289", V2 = "numsnp=113", 
               V3 = "length=537,389", V4 = "state5,cn=3", V5 = "CCCC.A_1_TR27GD1", 
               V6 = "startsnp=S-3OXBS", V7 = "endsnp=S-4LDZY", Chr = 1L, 
               Position = "151832901-152370289", StartPosition = 151832901, 
               EndPosition = 152370289, NumSNP = 113, Length = 537389L, 
               State = "5", CN = "3", Barcode = "CCCC.A_1_TR27GD1", StartSNP = "S-3OXBS", 
               EndSNP = "S-4LDZY", TN = "Tum", pop = "Black"), Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "Chr", "Position", "StartPosition", "EndPosition", "NumSNP", "Length", "State", "CN", "Barcode", "StartSNP", "EndSNP", "TN", "pop"), row.names = 2L, class = "data.frame")

test2 <- structure(list(Chr = 1L, Cutoff = 1.25e+10, StartPosition = 1.2497e+10, 
                        EndPosition = 1.2503e+10), .Names = c("Chr", "Cutoff", "StartPosition", 
                                                              "EndPosition"), row.names = c(NA, -1L), class = c("tbl_df", "tbl", 
                                                                                                                "data.frame"))

library(tidygenomics)

genome_subtract(test1,test2)

This results in this error message

Joining, by = c("Chr", "StartPosition", "EndPosition")
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : 
  solving row 1: range cannot be determined from the supplied arguments (too many NAs)
In addition: Warning messages:
1: In .normargSEW0(start, "start") :
  NAs introduced by coercion to integer range
2: In .normargSEW0(end, "end") :
  NAs introduced by coercion to integer range

Traceback

> traceback()
10: .Call(.NAME, ..., PACKAGE = PACKAGE)
9: .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges")
8: solveUserSEW0(start = start, end = end, width = width)
7: IRanges::IRanges(yd[[1]], yd[[2]])
6: .f(.x[[1L]], .y[[1L]], ...)
5: .Call(map2_impl, environment(), ".x", ".y", ".f", "list")
4: map2(.x, .y, .f, ...)
3: purrr::map2_df(joined$x_data, joined$y_data, find_subtractions)
2: f(d1, d2)
1: genome_subtract(test1, test2)

My sessionInfo

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidygenomics_0.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18        IRanges_2.8.2       tidyr_0.7.2         crayon_1.3.4       
 [5] dplyr_0.7.6         assertthat_0.2.0    R6_2.2.2            stats4_3.4.3       
 [9] magrittr_1.5        pillar_1.3.0        rlang_0.2.2         rstudioapi_0.7     
[13] bindrcpp_0.2.2      S4Vectors_0.12.2    tools_3.4.3         glue_1.2.0         
[17] purrr_0.2.4         parallel_3.4.3      yaml_2.1.15         compiler_3.4.3     
[21] BiocGenerics_0.20.0 pkgconfig_2.0.1     tidyselect_0.2.3    bindr_0.1.1        
[25] tibble_1.4.2   

genome_subtract (X-Y) also simplifies X?

I had chromosome 1 which looked like this, (X):
image

And what I wanted to subtract (Y):
image

I had hoped that it would just subtract Y from X (i.e., I am just trying to remove centromeric data) , but here is the output after running:
genome_subtract(X, Y, by=c("chromosome","start","end"))

image

I assume this is by design, but I think it would be a good option to allow only those reads/segments/ranges that overlap ranges in Y to be affected, and not to have ranges in X affect other ranges in X.

Thoughts? Did I do something wrong or misinterpret?

Thanks,
Gaius

Wishlists and potential pull requests

I am really enjoying this package and I think a few functionalities may be added to this package. Wondering what your thoughts are?

  • check data.frames: avoid a few common pitfalls in bioinformatics like chr# vs #, inconsistent headers etc.

Currently, the error message is a bit cryptic when chr numbering does not match:

x1 <- data.frame(id = 1:4, 
                chromosome = c("chr1", "chr1", "chr2", "chr2"),
                start = c(100, 200, 300, 400),
                end = c(150, 250, 350, 450))

x2 <- data.frame(id = 1:4,
                 chromosome = c("1", "2", "2", "1"),
                 start = c(140, 210, 400, 300),
                 end = c(160, 240, 415, 320))

tidygenomics::genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")
> tidygenomics::genome_intersect(x1, x2, by=c("chromosome", "start", "end"), mode="both")
Error: arrange() failed at implicit mutate() step.Could not create a temporary column for `..1`.`..1` is `x`.
Run `rlang::last_error()` to see where the error occurred.
  • strict conversion (in column headers) when converting between data.frame and GRanges
  • other functionalities like merge

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.