bioconductor / iranges Goto Github PK

Foundation of integer range manipulation in Bioconductor

Home Page: https://bioconductor.org/packages/IRanges

R 75.68% C 24.32%

iranges's Introduction

IRanges is an R/Bioconductor package that provides the foundation of integer range manipulation in Bioconductor.

See https://bioconductor.org/packages/IRanges for more information including how to install the release version of the package (please refrain from installing directly from GitHub).

iranges's People

Contributors

Stargazers

Watchers

Forkers

alexandremkuhn rajaldebnath balwierz jiefei-wang matthieurouland ltla rajesh-ibm-power vjcitn sonali8434

iranges's Issues

Safely getting the column names of a CompressedSplitDFrameList

Consider:

df <- DataFrame(X=1:26, Y=letters)
out <- split(df, rep(1:4, length.out=nrow(df)))
colnames(out)
## CharacterList of length 4
## [[1]] X Y
## [[2]] X Y
## [[3]] X Y
## [[4]] X Y

I just want a character vector of column names, so I might think of doing colnames(out)[[1]], but this fails for zero-length:

colnames(out[0])[[1]]
## NULL

The only safe way I can think of is:

colnames(unlist(out))
## [1] "X" "Y"

This is already somewhat unintuitive, but becomes a 3-liner if I want to change the column names:

df <- unlist(out)
colnames(df) <- c("x", "y")
out2 <- relist(df, out) # is metadata in 'out' retained in 'out2'?

So it would be nice to have a dedicated getter/setter for CSDFLs - following the columnMetadata example:

columnNames(out)
## [1] "X" "Y"
columnNames(out) <- c("x", "y")

Improving the CompressedList constructors

I frequently use CompressedLists but I can't figure out how to create them without making a list first. This is a common pattern:

stuff <- runif(100)
f <- sample(10, length(stuff), replace=TRUE)
library(IRanges)
out <- NumericList(split(stuff, f))

This is unfortunate because it spends time creating a large list with lots of little vectors before unlisting everything again to create the CompressedList. It seems like we could easily circumvent the middleman, possibly with an interface like:

CompressedList(stuff, by=f)

This would handle the reordering to create the internal IRanges and the unlistData. It would also handle the type dispatch so that I don't have to explicitly call NumericList for numeric values, etc. Finally, if by=NULL, it would do the same as as(stuff, "CompressedList"), which allows for a slightly less verbose way to convert vectors into CompressedLists:

library(IRanges)
df <- DataFrame(x=runif(100), y=runif(100))
f <- sample(letters, 100, replace=TRUE)
out <- split(df, f)
out[as(which.max(out[,'x']), "CompressedList")]
# replace with out[CompressedList(which.max(out[,'x']))]

setdiff that doesn't reduce

I see that setdiff is implemented as gaps(union(gaps(x), y)), so I'm not sure how feasible this is.
Seems like it would require a totally separate implementation, but it would be useful in my application and seems reasonable from an outside perspective.

Setup....

> a <- IRanges(start = c(1, 6), end = c(5, 10))
> a
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         5         5
  [2]         6        10         5
> b <- IRanges(start = 8, end = 12)
> b
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         8        12         5

Currently...

> setdiff(a, b)
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         7         7

Aspirationally....

> setdiff(a, b)
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1         5         5
  [2]         6         7         2

Best way of coercing a CompressedSplitDFrameList into a 1:1 mapping?

I have a use case where I have a CompressedSplitDFrameList (CSDFL) with variable numbers of rows, and I'm trying to think of the most concise, intuitive yet general way to convert this into a DataFrame with one row per original element per list entry of the CSDFL.

To illustrate, let's consider this DataFrame below:

library(IRanges)
df <- DataFrame(X=runif(25), Y=sample(LETTERS, 25, replace=TRUE), Z=rnorm(25))
sdfl <- split(df, factor(df$Y, LETTERS))

I want to convert sdfl into a DataFrame with one row per letter. For a given letter, if there were multiple rows in the corresponding DataFrame, we pick the one with the largest X; if empty, we set it to an all-NA row. Currently I have the following, which was not intuitive to derive:

best <- which.max(sdfl[,"X"])
n <- lengths(sdfl)
offset <- cumsum(n) - n
collapsed <- unlist(sdfl)[offset + best,]
rownames(collapsed) <- names(sdfl)

I'm wondering whether there is any generic we could co-opt or make that would allow me to do this as a one-liner, or at least something more obvious. I don't want to have to show the above code in the OSCA book, but I don't have any package that I can hide it as a standalone function.

The closest I can get is:

collapsed <- unlist(sdfl[as(which.max(sdfl[,"X"]), "IntegerList")])

... but this is quite slow. Some kind of optimization is broken when the subset contains NA. However, it seems like it doesn't have to be, as the following works fine:

S4Vectors:::.fast_subset_List_by_NL(sdfl, as(which.max(sdfl[,"X"]), "IntegerList"))

I'm guessing this was designed to protect against NA subsetting of GenomicRanges. Perhaps we could have another generic that flags whether NAs are a problem for subsetting the current x, and use that to direct the flow onto the fast path in the case of a CSDFL?

Session information

R version 4.0.0 Patched (2020-05-01 r78341)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BiocFileCache_1.13.0        dbplyr_1.4.3               
 [3] SingleCellExperiment_1.11.2 SummarizedExperiment_1.19.4
 [5] DelayedArray_0.15.1         matrixStats_0.56.0         
 [7] Biobase_2.49.0              GenomicRanges_1.41.1       
 [9] GenomeInfoDb_1.25.0         IRanges_2.23.5             
[11] S4Vectors_0.27.7            BiocGenerics_0.35.2        
[13] BiocStyle_2.17.0            rebook_0.99.0              

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0       xfun_0.14              purrr_0.3.4           
 [4] lattice_0.20-41        vctrs_0.3.0            htmltools_0.4.0       
 [7] yaml_2.2.1             blob_1.2.1             XML_3.99-0.3          
[10] rlang_0.4.6            pillar_1.4.4           glue_1.4.1            
[13] DBI_1.1.0              rappdirs_0.3.1         CodeDepends_0.6.5     
[16] bit64_0.9-7            GenomeInfoDbData_1.2.3 lifecycle_0.2.0       
[19] stringr_1.4.0          zlibbioc_1.35.0        memoise_1.1.0         
[22] codetools_0.2-16       evaluate_0.14          knitr_1.28            
[25] callr_3.4.3            ps_1.3.3               curl_4.3              
[28] highr_0.8              Rcpp_1.0.4.6           BiocManager_1.30.10   
[31] graph_1.67.0           XVector_0.29.1         bit_1.1-15.2          
[34] digest_0.6.25          stringi_1.4.6          bookdown_0.19         
[37] processx_3.4.2         dplyr_0.8.5            grid_4.0.0            
[40] tools_4.0.0            bitops_1.0-6           magrittr_1.5          
[43] RSQLite_2.2.0          RCurl_1.98-1.2         tibble_3.0.1          
[46] crayon_1.3.4           pkgconfig_2.0.3        ellipsis_0.3.1        
[49] Matrix_1.2-18          httr_1.4.1             assertthat_0.2.1      
[52] rmarkdown_2.1          R6_2.4.1               compiler_4.0.0

quantile,AtomicList-method misbehaves for length(p)==1

library(IRanges)
out <- NumericList(A=1:5, B=2:6)
quantile(out)
##      A B
## 0%   1 2
## 25%  2 3
## 50%  3 4
## 75%  4 5
## 100% 5 6

Okay, so far so good. Quantiles on the rows, list elements in the columns. But then:

quantile(out, p=0.5)
## A.50% B.50% 
##     3     4

I would have expected:

##     A B
## 50% 3 4

which preserves the expectations of the output structure, and also gets the names right.

Some poking around indicates that the offender is the overly liberal sapply at:

IRanges/R/CompressedAtomicList-class.R

Line 400 in 8633595

else sapply(x, .Generic, na.rm = na.rm)

and

IRanges/R/AtomicList-utils.R

Line 69 in 8633595

sapply(x, .Generic, na.rm = na.rm)

no function found corresponding to methods exports from ‘IRanges’ for: ‘parallel_slot_names’

Change to default minoverlap in subsetByOverlaps() feels unintuitive

The change of the default value of minoverlap to be 1 for subsetByOverlaps() feels really unintuitive to me (78ed68a). E.g., these used to have the same default behaviour (which felt intuitive) but no longer do:

suppressPackageStartupMessages(library(IRanges))
x <- IRanges(1, 4)
ranges <- IRanges(5, 8)
subsetByOverlaps(x, ranges)
#> IRanges object with 1 range and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         1         4         4
x[overlapsAny(x, ranges)]
#> IRanges object with 0 ranges and 0 metadata columns:
#>        start       end     width
#>    <integer> <integer> <integer>

Might there a way to get the 'correct' behaviour for zero width ranges in subsetByOverlaps() without changing the default? I'd expect more people to be confused by the new behaviour than by dropped zero-width ranges.

Cheers,
Pete

"[["/setListElement removes rownames on a CompressedSplitDataFrameList

replacing an element in a CompressedSplitDataFrameList removes the rownames of any element

> csdfl <- SplitDataFrameList(DataFrame(one = c(1,2,3,4), row.names = seq_len(4)),
                              DataFrame(one = c(11,12,13,14), row.names = c("a","b","c","d")))
> class(csdfl)
[1] "CompressedSplitDataFrameList"
attr(,"package")
[1] "IRanges"
> dimnames(csdfl)
[[1]]
CharacterList of length 2
[[1]] 1 2 3 4
[[2]] a b c d

[[2]]
CharacterList of length 2
[[1]] one
[[2]] one

> csdfl[[1]] <- DataFrame(one = c(4,3,2,1), row.names = rev(seq_len(4)))
> dimnames(csdfl)
[[1]]
CharacterList of length 2
[[1]] character(0)
[[2]] character(0)

[[2]]
CharacterList of length 2
[[1]] one
[[2]] one

Is that expected, since it is kind of a special situation, since it is a CompressedList?
The SimpleSplitDataFrameList behaves a bit differently.

> sdfl <- SplitDataFrameList(DataFrame(one = c(1,2,3,4), row.names = seq_len(4)),
                              DataFrame(one = c(11,12,13,14), row.names = c("a","b","c","d")),
                              compress = FALSE)
> class(sdfl)
[1] "SimpleSplitDataFrameList"
attr(,"package")
[1] "IRanges"
> dimnames(sdfl)
[[1]]
CharacterList of length 2
[[1]] 1 2 3 4
[[2]] a b c d

[[2]]
CharacterList of length 2
[[1]] one
[[2]] one

> sdfl[[1]] <- DataFrame(one = c(4,3,2,1), row.names = rev(seq_len(4)))
> dimnames(sdfl)
[[1]]
CharacterList of length 2
[[1]] character(0)
[[2]] a b c d

[[2]]
CharacterList of length 2
[[1]] one
[[2]] one

Here at least the other rownames do survive.

Any advice or suggestions? Is it best practice to not rely on the rownames in a DataFrameList?
Thanks

which.min() and which.max() are broken on CompressedNumericList objects

library(IRanges)
x <- NumericList(a=c(0.5, 0.1, 0.9), b=c(2.5, 2.9, 2.1))
which.min(x)
# a b 
# 1 1 
which.max(x)
# a b 
# 1 1

The culprit seems to be this line:

IRanges/src/CompressedAtomicList_utils.c

Line 147 in 52b13bf

PARTITIONED_AGG(int, ACCESSOR, INTSXP, INTEGER, \

where int is passed to the PARTITIONED_AGG() macro, with the consequence that values in x are extracted as integer values. Apparently it's been broken since which.min() and which.max() got optimized in 2016 (commit c320278).

sessionInfo()

R Under development (unstable) (2020-11-18 r79449)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.r79449/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.r79449/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] IRanges_2.23.7      S4Vectors_0.29.3    BiocGenerics_0.37.0

loaded via a namespace (and not attached):
[1] compiler_4.1.0

Support 'compress=NA' in AtomicList constructors

It would do compress=TRUE if the input is not too big, otherwise it would fallback to compress=FALSE. For example:

RleList(Rle(8, 2e9), Rle(5, 200), compress=NA)

would do RleList(Rle(8, 2e9), Rle(5, 200), compress=TRUE) i.e. return a CompressedRleList object.

But:

RleList(Rle(8, 2e9), Rle(5, 2e9), compress=NA)

would do RleList(Rle(8, 2e9), Rle(5, 2e9), compress=FALSE) i.e. return a SimpleRleList object.

Then compress=NA should probably be made the new default (right now it's compress=TRUE, which is problematic because it fails when the input is too big).

See Bioconductor/VariantAnnotation#71 (comment) for some background.

Conversion of IRanges to data.frame strips metadata

Converting an IRanges object with metadata to a data.frame drops all metadata, see:

suppressPackageStartupMessages(library(IRanges))
ir1 <- IRanges(c(1, 10, 20), width=5)
mcols(ir1) <- DataFrame(score=runif(3))
as.data.frame(ir1)
#>   start end width
#> 1     1   5     5
#> 2    10  14     5
#> 3    20  24     5

For me round-trips from data.frame (or actually tibble) to IRanges and back are a quite common use case. Currently, I'm fixing things by using this custom function:

as_tibble.IRanges <- function(x){
    as_tibble(bind_cols(as.data.frame(x), as.data.frame(x@elementMetadata)))
}

However, this does not feel right and I would rather see this as default behavior. I'm happy to help or test.

I originally reported this behavior in tidyomics/plyranges#55

I'm using IRanges 2.16.0 on R version 3.5.2, my sessionInfo():

R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: NixOS 18.09.2323.fef0aaaaab4 (Jellyfish)

Matrix products: default
BLAS/LAPACK: /nix/store/x2bijqr1l5hivs9h990ms6xlwvmp3asc-openblas-0.3.5/lib/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] IRanges_2.16.0      S4Vectors_0.20.1    BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] compiler_3.5.2

Make the IRanges() constructor compatible with ALTREP

(Moving this from the devteam-bioc Slack to GitHub)

Jiefei wrote on Slack:

Hi Herve, I would like to make some changes to the IRanges package but I might miss something so I want to hear your opinion. For making the IRanges package compatible with ALTREP, the IRanges should be able to use the user's input as its start and width variable, so for the C level constructor:

SEXP solve_user_SEW0(SEXP start, SEXP end, SEXP width)
{
    SEXP ans, ans_start, ans_width;
    int ans_length, i;

    ans_length = LENGTH(start);
    PROTECT(ans_start = NEW_INTEGER(ans_length));
    PROTECT(ans_width = NEW_INTEGER(ans_length));
    for (i = 0; i < ans_length; i++) {
        if (solve_user_SEW0_row(INTEGER(start)[i],
                    INTEGER(end)[i],
                    INTEGER(width)[i],
                    INTEGER(ans_start) + i,
                    INTEGER(ans_width) + i) != 0)
        {
            UNPROTECT(2);
            error("solving row %d: %s", i + 1, errmsg_buf);
        }
    }
    PROTECT(ans = _new_IRanges("IRanges", ans_start, ans_width, R_NilValue));
    UNPROTECT(3);
    return ans;
}

I plan to change it to

SEXP solve_user_SEW0(SEXP start, SEXP end, SEXP width)
{
    SEXP ans;
    int duplicate_num = 0;
    int ans_length = LENGTH(start);
    for (int i = 0; i < ans_length; i++) {
        if (solve_user_SEW0_row(i, 
            &start,&end, &width,
            &duplicate_num
        ) != 0)
        {
            error("solving row %d: %s", i + 1, errmsg_buf);
        }
    }
    PROTECT(ans = _new_IRanges("IRanges", start, width, R_NilValue));
    UNPROTECT(1+ duplicate_num);
    return ans;
}

The function solve_user_SEW0_row will duplicate start and width if it need to change the value of these two variables. The similar pattern is also found in the function solve_user_SEW. I would like to know if there is any concern that prevents us from using the user's input as the IRanges object's slots. Do you think these changes could be a solution for the ALTREP problem?

Subsetting a SimpleRleList object by a GRanges object 'x' should work, even when 'sum(width(x))' >= 2^31

This is a spin-off from issue #43:

library(BSgenome.Hsapiens.UCSC.hg38)
genome <- BSgenome.Hsapiens.UCSC.hg38

x <- GRanges(c("chr1:1-1000", "chr1:11-1010", "chr1:1000001-1500000", "chrM:1-16569"), seqinfo=seqinfo(genome))
x_cvg <- coverage(x)  # SimpleRleList object

bins <- tileGenome(seqinfo(genome), tilewidth=1e6, cut.last.tile=TRUE)
sum(width(bins))  # >= 2^31
x_cvg[bins]
# Error in .numeric2end(x, NG) : 
#   when 'x' is an integer vector, it cannot contain NAs or negative values
# In addition: Warning message:
# In bindROWS(x, list(...), ignore.mcols = ignore.mcols) :
#   integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

sessionInfo():

R Under development (unstable) (2021-10-25 r81105)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.2.r81105/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.2.r81105/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.4 BSgenome_1.63.3                  
 [3] rtracklayer_1.55.2                Biostrings_2.63.0                
 [5] XVector_0.35.0                    GenomicRanges_1.47.5             
 [7] GenomeInfoDb_1.31.1               IRanges_2.29.1                   
 [9] S4Vectors_0.33.5                  BiocGenerics_0.41.2              

loaded via a namespace (and not attached):
 [1] zlibbioc_1.41.0             GenomicAlignments_1.31.2   
 [3] BiocParallel_1.29.4         lattice_0.20-45            
 [5] rjson_0.2.20                tools_4.2.0                
 [7] grid_4.2.0                  SummarizedExperiment_1.25.2
 [9] parallel_4.2.0              Biobase_2.55.0             
[11] matrixStats_0.61.0          yaml_2.2.1                 
[13] crayon_1.4.2                BiocIO_1.5.0               
[15] Matrix_1.3-4                GenomeInfoDbData_1.2.7     
[17] restfulr_0.0.13             bitops_1.0-7               
[19] RCurl_1.98-1.5              DelayedArray_0.21.2        
[21] compiler_4.2.0              MatrixGenerics_1.7.0       
[23] Rsamtools_2.11.0            XML_3.99-0.8

function restrict() doesn't return expected result on continuous Iranges

Hi Michael @lawremi, Hervé @hpages,
I run the restrict() function on a continuous IRange:
range1 <- IRanges(start=c(1, 4, 8), end=c(3, 7, 10))
restrict(range1,start = 4)

RESULT:

IRanges object with 3 ranges and 0 metadata columns:
start end width
[1] 4 3 0
[2] 4 7 4
[3] 8 10 3
as you can see, the first Range of the result is not what I want, it's a little bit weird, I was wonder if I set some parameter wrong? I also found if the start or end given as one of the start(range1) or end(range1 list has the same issue. Can you help me out?

Thx.
Yuliang

Constructing a CompressedList given the PartitioningByEnd object?

If I have the unlistData and a PartitioningByEnd object, how do I construct the associated CompressedList? I come across this situation on occasion where I already know that the unlistData is in the correct order and I know the runs associated with each List element; it seems like a waste to have to create a rep(seq_along(runs), runs) factor to use splitAsList().

Making the CompressedList more widely usable

I've been playing around with the CompressedList subclasses for representing some complex data types and I've really come to like it. I've been thinking of ways to make it more generally usable by both end-users and other developers, and I've got a few wish-list elements:

Access to unlistData and partitioning. End-users would then be able to execute arbitrary unary operations on the underlying data while preserving the partitioning, like:

A <- DataFrame(X=LETTERS, Y=runif(26))
comp.list <- split(A,A$X)

# Attempt fails, for obvious reasons.
comp.list$Y <- log(comp.list$Y)

# Assuming we had a unlistData() method:
unlistData(comp.list)$Y <- log(unlistData(comp.list)$Y)

unlistData<- could even be unlist<-, if one were willing to introduce that concept. I don't mind if partitioning is getter-only; this would still be very useful for downstream functions that need to be list-aware yet don't want to create an intermediate list for efficiency purposes.

Non-virtual CompressedList class. I don't understand the motivation for making CompressedList virtual. From a representation perspective, a general concrete class would be useful if we could store any vector-like entity in unlistData. In fact, I ran into the case where I wanted to store a CompressedCharacterList as unlistData, effectively making a CompressedCompressedCharacterListList! I don't expect to be able to call many methods on this thing - other than the proposed unlistData and partitioning, and maybe unlist - I just want to use it for storage without needing to write an explicit subclass. A general CompressedList class would serve this purpose, and is better than the alternative of falling back to a SimpleList (which takes a noticeable time to generate).

A more careful unlist. If we do allow a general CompressedList class, the unlist method should probably take heed of recursive=TRUE and apply unlist on the unlistData slot.

I'm happy to chip in with a PR if these sound like good ideas.

IRanges::IntegerList & min / max

Hi Michael @lawremi, Hervé @hpages,

A (recent?) change to IRanges IntegerList returns some unexpected results.
In base R, the min(integer()) returns Inf.

When taking the min or max of IRanges::IntegerList, you get the max integer
value in R (see below).
Should this match what base R returns?

-M

library(IRanges)
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind,
#>     colMeans, colnames, colSums, dirname, do.call, duplicated,
#>     eval, evalq, Filter, Find, get, grep, grepl, intersect,
#>     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
#>     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
#>     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which, which.max,
#>     which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
il <- IntegerList(A = 1:3, B = integer())
max(il)
#>           A           B 
#>           3 -2147483647
min(il)
#>          A          B 
#>          1 2147483647

Created on 2018-08-08 by the reprex package (v0.2.0).

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                                      
#>  version  R version 3.5.1 Patched (2018-07-30 r75013)
#>  system   x86_64, linux-gnu                          
#>  ui       X11                                        
#>  language (EN)                                       
#>  collate  en_US.UTF-8                                
#>  tz       America/New_York                           
#>  date     2018-08-08
#> Packages -----------------------------------------------------------------
#>  package      * version date       source        
#>  backports      1.1.2   2017-12-13 CRAN (R 3.5.0)
#>  base         * 3.5.0   2018-05-01 local         
#>  BiocGenerics * 0.27.1  2018-06-17 Bioconductor  
#>  compiler       3.5.0   2018-05-01 local         
#>  datasets     * 3.5.0   2018-05-01 local         
#>  devtools       1.13.6  2018-06-27 cran (@1.13.6)
#>  digest         0.6.15  2018-01-28 CRAN (R 3.5.0)
#>  evaluate       0.11    2018-07-17 CRAN (R 3.5.1)
#>  graphics     * 3.5.0   2018-05-01 local         
#>  grDevices    * 3.5.0   2018-05-01 local         
#>  htmltools      0.3.6   2017-04-28 CRAN (R 3.5.0)
#>  IRanges      * 2.15.16 2018-07-18 Bioconductor  
#>  knitr          1.20    2018-02-20 CRAN (R 3.5.0)
#>  magrittr       1.5     2014-11-22 CRAN (R 3.5.0)
#>  memoise        1.1.0   2017-04-21 CRAN (R 3.5.0)
#>  methods      * 3.5.0   2018-05-01 local         
#>  parallel     * 3.5.0   2018-05-01 local         
#>  Rcpp           0.12.18 2018-07-23 CRAN (R 3.5.1)
#>  rmarkdown      1.10    2018-06-11 CRAN (R 3.5.0)
#>  rprojroot      1.3-2   2018-01-03 CRAN (R 3.5.0)
#>  S4Vectors    * 0.19.19 2018-07-18 Bioconductor  
#>  stats        * 3.5.0   2018-05-01 local         
#>  stats4       * 3.5.0   2018-05-01 local         
#>  stringi        1.2.4   2018-07-20 CRAN (R 3.5.1)
#>  stringr        1.3.1   2018-05-10 CRAN (R 3.5.0)
#>  tools          3.5.0   2018-05-01 local         
#>  utils        * 3.5.0   2018-05-01 local         
#>  withr          2.1.2   2018-03-15 CRAN (R 3.5.0)
#>  yaml           2.2.0   2018-07-25 CRAN (R 3.5.1)

Accidental issue

Apologies, I accidentally opened this issue.

Please delete if possible as there is no issue.

Consistency between between base Math Ops and S4Vectors for IntegerLists

Hi @hpages,

Is there a reason that the following return two different results? Does doing min/max on IntegerList with integer(0) elements always equal .Machine$integer.max?

> min(IRanges::IntegerList( integer(0)))
[1] 2147483647
> min(integer(0))
[1] Inf

Some methods expect no more than 2^31 elements

Hi Michael, Herve,
I'm trying to fetch the [human] genome-wide coverage of a CompressedRleList using a combination of coverage() from the GenomicRanges package and [ from IRanges (well probably [ is from S4Vectors but extended in IRanges?). I realized this is not possible, as it seems that some functions in IRanges expect the number of elements to fit in a 32-bit integer.

I'm most likely abusing the original use-case which these functions were designed for. Still, I wanted to ask whether this limitation with 32-bit integers is something known and intentionally meant to be there.

Here's a minimal example (get the average coverage of 1M bp bins):

library(GenomicRanges)

# get the average coverage of 1M bp bins
bins <- tileGenome(seqinfo(BSgenome.Hsapiens.UCSC.hg38::Hsapiens), tilewidth=1e6, cut.last.tile=TRUE)
bins <- keepStandardChromosomes(bins, pruning.mode="coarse")
x <- GRanges("chr1:1")   # a phony signal track
seqinfo(x) <- seqinfo(bins)

bincov <- coverage(x)[bins]    # <-- this will raise an error
bin.avg.cov <- lapply(bincov, function(x) mean(decode(x)))

This seems to work fine with organisms with shorter genomes (< 2^31 bp, like Scerevisiase).

The methods in question which expect 32 bit integers are bindROWS() from the CompressedList class (produce an integer overflow in a cumsum() call), and PartitioningByEnd() which internally calls .numeric2end() and does implicit conversions to integer type.

I'm currently using Bioconductor 3.14 through Bioconductor's official Docker image:

R> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  utils     datasets  grDevices methods   base     

other attached packages:
[1] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0  IRanges_2.28.0       S4Vectors_0.32.0     BiocGenerics_0.40.0 

loaded via a namespace (and not attached):
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.4 rstudioapi_0.13                   XVector_0.34.0                   
 [4] GenomicAlignments_1.30.0          zlibbioc_1.40.0                   BiocParallel_1.28.0              
 [7] lattice_0.20-45                   BSgenome_1.62.0                   rjson_0.2.20                     
[10] tools_4.1.1                       grid_4.1.1                        SummarizedExperiment_1.24.0      
[13] parallel_4.1.1                    Biobase_2.54.0                    matrixStats_0.61.0               
[16] yaml_2.2.1                        crayon_1.4.1                      BiocIO_1.4.0                     
[19] Matrix_1.3-4                      GenomeInfoDbData_1.2.7            rtracklayer_1.54.0               
[22] restfulr_0.0.13                   bitops_1.0-7                      RCurl_1.98-1.5                   
[25] DelayedArray_0.20.0               compiler_4.1.1                    MatrixGenerics_1.6.0             
[28] Biostrings_2.62.0                 Rsamtools_2.10.0                  XML_3.99-0.8

Many thanks for your feedback.

R 4.3.3 Unable to install IRanges

Hi, @lawremi

How to solve it? please

Add rescale() for IRanges objects

library(IRanges)

.zoom0 <- function(x, z=1)
{
    stopifnot(is(x, "Ranges"), is.numeric(z))
    if (length(z) > length(x) && length(z) != 1L)
        stop("'z' is longer than 'x'")
    if (anyNA(z) || min(z) <= -1L)
        stop("'z' contains NAs and/or negative values")
    new_start <- as.integer(start(x) * z)
    new_width <- as.integer(width(x) * z)
    BiocGenerics:::replaceSlots(x, start=new_start, width=new_width)
}

.normarg_scale <- function(scale)
{
    if (is(scale, "IRanges"))
        return(scale)
    if (is.numeric(scale))
        return(IRanges(1L, scale))
    as(scale, "IRanges")
}

## 'oldscale' and 'newscale': integer vectors or IRanges
## objects recycled to the length of 'x'.
rescale <- function(x, newscale=1L, oldscale=1L)
{
    if (!is(x, "IRanges"))
        x <- as(x, "IRanges")
    newscale <- .normarg_scale(newscale)
    oldscale <- .normarg_scale(oldscale)
    ans <- shift(x, -start(oldscale))
    ans <- .zoom0(ans, width(newscale) / width(oldscale))
    shift(ans, start(newscale))
}

Then:

> rescale("5-10", newscale=100, oldscale=10)
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        41       100        60

> rescale("505-510", newscale=100, oldscale="501-510")
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]        41       100        60

> rescale("505-510", newscale="101-200", oldscale="501-510")
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]       141       200        60

There is a rescale() generic in ggbio. Should probably be moved to BiocGenerics and the above rescale() for IRanges turned into a method.

There is also a rescale() S3 generic in the scales package (with arguments x, to, from, ...) but there is nothing we can do about this (the conflicted package will help the end user deal with the name clash).

startsWith and endsWith are not exported

Is this an oversight, or intentional?

library(IRanges)
X <- CharacterList(split(LETTERS, 1:26))
startsWith(X, "A")
## Error in startsWith("A", X) : non-character object(s)
IRanges:::startsWith(X, "A") ## works

Bug in IRanges:::recycleListElements()

Hi Michael @lawremi ,

Can you please take a look at this?

library(IRanges)
x <- IntegerList(11:13, NULL)
y <- IntegerList(NULL, NULL)
x + y
# Error in rep(seq_len(NROW(x)), ...) : invalid 'times' argument

The error comes from IRanges:::recycleListElements():

IRanges:::recycleListElements(x, c(0, 0))
# Error in rep(seq_len(NROW(x)), ...) : invalid 'times' argument

Thanks!

sessionInfo():

R version 4.0.0 Patched (2020-04-27 r78316)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.0.r78316/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.0.r78316/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] IRanges_2.23.5      S4Vectors_0.27.6    BiocGenerics_0.35.2

loaded via a namespace (and not attached):
[1] compiler_4.0.0

Missing the "CompressedList,missing" method for Ops

For example:

library(IRanges)
x <- IntegerList(11:12, integer(0), 3:-2, compress=TRUE)
-x
## Error in -x : invalid argument to unary operator

Seems like a simple omission in CompressedAtomicList-class.R:

setMethod("Ops",
          signature(e1 = "CompressedAtomicList", e2 = "missing"),
          function(e1, e2)
          {
              relist(callGeneric(e1@unlistData), e1)
          })

After which:

-x
## IntegerList of length 3
## [[1]] -11 -12
## [[2]] integer(0)
## [[3]] -3 -2 -1 0 1 2

IRanges constructor with FactorLists

Hi @hpages,

I have a test failure for my package plyranges on the dev branch based on what I think is changes to IRanges, I'm not sure how to proceed so any ideas would be really helpful. Basically, I have some helpers of constructing Ranges from DFrames, and test case includes a column that has a FactorList. I've made a reprex of the problem below using the current dev version of IRanges:

suppressPackageStartupMessages(library(IRanges))

start <- 1:3 
width <- 2:4

grps <- FactorList("a", c("b", "c"), "d")

grps 
#> FactorList of length 3
#> [[1]] a
#> [[2]] b c
#> [[3]] d

# iranges constructor adding mcols
ir <- IRanges(start, width, grps = grps)
ir
#> IRanges object with 3 ranges and 1 metadata column:
#>           start       end     width |         grps
#>       <integer> <integer> <integer> | <FactorList>
#>   [1]         1         2         2 |             
#>   [2]         2         3         2 |             
#>   [3]         3         4         2 |
mcols(ir)$grps
#> FactorList of length 3
#> Error in RangeNSBS(x, start = start, end = end, width = width): the specified range is out-of-bounds

# iranges constructor without mcols
ir <- IRanges(start, width)
mcols(ir)[["grps"]] <- grps
ir
#> IRanges object with 3 ranges and 1 metadata column:
#>           start       end     width |         grps
#>       <integer> <integer> <integer> | <FactorList>
#>   [1]         1         2         2 |             
#>   [2]         2         3         2 |             
#>   [3]         3         4         2 |
mcols(ir)$grps
#> FactorList of length 3
#> Error in RangeNSBS(x, start = start, end = end, width = width): the specified range is out-of-bounds

^{Created on 2021-05-03 by the reprex package (v2.0.0)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.5 (2021-03-31)
#>  os       macOS Big Sur 10.16         
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2021-05-03                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date       lib
#>  backports      1.2.1   2020-12-09 [1]
#>  BiocGenerics * 0.37.4  2021-05-03 [1]
#>  cli            2.5.0   2021-04-26 [1]
#>  crayon         1.4.1   2021-02-08 [1]
#>  digest         0.6.27  2020-10-24 [1]
#>  ellipsis       0.3.2   2021-04-29 [1]
#>  evaluate       0.14    2019-05-28 [1]
#>  fansi          0.4.2   2021-01-15 [1]
#>  fs             1.5.0   2020-07-31 [1]
#>  glue           1.4.2   2020-08-27 [1]
#>  highr          0.9     2021-04-16 [1]
#>  htmltools      0.5.1.1 2021-01-22 [1]
#>  IRanges      * 2.25.10 2021-05-03 [1]
#>  knitr          1.33    2021-04-24 [1]
#>  lifecycle      1.0.0   2021-02-15 [1]
#>  magrittr       2.0.1   2020-11-17 [1]
#>  pillar         1.6.0   2021-04-13 [1]
#>  pkgconfig      2.0.3   2019-09-22 [1]
#>  purrr          0.3.4   2020-04-17 [1]
#>  reprex         2.0.0   2021-04-02 [1]
#>  rlang          0.4.11  2021-04-30 [1]
#>  rmarkdown      2.7     2021-02-19 [1]
#>  S4Vectors    * 0.29.18 2021-05-03 [1]
#>  sessioninfo    1.1.1   2018-11-05 [1]
#>  stringi        1.5.3   2020-09-09 [1]
#>  stringr        1.4.0   2019-02-10 [1]
#>  styler         1.4.1   2021-03-30 [1]
#>  tibble         3.1.1   2021-04-18 [1]
#>  utf8           1.2.1   2021-03-12 [1]
#>  vctrs          0.3.8   2021-04-29 [1]
#>  withr          2.4.2   2021-04-18 [1]
#>  xfun           0.22    2021-03-11 [1]
#>  yaml           2.2.1   2020-02-01 [1]
#>  source                                    
#>  CRAN (R 4.0.2)                            
#>  Github (Bioconductor/BiocGenerics@cced297)
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.1)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  Github (Bioconductor/IRanges@a5258ca)     
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  Github (Bioconductor/S4Vectors@7593108)   
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#>  CRAN (R 4.0.2)                            
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

Reverse IRangesList

Is there a vectorized alternative for endoapply(IRangesList, rev)?

x <- IRanges(start = 1:10, width = 3)
group <- c(1,1,2,2,2,3,3,4,4,4)
split_x <- split(x, group)
rev_split_x <- endoapply(split_x, rev)
# works

reverse does not work in this case:

reverse(split_x)
> Error in (function (classes, fdef, mtable)  : 
> unable to find an inherited method for function ‘reverse’ for signature ‘"CompressedIRangesList"’

order(x, decreasing=TRUE) is broken on CompressedIntegerList objects

library(IRanges)
x1 <- IntegerList(15:11, 21:28, compress=TRUE)

order(x1)
# IntegerList of length 2
# [[1]] 5 4 3 2 1
# [[2]] 1 2 3 4 5 6 7 8

order(x1, decreasing=TRUE)  # BROKEN!
# IntegerList of length 2
# [[1]] 13 12 11 10 9
# [[2]] 3 2 1 -4 -3 -2 -1 0

This in turn breaks sort( , decreasing=TRUE):

sort(x1, decreasing=TRUE)
# Error in seq_len(x_NROW)[i] : 
#   only 0's may be mixed with negative subscripts

Everything looks fine with a SimpleIntegerList object:

x2 <- IntegerList(15:11, 21:28, compress=FALSE)

order(x2)
# IntegerList of length 2
# [[1]] 5 4 3 2 1
# [[2]] 1 2 3 4 5 6 7 8

order(x2, decreasing=TRUE)
# IntegerList of length 2
# [[1]] 1 2 3 4 5
# [[2]] 8 7 6 5 4 3 2 1

sort(x2, decreasing=TRUE)
# IntegerList of length 2
# [[1]] 15 14 13 12 11
# [[2]] 28 27 26 25 24 23 22 21

sessionInfo():

> sessionInfo()
R version 3.6.0 Patched (2019-05-02 r76454)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-3.6.r76454/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.6.r76454/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] IRanges_2.19.16     S4Vectors_0.23.23   BiocGenerics_0.31.6

loaded via a namespace (and not attached):
[1] compiler_3.6.0

Wrong min/max on empty entries of CompressedIntegerLists

Consider the following:

library(IRanges)
a <- list(integer(0), 1L, 1:2, 1:3)
b <- IntegerList(a, compress=TRUE)
max(b)
# [1] -2147483647           1           2           3
min(b)
# [1] 2147483647          1          1          1

I can smell the C here. Presumably these big numbers are meant to be NA_integer_s.

Session info

R version 3.5.3 RC (2019-03-04 r76215)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Users/luna/Software/R/R-3-5-branch/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-3-5-branch/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] IRanges_2.16.0      S4Vectors_0.20.1    BiocGenerics_0.28.0

loaded via a namespace (and not attached):
[1] compiler_3.5.3 tools_3.5.3

Missing overlaps with type='start' or type='end' in findOverlaps

Hi folks 😃
I recently encountered an odd behaviour in the findOverlaps function.
I have a single query for which I want to find overlapping genes, stratified by the type of overlap. So, I want to know, if the overlap is within the gene or overlapping the C or N-terminus. Therefore, I used the type parameters in the findOverlap function.
I first checked if there are any overlaps and found two genes with overlap, with my region overlapping the start of one and the end of another one of the two hits. However, when I run it with type='end' or 'start', I do not get the expected overlaps.
I tried to understand the underlying code, but it seems to call some C function which I cannot really figure out, unfortunately.

This should be a reproducible example with a subset of the gene coordinates (using IRanges_2.32.0 and R 4.2.1):

library("IRanges")
query <- IRanges(start=638752, end=639346)
subject <- IRanges(start=c(635246, 636459, 637282, 638191, 639306, 
                           640686, 641814, 643721, 643992, 645102, 
                           645499), 
                   end=c(636343, 637097, 638187, 639150, 640670, 641798, 
                         643592, 643978, 644858, 645485, 647505))

for (type.overlap in c('within', 'start', 'end', 'any', 'equal')){
        message(type.overlap)
        overlaps <- findOverlaps(query=query, subject=subject, 
                                 type = type.overlap, select='all')
        print(overlaps)
}

This is the output:

within
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1 / subjectLength: 11
start
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1 / subjectLength: 11
end
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1 / subjectLength: 11
any
Hits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           4
  [2]         1           5
  -------
  queryLength: 1 / subjectLength: 11
equal
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1 / subjectLength: 11

Just to clarify, the subjects 4 and 5 should be returned as overlaps with the query. This is the query:

query
IRanges object with 1 range and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]    638752    639346       595

and this the expected overlaps:

subject[c(4,5),]
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]    638191    639150       960
  [2]    639306    640670      1365

I hope I am not missing something obvious, but it seems like the function would not return the expected results.
Your help will be much appreciated!
Cheers,
Jakob

version bump required

since the beginning of January, the version number of the IRanges package was not increased. This causes the latest changes not to be propagated to the SPB, which leads to a build error during the submission of the package Structstrings.

Would it be possible to bump the version number, so that an update of the IRanges package is triggered on the SPB?

Thanks for any help in advance.

Fix for slidingIRanges function

See the Biostars thread at: https://support.bioconductor.org/p/9140415
I have checked the fix proposed by story.benjamin and this fixes the behaviour of the function and would propose a fix

Correct function below: (which changes seq(1L, len - width , by = shift) for seq(1L, len - width + 1 , by = shift)

slidingIRanges <- function (len, width, shift = 1L)  {
  start <- seq(1L, len - width + 1 , by = shift)
  end <- seq(width, len, by = shift)
  IRanges(start, end)
}

unname does not work for SplitDFLists

library(IRanges)
df <- DataFrame(X=1:10)
out <- split(df, factor(1:10))
unname(out)
## Error in `dimnames<-`(`*tmp*`, value = NULL) : 
##   replacement value must be a list

Probably because it has non-NULL dimnames() and so unname gets confused. Easy solution would be to turn unname into a generic and then just remove the names. If force=TRUE, you could also strip the underlying dimnames in the DFrames.

`IRanges()` in devel doesn't work if `start` is a matrix

Hi @hpages, is this an intentional change?

In devel

suppressPackageStartupMessages(library(IRanges))
start <- end <- matrix(c(3L, 2L, 6L, 5L, 9L, 8L, 12L, 11L), nrow = 2)
# Fails
IRanges(start, end)
#> Error in validObject(.Object): invalid class "IRanges" object: invalid object for slot "start" in class "IRanges": got class "matrix", should be or extend class "integer"
# Works
IRanges(as.integer(start), end)
#> IRanges object with 8 ranges and 0 metadata columns:
#>           start       end     width
#>       <integer> <integer> <integer>
#>   [1]         3         3         1
#>   [2]         2         2         1
#>   [3]         6         6         1
#>   [4]         5         5         1
#>   [5]         9         9         1
#>   [6]         8         8         1
#>   [7]        12        12         1
#>   [8]        11        11         1

^{Created on 2021-04-22 by the reprex package (v2.0.0)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                                             
#>  version  R Under development (unstable) (2021-03-29 r80130)
#>  os       macOS Catalina 10.15.7                            
#>  system   x86_64, darwin17.0                                
#>  ui       X11                                               
#>  language (EN)                                              
#>  collate  en_AU.UTF-8                                       
#>  ctype    en_AU.UTF-8                                       
#>  tz       Australia/Melbourne                               
#>  date     2021-04-22                                        
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date       lib source        
#>  backports      1.2.1   2020-12-09 [1] CRAN (R 4.1.0)
#>  BiocGenerics * 0.37.2  2021-04-16 [1] Bioconductor  
#>  cli            2.4.0   2021-04-05 [1] CRAN (R 4.1.0)
#>  crayon         1.4.1   2021-02-08 [1] CRAN (R 4.1.0)
#>  digest         0.6.27  2020-10-24 [1] CRAN (R 4.1.0)
#>  ellipsis       0.3.1   2020-05-15 [1] CRAN (R 4.1.0)
#>  evaluate       0.14    2019-05-28 [1] CRAN (R 4.1.0)
#>  fansi          0.4.2   2021-01-15 [1] CRAN (R 4.1.0)
#>  fs             1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
#>  glue           1.4.2   2020-08-27 [1] CRAN (R 4.1.0)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  htmltools      0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
#>  IRanges      * 2.25.9  2021-04-16 [1] Bioconductor  
#>  knitr          1.32    2021-04-14 [1] CRAN (R 4.1.0)
#>  lifecycle      1.0.0   2021-02-15 [1] CRAN (R 4.1.0)
#>  magrittr       2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
#>  pillar         1.6.0   2021-04-13 [1] CRAN (R 4.1.0)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
#>  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
#>  reprex         2.0.0   2021-04-02 [1] CRAN (R 4.1.0)
#>  rlang          0.4.10  2020-12-30 [1] CRAN (R 4.1.0)
#>  rmarkdown      2.7     2021-02-19 [1] CRAN (R 4.1.0)
#>  S4Vectors    * 0.29.15 2021-04-07 [1] Bioconductor  
#>  sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
#>  stringi        1.5.3   2020-09-09 [1] CRAN (R 4.1.0)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
#>  styler         1.4.1   2021-03-30 [1] CRAN (R 4.1.0)
#>  tibble         3.1.1   2021-04-18 [1] CRAN (R 4.1.0)
#>  utf8           1.2.1   2021-03-12 [1] CRAN (R 4.1.0)
#>  vctrs          0.3.7   2021-03-29 [1] CRAN (R 4.1.0)
#>  withr          2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
#>  xfun           0.22    2021-03-11 [1] CRAN (R 4.1.0)
#>  yaml           2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

The above code works in release.
I think this is the root cause of some failing tests for DelayedMatrixStats in devel (debugging in PeteHaitch/DelayedMatrixStats#81).

Recent change to unsplit() method for List objects breaks many packages

@lawremi @sanchit-saini Commit 85fc802 breaks 42 software packages with a "shape of 'skeleton' is not compatible with 'NROW(flesh)'" error on all platforms today: https://bioconductor.org/checkResults/3.13/bioc-LATEST/

List of affected packages:

AnnotationHub
ATACseqQC
biscuiteer
BSgenome
BUSpaRse
chimeraviz
ChIPanalyser
ChIPpeakAnno
CNVfilteR
CNVRanger
CopyNumberPlots
DEScan2
GA4GHclient
GenomeInfoDb
GenomicFeatures
GenomicFiles
GenomicScores
genotypeeval
ggbio
icetea
karyoploteR
MutationalPatterns
myvariant
ORFik
proActiv
profileplyr
PureCN
RCAS
recount
regioneR
ribosomeProfilingQC
scmeth
SCOPE
seqsetvis
SGSeq
SomaticSignatures
StructuralVariantAnnotation
TCGAutils
TitanCNA
trackViewer
VariantAnnotation
VariantFiltering

Simple reproducible example:

library(IRanges)
DF <- DataFrame(chrom=letters[1:3], genome=NA_character_)
f <- factor(DF$genome, exclude=NULL)
unsplit(split(DF, f), f)
# Error in relist(flesh, PartitioningByEnd(skeleton)) : 
#   shape of 'skeleton' is not compatible with 'NROW(flesh)'

Please be more careful when you make changes to core functionalities this close to a release. We (core team) had to disable the automatic failure notifications this morning to avoid spamming people with many false positives. We'll re-enable them when this is fixed so please fix ASAP. Thanks!

Unfortunately indecipherable C error on install

Hi all,

I'm an R coder and have no experience with C. When I try to install IRanges, I receive this error (no matter the version):

In file included from CompressedAtomicList_utils.c:5: IRanges.h:100:15: error: unknown type name ‘IntPairAE’ 100 | const IntPairAE *intpair_ae | ^~~~~~~~~ IRanges.h:105:15: error: unknown type name ‘IntPairAEAE’ 105 | const IntPairAEAE *intpair_aeae | ^~~~~~~~~~~ IRanges.h:233:1: error: unknown type name ‘Ints_holder’; did you mean ‘IRanges_holder’? 233 | Ints_holder _get_elt_from_CompressedIntsList_holder( | ^~~~~~~~~~~ | IRanges_holder

Does anyone have any insight here?

Cannot install IRanges with BiocManager::install("IRanges")

I am having trouble with the latest version of IRanges (2.24.0). I could not create IRanges objects, so I deinstalled the package. But when I now want to install it I always get the following error:

Error in reconcilePropertiesAndPrototype(name, slots, prototype, superClasses, :
no definition was found for superclass “DataTable” in the specification of class “RangedData”
Error: unable to load R code in package ‘IRanges’
Execution halted

I do not know what this error means, could you help me?

Thanks,
Raphael

propagate mcols() through setOps where possible

Just an idea -- where possible, propagate mcols().

For example,

mcols(c(a,b)) = rbind(mcols(a), mcols(b))

(IRanges already does this)

mcols(intersect(a,b))[i, ] = cbind(mcols(a)[j, ], mcols(b)[k, ])

where j is the row in a that gave rise to row i in the intersect and k is the same for b.

mcols(setdiff(a, b))[i, ] = mcols(a)[j, ]

where j is the row in a that gave rise to row i in the setdiff.

I don't think this idea applies to union, since a single output row doesn't necessarily map back to a single row in each input IRanges.

I think similar logic could apply to findOverlaps methods.

`grepl` isn't working correctly for `CharacterList` with `ignore.case` and `fixed` enabled

Hi Bioconductor team,

I noticed an issue with grepl matching against a CharacterList object:

library(IRanges)
pattern <- "a"
x <- CharacterList(list(c("A", "a"), c("B", "b")))
print(x)
## CharacterList of length 2
## [[1]] A a
## [[2]] B b

This works as expected:

grepl(pattern = pattern, x = x, ignore.case = FALSE, fixed = FALSE)
## LogicalList of length 2
## [[1]] FALSE TRUE
## [[2]] FALSE FALSE

This works as expected:

grepl(pattern = pattern, x = x, ignore.case = FALSE, fixed = TRUE)
## LogicalList of length 2
## [[1]] FALSE TRUE
## [[2]] FALSE FALSE

This works as expected:

grepl(pattern = pattern, x = x, ignore.case = TRUE, fixed = FALSE)
## LogicalList of length 2
## [[1]] TRUE TRUE
## [[2]] FALSE FALSE

This doesn't work as expected. Warns and returns incorrect values. Should return [[1]] TRUE TRUE.

grepl(pattern = pattern, x = x, ignore.case = TRUE, fixed = TRUE)
## Warning in grepl(pattern, unlist(x, use.names = FALSE), ignore.case, perl,  :
##   argument 'ignore.case = TRUE' will be ignored
## Calls: grepl -> grepl -> relist -> grepl -> grepl
## LogicalList of length 2
## [[1]] FALSE TRUE
## [[2]] FALSE FALSE

Best,
Mike

mergeByOverlaps, findOverlapPairs etc don't work with only one range

In the section findOverlaps-methods, the reference manual says:

If subject is omitted, query is queried against itself.

However, these functions fail unless both query and subject are provided:

require('IRanges')
#> Loading required package: IRanges
ran = IRanges(start=c(1, 2, 5, 6, 10), width=1)
findOverlapPairs(query=ran)
#> Error in findOverlaps(query, subject, ...): argument "subject" is missing, with no default

^{Created on 2021-03-01 by the reprex package (v1.0.0)}

I suspect this is the case for the whole family of findOverlaps-methods in the reference manual.

This is using:

IRanges_2.20.2
R 3.6.1

Run Length Encoding vignette section is out of place

This section starts by saying "Up until this point we have used R atomic vectors to represent atomic sequence" but in fact the previous sections have been using Rle.

IRanges/vignettes/IRangesOverview.Rnw

Lines 207 to 211 in f5c2890

 \subsubsection{Run Length Encoding} 

 Up until this point we have used \R{} atomic vectors to represent 

 atomic sequences, but there are times when these object become too 

 large to manage in memory. When there are lots of consecutive repeats in

and the code:

> xRle <- Rle(xVector)
> as.vector(object.size(xRle) / object.size(xVector))

returns 1 because xVector was created from the beginning as an Rle.

	\subsubsection{Run Length Encoding}

	Up until this point we have used \R{} atomic vectors to represent
	atomic sequences, but there are times when these object become too
	large to manage in memory. When there are lots of consecutive repeats in

bioconductor / iranges Goto Github PK

iranges's Introduction

iranges's People

Contributors

Stargazers

Watchers

Forkers

iranges's Issues

RESULT:

sessionInfo():

In devel

Recommend Projects

Recommend Topics

Recommend Org