nsheff / lola Goto Github PK

Locus Overlap Analysis: Enrichment of Genomic Ranges

R 86.49% HTML 8.36% CSS 5.16%

lola's Introduction

LOLA: Genomic Locus Overlap Enrichment Analysis

LOLA is an R package providing functions for testing overlap of sets of genomic regions with public and custom databases. You can think of it as testing your bed file (genome regions of interest) against a database of other bed files (regions from various previous studies) to look for enrichment of overlaps. This enables you to draw connections between newly generated data, and the growing public databases, leading to new hypotheses and annotation sharing.

This README provides a package overview, motivation, and installation instructions. For detailed documentation of functions and additional examples, please see the R documentation.

Installing LOLA

The release version of LOLA can be installed directly from Bioconductor:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("LOLA")

To install the development version directly from github, make sure you have GenomicRanges (bioconductor package) installed, then install LOLA with devtools:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("GenomicRanges")

devtools::install_github("nsheff/LOLA")

Or, clone the repo and install from there:

install.packages("path/to/LOLA", repos=NULL)

Running LOLA

For examples and workflows for LOLA, please check out the R vignettes.

LOLA Databases

Downloads of core databases and instructions for building your own databases are at databio.org/regiondb.

lola's People

Contributors

Stargazers

Watchers

Forkers

fhalbritter gitter-badger xtmgah aiminy ktrns demuellae j-lawson jf2016gh chenxifreidrich gyd1990 ddpinto 0x6b7966 joseverdezoto xuyingjie-z ben-laufer shaoyoucheng wook2014

lola's Issues

Combining LOLA Core and Extended

Hi, is it possible to combine LOLA Core and Extended in a single run? If so, what would be the best way to do so?

Thank you,
Rajat

OddsRatio or logOddsRatio ?

Hi,

I have a question. fisher.test returns the odds ratio ( estimate ). I could not find in the code that the odds ratio is been converted to logOddsRatio. Does the runLOLA outputs odds ratio or logOddsRatio ?

runLOLA results table for PEP databases

As of now, the runLOLA() results table displays columns defined in the index.txt file (e.g tissue, antibody). If the PEP sample annotation sheet doesn't contain those fields, the overlaps and fisher scores can be calculated but the function will crash when writing the results table.

Perhaps we could make this aspect of runLOLA() more flexible so that it accepts any fields the user provides in the PEP sample_annotation.csv file.

LOLA/R/calcLocEnrichment.R

Lines 180 to 185 in 6d5111a

 orderedCols = c("userSet", "dbSet", "collection", "pValueLog", "oddsRatio", 

 "support", "rnkPV", "rnkOR", "rnkSup", "maxRnk", "meanRnk", "b", "c", "d", 

 "description", "cellType", "tissue", "antibody", "treatment", "dataSource", "filename") 

 unorderedCols = setdiff(colnames(scoreTable), orderedCols) 

 setcolorder(scoreTable, c(orderedCols, unorderedCols))

LOLA runLOLA-regionDB to use for tissue-specific TFBS

I would like to test for enrichment of skeletal muscle TFBS in my differentially methylated regions (DMRs).

I have a Granges object of my DMRs.
I have a Granges object of all of my tested CpGs (is this correct for the userUniverse?)
And for the regionDB I have come up with two options:

Load Cache of TFBS from Lola, but is there a way to select for skeletal muscle? (my Granges are annotated to hg38) In any case, haven't been able to figure out how to load the TFBS in the Cache, just the "extdata" example from the vignette, what are the options for "extdata" in -- dbPath = system.file("extdata", "hg19", package="LOLA")
Create my own- I have 4 bed files that I have downloaded from Unibind, is this the best way to achieve what I am trying to do? (I have done this but am not sure about the results, see screenshot below, also the FDR function from qvalue gives an error when runLOLA)
Also-is there a way to filter out all of the TFBS annotated to the X and Y chromosomes from the 4 bed files at once?

Please advise if I am using this package correctly and if the results look normal.

Also I am concerned that using this method doesn't account for the multiple probe and gene bias issues in the EPIC array (see https://www.biorxiv.org/content/10.1101/2020.08.24.265702v1.abstract). But that is a whole other thing.

Lastly, what is the code to analyze for enrichment of all encode TFBS? I can't seem to figure out the input.
Thanks in advance!

Merging database region-sets

Is there a simple programmatic way to merge two regionsets from regionDB into a new comboned set within regionDB?

This workaround I can think of seems complicated:

extract the regions and merge them, save as new bed file
create custom database with new bed file
import the new database and merge with exisitng

For example, if I want enrichment for regions that are DNAse Weak OR UCSC CpG islands, would have to
extract the two sets from redionDM (or bed files) and create a new bed file, etc.

Seems like there should be a simpler way?

PEP tag system

Instead of collections, loading a LOLA PEP should allow a sample attribute value filter.

it's more flexible because it can filter out files from multiple collections.

Histone marks collections

Are you accepting feature requests?

If yes, I think it'd be great to have collections of histone marks from ENCODE and Roadmap Epigenomics, similar to encode_tfbs, cistrome_cistrome or cistrome_epigenome.

Thanks!

Command line interface

If feasible, please provide a basic command line interface for LOLA so that the tool can be integrated into computational pipelines w/o the need to write wrapper scripts.

cistrome_mm10

Database downloads include cistrome files for hg19 and hg38 but not for mm10. The database downloads page on databio site describes a LOLACore for mm10 including Cistrome, but when I downloaded the region core database, cistrome is missing from mm10. Is this a future endeavor? Do you plan on including cistrome as part of LOLAcore for mm10 soon?

In the meantime I have contacted Cistrome directly to get access to the files for mm10 and plan to create a custom database to run with LOLA. Hope to hear about future plans with LOLACore for mm10!

testing for enrichment of significant individual cpgs

Would it be possible to use LOLA to test for enrichment of significant individual CpGs identified in DNA methylation study? we can make a .bed file for each sig CpG, would that be ok to use as input for LOLA? Thanks!

Can't load LOLAExt with loadRegionDB()

Hello,

Thanks for LOLA! It looks like a great resource.

I get this error when trying to load the latest LOLAExt cached database:

> regionDB2 <- loadRegionDB("LOLAExt/hg19")
Reading collection annotations:
Error in readCollectionAnnotation(dbLocation, collections) :
  No regions found. Are you sure '../../data/DNase-seq/12LOLA/LOLAExt/hg19' is a region database?
In addition: Warning messages:
1: In readCollectionAnnotation(dbLocation, collections) :
  jaspar_motifs has no collection annotation or region files. Skipping...
2: In readCollectionAnnotation(dbLocation, collections) :
  roadmap_epigenomics has no collection annotation or region files. Skipping...

LOLACore loads fine.

Thanks,
Enrico

readBed is not consistent with import.bed

Hi Nathan,

I installed LOLA 1.4.0 in R 3.3.2 and Bioconductor 3.4 by biocLite("LOLA"). It seems that readBed() and loadRegionDB() would create GRanges without adding 1 to the left sites in the bed files, leading to not so right 1-based GRanges which would latter give out not so right results.

library("LOLA")
bed <- system.file("extdata", "hg19", "ucsc_example", "regions", "cpgIslandExt.bed", package="LOLA")
cat(readLines(bed, n = 5), sep = "\n")
## chr1    28735   29810
## chr1    135124  135563
## chr1    327790  328229
## chr1    437151  438164
## chr1    449273  450544
readBed(bed)
##  GRanges object with 28691 ranges and 0 metadata columns:
##        seqnames               ranges strand
##           <Rle>            <IRanges>  <Rle>
##      1     chr1     [ 28735,  29810]      *
##      2     chr1     [135124, 135563]      *
##      3     chr1     [327790, 328229]      *
##      4     chr1     [437151, 438164]      *
##      5     chr1     [449273, 450544]      *
##    ...      ...                  ...    ...
##  28687     chrY [27610115, 27611088]      *
##  28688     chrY [28555535, 28555932]      *
##  28689     chrY [28773315, 28773544]      *
##  28690     chrY [59213794, 59214183]      *
##  28691     chrY [59349266, 59349574]      *
##  -------
##  seqinfo: 69 sequences from an unspecified genome; no seqlengths
library("rtracklayer")
import(bed)
##  GRanges object with 28691 ranges and 0 metadata columns:
##          seqnames               ranges strand
##             <Rle>            <IRanges>  <Rle>
##      [1]     chr1     [ 28736,  29810]      *
##      [2]     chr1     [135125, 135563]      *
##      [3]     chr1     [327791, 328229]      *
##      [4]     chr1     [437152, 438164]      *
##      [5]     chr1     [449274, 450544]      *
##      ...      ...                  ...    ...
##  [28687]     chrY [27610116, 27611088]      *
##  [28688]     chrY [28555536, 28555932]      *
##  [28689]     chrY [28773316, 28773544]      *
##  [28690]     chrY [59213795, 59214183]      *
##  [28691]     chrY [59349267, 59349574]      *
##  -------
##  seqinfo: 69 sequences from an unspecified genome; no seqlengths

I find that runLOLA() invokes countOverlaps(). However, countOverlaps() is based on 1-based closed interval. So readBed() would in fact add one more base on the left side of all the regions. Even it would have small impact on the result, it maybe better to be fixed. :)

Inclusion of mm10 JASAPR prediction track into LOLA JASPAR

Would it be possible to integrate the recently published JSAPR binding prediction for mm10 into Lola Jaspar?
See http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/mm10/ for the data ,http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/ for the method details, and also http://jaspar.genereg.net/genome-tracks/ for a general overview of the new track feature of JASPAR.

Since it's already a region dataset (bed-file) a conversion into the LOLA db format may be straightforward.

Can't load an example regionDB.

No problem with the path.

regionDB = loadRegionDB(dbLocation=dbPath)
Reading collection annotations:
ucsc_example: found collection annotation:C:/Program Files/R/R-3.6.0/library/LOLA/extdata/hg19/ucsc_example/collection.txt
Reading region annotations...
C:/Program Files/R/R-3.6.0/library/LOLA/extdata/hg19//ucsc_example/regions
In 'ucsc_example', found index file:C:/Program Files/R/R-3.6.0/library/LOLA/extdata/hg19/ucsc_example/index.txt

Collection: ucsc_example. Creating size file...
Error in system(paste("wc -l ", filename), intern = TRUE) :
'wc' not found

In addition: Warning message:
In readRegionSetAnnotation(dbLocation, collections) :
You don't have simpleCache installed, so you won't be able to cache the
regionDB after reading it in. Install simpleCache to speed up later database
loading.

Each of the 2 combined objects has sequence levels not in the other:

Hi there,
I'm really appreciate your brilliant tool for enrichment analysis.
But when I try to create my own reference dataset, I got the warnings:

Warning messages:

1: In dir.create(cacheDir, recursive = TRUE) :
  'D:\APP\R-4.0.3\library\LOLA\mydata\danrer11\my_dataset' already exists
2: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrUn_KN149687v1, chrUn_KN149729v1, chrUn_KN149738v1, chrUn_KN149739v1, chrUn_KN149743v1, chrUn_KN149745v1, chrUn_KN149779v1, chrUn_KN149794v1, chrUn_KN149813v1, chrUn_KN149822v1, chrUn_KN149857v1, chrUn_KN149915v1, chrUn_KN149926v1, chrUn_KN149950v1, chrUn_KN149955v1, chrUn_KN149967v1, chrUn_KN149997v1, chrUn_KN150000v1, chrUn_KN150002v1, chrUn_KN150037v1, chrUn_KN150039v1, chrUn_KN150070v1, chrUn_KN150086v1, chrUn_KN150100v1, chrUn_KN150110v1, chrUn_KN150112v1, chrUn_KN150152v1, chrUn_KN150165v1, chrUn_KN150185v1, chrUn_KN150218v1, chrUn_KN150254v1, chrUn_KN150255v2, chrUn_KN150286v1, chrUn_KN150289v1, chrUn_KN150330v1, chrUn_KN150352v2, chrUn_KN150354v1, chrUn_KN150362v1, chrUn_KN150363v1, chrUn_KN150380v1, chrUn_KN150383v1, chrUn_KN150421v1, chrUn_KN150430v1, chrUn_KN150481v1, chrUn_KN150485v1, chrUn_KN150504v1, chrUn_KN150523v1, chrUn_KN150544v1, chrUn_KN150550v1, chrUn_KN150562v1, chrUn_KN150575v1, c [... truncated]
3: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrUn_KN149687v1, chrUn_KN149738v1, chrUn_KN149745v1, chrUn_KN149747v1, chrUn_KN149811v1, chrUn_KN149841v1, chrUn_KN149857v1, chrUn_KN149861v1, chrUn_KN149874v1, chrUn_KN149915v1, chrUn_KN149916v1, chrUn_KN149926v1, chrUn_KN149950v1, chrUn_KN150008v1, chrUn_KN150037v1, chrUn_KN150039v1, chrUn_KN150065v1, chrUn_KN150086v1, chrUn_KN150100v1, chrUn_KN150136v1, chrUn_KN150141v1, chrUn_KN150152v1, chrUn_KN150213v1, chrUn_KN150218v1, chrUn_KN150228v1, chrUn_KN150324v1, chrUn_KN150352v2, chrUn_KN150354v1, chrUn_KN150380v1, chrUn_KN150411v1, chrUn_KN150421v1, chrUn_KN150430v1, chrUn_KN150439v1, chrUn_KN150481v1, chrUn_KN150485v1, chrUn_KN150504v1, chrUn_KN150544v1, chrUn_KN150623v1, chrUn_KZ115963v1, chrUn_KZ115976v1, chrUn_KZ115993v1, chrUn_KZ116046v1, chrUn_KZ116047v1, chrUn_KZ116053v1, chrUn_KN148869v2, chrUn_KN149722v1, chrUn_KN149730v1, chrUn_KN149755v1, chrUn_KN149762v1, chrUn_KN149765v1, chrUn_KN149768v1, c [... truncated]
4: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrUn_KN149687v1, chrUn_KN149729v1, chrUn_KN149741v1, chrUn_KN149745v1, chrUn_KN149747v1, chrUn_KN149760v1, chrUn_KN149803v1, chrUn_KN149810v1, chrUn_KN149811v1, chrUn_KN149841v1, chrUn_KN149855v1, chrUn_KN149857v1, chrUn_KN149861v1, chrUn_KN149874v1, chrUn_KN149915v1, chrUn_KN149916v1, chrUn_KN149926v1, chrUn_KN149950v1, chrUn_KN149997v1, chrUn_KN150008v1, chrUn_KN150037v1, chrUn_KN150039v1, chrUn_KN150065v1, chrUn_KN150079v1, chrUn_KN150086v1, chrUn_KN150112v1, chrUn_KN150185v1, chrUn_KN150213v1, chrUn_KN150228v1, chrUn_KN150342v1, chrUn_KN150351v1, chrUn_KN150352v2, chrUn_KN150354v1, chrUn_KN150366v1, chrUn_KN150375v1, chrUn_KN150399v1, chrUn_KN150411v1, chrUn_KN150481v1, chrUn_KN150485v1, chrUn_KN150504v1, chrUn_KN150544v1, chrUn_KN150575v1, chrUn_KN150602v1, chrUn_KN150614v1, chrUn_KN150623v1, chrUn_KN150647v1, chrUn_KZ115963v1, chrUn_KZ115982v1, chrUn_KZ116008v1, chrUn_KZ116024v1, chrUn_KZ116046v1, c [... truncated]
5: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrUn_KN147632v2, chrUn_KN147636v1, chrUn_KN147652v2, chrUn_KN149683v1, chrUn_KN149685v1, chrUn_KN149687v1, chrUn_KN149688v2, chrUn_KN149689v2, chrUn_KN149691v1, chrUn_KN149694v1, chrUn_KN149698v1, chrUn_KN149702v1, chrUn_KN149705v1, chrUn_KN149716v1, chrUn_KN149717v1, chrUn_KN149719v1, chrUn_KN149723v1, chrUn_KN149725v1, chrUn_KN149741v1, chrUn_KN149743v1, chrUn_KN149745v1, chrUn_KN149747v1, chrUn_KN149749v1, chrUn_KN149760v1, chrUn_KN149778v1, chrUn_KN149794v1, chrUn_KN149795v1, chrUn_KN149803v1, chrUn_KN149805v1, chrUn_KN149810v1, chrUn_KN149811v1, chrUn_KN149812v1, chrUn_KN149813v1, chrUn_KN149841v1, chrUn_KN149852v1, chrUn_KN149856v1, chrUn_KN149861v1, chrUn_KN149884v1, chrUn_KN149893v1, chrUn_KN149904v1, chrUn_KN149912v1, chrUn_KN149915v1, chrUn_KN149939v1, chrUn_KN149946v1, chrUn_KN149949v1, chrUn_KN149950v1, chrUn_KN149967v1, chrUn_KN149968v1, chrUn_KN149973v1, chrUn_KN149998v1, chrUn_KN150000v1, c [... truncated]

I have eight reference datasets under the regions/ directory, all download from UCSC in bed format. It seems like these bed files doesn't have the same chrmosome numbers? But the eight files should be in different. I have no idea for these warnings.
Thank you so much for your answer!

Best
Ariel

PEP file read caching

We should make sure the PEP loading function has the same caching capability the current loadRegionDB function has:

https://github.com/joseverdezoto/LOLA/blob/127efbb6b273a526d64dd10a3a71460012288653/R/loadRegionDB.R#L288-L299

Different results between LOLAweb interactive server and local R package

I got different results from LOLAweb interactive server and LOLA installed from bioconductor with the same input files( region set, universe, and core database). What's the difference between web-based LOLA and LOLA R package?

The readBed function does not transform from 0-based to 1-based coordinates

Hello, I think I found a bug in the readBed function. I used it to read a .bed file with differentially methylated CpG sites. LOLA complained that the ranges are not disjoined. This was strange as each CpG range is 2 nucleotides long and they can't possibly be overlapping. I inspected the GRanges object and saw that the ranges are 3 nucleotides long. There were a few instances in which there were neighbouring CpGs and their ranges overlapped, e.g. 34567-34569 and 34569-34571, matching the ranges in the .bed file. As GRanges coordinates are 1-based, the coordinates should have been transformed to 34568-34569 and 34570-34571. I found a workaround for my problem by creating and using a non-standard .bed file with 1-based coordinates but I think the readBed function should be made to take into account the difference between coordinate systems and transform coordinates correctly.

Database download from "latest databases" doesnt work - 403 Forbidden Error

Hi,
When trying to download the latest databases from here as instructed here the following error and page is shown:
403 Forbidden
nginx/1.13.12

Thanks for your help!

Fisher test - two sided hypothesis

Hi,
Many thanks for this extremely useful package.
I have two questions about Fisher's test and output data frame.

Firstly, I would like to ask if is it possible to implement two sided alternative in runLOLA() function (as shown below)?
```
if (direction == "depletion") {
fisherAlternative = "less"
} 
else if (direction == "two.sided") {
    fisherAlternative = "two.sided"
    cat("Running fisher test with alternative=two.sided")
}
else {
fisherAlternative = "greater"
}
```
This topic was already mentioned here https://github.com/nsheff/LOLA/issues/20, however this feature is still not available.
Another question is about the presence of raw p-value within the output data frame, this kind of information may be useful for example to use custom correction for multiple comparisons problem.

Thanks in advance.

clarify checkUniverseAppropriateness

The checkUniverseAppropriateness is a bit confusing, with no real recommendations. I need to clean it up a little bit to make it more universal and useful, and formalize some of the concepts of an appropriate universe.

In addition, I need to fix the current example, which might (gasp!) use an inappropriate universe.

Re-order result columns

Suggestion: move description column further to the left, since this is sometimes the only information about the comparison.

Negative c entry in table. Bug with userSetsLength; this should not happen.

I created a custom LOLA database using the instructions in the README.md file.
The database loads fine and is cached without problems.

However, when I run runLOLA() I get this warning message:

2: In runLOLA(userSets, userUniverse, lolaCustom, cores = detectCores()) :
  Negative c entry in table. Bug with userSetsLength; this should not happen.

and indeed quite a few of my c values are negative.
The data.table returned only has the following columns: userSet, dbSet, support, b,c and d.

How can I fix this?

The same userSets and userUniverse work perfectly fine with LOLACore and LOLAExt.

This is how I create my userUniverse:

sheffield_dnase <- unlist(lolaCore$regionGRL[which(lolaCore$regionAnno$collection == "sheffield_dnase")])
userUniverse <- sort(disjoin(c(sheffield_dnase, unlist(userSets))))

This is my userSetsLength:


          first second           third fourth           fifth sixth           seventh
              518                 1               483                 3               280                 4               245
eighth           ninth tenth           eleventh
               51              1450                60               812

Thanks for your help!
Enrico

P4 tool not grouping according to the paper!!

Hi Nathan,

I would like to highlight that in your paper named "Nonstationary Evolution and Compositional Heterogeneity in Beetle
Mitochondrial Phylogenomics" you had plotted p4 results saying it groups according to the AT%. When I tried to run p4 with the dataset which you had included in the supplementary, It didn't group and the result obtained was different when I plotted for the same. It would be really helpful if you can give me a clarification. Maybe I missed something unlike you while p4 run. Could you please give a look and respond.

Make refreshSizes an option in readRegionSetAnnotation

Currently, refreshing the size file causes an error if you don't have write access to the directory containing the collection.

Thanks

reading zipped bed files

Christoph made a suggestion that LOLA could understand a zipped archive of bed files, in addition to the current custom database parsing.

Add in missing cellType/antibody entries for encodeTFBSmm10

Hi there

I noticed that a handful of the entries for encodeTFBSmm10 have NA in the cellType and antibody annotations. Fortunately, this data seems to be available here, and I've enclosed a code snippet that adds the missing entries to lola.db$regionAnno to make things easier for you. (Right now I'm using the below code as a workaround for myself.)

Thanks for making and maintaining a very useful tool!

lola.db <- loadRegionDB('LOLA/LOLACore/mm10') #change to wherever LOLACore is downloaded

# Fix missing cell/antibody entries in LOLA

encode.meta <- read.table('https://raw.githubusercontent.com/theaidenlab/juicebox/master/src/juicebox/encode/encode.mm9.txt',
                          sep='\t',
                          header=TRUE)

encode.meta$path <- sub('.*\\/', '', encode.meta$path)
encode.meta$path <- sub('.gz$', '', encode.meta$path)

rownames(encode.meta) <- encode.meta$path

lola.missing <- which(lola.db$regionAnno$collection=='encodeTFBSmm10'&is.na(lola.db$regionAnno$cellType))

missing.files <- lola.db$regionAnno$filename[lola.missing]


lola.db$regionAnno[lola.missing, c('cellType', 'antibody')] <- 
    encode.meta[missing.files, c('cell', 'antibody')]

gettingStarted.Rmd

gettingStarted.Rmd

regionDB = loadRegionDB(dbPath)
Found collections: ucsc_example
In 'ucsc_example', found index file:/home/fhalbritter/R/x86_64-unknown-linux-gnu-library/3.1/LOLA/extdata/hg19/ucsc_example/index.txt
Found collections: ucsc_example
In 'ucsc_example', no collection file.
ucsc_example
::Loading cache:: /home/fhalbritter/R/x86_64-unknown-linux-gnu-library/3.1/LOLA/extdata/hg19/ucsc_example/ucsc_example.RData

Error if GRangesList has names

If the GRangesList that is given to runLOLA as part of the regionSetDB has names for each region set, an error results. This can be fixed by setting the names to NULL to trigger the code chunk referenced below. Suggested fix: make the conditional code chunk no longer conditional.

LOLA/R/calcLocEnrichment.R

Line 60 in 8262ff5

if (any(is.null(names(testSetsGRL)))) {

0-based bed files in database are not properly translated to 1-based regions for LOLA

Dear Nathan,

I started using LOLA a few days ago and am coming across an issue for which I hope you can help. When I read my database with loadRegionDB containing bed files, the conversion of 0-based bed files to 1-based regions in LOLA doesn't quite work. Hence I am getting a few overlaps between my userSets and the database that aren't correct.

For example, my bed file contains the following lines:
scaffold_19 23641645 23642149
scaffold_13 24500587 24501311
scaffold_0 23378037 23378626

After loading the database, this looks like:
seqnames ranges strand

1 scaffold_19 [23641645, 23642149] *
2 scaffold_13 [24500587, 24501311] *
3 scaffold_0 [23378037, 23378626] *

... when actually the left coordinate should be +1.

Reading my userSet and userUniverse with makeGRangesFromDataFrame(df=userSet, starts.in.df.are.0based=TRUE) works.

I am loading the latest LOLA version that I just downloaded from github, since I saw you were working on this issue 20 days ago. Can you comment? Am I doing anything wrong?

Thank you in advance!

two-sided fisher test

I was thinking it would be nice to have the option to choose which type of fisher's test to use for the enrichment, unless there is some reason why you would only check for positive enrichment?

For my work, it would be informative to know not only which regions are enriched but also which are depleted.

It is a simple change in the code to allow an option for the fisher test. For example, instead of hard-coding alternative="greater" you could include a variable fisher=c("greater","less", "two.sided")

Let me know what you think. Thanks.

question: how to test two region sets for differential enrichment with buildRestrictedUniverse?

Hi there,
Thanks for the useful package, it makes enrichment calculations a breeze.

I am currently trying to find whether two region sets are differentially enriched in certain features. Could you upload an example on how to compare enrichment for testing two region sets for differential enrichment against a reference database?

From the publication:

Furthermore, the buildRestrictedUniverse() function automatically builds a universe based on query sets and can be used to test two region sets for differential enrichment against a reference database.

The only example I find in documentation is from ?redefineUserSets:

dbPath = system.file("extdata", "hg19", package="LOLA")
regionDB = loadRegionDB(dbLocation=dbPath)
data("sample_universe", package="LOLA")
data("sample_input", package="LOLA")

getRegionSet(regionDB, collections="ucsc_example", filenames="vistaEnhancers.bed")
getRegionSet(dbPath, collections="ucsc_example", filenames="vistaEnhancers.bed")
getRegionFile(dbPath, collections="ucsc_example", filenames="vistaEnhancers.bed")


res = runLOLA(userSets, userUniverse, regionDB, cores=1)
locResult = res[2,]
extractEnrichmentOverlaps(locResult, userSets, regionDB)
writeCombinedEnrichment(locResult, "temp_outfolder")

userSetsRedefined =	redefineUserSets(userSets, userUniverse)
resRedefined = runLOLA(userSetsRedefined, userUniverse, regionDB, cores=1)

g = plotTopLOLAEnrichments(resRedefined)

However, in this example, despite userSets containing setA and setB, I cant't find any statistics showing whether two sets are differentially enriched in certain features. What is suggested way to approach this? Thanks!

Error when trying to load a regionDB

Hi,

I would very much like to use LOLA, but I can't get the LOLACore db to load into R. I've tried both the cache and the ordinary db, but they both stop loading with error "Error in .combine_DFrame_rows(x, objects, strict.colnames = FALSE) : the objects to combine must be DFrame objects or derivatives". The same ting happens if I try to load the example db in your gettingStared-vignette:

library("LOLA")
dbPath = system.file("extdata", "hg19", package="LOLA")
regionDB = loadRegionDB(dbPath)
Reading collection annotations:
ucsc_example: found collection annotation:C:/Users/HH/Documents/R/win-library/4.1/LOLA/extdata/hg19/ucsc_example/collection.txt
Reading region annotations...
::Loading cache:: C:/Users/HH/Documents/R/win-library/4.1/LOLA/extdata/hg19/ucsc_example//ucsc_example_files.RData
ucsc_example
::Creating cache:: C:/Users/HH/Documents/R/win-library/4.1/LOLA/extdata/hg19/ucsc_example/ucsc_example.RData
Reading 5 files...
1: C:/Users/HH/Documents/R/win-library/4.1/LOLA/extdata/hg19/ucsc_example/regions/cpgIslandExt.bed
1 ERR:C:/Users/HH/Documents/R/win-library/4.1/LOLA/extdata/hg19/ucsc_example/regions/cpgIslandExt.bed
Error in .combine_DFrame_rows(x, objects, strict.colnames = FALSE) :
the objects to combine must be DFrame objects or derivatives
In addition: Warning message:
In dir.create(cacheDir, recursive = TRUE) :
'C:\Users\HH\Documents\R\win-library\4.1\LOLA\extdata\hg19\ucsc_example' already exists

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Norwegian Bokmål_Norway.1252 LC_CTYPE=Norwegian Bokmål_Norway.1252 LC_MONETARY=Norwegian Bokmål_Norway.1252 LC_NUMERIC=C
[5] LC_TIME=Norwegian Bokmål_Norway.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] LOLA_1.19.1

loaded via a namespace (and not attached):
[1] xfun_0.30 remotes_2.4.2 reshape2_1.4.4 purrr_1.0.1 vctrs_0.5.2 miniUI_0.1.1.1 usethis_2.1.6
[8] htmltools_0.5.2 stats4_4.1.2 yaml_2.2.1 rlang_1.0.6 pkgbuild_1.3.1 urlchecker_1.0.1 later_1.2.0
[15] glue_1.6.2 withr_2.5.0 BiocGenerics_0.40.0 sessioninfo_1.2.2 plyr_1.8.6 GenomeInfoDbData_1.2.7 lifecycle_1.0.3
[22] zlibbioc_1.36.0 stringr_1.5.0 devtools_2.4.4 htmlwidgets_1.5.4 memoise_2.0.1 evaluate_0.16 knitr_1.40
[29] callr_3.7.2 IRanges_2.28.0 fastmap_1.1.0 httpuv_1.6.1 ps_1.7.1 GenomeInfoDb_1.30.1 curl_4.3.2
[36] simpleCache_0.4.2 Rcpp_1.0.7 xtable_1.8-4 promises_1.2.0.1 cachem_1.0.6 writexl_1.4.2 S4Vectors_0.32.3
[43] pkgload_1.3.0 XVector_0.30.0 mime_0.10 fs_1.5.2 digest_0.6.27 stringi_1.5.3 processx_3.7.0
[50] shiny_1.7.2 GenomicRanges_1.42.0 cli_3.4.1 tools_4.1.2 bitops_1.0-7 magrittr_2.0.1 RCurl_1.98-1.3
[57] profvis_0.3.7 crayon_1.5.2 ellipsis_0.3.2 data.table_1.14.0 prettyunits_1.1.1 rmarkdown_2.16 rstudioapi_0.14
[64] R6_2.5.1 compiler_4.1.2

Do you have any suggestions as to how this problem can be solved?.
For the record, I am using Windows.

Kind regards,
Hanne

Error in setnames(scoreTable, c("Var1", "Var2", "value"), c("userSet", : Items of 'old' not found in column names: Var1,Var2

Hi,
Does anyone know why this error appears when I run the example code in LOLA website:

library("LOLA")
dbPath = system.file("extdata", "hg19", package="LOLA")
regionDB = loadRegionDB(dbPath)
data("sample_input", package="LOLA") # load userSets
data("sample_universe", package="LOLA") # load userUniverse
locResults = runLOLA(userSets, userUniverse, regionDB, cores=1)

Error in setnames(scoreTable, c("Var1", "Var2", "value"), c("userSet", :
Items of 'old' not found in column names: Var1,Var2

Thanks !

simpleCache dependency

LOLA should suggest, but not depend on simpleCache. If it exists, it may be used to speed up database loading, but it's also possible to just load the database manually each time, if you have a small DB, and so simpleCache would not be required.

No error on bad regionDB

loadRegionDB() silently fails to load if you give an incorrectly formatted folder as a regionDB. Instead, it show give a meaningful error.

Column naming

I think you should re-think the names of some of the output columns:

why are the four cells of the contingency table called support, b, c, and d? why do they not appear together in the table order? --> consider renaming all columns with meaningful variables names or prefix them, e.g. by cont to make clear they're conceptually linked
you're using a mix of camelCase and hyphenated variables. ouch. userSet, dbSet, etc. vs. cell-type, data-source, etc.
for abbreviations that save only one symbol, I sometimes wonder whether it's really necessary to abbreviate them at all, e.g. rnkSup (rankSup, or even rankSupport (?)), maxRnk (maxRank), etc.

Error by using extractEnrichmentOverlap function

Hi,
I wanted to get my overlap regions by using the extractEnrichmentOverlaps function, but it allways gives me the Error in .subset(x, j) : invalid subscript type 'S4'
Obviously I have a wrong datatype as input, but I can´t find out what datatype I need instead of S4. Also my colleagues used the exact same data for runLOLA and that worked. Had anyone the same problem or could help me?
Best regards
Thomas

hg38 database

Hi,
I think I read somewhere that hg38 is available but when I download your package and run dbPath = system.file("extdata", "hg38", package="LOLA") gives me an empty variable when I checked the extdata directory its also not there. Can you add it to your package?

readBed Input Format

The current input format for readBed requires that the strand be +, -, or *. The standard bed format uses . instead of * when the strand is unknown (https://genome.ucsc.edu/FAQ/FAQformat.html#format1). It would be great if readBed could be modified so that unknown strands can be . or *. Thank you!

R package error

When I replaced the background data set in the R package
I run LOLA and found an error
Negative b entry in table. This means either: 1) Your user sets contain items outside your universe; or 2) your universe has a region that overlaps multiple user set regions, interfering with the universe set overlap calculation.
But I used the background data set of your example, but no error was reported.

	orderedCols = c("userSet", "dbSet", "collection", "pValueLog", "oddsRatio",
	"support", "rnkPV", "rnkOR", "rnkSup", "maxRnk", "meanRnk", "b", "c", "d",
	"description", "cellType", "tissue", "antibody", "treatment", "dataSource", "filename")
	unorderedCols = setdiff(colnames(scoreTable), orderedCols)

	setcolorder(scoreTable, c(orderedCols, unorderedCols))