jorainer / ensembldb Goto Github PK

View Code? Open in Web Editor NEW

33.0 3.0 10.0 6.73 MB

This is the ensembldb development repository.

Home Page: https://jorainer.github.io/ensembldb

R 96.92% Perl 2.56% HTML 0.27% TeX 0.26%

bioconductor bioconductor-packages ensembl annotation

ensembldb's Introduction

`ensembldb`: build and use Ensembl-based annotation packages

For more information please refer to the packages main vignette as well as the online documentation.

ensembldb's People

Contributors

Stargazers

Watchers

Forkers

plijnzaad mtmorgan inambioinfo timyers kozo2 shicheng-guo procha2 lgatto mikelove tim-yu

ensembldb's Issues

missing column(s): 'entrezid'. error

I'm trying to create database for a newer Ensembl release (85) with:

library(ensembldb)
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c("Homo sapiens", "release-85"))
Gtf <- ah["AH51014"]
DbFile <- ensDbFromAH(Gtf)
edb <- EnsDb(DbFile)

and I get the error below:

Fetching data ...  |========================================================================| 100%
OK
  -------------
Proceeding to create the database.
Processing metadata...OK
Processing genes...
 Attribute availability:
  o gene_id... OK
  o gene_name... OK
  o entrezid... Nope
  o gene_biotype... OK
OK
Processing transcripts...
 Attribute availability:
  o transcript_id... OK
  o gene_id... OK
  o transcript_biotype... OK
OK
Processing exons...OK
Processing chromosomes...Fetch seqlengths from ensembl, dataset hsapiens_gene_ensembl version 85...OK
OK
Generating index...OK
  -------------
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK
Warning messages:
1: Failed to parse headers:
220- Welcome to the Ensembl anonymous FTP site
220- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
220-
220-   Users of the material provided here in advance of submission to
220-   public databases should read the Data Release Policy
220-   and Guidelines and Conditions on use of Data in the respective data
220-   directory.
220-   
220-   Please report any unusual problems you may have with this server
220-   via e-mail to webmaster@ensembl.org
220-
220-   All connections and transfers are logged
220-
331 Anonymous login ok, send your complete email address as your password
230 Anonymous access granted, restrictions apply
257 "/" is the current directory
250- Welcome to the Ensembl anonymous FTP site
250- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
250-
250-   Users of the material provided here in advance of submission to
250-   public databases should read the Data Release Policy
250-   and Guidelines and Conditions on use of Data in the respective dat [... truncated] 
2: In ensDbFromGRanges(gff, outfile = outfile, path = path, organism = orgFromAH,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
> DbFile
[1] "./Homo_sapiens.GRCh38.85.sqlite"

Ensure `setFeatureInGRangesFilter` is always called before `addFilterColumns`

Implement .processFilterParam function

This function should take the filter input parameter and:

validate it
translate it (if it's an expression) to a AnnotationFilter/AnnotationFilterList.

ensDbFromGtf fails to fetch seqlengths for plant gtf

dbF <- ensDbFromGtf("Arabidopsis_thaliana.TAIR10.34.gtf.gz")

fails to extract sequence lengths, complains that it can not retrieve them from Ensembl - does it try ensemblgenomes?

Support for columns TXNAME and SYMBOL in select?

Are TXNAME and SYMBOL supported for select?
Are they supported for genes etc?

Implement ensDbQuery,AnnotationFilterList

Implement the endDbQuery method for AnnotationFilterList objects.

Fix timeout on malbec2 build machine

ensembldb version 1.99.12 runTests.R seems to cause a timeout on malbec2.
What's also strange is the time that some examples take:

EnsDb-AnnotationDbi	4.392	0.348	79.168

supportedFilters lists also fields

Re-implement the supportedFilters method to return also the fields.

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'select' for signature '"EnsDb"'

Error reported by Wim Trypsteen.

R-code that produced the error:

library("ensembldb")
library("EnsDb.Hsapiens.v79")

ensemblGenes <- c("ENSG00000108958", "ENSG00000123009", "ENSG00000124399")
edb <- EnsDb.Hsapiens.v79
select(edb, keys=ensemblGenes, 
       columns=c("ENTREZID", "SYMBOL"), 
       keytype="GENEID")

select(edb, keys = c("BCL2", "BCL2L11"), keytype = "gene_name", columns = c("gene_id", "gene_name"))

Implement a `SymbolFilter`

Based on Vince's suggestion; this should symlink to GenenameFilter.

Add the AnnotationFilter::GRangesFilter

The new GRangesFilter has conditions any, start, end, within and equal. These are also passed to the filter using the type (not condition) parameter.

any: This seems to correspond to the overlapping.

> findOverlaps(IRanges(2, 5), IRanges(1, 4), type = "any")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 1 / subjectLength: 1

start: ranges have the same start:

> findOverlaps(IRanges(3, 5), IRanges(3, 9), type = "start")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 1 / subjectLength: 1

end: ranges have the same end:

> findOverlaps(IRanges(3, 5), IRanges(1, 5), type = "end")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 1 / subjectLength: 1

within:

> findOverlaps(IRanges(2, 5), IRanges(1, 6), type = "within")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 1 / subjectLength: 1

Problem importing Pbase

I am trying to import the Proteins class, but as soon as I import anything from https://github.com/ComputationalProteomicsUnit/Pbase I get the following error:

* installing *source* package ‘ensembldb’ ...
** R
** inst
** preparing package for lazy loading
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
  cyclic namespace dependency detected when loading ‘ensembldb’, already loading ‘biovizBase’, ‘Gviz’, ‘Pbase’, ‘ensembldb’
ERROR: lazy loading failed for package ‘ensembldb’
* removing ‘/Users/jo/R/R-3.3.1-Bioc3.4-devel/lib/R/library/ensembldb’
* restoring previous ‘/Users/jo/R/R-3.3.1-Bioc3.4-devel/lib/R/library/ensembldb’

Long build times of ensembldb with newer RSQLite packages

Build times differ considerably between RSQLite version 1.0.0 and release candidate.

Check unit tests.
Check examples.
Check vignette.

Fix documentation

Fix the documentation after importing filters from AnnotationFilters.

Add suport for `SYMBOL`

Allow SYMBOL to be queried by the select method.

Use of EnsDb v2.0 databases with ensembldb 1.x

Check if we get errors using an EnsDb version 2.0 (i.e. including protein annotations) with an ensembldb package that does not provide protein functionality (i.e. < 1.99).

Parameter to specify whether filter columns should be returned

As of now only columns specified with the columns argument are returned by the methods. It might however be useful to return the columns queried by the provided filters too.
Add a returnFilterColumns setting that allows to control whether filter-columns should be returned too.

TIMEOUT error on Windows build machines

The Windows Bioconductor build machines report sometimes/regularily a TIMEOUT for ensembldb. The process seems to stop during creation of the vignette(s).

* checking for file ‘ensembldb/DESCRIPTION’ ... OK
* preparing ‘ensembldb’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ...

Interestingly, this can not be reproduced on my Windows build machine.
To track the problem:

Disable proteins vignette
Disable the MySQL vignette
Disable all chunks that require internet connection on the main vignette.

Remove integer checking code once RMySQL returns correct data types

Remove the code transforming numeric to integer in .getWhat once r-dbi/RMySQL#166 is fixed.

Amend filter framework and objects

Organism.dplyr has similar filters than ensembldb implemented in a very elegant way. Thus it might be good to re-use some of these ideas, specifically, reduce the amount of implemented methods and objects and use some lightweight functions instead.

See also issue Bioconductor/Organism.dplyr#3 in Organism.dplyr

ensembldb query backend

Based on a brainstorming chat with UniProt:

Do you use the UniProt RESTful API to get the data?
I think you use the Ensembl perl API?

From the conversation, there are differences in the Ens -> Uniprot and Uniprot -> Ens mapping. But when these agree, then the features will match; but some features exist in one and not other.

To be discussed at EuroBioc in Basel.

Add protein data

What about adding protein annotations to ensembldb. @lgatto might that be something in your interest too?

EnsDb validation method to check protein content

All transcripts with a CDS are assigned to a protein_id.
Length of the protein sequence is the length of the CDS / 3.

Add filters from AnnotationFilters

Now, the new AnnotationFilters package will provide most of the filters required by ensembldb. Thus we have to change the classes etc in ensembldb for it to work.

Convert within tx variant information to genomic coordinates

Functionality to map variant information within tx to genomic coordinates and vice versa. Example code below:

  fa <- getGenomeFaFile(edb)

  ## Convert variant coordinates to genomic coordinates
  tx <- "ENST00000070846"
  ## Get the cds
  txCds <- cdsBy(edb, by="tx", filter=TxidFilter(tx))

  ## ENST00000070846:c.1643delG
  varPos <- 1643
  exWidths <- width(txCds[[tx]])
  ## Define the exon ends in the tx.
  exEnds <- cumsum(exWidths)
  ## Get the first negative index.
  exDiffs <- varPos - exEnds
  exVar <- min(which((exDiffs) < 0))
  ## Now we would like to know the position within that exon:
  posInExon <- exWidths[exVar] + exDiffs[exVar]
  ## Next the genomic coordinate:
  ## Note: here we have to consider the strand!
  ## fw: exon_start + (pos in exon -1)
  ## rv: exon_end - (pos in exon -1)
  if(as.character(strand(txCds[[tx]][1])) == "-"){
      chromPos <- end(txCds[[tx]][exVar]) - (posInExon - 1)
  }else{
      chromPos <- start(txCds[[tx]][exVar]) + (posInExon -1)
  }

  ## Validation.
  ## OK, now we get the sequence for that exon.
  ## Check if the estimated position is a G.
  exSeq <- getSeq(fa, txCds[[tx]][exVar])
  substring(exSeq, first=posInExon-2, last=posInExon+2)
  ## Hm, hard to tell... it's two Gs there!
  substring(exSeq, first=posInExon, last=posInExon) == "G"
  ## Get the full CDS
  cdsSeq <- unlist(getSeq(fa, txCds[[tx]]))
  substring(cdsSeq, first=varPos - 2, last=1643 + 2)
  ## The same.
  getSeq(fa, GRanges(seqnames=seqlevels(txCds[[tx]]),
                     IRanges(chromPos, chromPos), strand="-")) == "G"


  ## Next one is c.1881DelC:
  varPos <- 1881
  exDiffs <- varPos - exEnds
  exVar <- min(which(exDiffs < 0))
  posInExon <- exWidths[exVar] + exDiffs[exVar]
  exSeq <- getSeq(fa, txCds[[1]][exVar])
  substring(exSeq, first=posInExon - 2, last=posInExon + 2)
  ## Hm, again, we're right, but there are other 2 Cs there!

Add supportedFilters method

Add a supportedFilters,EnsDb method to list all filters supported by ensembldb.

Implement a query optimizer

Based on provided filters, define the order in which the query will be executed.

Don't concatenate EntrezID identifiers in the database

Put Entrezgene IDs into a separate database table. Concatenate them by default for the genes etc calls, but don't for e.g. select or mapIds.

This change in the database layout requires different queries depending on the database layout version!

Implement a `getGenomeTwoBitFile`

Get a TwoBit matching the genome release for the EnsDb object.

Return protein annotations by callse to select

Implement all required functionality in order to provide protein annotations also to calls of select:

Add protein columns to the list of allowed columns for select.
Add support for protein columns in mapIds.
Support the use of protein annotation filters (i.e. map columns to filters).

Test use case for protein annotations in Pbase

Check whether and how we can use protein annotations provided by ensembldb in the Pbase package.

Evaluate whether protein annotation confidence can be extracted from Ensembl

Check if additional information about protein annotations could be retrieved from Ensembl, e.g. protein ENSP00000338157 being annotated to two Uniprot IDs, Q05516 and A0A024R3C6, with only the former being a validated protein.

Add support for a MySQL backend

The idea would be to enable switching between the default SQLite backend and a MySQL backend.

Add a function that can take a SQLite database and insert that data into a MySQL database (checking if it's already there).
switchBackend method on an EnsDb that changes between backends; if no MySQL database is available, it tries to create one; requires host and credentials to be provided!

Note: this makes only sense if there is a performance increase with the MySQL backend.

Connect to MySQL databases without EnsDB package

Provide a general interface function to connect, eventually also list, EnsDb databases available on a dedicated MySQL server.

Script to create new `EnsDb` databases

Implement a script to automatically create EnsDbs for a new Ensembl release (including downloading and installing the database.)

Implement a `Filter` method

The idea is to implement a Filter,ANY,EnsDb method. This method should add the filter to a filterQuery in the EnsDb object. These filters are cached and applied once a method like genes is called.
This would enable calls like genes(Filter(GenenameFilter("ZBTB16"), edb)). Eventually it could also work with %>%.

Fix build error on Windows machines

  ERROR in test_getSeqlengthsFromMysqlFolder: Error in names(x) <- value : 
    'names' attribute [6] must be the same length as the vector [1]
  
  Test files with failing tests
  
     test_buildEdb.R 
       test_getEnsemblMysqlUrl 
       test_getSeqlengthsFromMysqlFolder

Problem with species without full genome assembly

Run into a problem creating an EnsDb for anas platyrhynchos with the perl API crashing as apparently there is no sequence type chromosome available.

Add gene description field to the gene table

Based on the comment from Julien Roux (https://support.bioconductor.org/p/96543/). Might be nice to add the gene description field.

Along the lines - also add something like transcript evidence too?

Add additional annotation columns to _protein_ and _uniprot_ tables.

Add additional columns to the uniprot table:

uniprot_db the database name from which the ID was derived (Swissprot, TREMBL etc).
uniprot_mapping_type the method by which the Uniprot ID was mapped to the Ensembl protein ID.
Adapt perl script to fetch that data.
Add and check code in ensembldb: all columns listed by listColumns?
Add an UniprotdbFilter and an UniprotmappingtypeFilter.

Implement intronsByTranscript method

Implement the intronsByTranscript method.

Compare versions for useMySQL function

The useMySQL method should check the versions of the SQLite and MySQL databases and eventually update the MySQL.

Ensure `setFeatureInGRangesFilter` is always called before `addFilterColumns`

Ensure result ordering for `select`

If a single filter or if keys are provided, the ordering of the result has to match the ordering of the input.
For multiple filters this would not work;

fetchin protein annotations for non-coding transcripts

With the protein annotations it comes to the problem that no mappings exist for non-coding transcripts. Thus, fetching protein annotation data for all transcripts of a gene will return only the protein coding transcripts!
Changing from the default inner join to outer join in the SQL queries would fix this problem. The only question is whether I should use that globally for all queries, or just locally for specific joins to protein annotation tables.

Fix build error on malbec2

On malbec2 I get the error:

* checking tests ...
  Running ‘runTests.R’
 ERROR
Running the tests in ‘tests/runTests.R’ failed.
Last 13 lines of output:
  1 Test Suite : 
  ensembldb RUnit Tests - 142 test functions, 1 error, 0 failures
  ERROR in test_getEnsemblMysqlUrl: Error in ensembldb:::.getEnsemblMysqlUrl(type = "ensemblgenomes", organism = "fusarium_oxysporum",  : 
    No database matching 'fusarium_oxysporum_core_21' found in Ensembl's ftp server
  
  Test files with failing tests
  
     test_buildEdb.R 
       test_getEnsemblMysqlUrl 
  
  
  Error in BiocGenerics:::testPackage("ensembldb") : 
    unit tests failed for package ensembldb
  In addition: There were 30 warnings (use warnings() to see them)
  Execution halted

Best way to represent the protein data

Extracting the protein annotations in form of a data.frame and DataFrame is straight forward, the question however is what type of object could best represent the protein annotation.

The object should be something similar to a GRanges, eventually the Proteins class from the Pbase (https://github.com/ComputationalProteomicsUnit/Pbase) package?

I've got:

(Ensembl) protein ID with sequence.
1:n mapping of protein ID to Uniprot ID.
n:m mapping between protein ID and protein domain ID, which provides in addition the position of the protein domain within the protein sequence.

@lgatto any suggestions/preferences here?

Fix/adapt vignette

Update the vignette after importing all filters from AnnotationFilters.

Add a section to the vignette to fetch EnsDb from AnnotationHub

Add a section that describes how to retrieve an EnsDb from AnnotationHub.

jorainer / ensembldb Goto Github PK

ensembldb's Introduction

ensembldb: build and use Ensembl-based annotation packages