Giter Club home page Giter Club logo

ensembldb's Introduction

Years in Bioconductor Bioconductor release build status Bioconductor devel build status build status codecov result

ensembldb: build and use Ensembl-based annotation packages

For more information please refer to the packages main vignette as well as the online documentation.

ensembldb's People

Contributors

hpages avatar jorainer avatar jwokaty avatar lgatto avatar mikelove avatar mtmorgan avatar nturaga avatar plijnzaad avatar tim-yu avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ensembldb's Issues

missing column(s): 'entrezid'. error

I'm trying to create database for a newer Ensembl release (85) with:

library(ensembldb)
library(AnnotationHub)
ah <- AnnotationHub()
query(ah, c("Homo sapiens", "release-85"))
Gtf <- ah["AH51014"]
DbFile <- ensDbFromAH(Gtf)
edb <- EnsDb(DbFile)

and I get the error below:

Fetching data ...  |========================================================================| 100%
OK
  -------------
Proceeding to create the database.
Processing metadata...OK
Processing genes...
 Attribute availability:
  o gene_id... OK
  o gene_name... OK
  o entrezid... Nope
  o gene_biotype... OK
OK
Processing transcripts...
 Attribute availability:
  o transcript_id... OK
  o gene_id... OK
  o transcript_biotype... OK
OK
Processing exons...OK
Processing chromosomes...Fetch seqlengths from ensembl, dataset hsapiens_gene_ensembl version 85...OK
OK
Generating index...OK
  -------------
Verifying validity of the information in the database:
Checking transcripts...OK
Checking exons...OK
Warning messages:
1: Failed to parse headers:
220- Welcome to the Ensembl anonymous FTP site
220- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
220-
220-   Users of the material provided here in advance of submission to
220-   public databases should read the Data Release Policy
220-   and Guidelines and Conditions on use of Data in the respective data
220-   directory.
220-   
220-   Please report any unusual problems you may have with this server
220-   via e-mail to webmaster@ensembl.org
220-
220-   All connections and transfers are logged
220-
331 Anonymous login ok, send your complete email address as your password
230 Anonymous access granted, restrictions apply
257 "/" is the current directory
250- Welcome to the Ensembl anonymous FTP site
250- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
250-
250-   Users of the material provided here in advance of submission to
250-   public databases should read the Data Release Policy
250-   and Guidelines and Conditions on use of Data in the respective dat [... truncated] 
2: In ensDbFromGRanges(gff, outfile = outfile, path = path, organism = orgFromAH,  :
   I'm missing column(s): 'entrezid'. The corresponding database column(s) will be empty!
> DbFile
[1] "./Homo_sapiens.GRCh38.85.sqlite"

Implement .processFilterParam function

This function should take the filter input parameter and:

  • validate it
  • translate it (if it's an expression) to a AnnotationFilter/AnnotationFilterList.

Fix timeout on malbec2 build machine

ensembldb version 1.99.12 runTests.R seems to cause a timeout on malbec2.
What's also strange is the time that some examples take:

EnsDb-AnnotationDbi	4.392	0.348	79.168	

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'select' for signature '"EnsDb"'

Error reported by Wim Trypsteen.

R-code that produced the error:

library("ensembldb")
library("EnsDb.Hsapiens.v79")

ensemblGenes <- c("ENSG00000108958", "ENSG00000123009", "ENSG00000124399")
edb <- EnsDb.Hsapiens.v79
select(edb, keys=ensemblGenes, 
       columns=c("ENTREZID", "SYMBOL"), 
       keytype="GENEID")
​
select(edb, keys = c("BCL2", "BCL2L11"), keytype = "gene_name", columns = c("gene_id", "gene_name"))

Add the AnnotationFilter::GRangesFilter

The new GRangesFilter has conditions any, start, end, within and equal. These are also passed to the filter using the type (not condition) parameter.

  • any: This seems to correspond to the overlapping.
> findOverlaps(IRanges(2, 5), IRanges(1, 4), type = "any")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 1 / subjectLength: 1
  • start: ranges have the same start:
> findOverlaps(IRanges(3, 5), IRanges(3, 9), type = "start")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 1 / subjectLength: 1
  • end: ranges have the same end:
> findOverlaps(IRanges(3, 5), IRanges(1, 5), type = "end")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 1 / subjectLength: 1
  • within:
> findOverlaps(IRanges(2, 5), IRanges(1, 6), type = "within")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 1 / subjectLength: 1

Problem importing Pbase

I am trying to import the Proteins class, but as soon as I import anything from https://github.com/ComputationalProteomicsUnit/Pbase I get the following error:

* installing *source* package ‘ensembldb’ ...
** R
** inst
** preparing package for lazy loading
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : 
  cyclic namespace dependency detected when loading ‘ensembldb’, already loading ‘biovizBase’, ‘Gviz’, ‘Pbase’, ‘ensembldb’
ERROR: lazy loading failed for package ‘ensembldb’
* removing ‘/Users/jo/R/R-3.3.1-Bioc3.4-devel/lib/R/library/ensembldb’
* restoring previous ‘/Users/jo/R/R-3.3.1-Bioc3.4-devel/lib/R/library/ensembldb’

Fix documentation

Fix the documentation after importing filters from AnnotationFilters.

Parameter to specify whether filter columns should be returned

As of now only columns specified with the columns argument are returned by the methods. It might however be useful to return the columns queried by the provided filters too.
Add a returnFilterColumns setting that allows to control whether filter-columns should be returned too.

TIMEOUT error on Windows build machines

The Windows Bioconductor build machines report sometimes/regularily a TIMEOUT for ensembldb. The process seems to stop during creation of the vignette(s).

* checking for file ‘ensembldb/DESCRIPTION’ ... OK
* preparing ‘ensembldb’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ...

Interestingly, this can not be reproduced on my Windows build machine.
To track the problem:

  1. Disable proteins vignette
  2. Disable the MySQL vignette
  3. Disable all chunks that require internet connection on the main vignette.

Amend filter framework and objects

Organism.dplyr has similar filters than ensembldb implemented in a very elegant way. Thus it might be good to re-use some of these ideas, specifically, reduce the amount of implemented methods and objects and use some lightweight functions instead.

See also issue Bioconductor/Organism.dplyr#3 in Organism.dplyr

ensembldb query backend

Based on a brainstorming chat with UniProt:

  • Do you use the UniProt RESTful API to get the data?
  • I think you use the Ensembl perl API?

From the conversation, there are differences in the Ens -> Uniprot and Uniprot -> Ens mapping. But when these agree, then the features will match; but some features exist in one and not other.

To be discussed at EuroBioc in Basel.

Add protein data

What about adding protein annotations to ensembldb. @lgatto might that be something in your interest too?

Add filters from AnnotationFilters

Now, the new AnnotationFilters package will provide most of the filters required by ensembldb. Thus we have to change the classes etc in ensembldb for it to work.

Convert within tx variant information to genomic coordinates

Functionality to map variant information within tx to genomic coordinates and vice versa. Example code below:

  fa <- getGenomeFaFile(edb)

  ## Convert variant coordinates to genomic coordinates
  tx <- "ENST00000070846"
  ## Get the cds
  txCds <- cdsBy(edb, by="tx", filter=TxidFilter(tx))

  ## ENST00000070846:c.1643delG
  varPos <- 1643
  exWidths <- width(txCds[[tx]])
  ## Define the exon ends in the tx.
  exEnds <- cumsum(exWidths)
  ## Get the first negative index.
  exDiffs <- varPos - exEnds
  exVar <- min(which((exDiffs) < 0))
  ## Now we would like to know the position within that exon:
  posInExon <- exWidths[exVar] + exDiffs[exVar]
  ## Next the genomic coordinate:
  ## Note: here we have to consider the strand!
  ## fw: exon_start + (pos in exon -1)
  ## rv: exon_end - (pos in exon -1)
  if(as.character(strand(txCds[[tx]][1])) == "-"){
      chromPos <- end(txCds[[tx]][exVar]) - (posInExon - 1)
  }else{
      chromPos <- start(txCds[[tx]][exVar]) + (posInExon -1)
  }

  ## Validation.
  ## OK, now we get the sequence for that exon.
  ## Check if the estimated position is a G.
  exSeq <- getSeq(fa, txCds[[tx]][exVar])
  substring(exSeq, first=posInExon-2, last=posInExon+2)
  ## Hm, hard to tell... it's two Gs there!
  substring(exSeq, first=posInExon, last=posInExon) == "G"
  ## Get the full CDS
  cdsSeq <- unlist(getSeq(fa, txCds[[tx]]))
  substring(cdsSeq, first=varPos - 2, last=1643 + 2)
  ## The same.
  getSeq(fa, GRanges(seqnames=seqlevels(txCds[[tx]]),
                     IRanges(chromPos, chromPos), strand="-")) == "G"


  ## Next one is c.1881DelC:
  varPos <- 1881
  exDiffs <- varPos - exEnds
  exVar <- min(which(exDiffs < 0))
  posInExon <- exWidths[exVar] + exDiffs[exVar]
  exSeq <- getSeq(fa, txCds[[1]][exVar])
  substring(exSeq, first=posInExon - 2, last=posInExon + 2)
  ## Hm, again, we're right, but there are other 2 Cs there!

Return protein annotations by callse to select

Implement all required functionality in order to provide protein annotations also to calls of select:

  • Add protein columns to the list of allowed columns for select.
  • Add support for protein columns in mapIds.
  • Support the use of protein annotation filters (i.e. map columns to filters).

Add support for a MySQL backend

The idea would be to enable switching between the default SQLite backend and a MySQL backend.

  • Add a function that can take a SQLite database and insert that data into a MySQL database (checking if it's already there).
  • switchBackend method on an EnsDb that changes between backends; if no MySQL database is available, it tries to create one; requires host and credentials to be provided!

Note: this makes only sense if there is a performance increase with the MySQL backend.

Implement a `Filter` method

The idea is to implement a Filter,ANY,EnsDb method. This method should add the filter to a filterQuery in the EnsDb object. These filters are cached and applied once a method like genes is called.
This would enable calls like genes(Filter(GenenameFilter("ZBTB16"), edb)). Eventually it could also work with %>%.

Fix build error on Windows machines

  ERROR in test_getSeqlengthsFromMysqlFolder: Error in names(x) <- value : 
    'names' attribute [6] must be the same length as the vector [1]
  
  Test files with failing tests
  
     test_buildEdb.R 
       test_getEnsemblMysqlUrl 
       test_getSeqlengthsFromMysqlFolder 
  

Add additional annotation columns to _protein_ and _uniprot_ tables.

Add additional columns to the uniprot table:

  • uniprot_db the database name from which the ID was derived (Swissprot, TREMBL etc).

  • uniprot_mapping_type the method by which the Uniprot ID was mapped to the Ensembl protein ID.

  • Adapt perl script to fetch that data.

  • Add and check code in ensembldb: all columns listed by listColumns?

  • Add an UniprotdbFilter and an UniprotmappingtypeFilter.

Ensure result ordering for `select`

If a single filter or if keys are provided, the ordering of the result has to match the ordering of the input.
For multiple filters this would not work;

fetchin protein annotations for non-coding transcripts

With the protein annotations it comes to the problem that no mappings exist for non-coding transcripts. Thus, fetching protein annotation data for all transcripts of a gene will return only the protein coding transcripts!
Changing from the default inner join to outer join in the SQL queries would fix this problem. The only question is whether I should use that globally for all queries, or just locally for specific joins to protein annotation tables.

Fix build error on malbec2

On malbec2 I get the error:

* checking tests ...
  Running ‘runTests.R’
 ERROR
Running the tests in ‘tests/runTests.R’ failed.
Last 13 lines of output:
  1 Test Suite : 
  ensembldb RUnit Tests - 142 test functions, 1 error, 0 failures
  ERROR in test_getEnsemblMysqlUrl: Error in ensembldb:::.getEnsemblMysqlUrl(type = "ensemblgenomes", organism = "fusarium_oxysporum",  : 
    No database matching 'fusarium_oxysporum_core_21' found in Ensembl's ftp server
  
  Test files with failing tests
  
     test_buildEdb.R 
       test_getEnsemblMysqlUrl 
  
  
  Error in BiocGenerics:::testPackage("ensembldb") : 
    unit tests failed for package ensembldb
  In addition: There were 30 warnings (use warnings() to see them)
  Execution halted

Best way to represent the protein data

Extracting the protein annotations in form of a data.frame and DataFrame is straight forward, the question however is what type of object could best represent the protein annotation.

The object should be something similar to a GRanges, eventually the Proteins class from the Pbase (https://github.com/ComputationalProteomicsUnit/Pbase) package?

I've got:

  • (Ensembl) protein ID with sequence.
  • 1:n mapping of protein ID to Uniprot ID.
  • n:m mapping between protein ID and protein domain ID, which provides in addition the position of the protein domain within the protein sequence.

@lgatto any suggestions/preferences here?

Fix/adapt vignette

Update the vignette after importing all filters from AnnotationFilters.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.