Giter Club home page Giter Club logo

neurogenomics / orthogene Goto Github PK

View Code? Open in Web Editor NEW
37.0 2.0 4.0 7.38 MB

🧬 o r t h o g e n e 🧬✨✨✨✨✨✨✨ Interspecies gene mapping✨✨✨✨✨ 🦠 πŸ” 🌱 πŸ” 🌳 πŸ” 🍎 πŸ” 🍊 πŸ” πŸͺ± πŸ” πŸͺ° πŸ” 🐟 πŸ” 🦎 πŸ” πŸ“ πŸ” πŸ¦‡ πŸ” πŸ„ πŸ” πŸ– πŸ” 🐐 πŸ” 🐎 πŸ” 🐈 πŸ” πŸ• πŸ” 🐁 πŸ” πŸ’ πŸ” 🦧 πŸ” 🦍 πŸ” πŸƒβ€β™€οΈ

Home Page: https://doi.org/doi:10.18129/B9.bioc.orthogene

R 68.24% HTML 31.74% Rez 0.02%
evolutionary-biology genomics bioinformatics ontologies genes comparative-genomics animal-models translational-research biomedicine r

orthogene's Introduction

orthogene: Interspecies gene mapping


download License: GPL-3

R build status

Authors: Brian Schilder

README updated: Dec-21-2023

Intro

orthogene is an R package for easy mapping of orthologous genes across hundreds of species. It pulls up-to-date gene ortholog mappings across 700+ organisms. It also provides various utility functions to aggregate/expand common objects (e.g.Β data.frames, gene expression matrices, lists) using 1:1, many:1, 1:many or many:many gene mappings, both within- and between-species.

In brief, orthogene lets you easily:

Citation

If you use orthogene, please cite:

Brian M. Schilder, Nathan G. Skene (2022). orthogene: Interspecies gene mapping. R package version 1.4.0, https://doi.org/doi:10.18129/B9.bioc.orthogene

Installation

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
# orthogene is only available on Bioconductor>=3.14
if(BiocManager::version()<"3.14") BiocManager::install(update = TRUE, ask = FALSE)

BiocManager::install("orthogene")

Docker

orthogene can also be installed via a Docker or Singularity container with Rstudio pre-installed. Further instructions provided here.

Methods

library(orthogene)

data("exp_mouse")
# Setting to "homologene" for the purposes of quick demonstration.
# We generally recommend using method="gprofiler" (default).
method <- "homologene"  

For most functions, orthogene lets users choose between different methods, each with complementary strengths and weaknesses: "gprofiler", "homologene", and "babelgene"

In general, we recommend you use "gprofiler" when possible, as it tends to be more comprehensive.

While "babelgene" contains less species, it queries a wide variety of orthology databases and can return a column β€œsupport_n” that tells you how many databases support each ortholog gene mapping. This can be helpful when you need a semi-quantitative measure of mapping quality.

It’s also worth noting that for smaller gene sets, the speed difference between these methods becomes negligible.

gprofiler homologene babelgene
Reference organisms 700+ 20+ 19 (but cannot convert between pairs of non-human species)
Gene mappings More comprehensive Less comprehensive More comprehensive
Updates Frequent Less frequent Less frequent
Orthology databases Ensembl, HomoloGene, WormBase HomoloGene HGNC Comparison of Orthology Predictions (HCOP), which includes predictions from eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN
Data location Remote Local Local
Internet connection Required Not required Not required
Speed Slower Faster Medium

Quick example

Convert orthologs

convert_orthologs is very flexible with what users can supply as gene_df, and can take a data.frame/data.table/tibble, (sparse) matrix, or list/vector containing genes.

Genes, transcripts, proteins, SNPs, or genomic ranges will be recognised in most formats (HGNC, Ensembl, RefSeq, UniProt, etc.) and can even be a mixture of different formats.

All genes will be mapped to gene symbols, unless specified otherwise with the ... arguments (see ?orthogene::convert_orthologs or here for details).

Note on non-1:1 orthologs

A key feature of convert_orthologs is that it handles the issue of genes with many-to-many mappings across species. This can occur due to evolutionary divergence, and the function of these genes tend to be less conserved and less translatable. Users can address this using different strategies via non121_strategy=.

gene_df <- orthogene::convert_orthologs(gene_df = exp_mouse,
                                        gene_input = "rownames", 
                                        gene_output = "rownames", 
                                        input_species = "mouse",
                                        output_species = "human",
                                        non121_strategy = "drop_both_species",
                                        method = method) 
## Preparing gene_df.

## sparseMatrix format detected.

## Extracting genes from rownames.

## 15,259 genes extracted.

## Converting mouse ==> human orthologs using: homologene

## Retrieving all organisms available in homologene.

## Mapping species name: mouse

## Common name mapping found for mouse

## 1 organism identified from search: 10090

## Retrieving all organisms available in homologene.

## Mapping species name: human

## Common name mapping found for human

## 1 organism identified from search: 9606

## Checking for genes without orthologs in human.

## Extracting genes from input_gene.

## 13,416 genes extracted.

## Extracting genes from ortholog_gene.

## 13,416 genes extracted.

## Checking for genes without 1:1 orthologs.

## Dropping 46 genes that have multiple input_gene per ortholog_gene (many:1).

## Dropping 56 genes that have multiple ortholog_gene per input_gene (1:many).

## Filtering gene_df with gene_map

## Setting ortholog_gene to rownames.

## 
## =========== REPORT SUMMARY ===========

## Total genes dropped after convert_orthologs :
##    2,016 / 15,259 (13%)

## Total genes remaining after convert_orthologs :
##    13,243 / 15,259 (87%)
knitr::kable(as.matrix(head(gene_df)))
astrocytes_ependymal endothelial-mural interneurons microglia oligodendrocytes pyramidal CA1 pyramidal SS
TSPAN12 0.3303571 0.5872340 0.6413793 0.1428571 0.1207317 0.2864750 0.1453634
TSHZ1 0.4285714 0.4468085 1.1551724 0.4387755 0.3621951 0.0692226 0.8320802
ADAMTS15 0.0089286 0.0978723 0.2206897 0.0000000 0.0231707 0.0117146 0.0375940
CLDN12 0.2232143 0.1148936 0.5517241 0.0510204 0.2609756 0.4376997 0.6842105
RXFP1 0.0000000 0.0127660 0.2551724 0.0000000 0.0158537 0.0511182 0.0751880
SEMA3C 0.1964286 0.9957447 8.6379310 0.2040816 0.1853659 0.1608094 0.2280702

convert_orthologs is just one of the many useful functions in orthogene. Please see the documentation website for the full vignette.

Additional resources

Session Info

utils::sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] orthogene_1.8.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4              babelgene_22.9           
##  [3] xfun_0.41                 ggplot2_3.4.4            
##  [5] htmlwidgets_1.6.4         rstatix_0.7.2            
##  [7] lattice_0.22-5            vctrs_0.6.5              
##  [9] tools_4.3.1               generics_0.1.3           
## [11] yulab.utils_0.1.1         parallel_4.3.1           
## [13] tibble_3.2.1              fansi_1.0.6              
## [15] pkgconfig_2.0.3           Matrix_1.6-4             
## [17] ggplotify_0.1.2           data.table_1.14.10       
## [19] homologene_1.4.68.19.3.27 RColorBrewer_1.1-3       
## [21] desc_1.4.3                lifecycle_1.0.4          
## [23] compiler_4.3.1            treeio_1.26.0            
## [25] dlstats_0.1.7             munsell_0.5.0            
## [27] carData_3.0-5             ggtree_3.10.0            
## [29] gprofiler2_0.2.2          ggfun_0.1.3              
## [31] htmltools_0.5.7           yaml_2.3.8               
## [33] lazyeval_0.2.2            plotly_4.10.3            
## [35] pillar_1.9.0              car_3.1-2                
## [37] ggpubr_0.6.0              tidyr_1.3.0              
## [39] cachem_1.0.8              grr_0.9.5                
## [41] abind_1.4-5               nlme_3.1-164             
## [43] tidyselect_1.2.0          aplot_0.2.2              
## [45] digest_0.6.33             dplyr_1.1.4              
## [47] purrr_1.0.2               rprojroot_2.0.4          
## [49] fastmap_1.1.1             grid_4.3.1               
## [51] here_1.0.1                colorspace_2.1-0         
## [53] cli_3.6.2                 magrittr_2.0.3           
## [55] patchwork_1.1.3           utf8_1.2.4               
## [57] broom_1.0.5               ape_5.7-1                
## [59] withr_2.5.2               scales_1.3.0             
## [61] backports_1.4.1           httr_1.4.7               
## [63] rmarkdown_2.25            rvcheck_0.2.1            
## [65] ggsignif_0.6.4            memoise_2.0.1.9000       
## [67] evaluate_0.23             knitr_1.45               
## [69] rworkflows_1.0.1          viridisLite_0.4.2        
## [71] gridGraphics_0.5-1        rlang_1.1.2              
## [73] Rcpp_1.0.11               glue_1.6.2               
## [75] tidytree_0.4.6            BiocManager_1.30.22      
## [77] renv_1.0.3                rstudioapi_0.15.0        
## [79] jsonlite_1.8.8            R6_2.5.1                 
## [81] badger_0.2.3              fs_1.6.3

Related projects

Tools

Databases

  • HomoloGene: NCBI database that the R package homologene pulls from.

  • gProfiler: Web server for functional enrichment analysis and conversions of gene lists.

  • OrtholoGene: Compiled list of gene orthology resources.

Contact

UK Dementia Research Institute
Department of Brain Sciences
Faculty of Medicine
Imperial College London
GitHub
DockerHub


orthogene's People

Contributors

al-murphy avatar bschilder avatar jwokaty avatar nturaga avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

orthogene's Issues

How to find all supported species?

Thanks for providing this useful package.

For a specific project, we are lookin at the rhinoceros and a few other ancient animals data. How do I get a list of all supported species in orthogene?

Scale silhouettes

plot_orthotree generates silhouettes that take up too much of the plot.

  • Limit max size
  • Would also be nice to make them slightly closer to realistic sizes (though not too much, bc otherwise you couldn't even see orgs like flies or yeast).
  • Figure out formula for scaling factor (distance of silhouettes from labels, %s)

Screenshot 2022-05-28 at 19 12 50

Flexible `gene_input` argument

Opted to turngene_col into a more flexible argument that encompasses all input options: gene_input

Lets users set which aspect of gene_df they want to get gene names from (e.g. "rownames", "colnames", some column). If no accepted option is supplied, user receives this error message:
Screenshot 2021-07-30 at 13 39 50

babelgene tests

Some of the unit tests are failing with the latest version of babelgene.

    Running the tests in β€˜tests/testthat.R’ failed.
    Last 13 lines of output:
      [ FAIL 4 | WARN 0 | SKIP 0 | PASS 141 ]
      
      ══ Failed tests ════════════════════════════════════════════════════════════════
      ── Failure (test-map_orthologs_babelgene.R:12:5): map_orthologs_babelgene works ──
      nrow(gene_map_b1) is not more than 13100. Difference: -34
      ── Failure (test-map_orthologs_babelgene.R:29:5): map_orthologs_babelgene works ──
      nrow(gene_map_b3) is not more than 15900. Difference: -27
      ── Failure (test-map_orthologs_babelgene.R:42:5): map_orthologs_babelgene works ──
      nrow(gene_map1) is not more than 29700. Difference: -49
      ── Failure (test-map_orthologs_babelgene.R:60:5): map_orthologs_babelgene works ──
      nrow(gene_map2) is not more than 29700. Difference: -49
      
      [ FAIL 4 | WARN 0 | SKIP 0 | PASS 141 ]
      Error: Test failures
      Execution halted

You may also consider adding a little more margin for variation in the future. These numbers tend to be less stable than one may expect.

convert_orthologs returning Error in names(object) <- nm : attempt to set an attribute on NULL

1. Bug description

When running conver_orthologs I get an error that I am trying to set attribute to NULL.

Console output

Preparing gene_df.
data.frame format detected.
Extracting genes from NOG.
4,174 genes extracted.
Converting rat ==> human orthologs using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: rat
Common name mapping found for rat
1 organism identified from search: rnorvegicus
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: hsapiens
Checking for genes without orthologs in human.
Extracting genes from input_gene.
4,285 genes extracted.
Extracting genes from ortholog_gene.
4,285 genes extracted.
Dropping 269 NAs of all kinds from ortholog_gene.
Checking for genes without 1:1 orthologs.
Dropping 45 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 11 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Adding input_gene_standard col to gene_df.
Error in names(object) <- nm : attempt to set an attribute on NULL
--
Β 
Β 

Expected behaviour

A new column in my gene_df with the human orthologs matching the rat genes in $NOG.

2. Reproducible example

Code

orthologs=orthogene::convert_orthologs(gene_df = allrats_sigs,
                                       gene_input = "NOG",
                                       gene_output = "columns",
                                       standardise_genes = "TRUE",
                                       input_species = "rat",
                                       output_species = "human")

Data

allrats_sigs$NOG
[1] "ENSRNOG00000000047" "ENSRNOG00000000047" "ENSRNOG00000000047" "ENSRNOG00000000047"
[5] "ENSRNOG00000000073" "ENSRNOG00000000073" "ENSRNOG00000000075" "ENSRNOG00000000075"...etc
(If possible, upload a small sample of your data so that we can reproduce the bug on our end. If that's not possible, please at least include a screenshot of your data and other relevant details.)

3. Session info

(Add output of the R function utils::sessionInfo() below. This helps us assess version/OS conflicts which could be causing bugs.)

# Paste utils::sessionInfo() output 
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Rn.eg.db_3.18.0     orthogene_1.8.0         ggpubr_0.6.0            ggplot2_3.4.2          
 [5] VennDiagram_1.7.3       futile.logger_1.4.3     ensembldb_2.26.0        AnnotationFilter_1.26.0
 [9] GenomicFeatures_1.54.4  AnnotationDbi_1.64.1    Biobase_2.62.0          GenomicRanges_1.54.1   
[13] GenomeInfoDb_1.38.8     IRanges_2.36.0          S4Vectors_0.40.2        BiocGenerics_0.48.1    
[17] edgeR_4.0.16            limma_3.58.1           

loaded via a namespace (and not attached):
  [1] jsonlite_1.8.7              rstudioapi_0.15.0           magrittr_2.0.3             
  [4] fs_1.6.3                    BiocIO_1.12.0               zlibbioc_1.48.2            
  [7] vctrs_0.6.3                 memoise_2.0.1               Rsamtools_2.18.0           
 [10] RCurl_1.98-1.12             ggtree_3.10.1               rstatix_0.7.2              
 [13] htmltools_0.5.5             S4Arrays_1.2.1              progress_1.2.2             
 [16] lambda.r_1.2.4              curl_5.0.1                  broom_1.0.5                
 [19] gridGraphics_0.5-1          SparseArray_1.2.4           htmlwidgets_1.6.2          
 [22] plotly_4.10.2               futile.options_1.0.1        cachem_1.0.8               
 [25] GenomicAlignments_1.38.2    lifecycle_1.0.3             pkgconfig_2.0.3            
 [28] Matrix_1.6-0                R6_2.5.1                    fastmap_1.1.1              
 [31] GenomeInfoDbData_1.2.11     MatrixGenerics_1.14.0       aplot_0.2.2                
 [34] digest_0.6.33               colorspace_2.1-0            patchwork_1.2.0            
 [37] grr_0.9.5                   DESeq2_1.42.1               RSQLite_2.3.6              
 [40] filelock_1.0.2              fansi_1.0.4                 httr_1.4.6                 
 [43] abind_1.4-5                 compiler_4.3.1              remotes_2.4.2.1            
 [46] bit64_4.0.5                 withr_2.5.0                 backports_1.4.1            
 [49] BiocParallel_1.36.0         carData_3.0-5               DBI_1.2.2                  
 [52] homologene_1.4.68.19.3.27   highr_0.10                  ggsignif_0.6.4             
 [55] biomaRt_2.58.2              rappdirs_0.3.3              DelayedArray_0.28.0        
 [58] rjson_0.2.21                tools_4.3.1                 ape_5.7-1                  
 [61] glue_1.6.2                  restfulr_0.0.15             nlme_3.1-162               
 [64] ggvenn_0.1.10               generics_0.1.3              gtable_0.3.3               
 [67] tidyr_1.3.0                 data.table_1.14.8           hms_1.1.3                  
 [70] xml2_1.3.5                  car_3.1-2                   utf8_1.2.3                 
 [73] XVector_0.42.0              pillar_1.9.0                stringr_1.5.0              
 [76] yulab.utils_0.1.4           babelgene_22.9              dplyr_1.1.2                
 [79] treeio_1.26.0               BiocFileCache_2.10.2        lattice_0.21-8             
 [82] rtracklayer_1.62.0          bit_4.0.5                   tidyselect_1.2.1           
 [85] locfit_1.5-9.9              Biostrings_2.70.3           knitr_1.43                 
 [88] ProtGenerics_1.34.0         SummarizedExperiment_1.32.0 xfun_0.39                  
 [91] statmod_1.5.0               matrixStats_1.2.0           stringi_1.7.12             
 [94] ggfun_0.1.4                 lazyeval_0.2.2              yaml_2.3.7                 
 [97] evaluate_0.21               codetools_0.2-19            tibble_3.2.1               
[100] BiocManager_1.30.22         ggplotify_0.1.2             cli_3.6.1                  
[103] munsell_0.5.0               Rcpp_1.0.11                 gprofiler2_0.2.3           
[106] dbplyr_2.5.0                png_0.1-8                   XML_3.99-0.14              
[109] parallel_4.3.1              blob_1.2.4                  prettyunits_1.1.1          
[112] bitops_1.0-7                viridisLite_0.4.2           tidytree_0.4.6             
[115] scales_1.2.1                purrr_1.0.1                 crayon_1.5.2               
[118] rlang_1.1.1                 KEGGREST_1.42.0             formatR_1.14      

GHA: `Package suggested but not available: β€˜rworkflows’`

GHA can't seem to find the rworkflows R package, even though it's definitely available on CRAN and passing all tests on all platforms...

I even tried clearing the cache with \nocache but that didn't seem to help at all.

https://github.com/neurogenomics/orthogene/actions/runs/4595804284/jobs/8116487829#step:2:1455

❯ checking package dependencies ... ERROR
  Package suggested but not available: β€˜rworkflows’
  
  The suggested packages are required for a complete check.
  Checking can be attempted without them by setting the environment
  variable _R_CHECK_FORCE_SUGGESTS_ to a false value.
  
  See section β€˜The DESCRIPTION file’ in the β€˜Writing R Extensions’
  manual.

Furthermore, I already have the Config/rcmdcheck/_R_CHECK_FORCE_SUGGESTS_: false set in the DESCRIPTION so I don't know why it's suddenly not able to register this.

Enable synonym mapping without "gprofiler"

Functions like map_genes and aggregate_genes rely on "gprofiler" as the only method that can do gene synonym matching.

However, this requires internet access and the gprofiler databases sometime breaks down randomly.
For example, today during checks:

── Error (test-run_benchmark.R:3:5): run_benchmark works ───────────────────────
Error in `gprofiler2::gconvert(query = ranges, organism = species, ...)`: Bad request, response code 400
Backtrace:

Would be good to have non-gprofiler alternatives for all functions, including these.

GHA: Docker: Error: both username and password must be set to login both username and password must be set to login

GHA made it all the way to the end, and then failed during the Docker steps:
https://github.com/neurogenomics/orthogene/actions/runs/4595804284/jobs/8116487699#step:4:9973

Run docker/build-push-action@v1
  with:
    password: ***
    repository: neurogenomicslab/orthogene
    tag_with_ref: true
    tag_with_sha: false
    tags: 1.5.2,latest
    build_args: PKG=orthogene
    path: .
    always_pull: false
    add_git_labels: false
    push: true
  env:
    GITHUB_PAT: ***
    GITHUB_TOKEN: ***
    RGL_USE_NULL: TRUE
    R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
    RSPM: https://packagemanager.rstudio.com/cran/__linux__/focal/release
    TZ: UTC
    NOT_CRAN: true
    packageName: orthogene
    packageNameOrig: orthogene
    packageVersion: 1.5.2
    deployment_status: success
/usr/bin/docker run --name dockergithubactionsv1_2ab6d0 --label 6c0442 --workdir /github/workspace --rm -e "GITHUB_PAT" -e "GITHUB_TOKEN" -e "RGL_USE_NULL" -e "R_REMOTES_NO_ERRORS_FROM_WARNINGS" -e "RSPM" -e "TZ" -e "NOT_CRAN" -e "packageName" -e "packageNameOrig" -e "packageVersion" -e "deployment_status" -e "INPUT_USERNAME" -e "INPUT_PASSWORD" -e "INPUT_REPOSITORY" -e "INPUT_TAG_WITH_REF" -e "INPUT_TAG_WITH_SHA" -e "INPUT_TAGS" -e "INPUT_BUILD_ARGS" -e "INPUT_REGISTRY" -e "INPUT_PATH" -e "INPUT_DOCKERFILE" -e "INPUT_TARGET" -e "INPUT_ALWAYS_PULL" -e "INPUT_CACHE_FROMS" -e "INPUT_LABELS" -e "INPUT_ADD_GIT_LABELS" -e "INPUT_PUSH" -e "HOME" -e "GITHUB_JOB" -e "GITHUB_REF" -e "GITHUB_SHA" -e "GITHUB_REPOSITORY" -e "GITHUB_REPOSITORY_OWNER" -e "GITHUB_REPOSITORY_OWNER_ID" -e "GITHUB_RUN_ID" -e "GITHUB_RUN_NUMBER" -e "GITHUB_RETENTION_DAYS" -e "GITHUB_RUN_ATTEMPT" -e "GITHUB_REPOSITORY_ID" -e "GITHUB_ACTOR_ID" -e "GITHUB_ACTOR" -e "GITHUB_TRIGGERING_ACTOR" -e "GITHUB_WORKFLOW" -e "GITHUB_HEAD_REF" -e "GITHUB_BASE_REF" -e "GITHUB_EVENT_NAME" -e "GITHUB_SERVER_URL" -e "GITHUB_API_URL" -e "GITHUB_GRAPHQL_URL" -e "GITHUB_REF_NAME" -e "GITHUB_REF_PROTECTED" -e "GITHUB_REF_TYPE" -e "GITHUB_WORKFLOW_REF" -e "GITHUB_WORKFLOW_SHA" -e "GITHUB_WORKSPACE" -e "GITHUB_ACTION" -e "GITHUB_EVENT_PATH" -e "GITHUB_ACTION_REPOSITORY" -e "GITHUB_ACTION_REF" -e "GITHUB_PATH" -e "GITHUB_ENV" -e "GITHUB_STEP_SUMMARY" -e "GITHUB_STATE" -e "GITHUB_OUTPUT" -e "GITHUB_ACTION_PATH" -e "RUNNER_OS" -e "RUNNER_ARCH" -e "RUNNER_NAME" -e "RUNNER_TOOL_CACHE" -e "RUNNER_TEMP" -e "RUNNER_WORKSPACE" -e "ACTIONS_RUNTIME_URL" -e "ACTIONS_RUNTIME_TOKEN" -e "ACTIONS_CACHE_URL" -e GITHUB_ACTIONS=true -e CI=true --network github_network_652c4c9e59c546f390843aefe1417e5f -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/orthogene/orthogene":"/github/workspace" docker/github-actions:v1  "build-push"
Error: both username and password must be set to login
both username and password must be set to login
Usage:
  github-actions build-push [flags]

Flags:
  -h, --help   help for build-push

When in reality, both my Docker username and org name are set:

DOCKER_USERNAME: bschilder

How to use older annotations with convert_orthologs method?

Thanks for providing this useful package. I am trying to use this for human-to-canine(dog) orthologs conversion. The current database by default use the newest canine genome on ENSEMBL, 2020 release. We have several analyses performed on the old version CanFam3.1.

my_genes = c("ENSG00000187608", "ENSG00000162572", "ENSG00000205090")

method <- "homologene"

ortho = convert_orthologs(my_genes, gene_input = "Gene_ID", gene_output = "dict",
                  input_species = "human", output_species = "dog", non121_strategy="drop_both_species",
                  as_sparse = T)

## ortho
##     ENSG00000187608      ENSG00000162572      ENSG00000205090 
##             "ISG15" "ENSCAFG00845023300" "ENSCAFG00845025522" 

How can we use the previous version of the dog genome with convert_orthologs method?

orthogene fails rcmdcheck when ape isn't installed

ape is only in Suggests, but it is called here:

orthogene/R/prepare_tree.R

Lines 129 to 130 in bc242c5

tr <- ape::drop.tip(phy = tr,
tip = tr$tip.label[unmapped])

This is ultimately called in one of the examples for plot_orthogene():

#' orthotree <- plot_orthotree(species = c("human","monkey","mouse"))

If ape is not installed rcmdcheck then fails. We've encountered this when trying to run a reverse dependencies check for rphylopic: https://github.com/palaeoverse/rphylopic/actions/runs/8524983850/job/23352044794

This example should be skipped if ape is not installed. You should also probably have some sort of graceful failure if ape is not installed and the user runs a similar command.

Benchmark ortholog mapping methods

Stategies

Benchmark the number of orthologs that can be correctly mapped to humans using the following strategies:

  1. Simply making the genes uppercase.
  2. convert_orthologues(method="homologene")
  3. convert_orthologues(method="gorth")

Speed

Also benchmark these for speed.

Species

Repeat tests for various common model organisms:

  • chimp (P troglodytes)
  • baboon (P anubis)
  • macaque (M mulatta)
  • marmoset (C jacchus)
  • mouse (M musculus)
  • rat (R norvegicus)
  • gerbil (M unguiculatus)
  • dog (C lupus familiaris)
  • cat (F catus)
  • cow (B taurus)
  • chicken (G gallus)
  • zebrafish (D rerio)
  • fly (D melanogaster)
  • worm (C elegans)

Figure out why "Canis lupus familiaris" 1:1 orthologs are reduced

@SarahMarzi noted that 1:1 dog:human orthologs dropped recently.
82.8% --> 16.8%
Trying to get to the bottom of it.

Examples

I believe both of these examples used method="babelgene"

Original

file15584db51b7b ggtree_edit

Now

Screenshot 2023-03-31 at 13 10 22

Possible explanations

  • Dog db got updated with a ton more orthologs (seems unlikely it would be affected this much tho)
  • Something is awry with the 1:1 ortholog calculation. Possibly due to inclusion of synonymous within-human genes table.
  • Something is awry with the reporting functions.
  • There's a bug in how the babelgene data was constructed

Replace usage of dictionaries in `format_gene_df`

format_gene_df uses dictionaries (named lists) to create new col names. But this is problematic when there are many:many orthologs. Switch over to using some sort of dataframe/filtering strategy instead.

Also, provide a warning to users when gene_output is "dict" or "dict_rev" AND non121_strategy is not "dbs".

Add new strategies for handling non-1:1 orthologs

Potential strategies:

  • DONE βœ… : Drop any gene that has >1 entry in input or output species.
  • DONE βœ… : Drop any gene that has >1 entry in input species.
  • DONE βœ… : Drop any gene that has >1 entry in output species.
  • decided against this: Select first/last gene of duplicates (arbitrarily).
  • DONE βœ… : Select most "popular" ortholog mappings.

More specific strategies for (gene expression) matrices:

  • Aggregate duplicate genes using the sum/mean/median/max/min
    now a separate issue here #8

convert_orthologs() output is list, with "Warning: Coercing LHS to a list"

1. Bug description

I have been using this package successfully for a while. When I went to re-run my code, convert_orthologs() now returns a list instead of a new data frame with the "input_gene" and "ortholog_gene" columns. When I try changing the "gene_output" argument to something else, like "rownames", I get a different error.

Console output

When using gene_output = "columns":

Preparing gene_df.
data.frame format detected.
Extracting genes from human_gene_name.
23,294 genes extracted.
Converting human ==> mouse orthologs using: homologene
Retrieving all organisms available in homologene.
Mapping species name: human
Common name mapping found for human
1 organism identified from search: 9606
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Checking for genes without orthologs in mouse.
Extracting genes from input_gene.
15,864 genes extracted.
Extracting genes from ortholog_gene.
15,864 genes extracted.
Checking for genes without 1:1 orthologs.
Dropping 392 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 79 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Adding input_gene col to gene_df.
Warning: Coercing LHS to a listAdding ortholog_gene col to gene_df.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
    / 23,294 (%)
Total genes remaining after convert_orthologs :
    / 23,294 (%)

Expected behaviour

I expected for a data frame to be returned with columns "input_gene" and "ortholog_gene."

2. Reproducible example

Code

library(orthogene)
library(reprex)

human_genes <- data.frame(read.csv("Human_gene_list_reprex.csv"))

head(human_genes)
#>   Gene_symbol
#> 1        A1BG
#> 2    A1BG-AS1
#> 3        A1CF
#> 4         A2M
#> 5       A2ML1
#> 6       A2MP1

class(human_genes)
#> [1] "data.frame"
class(human_genes$Gene_symbol)
#> [1] "character"

# Two lines down, change 2nd number to total number of rows
mouse_genes <- convert_orthologs(
  human_genes,
  gene_input = "Gene_symbol",
  gene_output = "columns",
  standardise_genes = FALSE,
  input_species = "human",
  output_species = "mouse",
  method = "homologene",
  drop_nonorths = TRUE,
  non121_strategy = "drop_both_species",
  mthreshold = Inf,
  as_sparse = FALSE,
  sort_rows = FALSE,
  verbose = TRUE
)
#> Preparing gene_df.
#> data.frame format detected.
#> Extracting genes from Gene_symbol.
#> 23,295 genes extracted.
#> Converting human ==> mouse orthologs using: homologene
#> Retrieving all organisms available in homologene.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: 9606
#> Retrieving all organisms available in homologene.
#> Mapping species name: mouse
#> Common name mapping found for mouse
#> 1 organism identified from search: 10090
#> Checking for genes without orthologs in mouse.
#> Extracting genes from input_gene.
#> 15,963 genes extracted.
#> Extracting genes from ortholog_gene.
#> 15,963 genes extracted.
#> Checking for genes without 1:1 orthologs.
#> Dropping 434 genes that have multiple input_gene per ortholog_gene (many:1).
#> Dropping 86 genes that have multiple ortholog_gene per input_gene (1:many).
#> Filtering gene_df with gene_map
#> Loading required namespace: DelayedArray
#> Adding input_gene col to gene_df.
#> Warning in gene_df2$input_gene <- input_dict[genes2]: [Coercing LHS to a list]([url]([Human_gene_list_reprex.csv](https://github.com/neurogenomics/orthogene/files/12004964/Human_gene_list_reprex.csv)))
#> Adding ortholog_gene col to gene_df.
#> 
#> =========== REPORT SUMMARY ===========
#> Total genes dropped after convert_orthologs :
#>     / 23,295 (%)
#> Total genes remaining after convert_orthologs :
#>     / 23,295 (%)

class(mouse_genes)
#> [1] "list"

head(mouse_genes)
#> [[1]]
#> [1] "A1BG"
#> 
#> [[2]]
#> [1] "A1CF"
#> 
#> [[3]]
#> [1] "A2M"
#> 
#> [[4]]
#> [1] "A3GALT2"
#> 
#> [[5]]
#> [1] "A4GALT"
#> 
#> [[6]]
#> [1] "A4GNT"

Data

I attached a csv with our gene list.

3. Session info

# Paste utils::sessionInfo() output 
R version 4.3.0 (2023-04-21)
[Human_gene_list_reprex.csv](https://github.com/neurogenomics/orthogene/files/12004790/Human_gene_list_reprex.csv)

Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reprex_2.0.2    orthogene_1.6.0 xml2_1.3.4      lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
 [7] dplyr_1.1.2     purrr_1.0.1     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2  
[13] tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0          viridisLite_0.4.2         fastmap_1.1.1            
 [4] lazyeval_0.2.2            homologene_1.4.68.19.3.27 digest_0.6.31            
 [7] timechange_0.2.0          lifecycle_1.0.3           tidytree_0.4.2           
[10] magrittr_2.0.3            compiler_4.3.0            rlang_1.1.1              
[13] tools_4.3.0               utf8_1.2.3                yaml_2.3.7               
[16] data.table_1.14.8         knitr_1.43                ggsignif_0.6.4           
[19] S4Arrays_1.0.4            htmlwidgets_1.6.2         bit_4.0.5                
[22] DelayedArray_0.26.6       aplot_0.1.10              abind_1.4-5              
[25] babelgene_22.9            withr_2.5.0               BiocGenerics_0.46.0      
[28] grid_4.3.0                stats4_4.3.0              fansi_1.0.4              
[31] ggpubr_0.6.0              colorspace_2.1-0          scales_1.2.1             
[34] cli_3.6.1                 rmarkdown_2.22            crayon_1.5.2             
[37] treeio_1.24.1             generics_0.1.3            rstudioapi_0.14          
[40] ggtree_3.8.0              httr_1.4.6                gprofiler2_0.2.2         
[43] tzdb_0.4.0                ape_5.7-1                 parallel_4.3.0           
[46] ggplotify_0.1.1           BiocManager_1.30.21       matrixStats_1.0.0        
[49] vctrs_0.6.2               yulab.utils_0.0.6         Matrix_1.5-4             
[52] jsonlite_1.8.5            carData_3.0-5             car_3.1-2                
[55] IRanges_2.34.1            S4Vectors_0.38.1          gridGraphics_0.5-1       
[58] hms_1.1.3                 patchwork_1.1.2           bit64_4.0.5              
[61] rstatix_0.7.2             plotly_4.10.2             grr_0.9.5                
[64] glue_1.6.2                stringi_1.7.12            gtable_0.3.3             
[67] munsell_0.5.0             pillar_1.9.0              htmltools_0.5.5          
[70] R6_2.5.1                  vroom_1.6.3               evaluate_0.21            
[73] lattice_0.21-8            backports_1.4.1           broom_1.0.4              
[76] ggfun_0.1.1               Rcpp_1.0.10               nlme_3.1-162             
[79] xfun_0.39                 fs_1.6.2                  MatrixGenerics_1.12.2    
[82] pkgconfig_2.0.3          

`rphylopic` failing

rphylopic is failing to run any of its examples, perhaps due to a change in the phylopic API.

library(rphylopic)
name_search(text = "Homo sapiens")
Error: Not Found (HTTP 404)

This causes orthogene to fail:
https://master.bioconductor.org/checkResults/3.17/bioc-LATEST/orthogene/nebbiolo1-checksrc.html

It may be the case that the development version of rphylopic (1.0.0) has since resolved this, but the latest CRAN release is rather behind (0.3.0). Will ping the maintainer to ask them to updated the CRAN version, which is required for Bioc packages.

Allow`convert_orthologs` to work when`input_species==output_species`

It would be nice to use the convert_orthologs function to also be able to standardise data within the same species.

When running benchmarks with human==>human, method="homologene" seems to be able to handle this fine (returns 18713 genes), but method="gprofiler" returns all "N/A" in the ortholog_output column.

Identifier mapping to Ensembl identifiers

I tried to figure out how I can change the output to Ensembl identifiers instead of gene symbols.
I tried adding the argument "numeric_ns="ENSG" but that didn't help.
Do you have a hint on how I can achieve that?

Issue with create_background - found in EWCE

1. Bug description

create_background when all species are the same as the output species and background is not null. In this case currently the background input is ignored instead of just being returned:

species_list <- c(species1,species2)
    gene_var <- if(as_output_species) "ortholog_gene" else "input_gene"
    if(all(species_list==output_species)){
        #### If all species are the same, just use all_genes ####
        gene_map <- all_genes(species = output_species, 
                              method = method,
                              verbose = verbose)
        bg <- gene_map$Gene.Symbol
        return(bg)
    }

This should be changed to:

species_list <- c(species1,species2)
    gene_var <- if(as_output_species) "ortholog_gene" else "input_gene"
    if(all(species_list==output_species)){
        if(is.null(bg)){
            #### If all species are the same, just use all_genes ####
            gene_map <- all_genes(species = output_species, 
                                  method = method,
                                  verbose = verbose)
            bg <- gene_map$Gene.Symbol
        } else {
            bg <- unique(bg) 
        }
        return(bg)
    }

This affects EWCE in prepare_genesize_control_network and bootstrap_enrichment_test as instead of using the user's inputted background list, it is now creating a new one.

I've added code in EWCE to avoid this for now but it would be better to just update at orthogene

The result was 0 on R package orthogene

mouse-gene.txt
AF293-gene.txt

1. Bug description

No matter what my input was, the result was 0 convert_orthologs.
For example, I extracted all the gene of M.musculus and calculated the homologous genes with human. The output is 0, and so are other species.

I used the test data "exp_mouse", and there were results, but I extracted the first column of genes, saved them as. CSV, and reread the . CSV to recalculation, and the result was still 0 orthologs.

Console output

[mouse-gene.txt](https://github.com/neurogenomics/orthogene/files/13587565/mouse-gene.txt)

Expected behaviour

Could you tell me what's wrong with me and how to use it?

2. Reproducible example

Code

(Please add the steps to reproduce the bug here. See here for an intro to making a reproducible example (i.e. reprex) and why they're important! This will help us to help you much faster.)

# Paste example here

setwd
library(orthogene)
data<-read.table("mouse-gene.txt")
gene_df <- orthogene::convert_orthologs(gene_df = data, gene_input = "rownames", gene_output = "rownames", input_species = "mouse", output_species = "human", non121_strategy = "1", method = "gprofiler") 

data2<-read.table("AF293-gene.txt")
gene_df <- orthogene::convert_orthologs(gene_df = data2, gene_input = "rownames", gene_output = "rownames", input_species = "Aspergillus fumigatus Af293", output_species = "human", non121_strategy = "1", method = "gprofiler") 

### Data




## 3. Session info

(Add output of the R function `utils::sessionInfo()` below. This helps us assess version/OS conflicts which could be causing bugs.)

<details>

for mouse
image

for AF293
image

Paste utils::sessionInfo() output

</details>

`plot_orthotree`: Randomly dropping species when `mc.cores>1`

plot_orthotree will seem to randomly drop some species (that are indeed available) when mc.cores>1, which parallelises certain steps with parallel::mclapply.

I'm not sure why this happens, but it does perhaps bring up some concerns about parallel::mclapply in general.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.