almeidasilvaf / syntenet Goto Github PK

An R package to infer and analyze synteny networks from protein sequences

Home Page: https://almeidasilvaf.github.io/syntenet/

R 72.21% C++ 24.49% TeX 3.30%

comparative-genomics evolutionary-genomics network-science phylogenomics rstats synteny synteny-network

syntenet's Introduction

syntenet

The goal of syntenet is to infer synteny networks from whole-genome protein sequence data and analyze them. Anchor pairs from synteny analyses are treated as an undirected unweighted graph (i.e., a synteny network), and users can perform:

Synteny detection using a native implementation of the MCScanX algorithm, a C++ program that has been modified and ported to R with Rcpp. This way, users do not need to install MCScanX beforehand, because syntenet has its own implementation of the same algorithm.
Synteny network inference by treating anchor pairs as edges of a graph;
Network clustering using the Infomap algorithm;
Phylogenomic profiling, which consists in identifying which species contain which clusters. This analysis can reveal highly conserved synteny clusters and taxon-specific ones (e.g., family- and order-specific clusters);
Microsynteny-based phylogeny reconstruction with maximum likelihood, which can be achieved by inferring a phylogeny from a binary matrix of phylogenomic profiles with IQTREE.

Installation instructions

Get the latest stable R release from CRAN. Then install syntenet from Bioconductor using the following code:

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("syntenet")

And the development version from GitHub with:

BiocManager::install("almeidasilvaf/syntenet")

Citation

Below is the citation output from using citation('syntenet') in R. Please run this yourself to check for any updates on how to cite syntenet.

print(citation('syntenet'), bibtex = TRUE)
#> 
#> To cite syntenet in publications, use:
#> 
#>   Almeida-Silva, F., Zhao, T., Ullrich, K.K., Schranz, M.E. and Van de
#>   Peer, Y. syntenet: an R/Bioconductor package for the inference and
#>   analysis of synteny networks. Bioinformatics, 39(1), p.btac806.
#>   (2023). https://doi.org/10.1093/bioinformatics/btac806
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {syntenet: an R/Bioconductor package for the inference and analysis of synteny networks},
#>     author = {Fabricio Almeida-Silva and Tao Zhao and Kristian K. Ullrich and M. Eric Schranz and Yves {Van de Peer}},
#>     journal = {Bioinformatics},
#>     year = {2023},
#>     volume = {39},
#>     number = {1},
#>     pages = {btac806},
#>     url = {https://academic.oup.com/bioinformatics/article/39/1/btac806/6947985},
#>     doi = {10.1093/bioinformatics/btac806},
#>   }

Please note that syntenet was only made possible thanks to many other R and bioinformatics software authors, which are cited either in the vignettes and/or the paper(s) describing this package.

Code of Conduct

Please note that the syntenet project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Development tools

Continuous code testing is possible thanks to GitHub actions through usethis, remotes, and rcmdcheck customized to use Bioconductor’s docker containers and BiocCheck.
Code coverage assessment is possible thanks to codecov and covr.
The documentation website is automatically updated thanks to pkgdown.
The code is styled automatically thanks to styler.
The documentation is formatted thanks to devtools and roxygen2.

For more details, check the dev directory.

This package was developed using biocthis.

syntenet's People

Contributors

Stargazers

Watchers

Forkers

kullrich animesh caoyu819 lizhizhong1992 manasealoo iaindhay

syntenet's Issues

check_input issue with number of genes

Hi Fabricio,
Thanks for syntenet that i used with success previously. But i'm currently trying to use it with another dataset and I cannot pass the "check_input" step which tells me that I do not have the same number of genes in the annot and seq objects. But I don't understand why, since it seems that I have the same number of genes in both objects. I also checked that the genes had the same identifiers in both objects. However, the order of genes may differ. I'm using the 1.0.4 version of syntenet.

# Create annot and seq objects
annot_files <- dir(here("data", "gff_whole_haplotype"), full.names = TRUE)
annot <- lapply(annot_files, function(x) {
  granges <- rtracklayer::import(x, feature.type = "mRNA")
  ## Keep only useful fields
  granges <- granges[, c("source", "type","ID")]
  return(granges)
})

annot <- lapply(annot, function(x) {
  x$gene_id <- x$ID
  return(x)
})
names(annot)<- gsub(".gff3", "", basename(annot_files))
annot <- GenomicRanges::GRangesList(annot)


seq <- fasta2AAStringSetlist(here("data", "faa_whole_haplotype"))
names(seq) <- gsub(".faa", "", names(seq))

GRangesList object of length 3:
$g1
GRanges object with 49762 ranges and 4 metadata columns:
          seqnames        ranges strand |   source     type             ID        gene_id
             <Rle>     <IRanges>  <Rle> | <factor> <factor>    <character>    <character>
      [1]    chr11   60034-62654      + |  Helixer     mRNA chr11_000001.1 chr11_000001.1
      [2]    chr11   70990-73157      + |  Helixer     mRNA chr11_000002.1 chr11_000002.1
      [3]    chr11 101227-102983      + |  Helixer     mRNA chr11_000003.1 chr11_000003.1
      [4]    chr11 110674-113051      + |  Helixer     mRNA chr11_000004.1 chr11_000004.1
      [5]    chr11 115581-116644      + |  Helixer     mRNA chr11_000005.1 chr11_000005.1
      ...      ...           ...    ... .      ...      ...            ...            ...
  [49758]    chr08 197167-197997      - |  Helixer     mRNA chr08_002377.1 chr08_002377.1
  [49759]    chr08 185223-185616      - |  Helixer     mRNA chr08_002378.1 chr08_002378.1
  [49760]    chr08 184933-185202      - |  Helixer     mRNA chr08_002379.1 chr08_002379.1
  [49761]    chr08   75215-78676      - |  Helixer     mRNA chr08_002380.1 chr08_002380.1
  [49762]    chr08    8161-13604      - |  Helixer     mRNA chr08_002381.1 chr08_002381.1
  -------
  seqinfo: 27 sequences from an unspecified genome; no seqlengths

$g2
GRanges object with 51889 ranges and 4 metadata columns:
          seqnames        ranges strand |   source     type             ID        gene_id
             <Rle>     <IRanges>  <Rle> | <factor> <factor>    <character>    <character>
      [1]     chr1 168478-173744      + |  Helixer     mRNA  chr1_000001.1  chr1_000001.1
      [2]     chr1 265069-265444      + |  Helixer     mRNA  chr1_000002.1  chr1_000002.1
      [3]     chr1 265600-268030      + |  Helixer     mRNA  chr1_000003.1  chr1_000003.1
      [4]     chr1 502438-508271      + |  Helixer     mRNA  chr1_000004.1  chr1_000004.1
      [5]     chr1 509859-511858      + |  Helixer     mRNA  chr1_000005.1  chr1_000005.1
      ...      ...           ...    ... .      ...      ...            ...            ...
  [51885]    chr17   97801-99401      - |  Helixer     mRNA chr17_002946.1 chr17_002946.1
  [51886]    chr17   87252-88373      - |  Helixer     mRNA chr17_002947.1 chr17_002947.1
  [51887]    chr17   81823-84434      - |  Helixer     mRNA chr17_002948.1 chr17_002948.1
  [51888]    chr17   76884-81270      - |  Helixer     mRNA chr17_002949.1 chr17_002949.1
  [51889]    chr17    9506-28460      - |  Helixer     mRNA chr17_002950.1 chr17_002950.1
  -------
  seqinfo: 27 sequences from an unspecified genome; no seqlengths

$g3
GRanges object with 45344 ranges and 4 metadata columns:
          seqnames        ranges strand |   source     type             ID        gene_id
             <Rle>     <IRanges>  <Rle> | <factor> <factor>    <character>    <character>
      [1]     chr1   28284-32837      + |  Helixer     mRNA  chr1_000001.1  chr1_000001.1
      [2]     chr1   32839-33270      + |  Helixer     mRNA  chr1_000002.1  chr1_000002.1
      [3]     chr1   41122-41853      + |  Helixer     mRNA  chr1_000003.1  chr1_000003.1
      [4]     chr1 201883-205824      + |  Helixer     mRNA  chr1_000004.1  chr1_000004.1
      [5]     chr1 285846-287931      + |  Helixer     mRNA  chr1_000005.1  chr1_000005.1
      ...      ...           ...    ... .      ...      ...            ...            ...
  [45340]    chr17   62188-63727      - |  Helixer     mRNA chr17_002687.1 chr17_002687.1
  [45341]    chr17   51453-52580      - |  Helixer     mRNA chr17_002688.1 chr17_002688.1
  [45342]    chr17   46003-48489      - |  Helixer     mRNA chr17_002689.1 chr17_002689.1
  [45343]    chr17   41160-44467      - |  Helixer     mRNA chr17_002690.1 chr17_002690.1
  [45344]    chr17    7668-14019      - |  Helixer     mRNA chr17_002691.1 chr17_002691.1
  -------
  seqinfo: 27 sequences from an unspecified genome; no seqlengths

$g1
AAStringSet object of length 49762:
        width seq                                                                              names               
    [1]  1580 MNEPRICHTLMTPNGNLHPKPNERKFPTLMVSQYLHSKN...IDYLGKLIWNYCILDEGHVIKNAKSKVTLSVKQLKAQH chr11_003021.1
    [2]    55 MIYYGPAVLSSPPPSHSLGHWFNPSDEIYGCSADRGHCQSASSGLELSVENALKM                          chr11_003020.1
    [3]   321 MLVITVDYLKKRVISLNNLVEIYCVHHPKHLLDYAALMS...GAVGVMKTGDFKRIHRTTRRISLTCRLVSKVHKNLLRF chr11_003019.1
    [4]   552 MSGMMLMLLLCLSVGTMSVVHCGDPTIFFTWNVTYGTLS...YVGVENIKRSLRNEYNMPDTALVCGLVKGMPKPPPYST chr11_000001.1
    [5]   552 MASVMYALLLCLSAGALSMVQGEDPYLFFTWNVTYGTIS...YVSVLSPARSLRDEYNLPETAQLCGIVKDLPKPPPYSS chr11_000002.1
    ...   ... ...
[49758]   227 MEAQEVEAQAATGSGQQQLKLYSYWMSSCSFRVRIALNL...RFQLDMTQFPLLARLHEAYNKIPAFLDALPEKQPDAPS chr08_001162.1
[49759]    76 MVLTKMKEIAESYLGQMVKNIVLTVPAYFNDSQQQATKDAGAISGLNVMRIINEPMAAAIAYGLDKKASRKGSRTS     chr08_001161.1
[49760]   742 MDNKQPLLLSQGSGGHVNYYVNLRKTSLSVTFAKWILKI...VCGPPTLQSSVAKEIRSHNLRRQFDHPIFHFNSHSFDL chr08_001158.1
[49761]   581 MSMYGRDPWGGPLEINTADSATDDDRSRNLQDLDRAALS...ALVALSGEKSVGIDKNTIFSAVPPMFPTPPPPPAPWKP chr08_001159.1
[49762]   411 MGGCTCLLCHYSALFVPSVLFINSERHEEQDALFEALHK...NVQQYIGSWVGGAKGKQDKTDSDQSLSQRIPIDIVISN chr08_001160.1

$g2
AAStringSet object of length 51889:
        width seq                                                                              names               
    [1]   221 MALSEATSVVPLNPETPSTNHAHSDQGSLPSSPDPDPTP...PVRPEAEDCSYYLKTGSCKFGSNCKFNHPVSRKTNQEI chr1_000001.1
    [2]    85 MELMKAKGGGGRLRHQQGRRLWEGSRGKKVMGGADEKGL...WDSPVVGCKEMLGLIGECVLEEYEEAQSSDDESFAIRR chr1_000002.1
    [3]   477 MVMGSSSRQSSGSNSRRVRAFKRWMKQEGIECSDALDLR...KLYHSLVLRISERKILRKLQIYAANAKEASWTNKLKRK chr1_000003.1
    [4]   757 MMNEENGSSSSSSSSLVSDDGGDIENLDLLDGRRRTCSQ...VETMSLAKAAILFRQLGLRHLCLVPKSQGGHKFPWNLG chr1_002382.1
    [5]    82 MIVHHLLPSHPSNHPSKPVACASASTTFEVDQAPTPATP...FQVGHEFDLHKLEPRYPLIISDIDIAQDRGWAAHSNVS chr1_002381.1
    ...   ... ...
[51885]   275 ERRLALEEVKFLQKQRERKSGIPALTNNNAATQTLPGST...VSRQNDTGTEAAGQRQAATDQFMLERFRKRERHRVMRR chr17_001458.1
[51886]   255 MTMASQSQSLQFFHSHIIIRRSTLPPPSLPFHFQLQLQL...VIEGMDVLRKLESQETSRLDVPRLPCRIVNCGELLLDD chr17_001454.1
[51887]   125 MSVLHIQYSQWPNQGVPEDTFSVREIEKRVMYQVAPAIP...QRDGMVQKPEQYRLCYDALVDELEDHISDGSPVEYLVK chr17_001457.1
[51888]    39 MSGPGTLWGSAESGNPTSRLGCVSRLSLLMVGLYFKDLA                                          chr17_001456.1
[51889]   838 MEVDTRQYLRDDQTSNCRSFGSINHPSSQSRKISIGVVV...TQINDAQRKIQVTRRMARGKMLQLKHELALCLKDGILS chr17_001455.1

$g3
AAStringSet object of length 45344:
        width seq                                                                              names               
    [1]    98 MLTSRTIPDSVHTVSRASGGHESSRRVKSELLVQVDGVN...EAFRRRLEKRLYIPLPNFQSRKALIRINMKPVEMWPLT chr1_000001.1
    [2]    45 MSKDDISKDPVAMCDFEEALAKVQRSVSQADIERHEKWIHEFGSA                                    chr1_000002.1
    [3]   224 MSGPSDRRFDLNLGEETATPSPDNIWRPSFISPTGPLTV...MPPPSGVLSSTEAPDNHPPVLSLSGALPTAETSPKQPL chr1_000003.1
    [4]   221 MALSEATSVVPPNPETPSTNHAHSDQGSLPSSPDPDPTP...PVRPEAEDCSYYLKTGSCKFGSNCKFNHPVSRKTNQEI chr1_000004.1
    [5]   480 MKQEGIECSDALDLRNDGATAGGVGISVRAVGDLKEGDV...KLYHSLVLRISERKILRKLQIYAANAKEASWTNKLKRK chr1_000005.1
    ...   ... ...
[45340]   125 MSVLHIQYSQWPNQGVPEDTFSVREIEKRVMYQVAPAIP...QRDGMVQKPEQYRLCYDALVDELEDHISDGSPVEYLVK chr17_001338.1
[45341]    43 MSGPGTLWGSAESGNPTSRLGCVSRLSLLMIRGWALLQRPCLI                                      chr17_001337.1
[45342]   808 MEVDTRQYLRDDQTSNCRSFGSINHPSSQSRKISIGVVV...TQINDAQRKIQVTRRMARGKMLQLKHELALCLKDGILS chr17_001336.1
[45343]   322 MACLHLPDLVTHNIIQILPTKAAVRMSFLSKQWEGVWCS...HDESYKKAEIYKGKNKVLNGSPILRQKIQPIQRGCIQI chr17_001334.1
[45344]   954 MPSVWFSISGAFSLCCTVITEFSAALQRVGDFDHASYTP...FVGQPSPVTPRPSEQLAGTEGLPSFHSPKQGRKYRKYR chr17_001335.1

Erreur dans check_ngenes(seq, annotation) : 
  Number of sequences in greater than the number of genes for:
1. g1
2. g2
3. g3

Thank you for your help,
Best,
Aurelie

Parallelization of infer_syntenet

Is it possible to parallelize the infer_syntenet function? It seems like one could split up the pairwise calculations of collinear blocks. I'm running a dataset with ~150 taxa, guessing it will take several days to complete as a single threaded process.

Error in plot_profiles 'cluster_species' must match column names of profile matrix.

Hello,

I performed the full pipeline up to plotting the profile but when I try to plot the profile,
I get the following error:
$plot_profiles(
profiles, species_annotation,
cluster_species = species_order,
cluster_columns = TRUE

Error in plot_profiles(profiles, species_annotation, cluster_species = species_order, :
'cluster_species' must match column names of profile matrix.

After checking the matrix, it seems that it is missing 2 species but these were initially included (9 in the matrix but 11 originally):
$head(profiles$profile_matrix)

-> ar av bc dc hm pl ps rc rr
-> 8164 2 0 0 0 0 0 0 0 0
-> 8163 2 0 0 0 0 0 0 0 0
-> 8162 2 0 0 0 0 0 0 0 0
-> 8161 2 0 0 0 0 0 0 0 0
-> 8159 2 0 0 0 0 0 0 0 0
-> 8158 2 0 0 0 0 0 0 0 0

What might be the issue?
Many thanks
Alex

Cant seem to pass check_input

I have been trying to get syntenet working but i cant seem to get past the check_input - check_list_name stage. My understanding is that it is comparing the header/name of the proteins in the fasta/AAString object with the "gene_id" in the GRange object (from column 9 of the gff file). But i cant seem to get it to process even when they are identical.
The files i am working with are user generated (i.e not for a database) and we have the fasta header as the protein ID and after failing to get it to pass we have made every flag in the gff column 9 to be the same protein ID. I have tried to use the "gene_field" option in check_input and set it to any other column in the GRange object but it doesn't seem to help. I have tried to remove the ".1" form all the names in both fasts and gff files and doesn't change. Any help appreciated.

Im using the current version on bioconductor - 3.17

> aastringsetlist
$processed_NC_004578_cds_protein_sequences
AAStringSet object of length 27:
     width seq                                                                                  names               
 [1]   493 MNREEFLRLAADGYNRIPLARETLADFDTPLSIYLKLADQP...VQAGGGIVADSVPALEWEETLNKRRAMFRAVALASQTAEG WP_007244435.1
 [2]   640 MTKTSRCWPFAACLLSLACGTATAGPYSTMVVFGDSLADAG...ATFGVSQKLTQDLTLRGNYNWRKNDDVTQQGVNVALSMSF WP_011103197.1
 [3]    80 MNEILDQLRKEFATPCPSLSAVRERYFSHLSNDRNLLRKINAGRIALKVSRTGGTRQGHPFVYLHDLANYLSDIVTNRAA     WP_046463953.1
 [4]    82 MLDQPPGNSQHDAYLALAQRIQDAIASDKAQIEHQVLLIREPGESAAHWDHIVDQISEAEGIVVTRSPENGTAHVSWYIDSL   WP_005768037.1
 [5]   203 METQEISQEEMAERMGVTPGAVGHWLNGKREPKIEVINRLL...KLVEDAGRYFLKPLNPAYPTLPVTEDCKLIGVIRQMTMRL WP_005768035.1
 ...   ... ...
[23]   336 MASTASYEHVLHRYPSVQEWLELLGNLGRAPATLEAYGRGL...RDPKTTLIYVHLSGADLTARLAHSVGSLDARLFAELFKSE WP_011103210.1
[24]   713 MSYVPFDVDHYERQEELSDLERTILSNRRYRSDWAYLQSSV...DERAIVEGDLAKLDGLIRKLDDVPTLDGRTPSQIEAKKNR WP_011103211.1
[25]   218 MITPSRYPGIYIAPLSNEPTAAHTFKEQAEEALDHISAAPS...EELRAVGLDKYRYSLTKKPSENSIRAEHGLPLRMKYRAHQ WP_011103212.1
[26]   241 MPDPAQFSDGRWKKLPTQLSSITLARFDQDICTNNHGISQR...DPNLGEFHTHSKALADTIENISSADGLPLIGVQVFASKIH WP_226992679.1
[27]   199 MLLMIDNYDSFTYNVVQYLGELGADVKVIRNDELTIAQIEA...LRHKTLNVEGVQFHPESILTEQGHELFANFLKQSGGHRQG WP_005767895.1

$processed_NC_005773_cds_protein_sequences
AAStringSet object of length 21:
     width seq                                                                                  names               
 [1]   493 MNREEFLRLAAEGYNRIPLARETLADFDTPLSIYLKLADQP...VQAGGGIVADSVPVLEWEETLNKRRAMFRAVALASQPAEG WP_011167608.1
 [2]   640 MTKTSRRWPFAACLLSLACGTAAAAPYSTMVVFGDSLADAG...ATLGVSQKLTQDLTLRGNYNWRKNDDVTQQGVNVALSMSF WP_011167609.1
 [3]    80 MNEILDQLRKEFATPCPSLSAVRERYFSHLSNDRNLLRKINAGRIDLKVSRTGGSRQGHPFVYLHDLAKYLSAIVTNRAA     WP_004643929.1
 [4]    82 MLKQFPDTSQHDAYLALAQRIQDAITGDKAQIEHQVLLIREPGESVAHWERIMDQISEAEGISVTRNPENGTARVSWYIDSL   WP_004642420.1
 [5]   203 METQQISQEEMAERMGVTPGAVGHWLNGKREPKIEVINRFL...KLVEDAGRYFLKPLNPAYPTLAVTEECKLIGVIRQMTMRL WP_004663805.1
 ...   ... ...
[17]   391 MALTDLKIRQAKPGKTSSKLTDSGGLYLEVTTGGSKLWRYR...QLAHVEQKKSKAAYNHASYLPARKALMQWWDNYIFGSESD WP_011167620.1
[18]   100 MSVVKLFDTATSPVDVFDEVVNSGIAGVYGGRLAVREVASQ...SSGLTHNHSRTELAYMFDDRKFSVDQALDAVEKTYDITFP WP_041924422.1
[19]    66 MSDHDIDILIKLPEVCRQAGFGKSTIYELIAAGTFPAPTKLGRFSRWSQKEVQDWIELQKLARFAA                   WP_011167621.1
[20]   210 MELKRDPMLEQVVLPLLGNGDMIMSWQGVRADESINRRYLP...GIQYDLMIATDATACSSAYGLCDSGADGFNDTNVQLGEAA WP_081002664.1
[21]   199 MLLMIDNYDSFTYNVVQYLGELGADVKVIRNDELTIEQIEA...LRHKTLNVEGVQFHPESILTEQGHELFANFLKQSGGHRQG WP_002551940.1

> grangeslist
GRangesList object of length 2:
$processed_NC_004578_cds_features
GRanges object with 27 ranges and 8 metadata columns:
          seqnames      ranges strand |   source     type     score     phase             ID           Name
             <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>    <character>    <character>
   [1] NC_004578.1      1-1482      + |       NA     gene        NA         0 WP_007244435.1 WP_007244435.1
   [2] NC_004578.1   1704-3626      - |       NA     gene        NA         0 WP_011103197.1 WP_011103197.1
   [3] NC_004578.1   3733-3975      - |       NA     gene        NA         0 WP_046463953.1 WP_046463953.1
   [4] NC_004578.1   4097-4345      - |       NA     gene        NA         0 WP_005768037.1 WP_005768037.1
   [5] NC_004578.1   4498-5109      - |       NA     gene        NA         0 WP_005768035.1 WP_005768035.1
   ...         ...         ...    ... .      ...      ...       ...       ...            ...            ...
  [23] NC_004578.1 20673-21683      + |       NA     gene        NA         0 WP_011103210.1 WP_011103210.1
  [24] NC_004578.1 21707-23848      + |       NA     gene        NA         0 WP_011103211.1 WP_011103211.1
  [25] NC_004578.1 23950-24606      - |       NA     gene        NA         0 WP_011103212.1 WP_011103212.1
  [26] NC_004578.1 25258-25983      - |       NA     gene        NA         0 WP_226992679.1 WP_226992679.1
  [27] NC_004578.1 27745-28344      + |       NA     gene        NA         0 WP_005767895.1 WP_005767895.1
                names        gene_id
          <character>    <character>
   [1] WP_007244435.1 WP_007244435.1
   [2] WP_011103197.1 WP_011103197.1
   [3] WP_046463953.1 WP_046463953.1
   [4] WP_005768037.1 WP_005768037.1
   [5] WP_005768035.1 WP_005768035.1
   ...            ...            ...
  [23] WP_011103210.1 WP_011103210.1
  [24] WP_011103211.1 WP_011103211.1
  [25] WP_011103212.1 WP_011103212.1
  [26] WP_226992679.1 WP_226992679.1
  [27] WP_005767895.1 WP_005767895.1
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

$processed_NC_005773_cds_features
GRanges object with 21 ranges and 8 metadata columns:
          seqnames      ranges strand |   source     type     score     phase             ID           Name
             <Rle>   <IRanges>  <Rle> | <factor> <factor> <numeric> <integer>    <character>    <character>
   [1] NC_005773.3      1-1482      + |       NA     gene        NA         0 WP_011167608.1 WP_011167608.1
   [2] NC_005773.3   1823-3745      - |       NA     gene        NA         0 WP_011167609.1 WP_011167609.1
   [3] NC_005773.3   3851-4093      - |       NA     gene        NA         0 WP_004643929.1 WP_004643929.1
   [4] NC_005773.3   4215-4463      - |       NA     gene        NA         0 WP_004642420.1 WP_004642420.1
   [5] NC_005773.3   4618-5229      - |       NA     gene        NA         0 WP_004663805.1 WP_004663805.1
   ...         ...         ...    ... .      ...      ...       ...       ...            ...            ...
  [17] NC_005773.3 16070-17245      + |       NA     gene        NA         0 WP_011167620.1 WP_011167620.1
  [18] NC_005773.3 17245-17547      + |       NA     gene        NA         0 WP_041924422.1 WP_041924422.1
  [19] NC_005773.3 17609-17809      - |       NA     gene        NA         0 WP_011167621.1 WP_011167621.1
  [20] NC_005773.3 17806-18438      - |       NA     gene        NA         0 WP_081002664.1 WP_081002664.1
  [21] NC_005773.3 21636-22235      + |       NA     gene        NA         0 WP_002551940.1 WP_002551940.1
                names        gene_id
          <character>    <character>
   [1] WP_011167608.1 WP_011167608.1
   [2] WP_011167609.1 WP_011167609.1
   [3] WP_004643929.1 WP_004643929.1
   [4] WP_004642420.1 WP_004642420.1
   [5] WP_004663805.1 WP_004663805.1
   ...            ...            ...
  [17] WP_011167620.1 WP_011167620.1
  [18] WP_041924422.1 WP_041924422.1
  [19] WP_011167621.1 WP_011167621.1
  [20] WP_081002664.1 WP_081002664.1
  [21] WP_002551940.1 WP_002551940.1
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

> check_input(aastringsetlist, grangeslist)
Error in check_list_names(seq, annotation) : 
  Names of list elements in 'seq' and 'annotation' must match.

unique species id in `process_input()`

Is there any way to customize the behavior of the renaming aspect of the process_input() where it adds a unique species identifier to sequence names - e.g. allowing it to use a genome accession as the unique species id? e.g. XX_000000

Error in check_ngenes(seq, annotation)

Hello Fabricio,

I am running syntenet for 4 species for a project. I filtered the fasta and gff3 files to match each other using custom python scripts to match the gff3 and fasta files and checked it with custom codes.

$ python count.py
Number of genes in athm.fasta: 27622
Number of genes in osam.fasta: 35489
Number of genes in crem.fasta: 19228
Number of genes in ppam.fasta: 22827

$ python entries.py
Number of 'gene_id' entries in athm.gff3: 27622
Number of 'gene_id' entries in osam.gff3: 35489
Number of 'gene_id' entries in crem.gff3: 19228
Number of 'gene_id' entries in ppam.gff3: 22827

however, the fuction check_input() is giving off error.

library(syntenet)
seqs=fasta2AAStringSetlist("fasta")
anns=gff2GRangesList("annotation")
check_input(seqs, anns)
Error in check_ngenes(seq, annotation) :
Number of sequences in greater than the number of genes for:

crem
osam
ppam

can you please assist this error?

-Sujeevan

.collinearity files with 0 collinear genes

Thanks for your replies to my previous posts @almeidasilvaf! Your detailed responses are much appreciated. I've got another puzzle for you though.

I found something strange in the .collinearity files produced by the infer_syntenet function. I won't post all my code (unless you'd like to see it), but here's the general idea. First Diamond searches.

# run Diamond
if(diamond_is_installed()) {
  blast_list <- run_diamond(seq = pdata$seq, 
                            verbose=TRUE, 
                            outdir="./diamond_results_Otocephala",
                            top_hits=20,
                            threads=60)
}

We can see plenty of hits between taxa "Cgar" and "Cgib" in the BLAST table.

head diamond_results_Otocephala/diamond/results/Cgar-GCF_024256425.1_Cgib-GCF_023724105.1.tsv
Cgar_ttn        Cgib_ttn        77.6    31617   4963    51      1       29570   1       31566   0.0     48384
Cgar_ttn        Cgib_ttn        58.0    29891   8199    100     1       29571   1       25869   0.0     33396
Cgar_ttn        Cgib_hmcn1      25.3    4198    2712    113     2237    6296    510     4432    8.65e-297       1019
Cgar_ttn        Cgib_hmcn2      25.6    3137    2107    82      3115    6159    385     3385    3.79e-243       840
Cgar_ttn        Cgib_LOC128013242       26.1    2994    1936    60      7588    10433   1213    4076    6.03e-242       838
Cgar_ttn        Cgib_LOC127945916       22.7    4162    2690    114     1725    5600    1966    5887    5.20e-203       710
Cgar_ttn        Cgib_obscna     26.4    1904    1361    22      7567    9455    446     2324    1.86e-177       625
Cgar_ttn        Cgib_LOC127964823       22.2    2942    1987    97      3164    5964    8       2789    7.03e-132       472
Cgar_ttn        Cgib_obsl1a     25.0    1546    1119    21      7571    9090    653     2184    2.43e-124       445
Cgar_ttn        Cgib_LOC128020137       21.6    2211    1497    52      7588    9650    444     2566    1.01e-104       383

wc -l diamond_results_Otocephala/diamond/results/Cgar-GCF_024256425.1_Cgib-GCF_023724105.1.tsv
251063 diamond_results_Otocephala/diamond/results/Cgar-GCF_024256425.1_Cgib-GCF_023724105.1.tsv

Then network analysis.

# run network analysis
net_v2<- infer_syntenet(blast_list, 
                      pdata$annotation,
                      outdir="./network_results_Otocephala_v2",
                      verbose=TRUE)

I didn't notice any error messages when running infer_syntenet, and the net object appears to be fine. I'm able to proceed with subsequent analysis steps (clustering, phylogenetic profiling, and phylogenetic inference).

However, there are several random pairwise comparisons that have 0 collinear genes, according to the .collinearity files in the outdir. It doesn't seem specific to any particular genome.

ls -lh -S network_results_Otocephala_v2/interspecies_synteny | tail
-rw-rw-r-- 1 krablab krablab   83K May 30 12:24 Asap-GCF_018492685.1_Dtra-GCA_007224835.1.collinearity
-rw-rw-r-- 1 krablab krablab   81K May 30 12:21 Aalo-GCF_017589495.1_Dtra-GCA_007224835.1.collinearity
-rw-rw-r-- 1 krablab krablab   80K May 30 12:25 Byar-GCA_005784505.1_Dtra-GCA_007224835.1.collinearity
-rw-rw-r-- 1 krablab krablab   77K May 30 12:32 Char-GCF_900700415.2_Dtra-GCA_007224835.1.collinearity
-rw-rw-r-- 1 krablab krablab   345 May 30 12:29 Cgar-GCF_024256425.1_Cgib-GCF_023724105.1.collinearity
-rw-rw-r-- 1 krablab krablab   344 May 30 12:22 Amel-GCA_012411365.1_Amex-GCF_023375975.1.collinearity
-rw-rw-r-- 1 krablab krablab   344 May 30 12:38 Mamb-GCF_018812025.1_Mang-GCF_027580225.1.collinearity
-rw-rw-r-- 1 krablab krablab   344 May 30 12:38 Mamb-GCF_018812025.1_Masi-GCF_019703515.2.collinearity
-rw-rw-r-- 1 krablab krablab   344 May 30 12:39 Mang-GCF_027580225.1_Masi-GCF_019703515.2.collinearity
-rw-rw-r-- 1 krablab krablab   344 May 30 12:42 Sans-GCF_001515605.1_Saso-GCA_024362625.1.collinearity

 cat ./network_results_Otocephala_v2/interspecies_synteny/Cgar-GCF_024256425.1_Cgib-GCF_023724105.1.collinearity
############### Parameters ###############
# MATCH_SCORE: 50
# GAP_PENALTY: -1
# MATCH_SIZE: 5
# OVERLAP_WINDOW: 5
# E_VALUE: 1e-05
# MAX GAPS: 25
# IS_PAIRWISE: 1
# IN_SYNTENY: 2
############### Statistics ###############
# Number of collinear genes: 0, Percentage: 0.00
# Number of all genes: 112611
##########################################

And indeed, the net_v2 object seems to be missing any Cgar-Cgib matches. I have not checked the other pairwise comparisons with 0 collinear genes, but I expect it's the same result.

> temp <- paste0(net_v2$Anchor1, "__", net_v2$Anchor2)
> head(temp)
[1] "Aalo_prkacbb__Aalo_prkacba"       "Aalo_LOC125293354__Aalo_adgrl2a"
[3] "Aalo_LOC125303560__Aalo_ankrd13c" "Aalo_LOC125303578__Aalo_srsf11"
[5] "Aalo_mylk4a__Aalo_LOC125301129"   "Aalo_LOC125296248__Aalo_sirt5"
> Cgar <- grep("Cgar", temp)
> Cgar_Cgib <- grep("Cgib", Cgar)
> head(Cgar_Cgib)
integer(0)

However, when we look at the phylogenetic profile heatmap, there seem to be many synteny clusters shared in these pairs of species that supposedly have 0 collinear genes.

For example, compare Clarias gariepinus (Cgar) and Carassius gibelio (Cgib). According to their .collinearity file, they have 0 collinear genes. I highlighted the two species in the plot below. Also, I should note that Carassius gibelio is a polyploid, hence the duplicated synteny clusters.

# plot phylogenomic profiles after a bit of filtering
p <- plot_profiles(profile_matrix = profiles_v2_filt,
                    species_annotation = species_annotation,
                    cluster_species = cluster_species,
                    dist_function = labdsv::dsvdis,
                    dist_params = list(index = "ruzicka"))

Any idea what might be going on? Happy to provide more data/code.

Import data

hello,

Thanks for developing such useful tools. I could not find somewhere the proper function to import the data (proteomes and annotations)
to syntenet or the correct format.

I have modified the data as described here https://github.com/zhaotao1987/SynNet-Pipeline/wiki/Genome-Preparation. Is this the correct format?

Which function should I use? The documentation describes an example using already existing data so no parsing..

Many thanks
Alex

Warning in cluster_network()

Dear developers,

I encountered this warning message in the cluster_network() run:

Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  restarting interrupted promise evaluation

Is this a serious problem? Can I go on with the results?

Thank you,

Tao

[BUG] Possible bug with find_GS_clusters

I have been using the find_GS_clusters function to count the number of synteny clusters for several clades. However, I have noticed that when I specify min_percentage = 100, no clusters are returned.

On line 166 of 05_phylogenomic_profiling.R, you perform filtering as follows.

fgroup_count <- group_count[group_count$perc > min_percentage, ]

Should this line actually be >= instead of >? If min_percentage = 100, I believe this will return 0 matches.

Importing blast database

Hi @almeidasilvaf,

Thanks for writing this really useful tool, I'm very excited to use it.
I have a minor query, which I think is mostly due to my lack of R knowledge.

I have an all v all blast run that I ran myself on command line with default settings and saved in tabular format as such:

SynAndr_BRAKERSADG00000005258	SynMyop_BRAKERSMFG00000005851	95.1	305	2	2	1	302	1	295	2.1e-90	336.7
SynAndr_BRAKERSADG00000005258	SynVesp_BRAKERSVEG00000005697	95.4	302	11	1	4	302	7	308	2.3e-84	316.6
SynAndr_BRAKERSADG00000005258	SesApif_BRAKERSAFG00000004954	85.2	317	32	3	1	302	1	317	1.1e-75	287.7
SynAndr_BRAKERSADG00000005258	SesBemb_BRAKERBSSG00005006065	87.3	308	33	2	1	302	1	308	1.8e-73	280.4
SynAndr_BRAKERSADG00000005258	ZeuPyri_BRAKERZPYG00000006574	83.4	307	42	3	5	302	21	327	7.4e-67	258.5

I was able to load my seq and gff data no problem - but when trying to import my personal blast output, in order to infer the network, I got a bit stuck.

I tried allvall <- read.table('all_v_all_diamondout.tsv') followed by network <- infer_syntenet(allvall, pdata$annotation). But I got the error:

Error in valid_blast(blast_list) : 
  BLAST list must be a list of data frames.

I realise this is because my dataframe is the full file, whereas in your tutorial it is a list of dataframes where each one corresponds to a given species.

Could you advise best how I could transform my dataframe into a list of dataframes as in the tutorial?
I could imagine most people would run the diamond search themselves (especially on large datasets) and the tutorial might benefit by having a bit more information on transforming the standard tabular output into the list of dataframes required for syntenet.

Thanks in advance for any help you can give, and apologies if this is a trivial issue simply due to my lack of R skills.

Best wishes,
Peter

Error in profiles$profile_matrix : $ operator is invalid for atomic vectors

Hello,

When I try to infer the binarized matrix from the phylogenomic profiles I get the following error.
I am using v. 1.0.3.

What might be the issue?
Thanks

bt_mat <- binarize_and_transpose(profiles$profile_matrix)

Error in profiles$profile_matrix : 
  $ operator is invalid for atomic vectors

[BUG] check_input does not work as intended

I believe there is a slight problem in the way that the check_input function performs check 3 using the check_gene_names function. Currently, the check_gene_names function searches all features of the GFF file for matches, not just "gene" features.

    gene_names <- lapply(annotation, function(x) {
        
        ranges_cols <- GenomicRanges::mcols(x)
        if(!gene_field %in% names(ranges_cols)) {
            stop("Could not find column '", gene_field, "' in GRanges.")
        }
        nam <- unique(ranges_cols[[gene_field]])
        
        return(nam)
    })

Then it compares the sequence names (from the FASTA file) with this exhaustive list.

    gene_names <- lapply(annotation, function(x) {
        
        ranges_cols <- GenomicRanges::mcols(x)
        if(!gene_field %in% names(ranges_cols)) {
            stop("Could not find column '", gene_field, "' in GRanges.")
        }
        nam <- unique(ranges_cols[[gene_field]])
        
        return(nam)
    })

However, the process_input function only examines "gene" GFF features.

annot_df <- annotation[[x]][annotation[[x]]$type == "gene"]

Thus, your data can return TRUE when running the check_input function, yet end up with seq and annotation name mismatches.

In the case below, the "CDS" features in my GFF file contain Name fields that match my FASTA IDs. But the "gene" features in my GFF do not contain names that match the FASTA IDs. This is unfortunately the standard format for GFFs and protein FASTAs downloaded from NCBI RefSeq.

> wd = "/mnt/krab2/MacGuigan_data/cypriniform_synNet_syntenet"
> gff_dir = "ncbi_gffs_longestIsoform_test"
> pep_dir = "ncbi_peps_longestIsoform_test"
>
> setwd(wd)
>
> # read in data and convert to proper format
> proteomes <- fasta2AAStringSetlist(pep_dir)
> annotation <- gff2GRangesList(gff_dir)
>
> # species ID table
> id_table <- create_species_id_table(names(proteomes))
>
> # check that the input data are in the correct format, returns TRUE if format is good
> check_input(proteomes, annotation, gene_field="Name")
[1] TRUE
>
> # prepare and reformat the input data
> pdata <- process_input(proteomes, annotation, gene_field="Name")
> pdata$annotation[[1]]
GRanges object with 27863 ranges and 1 metadata column:
                    seqnames        ranges strand |             gene
                       <Rle>     <IRanges>  <Rle> |      <character>
      [1]    Aal_NC_063192.1   53075-55592      * | Aal_LOC125292942
      [2]    Aal_NC_063192.1   76062-78643      * |     Aal_gpr37l1b
      [3]    Aal_NC_063192.1 101322-219960      * | Aal_LOC125293150
      [4]    Aal_NC_063192.1 232994-485103      * |         Aal_mib2
      [5]    Aal_NC_063192.1 346693-347430      * | Aal_LOC125292917
      ...                ...           ...    ... .              ...
  [27859] Aal_NW_025963120.1     3825-3898      * |    Aal_trnat-agu
  [27860] Aal_NW_025963120.1     4300-4373      * |    Aal_trnat-agu
  [27861] Aal_NW_025963120.1     5140-5213      * |    Aal_trnat-agu
  [27862] Aal_NW_025963120.1     5519-5592      * |    Aal_trnat-agu
  [27863] Aal_NW_025963120.1     5956-6029      * |    Aal_trnat-agu
  -------
  seqinfo: 172 sequences from an unspecified genome; no seqlengths

> pdata$seq[[1]]
AAStringSet object of length 23478:
        width seq                                           names
    [1]  1172 MMLSALFLLLMLWNCEGVRLP...LFVFSQEMVYFSDLKYECRDS Aal_XP_048083212
    [2]   309 MAAQMFDYAPGRTSSQHPSEA...MLLFSLMVSAVLLNGWLWMRR Aal_XP_048083213
    [3]   482 MDITRGSLEDISRPASSDRTR...VTSASDTPVDTTMLTEEEEGR Aal_XP_048083214
    [4]   852 MAAPLTDQEKRKQISIRGIVG...KIPXPSEPREAIEEAARALKL Aal_XP_048083217
    [5]   729 MDEMEEELKCPVCGSFYREPI...TGLPIPDFYTPGEPDPNGSVC Aal_XP_048083218
    ...   ... ...
[23474]    98 MTPVHFSFSMAFILGLMGLAF...LVATARTHGTDRLQNLNLLQC Aal_YP_001293640
[23475]   460 MLKVLIPTLMLFPTIWLTPKK...LHLIPVVLLVLKPEFVWGWFY Aal_YP_001293641
[23476]   611 MQTTLILSSSLTLIFALLAYP...IKTYLMIFMITTGLAALLASA Aal_YP_001293642
[23477]   173 MVYFVCLWLFGFVLGMVGVAS...LTLFVVLELTRGLSRGTLRAV Aal_YP_001293643
[23478]   380 MASLRKTHPLMKIANDALVDL...SLFLILTPLAGWAENKVLEWN Aal_YP_001293644

This is problematic because the seq and annotation name mismatches only matter for synteny network inference, not the DIAMOND search, which can lead to a lot of wasted compute time (as was the case for me).

check_input error

Hi
I was wondering if you could help me out since my data didn't pass the check_input, I'm confused since both files were obtained using Prokka 1.14.6, names for protein and headers must match. Here's the headers for the .gff file and the .fasta

##gff-version 3
##sequence-region Vibrio_cholerae_strain_1Mo 1 4024350
Vibrio_cholerae_strain_1Mo prokka gene 123 1163 . - . ID=KLDFOOAE_00001_gene;locus_tag=KLDFOOAE_00001
Vibrio_cholerae_strain_1Mo prokka mRNA 123 1163 . - . ID=KLDFOOAE_00001_mRNA;locus_tag=KLDFOOAE_00001
Vibrio_cholerae_strain_1Mo Prodigal:002006 CDS 123 1163 . - 0 ID=KLDFOOAE_00001;Parent=KLDFOOAE_00001_gene,KLDFOOAE_00001_mRNA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:RefSeq:WP_001909624.1;locus_tag=KLDFOOAE_00001;product=tape measure protein [Vibrio cholerae]
Vibrio_cholerae_strain_1Mo prokka gene 1298 1657 . - . ID=KLDFOOAE_00002_gene;locus_tag=KLDFOOAE_00002
Vibrio_cholerae_strain_1Mo prokka mRNA 1298 1657 . - . ID=KLDFOOAE_00002_mRNA;locus_tag=KLDFOOAE_00002
Vibrio_cholerae_strain_1Mo Prodigal:002006 CDS 1298 1657 . - 0 ID=KLDFOOAE_00002;Parent=KLDFOOAE_00002_gene,KLDFOOAE_00002_mRNA;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:RefSeq:WP_000290080.1;locus_tag=KLDFOOAE_00002;product=MULTISPECIES: phage tail assembly protein [Vibrio]

>KLDFOOAE_00001 tape measure protein [Vibrio cholerae]
MANNLKTDIVLNLQGDLAQKARSYSKEMTTLATRSKAAFSMISSSAIAASRGIDTFGNRL
LFITGAAAVGFERTFVKTAAEFERYQTMLNKLQGSPEAGAKAMAWIEEFTQNTPYAIDEV
TQSFVRLKAFGIDPMDGTMQSIADQAAMIGGTAETVEGIATALGQAWTKGKLQSEEALQL
LERGVPVWDYLIKSSKELGMNNGRGFTKEELDDMSSKGKLGRDAIRALIKQMGKESAGAA
KEQMNTWNGMISNMGDHWKLFQKDVMGSGAFTVLKDQLGEFLGMLDEMKKTGEYDEFVDK
VGRDLVEAFKSAAAAAREIKEVGEELWPVIREIGSMACSGIVNLVT
>KLDFOOAE_00002 MULTISPECIES: phage tail assembly protein [Vibrio]
MAVMTFNLEDGFKVGDAQCHEVGLKELTPKDVFDAQLASEKIGILNGRPHAYTSDVQMGM
ELLCRQVEFIGNVQGPFSVKEILKLSSRDFATLQQKARELDDIMFSDDALEGLEARGRD

I'm still confused in terms of what headers should match. I'm bewildered since both files were obtained from the same application Prokka 1.14.6 and coding must be preserved through files output. Nevertheless I reedit the .fasta file so it carries only the code, but no success. This is how fasta file would look after edition

>KLDFOOAE_00001
MANNLKTDIVLNLQGDLAQKARSYSKEMTTLATRSKAAFSMISSSAIAASRGIDTFGNRL
LFITGAAAVGFERTFVKTAAEFERYQTMLNKLQGSPEAGAKAMAWIEEFTQNTPYAIDEV
TQSFVRLKAFGIDPMDGTMQSIADQAAMIGGTAETVEGIATALGQAWTKGKLQSEEALQL
LERGVPVWDYLIKSSKELGMNNGRGFTKEELDDMSSKGKLGRDAIRALIKQMGKESAGAA
KEQMNTWNGMISNMGDHWKLFQKDVMGSGAFTVLKDQLGEFLGMLDEMKKTGEYDEFVDK
VGRDLVEAFKSAAAAAREIKEVGEELWPVIREIGSMACSGIVNLVT
>KLDFOOAE_00002
MAVMTFNLEDGFKVGDAQCHEVGLKELTPKDVFDAQLASEKIGILNGRPHAYTSDVQMGM
ELLCRQVEFIGNVQGPFSVKEILKLSSRDFATLQQKARELDDIMFSDDALEGLEARGRD

I was wondering how should I edit my files so they match. Thanks ahead

Syntenet error in infer_syntenet() Error in blast_list[[1]] : subscript out of bounds

I'm trying to run the package Syntenet for analyzing synteny of Tomato genes using a modified ITAG4.0 protein sequences.

when i tried infer_synt(), i got the following error.

synt_tom <- infer_syntenet(tomblast,proc_data$annotation)

Error in blast_list[[1]] : subscript out of bounds

I'm not familiar with the data structures, Please kindly explain this error and please suggest a fix.

Thank you