Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Importing blast database about syntenet HOT 2 CLOSED

PeterMulhair commented on June 15, 2024

Importing blast database

from syntenet.

Comments (2)

almeidasilvaf commented on June 15, 2024

Hi, @PeterMulhair

Thanks a lot for your feedback.

I agree that expanding a bit on this section will be helpful. I will write an FAQ section with details on how to export the sequences from process_input(), run DIAMOND, and import the output back to the R session. I will try to do it asap, but here's a solution that works for your case:

It seems that you have all comparisons in a single file, right? Could you confirm that this file contains all possible pairwise comparisons ($n^2$ comparisons, where n = number of species)? For example, for species A, B, and C, you should have the comparisons AA, AB, AC, BA, BB, BC, CA, CB, and CC.

As you have species IDs in front of gene names, you can create a column (here, named comparison) that contains information on which pairwise comparison each rows refers to. In the code below, I removed underscores and everything after them (regex: "_.*") for columns 1 and 2, and then I concatenate the 2 strings together, separated by _.

# Look at the data
allvall
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 1 SynAndr_BRAKERSADG00000005258 SynMyop_BRAKERSMFG00000005851 95.1 305  2  2  1
#> 2 SynAndr_BRAKERSADG00000005258 SynVesp_BRAKERSVEG00000005697 95.4 302 11  1  4
#> 3 SynAndr_BRAKERSADG00000005258 SesApif_BRAKERSAFG00000004954 85.2 317 32  3  1
#> 4 SynAndr_BRAKERSADG00000005258 SesBemb_BRAKERBSSG00005006065 87.3 308 33  2  1
#> 5 SynAndr_BRAKERSADG00000005258 ZeuPyri_BRAKERZPYG00000006574 83.4 307 42  3  5
#>    V8 V9 V10     V11   V12
#> 1 302  1 295 2.1e-90 336.7
#> 2 302  7 308 2.3e-84 316.6
#> 3 302  1 317 1.1e-75 287.7
#> 4 302  1 308 1.8e-73 280.4
#> 5 302 21 327 7.4e-67 258.5

# Add a column containing species IDs of genes in query and db
allvall$comparison <- paste(
    gsub("_.*", "", allvall[, 1]),
    gsub("_.*", "", allvall[, 2]),
    sep = "_"
)
allvall # look at the data
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 1 SynAndr_BRAKERSADG00000005258 SynMyop_BRAKERSMFG00000005851 95.1 305  2  2  1
#> 2 SynAndr_BRAKERSADG00000005258 SynVesp_BRAKERSVEG00000005697 95.4 302 11  1  4
#> 3 SynAndr_BRAKERSADG00000005258 SesApif_BRAKERSAFG00000004954 85.2 317 32  3  1
#> 4 SynAndr_BRAKERSADG00000005258 SesBemb_BRAKERBSSG00005006065 87.3 308 33  2  1
#> 5 SynAndr_BRAKERSADG00000005258 ZeuPyri_BRAKERZPYG00000006574 83.4 307 42  3  5
#>    V8 V9 V10     V11   V12      comparison
#> 1 302  1 295 2.1e-90 336.7 SynAndr_SynMyop
#> 2 302  7 308 2.3e-84 316.6 SynAndr_SynVesp
#> 3 302  1 317 1.1e-75 287.7 SynAndr_SesApif
#> 4 302  1 308 1.8e-73 280.4 SynAndr_SesBemb
#> 5 302 21 327 7.4e-67 258.5 SynAndr_ZeuPyri

Now, you can use the base R function split() to create a list of data frames from a data frame based on the unique observations in the column comparison:

# Create a list of data frames based on unique values in `comparison`
allvall_list <- split(allvall, allvall$comparison)
allvall_list # look at the data
#> $SynAndr_SesApif
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 3 SynAndr_BRAKERSADG00000005258 SesApif_BRAKERSAFG00000004954 85.2 317 32  3  1
#>    V8 V9 V10     V11   V12      comparison
#> 3 302  1 317 1.1e-75 287.7 SynAndr_SesApif
#> 
#> $SynAndr_SesBemb
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 4 SynAndr_BRAKERSADG00000005258 SesBemb_BRAKERBSSG00005006065 87.3 308 33  2  1
#>    V8 V9 V10     V11   V12      comparison
#> 4 302  1 308 1.8e-73 280.4 SynAndr_SesBemb
#> 
#> $SynAndr_SynMyop
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 1 SynAndr_BRAKERSADG00000005258 SynMyop_BRAKERSMFG00000005851 95.1 305  2  2  1
#>    V8 V9 V10     V11   V12      comparison
#> 1 302  1 295 2.1e-90 336.7 SynAndr_SynMyop
#> 
#> $SynAndr_SynVesp
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 2 SynAndr_BRAKERSADG00000005258 SynVesp_BRAKERSVEG00000005697 95.4 302 11  1  4
#>    V8 V9 V10     V11   V12      comparison
#> 2 302  7 308 2.3e-84 316.6 SynAndr_SynVesp
#> 
#> $SynAndr_ZeuPyri
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 5 SynAndr_BRAKERSADG00000005258 ZeuPyri_BRAKERZPYG00000006574 83.4 307 42  3  5
#>    V8 V9 V10     V11   V12      comparison
#> 5 302 21 327 7.4e-67 258.5 SynAndr_ZeuPyri

Finally, as infer_syntenet() expects DIAMOND/BLAST tables to have 12 columns, we keep only the first 12 columns (removing the comparison column`).

# Loop through each data frame in the list to remove the `comparison` column
allvall_list <- lapply(allvall_list, function(x) return(x[, 1:12]))

However, this might not be all. Species names in allvall_list must match species names in the seq and annotation lists returned by process_input(). For instance, if names(annotation) is "SpeciesA" and "SpeciesB", the names in allvall_list should be "SpeciesA_SpeciesA", "SpeciesA_SpeciesB", "SpeciesB_SpeciesA", "SpeciesB_SpeciesB", not the abbreviated species IDs. Then, you may have to replace species abbreviations with species names in names(allvall_list) to make it match names(annotation). In the example I mentioned, to replace abbreviations "spA" and "spB" with names "SpeciesA" and "SpeciesB", you'd do:

# Replace "spA" and "spB" in list names with "speciesA" and "speciesB"
names(allvall_list) <- stringr::str_replace_all(
    names(allvall_list),
    c(
        "spA" = "speciesA",
        "spB" = "speciesB"
    )
)

I also noticed that your species abbreviations are longer that the ones created by syntenet in process_input(). IDs created by syntenet have 3-5 characters. Did you add the IDs yourself? If so, I'd recommend letting syntenet do it. You probably will have errors later if you use your own IDs.

Best,
Fabricio

from syntenet.

PeterMulhair commented on June 15, 2024

Hi @almeidasilvaf

Thanks so much for getting back so quickly on this, this is extremely helpful!

Few points first:
(1) yes my blast output was a single file of all v all sequence searches, which will include self hits
(2) the species IDs I had in front of my gene IDs were ones I created myself, I can easily edit these so they match syntenet's format. However, I should note that when I do this (i.e. make species names all 5 characters long) I get some duplicate names emerging which is an issue. Something to think about perhaps in the base code of syntenet that this limit of 5 characters might cause issues with duplicate names (again especially with large datasets of lots of species).

The base R code you provided to add species names as headers worked perfect, and I could succesfully complete the network inference step with infer_syntenet()

I appreciate you providing this code very much, especially as it would've taken me quite a while to crack this.

Thanks again for all your help, I'll close this issue now as my problem has been solved.

Best,
Peter

from syntenet.

Importing blast database about syntenet HOT 2 CLOSED

Comments (2)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent