Giter Club home page Giter Club logo

Comments (2)

almeidasilvaf avatar almeidasilvaf commented on June 15, 2024

Hi, @PeterMulhair

Thanks a lot for your feedback.

I agree that expanding a bit on this section will be helpful. I will write an FAQ section with details on how to export the sequences from process_input(), run DIAMOND, and import the output back to the R session. I will try to do it asap, but here's a solution that works for your case:

It seems that you have all comparisons in a single file, right? Could you confirm that this file contains all possible pairwise comparisons ($n^2$ comparisons, where n = number of species)? For example, for species A, B, and C, you should have the comparisons AA, AB, AC, BA, BB, BC, CA, CB, and CC.

As you have species IDs in front of gene names, you can create a column (here, named comparison) that contains information on which pairwise comparison each rows refers to. In the code below, I removed underscores and everything after them (regex: "_.*") for columns 1 and 2, and then I concatenate the 2 strings together, separated by _.

# Look at the data
allvall
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 1 SynAndr_BRAKERSADG00000005258 SynMyop_BRAKERSMFG00000005851 95.1 305  2  2  1
#> 2 SynAndr_BRAKERSADG00000005258 SynVesp_BRAKERSVEG00000005697 95.4 302 11  1  4
#> 3 SynAndr_BRAKERSADG00000005258 SesApif_BRAKERSAFG00000004954 85.2 317 32  3  1
#> 4 SynAndr_BRAKERSADG00000005258 SesBemb_BRAKERBSSG00005006065 87.3 308 33  2  1
#> 5 SynAndr_BRAKERSADG00000005258 ZeuPyri_BRAKERZPYG00000006574 83.4 307 42  3  5
#>    V8 V9 V10     V11   V12
#> 1 302  1 295 2.1e-90 336.7
#> 2 302  7 308 2.3e-84 316.6
#> 3 302  1 317 1.1e-75 287.7
#> 4 302  1 308 1.8e-73 280.4
#> 5 302 21 327 7.4e-67 258.5

# Add a column containing species IDs of genes in query and db
allvall$comparison <- paste(
    gsub("_.*", "", allvall[, 1]),
    gsub("_.*", "", allvall[, 2]),
    sep = "_"
)
allvall # look at the data
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 1 SynAndr_BRAKERSADG00000005258 SynMyop_BRAKERSMFG00000005851 95.1 305  2  2  1
#> 2 SynAndr_BRAKERSADG00000005258 SynVesp_BRAKERSVEG00000005697 95.4 302 11  1  4
#> 3 SynAndr_BRAKERSADG00000005258 SesApif_BRAKERSAFG00000004954 85.2 317 32  3  1
#> 4 SynAndr_BRAKERSADG00000005258 SesBemb_BRAKERBSSG00005006065 87.3 308 33  2  1
#> 5 SynAndr_BRAKERSADG00000005258 ZeuPyri_BRAKERZPYG00000006574 83.4 307 42  3  5
#>    V8 V9 V10     V11   V12      comparison
#> 1 302  1 295 2.1e-90 336.7 SynAndr_SynMyop
#> 2 302  7 308 2.3e-84 316.6 SynAndr_SynVesp
#> 3 302  1 317 1.1e-75 287.7 SynAndr_SesApif
#> 4 302  1 308 1.8e-73 280.4 SynAndr_SesBemb
#> 5 302 21 327 7.4e-67 258.5 SynAndr_ZeuPyri

Now, you can use the base R function split() to create a list of data frames from a data frame based on the unique observations in the column comparison:

# Create a list of data frames based on unique values in `comparison`
allvall_list <- split(allvall, allvall$comparison)
allvall_list # look at the data
#> $SynAndr_SesApif
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 3 SynAndr_BRAKERSADG00000005258 SesApif_BRAKERSAFG00000004954 85.2 317 32  3  1
#>    V8 V9 V10     V11   V12      comparison
#> 3 302  1 317 1.1e-75 287.7 SynAndr_SesApif
#> 
#> $SynAndr_SesBemb
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 4 SynAndr_BRAKERSADG00000005258 SesBemb_BRAKERBSSG00005006065 87.3 308 33  2  1
#>    V8 V9 V10     V11   V12      comparison
#> 4 302  1 308 1.8e-73 280.4 SynAndr_SesBemb
#> 
#> $SynAndr_SynMyop
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 1 SynAndr_BRAKERSADG00000005258 SynMyop_BRAKERSMFG00000005851 95.1 305  2  2  1
#>    V8 V9 V10     V11   V12      comparison
#> 1 302  1 295 2.1e-90 336.7 SynAndr_SynMyop
#> 
#> $SynAndr_SynVesp
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 2 SynAndr_BRAKERSADG00000005258 SynVesp_BRAKERSVEG00000005697 95.4 302 11  1  4
#>    V8 V9 V10     V11   V12      comparison
#> 2 302  7 308 2.3e-84 316.6 SynAndr_SynVesp
#> 
#> $SynAndr_ZeuPyri
#>                              V1                            V2   V3  V4 V5 V6 V7
#> 5 SynAndr_BRAKERSADG00000005258 ZeuPyri_BRAKERZPYG00000006574 83.4 307 42  3  5
#>    V8 V9 V10     V11   V12      comparison
#> 5 302 21 327 7.4e-67 258.5 SynAndr_ZeuPyri

Finally, as infer_syntenet() expects DIAMOND/BLAST tables to have 12 columns, we keep only the first 12 columns (removing the comparison column`).

# Loop through each data frame in the list to remove the `comparison` column
allvall_list <- lapply(allvall_list, function(x) return(x[, 1:12]))

However, this might not be all. Species names in allvall_list must match species names in the seq and annotation lists returned by process_input(). For instance, if names(annotation) is "SpeciesA" and "SpeciesB", the names in allvall_list should be "SpeciesA_SpeciesA", "SpeciesA_SpeciesB", "SpeciesB_SpeciesA", "SpeciesB_SpeciesB", not the abbreviated species IDs. Then, you may have to replace species abbreviations with species names in names(allvall_list) to make it match names(annotation). In the example I mentioned, to replace abbreviations "spA" and "spB" with names "SpeciesA" and "SpeciesB", you'd do:

# Replace "spA" and "spB" in list names with "speciesA" and "speciesB"
names(allvall_list) <- stringr::str_replace_all(
    names(allvall_list),
    c(
        "spA" = "speciesA",
        "spB" = "speciesB"
    )
)

I also noticed that your species abbreviations are longer that the ones created by syntenet in process_input(). IDs created by syntenet have 3-5 characters. Did you add the IDs yourself? If so, I'd recommend letting syntenet do it. You probably will have errors later if you use your own IDs.

Best,
Fabricio

from syntenet.

PeterMulhair avatar PeterMulhair commented on June 15, 2024

Hi @almeidasilvaf

Thanks so much for getting back so quickly on this, this is extremely helpful!

Few points first:
(1) yes my blast output was a single file of all v all sequence searches, which will include self hits
(2) the species IDs I had in front of my gene IDs were ones I created myself, I can easily edit these so they match syntenet's format. However, I should note that when I do this (i.e. make species names all 5 characters long) I get some duplicate names emerging which is an issue. Something to think about perhaps in the base code of syntenet that this limit of 5 characters might cause issues with duplicate names (again especially with large datasets of lots of species).

The base R code you provided to add species names as headers worked perfect, and I could succesfully complete the network inference step with infer_syntenet()

I appreciate you providing this code very much, especially as it would've taken me quite a while to crack this.

Thanks again for all your help, I'll close this issue now as my problem has been solved.

Best,
Peter

from syntenet.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.