Comments (2)
Hi, @PeterMulhair
Thanks a lot for your feedback.
I agree that expanding a bit on this section will be helpful. I will write an FAQ section with details on how to export the sequences from process_input()
, run DIAMOND, and import the output back to the R session. I will try to do it asap, but here's a solution that works for your case:
It seems that you have all comparisons in a single file, right? Could you confirm that this file contains all possible pairwise comparisons (
As you have species IDs in front of gene names, you can create a column (here, named comparison
) that contains information on which pairwise comparison each rows refers to. In the code below, I removed underscores and everything after them (regex: "_.*") for columns 1 and 2, and then I concatenate the 2 strings together, separated by _.
# Look at the data
allvall
#> V1 V2 V3 V4 V5 V6 V7
#> 1 SynAndr_BRAKERSADG00000005258 SynMyop_BRAKERSMFG00000005851 95.1 305 2 2 1
#> 2 SynAndr_BRAKERSADG00000005258 SynVesp_BRAKERSVEG00000005697 95.4 302 11 1 4
#> 3 SynAndr_BRAKERSADG00000005258 SesApif_BRAKERSAFG00000004954 85.2 317 32 3 1
#> 4 SynAndr_BRAKERSADG00000005258 SesBemb_BRAKERBSSG00005006065 87.3 308 33 2 1
#> 5 SynAndr_BRAKERSADG00000005258 ZeuPyri_BRAKERZPYG00000006574 83.4 307 42 3 5
#> V8 V9 V10 V11 V12
#> 1 302 1 295 2.1e-90 336.7
#> 2 302 7 308 2.3e-84 316.6
#> 3 302 1 317 1.1e-75 287.7
#> 4 302 1 308 1.8e-73 280.4
#> 5 302 21 327 7.4e-67 258.5
# Add a column containing species IDs of genes in query and db
allvall$comparison <- paste(
gsub("_.*", "", allvall[, 1]),
gsub("_.*", "", allvall[, 2]),
sep = "_"
)
allvall # look at the data
#> V1 V2 V3 V4 V5 V6 V7
#> 1 SynAndr_BRAKERSADG00000005258 SynMyop_BRAKERSMFG00000005851 95.1 305 2 2 1
#> 2 SynAndr_BRAKERSADG00000005258 SynVesp_BRAKERSVEG00000005697 95.4 302 11 1 4
#> 3 SynAndr_BRAKERSADG00000005258 SesApif_BRAKERSAFG00000004954 85.2 317 32 3 1
#> 4 SynAndr_BRAKERSADG00000005258 SesBemb_BRAKERBSSG00005006065 87.3 308 33 2 1
#> 5 SynAndr_BRAKERSADG00000005258 ZeuPyri_BRAKERZPYG00000006574 83.4 307 42 3 5
#> V8 V9 V10 V11 V12 comparison
#> 1 302 1 295 2.1e-90 336.7 SynAndr_SynMyop
#> 2 302 7 308 2.3e-84 316.6 SynAndr_SynVesp
#> 3 302 1 317 1.1e-75 287.7 SynAndr_SesApif
#> 4 302 1 308 1.8e-73 280.4 SynAndr_SesBemb
#> 5 302 21 327 7.4e-67 258.5 SynAndr_ZeuPyri
Now, you can use the base R function split()
to create a list of data frames from a data frame based on the unique observations in the column comparison
:
# Create a list of data frames based on unique values in `comparison`
allvall_list <- split(allvall, allvall$comparison)
allvall_list # look at the data
#> $SynAndr_SesApif
#> V1 V2 V3 V4 V5 V6 V7
#> 3 SynAndr_BRAKERSADG00000005258 SesApif_BRAKERSAFG00000004954 85.2 317 32 3 1
#> V8 V9 V10 V11 V12 comparison
#> 3 302 1 317 1.1e-75 287.7 SynAndr_SesApif
#>
#> $SynAndr_SesBemb
#> V1 V2 V3 V4 V5 V6 V7
#> 4 SynAndr_BRAKERSADG00000005258 SesBemb_BRAKERBSSG00005006065 87.3 308 33 2 1
#> V8 V9 V10 V11 V12 comparison
#> 4 302 1 308 1.8e-73 280.4 SynAndr_SesBemb
#>
#> $SynAndr_SynMyop
#> V1 V2 V3 V4 V5 V6 V7
#> 1 SynAndr_BRAKERSADG00000005258 SynMyop_BRAKERSMFG00000005851 95.1 305 2 2 1
#> V8 V9 V10 V11 V12 comparison
#> 1 302 1 295 2.1e-90 336.7 SynAndr_SynMyop
#>
#> $SynAndr_SynVesp
#> V1 V2 V3 V4 V5 V6 V7
#> 2 SynAndr_BRAKERSADG00000005258 SynVesp_BRAKERSVEG00000005697 95.4 302 11 1 4
#> V8 V9 V10 V11 V12 comparison
#> 2 302 7 308 2.3e-84 316.6 SynAndr_SynVesp
#>
#> $SynAndr_ZeuPyri
#> V1 V2 V3 V4 V5 V6 V7
#> 5 SynAndr_BRAKERSADG00000005258 ZeuPyri_BRAKERZPYG00000006574 83.4 307 42 3 5
#> V8 V9 V10 V11 V12 comparison
#> 5 302 21 327 7.4e-67 258.5 SynAndr_ZeuPyri
Finally, as infer_syntenet()
expects DIAMOND/BLAST tables to have 12 columns, we keep only the first 12 columns (removing the comparison
column`).
# Loop through each data frame in the list to remove the `comparison` column
allvall_list <- lapply(allvall_list, function(x) return(x[, 1:12]))
However, this might not be all. Species names in allvall_list
must match species names in the seq
and annotation
lists returned by process_input()
. For instance, if names(annotation)
is "SpeciesA" and "SpeciesB", the names in allvall_list
should be "SpeciesA_SpeciesA", "SpeciesA_SpeciesB", "SpeciesB_SpeciesA", "SpeciesB_SpeciesB", not the abbreviated species IDs. Then, you may have to replace species abbreviations with species names in names(allvall_list)
to make it match names(annotation)
. In the example I mentioned, to replace abbreviations "spA" and "spB" with names "SpeciesA" and "SpeciesB", you'd do:
# Replace "spA" and "spB" in list names with "speciesA" and "speciesB"
names(allvall_list) <- stringr::str_replace_all(
names(allvall_list),
c(
"spA" = "speciesA",
"spB" = "speciesB"
)
)
I also noticed that your species abbreviations are longer that the ones created by syntenet in process_input()
. IDs created by syntenet have 3-5 characters. Did you add the IDs yourself? If so, I'd recommend letting syntenet do it. You probably will have errors later if you use your own IDs.
Best,
Fabricio
from syntenet.
Thanks so much for getting back so quickly on this, this is extremely helpful!
Few points first:
(1) yes my blast output was a single file of all v all sequence searches, which will include self hits
(2) the species IDs I had in front of my gene IDs were ones I created myself, I can easily edit these so they match syntenet's format. However, I should note that when I do this (i.e. make species names all 5 characters long) I get some duplicate names emerging which is an issue. Something to think about perhaps in the base code of syntenet that this limit of 5 characters might cause issues with duplicate names (again especially with large datasets of lots of species).
The base R code you provided to add species names as headers worked perfect, and I could succesfully complete the network inference step with infer_syntenet()
I appreciate you providing this code very much, especially as it would've taken me quite a while to crack this.
Thanks again for all your help, I'll close this issue now as my problem has been solved.
Best,
Peter
from syntenet.
Related Issues (15)
- Error in profiles$profile_matrix : $ operator is invalid for atomic vectors HOT 4
- check_input error HOT 3
- [BUG] check_input does not work as intended HOT 15
- Cant seem to pass check_input HOT 2
- unique species id in `process_input()` HOT 1
- Parallelization of infer_syntenet HOT 2
- .collinearity files with 0 collinear genes HOT 17
- check_input issue with number of genes HOT 3
- Syntenet error in infer_syntenet() Error in blast_list[[1]] : subscript out of bounds HOT 4
- Error in check_ngenes(seq, annotation) HOT 2
- [BUG] Possible bug with find_GS_clusters HOT 2
- Warning in cluster_network() HOT 7
- Import data HOT 7
- Error in plot_profiles 'cluster_species' must match column names of profile matrix. HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from syntenet.