legana / ampir Goto Github PK
View Code? Open in Web Editor NEWantimicrobial peptide prediction in R
antimicrobial peptide prediction in R
I tried with the following code
library(ampir)
test_data <- data.frame(name = c("withstop","nostop"),
seq=c("DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET*",
"DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET"),
stringsAsFactors = FALSE)
predict_amps(test_data)
This produces the following output
Could not run prediction for 1 proteins because they were either too short or contained invalid amino acids
name | seq | prob_AMP |
---|---|---|
withstop | DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET* | NA |
nostop | DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET | 0.6618123 |
I believe that the correct behaviour here would be to silently remove the terminal stop codon so that prediction will work properly
Hi,Legana
I have read your project code and related design process (AMP_pub) and I'm very admired for your detailed work. But there is some questions about the accuracy of AMP prediction.
The test data I used was downloaded from the APD3 database (min_len 10, rmdup seq).The figure below is the data density map generated by ggplot2.
It can be seen from the figure that a large amount of data is marked as non-AMP.
AMP test data: amp_rmdup_min10.zip
By the way, The figure below is the data density map generated by ggplot2.(The result is predicted by amPEP.py)
Article : amPEPpy 1.0: A portable and accurate antimicrobial peptide prediction tool
I don’t know if it is the difference between support vector machine and random forest or the difference in the calculation method of amino acid features.
Waiting for your reply, Thanks.
At present it isn't possible to access the feature vectors calculated with "calculate_features" in any of the user-facing functions.
I suggest a couple of possible options;
The following code
library(ampir)
test_data <- data.frame(name = c("seq1","seq2"),
seq=c("DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET",
"QRVRNPQSCRWNMGVCIPFLCRVGMRQIGTCFGPRVPCCRR"))
predict_amps(test_data)
Produces a difficult to interpret error
Error in nchar(faa_df[, 2]) : 'nchar()' requires a character vector
ampir requires that the second column of input is a character vector but fails if it is a factor. I see a couple of potential options for a fix
If a tibble is provided to predict_amps it will run without error but the result will have NA in all probs fields.
ampir should handle data in tibble or df format as long as it contains the correct columns
Hi @Legana, here is a reprex for the issue I encounter.
library(ampir)
# download fasta file of spider toxins
download.file("http://arachnoserver.org/fasta/all.pep.fa", "all.pep.fa",
method = "wget")
# load raw data downloaded from conoserver
spider_protein <- read_faa("all.pep.fa")
# launch prediction
prediction <- predict_amps(spider_protein, model = "precursor")
# filter for
my_predicted_amps <- spider_protein[prediction$prob_AMP >= 0.8,]
Trying to view the my_predicted_amps
data frame should give you at least the first row as NA's. This leads to have a final fasta with some lines like:
>NA
NA
Here is my sessionInfo():
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] ampir_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6
[3] compiler_4.0.3 gower_0.2.2
[5] plyr_1.8.6 class_7.3-17
[7] Peptides_2.4.2 iterators_1.0.12
[9] tools_4.0.3 rpart_4.1-15
[11] ipred_0.9-9 lubridate_1.7.9
[13] lifecycle_0.2.0 tibble_3.0.3
[15] gtable_0.3.0 nlme_3.1-149
[17] lattice_0.20-41 pkgconfig_2.0.3
[19] rlang_0.4.7 Matrix_1.2-18
[21] foreach_1.5.0 rstudioapi_0.11
[23] parallel_4.0.3 yaml_2.2.1
[25] prodlim_2019.11.13 xfun_0.17
[27] stringr_1.4.0 withr_2.2.0
[29] dplyr_1.0.2 pROC_1.16.2
[31] generics_0.0.2 vctrs_0.3.4
[33] recipes_0.1.13 stats4_4.0.3
[35] nnet_7.3-14 grid_4.0.3
[37] caret_6.0-86 tidyselect_1.1.0
[39] data.table_1.13.0 glue_1.4.2
[41] R6_2.4.1 survival_3.2-3
[43] lava_1.6.7 kernlab_0.9-29
[45] reshape2_1.4.4 ggplot2_3.3.2
[47] purrr_0.3.4 magrittr_1.5
[49] ModelMetrics_1.2.2.2 splines_4.0.3
[51] MASS_7.3-53 scales_1.1.1
[53] codetools_0.2-16 ellipsis_0.3.1
[55] timeDate_3043.102 colorspace_1.4-1
[57] tinytex_0.25 stringi_1.5.3
[59] munsell_0.5.0 crayon_1.3.4
Here is my rstudioapi::versionInfo():
$citation
To cite RStudio in publications use:
RStudio Team (2020). RStudio: Integrated Development
Environment for R. RStudio, PBC, Boston, MA URL
http://www.rstudio.com/.
A BibTeX entry for LaTeX users is
@Manual{,
title = {RStudio: Integrated Development Environment for R},
author = {{RStudio Team}},
organization = {RStudio, PBC},
address = {Boston, MA},
year = {2020},
url = {http://www.rstudio.com/},
}
$mode
[1] "desktop"
$version
[1] ‘1.4.869’
$release_name
[1] "Wax Begonia"
Regards.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.