legana / ampir Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 5.0 34.92 MB

antimicrobial peptide prediction in R

R 85.98% C++ 13.44% Shell 0.58%

ampir's People

Contributors

Stargazers

Watchers

Forkers

iracooke minghao2016 wytamma yongming-duan hlkfoz

ampir's Issues

ampir fails when input sequences contain stop codons

I tried with the following code

library(ampir)


test_data <- data.frame(name = c("withstop","nostop"),
                        seq=c("DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET*",
                          "DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET"),
                        stringsAsFactors = FALSE)


predict_amps(test_data)

This produces the following output

Could not run prediction for 1 proteins because they were either too short or contained invalid amino acids

name	seq	prob_AMP
withstop	DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET*	NA
nostop	DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET	0.6618123

I believe that the correct behaviour here would be to silently remove the terminal stop codon so that prediction will work properly

Question about accuracy of prediction

Hi，Legana
I have read your project code and related design process (AMP_pub) and I'm very admired for your detailed work. But there is some questions about the accuracy of AMP prediction.
The test data I used was downloaded from the APD3 database (min_len 10, rmdup seq).The figure below is the data density map generated by ggplot2.

It can be seen from the figure that a large amount of data is marked as non-AMP.
AMP test data: amp_rmdup_min10.zip

By the way, The figure below is the data density map generated by ggplot2.(The result is predicted by amPEP.py)
Article : amPEPpy 1.0: A portable and accurate antimicrobial peptide prediction tool

I don’t know if it is the difference between support vector machine and random forest or the difference in the calculation method of amino acid features.
Waiting for your reply, Thanks.

Feature Request: Option to return feature vectors

At present it isn't possible to access the feature vectors calculated with "calculate_features" in any of the user-facing functions.

I suggest a couple of possible options;

Expose 'calculate_features'
Provide an option to 'predict_amps' to add all the feature columns into the table that is returned to the user.

Improve behaviour when input sequences are provided as a factor

The following code

library(ampir)


test_data <- data.frame(name = c("seq1","seq2"),
                        seq=c("DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET",
                              "QRVRNPQSCRWNMGVCIPFLCRVGMRQIGTCFGPRVPCCRR"))


predict_amps(test_data)

Produces a difficult to interpret error

Error in nchar(faa_df[, 2]) : 'nchar()' requires a character vector

ampir requires that the second column of input is a character vector but fails if it is a factor. I see a couple of potential options for a fix

Check if a factor is provided and stop with a more helpful error message
Automatically convert factor to character vector

predict_amps silently fails if user provides a tibble instead of a data.frame

If a tibble is provided to predict_amps it will run without error but the result will have NA in all probs fields.

ampir should handle data in tibble or df format as long as it contains the correct columns

NA's output in predicted AMPs

Hi @Legana, here is a reprex for the issue I encounter.

library(ampir)

# download fasta file of spider toxins
download.file("http://arachnoserver.org/fasta/all.pep.fa", "all.pep.fa", 
              method = "wget")

# load raw data downloaded from conoserver
spider_protein <- read_faa("all.pep.fa")

# launch prediction
prediction <- predict_amps(spider_protein, model = "precursor")

# filter for 
my_predicted_amps <- spider_protein[prediction$prob_AMP >= 0.8,]

Trying to view the my_predicted_amps data frame should give you at least the first row as NA's. This leads to have a final fasta with some lines like:

>NA
NA

Here is my sessionInfo():

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] ampir_1.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           pillar_1.4.6        
 [3] compiler_4.0.3       gower_0.2.2         
 [5] plyr_1.8.6           class_7.3-17        
 [7] Peptides_2.4.2       iterators_1.0.12    
 [9] tools_4.0.3          rpart_4.1-15        
[11] ipred_0.9-9          lubridate_1.7.9     
[13] lifecycle_0.2.0      tibble_3.0.3        
[15] gtable_0.3.0         nlme_3.1-149        
[17] lattice_0.20-41      pkgconfig_2.0.3     
[19] rlang_0.4.7          Matrix_1.2-18       
[21] foreach_1.5.0        rstudioapi_0.11     
[23] parallel_4.0.3       yaml_2.2.1          
[25] prodlim_2019.11.13   xfun_0.17           
[27] stringr_1.4.0        withr_2.2.0         
[29] dplyr_1.0.2          pROC_1.16.2         
[31] generics_0.0.2       vctrs_0.3.4         
[33] recipes_0.1.13       stats4_4.0.3        
[35] nnet_7.3-14          grid_4.0.3          
[37] caret_6.0-86         tidyselect_1.1.0    
[39] data.table_1.13.0    glue_1.4.2          
[41] R6_2.4.1             survival_3.2-3      
[43] lava_1.6.7           kernlab_0.9-29      
[45] reshape2_1.4.4       ggplot2_3.3.2       
[47] purrr_0.3.4          magrittr_1.5        
[49] ModelMetrics_1.2.2.2 splines_4.0.3       
[51] MASS_7.3-53          scales_1.1.1        
[53] codetools_0.2-16     ellipsis_0.3.1      
[55] timeDate_3043.102    colorspace_1.4-1    
[57] tinytex_0.25         stringi_1.5.3       
[59] munsell_0.5.0        crayon_1.3.4

Here is my rstudioapi::versionInfo():

$citation

To cite RStudio in publications use:

  RStudio Team (2020). RStudio: Integrated Development
  Environment for R. RStudio, PBC, Boston, MA URL
  http://www.rstudio.com/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {RStudio: Integrated Development Environment for R},
    author = {{RStudio Team}},
    organization = {RStudio, PBC},
    address = {Boston, MA},
    year = {2020},
    url = {http://www.rstudio.com/},
  }


$mode
[1] "desktop"

$version
[1] ‘1.4.869’

$release_name
[1] "Wax Begonia"

Regards.

legana / ampir Goto Github PK

ampir's People

Contributors

Stargazers

Watchers

Forkers

ampir's Issues

ampir fails when input sequences contain stop codons

Question about accuracy of prediction

Feature Request: Option to return feature vectors

Improve behaviour when input sequences are provided as a factor

predict_amps silently fails if user provides a tibble instead of a data.frame

NA's output in predicted AMPs

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent