Giter Club home page Giter Club logo

ampir's People

Contributors

iracooke avatar legana avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ampir's Issues

ampir fails when input sequences contain stop codons

I tried with the following code

library(ampir)


test_data <- data.frame(name = c("withstop","nostop"),
                        seq=c("DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET*",
                          "DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET"),
                        stringsAsFactors = FALSE)


predict_amps(test_data)

This produces the following output

Could not run prediction for 1 proteins because they were either too short or contained invalid amino acids

name seq prob_AMP
withstop DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET* NA
nostop DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET 0.6618123

I believe that the correct behaviour here would be to silently remove the terminal stop codon so that prediction will work properly

Question about accuracy of prediction

Hi,Legana
I have read your project code and related design process (AMP_pub) and I'm very admired for your detailed work. But there is some questions about the accuracy of AMP prediction.
The test data I used was downloaded from the APD3 database (min_len 10, rmdup seq).The figure below is the data density map generated by ggplot2.
image
It can be seen from the figure that a large amount of data is marked as non-AMP.
AMP test data: amp_rmdup_min10.zip

By the way, The figure below is the data density map generated by ggplot2.(The result is predicted by amPEP.py)
Article : amPEPpy 1.0: A portable and accurate antimicrobial peptide prediction tool
image

I don’t know if it is the difference between support vector machine and random forest or the difference in the calculation method of amino acid features.
Waiting for your reply, Thanks.

Feature Request: Option to return feature vectors

At present it isn't possible to access the feature vectors calculated with "calculate_features" in any of the user-facing functions.

I suggest a couple of possible options;

  1. Expose 'calculate_features'
  2. Provide an option to 'predict_amps' to add all the feature columns into the table that is returned to the user.

Improve behaviour when input sequences are provided as a factor

The following code

library(ampir)


test_data <- data.frame(name = c("seq1","seq2"),
                        seq=c("DKLIGSCVWGAVNYTSDCNGECKRRGYKGGHCGSFANVNCWCET",
                              "QRVRNPQSCRWNMGVCIPFLCRVGMRQIGTCFGPRVPCCRR"))


predict_amps(test_data)

Produces a difficult to interpret error

Error in nchar(faa_df[, 2]) : 'nchar()' requires a character vector

ampir requires that the second column of input is a character vector but fails if it is a factor. I see a couple of potential options for a fix

  1. Check if a factor is provided and stop with a more helpful error message
  2. Automatically convert factor to character vector

NA's output in predicted AMPs

Hi @Legana, here is a reprex for the issue I encounter.

library(ampir)

# download fasta file of spider toxins
download.file("http://arachnoserver.org/fasta/all.pep.fa", "all.pep.fa", 
              method = "wget")

# load raw data downloaded from conoserver
spider_protein <- read_faa("all.pep.fa")

# launch prediction
prediction <- predict_amps(spider_protein, model = "precursor")

# filter for 
my_predicted_amps <- spider_protein[prediction$prob_AMP >= 0.8,]

Trying to view the my_predicted_amps data frame should give you at least the first row as NA's. This leads to have a final fasta with some lines like:

>NA
NA

Here is my sessionInfo():

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] ampir_1.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5           pillar_1.4.6        
 [3] compiler_4.0.3       gower_0.2.2         
 [5] plyr_1.8.6           class_7.3-17        
 [7] Peptides_2.4.2       iterators_1.0.12    
 [9] tools_4.0.3          rpart_4.1-15        
[11] ipred_0.9-9          lubridate_1.7.9     
[13] lifecycle_0.2.0      tibble_3.0.3        
[15] gtable_0.3.0         nlme_3.1-149        
[17] lattice_0.20-41      pkgconfig_2.0.3     
[19] rlang_0.4.7          Matrix_1.2-18       
[21] foreach_1.5.0        rstudioapi_0.11     
[23] parallel_4.0.3       yaml_2.2.1          
[25] prodlim_2019.11.13   xfun_0.17           
[27] stringr_1.4.0        withr_2.2.0         
[29] dplyr_1.0.2          pROC_1.16.2         
[31] generics_0.0.2       vctrs_0.3.4         
[33] recipes_0.1.13       stats4_4.0.3        
[35] nnet_7.3-14          grid_4.0.3          
[37] caret_6.0-86         tidyselect_1.1.0    
[39] data.table_1.13.0    glue_1.4.2          
[41] R6_2.4.1             survival_3.2-3      
[43] lava_1.6.7           kernlab_0.9-29      
[45] reshape2_1.4.4       ggplot2_3.3.2       
[47] purrr_0.3.4          magrittr_1.5        
[49] ModelMetrics_1.2.2.2 splines_4.0.3       
[51] MASS_7.3-53          scales_1.1.1        
[53] codetools_0.2-16     ellipsis_0.3.1      
[55] timeDate_3043.102    colorspace_1.4-1    
[57] tinytex_0.25         stringi_1.5.3       
[59] munsell_0.5.0        crayon_1.3.4 

Here is my rstudioapi::versionInfo():

$citation

To cite RStudio in publications use:

  RStudio Team (2020). RStudio: Integrated Development
  Environment for R. RStudio, PBC, Boston, MA URL
  http://www.rstudio.com/.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {RStudio: Integrated Development Environment for R},
    author = {{RStudio Team}},
    organization = {RStudio, PBC},
    address = {Boston, MA},
    year = {2020},
    url = {http://www.rstudio.com/},
  }


$mode
[1] "desktop"

$version
[1] ‘1.4.869’

$release_name
[1] "Wax Begonia"

Regards.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.