majkamichal / naivebayes Goto Github PK

View Code? Open in Web Editor NEW

36.0 3.0 5.0 22.27 MB

High performance implementation of the Naive Bayes algorithm in R

Home Page: https://majkamichal.github.io/naivebayes/

License: GNU General Public License v2.0

R 62.79% TeX 37.21%

r classification-model naive-bayes r-package datascience machine-learning

naivebayes's Introduction

Hi there 👋

I'm Michal, a data and open-source enthusiast with a background in statistics. I pursued my academic interests at the University of Vienna and Duke University, where I explored statistical analysis and programming.

Open source contributions

In my free time, I engage in both independent open-source software development, working on my own projects, and collaborative contributions to existing projects. Notably, I contribute to projects such as oHMMed within the field of population genetics and AIRSHIP which focuses on visualizing simulation results in clinical trials.

I'm dedicated to open source, driven by the belief in the benefits of collaborative development, transparent sharing of knowledge in the software community, and the personal growth that comes with it.

One of the interesting aspects of my journey has been the growing popularity of the naivebayes R package I developed, reaching over 300K downloads and being used in the actual high quality scientific research. It's rewarding to see it becoming a valuable tool in the R community - particularly for those diving into machine learning - as well as in the scientific community.

I am currently working on...

I am actively engaged in the development of easyPlot, an intuitive graphical user interface (GUI) meticulously designed for ggplot2. easyPlot serves as a powerful tool, allowing users to effortlessly create four fundamental types of graphs — scatterplots, histograms, boxplots, and bar charts — with just a few clicks.

naivebayes's People

Contributors

Stargazers

Watchers

Forkers

colin-fraser willtownes tcratius monali25t-sys ricuib

naivebayes's Issues

How to use additional density()-parameters for naive_bayes() tuning

I was wondering, if further additional parameters of the stats::density() function can be used when executing naive_bayes().

Actually, I am applying the naive_bayes() classifier on a mixed variables data set where most of the numeric data is non-negative. For this reason a log-normal prior-distribution or a KDE, which ensures no probabilities for values < 0 are estimated, seem to be a good choice for my case.
The stats::density() function, which you used for the KDE in naive_bayes() has the argument 'to', which could ensure probablities for values < 0 are zero.

Is it possible to make use of this argument when executing the naive_bayes() with kernel= TRUE?

Many thanks in advance for any reply and best regards
André

Numerical underflow in predict.naive_bayes

Naive Bayes is vulnerable to numerical underflow in the prediction step if the dimensionality of the predictors is much larger than the number of observations. For example, consider the following:

library(naivebayes)
n<-100; k<-2000
X<-matrix(rnorm(n*k),nrow=n)
b<-rnorm(k)
eta<-drop(X%*%b)
y<-rbinom(n,1,plogis(eta))
tr_idx<-1:floor(.8*n)
Xtrn<-X[tr_idx,]
ytrn<-y[tr_idx]
Xtst<-X[-tr_idx,]
ytst<-y[-tr_idx]
fit<-naive_bayes(Xtrn,ytrn,usekernel=TRUE)
preds<-predict(fit,Xtst,type="prob")
head(preds) #they will mostly all be NaN

I believe this is due to the implementation of the log-sum-exp operation. If these lines are replaced with the equivalent, more stable functions from package matrixStats such as logSumExp and/or rowLogSumExps, the underflow issue will probably go away.

log(p) = -Inf

In predict.naive_bayes, around line 46 and 50, p might be =0
p[p==0]=threshold should be added to get -Inf for log(p).

plot crashes when missing data present in trainingset

This works as expected

library(naivebayes)
m <- naive_bayes(Species ~ Sepal.Width, data=iris)
plot(m)

This crashes

iris$Sepal.Width[1] <- NA
m <- naive_bayes(Species ~ Sepal.Width, data=iris)
plot(m)

Error in seq.default(r[1], r[2], length.out = 512) : 
  'from' must be a finite number

Great package btw. I love how the naive_bayes interface is modeled after base R!

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] naivebayes_0.9.3

loaded via a namespace (and not attached):
[1] compiler_3.5.2 tools_3.5.2    yaml_2.2.0

Extracting feature importance

Excellent package! The most accessible approach to NB classification that I've found.

I'm wondering if there is a way to extract feature weight/importance from the model? I didn't see any relevant accessors nor any obvious slots in the naive_bayes object.

Error when feeding just 1 predictor into Naivebayes model

>     i<-2
>     nbmodel<-naive_bayes(data=trainset, y=trainset$label,x=trainset[2:i],usekernel= TRUE)
>     nbmodel_predict<-predict(nbmodel,as.vector(x_test))
Warning message:
In t(log_sum) + log(prior) :
  Recycling array of length 1 in array-vector arithmetic is deprecated.
  Use c() or as.vector() instead.

I suppose the package does not expect to handle a data-set with just 1 feature? Or i am misunderstanding some fundamental concept here.

majkamichal / naivebayes Goto Github PK

naivebayes's Introduction

Hi there 👋

Open source contributions

I am currently working on...

naivebayes's People

Contributors

Stargazers

Watchers

Forkers

naivebayes's Issues

How to use additional density()-parameters for naive_bayes() tuning

Numerical underflow in predict.naive_bayes

log(p) = -Inf

plot crashes when missing data present in trainingset

Extracting feature importance

Error when feeding just 1 predictor into Naivebayes model

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent