Giter Club home page Giter Club logo

naivebayes's Introduction

Extended documentation can be found on the website: https://majkamichal.github.io/naivebayes/

Naïve Bayes

BuyMeACoffee

CRAN_Status_Badge Anaconda Cloud Anaconda Cloud

1. Overview

The naivebayes package presents an efficient implementation of the widely-used Naïve Bayes classifier. It upholds three core principles: efficiency, user-friendliness, and reliance solely on Base R. By adhering to the latter principle, the package ensures stability and reliability without introducing external dependencies1. This design choice maintains efficiency by leveraging the optimized routines inherent in Base R, many of which are programmed in high-performance languages like C/C++ or FORTRAN. By following these principles, the naivebayes package provides a reliable and efficient tool for Naïve Bayes classification tasks, ensuring that users can perform their analyses effectively and with ease.

The naive_bayes() function is designed to determine the class of each feature in a dataset, and depending on user specifications, it can assume various distributions for each feature. It currently supports the following class conditional distributions:

  • categorical distribution for discrete features (with Bernoulli distribution as a special case for binary outcomes)
  • Poisson distribution for non-negative integer features
  • Gaussian distribution for continuous features
  • non-parametrically estimated densities via Kernel Density Estimation for continuous features

In addition to that specialized functions are available which implement:

  • Bernoulli Naive Bayes via bernoulli_naive_bayes()
  • Multinomial Naive Bayes via multinomial_naive_bayes()
  • Poisson Naive Bayes via poisson_naive_bayes()
  • Gaussian Naive Bayes via gaussian_naive_bayes()
  • Non-Parametric Naive Bayes via nonparametric_naive_bayes()

These specialized functions are carefully optimized for efficiency, utilizing linear algebra operations to excel when handling dense matrices. Additionally, they can also exploit sparsity of matrices for enhanced performance and work in presence of missing data. The package also includes various helper functions to improve user experience. Moreover, users can access the general naive_bayes() function through the excellent Caret package, providing additional versatility.

2. Installation

The naivebayes package can be installed from the CRAN repository by simply executing in the console the following line:

install.packages("naivebayes")

# Or the the development version from GitHub:
devtools::install_github("majkamichal/naivebayes")

3. Usage

The naivebayes package provides a user friendly implementation of the Naïve Bayes algorithm via formula interlace and classical combination of the matrix/data.frame containing the features and a vector with the class labels. All functions can recognize missing values, give an informative warning and more importantly - they know how to handle them. In following the basic usage of the main function naive_bayes() is demonstrated. Examples with the specialized Naive Bayes classifiers can be found in the extended documentation: https://majkamichal.github.io/naivebayes/ in this article.

3.1 Example data

library(naivebayes)
#> naivebayes 1.0.0 loaded
#> For more information please visit:
#> https://majkamichal.github.io/naivebayes/

# Simulate example data
n <- 100
set.seed(1)
data <- data.frame(class = sample(c("classA", "classB"), n, TRUE),
                   bern = sample(LETTERS[1:2], n, TRUE),
                   cat  = sample(letters[1:3], n, TRUE),
                   logical = sample(c(TRUE,FALSE), n, TRUE),
                   norm = rnorm(n),
                   count = rpois(n, lambda = c(5,15)))
train <- data[1:95, ]
test <- data[96:100, -1]

3.2 Formula interface

nb <- naive_bayes(class ~ ., train)
summary(nb)
#> 
#> ================================= Naive Bayes ================================== 
#>  
#> - Call: naive_bayes.formula(formula = class ~ ., data = train) 
#> - Laplace: 0 
#> - Classes: 2 
#> - Samples: 95 
#> - Features: 5 
#> - Conditional distributions: 
#>     - Bernoulli: 2
#>     - Categorical: 1
#>     - Gaussian: 2
#> - Prior probabilities: 
#>     - classA: 0.4842
#>     - classB: 0.5158
#> 
#> --------------------------------------------------------------------------------

# Classification
predict(nb, test, type = "class")
#> [1] classA classB classA classA classA
#> Levels: classA classB
nb %class% test
#> [1] classA classB classA classA classA
#> Levels: classA classB

# Posterior probabilities
predict(nb, test, type = "prob")
#>         classA    classB
#> [1,] 0.7174638 0.2825362
#> [2,] 0.2599418 0.7400582
#> [3,] 0.6341795 0.3658205
#> [4,] 0.5365311 0.4634689
#> [5,] 0.7186026 0.2813974
nb %prob% test
#>         classA    classB
#> [1,] 0.7174638 0.2825362
#> [2,] 0.2599418 0.7400582
#> [3,] 0.6341795 0.3658205
#> [4,] 0.5365311 0.4634689
#> [5,] 0.7186026 0.2813974

# Helper functions
tables(nb, 1)
#> -------------------------------------------------------------------------------- 
#> :: bern (Bernoulli) 
#> -------------------------------------------------------------------------------- 
#>     
#> bern    classA    classB
#>    A 0.5000000 0.5510204
#>    B 0.5000000 0.4489796
#> 
#> --------------------------------------------------------------------------------
get_cond_dist(nb)
#>          bern           cat       logical          norm         count 
#>   "Bernoulli" "Categorical"   "Bernoulli"    "Gaussian"    "Gaussian"

# Note: all "numeric" (integer, double) variables are modelled
#       with Gaussian distribution by default.

3.3 Matrix/data.frame and class vector

X <- train[-1]
class <- train$class
nb2 <- naive_bayes(x = X, y = class)
nb2 %prob% test
#>         classA    classB
#> [1,] 0.7174638 0.2825362
#> [2,] 0.2599418 0.7400582
#> [3,] 0.6341795 0.3658205
#> [4,] 0.5365311 0.4634689
#> [5,] 0.7186026 0.2813974

3.4 Non-parametric estimation for continuous features

Kernel density estimation can be used to estimate class conditional densities of continuous features. It has to be explicitly requested via the parameter usekernel=TRUE otherwise Gaussian distribution will be assumed. The estimation is performed with the built in R function density(). By default, Gaussian smoothing kernel and Silverman’s rule of thumb as bandwidth selector are used:

nb_kde <- naive_bayes(class ~ ., train, usekernel = TRUE)
summary(nb_kde)
#> 
#> ================================= Naive Bayes ================================== 
#>  
#> - Call: naive_bayes.formula(formula = class ~ ., data = train, usekernel = TRUE) 
#> - Laplace: 0 
#> - Classes: 2 
#> - Samples: 95 
#> - Features: 5 
#> - Conditional distributions: 
#>     - Bernoulli: 2
#>     - Categorical: 1
#>     - KDE: 2
#> - Prior probabilities: 
#>     - classA: 0.4842
#>     - classB: 0.5158
#> 
#> --------------------------------------------------------------------------------
get_cond_dist(nb_kde)
#>          bern           cat       logical          norm         count 
#>   "Bernoulli" "Categorical"   "Bernoulli"         "KDE"         "KDE"
nb_kde %prob% test
#>         classA    classB
#> [1,] 0.6498111 0.3501889
#> [2,] 0.2279460 0.7720540
#> [3,] 0.5915046 0.4084954
#> [4,] 0.5876798 0.4123202
#> [5,] 0.7017584 0.2982416

# Class conditional densities
plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

# Marginal densities
plot(nb_kde, "norm", arg.num = list(legend.cex = 0.9), prob = "marginal")

3.4.1 Changing kernel

In general, there are 7 different smoothing kernels available:

  • gaussian
  • epanechnikov
  • rectangular
  • triangular
  • biweight
  • cosine
  • optcosine

and they can be specified in naive_bayes() via parameter additional parameter kernel. Gaussian kernel is the default smoothing kernel. Please see density() and bw.nrd() for further details.

# Change Gaussian kernel to biweight kernel
nb_kde_biweight <- naive_bayes(class ~ ., train, usekernel = TRUE,
                               kernel = "biweight")
nb_kde_biweight %prob% test
#>         classA    classB
#> [1,] 0.6564159 0.3435841
#> [2,] 0.2350606 0.7649394
#> [3,] 0.5917223 0.4082777
#> [4,] 0.5680244 0.4319756
#> [5,] 0.6981813 0.3018187
plot(nb_kde_biweight, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

3.4.2 Changing bandwidth selector

The density() function offers 5 different bandwidth selectors, which can be specified via bw parameter:

  • nrd0 (Silverman’s rule-of-thumb)
  • nrd (variation of the rule-of-thumb)
  • ucv (unbiased cross-validation)
  • bcv (biased cross-validation)
  • SJ (Sheather & Jones method)
nb_kde_SJ <- naive_bayes(class ~ ., train, usekernel = TRUE,
                               bw = "SJ")
nb_kde_SJ %prob% test
#>         classA    classB
#> [1,] 0.6127232 0.3872768
#> [2,] 0.1827263 0.8172737
#> [3,] 0.5784831 0.4215169
#> [4,] 0.7031048 0.2968952
#> [5,] 0.6699132 0.3300868
plot(nb_kde_SJ, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

3.4.3 Adjusting bandwidth

The parameter adjust allows to rescale the estimated bandwidth and thus introduces more flexibility to the estimation process. For values below 1 (no rescaling; default setting) the density becomes “wigglier” and for values above 1 the density tends to be “smoother”:

nb_kde_adjust <- naive_bayes(class ~ ., train, usekernel = TRUE,
                         adjust = 0.5)
nb_kde_adjust %prob% test
#>         classA    classB
#> [1,] 0.5790672 0.4209328
#> [2,] 0.2075614 0.7924386
#> [3,] 0.5742479 0.4257521
#> [4,] 0.6940782 0.3059218
#> [5,] 0.7787019 0.2212981
plot(nb_kde_adjust, "norm", arg.num = list(legend.cex = 0.9), prob = "conditional")

3.5 Model non-negative integers with Poisson distribution

Class conditional distributions of non-negative integer predictors can be modelled with Poisson distribution. This can be achieved by setting usepoisson=TRUE in the naive_bayes() function and by making sure that the variables representing counts in the dataset are of class integer.

is.integer(train$count)
#> [1] TRUE
nb_pois <- naive_bayes(class ~ ., train, usepoisson = TRUE)
summary(nb_pois)
#> 
#> ================================= Naive Bayes ================================== 
#>  
#> - Call: naive_bayes.formula(formula = class ~ ., data = train, usepoisson = TRUE) 
#> - Laplace: 0 
#> - Classes: 2 
#> - Samples: 95 
#> - Features: 5 
#> - Conditional distributions: 
#>     - Bernoulli: 2
#>     - Categorical: 1
#>     - Poisson: 1
#>     - Gaussian: 1
#> - Prior probabilities: 
#>     - classA: 0.4842
#>     - classB: 0.5158
#> 
#> --------------------------------------------------------------------------------
get_cond_dist(nb_pois)
#>          bern           cat       logical          norm         count 
#>   "Bernoulli" "Categorical"   "Bernoulli"    "Gaussian"     "Poisson"

nb_pois %prob% test
#>         classA    classB
#> [1,] 0.6708181 0.3291819
#> [2,] 0.2792804 0.7207196
#> [3,] 0.6214784 0.3785216
#> [4,] 0.5806921 0.4193079
#> [5,] 0.7074807 0.2925193

# Class conditional distributions
plot(nb_pois, "count", prob = "conditional")

# Marginal distributions
plot(nb_pois, "count", prob = "marginal")

Footnotes

  1. Specialized Naïve Bayes functions within the package may optionally utilize sparse matrices if the Matrix package is installed. However, the Matrix package is not a dependency, and users are not required to install or use it.

naivebayes's People

Contributors

majkamichal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

naivebayes's Issues

log(p) = -Inf

In predict.naive_bayes, around line 46 and 50, p might be =0
p[p==0]=threshold should be added to get -Inf for log(p).

Numerical underflow in predict.naive_bayes

Naive Bayes is vulnerable to numerical underflow in the prediction step if the dimensionality of the predictors is much larger than the number of observations. For example, consider the following:

library(naivebayes)
n<-100; k<-2000
X<-matrix(rnorm(n*k),nrow=n)
b<-rnorm(k)
eta<-drop(X%*%b)
y<-rbinom(n,1,plogis(eta))
tr_idx<-1:floor(.8*n)
Xtrn<-X[tr_idx,]
ytrn<-y[tr_idx]
Xtst<-X[-tr_idx,]
ytst<-y[-tr_idx]
fit<-naive_bayes(Xtrn,ytrn,usekernel=TRUE)
preds<-predict(fit,Xtst,type="prob")
head(preds) #they will mostly all be NaN

I believe this is due to the implementation of the log-sum-exp operation. If these lines are replaced with the equivalent, more stable functions from package matrixStats such as logSumExp and/or rowLogSumExps, the underflow issue will probably go away.

Extracting feature importance

Excellent package! The most accessible approach to NB classification that I've found.

I'm wondering if there is a way to extract feature weight/importance from the model? I didn't see any relevant accessors nor any obvious slots in the naive_bayes object.

Error when feeding just 1 predictor into Naivebayes model

>     i<-2
>     nbmodel<-naive_bayes(data=trainset, y=trainset$label,x=trainset[2:i],usekernel= TRUE)
>     nbmodel_predict<-predict(nbmodel,as.vector(x_test))
Warning message:
In t(log_sum) + log(prior) :
  Recycling array of length 1 in array-vector arithmetic is deprecated.
  Use c() or as.vector() instead.

I suppose the package does not expect to handle a data-set with just 1 feature? Or i am misunderstanding some fundamental concept here.

How to use additional density()-parameters for naive_bayes() tuning

I was wondering, if further additional parameters of the stats::density() function can be used when executing naive_bayes().

Actually, I am applying the naive_bayes() classifier on a mixed variables data set where most of the numeric data is non-negative. For this reason a log-normal prior-distribution or a KDE, which ensures no probabilities for values < 0 are estimated, seem to be a good choice for my case.
The stats::density() function, which you used for the KDE in naive_bayes() has the argument 'to', which could ensure probablities for values < 0 are zero.

Is it possible to make use of this argument when executing the naive_bayes() with kernel= TRUE?

Many thanks in advance for any reply and best regards
André

plot crashes when missing data present in trainingset

This works as expected

library(naivebayes)
m <- naive_bayes(Species ~ Sepal.Width, data=iris)
plot(m)

This crashes

iris$Sepal.Width[1] <- NA
m <- naive_bayes(Species ~ Sepal.Width, data=iris)
plot(m)

Error in seq.default(r[1], r[2], length.out = 512) : 
  'from' must be a finite number

Great package btw. I love how the naive_bayes interface is modeled after base R!

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=nl_NL.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=nl_NL.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=nl_NL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] naivebayes_0.9.3

loaded via a namespace (and not attached):
[1] compiler_3.5.2 tools_3.5.2    yaml_2.2.0    

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.