Giter Club home page Giter Club logo

outforest's Introduction

{outForest}

CRAN status R-CMD-check Codecov test coverage

Overview

{outForest} is a multivariate anomaly detection method. Each numeric variable is regressed onto all other variables using a random forest. If the scaled absolute difference between observed value and out-of-bag prediction is larger than a prespecified threshold, then a value is considered an outlier. After identification of outliers, they can be replaced, e.g., by predictive mean matching from the non-outliers.

The method can be viewed as a multivariate extension of a basic univariate outlier detection method, in which a value is considered an outlier if it deviates from the mean by more than, say, three times the standard deviation. In the multivariate case, instead of comparing a value with the overall mean, rather the difference to the conditional mean is considered. {outForest} estimates this conditional mean by a random forest.

Once the method is trained on a reference data set, it can be applied to new data.

Installation

# From CRAN
install.packages("outForest")

# Development version
devtools::install_github("mayer79/outForest")

Usage

We first generate a data set with about 2% outliers values in each numeric column. Then, we try to identify them.

library(outForest)
set.seed(3)

# Generate data with outliers in numeric columns
head(irisWithOutliers <- generateOutliers(iris, p = 0.02))

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#          5.1    3.500000          1.4         0.2  setosa
#          4.9    3.000000          1.4         0.2  setosa
#          4.7    3.200000          1.3         0.2  setosa
#          4.6    3.100000          1.5         0.2  setosa
#          5.0   -3.744405          1.4         0.2  setosa
#          5.4    3.900000          1.7         0.4  setosa
 
# Find outliers by random forest regressions and replace them by predictive mean matching
(out <- outForest(irisWithOutliers, allow_predictions = TRUE))

# Plot the number of outliers per numeric variable
plot(out)

# Information on outliers
head(outliers(out))

# row          col  observed predicted      rmse     score threshold replacement
#   5  Sepal.Width -3.744405  3.298493 0.7810172 -9.017596         3         2.8
#  20 Sepal.Length 10.164017  5.141093 0.6750468  7.440852         3         5.4
# 138  Petal.Width  4.721186  2.113464 0.3712539  7.024092         3         2.1
#  68  Petal.Width -1.188913  1.305339 0.3712539 -6.718452         3         1.2
# 137  Sepal.Width  8.054524  2.861445 0.7810172  6.649122         3         2.9
#  15 Petal.Length  6.885277  1.875646 0.7767877  6.449163         3         1.3

# Resulting data set with replaced outliers
head(Data(out))

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#          5.1         3.5          1.4         0.2  setosa
#          4.9         3.0          1.4         0.2  setosa
#          4.7         3.2          1.3         0.2  setosa
#          4.6         3.1          1.5         0.2  setosa
#          5.0         2.8          1.4         0.2  setosa
#          5.4         3.9          1.7         0.4  setosa

# Out-of-sample application
iris1 <- iris[1, ]
iris1$Sepal.Length <- -1
pred <- predict(out, newdata = iris1)

# Did we find the outlier?
outliers(pred)

# row          col observed predicted      rmse    score threshold replacement
#   1 Sepal.Length       -1  4.960069 0.6750468 -8.82912         3         6.4

# Fixed data
Data(pred)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#          6.4         3.5          1.4         0.2  setosa

outforest's People

Contributors

mayer79 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

outforest's Issues

Use of random seeds for reproducibility

Hi, first of all, thanks for developing this package, it is a very interesting tool for outlier detecting.

I am writing because I am experiencing some problems regarding the reproducibility of my results. I am not sure if I am using the package correctly or not.

Concretely, I'm using the outForest function as follows:

outliers = outForest(x, replace = "NA", seed = 12345)

Where x is my dataframe with my individuals and their variables, I replace the outliers by "NA", and I set the integer 12345 as a random seed for reproducibility.

Then, as I'm using the "seed" parameter, I expect to obtain the same results every time I call that function with that parameters. I am using R studio and, when I click on "source" in order to execute the whole R script, I obtain the same results between different executions. Nonetheless, if I execute the R script line by line, the results differ from what I obtained by clicking the "source" button.

So, I am not very sure if I have to do something else in addition to setting the "seed" parameter when I use the outForest function. Would it be necessary to use set.seed(12345)? Or could this be a bug?

Thanks beforehand.

Namespace issue: Error in ranger(formula = reformulate(covariables, response = vv), data = data_imp, : could not find function "ranger"

> outForest(atmos_cph_num)

Outlier identification by random forests

  Variables to check:		UVIEF, UVIEFerr, UVDEF, UVDEFerr, UVDVF, UVDVFerr, UVDDF, UVDDFerr, ozone
  Variables used to check:	UVIEF, UVIEFerr, UVDEF, UVDEFerr, UVDVF, UVDVFerr, UVDDF, UVDDFerr, ozone

  Checking: UVIEF  Error in ranger(formula = reformulate(covariables, response = vv), data = data_imp,  : 
  could not find function "ranger"

atmos_cph_num is just some arbitrary dataset. The error here is that https://github.com/mayer79/outForest/blob/master/R/outForest.R#L139 is missing namespace.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.