Giter Club home page Giter Club logo

isofor's Introduction

Build Status

DOI

Isolation Forest

An Isolation Forest is an ensemble of completely random decision trees. At each split a random feature and a random split point is chosen. Anomalies are isolated if they end up in a partition far away from the rest of the data. In decision tree terms, this corresponds to a record that has a short "path length". The path length is the number of nodes that a record passes through before terminating in a leaf node. Records with short average path lengths through the entire ensemble are considered anomalies.

An analogy

Describing the location of a country home takes many fewer directions than describing the location of a brownstone in Brooklyn. The country home might be described as "the only house on the south shore of Lake Woebegon". While directions to the brownstone must be qualified with much more detail: "Go north on 5th Street for 12 blocks, take a left on Van Buren, etc.."

Isolated Dense

The country house in this example is a literal outlier. It is off by itself away from most other homes. Similarly, records that can be described succinctly are also outliers.

Example

Here we create two random, normal vectors and add some outliers. The majority of the data points are centered around (0, 0) with a standard deviation of 1/2. 50 outliers are introduced and are centered around (-1.5, 1.5) with a standard deviation of 1. This is to encourage some co-mingling of outliers with the bulk of the data.

N = 1e3
x = c(rnorm(N, 0, 0.5), rnorm(N*0.05, -1.5, 1))
y = c(rnorm(N, 0, 0.5), rnorm(N*0.05,  1.5, 1))
ol = c(rep(0, N), rep(1, (0.05*N))) + 2
data = data.frame(x, y)
plot(data, pch=ol)
title("Dummy data with outliers")

The code below builds an Isolation Forest by passing in the dummy data, the number of trees requested (100) and the number of records to subsample for each tree (32). The records that exceed the 95% percentile of the anomaly score should flag the most anomalous records. By coloring such records as red and plotting the results the effectiveness of the Isolation Forest can be viewed.

mod = iForest(X = data, 100, 32)
p = predict(mod, data)
col = ifelse(p > quantile(p, 0.95), "red", "blue")
plot(x, y, col=col, pch=ol)

Knowing there are two populations, the Kmeans algorithm seems like a good fit for identifying the two clusters. However, we can see that it picks cluster centers that do not do a good job of separating the data.

km = kmeans(data, 2)
plot(x, y, col=km$cluster+1, pch=ol)

Comparison of Results

We can compare the accuracy of identifying outliers by comparing the confusion matrix for each classification.

table(iForest=p  > quantile(p, 0.95), Actual=ol == 3)

##        Actual
## iForest FALSE TRUE
##   FALSE   987   10
##   TRUE     13   40

table(KMeans=km$cluster == 1, Actual=ol == 3)

##        Actual
## KMeans  FALSE TRUE
##   FALSE   282   49
##   TRUE    718    1

ROC Curve

r = pROC::roc(ol == 3, p)
plot(r)

## 
## Call:
## roc.default(response = ol == 3, predictor = p)
## 
## Data: p in 1000 controls (ol == 3 FALSE) < 50 cases (ol == 3 TRUE).
## Area under the curve: 0.9715

title("ROC Curve for Isolation Forest")

isofor's People

Contributors

gravesee avatar idroz avatar lucasdowiak avatar pedroaraujo9 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

isofor's Issues

'there is no package called isofor'

Hey,
I really like the project and would love to play with it. However, when I try tunning Rscript test.R I get the error there is no package called 'isofor'. I am very new to R so I don't know what I'm doing wrong. I'd really appreciate it if you could point me in the right #direction

Predict sparse node membership matrix

Add prediction option that outputs a sparse matrix where each column corresponds to a terminal node for every tree.

  • Possible output dimension n_records X max_node_size * num_trees
  • Should output a column for every possible terminal node
  • Investigate Rcpp sparse matrix support

Importance of variables

If I miss something, I am sorry, but is it possible to get the importance of variables like random forest in this package?

Unable to install isofor

Hi,
I am trying to install the package isofor using the following command. I am using Windows 10 Pro.
Here is the code I used to try to install from github.
library(devtools)
install_github("Zelazny7/isofor")

This is the error I am getting:-

install_github("Zelazny7/isofor")
Downloading GitHub repo Zelazny7/isofor@master
Installing 1 packages: Rcpp
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/Rcpp_1.0.1.zip'
Content type 'application/zip' length 4509148 bytes (4.3 MB)
downloaded 4.3 MB

package ‘Rcpp’ successfully unpacked and MD5 sums checked
Error: (converted from warning) cannot remove prior installation of package ‘Rcpp’

Please advise.
Thanks

Not able to install the package isofor from github.

Hi, While trying to install the package ,i am getting the below error --

Warning in install.packages :
unable to access index for repository https://github.com/Zelazny7/isofor.git/src/contrib:
cannot open URL 'https://github.com/Zelazny7/isofor.git/src/contrib/PACKAGES'
Warning in install.packages :
package ‘isofor’ is not available (for R version 3.5.0)
Warning in install.packages :
unable to access index for repository https://github.com/Zelazny7/isofor.git/bin/windows/contrib/3.5:
cannot open URL 'https://github.com/Zelazny7/isofor.git/bin/windows/contrib/3.5/PACKAGES'

Trouble installing macOS 10.14 Mojave

Hello,

I am having trouble installing the package under macOS 10.14 Mojave.

The error message is quite vague
Error: Could not find tools necessary to compile a package

llvm 7.0 was installed via brew

$ /usr/local/opt/llvm/bin/clang --version
clang version 7.0.0 (tags/RELEASE_700/final)
Target: x86_64-apple-darwin18.0.0
Thread model: posix
InstalledDir: /usr/local/opt/llvm/bin

~/.R/Makevars is also properly set up

$ cat ~/.R/Makevars 
CXX = /usr/local/opt/llvm/bin/clang
CXXFLAGS = -I/usr/local/opt/llvm/include -fopenmp
LDFLAGS = -L/usr/local/opt/llvm/lib -fopenmp=libiomp5

But when I run this in RStudio, I get the following message

> library(devtools)
> install_github('zelazny7/isofor')
Downloading GitHub repo zelazny7/isofor@master
Error: Could not find tools necessary to compile a package

R version 3.5.1

> version
               _                           
platform       x86_64-apple-darwin15.6.0   
arch           x86_64                      
os             darwin15.6.0                
system         x86_64, darwin15.6.0        
status                                     
major          3                           
minor          5.1                         
year           2018                        
month          07                          
day            02                          
svn rev        74947                       
language       R                           
version.string R version 3.5.1 (2018-07-02)
nickname       Feather Spray  

Plus, everything works fine on macOS 10.13. So I suspect it has something to do with os?

Thanks!

Recurse index

Hi, Thanks for your code.
I believe recurse should be take idx[which(f)] instead of just which(f). Otherwise, the split_on_var will use the wrong rows and partition elements which might already have been partitioned. Try it.

This is with the original library:
image
It immediately doesn't make sense because how can you further partition JAPAN, SINGAPORE into LONDON?

After editing the following, rebuild and reload:

recurse <- function(idx, e, l, ni=0, env) {
   ....
  ## recurse
  recurse(idx[which(f)] , e + 1, l, nL, env)
  recurse(idx[which(!f)], e + 1, l, nR, env)
}

Then run

get_split_factor<-function(x,v){
  l = which(levels(x) %in% unique(x))
  i = l[which(intToBits(v) == 1)]
  list(i=i,filter = levels(x)[i])
}

get_df_from_tree<-function(tree){
  df<-as.data.frame(tree)
  df$NodeNumber<-as.numeric(rownames(df))
  df$NodeName <- sapply(1:nrow(df), function(x) {
    i = df$SplitAtt[x]
    if (i!=0) { mod$vNames[i] }
    else {  paste0(df$Size[x])  }
  } )
  return(df)
}

set.seed(101)
mydata<-data.frame(Height=rnorm(10,1.5,0.4), Weight=c(rep(50,5), rep(45,5)), Region=c(rep(c("JAPAN","SINGAPORE","LONDON"),3), "SINGAPORE"))
mydata$Region<-as.factor(mydata$Region)
set.seed(101)
mod<-iForest(mydata,1,10)
df<-get_df_from_tree(mod$forest[[1]])
adjm <- t( sapply(1:nrow(df), function(x) {
  v <- rep(0, nrow(df))
  if (df$Left[x]!=0) {
    v[df$Left[x]] = 1
    v[df$Right[x]] = 1
  }
  return(v) }
) )

g<- igraph::graph_from_adjacency_matrix(adjm)
plot(g, layout=layout_as_tree(g),vertex.size=4, vertex.label=df$NodeName, edge.arrow.mode="-", edge.label=
       na.omit( unlist(sapply(1:nrow(df), function(x) {
         if(sum(adjm[x,])>0) {
           if(df$AttType[x] == 2) return(c ( paste0(get_split_factor(mydata[,df$SplitAtt[x]], df$SplitValue[x])$filter,collapse=","),"" ) )
           return(c(paste0("<",round(df$SplitValue[x],2)) ,""))
          }
         return(NA) } ) ) ) )

image

Installation in R

Hi,

I am trying to install your R-package in windows. I am getting below error

Error: running command '"C:/PROGRA~1/R/R-34~1.1/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD config CC' had status 1

Any help ?

Having a hard time installing the tool

Hi,

I'm trying to install but I get the following message "Error: Could not find build tools necessary to build isofor". Before that the downloader asked me if I wanted to install other tools required. Is it normal ? Is there a manual way please ?

Thanks a lot !

Error in env$X[idx, ] : incorrect number of dimensions

I'm getting an error with iForest that I don't understand:

# test dataset
test = data.frame(VALUE = rlnorm(200))
iForest(test, 100, 50)
## Error in env$X[idx, ] : incorrect number of dimensions
iForest(as.matrix(test), 100, 50)
## Error in env$X[idx, ] : incorrect number of dimensions
iForest(as_tibble(test), 100, 50)
## Isolation Forest with 100 Trees and Max Depth of 6

Why does iForest work with a tibble, but not with a regular data.frame or list?

is this package still being maintained?

As the title says---is this package still being maintained, and are there any plans to publish to CRAN? I am considering using this package in my workflow but will seek out an alternative if no release is planned. Thanks!

Isofor and r2pmml

When trying to create a PMMLL file from a trained model with r2pmml it gives me the following error.
Is this solvable?

Jan

dec 10, 2017 7:25:03 PM org.jpmml.rexp.Main run
INFO: Parsing RDS..
dec 10, 2017 7:25:03 PM org.jpmml.rexp.Main run
INFO: Parsed RDS in 14 ms.
dec 10, 2017 7:25:03 PM org.jpmml.rexp.Main run
INFO: Initializing default Converter
dec 10, 2017 7:25:03 PM org.jpmml.rexp.Main run
INFO: Initialized org.jpmml.rexp.IForestConverter
dec 10, 2017 7:25:03 PM org.jpmml.rexp.Main run
INFO: Converting..
dec 10, 2017 7:25:03 PM org.jpmml.rexp.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: xcols
at org.jpmml.rexp.RVector.getValue(RVector.java:104)
at org.jpmml.rexp.RVector.getValue(RVector.java:80)
at org.jpmml.rexp.IForestConverter.encodeSchema(IForestConverter.java:59)
at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:74)
at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:70)
at org.jpmml.rexp.Main.run(Main.java:149)
at org.jpmml.rexp.Main.main(Main.java:97)

Exception in thread "main" java.lang.IllegalArgumentException: xcols
at org.jpmml.rexp.RVector.getValue(RVector.java:104)
at org.jpmml.rexp.RVector.getValue(RVector.java:80)
at org.jpmml.rexp.IForestConverter.encodeSchema(IForestConverter.java:59)
at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:74)
at org.jpmml.rexp.ModelConverter.encodePMML(ModelConverter.java:70)
at org.jpmml.rexp.Main.run(Main.java:149)
at org.jpmml.rexp.Main.main(Main.java:97)
Error in .convert(tempfile, file, converter, converter_classpath, verbose) :
1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.