Giter Club home page Giter Club logo

benchmark_rf-lr_openml's Introduction

DOI

Code supporting the paper : Random forest versus logistic regression: a large-scale benchmark experiment

Instructions for launching the benchmark

1. Installation

  1. Create an OpenML account, and generate an API key
  2. Get the GitHub code here
  3. Optional : set up Docker as presented below, for a reproducible environment Note that files are already included in the Docker image.

2. Set up main.R

  1. Open enchmark_RF-LR_OpenML.Rproj

  2. Open main.R

  3. Enter your OpenML API key at the beginning of the file so that you will be able to download data from OpenML

  4. Enter the number of cores you want to use for the benchmark experiment

     # Enter below nCores and myapikey  
     nCores = ??? # number of Cpus you want to use   
     myapikey = "??????????????????????????????????" # OpenML API key  
     saveOMLConfig(apikey = myapikey, arff.reader = "RWeka", overwrite=TRUE)  
    

3. Use main.R

Make sure you have all the required package if you do not use our Docker image.
For each of the subsections (e.g. 1.1), you may run all the code.
For a more practical use, the results are already present in the GitHub in the folder Data. Thus, if you want only to generate the graphics and simulations you can skip parts 1. and 2.
Graphics will be saved in Data/Pictures/

  1. Benchmark Study

    1. Get the Data from OpenML
      Note that a fixed list of OpenML tasks is used here, so that we work with a fixed set of datasets (October 2016). We first remove all the datasets that do not fit our criteria (binary classification problem, no NAs, no High dimension, no simulated datasets, no duplicates). We then use the file "Data/OpenML/df.infos.RData" to remove the dataset which failed to load. If you want to recompute this file, you can set the option force=TRUE. Computations should then last for several hours.

    2. Launch the benchmark
      You can here recompute the benchmark using batchtools. The function setBatchtoolsExperiment() will clear the current batchtools folder and prepare R for a new benchmark computation. You can then use the batchtools function submitJobs to compute the results for the datasets. getStatus() will help monitor the computation. For 278 datasets, it took around 8 hours with 7 i7 cores and 8go RAM.

  2. Visualization of the results

    1. Convert the results
      Results are converted to a dataframe.
    2. Overall visualization
      Barplot of ranks is plotted, as well as boxplot of the performance and difference of performance for acc, auc and brier measures.
    3. Inclusion Criteria visualization
      We visualize the boxplot of difference considering different subgroups according to the values of the meta-features such as p and n.
  3. Analysis of the results

    1. Overall results
      We present here the mean, standard deviation and boostrap confidence interval of the results, as well as power of the test.
    2. Meta-Learning
      Partial dependance plot of the model trained to predict the difference of performance between RF and LR based on the values of the meta-features.
  4. Simulations

    1. Subset simulation Computation of the performance of LR and RF for many sub-datasets of the OpenML dataset with id 310. Sub-datasets are randomly generated according to subset of p0<p features or n0<n observations. We then visualize the dependancy of the difference between RF and LR according to increasing values of p0 and n0.
    2. Partial dependance plot simulation
      Computation of simple examples for partial dependance.
    3. Computation of the difference in partial dependance Computation of the difference in partial dependance between RF and LR for all the 278 datasets we considered for this study. Computation may be very time expensive.

Instructions for Setting up Docker

DOI

1. Install Docker

More information can be found on the Docker Website

2. Set up Docker

Change the default docker parameters

You might have to change the default parameters for your docker machine, such as the number of Cpus and RAM that you decide to allow for a container. This parameters can be found in the graphic interface, or via the command line, such as :

# Remove the default docker machine parameters  
> docker-machine rm default   
# Create new new default parameters for the docker machine, for example with 16gb RAM, 8 Cpus and 20Gb Hard-drive.  
> docker-machine create -d virtualbox --virtualbox-memory=16000 --virtualbox-cpu-count=8 --virtualbox-disk-size=20000 default  

3. Get the Docker image associated with the benchmark

The Docker image can be found on DockerHub. You can pull the image (around 5.53gb) to your system via the command line :

> docker pull shadoko/benchmark_rflr:version6

Note that in DockerHub the Dockerbuild file was given so that the image can be recomputed.

4. Generate a Rstudio instance that you can connect to

> docker run --rm -p 8787:8787 shadoko/benchmark_rflr:version6

Te option "-v /Users/myname/myFolder:/home/rstudio/Docker-Benchmark/" can be used to link a volume with the docker image.

-- rm indicates that the container will be deleted when stopped
-p 8787:8787 indicates that the Rstudio instance will be available on port 8787
-v /myComputerPath/:/myContainerPath/ link a volume from your computer to your container VM, so that you can for example open your R project
shadoko/benchmark_rflr:version6  refers to the docker image you want to create a container from

The GitHub code is already included in the Docker image. You can also link your docker container with a folder containing the GitHub project.
Note : for windows OS syntax is different, and the User Public is recommended for rights issues /c/Users/Public/MyFolder:/home/rstudio/Project

5. Connect to your Rstudio instance

  1. Check the IP of your computer which should look like 192.168.0.12
  2. In your browser enter http://myContainerIP:8787 and sign in with id=rstudio and password=rstudio Alternatively use http://0.0.0.0:8787

6. After use, close Docker

Close the container. In the command line use ctrl+c to close the container.
Check that no container are running with the command :
> docker ps

benchmark_rf-lr_openml's People

Contributors

raphaelcouronne avatar philipppro avatar

Stargazers

Cameron Smith avatar  avatar Joshua Levy avatar Juliano Garcia avatar  avatar

Watchers

James Cloos avatar  avatar  avatar

Forkers

mutual-ai

benchmark_rf-lr_openml's Issues

which file to modify for new learners?

Hello, I would be extremely grateful if you could kindly spare some time for the following questions:

  1. Is it necessary to generate computation time for benchmarking? In the code, the computation time is forced to false by default? Does it have any significance in benchmarking???
  2. Can we use the same experiment for benchmarking new learners? I believe we only have to use makelearners().

I look forward to your kind response. Have a great day ahead.

why integrity of the data is not checked?

Hello sir.

In your previous link, " https://github.com/RaphaelCouronne/Benchmark_RF-LR_OpenML/blob/v0.9/Benchmark/benchmark_getDataOpenML.R ".
You have checked for the integrity of the datasets with following codes:

check integrity of datasets

if (!identical(clas$data.id,df.infos$data.id)) {
  print("  Difference between df.infos and clas", quote = FALSE)
  
  # reorganise the data.id
  notcomputed = subset(clas, select = c("data.id", "task.id", 
                                        "number.of.instances","number.of.features"))[which(!clas$data.id %in% df.infos$data.id),]
  df.infos.new = data.frame(matrix(data = NA, nrow = length(df.infos$data.id) + length(notcomputed$data.id), ncol = 15))
  names(df.infos.new) = c("index", "data.id", "task.id","n","p", "began", "done", 
                          "loaded","converted", "target_type", "dimension", 
                          "rf_time", "lr_time", "rf_NA", "lr_NA")
  
  df.infos.new[c(1:length(df.infos$data.id)),] = df.infos
  df.infos.new[c((length(df.infos$data.id)+1):length(df.infos.new$data.id)),c(2,3,4,5)] = notcomputed
  df.infos.new = df.infos.new[order(df.infos.new$data.id),]
  df.infos.new = df.infos.new[order(df.infos.new$n*df.infos.new$p),]
  df.infos.new$index = c(1:length(df.infos.new$index))
  df.infos = df.infos.new
}

but, in this file,you have removed those codes. Please can you tell me why?
I have been trying to recompute time for the new datasets belonging to "Supervised classification". However it gives me error and doesn't classify the class into 3 division. Any suggestion would be really meaningful.

can we benchmark new learners?

Hello, I would be extremely grateful if you could kindly spare some time for the following questions:

Is it necessary to generate computation time for benchmarking? In the code, the computation time is forced to false by default? Does it have any significance in benchmarking???
Can we use the same experiment for benchmarking new learners? I believe we only have to use makelearners().
I look forward to your kind response. Have a great day ahead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.