The benchmark_rf-lr_openml from raphaelcouronne

Code supporting the paper : Random forest versus logistic regression: a large-scale benchmark experiment

Instructions for launching the benchmark

1. Installation

Create an OpenML account, and generate an API key
Get the GitHub code here
Optional : set up Docker as presented below, for a reproducible environment Note that files are already included in the Docker image.

2. Set up main.R

Open enchmark_RF-LR_OpenML.Rproj
Open main.R
Enter your OpenML API key at the beginning of the file so that you will be able to download data from OpenML

Enter the number of cores you want to use for the benchmark experiment

 # Enter below nCores and myapikey  
 nCores = ??? # number of Cpus you want to use   
 myapikey = "??????????????????????????????????" # OpenML API key  
 saveOMLConfig(apikey = myapikey, arff.reader = "RWeka", overwrite=TRUE)

3. Use main.R

Make sure you have all the required package if you do not use our Docker image.
For each of the subsections (e.g. 1.1), you may run all the code.
For a more practical use, the results are already present in the GitHub in the folder Data. Thus, if you want only to generate the graphics and simulations you can skip parts 1. and 2.
Graphics will be saved in Data/Pictures/

Benchmark Study
1. Get the Data from OpenML
  Note that a fixed list of OpenML tasks is used here, so that we work with a fixed set of datasets (October 2016). We first remove all the datasets that do not fit our criteria (binary classification problem, no NAs, no High dimension, no simulated datasets, no duplicates). We then use the file "Data/OpenML/df.infos.RData" to remove the dataset which failed to load. If you want to recompute this file, you can set the option force=TRUE. Computations should then last for several hours.
2. Launch the benchmark
  You can here recompute the benchmark using batchtools. The function setBatchtoolsExperiment() will clear the current batchtools folder and prepare R for a new benchmark computation. You can then use the batchtools function submitJobs to compute the results for the datasets. getStatus() will help monitor the computation. For 278 datasets, it took around 8 hours with 7 i7 cores and 8go RAM.
Visualization of the results
1. Convert the results
  Results are converted to a dataframe.
2. Overall visualization
  Barplot of ranks is plotted, as well as boxplot of the performance and difference of performance for acc, auc and brier measures.
3. Inclusion Criteria visualization
  We visualize the boxplot of difference considering different subgroups according to the values of the meta-features such as p and n.
Analysis of the results
1. Overall results
  We present here the mean, standard deviation and boostrap confidence interval of the results, as well as power of the test.
2. Meta-Learning
  Partial dependance plot of the model trained to predict the difference of performance between RF and LR based on the values of the meta-features.
Simulations
1. Subset simulation Computation of the performance of LR and RF for many sub-datasets of the OpenML dataset with id 310. Sub-datasets are randomly generated according to subset of p0<p features or n0<n observations. We then visualize the dependancy of the difference between RF and LR according to increasing values of p0 and n0.
2. Partial dependance plot simulation
  Computation of simple examples for partial dependance.
3. Computation of the difference in partial dependance Computation of the difference in partial dependance between RF and LR for all the 278 datasets we considered for this study. Computation may be very time expensive.

Instructions for Setting up Docker

1. Install Docker

More information can be found on the Docker Website

2. Set up Docker

Change the default docker parameters

You might have to change the default parameters for your docker machine, such as the number of Cpus and RAM that you decide to allow for a container. This parameters can be found in the graphic interface, or via the command line, such as :

# Remove the default docker machine parameters  
> docker-machine rm default   
# Create new new default parameters for the docker machine, for example with 16gb RAM, 8 Cpus and 20Gb Hard-drive.  
> docker-machine create -d virtualbox --virtualbox-memory=16000 --virtualbox-cpu-count=8 --virtualbox-disk-size=20000 default

3. Get the Docker image associated with the benchmark

The Docker image can be found on DockerHub. You can pull the image (around 5.53gb) to your system via the command line :

> docker pull shadoko/benchmark_rflr:version6

Note that in DockerHub the Dockerbuild file was given so that the image can be recomputed.

4. Generate a Rstudio instance that you can connect to

> docker run --rm -p 8787:8787 shadoko/benchmark_rflr:version6

Te option "-v /Users/myname/myFolder:/home/rstudio/Docker-Benchmark/" can be used to link a volume with the docker image.

-- rm indicates that the container will be deleted when stopped
-p 8787:8787 indicates that the Rstudio instance will be available on port 8787
-v /myComputerPath/:/myContainerPath/ link a volume from your computer to your container VM, so that you can for example open your R project
shadoko/benchmark_rflr:version6  refers to the docker image you want to create a container from

The GitHub code is already included in the Docker image. You can also link your docker container with a folder containing the GitHub project.
Note : for windows OS syntax is different, and the User Public is recommended for rights issues /c/Users/Public/MyFolder:/home/rstudio/Project

5. Connect to your Rstudio instance

Check the IP of your computer which should look like 192.168.0.12
In your browser enter http://myContainerIP:8787 and sign in with id=rstudio and password=rstudio Alternatively use http://0.0.0.0:8787

6. After use, close Docker

Close the container. In the command line use ctrl+c to close the container.
Check that no container are running with the command :
> docker ps

raphaelcouronne / benchmark_rf-lr_openml Goto Github PK

benchmark_rf-lr_openml's Introduction

Instructions for launching the benchmark

1. Installation

2. Set up main.R

3. Use main.R

Instructions for Setting up Docker

1. Install Docker

2. Set up Docker

Change the default docker parameters

3. Get the Docker image associated with the benchmark

4. Generate a Rstudio instance that you can connect to

5. Connect to your Rstudio instance

6. After use, close Docker

benchmark_rf-lr_openml's People

Contributors

Stargazers

Watchers

Forkers

benchmark_rf-lr_openml's Issues

check integrity of datasets

Recommend Projects

Recommend Topics

Recommend Org