Code supporting the paper : Random forest versus logistic regression: a large-scale benchmark experiment
- Create an OpenML account, and generate an API key
- Get the GitHub code here
- Optional : set up Docker as presented below, for a reproducible environment Note that files are already included in the Docker image.
-
Open enchmark_RF-LR_OpenML.Rproj
-
Open main.R
-
Enter your OpenML API key at the beginning of the file so that you will be able to download data from OpenML
-
Enter the number of cores you want to use for the benchmark experiment
# Enter below nCores and myapikey nCores = ??? # number of Cpus you want to use myapikey = "??????????????????????????????????" # OpenML API key saveOMLConfig(apikey = myapikey, arff.reader = "RWeka", overwrite=TRUE)
Make sure you have all the required package if you do not use our Docker image.
For each of the subsections (e.g. 1.1), you may run all the code.
For a more practical use, the results are already present in the GitHub in the folder Data. Thus, if you want only to generate the graphics and simulations you can skip parts 1. and 2.
Graphics will be saved in Data/Pictures/
-
Benchmark Study
-
Get the Data from OpenML
Note that a fixed list of OpenML tasks is used here, so that we work with a fixed set of datasets (October 2016). We first remove all the datasets that do not fit our criteria (binary classification problem, no NAs, no High dimension, no simulated datasets, no duplicates). We then use the file "Data/OpenML/df.infos.RData" to remove the dataset which failed to load. If you want to recompute this file, you can set the option force=TRUE. Computations should then last for several hours. -
Launch the benchmark
You can here recompute the benchmark using batchtools. The function setBatchtoolsExperiment() will clear the current batchtools folder and prepare R for a new benchmark computation. You can then use the batchtools function submitJobs to compute the results for the datasets. getStatus() will help monitor the computation. For 278 datasets, it took around 8 hours with 7 i7 cores and 8go RAM.
-
-
Visualization of the results
- Convert the results
Results are converted to a dataframe. - Overall visualization
Barplot of ranks is plotted, as well as boxplot of the performance and difference of performance for acc, auc and brier measures. - Inclusion Criteria visualization
We visualize the boxplot of difference considering different subgroups according to the values of the meta-features such as p and n.
- Convert the results
-
Analysis of the results
- Overall results
We present here the mean, standard deviation and boostrap confidence interval of the results, as well as power of the test. - Meta-Learning
Partial dependance plot of the model trained to predict the difference of performance between RF and LR based on the values of the meta-features.
- Overall results
-
Simulations
- Subset simulation Computation of the performance of LR and RF for many sub-datasets of the OpenML dataset with id 310. Sub-datasets are randomly generated according to subset of p0<p features or n0<n observations. We then visualize the dependancy of the difference between RF and LR according to increasing values of p0 and n0.
- Partial dependance plot simulation
Computation of simple examples for partial dependance. - Computation of the difference in partial dependance Computation of the difference in partial dependance between RF and LR for all the 278 datasets we considered for this study. Computation may be very time expensive.
More information can be found on the Docker Website
You might have to change the default parameters for your docker machine, such as the number of Cpus and RAM that you decide to allow for a container. This parameters can be found in the graphic interface, or via the command line, such as :
# Remove the default docker machine parameters
> docker-machine rm default
# Create new new default parameters for the docker machine, for example with 16gb RAM, 8 Cpus and 20Gb Hard-drive.
> docker-machine create -d virtualbox --virtualbox-memory=16000 --virtualbox-cpu-count=8 --virtualbox-disk-size=20000 default
The Docker image can be found on DockerHub. You can pull the image (around 5.53gb) to your system via the command line :
> docker pull shadoko/benchmark_rflr:version6
Note that in DockerHub the Dockerbuild file was given so that the image can be recomputed.
> docker run --rm -p 8787:8787 shadoko/benchmark_rflr:version6
Te option "-v /Users/myname/myFolder:/home/rstudio/Docker-Benchmark/" can be used to link a volume with the docker image.
-- rm indicates that the container will be deleted when stopped
-p 8787:8787 indicates that the Rstudio instance will be available on port 8787
-v /myComputerPath/:/myContainerPath/ link a volume from your computer to your container VM, so that you can for example open your R project
shadoko/benchmark_rflr:version6 refers to the docker image you want to create a container from
The GitHub code is already included in the Docker image. You can also link your docker container with a folder containing the GitHub project.
Note : for windows OS syntax is different, and the User Public is recommended for rights issues /c/Users/Public/MyFolder:/home/rstudio/Project
- Check the IP of your computer which should look like 192.168.0.12
- In your browser enter http://myContainerIP:8787 and sign in with id=rstudio and password=rstudio Alternatively use http://0.0.0.0:8787
Close the container. In the command line use ctrl+c to close the container.
Check that no container are running with the command :
> docker ps