Giter Club home page Giter Club logo

mlr3batchmark's Introduction

mlr3batchmark

r-cmd-check CRAN status StackOverflow Mattermost

A connector between mlr3 and batchtools. This allows to run large-scale benchmark experiments on scheduled high-performance computing clusters.

The package comes with two core functions for switching between mlr3 and batchtools to perform a benchmark:

  • After creating a design object (as required for mlr3’s benchmark() function), instead of benchmark() call batchmark() which populates an ExperimentRegistry for the computational jobs of the benchmark. You are now in the world of batchtools where you can selectively submit jobs with different resources, monitor the progress or resubmit as needed.
  • After the computations are finished, collect the results with reduceResultsBatchmark() to return to mlr3. The resulting object is a regular BenchmarkResult.

Example

library("mlr3")
library("batchtools")
library("mlr3batchmark")
tasks = tsks(c("iris", "sonar"))
learners = lrns(c("classif.featureless", "classif.rpart"))
resamplings = rsmp("cv", folds = 3)

design = benchmark_grid(
  tasks = tasks,
  learners = learners,
  resamplings = resamplings
)

reg = makeExperimentRegistry(NA)
## No readable configuration file found

## Created registry in '/tmp/Rtmp8DlMZQ/registry704553adf7a88' using cluster functions 'Interactive'
ids = batchmark(design, reg = reg)
## Adding algorithm 'run_learner'

## Adding problem 'b39ef23a66b1f1ee'

## Exporting new objects: '5ec484de3f93431b' ...

## Exporting new objects: '7c35d835f3dfae37' ...

## Exporting new objects: '70dd22724e5c724d' ...

## Adding 6 experiments ('b39ef23a66b1f1ee'[1] x 'run_learner'[2] x repls[3]) ...

## Adding problem '76c4fc7a533d41b7'

## Exporting new objects: 'b209de197d6cbe75' ...

## Adding 6 experiments ('76c4fc7a533d41b7'[1] x 'run_learner'[2] x repls[3]) ...
submitJobs()
## Submitting 12 jobs in 12 chunks using cluster functions 'Interactive' ...
getStatus()
## Status for 12 jobs at 2023-11-13 19:32:20:
##   Submitted    : 12 (100.0%)
##   -- Queued    :  0 (  0.0%)
##   -- Started   : 12 (100.0%)
##   ---- Running :  0 (  0.0%)
##   ---- Done    : 12 (100.0%)
##   ---- Error   :  0 (  0.0%)
##   ---- Expired :  0 (  0.0%)
reduceResultsBatchmark()
## <BenchmarkResult> of 12 rows with 4 resampling runs
##  nr task_id          learner_id resampling_id iters warnings errors
##   1    iris classif.featureless            cv     3        0      0
##   2    iris       classif.rpart            cv     3        0      0
##   3   sonar classif.featureless            cv     3        0      0
##   4   sonar       classif.rpart            cv     3        0      0

Resources

  • The Large-Scale Benchmarking chapter of the mlr3 book

mlr3batchmark's People

Contributors

be-marc avatar github-actions[bot] avatar jemus42 avatar mllg avatar pat-s avatar sebffischer avatar tdhock avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mlr3batchmark's Issues

Funtion reduceResultsBatchmark saves backend when store_backends=FALSE

Hi again,

I am experimenting with reduceResultsBatchmark function.

The function consumes lots of RAM's on my local machine, even after setting store_models=FALSE and store_backends=FALSE. I have looked at source code and it seems it stores task in Result object no meter of store_backends=FALSE argument:

 rdata = mlr3::ResultData$new(data.table(task = list(task), 
                                          learner = list(learner), resampling = list(resampling), 
                                          iteration = tab$repl, prediction = map(results, 
                                                                                 "prediction"), learner_state = map(results, 
                                                                                                                    "learner_state"), uhash = tab$job.name), store_backends = store_backends)

It just sets store_backends attribute to store_backends, but it saves task inside the object.

Is that intentional? I would expect task would not save the backend if the store_backend argument is set to FALSE.

Should we really rely on learner hashes?

When we create complicated graph learner it can always happen that one of them does not implement their hash properly.
I stumbled upon this problem and it leads to bad bugs, as the experiments that batchtools executes are not those defined in the design.

Release mlr3batchmark 0.1.1

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)

Learner has not been trained yet

I have trained models using mlr3batchmark package:

  # Exp dir
  date_ = strftime(Sys.time(), format = "%Y%m%d")
  dirname_ = glue("experiments_pread_live_{date_}")
  if (dir.exists(dirname_)) system(paste0("rm -r ", dirname_))

  # Create registry
  packages = c("data.table", "gausscov", "paradox", "mlr3", "mlr3pipelines",
               "mlr3tuning", "mlr3misc", "future", "future.apply",
               "mlr3extralearners", "stats")
  reg = makeExperimentRegistry(file.dir = dirname_, seed = 1, packages = packages)
  
  # Populate registry with problems and algorithms to form the jobs
  batchmark(designs, store_models = TRUE, reg = reg)
  
  # Save registry
  saveRegistry(reg = reg)

  # get nondone jobs
  ids = findNotDone(reg = reg)
  
  # set up cluster (for local it is parallel)
  cf = makeClusterFunctionsSocket(ncpus = 4L)
  reg$cluster.functions = cf
  saveRegistry(reg = reg)
  
  # define resources and submit jobs
  resources = list(ncpus = 2, memory = 8000)
  submitJobs(ids = ids$job.id, resources = resources, reg = reg)

When I load models after the training:

  results = reduceResultsBatchmark(reg = reg, store_backends = TRUE)
  
  # predictions for new dataset
  task_new = task$clone()
  task_new$filter(rows = data_ids)
  results$learners$learner[[1]]$predict(task = task_new)

I get an error

Error: Cannot predict, Learner 'filter_target_branch.nop_filter_target.filter_target_id.filter_target_unbranch.subsample.dropnacol.dropna.removeconstants_1.fixfactors.winsorizesimple.removeconstants_2.dropcorr.scale_branch.uniformization.scale.scale_unbranch.dropna_v2.nop_union_pca.pca.ica.feature_union_pca.disr.jmim.jmi.mim.mrmr.njmim.cmim.carscore.information_gain.relief.gausscov_f1st.feature_union_filters.removeconstants_3.regr.ranger.tuned' has not been trained yet

It returns an error like the model is not saved, even if I set I want to save it.

Run mlr3batchmark on PBS Pro cluster

I am new to mlr3batchmark and HPC computing, so could you help me out? I have read the chapter in mlr3 book on HPC computing and replicated all the steps on the local machine. But I don't know how to run everything on the cluster.

I am not sure what my cluster function should look like. After I read cluster documentation which is available here
https://wiki.srce.hr/pages/viewpage.action?pageId=121966084 (it is in Croatian, but easy to translate with google), it seems that cluster uses PBS Pro, which is not included in list of claster functions.

It seems very similar to custer function made by makeClusterFunctionsTORQUE, but there are some differences too.
For example, from the error I get:

Error: Listing of jobs failed (exit code 127);
cmd: 'qselect -u $USER -s EHRT'
output:
command not found
Execution halted

it seems this cluster doesn't have qselect function, buy uses qstat function.

Then, I don't see select option in resources, which defines number of nodes we want to use.

Do you have experience or temaplte for the PBS Pro type of cluster functions? It seems to me I should write my own cluster function ?

Error in becnhmark grid: A Resampling is instantiated for a task with a different number of observations

I have upgraded all mlr3 packages, including mlr3batchmark.

Now I am getting the error I didn't get before.

Ther error is:

### [bt]: Generating problem instance for problem '8a7d4bb8e5156165' ...
### [bt]: Applying algorithm 'run_learner' on problem '8a7d4bb8e5156165' for job 3 (seed = 4) ...
INFO  [09:41:37.721] [mlr3] Applying learner 'subsample.dropnacol.dropna.removeconstants_1.fixfactors.winsorizesimple.removeconstants_2.dropcorr.scale_branch.uniformization.scale.scale_unbranch.dropna_v2.nop_union_pca.pca.ica.featureunion.filter_branch.jmi.relief.gausscov_f1st.filter_unbranch.regr.lightgbm.tuned' on task 'taskRetWeek' (iter 1/1)
INFO  [09:41:40.017] [bbotk] Starting to optimize 9 parameter(s) with '<OptimizerHyperband>' and '<TerminatorNone>'
INFO  [09:41:40.128] [bbotk] Evaluating 1 configuration(s)
Error in benchmark_grid(self$task, self$learner, resampling, param_values = list(xss)) : 
  A Resampling is instantiated for a task with a different number of observations
40: (function (e) 
    traceback(2L))()
39: stop("A Resampling is instantiated for a task with a different number of observations")
38: benchmark_grid(self$task, self$learner, resampling, param_values = list(xss))
37: .__ObjectiveTuning__.eval_many(self = self, private = private, 
        super = super, xss = xss, resampling = resampling)
36: private$.eval_many(xss, resampling = list(<environment>))
35: eval(expr, p)
34: eval(expr, p)
33: eval.parent(expr, n = 1L)
32: invoke(private$.eval_many, xss, .args = self$constants$values)
31: .__Objective__eval_many(self = self, private = private, super = super, 
        xss = xss)
30: self$objective$eval_many(xss_trafoed)
29: .__OptimInstance__eval_batch(self = self, private = private, 
        super = super, xdt = xdt)
28: inst$eval_batch(xdt)
27: private$.optimize(inst)
26: doTryCatch(return(expr), name, parentenv, handler)
25: tryCatchOne(expr, names, parentenv, handlers[[1L]])
24: tryCatchList(expr, classes, parentenv, handlers)
23: tryCatch({
        private$.optimize(inst)
    }, terminated_error = function(cond) {
    })
22: optimize_default(inst, self, private)
21: .__Optimizer__optimize(self = self, private = private, super = super, 
        inst = inst)
20: private$.optimizer$optimize(inst)
19: .__TunerFromOptimizer__optimize(self = self, private = private, 
        super = super, inst = inst)
18: self$tuner$optimize(instance)
17: .__AutoTuner__.train(self = self, private = private, super = super, 
        task = task)
16: get_private(learner)$.train(task)
15: .f(learner = <environment>, task = <environment>)
14: eval(expr, p)
13: eval(expr, p)
12: eval.parent(expr, n = 1L)
11: invoke(.f, .args = .args, .opts = .opts, .seed = .seed, .timeout = .timeout)
10: encapsulate(learner$encapsulate["train"], .f = train_wrapper, 
        .args = list(learner = learner, task = task), .pkgs = learner$packages, 
        .seed = NA_integer_, .timeout = learner$timeout["train"])
9: learner_train(learner, task, sets[["train"]], sets[["test"]], 
       mode = mode)
8: workhorse(iteration = job$repl, task = data, learner = learner, 
       resampling = resampling, store_models = store_models, lgr_threshold = lgr::get_logger("mlr3")$threshold)
7: job$algorithm$fun(job = job, data = job$problem$data, instance = instance, 
       ...)
6: (function (...) 
   job$algorithm$fun(job = job, data = job$problem$data, instance = instance, 
       ...))(learner_hash = "b18020b11dad6832", learner_id = "subsample.dropnacol.dropna.removeconstants_1.fixfactors.winsorizesimple.removeconstants_2.dropcorr.scale_branch.uniformization.scale.scale_unbranch.dropna_v2.nop_union_pca.pca.ica.featureunion.filter_branch.jmi.relief.gausscov_f1st.filter_unbranch.regr.lightgbm.tuned", 
       store_models = FALSE)
5: do.call(wrapper, job$algo.pars, envir = .GlobalEnv)
4: with_preserve_seed({
       set_seed(list(seed = seed, rng_kind = rng_kind))
       code
   })
3: with_seed(job$seed, do.call(wrapper, job$algo.pars, envir = .GlobalEnv))
2: execJob.Experiment(job)
1: execJob(job)

It's hard to me to reproduce since there are lots of steps I am doing before getting to execJob(job) which produces an error. Additonaly, I had to to use only parts of mlr3batchmark package because it didn't work as is before.

I first prepare data, learners and resamplings with:

# create registry
print("Create registry")
packages = c("data.table", "gausscov", "paradox", "mlr3", "mlr3pipelines",
             "mlr3tuning", "mlr3misc", "future", "future.apply",
             "mlr3extralearners", "stats")
reg = makeExperimentRegistry(file.dir = dirname_, seed = 1, packages = packages)

# populate registry with problems and algorithms to form the jobs
print("Batchmark")
batchmark(designs, reg = reg)

# save registry
print("Save registry")
saveRegistry(reg = reg)

and than call the script

options(warn = -1)
library(data.table)
library(gausscov)
library(paradox)
library(mlr3)
library(mlr3pipelines)
library(mlr3viz)
library(mlr3tuning)
library(mlr3misc)
library(future)
library(future.apply)
library(mlr3extralearners)
library(batchtools)
library(mlr3batchmark)
library(checkmate)
library(stringi)
library(R6)
library(brew)



# UTILS -------------------------------------------------------------------
# utils functions
dir = function(reg, what) {
  fs::path(fs::path_expand(reg$file.dir), what)
}
getResultFiles = function(reg, ids) {
  fs::path(dir(reg, "results"), sprintf("%i.rds", if (is.atomic(ids)) ids else ids$job.id))
}
waitForFile = function(fn, timeout = 0, must.work = TRUE) {
  if (timeout == 0 || fs::file_exists(fn))
    return(TRUE)
  "!DEBUG [waitForFile]: `fn` not found via 'file.exists()'"
  timeout = timeout + Sys.time()
  path = fs::path_dir(fn)
  repeat {
    Sys.sleep(0.5)
    if (basename(fn) %chin% list.files(path, all.files = TRUE))
      return(TRUE)
    if (Sys.time() > timeout) {
      if (must.work)
        stopf("Timeout while waiting for file '%s'",
              fn)
      return(FALSE)
    }
  }
}
writeRDS = function (object, file, compress = "gzip") {
  batchtools:::file_remove(file)
  saveRDS(object, file = file, version = 2L, compress = compress)
  waitForFile(file, 300)
  invisible(TRUE)
}
UpdateBuffer = R6Class(
  "UpdateBuffer",
  cloneable = FALSE,
  public = list(
    updates = NULL,
    next.update = NA_real_,
    initialize = function(ids) {
      self$updates = data.table(
        job.id = ids,
        started = NA_real_,
        done = NA_real_,
        error = NA_character_,
        mem.used = NA_real_,
        written = FALSE,
        key = "job.id"
      )
      self$next.update = Sys.time() + runif(1L, 60, 300)
    },

    add = function(i, x) {
      set(self$updates, i, names(x), x)
    },

    save = function(jc) {
      i = self$updates[!is.na(started) & (!written), which = TRUE]
      if (length(i) > 0L) {
        first.id = self$updates$job.id[i[1L]]
        writeRDS(
          self$updates[i,!"written"],
          file = fs::path(
            jc$file.dir,
            "updates",
            sprintf("%s-%i.rds", jc$job.hash, first.id)
          ),
          compress = jc$compress
        )
        set(self$updates, i, "written", TRUE)
      }
    },

    flush = function(jc) {
      now = Sys.time()
      if (now > self$next.update) {
        self$save(jc)
        self$next.update = now + runif(1L, 60, 300)
      }
    }

  )
)


# RUN JOB -----------------------------------------------------------------
# load registry
if (interactive()) {
  reg = loadRegistry("experiments_test")
} else {
  reg = loadRegistry("experiments")
}

# extract integer
i = as.integer(Sys.getenv('PBS_ARRAY_INDEX'))
# i = 3L

# extract not done ids
ids_not_done = findNotDone(reg=reg)
ids_done = findDone(reg=reg)
(nrow(ids_not_done) + nrow(ids_done)) == 8866

# create job collection
# if (nrow(ids_done) == 0) {
#
# }
resources = list(ncpus = 4) # this shouldnt be important
jc = makeJobCollection(ids = NULL,
                       resources = resources,
                       reg = reg)

# resources = list(ncpus = 4) # this shouldnt be important
# jc = makeJobCollection(ids = ids_not_done,
#                        resources = resources,
#                        reg = reg)


# start buffer
buf = UpdateBuffer$new(jc$jobs$job.id)
update = list(started = batchtools:::ustamp(), done = NA_integer_, error = NA_character_, mem.used = NA_real_)

# get job
cat("Get Job \n")
# job = batchtools:::Job$new(
#   file.dir = jc$file.dir,
#   reader = batchtools:::RDSReader$new(FALSE),
#   id = jc$jobs[i]$job.id,
#   job.pars = jc$jobs[i]$job.pars[[1L]],
#   seed = 1 + jc$jobs[i]$job.id,
#   resources = jc$resources
# )
job = batchtools:::getJob(jc, i)
id = job$id

# execute job
cat("Execute Job")
gc(reset = TRUE)
update$started = batchtools:::ustamp()
result = execJob(job)

# save job
writeRDS(result, file = getResultFiles(jc, id), compress = jc$compress)

# memory usage
tryCatch({
  memory.mult = c(if (.Machine$sizeof.pointer == 4L) 28L else 56L, 8L)
  gc_info <- gc(verbose = FALSE)
  memory_used = sum(gc_info[, 1L] * memory.mult) / 1000000L
}, error = function(e) {
  memory_used <- 1000  # Set to NA or some default value in case of error
})

# updates
update$done = batchtools:::ustamp()
update$mem.used = memory_used
buf$add(i, update)
buf$flush(jc)
buf$save(jc)

I can send exmaple data from expriments-test if needed.

I am not sure, but it seems to me number of elements in exports folder was the same as number of designs. Now, this is not the case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.