(I tried to frame these questions generally, but if it shades too far to advice on an

<div class="highlight highlight-source-r notranslate position-relative overflow-auto" dir="auto" dat

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

is dynamic task allocation across workers possible? about furrr HOT 5 CLOSED

sheffe commented on September 3, 2024

is dynamic task allocation across workers possible?

from furrr.

Comments (5)

DavisVaughan commented on September 3, 2024 1

library(furrr)
library(rsample)
library(listenv)
library(tictoc)

# ..............................................................................

# 16 bootstraps
n <- 16

# The ones at the end take much longer than
# the ones at the beginning
straps <- bootstraps(iris, times = n)
delays <- seq_len(n)

# Each bootstrap takes a varying amount of time
process_strap <- function(split, delay) {
  Sys.sleep(delay)
  nrow(analysis(split))
}

# ..............................................................................

# Sends bootstraps 1-8 to the 8 cores immediately, with no hanging of the console.
# Then, once the request for the 9th bootstrap to be sent to a worker is made,
# it waits for the next core to be available, and once one is open, it sends
# it along. Continue this for all bootstraps

# This has better performance, and the time on each core will probably
# be something like:
# Core 1) 1 sec + 9  sec 
# Core 2) 2 sec + 10 sec 
# Core 3) 3 sec + 11 sec
# Core 4) 4 sec + 12 sec
# Core 5) 5 sec + 13 sec
# Core 6) 6 sec + 14 sec
# Core 7) 7 sec + 15 sec
# Core 8) 8 sec + 16 sec

# So the total time is the longest spent on any core, or around 8 + 16 = 24 sec

plan(multiprocess, workers = 8)

vals <- listenv()

# dynamic allocation
tic()
for(i in seq_len(nrow(straps))) {
  
  strap <- straps[i,]
  delay <- delays[i]
  
  vals[[i]] %<-% process_strap(strap$splits[[1]], delay)
  
}
vals <- as.list(vals)
toc()
#> 25.502 sec elapsed

# ..............................................................................

# This evenly divides your 16 splits into 8 groups of 2 (in order)
# Then it sends all 8 groups off to the 8 cores all at once
# So core 1 has bootstraps [1,2] (which take 1 and 2 seconds respectively)
# So core 8 has bootstraps [15,16] (which take 15 and 16 seconds respectively)

# So here you are limited by the 8th core
# and you get ~15 + 16 = 31 seconds

# reset
plan(multiprocess, workers = 8)

# batch allocation
tic()
vals2 <- future_map2(straps$splits, delays, process_strap)
toc()
#> 31.168 sec elapsed

# ..............................................................................

# The future_options() option: `scheduling`
# This controls the question: "How many futures (chunks) to create per worker?"
# - Defaults to 1   = 1 future per worker! (8 chunks, as above)
# - You can use Inf = 1 future per element of `.x` (16 chunks each with 1 element!
#    this is most similar to the first example)

# reset
plan(multiprocess, workers = 8)

# basically dynamic allocation
tic()
vals2 <- future_map2(straps$splits, delays, process_strap, 
                     .options = future_options(scheduling = Inf))
toc()
#> 25.267 sec elapsed

^{Created on 2019-01-08 by the reprex package (v0.2.1.9000)}

So, as you can hopefully see, when you have lots of things to do and each takes the exact same amount of time, it's fine to use the default scheduling = 1. But, when each thing takes a varying amount of time, and you risk having 1 core working much harder than others, it might be more useful to set scheduling = Inf. This problem becomes more prevalent in your example too, as you have much more than 16 bootstraps.

from furrr.

sheffe commented on September 3, 2024 1

Thank you @DavisVaughan! That is exactly what I needed -- pretty certain that this solution will speed up half of my current projects. I'd also never seen the tictoc package before -- very slick.

(This was a way-past-expectations thorough response -- really appreciate it. If someone from my team tries to buy you a beer at rstudio::conf next week, you'll know why...)

from furrr.

DavisVaughan commented on September 3, 2024

Hi @sheffe, thanks for the nice words about furrr. I have an answer for you that is hopefully satisfying.

I assume what's happening is that work is divided across cores at the outset, and there's enough variance in the runtime of my function calls that some cores can finish their work faster than others.

This is correct! At least with the default, furrr with evenly divide your work with 1 chunk per worker you have available, and then ship out each chunk to the workers.

Is there an option for dynamic task allocation, within furrr itself or future?

Sort of! You can alter the scheduling option. See below for the full explanation.

Update) Edited because I remembered what the scheduling option does.

from furrr.

sheffe commented on September 3, 2024

@DavisVaughan a followup thought here -- I implemented your solution last night and it's going beautifully, with some other lessons learned. Maybe worth a docs PR if I can consolidate this well -- LMK.

I'm seeing a pretty strong performance relationship between future_options(scheduling = N)) and the expected runtime variance of the function being iterated.

If scheduling = N goes too high, there's more overhead burned in division/allocation of jobs, and it may not compensate well for the runtime hit. (For the task that prompted my question, testing scheduling = Inf was a pretty severe performance reduction -- hundreds of thousands of highly optimized function calls that can't directly be vectorized, so nearly all of the time went to dividing up and managing the returned futures. The 64 cores were largely not used compared to the overhead, for that specific problem. For web scraping, parallel file reads on very different file sizes, etc -- long-running functions especially -- that probably inverts.)
scheduling = 200 brought it closer to scheduling = 1
For now, the 25-50 range is where I'm seeing some nice improvement.

Before this issue, I'd been watching cores finish ~25% before others at most with scheduling = 1, so I'm guessing that this is the sweet spot for minimizing work allocation overhead while still dividing work into small enough chunks that we bring the "core disuse from early finishing" range from a max of 25% down to a max of ~2-4%.

I think there's a way to test that hypothesis directly, and at least demonstrate how to figure it out for a specific problem in a vignette perhaps.

This issue probably doesn't need reopening. If I have any luck with turning that profiling into reliable heuristics, I'll report back. I'm still completely incredulous that this worked so well with such a small change to the code. Thanks again!

from furrr.

DavisVaughan commented on September 3, 2024

I think this works best when you have, say, 50 models to run, and 10 cores and each model takes a varying amount of time to finish, but each one takes at least a few minutes or so to run. That way the overhead is more in the task than in the allocation of data.

That's good rationale for why the default is 1, more intricate tuning requires some domain knowledge of your exact problem.

I'm happy it seems to be helping though!

This is good to know though, and needs to be elaborated on further. I'll open another issue for that.

from furrr.

is dynamic task allocation across workers possible? about furrr HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent