Buildbarn scheduler has the NewCounterVec("in_memory_build_q

Sorry, maybe I wasn't clear. My point was that this: <p

Autoscaling workers based on queue size about bb-remote-execution HOT 10 CLOSED

buildbarn commented on June 14, 2024

Autoscaling workers based on queue size

from bb-remote-execution.

Comments (10)

Monnoroch commented on June 14, 2024 2

Amazing, thank you! I just started designing a proper rule and this is a great help!

from bb-remote-execution.

EdSchouten commented on June 14, 2024

One of the things I'm always careful of when I add Prometheus metrics is that I make sure to not expose any redundant information. Not only does that blow up the size of /metrics, it also makes it hard to grasp what the correlation between values is.

Every operation known by bb-scheduler always goes through the following state transitions:

nonexistent -> queued -> executing -> completed -> deleted

We have metrics in place at every state transition:

nonexistent -> queued: buildbarn_builder_in_memory_build_queue_operations_queued_total
queued -> executing: buildbarn_builder_in_memory_build_queue_operations_queued_duration_seconds_count
executing -> completed: buildbarn_builder_in_memory_build_queue_operations_executing_duration_seconds_count
completed -> deleted: buildbarn_builder_in_memory_build_queue_operations_completed_duration_seconds_count

What do you do if you want to measure the total number of operations in certain states? You simply subtract counters of the adjacent state transitions. For example, if I want to measure the number of operations that are either queued or executing (which is likely what you want to base your autoscaling on, as that represents the total amount of work you could be doing):

buildbarn_builder_in_memory_build_queue_operations_queued_total 
-
sum(buildbarn_builder_in_memory_build_queue_operations_executing_duration_seconds_count) without (result, grpc_code)

We need to use the aggregation, because buildbarn_builder_in_memory_build_queue_operations_executing_duration_seconds_count has two additional labels that buildbarn_builder_in_memory_build_queue_operations_queued_total does not have.

from bb-remote-execution.

Monnoroch commented on June 14, 2024

Thanks for a quick and detailed response!

Unfortunately this doesn't play well with k8s HPA + custom metrics exporter that can't execute an arbitrary expressions and requires a named metric. I could work around that issue though and record the provided expression.

from bb-remote-execution.

EdSchouten commented on June 14, 2024

By the way, here's an expression that's a bit more complex that I'd like to invite people to give a try:

max_over_time(
    quantile_over_time(
        0.95,
        (
            buildbarn_builder_in_memory_build_queue_operations_queued_total
            -
            sum(buildbarn_builder_in_memory_build_queue_operations_executing_duration_seconds_count) without (result, grpc_code)
        )[4h:]
    )[1h:]
)

In short, it takes the 95th percentile of the amount of work that came in over the last four hours. We then do some post-processing by taking the max_over_time(1h) to remove any jitter, causing us to unnecessarily kill/spawn/kill/spawn/... new workers.

(Note that you may want to tune the constants in the expression above to suit your needs. Also be sure to place it in a recording rule, because it's a bit heavy.)

from bb-remote-execution.

Monnoroch commented on June 14, 2024

Just want to add that the proposed query is not equivalent to my original idea because it's also counting currently executing actions while I was thinking about scaling based on the actions that are waiting in the queue.

from bb-remote-execution.

EdSchouten commented on June 14, 2024

You mean only scaling based on the size of the queue? But that means that for a sufficiently sized cluster (where the queue generally remains empty), such an autoscaler would scale the cluster down to a size of zero...?

from bb-remote-execution.

mickael-carl commented on June 14, 2024

That's not really an issue if you set the minReplicas of your autoscaler to e.g. 2 right?

from bb-remote-execution.

Monnoroch commented on June 14, 2024

Sorry, maybe I wasn't clear. My point was that this:

I'm always careful of when I add Prometheus metrics is that I make sure to not expose any redundant information.

is not strictly correct, as this information is not redundant because it can't be obtained any other way.

Depending on the type of the autoscaler you may indeed want to use "all actions in progress + in the queue" vs "all actions in the queue". Right now I've settled for the former as per advice above in this thread, but I'd like to play with the latter as well. As mentioned, there are challenges, but I think it better matches the goal for when your node pool is in the cloud and is auto-scaled and nodes come in one-by-one and you have to pay for the ones spawned. So I'd like my HPA to spawn replicas one-by-one when the queue is not empty for some significant time period.

Upd 1.

The "ideal" algorithm for me would be:

Spawn an empty node pool, no replicas are active (okay, maybe one just to make builds warm up faster)
Once the queue is not empty for like 2 minutes, spawn a second replica. It will not be able to spawn because of lack of hardware, so cloud will spawn a new node in the pool for me
If the queue is still not empty for another 2 minutes, spawn a third one
Et cetera
If the queue is empty for like 10 minutes, kill one replica. The node will be free and cloud will kill it for me.
If the queue is still empty for another 10 minutes, kill another replica
Et cetera

This way I think I can design the scaling process to minimize infrastructure expenses at nights yet still provide maximal build speeds during the day when there are many builds happening.

from bb-remote-execution.

EdSchouten commented on June 14, 2024

Sorry, maybe I wasn't clear. My point was that this:

I'm always careful of when I add Prometheus metrics is that I make sure to not expose any redundant information.

is not strictly correct, as this information is not redundant because it can't be obtained any other way.

Depending on the type of the autoscaler you may indeed want to use "all actions in progress + in the queue" vs "all actions in the queue". Right now I've settled for the former as per advice above in this thread, but I'd like to play with the latter as well.

Because a concrete recording rule hasn't been provided to you, doesn't mean it's not possible. What you are looking for (i.e., the number of actions currently in the queue) can be obtained by using this recording rule:

buildbarn_builder_in_memory_build_queue_operations_queued_total
- 
buildbarn_builder_in_memory_build_queue_operations_queued_duration_seconds_count

As mentioned, there are challenges, but I think it better matches the goal for when your node pool is in the cloud and is auto-scaled and nodes come in one-by-one and you have to pay for the ones spawned. So I'd like my HPA to spawn replicas one-by-one when the queue is not empty for some significant time period.

Upd 1.

The "ideal" algorithm for me would be:

Spawn an empty node pool, no replicas are active (okay, maybe one just to make builds warm up faster)

Once the queue is not empty for like 2 minutes, spawn a second replica. It will not be able to spawn because of lack of hardware, so cloud will spawn a new node in the pool for me

If the queue is still not empty for another 2 minutes, spawn a third one

Et cetera

If the queue is empty for like 10 minutes, kill one replica. The node will be free and cloud will kill it for me.

If the queue is still empty for another 10 minutes, kill another replica

Et cetera

This way I think I can design the scaling process to minimize infrastructure expenses at nights yet still provide maximal build speeds during the day when there are many builds happening.

The problem with such an algorithm is that the speed at which you scale up/down depends on the absolute size of your cluster. If you have a cluster with a hundred replicas and no load, it's going to take 100 * 10 = 1000 minutes = 16 hours, 40 minutes to scale all the way down. The recording rule that I presented, that scales the cluster based on a load quantile, works well regardless of the absolute scale of a cluster. It is guaranteed to scale up/down to a desired capacity within a certain amount of time.

Another problem with such an approach is that the startup speed of a node is also not factored in. If all you need is one more server, but it takes 10 minutes to start one up, you will get five new servers.

from bb-remote-execution.

Monnoroch commented on June 14, 2024

What you are looking for can be obtained by using this recording rule:

Thanks! But yeah, I wasn't able to find this one myself.

The problem with such an algorithm is that the speed at which you scale up/down depends on the absolute size of your cluster.

That's why it can be improved based on the length of the queue. Plus, some manual tuning of parameters will certainly help. But I definitely see your point, there are challenges to that approach compared to the classic scaling based on load.

from bb-remote-execution.

Autoscaling workers based on queue size about bb-remote-execution HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent