Giter Club home page Giter Club logo

Comments (10)

Monnoroch avatar Monnoroch commented on June 14, 2024 2

Amazing, thank you! I just started designing a proper rule and this is a great help!

from bb-remote-execution.

EdSchouten avatar EdSchouten commented on June 14, 2024

One of the things I'm always careful of when I add Prometheus metrics is that I make sure to not expose any redundant information. Not only does that blow up the size of /metrics, it also makes it hard to grasp what the correlation between values is.

Every operation known by bb-scheduler always goes through the following state transitions:

nonexistent -> queued -> executing -> completed -> deleted

We have metrics in place at every state transition:

  • nonexistent -> queued: buildbarn_builder_in_memory_build_queue_operations_queued_total
  • queued -> executing: buildbarn_builder_in_memory_build_queue_operations_queued_duration_seconds_count
  • executing -> completed: buildbarn_builder_in_memory_build_queue_operations_executing_duration_seconds_count
  • completed -> deleted: buildbarn_builder_in_memory_build_queue_operations_completed_duration_seconds_count

What do you do if you want to measure the total number of operations in certain states? You simply subtract counters of the adjacent state transitions. For example, if I want to measure the number of operations that are either queued or executing (which is likely what you want to base your autoscaling on, as that represents the total amount of work you could be doing):

buildbarn_builder_in_memory_build_queue_operations_queued_total 
-
sum(buildbarn_builder_in_memory_build_queue_operations_executing_duration_seconds_count) without (result, grpc_code)

We need to use the aggregation, because buildbarn_builder_in_memory_build_queue_operations_executing_duration_seconds_count has two additional labels that buildbarn_builder_in_memory_build_queue_operations_queued_total does not have.

from bb-remote-execution.

Monnoroch avatar Monnoroch commented on June 14, 2024

Thanks for a quick and detailed response!

Unfortunately this doesn't play well with k8s HPA + custom metrics exporter that can't execute an arbitrary expressions and requires a named metric. I could work around that issue though and record the provided expression.

from bb-remote-execution.

EdSchouten avatar EdSchouten commented on June 14, 2024

By the way, here's an expression that's a bit more complex that I'd like to invite people to give a try:

max_over_time(
    quantile_over_time(
        0.95,
        (
            buildbarn_builder_in_memory_build_queue_operations_queued_total
            -
            sum(buildbarn_builder_in_memory_build_queue_operations_executing_duration_seconds_count) without (result, grpc_code)
        )[4h:]
    )[1h:]
)

In short, it takes the 95th percentile of the amount of work that came in over the last four hours. We then do some post-processing by taking the max_over_time(1h) to remove any jitter, causing us to unnecessarily kill/spawn/kill/spawn/... new workers.

(Note that you may want to tune the constants in the expression above to suit your needs. Also be sure to place it in a recording rule, because it's a bit heavy.)

from bb-remote-execution.

Monnoroch avatar Monnoroch commented on June 14, 2024

Just want to add that the proposed query is not equivalent to my original idea because it's also counting currently executing actions while I was thinking about scaling based on the actions that are waiting in the queue.

from bb-remote-execution.

EdSchouten avatar EdSchouten commented on June 14, 2024

You mean only scaling based on the size of the queue? But that means that for a sufficiently sized cluster (where the queue generally remains empty), such an autoscaler would scale the cluster down to a size of zero...?

from bb-remote-execution.

mickael-carl avatar mickael-carl commented on June 14, 2024

That's not really an issue if you set the minReplicas of your autoscaler to e.g. 2 right?

from bb-remote-execution.

Monnoroch avatar Monnoroch commented on June 14, 2024

Sorry, maybe I wasn't clear. My point was that this:

I'm always careful of when I add Prometheus metrics is that I make sure to not expose any redundant information.

is not strictly correct, as this information is not redundant because it can't be obtained any other way.

Depending on the type of the autoscaler you may indeed want to use "all actions in progress + in the queue" vs "all actions in the queue". Right now I've settled for the former as per advice above in this thread, but I'd like to play with the latter as well. As mentioned, there are challenges, but I think it better matches the goal for when your node pool is in the cloud and is auto-scaled and nodes come in one-by-one and you have to pay for the ones spawned. So I'd like my HPA to spawn replicas one-by-one when the queue is not empty for some significant time period.

Upd 1.

The "ideal" algorithm for me would be:

  1. Spawn an empty node pool, no replicas are active (okay, maybe one just to make builds warm up faster)
  2. Once the queue is not empty for like 2 minutes, spawn a second replica. It will not be able to spawn because of lack of hardware, so cloud will spawn a new node in the pool for me
  3. If the queue is still not empty for another 2 minutes, spawn a third one
  4. Et cetera
  5. If the queue is empty for like 10 minutes, kill one replica. The node will be free and cloud will kill it for me.
  6. If the queue is still empty for another 10 minutes, kill another replica
  7. Et cetera

This way I think I can design the scaling process to minimize infrastructure expenses at nights yet still provide maximal build speeds during the day when there are many builds happening.

from bb-remote-execution.

EdSchouten avatar EdSchouten commented on June 14, 2024

Sorry, maybe I wasn't clear. My point was that this:

I'm always careful of when I add Prometheus metrics is that I make sure to not expose any redundant information.

is not strictly correct, as this information is not redundant because it can't be obtained any other way.

Depending on the type of the autoscaler you may indeed want to use "all actions in progress + in the queue" vs "all actions in the queue". Right now I've settled for the former as per advice above in this thread, but I'd like to play with the latter as well.

Because a concrete recording rule hasn't been provided to you, doesn't mean it's not possible. What you are looking for (i.e., the number of actions currently in the queue) can be obtained by using this recording rule:

buildbarn_builder_in_memory_build_queue_operations_queued_total
- 
buildbarn_builder_in_memory_build_queue_operations_queued_duration_seconds_count

As mentioned, there are challenges, but I think it better matches the goal for when your node pool is in the cloud and is auto-scaled and nodes come in one-by-one and you have to pay for the ones spawned. So I'd like my HPA to spawn replicas one-by-one when the queue is not empty for some significant time period.

Upd 1.

The "ideal" algorithm for me would be:

  1. Spawn an empty node pool, no replicas are active (okay, maybe one just to make builds warm up faster)
  2. Once the queue is not empty for like 2 minutes, spawn a second replica. It will not be able to spawn because of lack of hardware, so cloud will spawn a new node in the pool for me
  3. If the queue is still not empty for another 2 minutes, spawn a third one
  4. Et cetera
  5. If the queue is empty for like 10 minutes, kill one replica. The node will be free and cloud will kill it for me.
  6. If the queue is still empty for another 10 minutes, kill another replica
  7. Et cetera

This way I think I can design the scaling process to minimize infrastructure expenses at nights yet still provide maximal build speeds during the day when there are many builds happening.

The problem with such an algorithm is that the speed at which you scale up/down depends on the absolute size of your cluster. If you have a cluster with a hundred replicas and no load, it's going to take 100 * 10 = 1000 minutes = 16 hours, 40 minutes to scale all the way down. The recording rule that I presented, that scales the cluster based on a load quantile, works well regardless of the absolute scale of a cluster. It is guaranteed to scale up/down to a desired capacity within a certain amount of time.

Another problem with such an approach is that the startup speed of a node is also not factored in. If all you need is one more server, but it takes 10 minutes to start one up, you will get five new servers.

from bb-remote-execution.

Monnoroch avatar Monnoroch commented on June 14, 2024

What you are looking for can be obtained by using this recording rule:

Thanks! But yeah, I wasn't able to find this one myself.

The problem with such an algorithm is that the speed at which you scale up/down depends on the absolute size of your cluster.

That's why it can be improved based on the length of the queue. Plus, some manual tuning of parameters will certainly help. But I definitely see your point, there are challenges to that approach compared to the classic scaling based on load.

from bb-remote-execution.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.