vmware / declarative-cluster-management Goto Github PK

View Code? Open in Web Editor NEW

99.0 10.0 19.0 9.09 MB

Declarative cluster management using constraint programming, where constraints are described using SQL.

License: Other

Java 96.58% FreeMarker 0.25% Awk 0.04% Shell 0.51% R 1.35% Python 1.28%

declarative-cluster-management's Introduction

Overview

Modern cluster management systems like Kubernetes routinely grapple with hard combinatorial optimization problems: load balancing, placement, scheduling, and configuration. Implementing application-specific algorithms to solve these problems is notoriously hard to do, making it challenging to evolve the system over time and add new features.

DCM is a tool to overcome this challenge. It enables programmers to build schedulers and cluster managers using a high-level declarative language (SQL).

Specifically, developers need to represent cluster state in an SQL database, and write constraints and policies that should apply on that state using SQL. From the SQL specification, the DCM compiler synthesizes a program that at runtime, can be invoked to compute policy-compliant cluster management decisions given the latest cluster state. Under the covers, the generated program efficiently encodes the cluster state as an optimization problem that can be solved using off-the-shelf solvers, freeing developers from having to design ad-hoc heuristics.

The high-level architecture is shown in the diagram below.

Download

The DCM project's groupId is com.vmware.dcm and its artifactId is dcm. We make DCM's artifacts available through Maven Central.

To use DCM from a Maven-based project, use the following dependency:

<dependency>
    <groupId>com.vmware.dcm</groupId>
    <artifactId>dcm</artifactId>
    <version>0.15.0</version>
</dependency>

To use within a Gradle-based project:

implementation 'com.vmware.dcm:dcm:0.15.0'

Pre-requisites for use

We test regularly on JDK 11 and 16.
We test regularly on OSX and Ubuntu 20.04.
We currently support two solver backends.
- Google OR-tools CP-SAT (version 9.1.9490). This is available by default when using the maven dependency.
- MiniZinc (version 2.3.2). This backend is currently being deprecated. If you still want to use it in your project, or if you want run all tests in this repository, you will have to install MiniZinc out-of-band.
  
  To do so, download MiniZinc from https://www.minizinc.org/software.html ... and make sure you are able to invoke the minizinc binary from your commandline.

Quick start

Here is a complete program that you can run to get a feel for DCM.

import com.vmware.dcm.Model;
import org.jooq.DSLContext;
import org.jooq.impl.DSL;
import org.junit.jupiter.api.Test;

import java.util.List;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertTrue;

public class QuickStartTest {

    @Test
    public void quickStart() {
        // Create an in-memory database and get a JOOQ connection to it
        final DSLContext conn = DSL.using("jdbc:h2:mem:");

        // A table representing some machines
        conn.execute("create table machines(id integer)");

        // A table representing tasks, that need to be assigned to machines by DCM.
        // To do so, create a variable column (prefixed by controllable__).
        conn.execute("create table tasks(task_id integer, controllable__worker_id integer, " +
                "foreign key (controllable__worker_id) references machines(id))");

        // Add four machines
        conn.execute("insert into machines values(1)");
        conn.execute("insert into machines values(3)");
        conn.execute("insert into machines values(5)");
        conn.execute("insert into machines values(8)");

        // Add two tasks
        conn.execute("insert into tasks values(1, null)");
        conn.execute("insert into tasks values(2, null)");

        // Time to specify a constraint! Just for fun, let's assign tasks to machines such that
        // the machine IDs sum up to 6.
        final String constraint = "create constraint example_constraint as " +
                "select * from tasks check sum(controllable__worker_id) = 6";

        // Create a DCM model using the database connection and the above constraint
        final Model model = Model.build(conn, List.of(constraint));

        // Solve and return the tasks table. The controllable__worker_id column will either be [1, 5] or [5, 1]
        final List<Integer> column = model.solve("TASKS")
                .map(e -> e.get("CONTROLLABLE__WORKER_ID", Integer.class));
        assertEquals(2, column.size());
        assertTrue(column.contains(1));
        assertTrue(column.contains(5));
    }
}

Documentation

The Model class serves as DCM's public API. It exposes two methods: Model.build() and model.solve().

Check out the tutorial to learn how to use DCM by building a simple VM load balancer
Check out our research papers for the back story behind DCM
The Model API Javadocs

Contributing

We welcome all feedback and contributions! ❤️

Please use Github issues for user questions and bug reports.

Check out the contributing guide if you'd like to send us a pull request.

Information for developers

The entire build including unit tests can be triggered from the root folder with the following command (make sure to setup both solvers first):

$: ./gradlew build

To avoid documentation drift, code snippets in a documentation file (like the README or tutorial) are embedded directly from source files that are continuously tested. To refresh these documentation files:

$: npx embedme <file>

The Kubernetes scheduler also comes with integration tests that run against a real Kubernetes cluster. It goes without saying that you should not point to a production cluster as these tests repeatedly delete all running pods and deployments. To run these integration-tests, make sure you have a valid KUBECONFIG environment variable that points to a Kubernetes cluster.

We recommend setting up a local multi-node cluster and a corresponding KUBECONFIG using kind. Once you've installed kind, run the following to create a test cluster:

 $: kind create cluster --config k8s-scheduler/src/test/resources/kind-test-cluster-configuration.yaml --name dcm-it

The above step will create a configuration file in your home folder (~/.kube/kind-config-dcm-it), make sure you initialize a KUBECONFIG environment variable to point to that path.

You can then execute the following command to run integration-tests against the created local cluster:

$: KUBECONFIG=~/.kube/kind-config-dcm-it ./gradlew :k8s-scheduler:integrationTest

To run a specific integration test class (example: SchedulerIT from the k8s-scheduler module):

$: KUBECONFIG=~/.kube/kind-config-dcm-it ./gradlew :k8s-scheduler:integrationTest --tests SchedulerIT

Learn more

To learn more about DCM, we suggest going through the following references:

Talks:

Hydra 2021 (~75 minutes)
OSDI 2020 (20 minutes)

Research papers:

Building Scalable and Flexible Cluster Managers Using Declarative Programming
Lalith Suresh, Joao Loff, Faria Kalim, Sangeetha Abdu Jyothi, Nina Narodytska, Leonid Ryzhyk, Sahan Gamage, Brian Oki, Pranshu Jain, Michael Gasch. To appear, 14th USENIX Symposium on Operating Systems Design and Implementation, (OSDI 2020).
Automating Cluster Management with Weave
Lalith Suresh, Joao Loff, Faria Kalim, Nina Narodytska, Leonid Ryzhyk, Sahan Gamage, Brian Oki, Zeeshan Lokhandwala, Mukesh Hira, Mooly Sagiv. arXiv preprint arXiv:1909.03130 (2019).
Synthesizing Cluster Management Code for Distributed Systems
Lalith Suresh, João Loff, Nina Narodytska, Leonid Ryzhyk, Mooly Sagiv, and Brian Oki. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS 2019). ACM, New York, NY, USA, 45-50. DOI: https://doi.org/10.1145/3317550.3321444

declarative-cluster-management's People

Contributors

Stargazers

Watchers

Forkers

faria-kalim jchesterpivotal so1us2 reith qadahtm gitter-badger or-testuser askiad amytai kexinrong hellblazer hunhoffe armbiant distributedsystemresearch metadan1 isabella232 lalithsuresh seanpm2001

declarative-cluster-management's Issues

Case sensitivity in using DSLContext.meta() and checking for names

There are different assumptions by libraries we use to analyze SQL programs and interact with databases around case insensitivity. Model.build() should canonicalize all programs to upper or lower case to be on the safe side.

capacityConstaint will throw exception on empty domain, should it?

Currently model.solve() with or-tools backend will throw a runtime exception if there is a capacityConstraint being evaluated on empty domain set. Is it necessary? For now library users can check domain by querying table or view that capacityConstraint applies on but they should also synchronize models data-cache and database query execution until there is an API to query from model's cached data (is it?).

Also, k8s-scheduler implementations assumes mode.solve() does not throw exceptions (otherwise scheduler execution loop will be terminated). Maybe it's better to make sure operations never throw exception or maybe just checked exceptions?

Remove controllable__ syntax

Either dynamically declare variable columns during model initialization or support annotations in the schema.

Update README instructions to reflect or-tools solver

Aid solution explainability in terms of soft constraints

Add support to return the computed objective values for all soft constraints.

Add IR for comprehension syntax to represent unary operators

Align objective function syntax with hard constraint syntax

Similar to the check syntax, avoid encoding the fact that a view is an objective function in the view name. Instead, use "maximize" or "minimize" annotations

Add an or-tools CP/SAT backend

Being tracked here: #9

Index usage in ortools backend is sensitive to TableRowGenerator ordering in IR

Make sure string literals in generated code preserve the casing from the DB

Migrate to official ortools Maven dependency

Infer variable domains from constraints

Improve documentation

Improve Model.java Javadocs
Do not export Javadocs for classes that are not part of the public API
Add a docs/ folder
Separate top-level README into multiple parts, such as factoring out the tutorial into the docs/ folder or the examples/ folder.
Add API documentation

Use coveralls to track code coverage

Reduce presolve times

Pod placement times especially at low batching sizes (like one pod) are dominated by presolve costs.

For example, the overall time to place a single pod in a 1000 node cluster as per JMH benchmarks is roughly 57ms (from pod arrival via the informer API, up to a binding decision being made):

Benchmark                      (numThreads)  (solverToUse)  Mode  Cnt           Score           Error  Units
EndToEnd.testSinglePodPlacement             1        ORTOOLS  avgt    3  5785447107.000 ± 737202520.836  ns/op
EndToEnd.testSinglePodPlacement             2        ORTOOLS  avgt    3  5785887140.333 ± 324973887.983  ns/op

Profiling these invocations via async-profiler, we find that presolve times are dominating. From a single example solver invocation:

Parameters: max_time_in_seconds: 1 log_search_progress: true num_search_workers: 2 cp_model_probing_level: 0
Optimization model '':
#Variables: 9028 (3 in objective)
 - 1 in [-2147483648,2147483647]
 - 10 in [0,1]
 - 3 in [0,2147483647]
 - 1 in [2,1001]
 - 9013 constants in {0,1,2,3,4,5,6,7,8,9,10,11,12 ... 993,994,995,996,997,998,999,1000,1001,5999,7900}
#kBoolAnd: 4 (#enforced: 4) (#literals: 4)
#kBoolOr: 4 (#enforced: 4) (#literals: 4)
#kCumulative: 6
#kInterval: 1001
#kLinear1: 18 (#enforced: 12)
*** starting model presolve at 0.00s
- 8019 affine relations were detected.
- 8019 variable equivalence relations were detected.
- rule 'bool_and: non-reified.' was applied 2 times.
- rule 'bool_or: always true' was applied 2 times.
- rule 'bool_or: only one literal' was applied 5 times.
- rule 'bool_or: removed enforcement literal' was applied 2 times.
- rule 'cumulative: no intervals' was applied 4 times.
- rule 'cumulative: removed intervals with no demands' was applied 6 times.
- rule 'enforcement literal not used' was applied 1 time.
- rule 'false enforcement literal' was applied 4 times.
- rule 'interval: unused, converted to linear' was applied 996 times.
- rule 'linear: empty' was applied 999 times.
- rule 'linear: fixed or dup variables' was applied 999 times.
- rule 'linear: infeasible' was applied 3 times.
- rule 'linear: size one' was applied 9 times.
- rule 'objective: variable not used elsewhere' was applied 2 times.
- rule 'presolve: iteration' was applied 1 time.
- rule 'true enforcement literal' was applied 9 times.
Optimization model '':
#Variables: 9 (1 in objective)
 - 1 in [0,2147483647]
 - 1 in [1,1000]
 - 1 in [2,1001]
 - 6 constants in {1,2,3,4,5,109}
#kCumulative: 2
#kInterval: 5
*** starting Search at 0.03s with 2 workers and strategies: [ auto, lp_br, helper, rnd_lns_auto, var_lns_auto, cst_lns_auto, rins/rens_lns_auto ]
#Bound   0.03s best:inf   next:[1,2.14748365e+09] auto
#1       0.03s best:2     next:[1,1]      auto num_bool:3
#2       0.03s best:1     next:[1,0]      auto num_bool:4
#Done    0.03s  auto
CpSolverResponse:
status: OPTIMAL
objective: 1
best_bound: 1
booleans: 4
conflicts: 0
branches: 3
propagations: 3
integer_propagations: 11
walltime: 0.0454228
usertime: 0.0454229
deterministic_time: 4.8e-07
primal_integral: 0
19:56:24.110 [computation-thread-0] INFO  org.dcm.Model - Solver has run successfully in 60546538ns. Processing records.

From a total runtime of roughly 45ms spent within the solver, 30ms is spent within the presolve phase.

Enforce Preconditions/Assertions throughout the code to print why they fail

Infer the minimal set of tables to fetch based on constraints

We currently either fetch all tables in the schema, or a subset specified by the user. A single pass over every constraint should tell us the minimal set of tables/views to fetch on every iteration.

Add usage documentation and examples

This is a tracking issue for adding usage documentation to DCM.

Update README.md with fully runnable examples of how to use DCM.
Add an examples folder with fully runnable examples.

Simplify installation

Once Google OR-Tools becomes available as maven package, we can save the trouble of manually installing the solver. Dependent on (google/or-tools#202).
Make MiniZinc optional for users (but not for developers, if they want to run all the tests)

Bump up to JDK 15

Text blocks is no longer in preview as of JDK 15. This will help clean up a significant amount of clutter in the codebase from constructing long SQL strings.

Being worked on on the jdk15 branch.

Don't search for an UNSAT core unless solver status is INFEASIBLE

@askiad

k8s-scheduler: Pod.Name probably should not be used as primary key

Since different namespaces can have different Pods by the same name, it looks better to define pods_info table's primary key as Pod.UUID and have foreign keys based on UUID. I see UUID is missing from pods_info, if there is a reason behind that, defining a compound key of Pod.Name and Pod.Namespace looks more appropriate.

Construct IRTable entries for transitive closure of FK relationships

Avoid overloading with TypeToken pattern for min/max aggregate functions

minV() and maxV() variants should be rewritten to minVType.

Support check syntax

Use consistent syntax for hard and soft constraints

With #82 being fixed, there is little reason to have a different query structure for hard and soft constraints. We can instead structure both constraints as a view that produces a set of records, followed by a CHECK or MAXIMIZE clause, followed by an expression. This would allow a user to easily switch between hard/soft constraints if they want.

Alias management when lowering to IR

Aliases must be scoped to within a query, not globally as it is done now.

Infer opportunities to use scalar products for sum() aggregates

Early tests suggest significant improvement in performance when dealing with large problem sizes. Need to infer this automatically from the compiler and generate code accordingly.

Infer variable domain bounds when creating variables

Relax the requirement that an objective function has to be a scalar expression

If it's not a scalar expression (e.g., preference for certain assignments to hold true), we can simply encode it as a sum on behalf of the user.

ortools: using sharper types in backend code generator

Use enums to represent operators and supported functions

There is accumulating cruft from using Strings instead.

Run integration tests on Circle CI

This will require migrating to the machine executor.

Currently tracked by https://github.com/vmware/declarative-cluster-management/tree/circle-ci-machine-exec

Add performance benchmarks that can run per build

API improvements to support ddlog

While H2 has been convenient thus far, it might help to separate out the JOOQ/JDBC-specific API boundary from Model.updateData() behind a more abstract interface can be used to collect input data required for the solver. This will allow us to interface with relational engines like ddlog, that JOOQ cannot interface with.

Support toggling constraints on/off

Another option is to use a builder pattern to create models, with which one could easily instantiate multiple models.

API improvements

Following #73, some API improvements that are in the works:

ModelException should exclusively correspond to issues with the model. Ideally, it'll only be thrown during model creation.
SolverException should only be thrown when invoking the underlying solver, or the solver API. This corresponds to bugs in the input data + solver-interaction. For now, SolverException.reason() will convey why the model failed.
Unsat cores from or-tools. Use a natural API (maybe in the form of tables?) to convey why a model failed.

@reith do add any further requirements from your end here.

Refactor IR internal naming

Much of the IR uses naming around monoid comprehensions, even though our IR has since evolved to only deal with list comprehensions.

Use only half-reified constraints where appropriate

Currently, the or-tools CP-SAT backend fully reifies all constraints when constructing expressions. This is overly conservative. We can reduce the number of the intermediate variables and constraints by only using half-reified constraints when appropriate (for example, logical constraints that have to be true).

Improve intermediate view type inferrence

For now, it assumes all columns computed are of type IntVar as a default. This however does not work if an intermediate view is consumed in a subsequent Group By.

Check for supported subset of SQL syntax

Implement variable re-use

CapacityConstraint scale and limits

capacityConstraint demands and capacities are Integers and if not normalized can overflow soon. In fact k8s-scheduler implementation will probably fail to schedule Pods requesting memory more than 4Gi. It seems ortools can work with Longs, in that case, do changing them have negative impact on solving performance?
Currently, I scale down values in my controller before storing them in database, but maybe implementation can get enhanced. Also there is a scale factor which further reduces overflow limit by 1000. If capacityConstraint is expected to work just for Integer, it's good to detect and throw exceptions in generated code iteration. It's better to fail model creation if database schema suggest that overflow is possible.

Also, there are two possible division by zero cases, here when a node has no capacity and here when some capacity in all nodes are zero. I surprised by second case because in my scenario the capacityConstraint been called for a different task that I wasn't solving model for it - the Policy that called was for scheduling Pods but I didn't have any Pod in database or model - So all demands where zero there and probably the function could have returned sooner.

Rewrite affinity/anti-affinity views

Users provide labels that describe which nodes pods are affine and anti-affine to. These labels help shortlist nodes that are whitelisted and blacklisted for the pod. Currently, we combine this information to create a final whitelist of nodes for the pod. This can be problematic if the user only provides us with a blacklist; then, we build a view that gives us the set of rows = {all nodes - blacklisted nodes}. This list is large and can be expensive to pull out. Instead, we can try to make use of the fact that the blacklist is usually small and rewrite our constraint to ignore the blacklisted nodes while placing the pod.

Migrate to the Apache Calcite parser

The presto parser API is not extensible, forcing us to shoehorn DCM's check/maximize syntax on top of views.

Calcite's parser is extensible, allowing us to add our own DDL for constraints, and restrict the subset of SQL we'd like to support more cleanly.

Expression short-circuiting and null friendliness for operations in Ops

An operation of the form CHECK x = 10 OR var1 = col1, currently gets compiled down to a series of ortools operations that constructs the full expression without returning early. If x=10 but var1 is null, we still get an NPE when passing an argument to Ops.

The above operation is equivalent to WHERE x != 10 CHECK var1 = col1, which generates an if-expression that only encodes the constraints for rows that pass the x != 10 predicate. Doing so makes this safe for use cases where var1 might be null if x=10 but has a value otherwise.

It's worth considering what the behavior should be in the presence of nulls and whether we can potentially rewrite the IR.

Reassociation pass on variables

Apply a reassociation pass on expressions to evaluate as many constants as possible before forming IntVars.

Add travis CI

Improve examples/

Add examples for:

Incremental placement vs global re-shuffling
use case with A/B testing

Access tuples by index and not field name

Following #85, we have more instances in the generated code where we used field names to reference cells from Jooq Records. This is proving fragile given the three-way interaction between:

DCM canonicalizing table/field names (upper case always)
JOOQ using a schema file for code generation but then connecting to a...
database at runtime with its own rules for case sensitivity

The fix on the DCM side is to always use field indices instead of field names to refer to values within JOOQ Records, without any loss of readability.