wasmcloud / wadm Goto Github PK

View Code? Open in Web Editor NEW

89.0 7.0 19.0 1.14 MB

wasmCloud Application Deployment Manager (wadm): Declarative application deployments for wasmCloud applications.

Home Page: https://wasmcloud.com

License: Apache License 2.0

Dockerfile 0.07% Makefile 0.43% Rust 99.16% Shell 0.02% Smarty 0.31%

webassembly webassembly-runtime oam nats reconciliation wasmcloud

wadm's Issues

[FEATURE] Add a section in the wadm manifest for scaler config

Looking for comments and thoughts on this one, it's a half RFC.

When creating more complex wadm manifests, I noticed that I was often copying the same scaler config around for multiple resources. For example, if I wanted to have 3 actors all run on a host with a custom label, I'll have to add this for each actor:

      traits:
        - type: spreadscaler
          properties:
            replicas: 1
            spread:
              - name: custom
                requirements:
                  app: custom

It could be nice to be able to define this as custom configuration with a name, and then re-use that config using that name later. Essentially, this would just ensure that I could define multiple components with the same traits with fewer lines. This could look like the following:

apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
  name: my-backend
  annotations:
    version: v0.0.1
    description: "My Backend"
spec:
  traits:
    - type: spreadscaler
      name: customspreadscaler
      properties:
        replicas: 1
        spread:
          - name: custom
            requirements:
              app: custom
  components:
    - name: myactor1
      type: actor
      properties:
        image: ghcr.io/myactor:0.1.0
      traits:
        - name: customspreadscaler
    - name: myactor2
      type: actor
      properties:
        image: ghcr.io/myactor2:0.1.0
      traits:
        - name: customspreadscaler

This is just a thought and I'm happy to be challenged here, I worry about making it too confusing with jumping around to multiple places in a manifest to understand the spread configuration

Expose NATS API for applications

Create a NATS API that exposes (at least) the following functionality:

Put an application spec (OAM model)
Query an application spec
Query list of applications
Obtain application status
Roll back/delete an application
Query application event log

This issue also includes implementing the underlying persistence mechanism for deployments. At implementation time, we'll need to decide if we're going to use global/distributed ETS or use JetStream.

Receive and parse wadm/OAM manifests

To be complete, there should be a module in the library code that:

Has concrete types for wadm manifests that can deserialize and serialize to yaml and json
Listens for manifests on the correct topics (see existing wadm code)
Properly parses those manifests

Spike: Test out HA streaming architecture

See for context: #40 (comment)

The definition of done is that the basic scaffolding is in place and working so we can make a decision whether or not to move forward with it. This scaffolding includes:

Creating the actual wadm binary with the following functionality
- No configuration needed for this first draft, we can add flags later
- Create the needed streams for events and commands
- Have spawned functions that
  - Listen on the consumer for events
  - Listen on the consumer for commands

Create deploy guides

At a minimum, the following guides should exist:

Getting started: The most basic, barebones wadm with no NATS configuration needed
"Production level" deploy guides: How to set up the NATS streams to be durable and replicated, running with TLS, etc.
The same guides as above, but with the docker container

Provider scaler should only check if a provider exists when matching scale requirements

There are going to be many times where a capability provider is "shared" between different applications (think something like the httpserver provider). Right now if one is already running, then every single reconcile will trigger a command that fails if some other application is running the provider.

To solve this I think the solution should be that the provider scaler should NOT match on annotations and only check if a provider with the correct linkname is running on hosts that match its spread requirements. It should also run a reconcile anytime it sees a provider stopped event. This is correct behavior because all that this is checking for is that a host has that capability, it doesn't matter what is running it. If for some reason that provider is deleted (by a user or by another application that has been deleted), then the scaler should make sure it is running again for the application

Please note that this isn't perfect. In the future I'd like to avoid the, albeit short, downtime of a provider stopping and then starting again, but that will require some sort of shared state between all of the different scalers

Create storage struct for storing lattice state in NATS KV

This solution should define how we store the data in a KV bucket (key structure, data encoding)
- Suggestion: Look at what the lattice observer does. Don't have to follow it exactly, but probably will be very similar
There should be a struct that auto creates a NATS KV bucket if it doesn't exists and handles all CRUD operations for state

[FEATURE] Support `config_json` for provider start commands

Capability providers support an optional JSON configuration which we should be able to support via a config block in wadm. You can see the configuration variable in the interface: https://docs.rs/wasmcloud-interface-lattice-control/0.18.0/wasmcloud_interface_lattice_control/struct.StartProviderCommand.html.

This is most useful for defining configuration that a provider will use for every link, like a different default HTTPserver port with the wasmCloud httpserver provider. This config block would be taken and sent along to the start provider command. It would be nicer to specify nested config as YAML or JSON instead of just an opaque string, but the provider itself may or may not support a specific format

This would look something like this:

    - name: httpserver
      type: capability
      properties:
        image: wasmcloud.azurecr.io/httpserver:0.17.0
        contract: wasmcloud:httpserver
        config:
          address: 0.0.0.0:8085
      traits:
        - type: spreadscaler
          properties:
            replicas: 1

It would be important here to note the difference between provider configuration and the linkdef trait, since they can be used for different things.

Create testing pipelines in GH actions

This only needs to be running the integration and unit tests. No building and deploying of artifacts required

Create work loop that handles incoming manifests

We need to have a loop that triggers on watching the KV store for manifest changes. This should be an easy thing to add to the struct that stores manifests. This should take any changes and automatically trigger an update from the appropriate Scaler

Complete configuration and usage documentation

Every flag and config file should be documented both in the CLI and in wadm documentation. The best option is that we add a new section to the wasmcloud docs site for wadm. This task does not include deployment documentation (it is a separate task)

Store wadm manifests in a KV bucket

Received manifests should be stored in a KV bucket inside of NATS:

Bucket should auto create if it doesn't exist
There should be a struct in code that can perform all CRUD operations for manifests, this involves storing under the right key and scoped to the proper lattice ID
Data should be serialized to json for storage

Update state calculations to use ActorsStarted rather than ActorStarted

ActorsStarted will lead to far less reads and writes as we won't be updating the store for each and every actor that is started

Failed to update package pc from repo hexpm

Can anybody help me?

I try local compile, and run WADM ，i just got in touch with elixir , I don't know how to solve this problem？

shell

-> % mix do  compile         

=NOTICE REPORT==== 24-Aug-2022::18:49:31.264459 ===
TLS client: In state certify at ssl_handshake.erl:2100 generated CLIENT ALERT: Fatal - Handshake Failure
 - {bad_cert,hostname_check_failed}
===> Failed to update package pc from repo hexpm
===> Errors loading plugin pc. Run rebar3 with DEBUG=1 set to see errors.
=NOTICE REPORT==== 24-Aug-2022::18:49:31.327068 ===
TLS client: In state certify at ssl_handshake.erl:2100 generated CLIENT ALERT: Fatal - Handshake Failure
 - {bad_cert,hostname_check_failed}
===> Failed to update package pc from repo hexpm
===> Errors loading plugin pc. Run rebar3 with DEBUG=1 set to see errors.
===> Unable to run pre hooks for 'compile', command 'compile' in namespace 'pc' not found.
** (Mix) Could not compile dependency :snappyer, "/Users/duyong/.mix/rebar3 bare compile --paths /Users/duyong/Desktop/workspace/wadm/wadm/_build/dev/lib/*/ebin" command failed. Errors may have been logged above. You can recompile this dependency with "mix deps.compile snappyer", update it with "mix deps.update snappyer" or clean it with "mix deps.clean snappyer"

os : macos
Elixir: 1.13.4_1
Erlang: 25.0.4
rebar 3.15.2 on Erlang/OTP 25 Erts 13.0.4

Add a client library

Right now we basically consume the raw NATS API everywhere we interact with wadm. It would be nice if we added a client library to the wadm crate so that we could have an experience like client.put_manifest(Manifest{}).await. This could become more helpful down the line too as we add more features

Make shared provider handling better

In #119 we made any provider satisfy the requirements for a spread (see #106 for context). However, an Actual Solution™ would be to have some sort of generated list of all shared providers so an undeploy doesn't delete one another manifest is relying on. I don't quite have an idea of how we should implement this, so any ideas or PRs are welcome!

Spike: Define `Scaler` trait

A scaler will need to take a config of some kind, likely a handle to some sort of state, and two different methods. Below is a general idea of what it could look like, but part of this spike is to have an integration test that can pass in an Event or a wadm manifest and output commands with a basic, not fully functional spreadscaler.

/// A Scaler is used to manage responding to even
pub trait Scaler {
  /// Any type that has the necessary data to configure the scaler
  type Config: Send + Sync;

  /// Handles the event, returning any needed changes in response
  fn handle_event(&self, event: ScopedEvent) -> Result<HashSet<Command>>;

  /// Handles a new or updated manifest with its given config. This configuration should be stored by implementors in some form so that handle_event can produce the right commands
  fn handle_manifest(&self, config: Self::Config) -> Result<HashSet<Command>>;

  /// Removes a config from the scaler and emits the expected compensatory commands
  fn remove_manifest(&self, config: &Self::Config) -> Result<HashSet<Command>>;
}

[FEATURE] Be able to reference Environment Variables on WADM App Manifest for Config

Being able to reference environment variables on App Manifest for config and secrets would be really wonderful. Areas on the manifest this may be relevant include Link Defs and Config.

An example of using Environment Variables with AWS Secrets Manager can be found here:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/secrets-envvar.html

I wonder if HashiCorp Vault and others allow similar use.

@brooksmtownsend mentioned the potential of storing values in JetStream also.

Update host chart to deploy wadm by default

We should update the wasmcloud-otp Helm chart to use wadm by default as another container in the pod

[Epic] wasmCloud host improvements thread

This issue simply serves as a list of things that we've found while writing wadm 0.4 that we wish would change in the OTP host, ranging from adding an additional field on an event to changing structure of commands, etc.

Most of this is just a stream so they shouldn't be taken as must-do's, but after the release of wadm 0.4 we'll consider this list and what we can do to improve the efficiency of wadm and the host

Would be really nice to implement instantiate-on-request actors, effectively removing the need to deal with actor counts and scale. "starting" an actor on a host could have an optional "max" count, but otherwise it just instantiates when it receives a request and exits.
The below checkmarks assume that the checkmark above is not implemented for a while
#110
Would be nice for the host heartbeat to contain enough information to uniquely identify actors that belong to spreads, e.g. a combination of annotations to counts or vice versa

Create a `PutModelRequest` type and change the put endpoint

After talking with @brooksmtownsend we think that a manifest should always deploy by default when it is put, but this means we need a request body to indicate when we don't want it to deploy immediately. So we'll need to modify the put endpoint to expect a different body

[FEATURE] Allow taking WADM_NATS_JWT from the CLI

Previously, you could specify a NATS JWT and seed in order to authenticate with a NATS server over the command line entirely, not needing files. In wadm v0.4.0-alpha.1, the JWT is a pathbuf, so you will have to provide a file path in order to use this authentication method, in which case you could just use a credsfile.

I would like to be able to supply a JWT and seed on the CLI to keep this simple, e.g.

WADM_NATS_JWT=eyJWTthings WADM_NATS_NKEY=SUASDASDASD wadm

Utilize annotations to compute spread requirements

While implementing #75, I realized that we aren't using annotations to mark actors that are managed by wadm. In addition to being managed by wadm, we'll also want to note the specific Spread that the annotation is using so that we can manage conflicting actor spreads.

Until this is complete, the ActorSpreadScaler as implemented in #75 will not be able to properly handle multiple spreads that match to the same host and it's recommended to keep spread requirements as specific as possible.

Support / handle health checks

The observed model should keep track of passed/failed health checks and be able to reflect that in the observed state

Generate and store lattice state based on incoming events

Using received events from the stream, the KV bucket should have a generated state for every observed lattice.

This should be very well tested with integration and/or e2e tests

Create e2e tests with multiple hosts

These e2e tests should be run against a lattice with multiple hosts running to exercise the full functionality of the scalers. Testing a multitenant cluster (i.e. one with multiple lattice prefixes) is out of scope for this task

cargo install wadm - `the '--version' provided, 'v0.4.0-alpha.1', is not a valid semver version`

Heyho,

super excited to see wadm taking shape, so I tried to follow install instructions on my M2. Unfortunately there seems to be an issue with semver detection of cargo for v0.4.0-alpha.1.

Trying to run the command from the readme:

$ cargo install wadm --bin wadm --features cli --git https://github.com/wasmcloud/wadm --version v0.4.0-alpha.1 --force
error: the `--version` provided, `v0.4.0-alpha.1`, is not a valid semver version: cannot parse 'v0.4.0-alpha.1' as a semver

$ cargo version
cargo 1.68.2 (6feb7c9cf 2023-03-26)

For now I was able to work around it via leaving out the --version flag.

$ cargo install wadm --bin wadm --features cli --git https://github.com/wasmcloud/wadm --force
[...]
Installed package `wadm v0.4.0-alpha.1 (https://github.com/wasmcloud/wadm#a08761c6)` (executable `wadm`)
$ wadm --version
wadm 0.4.0-alpha.1

Thank you and keep up the great work!

Update getting started and example docs to use wadm

All of our examples and getting started docs should use wadm by default and hide the current instructions behind an "advanced" tab or header

[FEATURE] Respond to api requests when YAML validation fails

When I submit invalid YAML to wadm, I get timeout errors from the client, but in the wadm logs we have:

14:01:31.917 [error] Gnat.Server encountered an error while handling a request: %Protocol.UndefinedError{description: "", protocol: String.Chars, value: %YamlElixir.ParsingError{column: 9, line: 104, message: "Invalid sequence", type: :not_a_sequence}}
14:02:27.411 [error] Gnat.Server encountered an error while handling a request: %Protocol.UndefinedError{description: "", protocol: String.Chars, value: %YamlElixir.ParsingError{column: 9, line: 104, message: "Invalid sequence", type: :not_a_sequence}}
14:03:00.468 [error] Gnat.Server encountered an error while handling a request: %Protocol.UndefinedError{description: "", protocol: String.Chars, value: %YamlElixir.ParsingError{column: 19, line: 105, message: "Block mapping value not allowed here", type: :block_mapping_value_not_allowed}}

These happen quickly and should result in a YAML parsing error and not a timeout, this could just be modifying the API server flow for model put https://github.com/wasmCloud/wadm/blob/main/wadm/lib/wadm/api/api_server.ex#L121 so that requests that fail validation return an error instead of panicking

Generate observed state from actual NATS subscriptions

As of the time of this issue writing, the observed model exists only in pure functional format. This issue is to track the integration of that functional model with actual NATS subscriptions to the lattice event stream to produce observed lattice state.

Trigger reconciliation when a node is reaped

We need to make sure all scalers try to reconcile if a node "ages out" from the store. I think there are two different ways to do this:

Create a new NodeReaped event that gets emitted to wadm.evt.{lattice-id}
Create a new ForceReconcile event that gets emitted to wadm.evt.{lattice-id}

I'm more in favor of option 2 because it gives us more flexibility. It would also allow a user to force a reconciliation pass if they ever wanted to (by emitting the correct event to the right topic)

Actually handle commands in the work loop

As part of Stage 1, we are implementing a dummy work loop to handle commands. This task involves taking a list of commands and turning them into actual compensatory actions against the lattice control topics

[RFC] The state of the state(s)

https://mermaid.live
https://mermaid-js.github.io/mermaid/#/entityRelationshipDiagram
Looking through the code that has been written up this is my mental picture

erDiagram
          WASMCLOUD_HOST ||--|| State : has
          State }|..o{ Actor : has
          State }|..o{ Provider : has
          State }|..o{ Link_Definiton : has
          LATTICE_OBSERVER }|--|| Lattice : has
          Lattice ||--|| State: contains
          WADM ||--o| Deployment_State : has
          Deployment_State }|..o{ Actor : has
          Deployment_State }|..o{ Provider : has
          Deployment_State }|..o{ Link_Definiton : has

graph TD
    A[Updated OAM Spec] -->|Apply| B(HOST_CORE)
    B -->|Spin Up| C(Actor/Provider/Link)
    C --> |Reconcile| D{Diff}
    D -->|Yes/Apply/Backoff| B
    D -->|No| E[Done]
    F[WASHBOARD]-->|manual entry| B

How is the contention between the UI and OAM spec going to be handled - who wins? Last in?

Incomplete thoughts - Since Horde is involved WADM is a singleton? Haven't processed that. Redis will maintain....

Create testable lattice controller model

Would be nice for ActorStarted and ActorStopped events to include the count, and not include the instance ID.

Fully implement the spreadscaler

As part of #48 we are implementing the basics of a spreadscaler. This work is to make sure everything is logged, instrumented, feature complete, and tested. It should implement all of the same functionality currently supported in wadm 0.3

Ensure LinkScaler validates values

At the moment, the LinkScaler only ensures that a link exists between an actor and a provider but does not assert that the values specified are correct. See this TODO for where this code would go: https://github.com/wasmCloud/wadm/blob/main/src/scaler/spreadscaler/link.rs#L119

Essentially we already have this information in the lattice data stream for linkdefs so we should check to see if the linkdef exists so that we can issue a delete and then a put command.

[question] Can this app deploy OAM model applications?

Hi,I found that it publish the json data of the OAM model to nats, who consumed this? It seems that there is currently no introduction in the documentation.😁

[question] Failed to perform reconciliation pass

hi, when i try to deploy a simlpe2.yaml, i got an error logs like follows:

Failed to perform reconciliation pass: Weighted target 'eastcoast' has insufficient candidate hosts.,
Weighted target 'haslights' has insufficient candidate hosts.

oam json req:

{
    "apiVersion": "core.oam.dev/v1beta1",
    "kind": "Application",
    "metadata": {
        "name": "my-example-app",
        "annotations": {
            "version": "v0.0.1",
            "description": "This is my app revision 1"
        }
    },
    "spec": {
        "components": [
            {
                "name": "userinfo",
                "type": "actor",
                "properties": {
                    "image": "wasmcloud.azurecr.io/fake:1"
                },
                "traits": [
                    {
                        "type": "spreadscaler",
                        "properties": {
                            "replicas": 4,
                            "spread": [
                                {
                                    "name": "eastcoast",
                                    "requirements": {
                                        "zone": "us-east-1"
                                    },
                                    "weight": 80
                                },
                                {
                                    "name": "westcoast",
                                    "requirements": {
                                        "zone": "us-west-1"
                                    },
                                    "weight": 20
                                }
                            ]
                        }
                    },
                    {
                        "type": "linkdef",
                        "properties": {
                            "target": "webcap",
                            "values": {
                                "port": 8080
                            }
                        }
                    }
                ]
            },
            {
                "name": "webcap",
                "type": "capability",
                "properties": {
                    "contract": "wasmcloud:httpserver",
                    "image": "wasmcloud.azurecr.io/httpserver:0.13.1",
                    "link_name": "default"
                }
            },
            {
                "name": "ledblinky",
                "type": "capability",
                "properties": {
                    "image": "wasmcloud.azurecr.io/ledblinky:0.0.1",
                    "contract": "wasmcloud:blinkenlights"
                },
                "traits": [
                    {
                        "type": "spreadscaler",
                        "properties": {
                            "replicas": 1,
                            "spread": [
                                {
                                    "name": "haslights",
                                    "requirements": {
                                        "ledenabled": true
                                    }
                                }
                            ]
                        }
                    }
                ]
            }
        ]
    }
}

what cause this error? T_T

Implement Linkdef handler

The linkdef handler should be a Scaler implementation. This will be the second Scaler so there will be some additional work required:

Emit linkdef delete and set commands based on the incoming events or manifests
Figure out the proper way to combine and reconcile the final list of commands from multiple scalers. To keep this simple, it can probably just be a simple set union to start as we know that this scaler only emits linkdef commands

Create reconciler supervisor hierarchy

Create a supervisor hierarchy of OTP processes that manages the reconciliation process in a consistent way that is compatible with the functional reconciliation model and BEAM clustering.

Add wadm support to `wash up`

wash up should now support pulling down wadm and starting it by default (just like it does for the host and NATS). As part of this, there should be an optional flag to not start it

[BUG] Linkdef scalers require values

I tried to define a linkdef scaler with the following spec:

        - type: linkdef
          properties:
            target: httpclient

It did not get put. But, when adding example values, it works

        - type: linkdef
          properties:
            target: httpclient
            values:
              foo: bar

We should make this list optional

Implement non-destructive undeploy

Right now we have the API object to request that a manifest can undeploy without deleting the managed actors and providers, but we we haven't actually hooked up that logic. We'll need to add that to the ManifestUndeployed notification so that a scaler can just be deleted rather than spinning things down

WASMCLOUD_VERSION=v0.60 Does not work with WADM When Using Private Registry

When using WADM with a private registry, the host can't be found to deploy providers and actors.

The same code works fine when using WASMCLOUD_VERSION=v0.59.0.

I have not tested without using a private registry, so that may not be the issue, but I believe @jordan-rash could not recreate with 0.60 without private reg, so it is likely.

Errors:

17:14:58.444 [error] Failed to perform reconciliation pass: Weighted target 'oauth2_pkce' has insufficient candidate hosts.,
Weighted target 'apigw_router' has insufficient candidate hosts.,
Weighted target 'jammin_messaging_provider_spread' has insufficient candidate hosts.,
Weighted target 'httpclient_spread' has insufficient candidate hosts.,
Weighted target 'httpserver_spread' has insufficient candidate hosts.,
Weighted target 'vault_spread' has insufficient candidate hosts.,
Weighted target 'redis_spread' has insufficient candidate hosts.

17:15:14.659 [error] GenServer {Wadm.HordeRegistry, "st_default"} terminating
** (FunctionClauseError) no function clause matching in List.foldl/3
    (elixir 1.13.3) lib/list.ex:248: List.foldl(%{}, %LatticeObserver.Observed.Lattice{actors: %{}, claims: %{}, hosts: %{"NA6UITC5DLLNAKTDFYQJRYO3FPAU5RPX64FXKDP46Y7FB74SUTMGT5DX" => %LatticeObserver.Observed.Host{first_seen: ~U[2023-02-22 22:15:14.657009Z], friendly_name: "weathered-fire-6128", id: "NA6UITC5DLLNAKTDFYQJRYO3FPAU5RPX64FXKDP46Y7FB74SUTMGT5DX", labels: %{"app" => "oauth2", "hostcore.arch" => "x86_64", "hostcore.os" => "linux", "hostcore.osfamily" => "unix"}, last_seen: ~U[2023-02-22 22:15:14.657009Z], status: :healthy}}, id: "default", instance_tracking: %{}, invocation_log: %{}, linkdefs: [], parameters: %LatticeObserver.Observed.Lattice.Parameters{host_status_decay_rate_seconds: 35}, providers: %{}, refmap: %{}}, #Function<4.118258543/2 in LatticeObserver.Observed.EventProcessor.record_heartbeat/4>)
    (lattice_observer 0.1.0) lib/lattice_observer/observed/event_processor.ex:198: LatticeObserver.Observed.EventProcessor.record_heartbeat/4
    (wadm 0.2.0) lib/wadm/lattice_state_monitor.ex:36: Wadm.LatticeStateMonitor.handle_info/2
    (stdlib 3.17.2) gen_server.erl:695: :gen_server.try_dispatch/4
    (stdlib 3.17.2) gen_server.erl:771: :gen_server.handle_msg/6
    (stdlib 3.17.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {:cloud_event, %Cloudevents.Format.V_1_0.Event{data: %{"actors" => %{}, "friendly_name" => "weathered-fire-6128", "labels" => %{"app" => "oauth2", "hostcore.arch" => "x86_64", "hostcore.os" => "linux", "hostcore.osfamily" => "unix"}, "providers" => [], "uptime_human" => "32 seconds", "uptime_seconds" => 32, "version" => "0.60.0"}, datacontenttype: "application/json", dataschema: nil, extensions: %{}, id: "fbdaf2c2-9ae0-42cd-b678-f46c09904ad7", source: "NA6UITC5DLLNAKTDFYQJRYO3FPAU5RPX64FXKDP46Y7FB74SUTMGT5DX", specversion: "1.0", subject: nil, time: "2023-02-22T22:15:14.657009Z", type: "com.wasmcloud.lattice.host_heartbeat"}}
State: %LatticeObserver.Observed.Lattice{actors: %{}, claims: %{}, hosts: %{}, id: "default", instance_tracking: %{}, invocation_log: %{}, linkdefs: [], parameters: %LatticeObserver.Observed.Lattice.Parameters{host_status_decay_rate_seconds: 35}, providers: %{}, refmap: %{}}

Support version upgrades for actors and providers

Currently, with actors and providers we check for existence of, for actors, a public key, and for providers, a public key/contract ID/link name triple.

We do not take into account version information, which will be problematic when upgrading actors and providers to newer versions or using wadm as a part of an inner dev loop. The Actor and Provider scalers should also look for the versions of those running assets and upgrade older versions if necessary as a part of reconciling.

Remove mirror stream on update to NATS 2.10

In #79 we added some things around a mirror stream for combining streams. In NATS 2.10, we will be able to have multiple filter subjects and won't need it any more. So once we update, we should remove the need for a mirror stream

Proposal/RFC: Wadm tidying and productionizing

Overview

We started Wadm a long time ago, but have mostly let it sit since then as we've worked on polishing the host. However, the time has come to get this all working and productionized. This document is a proposal for rewriting and releasing wadm as a fully supported and featured part of the wasmCloud ecosystem, complete with waterslide!

Goals and Non-goals

This section covers the goals and non-goals of this scope of work. Please note that some of the items in non-goals are possible future work for wadm, but are not in scope for this work

Goals

Make it easy to run a wasmCloud application in a lattice, including redistributing and running an application as hosts join or leave a lattice
Give users a declarative way to run their applications
wasmCloud operators should have a clear guide on how to run and deploy wadm
- This also includes new users who just want to run it, meaning this should be included in wash up
Be the canonical scheduling implementation for wasmCloud without precluding the use of other custom built schedulers
Be a project that people want to contribute to
- I know this one sounds all vision-y and corporate-y, but it is a key point. If we want to make something robust that will serve the needs of many people, we need contributors
Have planned extension points for custom scalers (implementation is not required)

Non-goals

Make wadm a full control plane for all things
HTTP API for wadm
Changing the scope of features that already exist in the current version of wadm

Key technical decisions

Language choice

For this work, I propose we rewrite Wadm in Rust. This decision was made for the following reasons, in order of importance. As part of making this decision, two other languages (Elixir and Go) were considered. Reasons for rejecting are described in the last 2 sections

Need for contributors

Schedulers and application lifecycle management are topics that many people in the cloud native space have deep knowledge of. If we are going to be writing something that does those things for wasmCloud, then we need as many eyes on it as possible. Based on current metrics of wasmCloud repos, we have very few contributors to our Elixir code and a lot more to our Rust repos. Other projects in, or consumed by, the wasm ecosystem are in Rust and also have higher numbers of contributors. Go would have also been an excellent choice here, but the other reasons listed here precluded it. We also have multiple contributors in the wasmCloud community right know who already know Rust.

The tl;dr is that we need contributors to be successful and the current language does not attract enough people.

Static Typing and Generics

One problem we've run into consistently in our elixir projects is issues with dynamic typing. Although this can be mitigated somewhat by tools like dialyzer, it requires programmer and maintainer discipline and still doesn't catch everything. Having a static type for each type of event that will drive a system like Wadm is critical for ensuring correct behavior and avoiding panics.

In addition to the need for static typing is the preference for having generics. In my experience with writing large applications for Kubernetes in both Rust and Go, a generic type system makes interacting with the API much easier. There is less generated code and need for rolling your own type system as what happens in many large Go projects. Go has added generics, but its system is nowhere near as strong as other statically typed languages such as Rust.

Support for wasm and wasmCloud libraries

To support custom scalers, we will likely be supporting at least an actor based extension point and possibly a bare wasm module. Either way, most of the wasm libraries/runtimes out in the wild are written in Rust or have first-class bindings for Rust. Also, many of our wasmCloud libraries are written in Rust, which will allow for reuse.

Static binaries and small runtime size

This is the lowest priority reason why I am suggesting Rust, but it is still an important one. The current implementation requires bundling an Erlang VM along with the compiled Elixir. That means someone who runs Wadm as it currently stands will likely need to tune a VM. It is also larger, which leads to more space requirements on disk and longer download times.

Rust (and even Go moreso) has great support for static binaries and both run lighter than a VM without much additional tuning (if any).

Disadvantages of Rust

As with any tool choice, there are tradeoffs that occur. Below is a list of disadvantages I think will be most likely to cause friction

Rust async
- This isn't as bad of a problem as some people in the Rust community say, but we will likely need to implement things/workaround some of these rough edges
Dealing with Rustls
Rust still has a steeper learning curve than Go, so there will need to be some handholding on PRs from new contributors

Why not Elixir?

One of the biggest questions here is why not continue with Elixir. By far the biggest thing we are giving up is the code around the lattice observer. However, writing this in Rust gives us the advantage of creating something that we could eventually make bindings for in any language (this also helps enable the reusability described below), although that isn't a goal here.

With that said, the previous sections cover in depth the advantage of using Rust over Elixir in this case

Why not Go?

In my comparisons, I was looking for languages that would fit the requirements above. Due to the overlap of languages used for wasm as well as languages familiar to those in the Cloud Native space, that whittled things down to Go and Rust. Go in many ways excels at many of these requirements. It is much more popular that Rust and Elixir (probably combined) and has great support for statically compiling binaries. Also, things like NATS are native to Go.

It came down to a few main concerns of why Rust would be better:

Generics + Code cleanliness
Most of Wasm and wasmCloud are in Rust
Familiarity and preference of core maintainers

To be clear, there are other smaller reasons, but those could be considered nitpicky.

State machine vs event-driven "filtering"

One of the items I most thought about when drafting this was whether or not we should implement wadm as a true state machine. Given the simplicity of what it is trying to do, I propose we focus more on implementing an event-driven filtering approach. Essentially, a state machine approach is going to overkill for this stage of the project and the near future.

Loosely, I am calling these "Scalers" (name subject to change). Every scaler can take a list of events (that may or not be filtered) that returns a list of actions to take.

This does not mean we might iterate into a state machine style in the future (if you are curious, you can see Krator for an example of how this could be done in code) or that a scaler implementation can't use a state machine. This only means that for this first approach, we'll filter events into actions.

I have purposefully not gone into high levels of detail of what this looks like in code as it will probably be best just to try and see how this looks like as we begin to implement it. What we currently have in wadm is probably a good way of going about this (i.e. Scalers output commands)

Scalers are commutative

One important point is that these "Scalers" should be commutative (i.e. if a+b=c then b+a=c, the order of operations doesn't matter). That means when a manfest is sent to wadm, it can run through the list of supported Scalers in any order and it will return the same output.

API is NATS-based

For this first production version, we will only be supporting a NATS API. This is because pretty much all wasmCloud tooling already uses NATS mostly transparently to an every day user. We can take advantage of that same tooling to keep things simple this time around. If we were to add an HTTP API right now, we'd have to figure out authn/z and figure out how we want to handle issuing tokens. So to keep it simple, we'll focus only on NATS to start.

One very important note here is that we definitely do want an HTTP API in the future. We know that many people will want to integrate with or extend WADM and an HTTP API is the easiest way to do that. But not for this first go around (well, second, but you get my point)

Data is stored in NATS KV

This is fairly self explanatory, but we want to store everything in one place now that NATS has KV support so we don't need any additional databases. Only the manifest data is stored in NATS. Lattice state is built up by querying all hosts on startup and then responding to events

High availability

A key requirement is that wadm can be run in high availability scenarios, which at its most basic means that multiple copies can be running.

I propose that this be done with leader election. Only one wadm process will ever be performing actions. All processes can gather the state of the lattice for purposes of fast failover, but only one performs actions. This is the simplest way to gain basic HA support

Custom scheduler support

This is purely here as a design note and is not required for completing the work, but based on experiences with tools like Kubernetes and Nomad, extending with a custom scheduler is a common ask for large deployments. In code, adding a scaler will be as simple as implementing a Scaler trait.

For most people however, I propose that custom "Scalers" be added via a wadm manifest. The application provider must have an actor that will implement a new wasmcloud:wadm-scaler interface, but can be as arbitrarily complex as desired. This manifest will have 2 special requirements

It must have an annotation like wasmcloud.dev/scaler: <scaler-name>
It is only allowed to use the built in Scaler types to avoid chicken and egg problems

Once again, this is not going to be implemented here, and will likely be another, smaller, proposal than this one

Reusability and a canonical scheduler

One key point to stress here is that wadm is meant to be the canonical scheduler for wasmCloud. This means that it is the general purpose scheduler that most people use when running wasmCloud, but no one is forced to use it. You can choose not to use it at all, or to write your own entirely custom scheduler.

To that end, I propose we publish the key functionality as a Rust crate. Much of the functionality could be used in many other applications besides a scheduler, but it can also be used to build your own if so desired. Basically we want to avoid some of the problems of what occurred in Kubernetes where everything must go through the built-in scheduler

Basic roadmap of work

Whew, we made it to the actual work! As part of thinking through these ideas, I started a branch that has implemented some of the basic building blocks like streaming events and leader election. When we actually begin work, it will be against a new wadm_0.4 branch in the main repo until we have completed work. Please note that these are a general roadmap, I didn't want to try and give minute details here. Below is the basic overview of needed work.

Stage 1

All of this work can be worked on in parallel. This is a bit shorter because we are about 40-50% there with the branch I started work on

Implement receiving and parsing wadm/OAM manifests, then storing them in NATS.
Build up lattice state from queries (reconciliation loop) and events

Stage 2

This is a bit more difficult as these things must be worked on roughly in order. This work is more spike-like as it is spiking out the design of the Scaler

Define the Scaler trait and implement the spreadscaler type (at least the number of replicas functionality). Scalers will need to handle manifests and state changes given by events (such as if a host stops). We want to start with implementing so we can see what kind of info is needed
Create a work loop that takes incoming manifest changes and events and gives the list of scalers all it needs to process that event

Stage 3

Fully implement the spreadscaler type
Implement linkdef handler
Run e2e tests with multiple wasmCloud hosts
Fully functional pipelines and alpha deploy candidate (crate, plain binary, and docker container)

Stage 4

This is the "tying a bow on it" stage of work

Add wadm support to wash up by default with an optional flag to not use it
All flags and configuration documented
Create deploy guides for getting started and "production level" deployments with multiple wadm's running with TLS against NATS. Should also show how to use the docker container.

[DOCUMENTATION] Create Design Pattern Diagrams for WADM Documentation

Would be wonderful to include or update diagrams found here: #15 such that it reflects the new design patterns within WADM.

Create build pipelines for wadm

Once the below are in place, we should cut a 0.4.0-alpha.1 tag for all of these so we can start on the stage 4 work

Build plain binaries for
- Mac x86
- Mac arm
- Linux x86
- Linux arm
- Windows x86
Build a docker image
- x86
- arm
Deploy the library part of the project as the wadm crate to crates.io