Giter Club home page Giter Club logo

risingwave-operator's Introduction

RisingWave Kubernetes Operator

Slack Build status codecov License

Description

The RisingWave Kubernetes Operator is a powerful tool designed to facilitate the management and deployment of RisingWave, a streaming processing platform written in Rust. With its distributed architecture, RisingWave provides a scalable and efficient solution for processing large streams of data in real-time.

The Kubernetes operator acts as a bridge between the RisingWave platform and the Kubernetes cluster, streamlining the deployment and management process. It leverages the native capabilities of Kubernetes to automate tasks such as scaling, monitoring, and fault tolerance, making it easier to operate RisingWave in a Kubernetes environment.

Table of Contents

Compatibility

RisingWave Operator has been tested and should be working with the following Kubernetes distributions:

If you are using other Kubernetes distributions or encounter problems, please feel free to create an issue.

Here is the compatibility matrix:

RisingWave Operator RisingWave Kubernetes
main v0.19.0+ v1.21+
v0.5.0+ v0.19.0+ v1.21+
v0.4.1 v0.18.0+ v1.21+
v0.3.6 v0.18.0+ v1.21+

Installation

To secure the webhook server, you need to install the cert-manager first. Please refer to the cert-manager installation guide for more information.

Install RisingWave Operator

Install the latest version of RisingWave Operator:

kubectl apply --server-side -f https://github.com/risingwavelabs/risingwave-operator/releases/latest/download/risingwave-operator.yaml

(Optional) Install RisingWave Operator with a specific version:

# Replace ${VERSION} with the version you want to install, e.g., v0.4.0
kubectl apply --server-side -f https://github.com/risingwavelabs/risingwave-operator/releases/download/${VERSION}/risingwave-operator.yaml

(Optional) Install the main branch of RisingWave Operator (not recommended for production environments):

kubectl apply --server-side -f https://raw.githubusercontent.com/risingwavelabs/risingwave-operator/main/config/risingwave-operator.yaml

Note: errors might occur if cert-manager is not fully initialized. Don't panic! Simply wait for another minute and retry the command above.

Error from server (InternalError): Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.105.102.32: 443: connect: connection refused

Error from server (InternalError): Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": dial tcp 10.105.102.32: 443: connect: connection refused

Check the installation status:

# Check the CRDs
$ kubectl get crds | grep risingwavelabs.com
risingwaves.risingwave.risingwavelabs.com              2023-05-23T06:04:00Z
risingwavescaleviews.risingwave.risingwavelabs.com     2023-05-23T06:04:01Z

# Check the controller Pod status
$ kubectl -n risingwave-operator-system get pods
NAME                                                     READY   STATUS    RESTARTS   AGE
risingwave-operator-controller-manager-b5d5f585d-6npn5   2/2     Running   0          60s

Helm chart

You can also use Helm chart to install the operator.

Usage

RisingWave Kubernetes Operator extends the Kubernetes with CRDs (Custom Resource Definitions) to manage RisingWave. That means all you need to do is to create a RisingWave resource in your Kubernetes cluster, and the RisingWave Kubernetes Operator will take care of the rest.

The RisingWave resource is a custom resource that defines a RisingWave cluster. You can find more examples in the docs/manifests/risingwave directory. For more details of the APIs, please refer to the API reference.

NOTE: since the project is still under rapid development, the compatibility between different versions of RisingWave operator might be broken. We have maintained a stable set of manifests in the docs/manifest/stable directory that are ensured to be compatible with the latest released version. Please use them if you want to deploy RisingWave in a production environment.

Create a RisingWave cluster

Follow the steps below to create a RisingWave cluster in your Kubernetes cluster:

# Download the manifest YAML file.
curl https://raw.githubusercontent.com/risingwavelabs/risingwave-operator/main/docs/manifests/stable/persistent/minio/risingwave.yaml -o risingwave.yaml

# Apply it to the Kubernetes cluster.
kubectl apply -f risingwave.yaml

Note: the RisingWave cluster will be created in the default namespace by default. If you want to create it in another namespace, please modify the metadata.namespace field in the manifest YAML file or use the --namespace option.

The RisingWave cluster will be created in a few minutes. You can check the status of the RisingWave cluster by running the following command:

kubectl get risingwave
NAME         META STORE   STATE STORE   VERSION   RUNNING   AGE
risingwave   Etcd         MinIO         v1.6.0    True      2m20s

Note: the META STORE column indicates the storage backend for the RisingWave metadata. The STATE STORE column indicates the storage backend for the state store. The VERSION column indicates the version of the RisingWave cluster. The RUNNING column indicates whether the RisingWave cluster is running.

You can check the Pods of the RisingWave cluster by running the following command:

kubectl get pods -l risingwave/name
NAME                                    READY   STATUS    RESTARTS      AGE
risingwave-compactor-5cfcb469c5-gnkrp   1/1     Running   2 (1m ago)    2m35s
risingwave-compute-0                    1/1     Running   2 (1m ago)    2m35s
risingwave-frontend-86c948f4bb-69cld    1/1     Running   2 (1m ago)    2m35s
risingwave-meta-0                       1/1     Running   1 (1m ago)    2m35s

Connect to the RisingWave cluster

The RisingWave cluster is now ready to use. However, it is not accessible from outside the Kubernetes cluster by default. To connect to the RisingWave cluster, you need to forward the ports of the RisingWave cluster to your local machine:

kubectl port-forward svc/risingwave-frontend 4567:service

Keep the port forwarding command running in the terminal and open another terminal window. You can now connect to the RisingWave cluster using the psql command line tool. The default username is root and the default database name is dev:

psql -h localhost -p 4567 -d dev -U root

Now try to create a table in the database:

dev=> CREATE TABLE t1 (v1 int);
CREATE_TABLE

Then create a materialized view based on the table:

dev=> CREATE MATERIALIZED VIEW mv1 AS SELECT sum(v1) AS sum_v1 FROM t1;
CREATE_MATERIALIZED_VIEW

Insert some data into the table:

dev=> INSERT INTO t1 VALUES (1), (2), (3);
INSERT 0 3

dev=> FLUSH;
FLUSH

Now you can query the materialized view:

dev=> SELECT * FROM mv1;
sum_v1
--------
      6
(1 row)

Delete the RisingWave cluster

To delete the RisingWave cluster, simply delete the RisingWave resource:

kubectl delete risingwave risingwave

The Pods will be deleted in a few minutes.

Note: the data in the RisingWave cluster will not be lost after the RisingWave cluster is deleted in this example since the etcd and MinIO services are still running. If you would like to terminate them all and purge the data, you can run the following commands:

kubectl delete -f risingwave.yaml   # Delete all resources defined in the risingwave.yaml that you used above.
kubectl delete pvc -l app=etcd      # Delete the PVCs of etcd.
kubectl delete pvc -l app=minio     # Delete the PVCs of MinIO.

Customize the RisingWave cluster

You can customize the RisingWave cluster by modifying the manifest YAML file. For more details, please refer to the API reference in the docs/general/api.md file.

For customizing the state store backends of RisingWave cluster, please refer to the docs/general/state-stores.md file.

Contribution Guidelines

We welcome contributions from the community! If you would like to contribute to this project, please follow the guidelines outlined in the CONTRIBUTING.md file.

License

This project is licensed under the Apache License 2.0. You can find the full text of the license in the LICENSE file.

risingwave-operator's People

Contributors

alissa-tung avatar arkbriar avatar bugenzhao avatar cajan93 avatar daviderli614 avatar dependabot[bot] avatar flpha0830 avatar fuyufjh avatar globecen avatar gogomoe avatar hezhizhen avatar huangjw806 avatar jiayouxujin avatar jlerche avatar kexiangwang avatar kezhenxu94 avatar larrystamford avatar lmatz avatar lukeraphael avatar matanper avatar mikechesterwang avatar nebulazhang avatar neverchanje avatar risingwave-ci avatar sixletters avatar stab123 avatar sunt-ing avatar wjf3121 avatar xuhui-lu avatar yufansong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

risingwave-operator's Issues

fix: compute node config is not valid

Compute node panicked while starting:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: internal error: parse error unknown field `chunk_size`, there are no fields for key `batch` at line 10 column 1
disabled backtrace', src/common/src/config.rs:51:53
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This is caused by the latest update of RisingWave: risingwavelabs/risingwave#2686

Need to fix the current config in ./template/compute-config.yaml

docs: testing RisingWave on Kubernetes

As mentioned in the title, we need documents about running tests, especially load tests on the RisingWaves deployed by the operator. Currently, there are no public documents or scripts for running load tests against RisingWave. So I guess we need to wait before there's one.

bug: cannot delete risingwave because failed calling webhoook

When create or delete risingwave, error from server:
Internal error occurred: failed calling webhook "mrisingwave.kb.io": Post "https://127.0.0.1:57355/mutate-risingwave-singularity-data-com-v1alpha1-risingwave?timeout=10s": dial tcp 127.0.0.1:57355: connect: connection refused

I have met in aws eks and kind cluster, maybe webhook network bug?

feat: support risingwave update

We need to support to update risingwave which includes:

  1. Change image to upgrade
  2. Change replicas to scale

Not Supported in the short term:

  1. Change resource request and limit(VPA)
  2. Change running params

test: Add integration tests for the RisingWaveController

Add integration tests for the RisingWaveController, e.g.,

  • Leverage the ActionHook to validate the inputs and outputs of each action
  • Tests for controllers under different states, simulated by the fake client, possible test cases:
    • Initializing, with a new RisingWave object
    • Initializing, with partial sub-resources
    • Running, and stable
    • Running, and spec changed
    • Running, and some sub-resources deleted, related to #64
    • Upgrading, and outdated sub-resources
    • Upgrading, and latest sub-resources
    • Deleted object
    • No object
    • ...

feat: support network policy

NetworkPolicy lets us define the ingress and egress policies for the Pods. We can integrate the NetworkPolicy in the RisingWave CRD to provide controls on the network policies. A typical use case would be configuring the security group of the RisingWave instances.

References:

feat: monitoring logs

It's essential to provide monitoring tools for logs because it's hard for normal users to interact with Kubernetes directly. Here're some open-source projects that might help:

perf: reduce deployment time

For now, need to spend too much time when deploying a suite of RisingWave in a new cluster. Maybe can optimize component startup speed by reducing RisingWave image size.

fix: panic on missing compactorNode spec

If we don't provide spec for compactNode when deploying RisingWave, the operator will panic at: pkg/controllers/risingwave/risingwave_controller.go:262 since *rw.Spec.CompactorNode.Replicas is a nil pointer.

test: Add tests for the ctrlkit

Add tests for the ctrlkit package, e.g.,

  • Tests for the OptimizeWorkflow method and make sure every rule works
  • Tests for the Group implementations to ensure their semantics
  • Tests for the joinResultAndErr method to make sure the result and err are as expected

feat: support etcd authentication

The RisingWave provides a field specifying the secret containing the etcd credentials. But it doesn't work now. AFAIK, the RisingWave itself hasn't supported authenticating with etcd currently. This is the tracking issue for the operator side, and I'm going to create an issue in the risingwave repo as well.

test: Add tests for the implementation of the RisingWaveControllerManagerImpl

Add tests for the implementation of RisingWaveControllerManagerImpl, a.k.a. the risingWaveControllerManagerImpl, e.g.,

  • Tests for the syncObject method with different inputs, validating the lazy sync mechanism
  • Tests for the isObjectSynced method with different inputs, validating the lazy sync mechanism
  • Tests for each action implementation to ensure that they don't panic and works

Tips:

  • The lazy sync mechanism is mainly based on the risingwave/generation label. If the label value is smaller than the generation of the RisingWave object, then the object should be synced/updated. Otherwise, we do nothing. A special case is that when the label value is nosync, the object won't be synced. Creation always happens when the object isn't found.

feat: design a CR for operating and managing the stream sources inside the RisingWave

Currently, we have the RisingWave CR for managing the RisingWave instances in the Kubernetes. We can manipulate the RisingWave resources with kubectl. I'm thinking of abstracting other resources as CRs, such as the stream sources.

The stream sources should have the connection endpoint, the credentials, and other configurations supported by either the stream provider or the RisingWave. Then we can have a controller that syncs the stream sources with the RisingWave instance and reports the status.

test: Add tests for the RisingWaveObjectFactory

Add tests for the RisingWaveObjectFactory, e.g.,

  • Tests for building new ConfigMap for RisingWave config
  • Tests for building new Service/Deployment for meta/frontend/compactor
  • Tests for building new Service/StatefulSet for compute

Possible checks:

  • ownership checks
  • labels checks
  • spec checks
    • especially the command and args checks for compatibility
  • panic checks

Tips:

  • Simulating the webhook manually

bug: validate the etcd endpoint in webhook

Now it's legal to have a YAML like this, specifying the meta storage type as etcd but not giving the endpoint:

spec:
  metaNode:
    storage: 
      type: ETCD
      etcdEndpoint: 

It's a bad YAML and should be rejected by webhook.

Bug: Issue running psql commands on risingwave cluster

There is an issue running the example psql queries on the risingwave readme on the risingwave cluster. Running the example queries result in an rpc error. This error is not replicated when running the same queries with the docker image docker run -it --pull=always -p 4566:4566 -p 5691:5691 ghcr.io/singularity-data/risingwave:latest playground

Expected Behavior

No errors.

Current Behavior

RPC error which causes the compute node to crash.
Screenshot from 2022-07-11 17-27-36

Steps to Reproduce

  1. Set up the operator as per the risingwave operator readme but use this yaml for the risingwave object. This is similar to the default settings but implements a nodeport to access the frontend node.
  2. Enter psql shell
PHOST=`kubectl get node -o=jsonpath='{.items[0].status.addresses[?(@.type=="InternalIP")].address}'`
PPORT=`kubectl get service -n test test-risingwave-amd64-frontend-nodeport -o=jsonpath='{.spec.ports[0].nodePort}'`
psql -h PHOST -p PPORT -d dev -U root
  1. Run psql queries
/* create a table */
create table t1(v1 int not null);

/* create a materialized view based on the previous table */
create materialized view mv1 as select sum(v1) as sum_v1 from t1;

/* insert some data into the source table */
insert into t1 values (1), (2), (3);

/* (optional) ensure the materialized view has been updated */
flush;

/* the materialized view should reflect the changes in source table */
select * from mv1;):

test: Add tests for the webhooks

Add tests for the webhooks, e.g.,

  • Tests for the mutation webhook, mainly for the defaulter
  • Tests for the validation webhook, e.g.,
    • Tests of unexpected fields, e.g., nil ptr fields, invalid etcd endpoints, unexpected enum values
    • Tests of unexpected updates, such as updating immutable fields

feat: exposing metrics about the operator

As the project grows, we need to expose metrics about the webhooks and controllers to build the observability & maintenance stacks.

Metrics we'd like to have:

  • Webhooks:
    • webhook_request_count, labels:
      • type, the value should be mutating or validating
      • group, the target resource group of the webhook, e.g., risingwave.risingwavelabs.com
      • version, the target API version, e.g., v1alpha1
      • kind, the target API kind, e.g., risingwave, risingwavepodtemplate
      • namespace, the namespace of the object, e.g., default
      • name, the name of the object
      • verb, the verb (action) on the object which triggers the webhook, the value should be one of "create", "update", and "delete".
    • webhook_request_pass_count, with the same labels as webhook_request_count
    • webhook_request_reject_count, with the same labels as webhook_request_count
    • webhook_request_panic_count, with the same labels as webhook_request_count
  • Controllers:
    • controller_reconcile_count, labels:
      • group, version, kind mentioned above
      • namespace, name
    • controller_reconcile_requeue_count, with the same labels as controller_reconcile_count
    • controller_reconcile_error_count, with the same labels as controller_reconcile_count
    • controller_reconcile_requeue_after (could be a histogram), with additional labels:
      • after, the duration before the next requeue in milliseconds
    • controller_reconcile_duration, the time elapsed during the reconciliation, with the same labels as controller_reconcile_count
    • controller_reconcile_panic_count, with the same labels as controller_reconcile_count

The collectors of these metrics can all be implemented by using a proxy pattern.

feat: support horizontal and vertical auto-scaling

The Kubernetes community has been working on the auto scaler for a long time. They have had the HorizontalPodAutoScaler GA and provided a CRD and the controller of the VerticalPodAutoScaler. These auto-scalers targets the workload resources such as Deployment and StatefulSet. But we will scale the RisingWave instance, which contains four components and multiple groups of workload resources. Therefore, there're two options for implementing the auto-scaling:

  • Leverage the pod auto-scaler provided by Kubernetes and ignores the replicas defined in the RisingWave spec when the auto-scalers are enabled. We need to define the behavior when auto-scaler is enabled/disabled and ensure there're no ambiguities.
  • Define new CRs ourselves and re-implement the trigger and scaling progress by operating the RisingWave resources directly. This lets us have more control over the progress, e.g., we can have a customized policy for triggering the scaling, and there will be no ambiguity, but we need to do a lot of work that the community has done.

Either option makes sense to me. Since we don't have much effort, I think we have to decide what to do.

References:

test: Testing against more object storage types

After merging #61, the E2E test is modified to test against an in-memory RisingWave cluster, which is not comprehensive enough. We should test against all the storages types, including:

  • memory/etcd based meta
  • memory/MinIO/S3 based compute and compactor

We should ensure that it works in any combination.

release: add release tag for operators

Operators have many crucial definitions like specs and API interfaces that should be consistent for specific versions, so that other components should confidently rely on.

feat: bring back the monitoring features

After several refactors #61 and #109, the functionalities for monitoring are broken, such as:

  • Auto provisioning of the ServiceMonitors provided by the Prometheus Operator
  • Guidance to start monitoring the RisingWaves in Kubernetes
  • Documents about testing against the RisingWaves in Kubernetes

Tracking: Recovery mechanism

Currently, the operator won't react to problems like nodes hanging or accident deletion of sub-resources such as Deployments, Services, and StatefulSets. We want to make sure that we can recover from these issues.

  • Leverage the liveness probe so that we can restart the unhealthy node and recover from it.
  • Implement the reactions to the absence of sub-resources, especially when the cluster's running.

feat: monitoring and tracing the workflow

Currently, the RisingWave Operator runs a workflow for each reconciliation. The workflow is organized in pure Go with help functions provided by the ctrlkit. It's easy to understand and develop, but it's hard to monitor and trace for now. We need mechanisms for this. Here're the topics I think would be helpful if achieved:

  • Monitoring
    • Expose metrics of the workflow status, such as ID, target, result, error, and elapsed time
    • Expose metrics of the workflow action status, such as result, error, and elapsed time
    • Dashboard (panels) for monitoring the running status of the operator workflows, e.g. event rates, avg/p50/p99 time, common errors
  • Tracing
    • Design and implement the structures and functions for helping trace the workflow, capturing metrics and states (optional)
    • Provide an option for the tracing and provide a way to get the trace result
    • Dashboard for trace analysis, e.g., the time costs in a tree, a graph where each node(action) is colored by its result
    • Provide an option for capturing the states of the future workflows and a way to replay the workflows with states captured. This is useful when debugging the online service.

References:

bug: unstructured.Unstructured error when create risingwave

image

when create risingwave by operator, this error occur:

sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:245: expected type *v1.StatefulSet, but watch event object had type *unstructured.Unstructured

This bug occurred after PR #21 merged.

maybe controller-runtime issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.