Giter Club home page Giter Club logo

lifecycle-manager's Introduction

REUSE status

Lifecycle Manager

Kyma is an opinionated set of Kubernetes-based modular building blocks that includes the necessary capabilities to develop and run enterprise-grade cloud-native applications. Kyma's Lifecycle Manager is a tool that manages the lifecycle of these modules in your cluster.

Modularization

Lifecycle Manager was introduced along with the concept of Kyma modularization. With Kyma's modular approach, you can install just the modules you need, giving you more flexibility and reducing the footprint of your Kyma cluster. Lifecycle Manager manages clusters using the Kyma custom resource (CR). The CR defines the desired state of modules in a cluster. With the CR you can enable and disable modules. Lifecycle Manager installs or uninstalls modules and updates their statuses. For more details, read about the modularization concept in Kyma.

Basic Concepts

See the list of basic concepts relating to Lifecycle Manager to understand its workflow better.

  • Kyma custom resource (CR) - represents Kyma installation in a cluster. It contains the list of modules and their state.
  • ModuleTemplate CR - contains modules' metadata with links to their images and manifests. ModuleTemplate CR represents a module in a particular version. Based on this resource Lifecycle Manager enables or disables modules in your cluster.
  • Manifest CR - represents resources that make up a module and are to be installed by Lifecycle Manager. The Manifest CR is a rendered module enabled on a particular cluster.
  • Module CR, such as Keda CR - allows you to configure the behavior of a module. This is a per-module CR.

For the worklow details, read the Architecture document.

Quick Start

Follow this quick start guide to set up the environment and use Lifecycle Manager to add modules.

Prerequisites

To use Lifecycle Manager in a local setup, install the following:

Steps

  1. To set up the environment, provision a local k3d cluster and install Kyma. Run:
k3d registry create kyma-registry --port 5001
k3d cluster create kyma --kubeconfig-switch-context -p 80:80@loadbalancer -p 443:443@loadbalancer --registry-use kyma-registry
kubectl create ns kyma-system
kyma alpha deploy
  1. Apply a ModuleTemplate CR. Run the following kubectl command:
kubectl apply -f {MODULE_TEMPLATE.yaml}

TIP: You can use any deployment-ready ModuleTemplates, such as cluster-ip or keda.

  1. Enable a module. Run:
kyma alpha add module {MODULE_NAME}

TIP: Check the modular Kyma interactive tutorial to play with enabling and disabling Kyma modules in both terminal and Busola.

Read More

Go to the Table of Contents in the /docs directory to find the complete list of documents on Lifecycle Manager. Read those to learn more about Lifecycle Manager and its functionalities.

lifecycle-manager's People

Contributors

adityabhatia avatar ajinkyapatil8190 avatar ameteiko avatar amritanshusikdar avatar c-pius avatar dependabot[bot] avatar fourthisle avatar grego952 avatar halamix2 avatar jacekon avatar jakobmoellerdev avatar janmedrek avatar jeremyharisch avatar kyma-bot avatar leelachacha avatar lindnerby avatar medmes avatar mmitoraj avatar nesmabadr avatar pbochynski avatar ruanxin avatar tobiscr avatar tomasz-smelcerz-sap avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lifecycle-manager's Issues

POC Module Registry / Listing: One-Way Sync of ModuleTemplateList into SKR

To enable Introspection of all available Modules we want to make sure that we offer a convenient way to know which modules can be installed within SKR. For this we want to enable a synchronization that is write-only copying any active ModuleTemplateCR in the KCP into the runtime cluster. That way Busola as well as CLI and other independant toolings can use it for introspection and coding things like value-help or discovery toolings.

Trade-Offs:

  • A change or newly introduced / changed / deleted module-template can potentially lead to devastating control-loops and we will have to consider this feature performance-relevant.
  • There is no way to currently determine if the ModuleTemplate has been maliciously changed in the cluster between reconciliations

AC

  • All ModuleTemplates in Cluster are synced into SKR
  • Any change of ModuleTemplates in SKR are ignored
  • Any ModuleTemplate change in Control-Plane is reconciled into SKR
  • Module-Template Synchronization is done in a separate Control-Loop to avoid cluttering the Kyma-Operator Control-Loop
  • A deleted Module Template needs to be removed from all SKRs
  • We should include a flag to disable this synchronization in case it turns out to be too demanding for big control-planes (see trade-off above)
  • In Local Mode we also should create that list but not into SKR

Developer Notes:

  • Control Loop will need to synchronize based on Kyma (Watches Kyma CR) and will need to list all module-templates and reconcile them.
  • Control Loop will need a change handler for new / changed module templates and will need to write them into the SKR on change.
  • There should be a reasonable but not too inefficient time interval for refreshing the module templates of the remote clusters

Create a new Operator/Control-Loop responsible for establishing lesser privileged Kubeconfigs for SKR Interaction

As a Control-Plane Operator, I want to make sure that all Operators that are interacting with my Runtimes that are managed through the Control Plane have only limited Access. I want to do this to

  • Ensure that all Operators only can work with the resources that they declare
  • Ensure that no Operator can achieve privilege Escalation
  • No malicious attacker can achieve privilege Escalation through gaining control of a privileged Operator

As such I want to achieve a new way of reading and distributing Kubeconfigs.

AC:

  • Every Operator managed by Kyma-Operator should receive new Kubeconfigs
  • Every new Kubeconfig should be rotated out regularly
  • Every Kubeconfig should be limited to a set of Privileges based on preset RBAC Information
  • The RBAC Information required for every module is depending on its business needs, so it needs to be configurable (e.g. one Controller might require Read-Only Access to Pods, while another might even touch CRDs)
  • The Controller should expect a "Master" Kubeconfig, or privileged Access in the Control Plane and should run in a different namespace and name together with different roles than all other Control Loops
  • The Master Kubeconfigs should ideally never be persisted in Cluster, but if they have to be, then they should be stored in the safe namespace
  • The Kubeconfigs that are generated should bind a user with a RoleBinding of the lesser Privileged Roles

Load Test for demo scenario for >10k Clusters

Implement a demo load test scenario, verify if the load test setup works as expected.

This task mainly have three parts:

1. Load test script

  • Written as k6 script, a demo can be found here: https://github.com/ruanxin/performance-test/blob/main/load_generator.js
    • Load test the remote watching: watch resources on X remote clusters and ensure the reported customer changes are properly tracked and considered by the Kyma Operator.
    • TBC - done by PB already?: Implement test case to measure the amount of maximal watchable resources until K8s starts to deliver not all change events anymore => sync with @ebensom about limitations SREs wre already facing in their setup

2. Target Cluster under testing

  • Provision 2 Gardener clusters, one as control plane, one as skr cluster
  • Deploy prometheus-operator into control plane cluster
  • Create a grafana dashboard, with some demo panel to monitor some predefined metrics
  • Create a helm chart to define a dummy resource which simulate the behavior of resources defined in kyma components
  • Deploy kyma-operator, manifest-operator into control plane cluster
  • Deploy skr-operator into skr cluster which manage the reconciliation of target resources
  • Config a remote OCI registery to save component descriptor files
  • Run load test against control plane cluster, target 10000 kyma with 200000 modules
    • โ›” 10% of the modules should be kcp modules NO kcp modules planned for Phase 1

3. Conclusion

  • Present measuring / results to team => point out identified threshold

4. Findings and followup issue

Notes

  • Rewrite manifest-operator to overwrite deployed helm chart resource name with kyma CR name as suffix
    • operator/internal/pkg/prepare/deploy.go:106
    • make sure the overwritten name is an unique identifier
  • Load test should not run until listener, watcher integrated

Lifecycle Manager delivery

Description
The Lifecycle Manager is an essential component in the new reconciler architecture. It is following the K8s operator pattern reacting on data changes (defined in CRs).

The Lifecycle manager is responsible for the lifecycle management of Kyma installations (respectively the components configured for a Kyma cluster) and determining the Kyma installation status (aka. cluster status).

Acceptance criteria

  • #77
  • #78
  • #49
  • Documentation of the architectural design
  • Operational requirements are covered
    • #79
    • Exposing analysis and statistic relevant data via Prometheus endpoint
    • Logging of critical events
    • Auditlog is integrated
    • Operational test was executed and trouble-shooting guides were created
  • Threat modelling findings implemented (see #144)
  • Microdelivery process established

Issues
kyma-project/kyma#13759

Appendix

  • CRD = Customt Resource Definition
  • CR = Custom Resource
  • BYOC = Bring your own cluster
  • SKR = SAP Kyma Runtime

Reference implementation of a CI/CD pipeline with Gherkin support

Description
Each component in the new reconciler architecture requires CI/CD pipelines. Most of the components have quite similar requirements for their pipelines. Therefore, a reference pipeline is required which implements the basic CI/CD features (build, validate, test etc.). This reference implementation can be reused by other reconciler components to reduce implementation efforts.

AC:

  • Align with Pandas team to setup Gherkin testing as main test framework
  • Integrate code checks into CI /CD pipelines (linting etc.)

POC: Manifest Operator

Description
The POC of the Manifest Operator has to verify the technical feasibility and close potential knowledge gaps.

AC:

Following points have to be verified:

  • Verify that X Manifest resources can be handled by the Manifest operator and processed accordingly and scales horizontal (covered by load-test, see)
  • State handling of manifest CR is verified
    • Create reference implementation for component reconciliation (based on HELM kubeclient)
    • CRUD-operations and error handling implemented
  • minimal Manifest CRD definition defined
  • reconciler loop takes care to synchronise SKR module CRs
    • merge of remotely retrieved module CR in case of updates
  • test framework established (will be done with Gherkin: see )
  • high-level security review (no threat modelling)
  • implement shared manifest library (SML)
    • provide interface definition (API spec)
  • Load dynamic referenced kubeconfig from CR
  • reference implementation for caching of rendered manifests
  • Overrides and profiles are considered by the manifest rendering (deals with overrides for CRs and for Helm charts)

Reason
Validate technical feasibility of Manifest operator and close knowledge gaps.

POC: Template Operator - Use Kubebuilder Declarative Pattern to improve the Manifest Library for easy Reconciliation

Research into kubebuilder declarative approach to hide reconciliation complexity from a template-operator user.
With this declarative approach ideally, the reconciliation complexity and status (or state) handling should happen implicitly as part of the offered framework. For installation of target resources it could use manifest library as an option.
https://github.com/kubernetes-sigs/kubebuilder-declarative-pattern

AC:

  • template-operator is setup inside lifecycle-manager/samples/template-operator/operator with a custom resource (e.g. Sample)
  • README.md is updated to contain installation information
  • declarative library and types are implemented in the module-manager library
  • Sample controller is implemented using the reconciliation and status handling logic from the declarative library
  • Sample CR api Status sub-resource refers to the Status object type from the declarative library
  • Options to interact with the declarative library's reconciliation: add labels to Sample CR, transform manifest resources, resolve chart information, verify installed information - to be implemented and included in the Sample controller

Pull requests:
kyma-project/module-manager#67
#123
kyma-project/module-manager#79
kyma-project/module-manager#80
#136

POC: Kyma-Operator

Description
The POC of the Kyma Operator has to verify the technical feasibility and close potential knowledge gaps.

AC:

Following points have to be verified:

  • Verify that X Kyma resources can be handled by the Kyma operator and processed accordingly and scales horizontal
  • dynamic watching of module template (module CR change to trigger reconciliation of Kyma CR)
  • State handling is verified
    • watching module CRs
    • reference implementation of state aggregation available
    • CRUD-operations and error handling implemented
  • minimal Kyma CRD definition defined
  • reconciler loop takes care to synchronise SKR Kyma CRs
    • merge remotely retrieved SKR Kyma CR with KCP Kyma CR
  • release channel parsing (propagating of module-versions)
  • Can work with Module Templates containing CNUDIE / OCM Module Descriptors for resolving and parsing OCI Image references for bundling (includes support of overwrites provided from outside)
  • test framework established: will be done using Gherking (see )
  • high-level security review (no threat modelling)

Reason
Validate technical feasibility of Kyma operator and close knowledge gaps.

Threat modelling for new reconciler architecture

The new reconciler architecture has to pass a threat modelling workshop as part of the SAP product standards. The threat modelling workshop has to be planned and executed together with security experts and findings have to be solved before the Go-Live happens.

AC:

  • Plan and execute threat modelling workshop with security experts
  • Track findings in dedicated issue tickets and mark go-live relevant issues accordingly

Implementing Monitoring and Observability

Description

To ensure operational readiness for new reconciler product, a comprehensive monitoring and observability solution is essential to have. We want to support operational aspects like observability and tracing by design and want to incorporate it early on.

AC

Developer Notes

  • Metrics Including but not limited to:
    • Successful Processing Duration 95th percentile for each module reconciliation
      • The duration time of one module state change from error or processing to ready.
      • Design this metrics as Histograms
    • Operator memory, cpu usage (in % and in peak %)
    • Operator worker queue related metrics (from kubebuilder)
    • Operator reconcilation error, success count (from kubebuilder)
  • Dashboard data file should persistent and configured in a grafana dashboard ConfigMap, current control-plane repo reference
  • Contact with Huskies to clarify who take responsibility and how to do the integration of Jaeger, Kiali.
    • Jaeger is not in productive ready state
    • Jaeger will be replaced with OpenTelemetry
    • Operator Listener/Watcher is not a typical use case for tracing
    • Conclusion: Don't integrate tracing tool

Related PR

kyma-project/module-manager#113

Template Operator: Developer Makefile Tooling: spin up a minimalistic Control Plane Development Environment

Template Operator: Developer Makefile Tooling: spin up a minimalistic Control Plane Development Environment with Kyma Operator and Manifest Operator in a new control plane and connection to a given SKR in the Runtime

  • Create a new Makefile Command in the template-operator
  • Make sure that the Command is using Kyma CLI in the latest Release Version as Binary
  • Spin up a second Cluster that is used as Control Plane
  • Install Kyma-Operator and Manifest-Operator with hardwired release versions into the Control Plane (kyma deploy)
  • Create a default Kyma CR with the module name from the Template generated by the template-operator and Secret with a Kubeconfig referencing the SKR Cluster derived from the current $KUBECONFIG variable in the script
  • Install the ModuleTemplate from the generation command in #106 into the Control Plane

Define Kyma CRD

Description
CRDs are the data schema of CR resources and building the contract between the Kyma Operator and other KCP components. The requirements of the Kyma Operator (especially in regards to the Kyma CRD) are defined and aligned with other KCP components (e.g. KEB).

AC:

  • Final CRD is defined and documented (in code as comments)
  • CRD is aligned with KEB team
  • Support debug-flag propagation (required to enable extended log-messages in Manifest Operator for particular Kyma components)

POC Kyma operator II - refactor module template management

AC

  • Define new flag for SKR/KCP differentiation
  • Update to module CRs for KCP installation not reapplied
  • Apply updates from the module template to the manifest for SKR installation
  • The spec update to the manifest will be overwritten by the latest module template (not implementing in this issue)
  • If module CR / manifest CR in KCP is deleted it gets recreated from module template
  • CRDs are passed to the manifest based on a new layer on the descriptor with name/type "crds" and its an OCIRef (meaning layer of an image somewhere on an OCI registry)
    • crds is defined under fields descriptor.resources.type
    • โ“ the type of crds should be yaml
  • CR of module template gets propagated to manifest for remote installation

Developer Notes

  • Introduce new flag in operator/api/v1alpha1/moduletemplate_types.go for differentiate SKR/KCP installation
  • Use this new flag to determine if needs to apply update for module CR, in operator/pkg/watch/template_change.go:24
  • Write tests to verify if module CR recreation works
  • Verify with Manifest Operator team members if generated CR can be consumed correctly
  • Samples in config get updated according to new design

Create CI pipeline to run mandatory SAP security scanners

Description
The reconciler code base has to be scanned by the SAP security scanners. CI/CD-pipelines have to be created to run the mandatory scanner tools.

Mandatory scanners are:

  • Protecode
  • Checkmarx
  • Whitesource

The code base in following repositories have to be scanned by these tools:

AC:

  • Scanner tools are updated to support scanning the Kyma Operator repository and Manifest Operator repository (has to be aligned with @aberezovski)
  • CI/CD pipelines are added which run the SAP security scanners (Whitesource, Protecode, Checkmarx) regularly for the Kyma Operator and Manifest Operator code bases. These pipelines are triggered at least once a day (Mo-Fr).

Reason
Be compliant with SAP product standards.

MVP: Infrastructure Operators

Description

The Infrastructure Operators are essential components in the new reconciler architecture. They are following the K8s operator pattern reacting to data changes (defined in CRs).

The Infrastructure Operators will be responsible for the lifecycle management of managed Kubernetes clusters (SKR clusters) and their underlying network setup.

Acceptance criteria

  • Create dedicated repositories for infrastructure operators' code base in the kyma-project organisation
  • CRDs are the data schema of CR resources and build the contract between the Provisioner Operator and other KCP components. The requirements of the infrastructure operators (especially in regards to the Cluster CRDs or how kubeconfig data are stored) are defined and aligned with other KCP components (e.g. KEB).
  • The operators have to react on Cluster CRs and, depending on the resource change,
    • create/update/delete network configurations used by the Gardener cluster
    • create/update/delete Kubernetes clusters by interacting with Gardener
  • Store/update the kubeconfig of a created Kubernetes cluster in a dedicated resource (e.g. 'Secret` resource)
  • Operational requirements are covered
    • Exposing analysis and statistic-relevant data via Prometheus endpoint
    • Logging of critical events
    • Auditlog is integrated
    • Operational test was executed and trouble-shooting guides were created
  • Documentation of the architectural design
  • Security threat-modelling passed
  • Microdelivery process established
  • Load test executed and thresholds / limitations are identified

Issues
kyma-project/kyma#13759

Appendix

  • CRD = Custom Resource Definition
  • CR = Custom Resource
  • BYOC = Bring your own cluster
  • SKR = SAP Kyma Runtime

POC: Watching and triggering actions for cluster resources

Due to the change in the reconciler concept and scenario, the description of this issue has been edited to fit to the new concept. Edited 30.06.2022 @jeremyharisch
(Switching from reconciler mothership to cloud-native approach; Instead having a per-cluster watcher-tower, having a Kyma-Watcher, based on Kubernetes operator, which communicates with the management plane)

Context

The main context to be set is that the user is able to set/edit configurations inside the SKR cluster and make them available on the management plane.

To combat this, every cluster contains a component called Watcher that reports detected changes to the management control plane. The central Kyma Operator (established here) on the management control plane relays the changes to the corresponding configuration CRs and thus trigger a reconciliation.

watcher_workflow_network_arc

Key Requirements (for MVP):

Stability

  • Every Cluster that is subject for change detection must be able to be reconciled at any given time and cluster status given administrative access to the cluster is ensured
  • Change updates of an individual cluster must have no measurable impact on the availability of the Kubernetes API Server from the SKR cluster as well as from the management plane
  • A failure in change detection of a given cluster must have no influence on the change detection process for any other cluster with change detection
  • A change of the change detection rules required for cluster operations must correlate to a revision of the kyma runtime version, allowing only one ruleset per kyma version
  • A change detection event can be send to any amount of interested parties without impacting other change notifications
  • Any given change discovered from the change detection can be discarded based on the type of change to avoid information congestion
  • Every cluster can be ignored for change detection during reconciliation to avoid false positives on changes caused by the reconciliation process or at least the change has to be verified if it is applicable
  • The change detection may only pick up kyma-related changes to resources relevant for changes and should ignore changes based on independant processes running inside the cluster

Reliability

  • A cluster should be able to notify about detected changes via at least 2 different ways of communication to ensure resilient reconciliations
  • The change detection can run in high-availability, failure-tolerant environments without consistency issues by using independant containers for its orchestration
  • The change detection should guarantee to inform about detected changes on an at-least-once base, removing duplicates on best-effort base
  • Should change detection fail due to the operator becoming unresponsive, a central orchestrated service should be able to set up a new Watcher-Operator
  • Should change detection fail without chance of recovery, the change detection should be falling back to notify of future changes and should be able to continue operations
  • A configuration change picked up by the change detection should result in all changes detected to be delivered under an old revision of the configuration, ensuring reliable operations during Kyma upgrades

Security

  • Changes in a cluster are not allowed to be shared between other clusters
  • Changes can only be delivered to registered services trusted ahead-of-provision on base of certificates and mutual trust
  • Changes can not be accessed by any component inside the cluster except cluster administrators with a specific RBAC Rule
  • A Change History should never be stored centrally outside of the given processing time and API-Server responses and can never be persisted
  • Change delivery can only occur with third party services ahead-of-provisioning on base of certificates and mutual trust
  • The Kyma operator is considered an external service and has to fulfill the same guarantees of trust as any other component requiring to be notified of changes

Maintainability

  • Changes in configuration causing behaviour differences for change detection must be reacted to during runtime and cannot cause maintenance incidents
  • Change Detection can be paused for individual clusters or groups based on common attributes defined through the configuration
  • Change Detection can be partially disabled for individual change groups through a Kyma Component

Performance

  • A change detection should be flawlessly managed for over 10.000 clusters within a change detection system architecture. The resources required for change detection are defined through the individual components
  • A change detection agent running inside a cluster should not consume more than 1% of a given nodes performance at any given time
  • A change detection agent should run at most once on any given node in a cluster, either through affinity or daemonset
  • A change detection agent should not be responsible for maintaining persisted state over a change or of notifying other components of the change
  • A change detected in a cluster must not contain more than 1 MB of data for notification of other parties

Operability

  • A change detected in a cluster must be fully traceable once processed and has to be uniquely identified for traceability based on the kubernetes resource and change time
  • The Kyma watch operator must be horizontally scalable individually while respecting the requirements to performance and reliability
  • The Kyma watch operator must be able to show the current configuration state they are running in at any given time
  • The Kyma watcher operator must track metrics for detected changes and react to these metrics accordingly with adjustments to their behaviour in performance and reliability

AC for POC

  • Implement simple watch mechanism which is reacting on particular resources changes (marked by defined labels and APIGroups)
    • Changes are reported to Listener component, including information about the resource name and namespace
    • Map newly created/adapted resources to corresponding components
    • Dynamic Watching on configured K8s resources (i.e. ConfigMaps, KymaCR)
    • DESCOPED for POC: Dynamic approach on how to watch on configured labels; can change during runtime
    • DESCOPED for POC: What would happen if the watchable resources increase with time. Rate limiting or comparison based on filtered set timestamps?
  • Implement listener modules (part of Kyma operator and Manifest operator) which receives detected changes reported by the the watcher component
    • Establish endpoint to retrieve changes reported from watcher (Kyma Operator)
    • Establish endpoint to retrieve changes reported from watcher (Manifest Operator)
    • An event handler module retrieves the change per module operator
    • The event handler triggers the reconciliation of an updated module
    • OPTIONAL (depends on security requirements): Find solution to store new events, in case kyma-operator shuts down or using at least one delivery
    • Implement networking solution to distribute incoming SKR-events to corresponding listeners
  • The POC was reviewed by the security expert and approved
    • Optional: Provide alternative communication approach to pass security check
    • investigate/define infrastructure related parts required to ensure a secure communication between watcher / listener (reverse proxy setup etc.)

POC Kyma operator II - refactor Kyma CR management

AC

  • Kyma CR is synchronised from SKR as Single-Source-Of-Truth
    • Deleted CRs on SKR gets recreated from KCP CR(handled in #90)
    • If KCP CR is deleted, Kyma installation gets uninstalled
    • Ignore Update on Kyma CR spec from KCP to SKR
    • #93: Remove Logic for Configuration Overrides in Kyma CR and passing the Label Selector to Manifest Operator
    • #95
  • Status aggregation works by combining all manifest CR with all module-CR states and writing it into the Kyma status (allignment blocked by changes necessary in #90 to create multiple paths for SKR/KCP Installation first)
    • CRDs are verified to contain status.state-field

Developer Notes

  • Refactor kyma_synchronization_context.go
    • Refactor SynchronizeRemoteKyma only keep the logic sync from SKR to KCP
  • Write tests to verify if status aggregation works

Target (SKR) objects name/namespace configuration overrides.

This is needed for the Load Testing purposes, but it looks like a general feature that the Users would like to have.

For the Load Testing we have:
10 000 of Kymas with 20 components each, every component handled by the Manifest CR.
This results in 200 000 of Manifest objects.
Every Manifest CR causes an installation of some Helm chart in the target cluster.
Let's assume this Helm chart just contains a custom CR, a single k8s object.
200 000 Manifests means then 200 000 custom CRs installed in the target cluster(s)
How do we control names/namespaces of these objects?

Note: The number of target objects deployed in the single target cluster of course depends on how many target clusters we have. For 4 target clusters we have 200 000 / 4 => 50 000 objects. But this doesn't matter for the discussed feature.

I am putting some proposals below. Feel free to extend it.

  1. A single namespace for all CRs

    In that scenario every object is installed in the same namespace in the target cluster. Then the names of these objects must be different to avoid collisions. This brings two questions:

    • How and where do we generate names for these objects? Of course Helm chart could just "randomize" the name, but it's not the reality of how Helm charts are written. We should expect a fixed name in the chart, with an ability to override it. Can we define such an override in the Kyma object?
    • How do we configure custom state checking function for every object (different names)? We somehow need to know the name upfront.
  2. Different namespaces

    • 2.1. Different target namespace per Kyma object

      This is the setup we used previously. For every Kyma object in the control-plane we've created a separate namespace in the target (SKR) cluster. All the custom CRs related to a single Kyma object and installed in the target cluster, ended up in the same namespace. So for 10 000 Kymas we've had 10 000 namespaces. It would be nice to be able to configure this namespace's name somehow - we've just use the name of the Kyma object by customizing the Manifest Operator code.
      Because we have different namespaces, names of components (20 of them) must be unique inside the namespace, but can be otherwise fixed i.e. component 1 can always be named like "component1" in all namespaces - no collisions.

    • 2.2. Different target namespace for every single CR

      Extreme case, but this setup most closely corresponds to the "real" SKR. After all, in the real installation, most of the Kyma components are installed in their separate namespaces. For the 200 000 Manifests and related 200 000 custom CRs we need 200 000 namespaces in the target cluster(s). As a consequence all the components may even have the same fixed name, like "load-test-component", and there are no collisions because every one is deployed in a different namespace.

The question is:

  • Which of the above can we setup at this moment with just the code we have?
  • Which of the above we consider a valid and useful scenario that can be easily implemented even during the POC?

Prepare a template operator

A template operator is a re-usable operator which could be leveraged by component teams to create their own operators, to interact with the kyma-operator and thereby be a part of the modularisation framework

AC:

  • Prepare a template operator skeleton project
  • Implement a template CR managed by this operator
  • Implement basic simple state handling for the template CR
  • Showcase operator best practices
  • Write documentation how to re-use this operator and inject into the kyma-operator

Implement Kyma Operator

The Kyma operator is responsible to react on changes applied to Kyma CR (see #77 ) and triggers required CR updates.

AC:

  • The operator is watching these Kyma CRs and, depending on the resource change, creating/updating/deleting related component CRs and reacting to modules by creating manifest CRs
  • Kyma status tracking
    • Aggregating conditions based on manifest status to Kyma status
    • Dynamic enqueueing of KYma instances based on failure and processing state
    • Detect and react to component status change for aggregation in Kyma conditions (list of conditions indicating whether Kyma installation is finished)
  • Module template handling
    • Watching module templates based on channel, profile and module name
    • Parsing module templates based on OCM component descriptor API
    • Validating and defaulting webhooks to correctly validate OCM descriptor on input
    • Translation of OCM descriptor model into manifest API
  • Allow overwrites
    • Allow module based overwrites for channels in Kyma
    • Allow channel based overwrites in Kyma
  • SKR interaction
    • Accepting listener request for reconciliation of Kyma instance (tracked in kyma-project/runtime-watcher#2)
    • Verify and fetch Kubeconfig for SKR based on secrets (tracked in #12)

Template Operator: Developer Makefile Tooling for Building ModuleTemplates

To support developers writing their own modules, we have to make sure that they can integrate seamlessly into the existing kyma module infrastructure. Thats why we want to Introduce Generation Tooling in the Template Operator (Makefile Command) that creates a ModuleTemplate from the Operator with a Helm Chart based on the kustomize values of the kubebuilder project and a default values yaml used for the ModuleTemplate. In the end it should just be another command that outputs the finished bundled module.

AC:

  • Create a new Makefile Command in the template-operator
  • Make sure that the Command is using Kyma CLI in the latest Release Version as Binary
  • Trigger Module Building by converting the kustomize output into the different layers and chart of the CLI
  • Make sure that output module template is saved into a directory ignored by .gitignore
  • Make sure that module template can be used for installation into a control-plane cluster
  • Make sure that the generated ModuleTemplate from the Make command above is pushed against a predefined registry hardwired in the Makefile

Threat Modelling of Manifest Operator

The Manifest Operator has to pass a threat modelling workshop as part of the SAP product standards. The threat modelling workshop has to be planned and executed together with security experts and findings have to be solved before the Go-Live happens.

The scope for this threat modelling workshop is primarily the Manifest operator and its interaction with remote Kubernetes clusters.

AC:

  • Plan and execute threat modelling workshop with security experts
  • Track findings in dedicated issue tickets and mark go-live relevant issues accordingly

Release process and delivery pipelines for new reconciler ecosystem

Description

The new reconciler product consists of multiple different services and resources (e.g. Kyma operator, Manifest operators, Kyma CRDs etc.). The creation of release and the lifecycle management of the different components of this product can be challenging (especially in regards to configuration options and updates). A concept is required, how all resources have to be deployed and properly configured in Kubernetes to get the product working (potential solutions could be to the HELM package manager, Kyma CLI etc.).

Additionally, a lightweight end2end test is required to verify that the the new reconciler product was successfully installed and is ready to work.

Finally, SREs have to be ramped up and enabled to implement a delivery pipeline in their CI/CD system (Spinnaker) to automate the rollout and update of the new reconciler product to the KCP DEV/ STAGE / PROD landscapes.

AC:

  • Define a concept how the lifecycle of the reconciler will be managed / how the release process works
    • Concept is aligned within the team and SREs
  • Necessary tooling for the lifecycle management is configured / implemented (e.g. CLI command etc.)
  • SREs are ramped up and tooling is passed over so that SREs can configure their CI/CD system to deliver the new reconciler product to KCP landscapes
  • Simple E2E test case is implemented to verify the correct deployment of the reconciler product
  • End-user documentation (main audience is SRE and developer) is available -> see kyma-project/community#700

Reason
Establish a solution for lifecycle management of the new reconciler ecosystem.

Remove reconciler-control-plane-image-bump after reconciler deprecated

Description
After reconciler deprecated, remove reconciler-control-plane-image-bump job and related code in test-infra repo.
related code:

prow/scripts/resources/autobumper

AC:

  • Pipeline reconciler-control-plane-image-bump was removed

Reason
Cleanup pipeline of old reconciler architecture which are acting on control-plane repo.

Remote reconciliation strategy

The purpose of this issue is to define how do we want to architecture and implement the remote reconcilitation.

The remote reconciliation assumes:

  • Component operators are running in the Control Plane
  • Component operator CRs are created in the Control Plane
  • Running instances of component operators are reconciling Kyma components in remote clusters (customer clusters)

The challenges found initially:

  • Scaling of the solution. Consider one hundred clusters are to be upgraded starting Tuesday morning. How long does it take overall, how long does it take for a single cluster on average and what's the latency in CR status updates?
  • Do we want to control the ordering of upgrades between clusters? See below for explanation.
  1. Scaling of the solution
    The operators can't scale. There's always only one instance of an operator. How do we reconcile 100 clusters in parallel using just a single golang process?

    • goroutines won't work - do you have a Node performant enough?
    • horizontal scaling won't work - you can't increase replicas for an operator
    • what about status update latency? We should update status on CR's to something like "Pending" ASAP, for every CR, regardless of the time it takes to actually upgrade the related clusters. It means the actual workload should somehow be scheduled to the "background" or to some external entity, and the operator should process CR's without any substantial delay.

    Possible solutions:

    • just measure what we can achieve right now, if it's not enough, try to find a solution :)
  2. Ordering of upgrades problem - latency, resources.

    Assume we have just three component operators: A, B, C.
    Let's also assume all component operators performs the reconciliation step in a similar time.
    Assume we have 10 clusters to upgrade.
    We can easily show that the order of processing between component operators does change the observed latency and resource allocation.

    First model (worst-case):

    Steps in time comp. A is processing comp. B is processing comp. C is processing clusters ready after the step
    1 cluster 1 cluster 1 cluster 10 -
    2 cluster 2 cluster 2 cluster 9 -
    3 cluster 3 cluster 3 cluster 8 -
    4 cluster 4 cluster 4 cluster 7 -
    5 cluster 5 cluster 10 cluster 6 -
    6 cluster 6 cluster 9 cluster 5 -
    7 cluster 7 cluster 8 cluster 4 4
    8 cluster 8 cluster 7 cluster 3 4, 8, 7, 3
    9 cluster 9 cluster 6 cluster 2 4, 8, 7, 3, 9, 6, 2
    10 cluster 10 cluster 5 cluster 1 4, 8, 7, 3, 9, 6, 2, 10, 5, 1

    In this model, we are actively reconciling up to 6 clusters at any time, and the average latency for a cluster to be ready is 8.8 (in the artificial "Steps in time" units)

    Second model (best-case):

    Steps in time comp. A is processing comp. B is processing comp. C is processing clusters ready after the step
    1 cluster 1 cluster 1 cluster 1 1
    2 cluster 2 cluster 2 cluster 2 1, 2
    3 cluster 3 cluster 3 cluster 3 1, 2, 3
    4 cluster 4 cluster 4 cluster 4 1, 2, 3, 4
    5 cluster 5 cluster 5 cluster 5 1, 2, 3, 4, 5
    6 cluster 6 cluster 6 cluster 6 1, 2, 3, 4, 5, 6
    7 cluster 7 cluster 7 cluster 7 1, 2, 3, 4, 5, 6, 7
    8 cluster 8 cluster 8 cluster 8 1, 2, 3, 4, 5, 6, 7, 8
    9 cluster 9 cluster 9 cluster 9 1, 2, 3, 4, 5, 6, 7, 8, 9
    10 cluster 10 cluster 10 cluster 10 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

    In this model, we are actively reconciling only 1 cluster at a time, and the average latency for a cluster to be ready is 5.5 (in the artificial "Steps in time" units).

    Notice that the amount of work to be done is the same in both models, just the ordering makes the difference.
    Our current proposed strategy is: Every component operator is independent and is processing clusters in it's own order/pace
    This unfortunately leads more towards the first execution model, which is not optimal.

    Another thing is resource consumption. If we actively reconcile 6 different clusters at a time, we need 6 different k8s golang clients, 6 HTTPS streams, and so on.
    Finding a way to share such resources between operators (a common external worker, for example), along with processing clusters in some synchronized order, would reduce resource consumption along with the latency.

    As always, the benefits from the possible improvements here may not justify the effort - it should be measured.

POC Kyma operator II - module deletion

AC:

  • module removed from kyma CR (in KCP and SKR) should delete related modules, this include modules for all target (remote, control-plane)
  • kyma CR in KCP marked as delete should delete kyma in SKR and all modules
  • kyma CR in SKR should be protected by finalizer
  • kyma should created immediately after finalizer removed
  • kyma creation is based on the spec.sync condition in KCP kyma CR

Developer Notes

  • module removed from kyma CR means
    • the related entry in spec.modules get deleted
  • after module deleted, the related entry in status.conditions should also get deleted
  • kyma CR in SKR is not managed by controller reference
  • confirm the controller reference deletion works as expected
  • refactor the status.conditions creation logic
    • initiate all existing module conditions when kyma CR created

Configuration Management - User and central configuration

After the sample application from Watching POC we want to deduce some base CRs necessary for all Components and then decide wether we want additional Configuration Overrides. Right now, the overrides can be left out for simplicity, but it is still pending where we want to allow a configuration override from SKR

AC:

  • Conceptualise base CRs necessary for all Components
  • Assess additional Configuration Overrides

Setup pipelines for continuous integration of Kyma Operator

Description
Continuous integration is mandatory and best practise to ensure a working and verified code base and to detect breaking changes. A CI/CD pipeline has to be created which validates the Kyma Operator sources whenever a PR is opened to the Kyma Operator repository (especially to the main branch).

The pipelines has to execute following actions / code checks:

  • Compile sources
  • Run linter
  • Execute unit tests

AC:

  • CI/CD pipeline is created which is checking the codebase of each PR opened to the main Kyma Operator repository
  • The pipeline executes following actions:
    1. Compile the sources
    2. Run code linter
    3. Execute unit tests
  • A failing pipeline leads to a Slack notification to the existing Jellyfish notification-channel.

Reason
Setup Continuous Integration in the Kyma Operator development process.

Performance Test for new reconciler architecture

For the longterm, a regular performance check of the reconciliation product is required. The load test should be executed periodically / per release as part of the CI/CD.

Goal is gain information how the general performance of the product evolves. The statistics should give an indication whether the general product performance has in- or decreased between releases.

AC:

Concept

  • Define Performance Test Concept
  • Align concept with team

Implementation

  • Implement the load test scenario
    • #18
    • Boostrap test infrastructure (e.g. create new cluster or use hibernation in Gardener to stop/start landscapes)
    • Run load test scenarios
    • Shutdown test infrastructure
      • Delete generated test resource
      • Teardown test infrastructure (remove generated resources, stop test infrastructur)
  • Show results in Grafana dashboard and make these data available in the longterm (enables us to compare how to performance profile evolves over time)
  • Setup CI/CD pipeline which executes the load test regularly (at least once per week and per new release)

POC Implement listener for kyma-operator

Once Kyma CR in SKR has a generation change, the kyma-watcher in SKR should call the registered kyma-operator endpoint with the .spec.modules. payload from SKR. This payload should then update the skr-watcher-modules ConfigMap in the KCP.

The skr-watcher-modules ConfigMap should have the following structure:
(same as kcp-watcher-modules)

kind: ConfigMap
<...>
data:
  <module-name>-<channel>: |
    {
      "KymaCRList": [{
        "KymaCR": "abc",
        "KymaNamespace": "default"
      }, {
        "KymaCR": "def",
        "KymaNamespace": "default"
      }]
    }

This resource should then be leveraged by the kcp-watcher control loop to configure the watcher on skr.

ACs:

  • Watcher-Repo:
    • Adapt the contract of the listener pkg: kyma-project/runtime-watcher#52
      • Include a list of modules in contract
      • Enhance listener mapping logic to include list of modules in the created k8s generic event
    • Adapt SKR Watcher to include the modules in the contract which is sent to the KCP
  • Kyma-Operator Repo:
    • Use / Leverage functions of KCP Watcher to update ConfigMap (link) OR reimplement them -> needs to be checked which is the most sufficient solution
    • Requeue KymaCR to reconciliation loop after updating ConfigMap

POC Kyma operator II

Description

Ensure the technical feasibility of the Kyma operator when applying the new reconciler architecture.

diagram

AC:

Support debugging and settings propagations of reconciliations: add option to enable debug mode during runtime per Module CR

Description

The Kyma operator has to support debugging possibilities to enable analysis by on-call engineers and SREs during runtime.

It has to be possible to enable the debug mode for a whole Kyma instance or for particular Kyma modules. It's recommended to make the configuration quite simple, e.g by offering a debug field in the CR which accepts a boolean value.

The debug: true field should enable the debug mode. If the field is false, the debug mode is disabled.

AC:

  • Labels are properly filled with necessary metadata to enable tracability (see comment below)
  • The logger-instance of the module operator (respectively the logger used by operator responsible for the particular Kyma module) has to use the debug log-level for any printed log message. This is only possible if the module operator supports parsing the passed labels.
  • If debug mode is disabled, the default log-level (probably info) has to be used by the logger-instance
  • The debug mode can be configured Kyma wide by editing the Kyma CR. This will enable / disable the debug mode for all Kyma modules listed in the Kyma CR.
  • The debug mode can be configured per Kyma module by editing the particular module CR. If the debug mode of the Kyma CR changes, the value of the Kyma CR has precedence and overwrite the module specific value.

Refactor Kyma status.conditions

The purpose of this ticket is to refactor current kyma conditions to compliance with the standardlize-conditions proposed by KEP.

AC

  • The kyma condition should compliance with the standardlize conditions proposal.
  • condition.Reason should not be used as key for tracking component name.
  • TemplateInfo should persist under Kyma status.

Developer Notes

  • replace KymaCondition with metav1.Condition

Detect broken compatibility when using Alpha-/Beta-dependencies

Description

Sometimes it is required to use an Alpha- or Beta API (e.g. when dealing with dynamic types in K8s). Like we did here.

Such libraries can loose their support in newer K8s versions (e.g. was the reason for this incident).

To detect such outdated / no-longer supported libs, it has to be ensured that for each usage of alpha-/beta-libs, an integration test is defined which verifies the correct behaviour of this code when using the same Kubernetes version which also runs on Production / will run next on Production.

AC:

  • Identify code sections which are using alpha- / beta-libraries (especially relevant for libs coming from K8s community)
  • For each identified section, implement an integration test which is verifying the correct behaviour of the code when using the latest K8s cluster version

Introduce jitter on re-scheduling

If we want to process hundreds of thousands of resources smoothly, we should ensure these objects are distributed in time "smoothly". For now, we are re-scheduling objects with a fixed time value.
It means that if we have a pile resources - and related processing overload peak - we are scheduling objects so that the same peak will happen in a couple of seconds ahead.
The not-perfect but very simple way to mitigate the problem is a random "jitter" when re-scheduling to spread out these objects in time.

Change Kyma Synchronization from SKR and KCP to SuperSet of Modules

The resulting set of all modules should consist of all modules from SKR and KCP combined.

Trade-Off: Removal of a module requires removal from both KCP and SKR. Adding A module requires only one place for the change

AC:

  • Make sure that KCP Changes are not overwritten by Spec Changes of SKR Changes
  • Make sure that SKR Changes are not overritten by Spec Changes of KCP
  • Make sure that SKR + KCP Modules (without Duplicates) = Total Set of Modules to be installed

Developer Notes:

  • It might be necessary to introduce a refactoring of the conditions so that the deletion of removed modules is handled properly
  • The merging of the two CRs from SKR and KCP requires us to continously maintain a "virtual" set of modules that Kyma is actually installing, thus needing a small refactor of the sync logic
  • The Sync that is happening needs to be changed so that the KCP CR can still be used to recreate the SKR CR but that it is not overwritten from the SKR CR if it was changed by a customer. Instead it should always use the "virtual" module set that is fetched from both clusters.
  • In case the virtual module set is not available (e.g. the remote cluster is not available) we can no longer retrieve the new desired state, thus producing a new short-circuit error: IF remote cluster cannot be reached, THEN error out the reconciliation and wait for requeue immediately

Operational Readiness of new reconciler ecosystem

Description

We have to ensure that the new reconciler system is fulfilling the operational readiness criteria. This includes all topics which are required to ensure the product can be successfully operated over time. This includes observability, operatoinal tooling, setup of rollout pipelines, education of the team to deal with incidents etc.

AC:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.