kyma-project / lifecycle-manager Goto Github PK

View Code? Open in Web Editor NEW

9.0 14.0 30.0 5.33 MB

Controller that manages the lifecycle of Kyma Modules in your cluster.

Home Page: http://kyma-project.io

License: Apache License 2.0

Dockerfile 0.12% Makefile 1.31% Go 97.69% Shell 0.75% JavaScript 0.14%

kubernetes kyma operator

lifecycle-manager's Introduction

Lifecycle Manager

Kyma is an opinionated set of Kubernetes-based modular building blocks that includes the necessary capabilities to develop and run enterprise-grade cloud-native applications. Kyma's Lifecycle Manager is a tool that manages the lifecycle of these modules in your cluster.

Modularization

Lifecycle Manager was introduced along with the concept of Kyma modularization. With Kyma's modular approach, you can install just the modules you need, giving you more flexibility and reducing the footprint of your Kyma cluster. Lifecycle Manager manages clusters using the Kyma custom resource (CR). The CR defines the desired state of modules in a cluster. With the CR you can enable and disable modules. Lifecycle Manager installs or uninstalls modules and updates their statuses. For more details, read about the modularization concept in Kyma.

Basic Concepts

See the list of basic concepts relating to Lifecycle Manager to understand its workflow better.

Kyma custom resource (CR) - represents Kyma installation in a cluster. It contains the list of modules and their state.
ModuleTemplate CR - contains modules' metadata with links to their images and manifests. ModuleTemplate CR represents a module in a particular version. Based on this resource Lifecycle Manager enables or disables modules in your cluster.
Manifest CR - represents resources that make up a module and are to be installed by Lifecycle Manager. The Manifest CR is a rendered module enabled on a particular cluster.
Module CR, such as Keda CR - allows you to configure the behavior of a module. This is a per-module CR.

For the worklow details, read the Architecture document.

Quick Start

Follow this quick start guide to set up the environment and use Lifecycle Manager to add modules.

Prerequisites

To use Lifecycle Manager in a local setup, install the following:

Steps

To set up the environment, provision a local k3d cluster and install Kyma. Run:

k3d registry create kyma-registry --port 5001
k3d cluster create kyma --kubeconfig-switch-context -p 80:80@loadbalancer -p 443:443@loadbalancer --registry-use kyma-registry
kubectl create ns kyma-system
kyma alpha deploy

Apply a ModuleTemplate CR. Run the following kubectl command:

kubectl apply -f {MODULE_TEMPLATE.yaml}

TIP: You can use any deployment-ready ModuleTemplates, such as cluster-ip or keda.

Enable a module. Run:

kyma alpha add module {MODULE_NAME}

TIP: Check the modular Kyma interactive tutorial to play with enabling and disabling Kyma modules in both terminal and Busola.

Go to the Table of Contents in the /docs directory to find the complete list of documents on Lifecycle Manager. Read those to learn more about Lifecycle Manager and its functionalities.

lifecycle-manager's People

Contributors

Stargazers

Watchers

lifecycle-manager's Issues

Establish a process to allow SRE can preselect a subset of Kyma cluster to testing new Kyma operator

Description

To experimental how Kyma operator behavior under real landscape environment, SRE required us to establish a process so that they can select a subset of Kyma which can be managed by new kyma operator.

AC

Establish an instruction document, describe the steps for the configuration in detail, delivery to SRE as a guidance.

Watch for all component CRDs dynamically, on kyma-operator startup

POC Module Registry / Listing: One-Way Sync of ModuleTemplateList into SKR

To enable Introspection of all available Modules we want to make sure that we offer a convenient way to know which modules can be installed within SKR. For this we want to enable a synchronization that is write-only copying any active ModuleTemplateCR in the KCP into the runtime cluster. That way Busola as well as CLI and other independant toolings can use it for introspection and coding things like value-help or discovery toolings.

Trade-Offs:

A change or newly introduced / changed / deleted module-template can potentially lead to devastating control-loops and we will have to consider this feature performance-relevant.
There is no way to currently determine if the ModuleTemplate has been maliciously changed in the cluster between reconciliations

All ModuleTemplates in Cluster are synced into SKR
Any change of ModuleTemplates in SKR are ignored
Any ModuleTemplate change in Control-Plane is reconciled into SKR
Module-Template Synchronization is done in a separate Control-Loop to avoid cluttering the Kyma-Operator Control-Loop
A deleted Module Template needs to be removed from all SKRs
We should include a flag to disable this synchronization in case it turns out to be too demanding for big control-planes (see trade-off above)
In Local Mode we also should create that list but not into SKR

Developer Notes:

Control Loop will need to synchronize based on Kyma (Watches Kyma CR) and will need to list all module-templates and reconcile them.
Control Loop will need a change handler for new / changed module templates and will need to write them into the SKR on change.
There should be a reasonable but not too inefficient time interval for refreshing the module templates of the remote clusters

Create a new Operator/Control-Loop responsible for establishing lesser privileged Kubeconfigs for SKR Interaction

As a Control-Plane Operator, I want to make sure that all Operators that are interacting with my Runtimes that are managed through the Control Plane have only limited Access. I want to do this to

Ensure that all Operators only can work with the resources that they declare
Ensure that no Operator can achieve privilege Escalation
No malicious attacker can achieve privilege Escalation through gaining control of a privileged Operator

As such I want to achieve a new way of reading and distributing Kubeconfigs.

AC:

Every Operator managed by Kyma-Operator should receive new Kubeconfigs
Every new Kubeconfig should be rotated out regularly
Every Kubeconfig should be limited to a set of Privileges based on preset RBAC Information
The RBAC Information required for every module is depending on its business needs, so it needs to be configurable (e.g. one Controller might require Read-Only Access to Pods, while another might even touch CRDs)
The Controller should expect a "Master" Kubeconfig, or privileged Access in the Control Plane and should run in a different namespace and name together with different roles than all other Control Loops
The Master Kubeconfigs should ideally never be persisted in Cluster, but if they have to be, then they should be stored in the safe namespace
The Kubeconfigs that are generated should bind a user with a RoleBinding of the lesser Privileged Roles

Load Test for demo scenario for >10k Clusters

Implement a demo load test scenario, verify if the load test setup works as expected.

This task mainly have three parts:

1. Load test script

Written as k6 script, a demo can be found here: https://github.com/ruanxin/performance-test/blob/main/load_generator.js
- Load test the remote watching: watch resources on X remote clusters and ensure the reported customer changes are properly tracked and considered by the Kyma Operator.
- TBC - done by PB already?: Implement test case to measure the amount of maximal watchable resources until K8s starts to deliver not all change events anymore => sync with @ebensom about limitations SREs wre already facing in their setup

2. Target Cluster under testing

Provision 2 Gardener clusters, one as control plane, one as skr cluster
Deploy prometheus-operator into control plane cluster
Create a grafana dashboard, with some demo panel to monitor some predefined metrics
Create a helm chart to define a dummy resource which simulate the behavior of resources defined in kyma components
- https://github.com/ruanxin/performance-test/tree/main/kyma-load-test
Deploy kyma-operator, manifest-operator into control plane cluster
Deploy skr-operator into skr cluster which manage the reconciliation of target resources
Config a remote OCI registery to save component descriptor files
Run load test against control plane cluster, target 10000 kyma with 200000 modules
- ⛔ ~~10% of the modules should be kcp modules~~ NO kcp modules planned for Phase 1

3. Conclusion

Present measuring / results to team => point out identified threshold

4. Findings and followup issue

Notes

Rewrite manifest-operator to overwrite deployed helm chart resource name with kyma CR name as suffix
- operator/internal/pkg/prepare/deploy.go:106
- make sure the overwritten name is an unique identifier
Load test should not run until listener, watcher integrated

concept: formalize and create concept/POC for operator-based reconciliation in-cluster

AC:

Finalize kyma-project/community#654
Create POC Proposal to determine Value-Add of Operators used for Reconciliation
Create First Implementation Recommendations based on established Frameworks
Create Follow-Up Items and finalize Decision points from Concept
#9
#10
#11
#99

Lifecycle Manager delivery

Description
The Lifecycle Manager is an essential component in the new reconciler architecture. It is following the K8s operator pattern reacting on data changes (defined in CRs).

The Lifecycle manager is responsible for the lifecycle management of Kyma installations (respectively the components configured for a Kyma cluster) and determining the Kyma installation status (aka. cluster status).

Acceptance criteria

Issues
kyma-project/kyma#13759

Appendix

CRD = Customt Resource Definition
CR = Custom Resource
BYOC = Bring your own cluster
SKR = SAP Kyma Runtime

Manifest Operator Ready State handling

In the current implementation, when the manifest resource into ready state, the reconciling will skip processing.

Reference implementation of a CI/CD pipeline with Gherkin support

Description
Each component in the new reconciler architecture requires CI/CD pipelines. Most of the components have quite similar requirements for their pipelines. Therefore, a reference pipeline is required which implements the basic CI/CD features (build, validate, test etc.). This reference implementation can be reused by other reconciler components to reduce implementation efforts.

AC:

Align with Pandas team to setup Gherkin testing as main test framework
Integrate code checks into CI /CD pipelines (linting etc.)

POC: Manifest Operator

Description
The POC of the Manifest Operator has to verify the technical feasibility and close potential knowledge gaps.

AC:

Following points have to be verified:

Reason
Validate technical feasibility of Manifest operator and close knowledge gaps.

POC: Template Operator - Use Kubebuilder Declarative Pattern to improve the Manifest Library for easy Reconciliation

Research into kubebuilder declarative approach to hide reconciliation complexity from a template-operator user.
With this declarative approach ideally, the reconciliation complexity and status (or state) handling should happen implicitly as part of the offered framework. For installation of target resources it could use manifest library as an option.
https://github.com/kubernetes-sigs/kubebuilder-declarative-pattern

AC:

template-operator is setup inside lifecycle-manager/samples/template-operator/operator with a custom resource (e.g. Sample)
README.md is updated to contain installation information
declarative library and types are implemented in the module-manager library
Sample controller is implemented using the reconciliation and status handling logic from the declarative library
Sample CR api Status sub-resource refers to the Status object type from the declarative library
Options to interact with the declarative library's reconciliation: add labels to Sample CR, transform manifest resources, resolve chart information, verify installed information - to be implemented and included in the Sample controller

Pull requests:
kyma-project/module-manager#67
#123
kyma-project/module-manager#79
kyma-project/module-manager#80
#136

POC: Kyma-Operator

Description
The POC of the Kyma Operator has to verify the technical feasibility and close potential knowledge gaps.

AC:

Following points have to be verified:

Reason
Validate technical feasibility of Kyma operator and close knowledge gaps.

Threat modelling for new reconciler architecture

The new reconciler architecture has to pass a threat modelling workshop as part of the SAP product standards. The threat modelling workshop has to be planned and executed together with security experts and findings have to be solved before the Go-Live happens.

AC:

Plan and execute threat modelling workshop with security experts
Track findings in dedicated issue tickets and mark go-live relevant issues accordingly

Standardize state handling for kyma-operator

Implementing Monitoring and Observability

Description

To ensure operational readiness for new reconciler product, a comprehensive monitoring and observability solution is essential to have. We want to support operational aspects like observability and tracing by design and want to incorporate it early on.

AC

Design and establish KCP Operator Grafana dashboard
- Reporting on the overall health of KCP Operators (Lifecycle-Manager and Module-Manager)
- Design and implement key business and systems metrics (both from kubebuilder and custom)
- Contact and collaborate with SRE for establishing meaningful alerting rule if needed
Make sure Lifecycle-Manager, Module-Manager log can be collected and presented in KCP Kibana
example: https://kibana-sf-2eaed983-6984-422e-b6a3-2dbf1c1775cc.cls-03.cloud.logs.services.sap.hana.ondemand.com/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15d,to:now))&_a=(columns:!(_source),filters:!(),index:'14462540-9c20-11eb-bfd5-3d4e267b3508',interval:auto,query:(language:kuery,query:klm),sort:!())

Developer Notes

Metrics Including but not limited to:
- Successful Processing Duration 95th percentile for each module reconciliation
  - The duration time of one module state change from error or processing to ready.
  - Design this metrics as Histograms
- Operator memory, cpu usage (in % and in peak %)
- Operator worker queue related metrics (from kubebuilder)
- Operator reconcilation error, success count (from kubebuilder)
Dashboard data file should persistent and configured in a grafana dashboard ConfigMap, current control-plane repo reference
Contact with Huskies to clarify who take responsibility and how to do the integration of Jaeger, Kiali.
- Jaeger is not in productive ready state
- Jaeger will be replaced with OpenTelemetry
- Operator Listener/Watcher is not a typical use case for tracing
- Conclusion: Don't integrate tracing tool

Related PR

kyma-project/module-manager#113

Template Operator: Developer Makefile Tooling: spin up a minimalistic Control Plane Development Environment

Template Operator: Developer Makefile Tooling: spin up a minimalistic Control Plane Development Environment with Kyma Operator and Manifest Operator in a new control plane and connection to a given SKR in the Runtime

Create a new Makefile Command in the template-operator
Make sure that the Command is using Kyma CLI in the latest Release Version as Binary
Spin up a second Cluster that is used as Control Plane
Install Kyma-Operator and Manifest-Operator with hardwired release versions into the Control Plane (kyma deploy)
Create a default Kyma CR with the module name from the Template generated by the template-operator and Secret with a Kubeconfig referencing the SKR Cluster derived from the current $KUBECONFIG variable in the script
Install the ModuleTemplate from the generation command in #106 into the Control Plane

Define Kyma CRD

Description
CRDs are the data schema of CR resources and building the contract between the Kyma Operator and other KCP components. The requirements of the Kyma Operator (especially in regards to the Kyma CRD) are defined and aligned with other KCP components (e.g. KEB).

AC:

Final CRD is defined and documented (in code as comments)
CRD is aligned with KEB team
Support debug-flag propagation (required to enable extended log-messages in Manifest Operator for particular Kyma components)

POC Kyma operator II - refactor module template management

AC

Define new flag for SKR/KCP differentiation
Update to module CRs for KCP installation not reapplied
Apply updates from the module template to the manifest for SKR installation
The spec update to the manifest will be overwritten by the latest module template (not implementing in this issue)
If module CR / manifest CR in KCP is deleted it gets recreated from module template
CRDs are passed to the manifest based on a new layer on the descriptor with name/type "crds" and its an OCIRef (meaning layer of an image somewhere on an OCI registry)
- crds is defined under fields descriptor.resources.type
- ❓ the type of crds should be yaml
CR of module template gets propagated to manifest for remote installation

Developer Notes

Introduce new flag in operator/api/v1alpha1/moduletemplate_types.go for differentiate SKR/KCP installation
Use this new flag to determine if needs to apply update for module CR, in operator/pkg/watch/template_change.go:24
Write tests to verify if module CR recreation works
Verify with Manifest Operator team members if generated CR can be consumed correctly
Samples in config get updated according to new design

Concept for implementing finalizers on Component CRs

Create CI pipeline to run mandatory SAP security scanners

Description
The reconciler code base has to be scanned by the SAP security scanners. CI/CD-pipelines have to be created to run the mandatory scanner tools.

Mandatory scanners are:

Protecode
Checkmarx
Whitesource

The code base in following repositories have to be scanned by these tools:

Kyma Operator: https://github.com/kyma-project/kyma-operator
Manifest Operator: https://github.com/kyma-project/manifest-operator
Watcher: https://github.com/kyma-project/kyma-watcher

AC:

Scanner tools are updated to support scanning the Kyma Operator repository and Manifest Operator repository (has to be aligned with @aberezovski)
CI/CD pipelines are added which run the SAP security scanners (Whitesource, Protecode, Checkmarx) regularly for the Kyma Operator and Manifest Operator code bases. These pipelines are triggered at least once a day (Mo-Fr).

Reason
Be compliant with SAP product standards.

MVP: Infrastructure Operators

Description

The Infrastructure Operators are essential components in the new reconciler architecture. They are following the K8s operator pattern reacting to data changes (defined in CRs).

The Infrastructure Operators will be responsible for the lifecycle management of managed Kubernetes clusters (SKR clusters) and their underlying network setup.

Acceptance criteria

Issues
kyma-project/kyma#13759

Appendix

CRD = Custom Resource Definition
CR = Custom Resource
BYOC = Bring your own cluster
SKR = SAP Kyma Runtime

POC: Watching and triggering actions for cluster resources

Due to the change in the reconciler concept and scenario, the description of this issue has been edited to fit to the new concept. Edited 30.06.2022 @jeremyharisch
(Switching from reconciler mothership to cloud-native approach; Instead having a per-cluster watcher-tower, having a Kyma-Watcher, based on Kubernetes operator, which communicates with the management plane)

Context

The main context to be set is that the user is able to set/edit configurations inside the SKR cluster and make them available on the management plane.

To combat this, every cluster contains a component called Watcher that reports detected changes to the management control plane. The central Kyma Operator (established here) on the management control plane relays the changes to the corresponding configuration CRs and thus trigger a reconciliation.

Key Requirements (for MVP):

Stability

Every Cluster that is subject for change detection must be able to be reconciled at any given time and cluster status given administrative access to the cluster is ensured
Change updates of an individual cluster must have no measurable impact on the availability of the Kubernetes API Server from the SKR cluster as well as from the management plane
A failure in change detection of a given cluster must have no influence on the change detection process for any other cluster with change detection
A change of the change detection rules required for cluster operations must correlate to a revision of the kyma runtime version, allowing only one ruleset per kyma version
A change detection event can be send to any amount of interested parties without impacting other change notifications
Any given change discovered from the change detection can be discarded based on the type of change to avoid information congestion
Every cluster can be ignored for change detection during reconciliation to avoid false positives on changes caused by the reconciliation process or at least the change has to be verified if it is applicable
The change detection may only pick up kyma-related changes to resources relevant for changes and should ignore changes based on independant processes running inside the cluster

Reliability

A cluster should be able to notify about detected changes via at least 2 different ways of communication to ensure resilient reconciliations
The change detection can run in high-availability, failure-tolerant environments without consistency issues by using independant containers for its orchestration
The change detection should guarantee to inform about detected changes on an at-least-once base, removing duplicates on best-effort base
Should change detection fail due to the operator becoming unresponsive, a central orchestrated service should be able to set up a new Watcher-Operator
Should change detection fail without chance of recovery, the change detection should be falling back to notify of future changes and should be able to continue operations
A configuration change picked up by the change detection should result in all changes detected to be delivered under an old revision of the configuration, ensuring reliable operations during Kyma upgrades

Security

Changes in a cluster are not allowed to be shared between other clusters
Changes can only be delivered to registered services trusted ahead-of-provision on base of certificates and mutual trust
Changes can not be accessed by any component inside the cluster except cluster administrators with a specific RBAC Rule
A Change History should never be stored centrally outside of the given processing time and API-Server responses and can never be persisted
Change delivery can only occur with third party services ahead-of-provisioning on base of certificates and mutual trust
The Kyma operator is considered an external service and has to fulfill the same guarantees of trust as any other component requiring to be notified of changes

Maintainability

Changes in configuration causing behaviour differences for change detection must be reacted to during runtime and cannot cause maintenance incidents
Change Detection can be paused for individual clusters or groups based on common attributes defined through the configuration
Change Detection can be partially disabled for individual change groups through a Kyma Component

Performance

A change detection should be flawlessly managed for over 10.000 clusters within a change detection system architecture. The resources required for change detection are defined through the individual components
A change detection agent running inside a cluster should not consume more than 1% of a given nodes performance at any given time
A change detection agent should run at most once on any given node in a cluster, either through affinity or daemonset
A change detection agent should not be responsible for maintaining persisted state over a change or of notifying other components of the change
A change detected in a cluster must not contain more than 1 MB of data for notification of other parties

Operability

A change detected in a cluster must be fully traceable once processed and has to be uniquely identified for traceability based on the kubernetes resource and change time
The Kyma watch operator must be horizontally scalable individually while respecting the requirements to performance and reliability
The Kyma watch operator must be able to show the current configuration state they are running in at any given time
The Kyma watcher operator must track metrics for detected changes and react to these metrics accordingly with adjustments to their behaviour in performance and reliability

AC for POC

POC Kyma operator II - refactor Kyma CR management

AC

Kyma CR is synchronised from SKR as Single-Source-Of-Truth
- ~~Deleted CRs on SKR gets recreated from KCP CR~~(handled in #90)
- If KCP CR is deleted, Kyma installation gets uninstalled
- ~~Ignore Update on Kyma CR spec from KCP to SKR~~
- #93: Remove Logic for Configuration Overrides in Kyma CR and passing the Label Selector to Manifest Operator
- #95
Status aggregation works by combining all manifest CR with all module-CR states and writing it into the Kyma status (allignment blocked by changes necessary in #90 to create multiple paths for SKR/KCP Installation first)
- CRDs are verified to contain status.state-field

Developer Notes

Refactor kyma_synchronization_context.go
- Refactor SynchronizeRemoteKyma only keep the logic sync from SKR to KCP
Write tests to verify if status aggregation works

Target (SKR) objects name/namespace configuration overrides.

This is needed for the Load Testing purposes, but it looks like a general feature that the Users would like to have.

For the Load Testing we have:
10 000 of Kymas with 20 components each, every component handled by the Manifest CR.
This results in 200 000 of Manifest objects.
Every Manifest CR causes an installation of some Helm chart in the target cluster.
Let's assume this Helm chart just contains a custom CR, a single k8s object.
200 000 Manifests means then 200 000 custom CRs installed in the target cluster(s)
How do we control names/namespaces of these objects?

Note: The number of target objects deployed in the single target cluster of course depends on how many target clusters we have. For 4 target clusters we have 200 000 / 4 => 50 000 objects. But this doesn't matter for the discussed feature.

I am putting some proposals below. Feel free to extend it.

A single namespace for all CRs

In that scenario every object is installed in the same namespace in the target cluster. Then the names of these objects must be different to avoid collisions. This brings two questions:
- How and where do we generate names for these objects? Of course Helm chart could just "randomize" the name, but it's not the reality of how Helm charts are written. We should expect a fixed name in the chart, with an ability to override it. Can we define such an override in the Kyma object?
- How do we configure custom state checking function for every object (different names)? We somehow need to know the name upfront.
Different namespaces
- 2.1. Different target namespace per Kyma object
  
  This is the setup we used previously. For every Kyma object in the control-plane we've created a separate namespace in the target (SKR) cluster. All the custom CRs related to a single Kyma object and installed in the target cluster, ended up in the same namespace. So for 10 000 Kymas we've had 10 000 namespaces. It would be nice to be able to configure this namespace's name somehow - we've just use the name of the Kyma object by customizing the Manifest Operator code.
  Because we have different namespaces, names of components (20 of them) must be unique inside the namespace, but can be otherwise fixed i.e. component 1 can always be named like "component1" in all namespaces - no collisions.
- 2.2. Different target namespace for every single CR
  
  Extreme case, but this setup most closely corresponds to the "real" SKR. After all, in the real installation, most of the Kyma components are installed in their separate namespaces. For the 200 000 Manifests and related 200 000 custom CRs we need 200 000 namespaces in the target cluster(s). As a consequence all the components may even have the same fixed name, like "load-test-component", and there are no collisions because every one is deployed in a different namespace.

The question is:

Which of the above can we setup at this moment with just the code we have?
Which of the above we consider a valid and useful scenario that can be easily implemented even during the POC?

Introduce static code analysis tools into CI pipeline to ensure code quality

Description

To ensure certain level of code quality, it is necessary to introduce some static code analysis tools into CI pipeline.

Linting:

AC: config prow job to run golangci-lint check in pull request.

Other tools

To be discuss and configured later.

Prepare a template operator

A template operator is a re-usable operator which could be leveraged by component teams to create their own operators, to interact with the kyma-operator and thereby be a part of the modularisation framework

AC:

Prepare a template operator skeleton project
Implement a template CR managed by this operator
Implement basic simple state handling for the template CR
Showcase operator best practices
Write documentation how to re-use this operator and inject into the kyma-operator

Implement Kyma Operator

The Kyma operator is responsible to react on changes applied to Kyma CR (see #77 ) and triggers required CR updates.

AC:

Template Operator: Developer Makefile Tooling for Building ModuleTemplates

To support developers writing their own modules, we have to make sure that they can integrate seamlessly into the existing kyma module infrastructure. Thats why we want to Introduce Generation Tooling in the Template Operator (Makefile Command) that creates a ModuleTemplate from the Operator with a Helm Chart based on the kustomize values of the kubebuilder project and a default values yaml used for the ModuleTemplate. In the end it should just be another command that outputs the finished bundled module.

AC:

Create a new Makefile Command in the template-operator
Make sure that the Command is using Kyma CLI in the latest Release Version as Binary
Trigger Module Building by converting the kustomize output into the different layers and chart of the CLI
Make sure that output module template is saved into a directory ignored by .gitignore
Make sure that module template can be used for installation into a control-plane cluster
Make sure that the generated ModuleTemplate from the Make command above is pushed against a predefined registry hardwired in the Makefile

Threat Modelling of Manifest Operator

The Manifest Operator has to pass a threat modelling workshop as part of the SAP product standards. The threat modelling workshop has to be planned and executed together with security experts and findings have to be solved before the Go-Live happens.

The scope for this threat modelling workshop is primarily the Manifest operator and its interaction with remote Kubernetes clusters.

AC:

Plan and execute threat modelling workshop with security experts
Track findings in dedicated issue tickets and mark go-live relevant issues accordingly

template-operator: Reconcile sampleCRD using the manifest library

AC:

Rename custom resource to SampleCR
Reconcile a chart using manifest library

Release process and delivery pipelines for new reconciler ecosystem

Description

The new reconciler product consists of multiple different services and resources (e.g. Kyma operator, Manifest operators, Kyma CRDs etc.). The creation of release and the lifecycle management of the different components of this product can be challenging (especially in regards to configuration options and updates). A concept is required, how all resources have to be deployed and properly configured in Kubernetes to get the product working (potential solutions could be to the HELM package manager, Kyma CLI etc.).

Additionally, a lightweight end2end test is required to verify that the the new reconciler product was successfully installed and is ready to work.

Finally, SREs have to be ramped up and enabled to implement a delivery pipeline in their CI/CD system (Spinnaker) to automate the rollout and update of the new reconciler product to the KCP DEV/ STAGE / PROD landscapes.

AC:

Define a concept how the lifecycle of the reconciler will be managed / how the release process works
- Concept is aligned within the team and SREs
Necessary tooling for the lifecycle management is configured / implemented (e.g. CLI command etc.)
SREs are ramped up and tooling is passed over so that SREs can configure their CI/CD system to deliver the new reconciler product to KCP landscapes
Simple E2E test case is implemented to verify the correct deployment of the reconciler product
~~End-user documentation (main audience is SRE and developer) is available~~ -> see kyma-project/community#700

Reason
Establish a solution for lifecycle management of the new reconciler ecosystem.

Remove reconciler-control-plane-image-bump after reconciler deprecated

Description
After reconciler deprecated, remove reconciler-control-plane-image-bump job and related code in test-infra repo.
related code:

prow/scripts/resources/autobumper

AC:

Pipeline reconciler-control-plane-image-bump was removed

Reason
Cleanup pipeline of old reconciler architecture which are acting on control-plane repo.

Setup of the Reconciler Operator Repository managing the Component Reconciliation inside SKR

AC:

Create a Repository as first-prio for Iterating
POC for handling of component CRs via static clients
POC for handling of component CRs via dynamic clients

Remote reconciliation strategy

The purpose of this issue is to define how do we want to architecture and implement the remote reconcilitation.

The remote reconciliation assumes:

Component operators are running in the Control Plane
Component operator CRs are created in the Control Plane
Running instances of component operators are reconciling Kyma components in remote clusters (customer clusters)

The challenges found initially:

Scaling of the solution. Consider one hundred clusters are to be upgraded starting Tuesday morning. How long does it take overall, how long does it take for a single cluster on average and what's the latency in CR status updates?
Do we want to control the ordering of upgrades between clusters? See below for explanation.

Scaling of the solution
The operators can't scale. There's always only one instance of an operator. How do we reconcile 100 clusters in parallel using just a single golang process?
- goroutines won't work - do you have a Node performant enough?
- horizontal scaling won't work - you can't increase replicas for an operator
- what about status update latency? We should update status on CR's to something like "Pending" ASAP, for every CR, regardless of the time it takes to actually upgrade the related clusters. It means the actual workload should somehow be scheduled to the "background" or to some external entity, and the operator should process CR's without any substantial delay.
Possible solutions:
- just measure what we can achieve right now, if it's not enough, try to find a solution :)

Ordering of upgrades problem - latency, resources.

Assume we have just three component operators: A, B, C.
Let's also assume all component operators performs the reconciliation step in a similar time.
Assume we have 10 clusters to upgrade.
We can easily show that the order of processing between component operators does change the observed latency and resource allocation.

First model (worst-case):

Steps in time	comp. A is processing	comp. B is processing	comp. C is processing	clusters ready after the step
1	cluster 1	cluster 1	cluster 10	-
2	cluster 2	cluster 2	cluster 9	-
3	cluster 3	cluster 3	cluster 8	-
4	cluster 4	cluster 4	cluster 7	-
5	cluster 5	cluster 10	cluster 6	-
6	cluster 6	cluster 9	cluster 5	-
7	cluster 7	cluster 8	cluster 4	4
8	cluster 8	cluster 7	cluster 3	4, 8, 7, 3
9	cluster 9	cluster 6	cluster 2	4, 8, 7, 3, 9, 6, 2
10	cluster 10	cluster 5	cluster 1	4, 8, 7, 3, 9, 6, 2, 10, 5, 1

In this model, we are actively reconciling up to 6 clusters at any time, and the average latency for a cluster to be ready is 8.8 (in the artificial "Steps in time" units)

Second model (best-case):

Steps in time	comp. A is processing	comp. B is processing	comp. C is processing	clusters ready after the step
1	cluster 1	cluster 1	cluster 1	1
2	cluster 2	cluster 2	cluster 2	1, 2
3	cluster 3	cluster 3	cluster 3	1, 2, 3
4	cluster 4	cluster 4	cluster 4	1, 2, 3, 4
5	cluster 5	cluster 5	cluster 5	1, 2, 3, 4, 5
6	cluster 6	cluster 6	cluster 6	1, 2, 3, 4, 5, 6
7	cluster 7	cluster 7	cluster 7	1, 2, 3, 4, 5, 6, 7
8	cluster 8	cluster 8	cluster 8	1, 2, 3, 4, 5, 6, 7, 8
9	cluster 9	cluster 9	cluster 9	1, 2, 3, 4, 5, 6, 7, 8, 9
10	cluster 10	cluster 10	cluster 10	1, 2, 3, 4, 5, 6, 7, 8, 9, 10

In this model, we are actively reconciling only 1 cluster at a time, and the average latency for a cluster to be ready is 5.5 (in the artificial "Steps in time" units).

Notice that the amount of work to be done is the same in both models, just the ordering makes the difference.
Our current proposed strategy is: Every component operator is independent and is processing clusters in it's own order/pace
This unfortunately leads more towards the first execution model, which is not optimal.

Another thing is resource consumption. If we actively reconcile 6 different clusters at a time, we need 6 different k8s golang clients, 6 HTTPS streams, and so on.
Finding a way to share such resources between operators (a common external worker, for example), along with processing clusters in some synchronized order, would reduce resource consumption along with the latency.

As always, the benefits from the possible improvements here may not justify the effort - it should be measured.

POC Kyma operator II - module deletion

AC:

module removed from kyma CR (in KCP and SKR) should delete related modules, this include modules for all target (remote, control-plane)
kyma CR in KCP marked as delete should delete kyma in SKR and all modules
kyma CR in SKR should be protected by finalizer
kyma should created immediately after finalizer removed
kyma creation is based on the spec.sync condition in KCP kyma CR

Developer Notes

module removed from kyma CR means
- the related entry in spec.modules get deleted
after module deleted, the related entry in status.conditions should also get deleted
kyma CR in SKR is not managed by controller reference
confirm the controller reference deletion works as expected
refactor the status.conditions creation logic
- initiate all existing module conditions when kyma CR created

Configuration Management - User and central configuration

After the sample application from Watching POC we want to deduce some base CRs necessary for all Components and then decide wether we want additional Configuration Overrides. Right now, the overrides can be left out for simplicity, but it is still pending where we want to allow a configuration override from SKR

AC:

Conceptualise base CRs necessary for all Components
Assess additional Configuration Overrides

Setup pipelines for continuous integration of Kyma Operator

Description
Continuous integration is mandatory and best practise to ensure a working and verified code base and to detect breaking changes. A CI/CD pipeline has to be created which validates the Kyma Operator sources whenever a PR is opened to the Kyma Operator repository (especially to the main branch).

The pipelines has to execute following actions / code checks:

Compile sources
Run linter
Execute unit tests

AC:

CI/CD pipeline is created which is checking the codebase of each PR opened to the main Kyma Operator repository
The pipeline executes following actions:
1. Compile the sources
2. Run code linter
3. Execute unit tests
A failing pipeline leads to a Slack notification to the existing Jellyfish notification-channel.

Reason
Setup Continuous Integration in the Kyma Operator development process.

Performance Test for new reconciler architecture

For the longterm, a regular performance check of the reconciliation product is required. The load test should be executed periodically / per release as part of the CI/CD.

Goal is gain information how the general performance of the product evolves. The statistics should give an indication whether the general product performance has in- or decreased between releases.

AC:

Concept

Define Performance Test Concept
Align concept with team

Implementation

Implement the load test scenario
- #18
- Boostrap test infrastructure (e.g. create new cluster or use hibernation in Gardener to stop/start landscapes)
- Run load test scenarios
- Shutdown test infrastructure
  - Delete generated test resource
  - Teardown test infrastructure (remove generated resources, stop test infrastructur)
Show results in Grafana dashboard and make these data available in the longterm (enables us to compare how to performance profile evolves over time)
Setup CI/CD pipeline which executes the load test regularly (at least once per week and per new release)

POC Implement listener for kyma-operator

Once Kyma CR in SKR has a generation change, the kyma-watcher in SKR should call the registered kyma-operator endpoint with the .spec.modules. payload from SKR. This payload should then update the skr-watcher-modules ConfigMap in the KCP.

The skr-watcher-modules ConfigMap should have the following structure:
(same as kcp-watcher-modules)

kind: ConfigMap
<...>
data:
  <module-name>-<channel>: |
    {
      "KymaCRList": [{
        "KymaCR": "abc",
        "KymaNamespace": "default"
      }, {
        "KymaCR": "def",
        "KymaNamespace": "default"
      }]
    }

This resource should then be leveraged by the kcp-watcher control loop to configure the watcher on skr.

ACs:

Watcher-Repo:
- Adapt the contract of the listener pkg: kyma-project/runtime-watcher#52
  - Include a list of modules in contract
  - Enhance listener mapping logic to include list of modules in the created k8s generic event
- Adapt SKR Watcher to include the modules in the contract which is sent to the KCP
  - List of modules should be empty if module in KymaCR did not change
  - Needs to be implemented based on the following: kyma-project/runtime-watcher#23
Kyma-Operator Repo:
- Use / Leverage functions of KCP Watcher to update ConfigMap (link) OR reimplement them -> needs to be checked which is the most sufficient solution
- Requeue KymaCR to reconciliation loop after updating ConfigMap

POC Kyma operator II

Description

Ensure the technical feasibility of the Kyma operator when applying the new reconciler architecture.

AC:

Implement error handling inside Component CR handlers in kyma-operator

Support debugging and settings propagations of reconciliations: add option to enable debug mode during runtime per Module CR

Description

The Kyma operator has to support debugging possibilities to enable analysis by on-call engineers and SREs during runtime.

It has to be possible to enable the debug mode for a whole Kyma instance or for particular Kyma modules. It's recommended to make the configuration quite simple, e.g by offering a debug field in the CR which accepts a boolean value.

The debug: true field should enable the debug mode. If the field is false, the debug mode is disabled.

AC:

Labels are properly filled with necessary metadata to enable tracability (see comment below)
The logger-instance of the module operator (respectively the logger used by operator responsible for the particular Kyma module) has to use the debug log-level for any printed log message. This is only possible if the module operator supports parsing the passed labels.
If debug mode is disabled, the default log-level (probably info) has to be used by the logger-instance
The debug mode can be configured Kyma wide by editing the Kyma CR. This will enable / disable the debug mode for all Kyma modules listed in the Kyma CR.
The debug mode can be configured per Kyma module by editing the particular module CR. If the debug mode of the Kyma CR changes, the value of the Kyma CR has precedence and overwrite the module specific value.

Refactor Kyma status.conditions

The purpose of this ticket is to refactor current kyma conditions to compliance with the standardlize-conditions proposed by KEP.

AC

The kyma condition should compliance with the standardlize conditions proposal.
condition.Reason should not be used as key for tracking component name.
TemplateInfo should persist under Kyma status.

Developer Notes

replace KymaCondition with metav1.Condition

Detect broken compatibility when using Alpha-/Beta-dependencies

Description

Sometimes it is required to use an Alpha- or Beta API (e.g. when dealing with dynamic types in K8s). Like we did here.

Such libraries can loose their support in newer K8s versions (e.g. was the reason for this incident).

To detect such outdated / no-longer supported libs, it has to be ensured that for each usage of alpha-/beta-libs, an integration test is defined which verifies the correct behaviour of this code when using the same Kubernetes version which also runs on Production / will run next on Production.

AC:

Identify code sections which are using alpha- / beta-libraries (especially relevant for libs coming from K8s community)
For each identified section, implement an integration test which is verifying the correct behaviour of the code when using the latest K8s cluster version

Conceptualise initial Kyma handling inside kyma-operator

Introduce jitter on re-scheduling

If we want to process hundreds of thousands of resources smoothly, we should ensure these objects are distributed in time "smoothly". For now, we are re-scheduling objects with a fixed time value.
It means that if we have a pile resources - and related processing overload peak - we are scheduling objects so that the same peak will happen in a couple of seconds ahead.
The not-perfect but very simple way to mitigate the problem is a random "jitter" when re-scheduling to spread out these objects in time.

Change Kyma Synchronization from SKR and KCP to SuperSet of Modules

The resulting set of all modules should consist of all modules from SKR and KCP combined.

Trade-Off: Removal of a module requires removal from both KCP and SKR. Adding A module requires only one place for the change

AC:

Make sure that KCP Changes are not overwritten by Spec Changes of SKR Changes
Make sure that SKR Changes are not overritten by Spec Changes of KCP
Make sure that SKR + KCP Modules (without Duplicates) = Total Set of Modules to be installed

Developer Notes:

It might be necessary to introduce a refactoring of the conditions so that the deletion of removed modules is handled properly
The merging of the two CRs from SKR and KCP requires us to continously maintain a "virtual" set of modules that Kyma is actually installing, thus needing a small refactor of the sync logic
The Sync that is happening needs to be changed so that the KCP CR can still be used to recreate the SKR CR but that it is not overwritten from the SKR CR if it was changed by a customer. Instead it should always use the "virtual" module set that is fetched from both clusters.
In case the virtual module set is not available (e.g. the remote cluster is not available) we can no longer retrieve the new desired state, thus producing a new short-circuit error: IF remote cluster cannot be reached, THEN error out the reconciliation and wait for requeue immediately

Operational Readiness of new reconciler ecosystem

Description

We have to ensure that the new reconciler system is fulfilling the operational readiness criteria. This includes all topics which are required to ensure the product can be successfully operated over time. This includes observability, operatoinal tooling, setup of rollout pipelines, education of the team to deal with incidents etc.

AC:

Pass operational awareness workshop
#114
Showcase of rollout of reconciler ecosystem for SREs
- #116

Refactoring the way read `kubeconfig` from secrets in load test

The branch of Manifest Operator for load test still use old way to read kubeconfig, we need to refactor it to the current logic, which is ready from secrets.

kyma-project / lifecycle-manager Goto Github PK

lifecycle-manager's Introduction

Lifecycle Manager

Modularization

Basic Concepts

Quick Start

Prerequisites

Steps

Read More

lifecycle-manager's People

Contributors

Stargazers

Watchers

Forkers

lifecycle-manager's Issues

Description

AC

1. Load test script

2. Target Cluster under testing

3. Conclusion

4. Findings and followup issue

Notes

Description

AC

Developer Notes

Related PR

AC

Developer Notes

Key Requirements (for MVP):

AC for POC

AC

Developer Notes

Description

Linting:

Other tools

AC:

Developer Notes

Concept

Implementation

AC

Developer Notes

Recommend Projects

Recommend Topics

Recommend Org