kyma-incubator / reconciler Goto Github PK
View Code? Open in Web Editor NEWKyma reconciler
License: Apache License 2.0
Kyma reconciler
License: Apache License 2.0
Description
The metrics exporter is responsible to expose to external monitoring systems (e.g. Prometheus from SREs) the list of clusters which are currently in a failure, error or transition state.
These data will be evaluated by the SRE monitoring system and used to reduce false-positive alerts during reconciliation runs. SRE will disable the alerting for a particular time range when a cluster enters a transition state.
Output of the metrics exporter are Prometheus metrics, containing all clusters which are in a transition or error/failure state inclusive the date when the transition state started.
ACs:
Reasons
Reduce false-positive alerts on SRE side for clusters which are currently reconciled.
Attachments
Description
For increasing the transparency and to make debugging much simpler, each log message which is related to a component-reconciliation process has to include a kind of correlation-ID which has to be provided by the reconciler-controller. The correlation ID allows the mapping between reconciler-controller calls to component-reconciler processes.
The log messages should also be stored temporarily on the component-reconciler side (e.g. as file) to be available by the reconciler-controller for debugging purposes.
Additionally, the component-reconciler has to send the latest log-messages related to a particular reconciliation-process to the reconciliation-controller as part of the status-updates (heartbeat-messages) as soon as the process failed or reached the error state.
ACs:
Reasons
Improve traceability of reconciliation runs and improve debugging possibilities.
Attachments
Description
The cluster inventory is managing all clusters handled by the reconciliation service. The inventory has an interface to KEB to receive information about new clusters or clusters which require an upgrade.
The inventory will store all cluster data in a database and support querying for cluster entities. Each entity has, beside it's name and some metadata (like its configuration, component list etc.) also a particular state. The cluster state will be several times updated during its lifetime (each reconciliation causes several cluster date updates).
The inventory will support efficient access to clusters which require any kind of reconciliation (e.g. used by schedule) and it allows to query efficiently for clusters which are in a failed/error or transition state (e.g. required by the metrics exporter).
ACs:
kubeconfig
files/strings for particular clusters from either Gardener or the underneath running KCP cluster.Reasons
Centralised managing of cluster data.
Attachments
Description
To establish a reliable delivery process, a full end-to-end test for the interaction between:
The used mothership- and component-reconcilers run as dedicated services and communicate via REST with each other. The deployment of the mothership- and component-reconcilers has to be based on a Kubernetes deployment (comparable to the deployment used for the productive setup in KCP).
AC:
Reasons
Ensure delivery of fully working reconciliation services which covers also edge-case scenarios.
Attachments
Description
Goal is to use the Golang Kubernetes client API for any interaction with Kubernetes clusters. This has several advantages:
kubectl
.ACs:
kubernetesClient
interface (see https://github.com/kyma-incubator/reconciler/blob/main/pkg/compreconciler/kubernetes.go#L13) using Golang Kuberentes-client APIkubectl
-based kubernetesClient (see https://github.com/kyma-incubator/reconciler/blob/main/pkg/compreconciler/kubectl.go) and use native implementation insteadReasons
Consolidation of current Kubernetes interactions and security risk mitigation.
Attachments
Description
The reconciler is storing copies of cluster data in his cluster inventory. But data leading system is KEB. To avoid discrepancies between the copy the reconciler is using and the single-point-of-truth on KEB side, a regular synchronisation of cluster inventory data between reconciler and KEB has to happen (at least once a night).
AC
Reasons
Identify and avoid gaps in cluster data between reconciler and KEB.
Attachments
Description
The reconciliation controller to be highly scalable and each reconciliation controller should run in a dedicate go-routine. Creating and dispatching/reuse of go-routines has to be handled in efficient way. The pool size of worker routines has to be configurable.
ACs:
Reasons
Enable scaling of reconciliation workers.
Attachments
Description
An integration test suite is required to verify the correct behaviour of the component reconcilers (considering edge cases).
AC:
Scope of the integration test covers:
Reasons
Verify correct behaviour of component reconcilers in expected edge cases.
Attachments
Description
Provide a basic implementation of Istio configuration using istioctl + configuration template.
Reasons
Since Istio is a special case for component installations right now, we'll attempt to reduce risk and dev time by implementing custom logic to install Istio via istioctl.
Links
kyma-project/kyma#11635
Description
The kubeClient implementation should also support a blocking deletion-call for Kubernetes entity. The Delete
function should wait until all deleted resources were fully removed by Kubernetes. See this pull-based example how to wait until a resources reached a particular state.
The Delete
function of the Client
interface should be changed to define whether the client will block or directly continue when the resources of a manifest got deleted.
type Client interface {
Delete(manifest string, blocking bool) error
...
}
AC:
Reasons
Currently the call of the delete method returns immediately which can cause side-effects when the same resource is re-created while the old resource still terminates.
Attachments
Description
The reconciler is the new approach for installing Kyma on a cluster and will replace the parallel-install
module of Hydroform. The Kyma CLI has to be adjusted to replace the used parallel-install
API with the Kyma reconciler API.
AC:
parallel-install
module from Kyme CLI and integrate the reconciler API insteadReasons
Consolidated and consistent approach how Kyma gets installed on clusters.
Attachments
Description
The reconcile requires an operational concept which is aligned with SREs. The concept has to cover:
AC:
Reasons
Ensure the reconciler is addressing operational constraints from SREs properly.
Attachments
Description
The Kyma CLI requires several smaller changes in the reconciler API to get their requirements implemented:
failed
and error
should include the error as messageAC:
error
or failed
status of the component reconciler.Reasons
Proper integration of the reconciler API into to the Kyma CLI
Attachments
Description
The progress tracker handles pods which are in "Terminating" state as ready.
Expected result
Only pods in Running-state should be treated as ready.
Actual result
Pods in "Terminating" state are also treated as ready.
Steps to reproduce
Create a pod, delete it and run progress tracker during termination phase.
Troubleshooting
Description
The reconciler team will offer a framework for external teams to easily implement a component specific reconciler.
ACs:
Reasons
Base framework for the component reconciler.
Attachments
Description
Depending on the state of a cluster, different reconciliation strategies have to be used:
The strategy manager is responsible to define which logic a reconciliation worker has to execute. It evaluates the cluster state and returns a closure-object which wraps the logic to worker has to run.
ACs:
Reasons
Encapsulation of different reconciliation logic blocks.
Attachments
Description
The database layer has to support encryption of sensitive data columns in the database. Encryption has to happen by using a secure algorithm like AES256 (e.g. https://www.melvinvivas.com/how-to-encrypt-and-decrypt-data-using-aes/).
Key rotation has to be considered and code should be provided which allows the rotation of a key in a reliable and idempotent way (e.g. by storing the AES key-hash as prefix to the binary entry or similar).
ACs:
Reasons
Sensitive data have to be encrypted to increase security.
Attachments
Description
The reconciler has to support remote administration via the CLI. The communication between the CLI and the mothership-reconciler has to be handled via an REST API.
The REST API has to be specified using OpenAPI specification (e.g. Swagger, see #116), support a secure and trusted communication (HTTPS) and be integrated into the SAP SSO solution (ORY?). Any user-action triggered by a client has to be recorded in an audit log.
AC:
Reasons
Establish a standardised tooling to control and administrate the mothership reconciler which fulfils security requirements and is integrated with the SAP SSO system.
Attachments
Description
Code changes committed to the reconciler repository have to picked up by a CI system (Prow) and trigger a build and unitest/e2e test run.
Reasons
Establish CI driven development approach.
Attachments
Description
The kubeconfig is currently not passed to the reconciler from KEB and missing in the contract.
The contract has to be adjuted and the kubeconfig-value has to be considered when creating new cluster entities in the reconciler.
See inventory.createCluster
function for further details.
Expected result
Kubeconfig is another attribute of a ClusterEntity
and stored in the DB.
Actual result
Kubeconfig is missing in models and in DB.
Steps to reproduce
Troubleshooting
Description
The scaffolding script pkg/reconciler/instances/reconcilerctl.sh
creates package names with hyphen if the reconciler-name also includes a hyphen. Such package names are not allowed din Golang and the script has to remove them.
Expected result
Valid package names are generated also if the component reconciler name includes a hyphen.
Actual result
Package name with hyphen are generated which leads to invalid Go code.
Steps to reproduce
Troubleshooting
Description
Currently, the CLI starts a component reconciler always as standalone microservice which has to be called via its REST API. This makes testing for reconciler times unnecessary complicated if they only want to trigger their component reconciler.
To address this disadvantage, the CLI command ./bin/reconciler reconciler test ...
has to support the option to start a particular component reconciler embedded (without an surrounding webserver) and to all it directly.
AC:
Reasons
Simplify development and testing of component reconcilers.
Attachments
Description
When deleting resources based on a manifest, defined namespaces in the manifest should per default be excluded and processes in a second step:
After the deletion of all manifest resources is finished, the namespace deletion can happen but only if no further resources from Kyma exist in the namespace. If a namespace is not empty, the deletion is not allowed.
AC:*
Reasons
Don't delete namespaces which include resources.
Attachments
Description
The chart provider is responsible to render HELM charts to YAML. The output of the chart compiler will be a list of rendered Kyma component objects. Each object includes the rendered Kubernetes YAML and offers functions to verify the installation status of the component (e.g. checking the state of the K8s-deployments, -pods etc.).
ACs:
Reasons
Rendering of Helm components is a mandatory feature of the Kyma reconciler.
Attachments
Description
With each reconciliation run is another reconciliation ID added to the log messages. The correlation Id should be set just once.
Also, the correlation ID is added as JSON - but the rest of the log-message is plain text.
Log messages should either be in JSON or in plain-text.
Expected result
Just one correlation-Id is added to log-messages. The log message is either JSON or Plain Text.
Actual result
2021-08-09T18:17:20.735+0200 DEBUG status/status.go:102 Status 'success' successfully communicated: stopping update loop {"correlation-id": "6a18a54c-98da-4998-b934-e50577279097", "correlation-id": "53ecbf04-a427-4e08-bdd7-a828d8aaa148", "correlation-id": "91db97a3-d318-4c6a-bd3d-406ec82fe989", "correlation-id": "f3d0a53d-574f-4150-af87-919a5c1b1208"}
Steps to reproduce
Start mothership-reconciler and "helm" component-reconciler. Each triggering of the component-reconciler by the ms-reconciler adds another correlation-Id to the log-message.
Troubleshooting
Description
The component reconciler REST API is not described as OpenAPI specification (Swagger spec). This has to be changed and the exposed REST API and related models should to be generated on base of the specification.
The pattern of supporting different API versions in parallel should not be removed: means the URL should still include an indicator of the used API version (e.g. https://host/v1/...).
Please consider #109 before implementing this feature as it is also impacting the REST API implementation and have tickets to be aligned.
AC:
Reasons
Establishing a OpenAPI specification makes the consumption easier for API by clients, adds code-generator capabilities for REST-API models and middleware code and can simplify discussions about API changes.
Attachments
Description
The status tracker has the responsibility to record the progress/state of the ongoing cluster reconciliation (e.g. state of deployments/pods which were updated [tbc]).
Each reconciliation worker has a communication channel to the status tracker and sends progress updates. The status tracker will track each status changes / installation result per cluster and store these date in a change log. Purpose is to give full transparency about the status of modified cluster resources.
In case of a restart of the reconciliation service, the status tracker information can be used to identify non-finished reconciliation processes and the schedule can re-schedule them.
After a reconciliation process was finished, the status tracker will update the cluster state in the cluster inventory.
ACs:
Reasons
Track status changes of cluster resources during a reconciliation run.
Attachments
Description
Currently is the namespace passed to the HELM chart rendering but it's not warrantied that the provided namespace will be properly used in all resources declared by the chart.
To enforce the usage of the correct namespaces, the deployment logic has to set the namespace provided by KEB. It also has to ensure that this namespace is created before the component gets deployed.
Expected result
Namespace provided by KEB has always to have precedence over the namespaces defined by a component. If the namespace doesn't exist, it has to be created before the component gets deployed.
Unit test has to be implemented which is setting a different namespace as defined by the component.
Actual result
Namespace given by KEB is not enforced before deploying a Kyma component.
Steps to reproduce
Troubleshooting
Description
We should be able to start multiple mothership reconcilers without the risk of having potential race conditions. Currently are these parts identified for potential conflicts:
Potential solutions:
The detection of race conditions will be handled by using standard DB features (isolation level + primary keys):
reconciling
, the ms-reconciler has to create an operation for the cluster (handled in a dedicated DB table - cluster-ID will be a unique value/PK). Only if no entry for this cluster exists, the DB will create a new operation entry in the DB and the ms-reconciler is allowed to continue. Otherwise the DB will complain the violation of unique key constraint and the ms reconciler will know that another ms-reconciler was already picking this cluster.reconciling
new
and were not yet assigned to a component reconciler OR which are in status orphaned
(the ms-reconcilers will have a mechanism whcih is updating running operations to status orphaned
if they haven't received an update longer than X minutes).new
(respectively orphaned
) to in_progress
. Before triggering the component reconciler it has to ensure that the update was successful (means affected-rows is == 1) by using an update query which considers the cluster + previous status (e.g. UPDATE operation SET status=
reconciling` WHERE operation-id=$1 and status=$2)AC:
orphaned
.
Reasons
Become scalable for mothership-reconcilers without having risks of race conditions.
Attachments
Description
The scheduler is responsible to react on clusters which require a transition (e.g. pending for an upgrade, installation or a regular reconciliation). The scheduler queries the cluster inventory to retrieve such clusters.
It requests the reconciliation logic (strategy) for each cluster from the strategy factory and passes both (the cluster data + the strategy) to the worker pool.
ACs:
Reasons
Identify clusters which require a reconciliation, pass them to the worker pool and track all changes applied to a cluster and its results.
Attachments
Description
Boolean flags in charts are not properly handled. Example from tracing chart:
{{- if .Values.virtualservice.enabled }}
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: {{ template "jaeger-operator.fullname" . }}
labels:
{{ include "jaeger-operator.labels" . | indent 4 }}
spec:
hosts:
- jaeger.{{ .Values.global.ingress.domainName }}
...
Such resource should not be rendered if you set virtualservice.enabled to false.
Expected result
Virtual service is not rendered.
Actual result
Error:
Default-reconciliation of 'tracing' with version 'main' failed: Failed to get manifest for component 'tracing' in Kyma version 'main': Failed to render HELM template for component 'tracing': template: tracing/templates/kyma-additions/virtualservice.yaml:10:21: executing "tracing/templates/kyma-additions/virtualservice.yaml" at <.Values.global.ingress.domainName>: nil pointer evaluating interface {}.domainName
Steps to reproduce
Run reconciler with such command:
./bin/reconciler-darwin local --components tracing --value tracing.virtualservice.enabled=false
Links
This PR demonstrates the issue: #139
Description
The reconciler has to be executed continuously and end2end tests have to be executed.
The test should include:
Reasons
Continuous testing as best practise development approach.
Attachments
Description
This can be achieved by replacing the ticker-channel with a channel which is used by the operation registry to send status-update to the worker. The mothership-reconciler should track the cluster status on base of these conditions:
RECONCILING
status: cluster state becomes RECONCILING
ERROR
status: cluster state becomes ERROR
SUCCESS
status: cluster state becomes READY
(Also documented in the Wiki)
Additionally, the naming of the shared processing ID between mothership-reconciler and component-reconciler should be aligned. In the mothership-reconciler is the variable normally called operationID
but on component-reconciler side we use the naming correlationID
. This should be aligned to one name to make the code base more consistent.
ACs:
Reasons
Reduce risk of lost data which can happen caused by time-based updates and make code base easier readable by using consistent naming.
Attachments
Description
It turned out that the call of the manifest-renderer for ISTIO is not returning the fully rendered manifest (Job-resources were missing).
Expected result
Manifest includes all resources of a Helm chart.
Actual result
Manifest is not includes all resources. E.g. the HELM chart for ISTIO defines also Job resources which are missing in the manifest result
Steps to reproduce
Render manifest resources for HELM ISTIO chart and compare the result with the YAML charts inside of the the ISTIO component. Access to the rendered manifests is possible by setting a breakpoint to the runner.install()
function where manifests are rendered.
Troubleshooting
Description
The reconciliation is executing different logic depending on the cluster state (e..g install upgrade or reconcile the Kyma cluster). Each logic block has to be implemented as a strategy which can be passed between Go entities. The strategy has to be added to the strategy manager.
ACs:
Reasons
Encapsulation of different reconciliation logic.
Attachments
Description
Running the local reconciliation via CLI (bin/reconiler local
) and pressing CTRL+C leads to an interrupt-event (execution context gets cancelled) but the process doesn't shutdown properly.
Only pressing CTRL+C a second time the hard shutdown is triggered which finally stops the execution.
Expected result
Pressing CTRL+C the first time should trigger a graceful shutdown and the process should stop running (at least a clean shutdown should be visible).
Actual result
Pressing CTRL+C has no impact on the running process.
Steps to reproduce
Start local reconciliation via CLI (bin/reconiler local
) and press CTRL+C - shutdown happens only after triggering a hard shutdown (happens after pressing CTRL+C a second time).
Troubleshooting
Description
The current REST API exposed by the mothership reconciler is not described in an OpenAPI specification (Swagger spec). This has to be changed and the exposed REST API and related models should to be generated on base of the specification.
The pattern of supporting different API versions in parallel should not be removed: means the URL should still include an indicator of the used API version (e.g. https://host/v1/...
).
AC:
Reasons
Establishing a OpenAPI specification makes the consumption easier for API by clients, adds code-generator capabilities for REST-API models and middleware code and can simplify discussions about API changes.
Attachments
Description
In order to mark "Eventing" component ready, there needs to be a mechanism to give this information back to the client of mothership who triggered the reconciliation for eventing.
One way to do that:
EventingBackend
CR and checks for the overall readiness status. This field is reported back to the client of the statusURL.Reasons
Attachments
Description
Currently the workspace factory checks only for directory names to decide whether a workspace exists. This is not sufficient if the clone of the GIT repository was interrupted and couldn't finish.
The clone process should create marker file after the clone was successfully completed. If the file exist, it can be assumed the workspace is ready to use, otherwise the next goroutine should delete it and clone it again.
See https://github.com/kyma-incubator/reconciler/blob/main/pkg/workspace/factory.go#L72 - check for marker file instead for the directory itself.
AC:
Expected result
Interrupted GIT clones don't cause incomplete workspaces. Such workspaces are automatically renewed if they were detected.
Actual result
Steps to reproduce
Troubleshooting
Description
For the rollout of the reconciler (which is finally a pre-requisite before Kyma2 can go live) is a cutover plan required.
The plan has to cover:
AC:
Reasons
Provide transparency to all teams and make the rollout properly manage -and traceable.
Attachments
Description
Kyma has a few mandatory resources which have to be installed before any other component can be installed. These are:
To ensure both resources are available, the mothership-reconciler has to take care that
After the reconciliation of CRDs and ISTIO is finished, all other components listed in the component-list of the cluster will be reconciled.
AC:
Reasons
Ensure resources which are common pre-requisites for Kyma components are made available before the reconciliation of these components starts.
Attachments
Description
The reconciler requires a security concept to ensure authentication, authorisation and auditing/accounting is established.
Goal is to fulfil SAP security requirements which covers
User-ID
is provided as HTTP header] - See #291ACs:
Reasons
Fulfil SAP security requirements.
Attachments
Description
Checkmarx is currently not pat of the regular CI pipeline checks. This step has to be added to the CI pipeline of the reconciler.
AC:
Reasons
Finding of the security threat modelling workshop and required to fulfil security requirements.
Attachments
Description
Component reconcilers can define dependencies to other components. Such dependencies can vary between Kyma versions.
Currently the mothership reconciler (MSR) is notified about missing dependencies by the component reconcilers (CR). Such CRs will be triggered again by the MSR as soon as other components were successfully reconciled.
The MSR and CRs have to be enriched to handle dependencies between CRs more efficiently.
AC:
error
status) the depending CR won't be called for this reconciliation run.Reasons
Establish simple dependency management to mothership reconciler to detect failure-cases and reduce amount of failing requests to component reconcilers.
Attachments
Description
Scheduler is setting the cluster state to reconciling
even when the component-reconciler is not reachable. This is bascially acceptable, but it should retry to connect to the component-reconciler or set the cluster state to failed.
Expected result
Cluster state is either not changed or set to failed when the component-reconciler cannot be reached.
Actual result
Cluster state is set to reconciling
but scheduler doesn't retry to connect to the component-reconciler. The inventory is also not returning this cluster when querying for "clusters to reconcile" because it considers only clusters which are in error
, failed
state or which are too old.
Steps to reproduce
Start ms-reconciler and register a new cluster without starting the required component-reconcilers.
Troubleshooting
Description
Currently the Hydroform API parallel-install
is used for rendering HELM templates. As only the HELM templating functionality is required by the reconciler, the overhead of calling the parallel-install
API is no longer valid.
Goal is to cherry pick the HELM templating code and migrate it into the reconciler. Afterwards, the dependency to Hydroform has to be removed.
AC:
parallel-install
API of Hydroformparallel-install
API is removed in the reconciler go.mod
fileReasons
Reduce execution overhead and avoid indirect dependencies coming with the Hydroform parallel-install
API.
Attachments
Description
Following risks have to be mitigated related to the used data layer on reconciler side:
AC:
Reasons
Increase security on database layer.
Attachments
Description
Currently the update of the cluster state happens in a predefined interval (e.g. each 10 seconds). This has the benefit to have a linear scaling load on the database depending on the amount of parallel running reconciliations but also the disadvantage, that in case of an outage of the mothership-reconciler won't update the cluster-status within the given interval-window.
It's possible to reduce the load on the database by establishing an event-base cluster-status update approach: the operation-registry informs the workers in the mothership-reconciler when it's worth to update the cluster state. The operation-registry can make intelligent decisions (e.g. by changing the cluster-state only if there is a high likelihood that the state won't change again soon) and reduce the amount of applied status-changes.
Cluster states will be updated by these rules:
ERROR
cluster-state is set immediately when >= 1 component reconciler reports an ERROR
statusREADY
cluster-state is set immediately set when all component reconcilers report SUCCESS
statusRECONCILING
to RECONCILE_FAILED
cluster-state are only triggered if a component-reconciler has >= 2 times reported a FAILED
statusRECONCILE_FAILED
to RECONCILING
cluster-state are triggered if all failing component-reconcilers reported a SUCCESS
status and other component-reconcilers are still runningAC:
channel
between worker and operation-registry)Reasons
Switch to event-based cluster-state updates by letting the operation-registry decide when it's time for a cluster-status change.
Attachments
Description
The creation of the HTTP-server instance for component-reconcilers happens inside of the component reconciler instance (see here).
In the mothership is the HTTP-server and routing creating as part of the CLI command. To be consistent, the same pattern should be used for the component-reconcilers: configuring and starting the HTTP server should happen as part of the related CLI command.
AC:
Reasons
Establish standardised approach for creating HTTP interfaces of the mothership- and component-reconcilers.
Attachments
Description
Security scanners have to be enabled for any implemented code in Kyma. The reconciliation code base has to be scanned by SAP security scanners regularly.
Reasons
Required by company policy.
Attachments
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.