keptn / enhancement-proposals Goto Github PK

View Code? Open in Web Editor NEW

9.0 10.0 10.0 322 KB

Keptn Enhancement Proposals

License: Apache License 2.0

enhancement-proposals's Introduction

Keptn Roadmap and Keptn Enhancement Proposals (KEP)

This repository contains two elements that are important to drive the development of the Keptn project:

Keptn Roadmap: Defining the direction of Keptn.
Keptn Enhancement Proposals (one could also call these Request For Comments): A new proposal can be made by creating a pull request. Already accepted proposals will be listed in the text/ directory.

Keptn Roadmap

Please find here the Keptn Roadmap which defines the building blocks for 2021. Based on the building blocks, Keptn Enhancements Proposals are derived.

Keptn Enhancement Proposals

All enhancement proposals that have been accepted into this repo can be found in the text/ directory in the following format:

XXXX-my-proposal.md where as XXXX is the KEP ID (basically the ID of the Pull Request).

Each proposal can have one of the following status:

proposed (PR created)
approved (PR reviewed and merged)
rejected (PR is actively rejected by the reviewers)
withdrawn (PR has been closed by the proposer)

What changes require a KEP?

A KEP is required when it is intended to introduce new behaviour, change desired behaviour, or otherwise modify requirements of Keptn.

In practice, this means that KEPs should be used for such changes as:

Behavioural changes of Keptn or any core services
Changes that affect the interaction of multiple services
Breaking changes

On the other hand, they do not necessarily need to be used for such changes as:

Bug fixes
Rephrasing, grammatical fixes, typos, etc.
Refactoring
Automated tests
Simple workflow changes (such as adding a timeout, retry logic, ...)

Note: The above lists are intended only as examples and are not meant to be exhaustive. If you don't know whether a change requires a KEP, please feel free to contact us!

Writing a new proposal

There are two options available for creating a KEP:

Forking the repo and creating a Pull Request
Creating an issue based on the KEP-template that contains the same content - we will then take care of the Pull Request

Preferred: Fork the repo and create a PR

Fork the keptn/enhancement-proposals repo
Copy 0000-template.md into the text/ directory and rename it accordingly (e.g., to XXXX-my-proposal-title.md) - Please note that XXXX (ID of the KEP) needs to be replaced with the Pull Request ID later.
Fill in the template. Put care into the details: It is important to present convincing motivation, demonstrate an understanding of the design's impact, and honestly assess the drawbacks and potential alternatives (feel free to adapt the template to your likings, if you feel that it is necessary or if it helps to improve readability).

Create an issue based on the KEP-template

Create a new issue (based on the KEP-template)
Please note the ID of the issue and replace XXXX in the template with the issue ID.
Fill in the template. Put care into the details: It is important to present convincing motivation, demonstrate an understanding of the design's impact, and honestly assess the drawbacks and potential alternatives (feel free to adapt the template to your likings, if you feel that it is necessary or if it helps to improve readability).
Please note: For this KEP to be accepted, we will eventually create a Pull Request and merge it.

Submitting a new proposal

A KEP is proposed by posting it as a Pull Request (PR). Once the PR is created, update the KEP file name to use the PR ID as the KEP ID.
A KEP is approved after a number of official reviewers github-approve the PR (this number might change over time). The KEP is then merged.
If a KEP is rejected or withdrawn, the PR is closed. Note that these KEP submissions are still recorded, as Github retains both the discussion and the proposal, even if the branch is later deleted.
If a KEP discussion becomes long, and the KEP then goes through a major revision, the next version of the KEP can be posted as a new PR, which references the old PR. The old PR is then closed. This makes KEP review easier to follow and participate in.

Final words

This process borrows from Open Telemetry Enhancement Proposals.

The hope and expectation is that this process will evolve over time, it is by no means fixed. If you have any suggestions, questions or concerns, please get in touch with us.

enhancement-proposals's People

Contributors

Stargazers

Watchers

Forkers

johannes-b grabnerandi checkelmann bacherfl malon875875 thschue aaronmassicotte aloisreitbauer bradmccoydev agardnerit

enhancement-proposals's Issues

KEP-84 - Reliable Usage Analytics collection

Keptn Usage Analytics Engine

Summarizes the proposal for having a new analytics engine for gathering usage

Motivation

It would be nice to know basics about Keptn usage, and for that we would need to extend the current usage analytics collection and the implementation. It includes installation numbers, service usage statistics, feature flags, etc.

Internal details

Currently we rely on the update checks that poll a pre-defined URL to retrieve the data for CLI and Keptn Bridge. When collecting usage analytics is enabled, Keptn and CLI will check for updates. This is opt-in for Keptn Bridge and opt-out for Keptn CLI at the moment. These URLs are served by AWS CloudFront, and then we retrieve access logs and parse them with the scripts. This engine is to be reworked (see below). The request is logged in the format shown here: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html

The following data is collected at the moment:

IP address;
Version of Keptn deployment;

The following data CAN be collected according to the reference page:

IP address;
Version of Keptn deployment;
Number of projects, stages, and services;
Number of root events per service and day;
List of installed Keptn-services including version thereof.

Explanation

The current analytics engine is not reliable. The lack of a way to uniquely identify the instance ID is the biggest problem. Right now we rely on Cloud Front logs, so we cannot say for sure how many instances are used behind a single network
- Proposal: Introduce uniqueId as a part of Keptn core. It should be persisted in the database and retained between restarts. The recipient side and processing scripts also need to be reworked to support unique ID, likely through the HTTP URL parameter.
- Maybe it requires another endpoint, like Bitly Enterprise
The number of installations is not enough for decision making. We need to collect more data. Actually, the current analytics opt-in permissions in the Web UI allows for more data:
- List of submitted data: https://keptn.sh/docs/0.16.x/bridge/load_information/
- Proposal: Implement the submitted data. It requires a fresh new analytics engine. There are many open source projects like Matomo, but maybe a simple database provider + GrimoireLab are a better solution for flexibility purposes. Sentry could be used as receiver/visualization tool && has open source sponsorship
Since the core patches are needed, it will take several months to propagate new analytics implementation. It has to taken into account
Linux Foundation has strict usage analytics rules that put many limits on what cannot be submitted. E.g. Usage analytics should be opt-in, and that there should be no personal or company identifiable data in public layouts. Also note all analytics engines should comply with GDPR and the Californian equivalent

References:

Keptn 0.16 usage analytics: https://keptn.sh/docs/0.16.x/bridge/load_information/
LF Telemetry and Usage Analytics guidelines: https://www.linuxfoundation.org/telemetry-data-policy/

Trade-offs and mitigations

What are some (known) drawbacks?
What are some ways that they might be mitigated?

Note that mitigations do not need to be complete solutions and they do not need to be accomplished directly through your proposal. A suggested mitigation may even warrant its own KEP.

Breaking changes

Keptn CLI will stop submitting data by default on update checks
TBD: Extending the list of the collected data, if needed, will require re-agreeing to the usage analytics submission.

Prior art and alternatives

Good and fresh comparison of open source analytics tools: https://posthog.com/blog/best-open-source-analytics-tools

Open questions

Technical choice for the analytics engine
What data we want to collect? Comments with the requirements are welcome

Future possibilities

Decision making on what's next for Keptn, community stats

Introduce OpenTelemetry pre-instrumentation to Keptn

Added OpenTelemetry pre-instrumentation to Keptn’s helm-service ensuring insights into e.g., response times and errors.

Motivation

Similar to Keptn, OpenTelemetry is an open-source CNCF project formed from the merger of the OpenCensus and OpenTracing projects. It provides a collection of tools, APIs, and SDKs for capturing metrics, distributed traces, and logs from applications.
OpenTelemetry is expected to become the industry standard for:

pre-instrumenting libraries and frameworks to add out-of-the-box and vendor-neutral observability
enriching local monitoring data with custom instrumentation
While OpenTelemetry does not provide backend or analytics capabilities, it provides several integration points for observability platforms to ingest the collected data.
Which use-cases does this KEP enable?
This KEP proposes to team up with the trending CNCF project OpenTelemetry by aiming to enrich Keptn’s helm-service with OpenTelemetry pre-instrumentation. The helm-service is used as first candidate to get pre-instrumentation since occasionally Keptn delivery use-cases fail due to missing insights into this particular service. Such pre-instrumentation shall ensure observability insights into e.g., response times and errors.
Which value would this KEP create?
This KEP would add currently missing observability insights into the helm-service used for deploying services to a Kubernetes cluster and releasing them to user traffic.
Besides, it would extend the OpenTelemetry instrumentation for metrics already in place for the Keptn statistics-service.
How can I, as a Keptn user, make use of the instrumentation
The value of OpenTelemetry pre-instrumentation is the following:
• As a developer, adopting OpenTelemetry instrumentation allows to gain more insights into services or to add domain specific knowledge
• OpenTelemetry data can be sent to an observability backend of choice for analysis to understand how apps are performing
• As an SRE, OpenTelemetry instrumentation can be ingested into an observability backend of choice and used to ensure SLIs/SLOs are met for critical services

Explanation

	// Start TCP server
    listener, _ := net.Listen("tcp", ":1234")
	conn, _ := listener.Accept()
	defer conn.Close()
	conn.Write([]byte("Hello from TCP server"))
	// Create custom Span that represent accepted TCP connection
	_, connSpan := tracer.Start(context.Background(), "Incoming TCP connection")
	defer connSpan.End()
	// Set client IP address as Span attribute
	connSpan.SetAttributes(label.KeyValue{
		Key:   "Client address",
		Value: label.StringValue(conn.LocalAddr().String()),
	})

The example above adds relevant custom metadata, i.e., client IP address to a low-level TCP connection. This shows how span attributes can be used to provide specific metadata related to, e.g., the helm-service. Such domain-specific metadata can be useful to quickly identify issues related to deploying services and mitigate them.

Internal details

From a technical perspective, how do you propose accomplishing the proposal?

In particular, please explain:

How would the change impact and interact with existing functionality?
Likely error modes and how to handle them
Corner cases and how to handle them

While you do not need to prescribe a particular implementation - a KEP should be about behavior, not the implementation - it may be useful to provide at least one suggestion as to how the proposal could be implemented. This helps reassure reviewers that implementation is at least possible, and often helps them thinking more deeply about trade-offs, alternatives, etc.

Trade-offs and mitigations

Introduces a dependency to (probably unstable) OpenTelemetry APIs. The OpenTelemetry community is working hard towards providing release candidates for APIs (please see https://medium.com/opentelemetry/tracing-specification-release-candidate-ga-p-eec434d220f2 for all the details)
Ensure RCs are available for referenced APIs

Breaking changes

Could this proposal cause any breaking changes (e.g., in the spec or within any workflow)?

Prior art and alternatives

What are some prior and/or alternative approaches? E.g., is there a corresponding feature in OpenTracing or OpenCensus?
What are some ideas that you have rejected?

Open questions

Does OpenTelemetry pre-instrumentation for Keptn require to have trace context support available?

Future possibilities

Of course, OpenTelemetry pre-instrumentation of other Keptn services could follow.

Allow external tools to provide SLI results via events vs just via SLI-provider

Allow external tools to provide SLI Results via Events vs just via SLI Provider

SLI results should not only be pulled in via an SLI Provider but it must be possible to push SLI Results to Keptn from tools as they are taking part of the delivery process, e.g: JMeter should be able to provide SLI Results after test execution is done via the Test Finished Event as this is the time when these values are actually available. This also eliminates writing an SLI provider to fetch data that is already available

Motivation

When executing JMeter tests, JMeter has a lot of valuable SLI data (response time, failure rate, ...) that should be made available to the Lighthouse service to be included in SLO validation. Letting a Keptn Service or even an external tool that sends e.g: a Deployment Finished Event provide SLI Results from this tool eliminates the need to implement custom SLI providers for each potential external data source. Instead - we giving those tools the chance to provide SLI results at the time when these tools have the data available.
This eliminates the need to write specific SLI Providers for tools that are already integrated with Keptn such as the JMeter or Neoload service. This option also allows tools that integrate with Keptn, e.g: GitLabs, Jenkins ... to send key metrics via e.g: the Keptn Start Evaluation Event. Examples here would be code coverage metrics, build times, ...

While we will still need support for Multi SLI Providers I think this approach will reduce the need for multi SLI providers by a great deal as most use cases I am aware of right now are data sources such as Testing Tools. The primare SLI Provider would then be - just as it is now - the Monitoring TOol SLI PRovider

Explanation

I propose to allow any tool to send SLIResult data as part of a cloud event. It can be the same format as our internal IndicatorValues []*SLIResult that is part of the InternalGetSLIDoneEventData

Here are a couple of scenarios
#1: Testing Tool sends metrics as part of Test Finished
JMeter can send SLIResults as part of Test Finished which are then automatically available to the Lighthouse Service, e.g: jmeter_rt_test1, jmeter_rt_test2, ...
This allows our users to use these SLIs in their SLOs without having to specify them in a custom SLI. This brings immediate value for quality gates without having to configure an SLI provider and it allows us to reuse metrics that are already available by the test execution

#2: Pipeline sends SLIs as part of the Start Evaluation Event
Jenkins can for instance send SLIResults when it starts an evaluation event, e.g: jenkins_codecoverage, jenkins_buildtime, ...
This allows a user to include codecoverage or buildtime as part of their SLOs. This really opens a door to more data sources without having to implement new SLI Data sources for every data source

Internal details

Internally I would simply allow tools to add SLIResult data in the Test Finished and Start Evaluation Event. Here are some example cloud events to show how this could look like for the JMeter or also the Neoload service:

{
  "contenttype": "application/json",
  "data": {
    "deploymentstrategy": "direct",
    "end": "2020-03-24T08:56:18Z",
    "labels": null,
    "project": "simpleproject",
    "result": "pass",
    "service": "simplenode",
    "stage": "staging",
    "start": "2020-03-24T08:43:21Z",
    "teststrategy": "performance"
    "indicatorValues": [
      {
        "metric": "jmeter_response_time",
        "success": true,
        "value": 535.624022959996
      },
      {
        "metric": "jmeter_failurerate_step1",
        "success": true,
        "value": 0.0
      },
      {
        "metric": "jmeter_failurerate_step2",
        "success": true,
        "value": 3.45
      }
    ]
  },
  "id": "3ebc02e1-bcab-40dc-a878-5868085b8016",
  "source": "jmeter-service",
  "specversion": "0.2",
  "time": "2020-03-24T08:56:18.246Z",
  "type": "sh.keptn.events.tests-finished",
  "shkeptncontext": "664d2ad3-e925-4049-a7ff-39b8e584e5b5"
}

Here is an example for a start evaluation that is for instance triggered by a pipeline tool

{
  "contenttype": "application/json",
  "data": {
    "labels": {
      "jenkinsjob" : "http://myjenkins/myproject/job/3
     "gitcommit" : "abcedefadf"
    },
    "project": "simpleproject",
    "service": "simplenode",
    "stage": "staging",
    "start": "2019-11-21T11:00:00.000Z",
    "end": "2019-11-21T11:05:00.000Z",
    "teststrategy": "performance"
    "indicatorValues": [
      {
        "metric": "jenkins_buildtime",
        "success": true,
        "value": 12400.00
      },
      {
        "metric": "jenkins_codecoverage",
        "success": true,
        "value": 33.45
      },
      {
        "metric": "jenkins_selenium_score",
        "success": true,
        "value": 75.0
      }
    ]
  },
  "id": "3ebc02e1-bcab-40dc-a878-5868085b8016",
  "source": "jenkins",
  "specversion": "0.2",
  "time": "2020-03-24T08:56:18.246Z",
  "type": "sh.keptn.event.start-evaluation",
  "shkeptncontext": "664d2ad3-e925-4049-a7ff-39b8e584e5b5",
}

The lighthouse service when handling the start-evaluation or the test-finished event can then take these SLIs and "merge" them with the SLIs it retrieves through the configured SLI Provider. With that the list of overall available SLIs is growing and can be used in the SLOs

Trade-offs and mitigations

If a user references SLIs in the SLOs that is not available or was not passed by any external tool the lighthouse service needs to highlight this - but - I think it is already doing this anyway in case an SLI value is not available.

Breaking changes

I dont think there are any breaking changes as we are just extending the source of SLIs. Everything else we have right now will work as it works right now

Prior art and alternatives

The alternative would be to write additional SLI providers to pull in data from other tools that have been used in the delivery process already. While this is technically feasible it is not always easy, e.g: it would be very hard for us to develop a JMEter SLI Provider becuase that SLI Provider might no longer have access to the JMEter test result file that was generated earlier. WE have no guarantee that this file still exists in that JMEter SErvice container.

Open questions

Future possibilities

Gives external tools easier access to the SLO validation by providing their own metrics as they call start-evaluation. This is a HUGE enabler of future integrations with the Quality Gate capability of keptn

Role-based Access Control (RBAC)

Success Criteria: Role-based and SSO-based access control need to be applied for accessing Keptn UIs and APIs.

Motivation

Target audience pain points
As any user entering the Bridge, I see all and everything there from my whole organization. I see things, I do not care about and even worse: I might also see things I should not be allowed to see (e.g. QG results including security vulnerability metrics for a certain release).

Evidence
Keptn users requested role-based access to the Bridge several times.

No discussion, that role-based access and visibility of company data is a must-have for any enterprise-ready solution.

What

What it is? What it is not?

As anyone looking into data in the Bridge, I should only authenticate once and have proper permissions within the Bridge based on my SSO roles.
As an automation engineer, I should only edit the automation I'm responsible for.
As a member of one department, I might not be allowed to see (and not allowed to edit) quality gates of another department.

Use Cases

Keptn has three roles: read / write / admin

A read role allows you to:
- Read all resources without enabling you to modify a resource
A write role allows you to:
- Modify all resources. E.g. approve a promotion in the bridge
- The "write" role for a given stage should only permit any modifications within the same stage. E.g. if you have the "write" role in dev, but not in hardening it should not allow you to press the "promote" button in dev, since then modifications are done in hardening.
An admin role allows you to:
- Modify all resources, change the Keptn shipyard, create projects, ...

Roles in Keptn must be settable on project- / service- / (stage+service)- level

Having a role on project-level enables you to perform an action in each stage for each service
- E.g. when having the write role for service X you can approve any approval step for service X in any stage
- Example use cases: define admins per project / give read permissions to business stakeholders
Having a role on service-level enables you to perform an action only for the given service (in each stage)
- E.g. when having the write role for service X you can approve any approval step for service X in any stage
- Example use case: All members of a team can promote a service to the next stage
Having a role on (stage+service)-level enables you to perform an action only for the given stage/service tuple
- E.g. when having the write role for service X in stage Y you can only approve approval steps for service X in stage Y
- Example use case: Only a handful of people can approve the promotion of a service to production

A role has one or more API tokens

The API tokens can be rotated

That's why we need multiple API tokens per role: We want to prevent that access is lost during the time the API token is rotated

Suggested Services

Suggest services in the bridge that a user may wish to use with Keptn.

Motivation

Easier onboarding

Explanation

A user creates a project via the bridge. They upload their shipyard. The bridge parses the tasks from the shipyard and then (new functionality) asks the user which tooling they wish to use to action those tasks.

Imagine the following tasks are parsed:

stages:
  - dev
  tasks:
    - deploy
    - test
    - evaluation
    - release
  - prod
  tasks:
    - deploy
    - release
    - evaluation

The user would be asked: "which tool would you like to action the deploy task in dev?

The user is then directed to the artifacthub installation instructions (or perhaps we can auto-install the service?)

We could offer links to "don't see your tool? Integrate anything" which links either to docs on the keptn/integrations repo`.

Internal details

The task suggestions could be pulled from some field in artifacthub.

Benefits

Such an integration would bring two benefits:

Much simpler onboarding process
A clear understanding to the user that tools action tasks and they are in control of which tools action which tasks.
As Keptn developers we can code in stats gathering to see which integrations, tools and use cases are popular and thus direct more effort to maintaining those.

Service screen revamped

Success Criteria: The Keptn Bridge provides a holistic view of the services managed by Keptn.

Motivation

Pain: Currently, the service screen is reduced to displaying the evaluation result of a service in a particular stage and a link to the executed sequence.
Target: Help the user to find the required information related to specific service deployment.
Driver: Provide more service-centered information driven by the main use-cases of Keptn.

User stories

As a user, I would like to see the deployments of a service that are currently running in the environment (i.e., the different stages)
The main use-cases, which are continuous delivery, quality gate, and auto-remediation, are present and drive the usage of this screen.
As a user, I would like to see what happened as part of executed remediations on that service.
As a user, I would like to see my SLO.yaml / remediation.yaml configured for a particular deployment.

Use-cases

Multiple dev stages that exist in parallel and allow a developer to run tests as part of the development process.
Multiple production stages - each for a different customer.

Mockups

⚠️ Disclaimer: These mockups are subject to change.

This section explains how the changes in the Bridge will help the user to answer his/her service-related questions.

Question 1: What is the status of my service in the entire environment, i.e., in the different stages?
new Feature: A filter component on the environment screen allows selecting the service I'm interested in. For example, I filter on carts and get the following view:

Question 2: Now I'm interested in the deployment 0.9.11 in production-A> How can I get more information: I click on carts:0.9.11 (see yellow highlighting)
This opens the revamped service screen and pre-selects the deployment 0.9.11 that is running in production-A:

This screen shows:

All deployments of this service that are currently running; see list of 0.12.0, 0.11.1, 0.10.3, 0.9.11 in the left panel.
Details about the selected deployment, which is carts:0.9.11. The details include:
- Meta-data like: Git commit and labels
- The stages the deployment went through or is currently deployed in.
  - If the bubble is filled, the service is currently running in this sage. Besides, the icon for the deployment URL is displayed (square with arrow)
  - If the bubble just has a border, a newer version (or another artifact) is currently running in this sage. But you can still take a look at the evaluation result.
- The evaluation result for this deployment in this stage; see heatmap and SLI breakdown.

Work in progress / Open questions

This section contains open questions that need to be refined:

How to cover the remediation use-case? Show list of running/executed remediation like:
How to deal with the quality-gates-only use-case?

Gimlet - OneChart integration for Keptn

Gimlet OneChart integration for Keptn

Success Criteria: Keptn - by using the OneChart integration - can deploy a service without providing a Helm Chart.

Motivation

Target: A onechart integration provides a default Helm Chart template (aka. OneChart) that can be used to deploy a service.
Pain: Currently, a Keptn user has to provide a Helm Chart for onboarding a service to Keptn. This Helm Chart is then deployed by Keptn.
- Even though the chart from the example repo can get re-used, in many cases the Keptn user has to adapt it.
- The default example Helm Chart is opinionated in the way that it contains just one Service and Deployment manifest for implementing the built-in rollback functionality of Keptn.
Driver: Help Keptn user to get started with Keptn and to quickly deploy the first Keptn managed service.

User Stories

As a user, I deploy the onechart-service on the Keptn control-plane (as shown below):

helm install onechart-service [repo] -n keptn

As a user, I do not have to worry about the Helm Chart, which is used for deploying my service. In the end, I can simply do:

keptn create project <my-project> --shipyard=<myshipyard.yaml>
keptn create service my-service --project=<my-project>
keptn trigger delivery --service=<my-service> --project=<my-project> --image=<image> --tag=<tag>

Notes:

Note 1: In Keptn <=0.8.x, you can either use the keptn onboarding or keptn add-resource commands to upload a Helm Chart for a service --> these commands become obsolete when using OneChart.
- keptn onboard service - this is deprecated anyways
- keptn add-resource
Note 2: After implementing keptn/keptn#3341, you can use:

keptn trigger delivery --service=<my-service> --project=<my-project> --values="shipyardController.image.tag=0.8.2-dev20210222"

Open Questions

How can we handle rollbacks with OneChart?
Are there different types/classifications of OneCharts, e.g., one for webapp, database, etc.?
Versioning of OneChart: How can we make sure to apply the correct version of the OneChart (in case the OneChart is not stored in the git repo maintained by Keptn).
Should the service be called: onechart-service ?

Implementation Details

The onechart-service subscribes to a sh.keptn.event.service.create.started and responsible for creating the Helm Chart based on the OneChart.
- Advantage: The integration kicks in when the service is going to be created.
- Drawback: It may take the onechart-service some time to finish its job. Hence, when automation is in place that (1) creates a service and (2) immediately triggers a delivery > the Chart may not be ready yet.
To get started, the keptn-service-template: https://github.com/keptn-sandbox/keptn-service-template-go

References

https://gimlet.io/onechart/getting-started/

Deploy Keptn w/ Demo Project

It's much easier for new users if Keptn comes installed with a default project as a starting point.

Motivation

New user onboarding will be easier

Extensibility in Keptn

Problem statement

Currently, there are limitations when extending Keptn with custom services (Keptn-services):

Keptn-services are constantly running while listening for CloudEvents coming in over NATS or are polled from the Keptn API.
Keptn-services usually filter for a static list of events the trigger the included functionality. This is not configurable.
Support for new functionality in Keptn needs to be added to each Keptn-service individually. E.g. the new secret functionality needs to be included into all of the services running in the Keptn execution-plane.

Success Criteria

Current challenges for integrating custom tooling should be mitigated and already provided solutions (e.g., provided by the job-executor-service) should become a Keptn-core capability. The goal should be to make the integration of external tooling easier.

Workshop result

In a workshop with Keptn users, the group derived a list of topics/issues that are relevant to improve the extensibility of Keptn. Based on that, concrete Action Items and Feature Requests were derived.

Action items

SDK for building integrations

In the workshop, the participants derived the code, a developer of an Integration has to write:

KeptnClient c;
try {
   var e = c.fetchEvent("deployment")
   if (e != null) {
      doSomething();
     c.sendResult("...");
   } 
} catch (e) {
   c.reportError(e);
}

Conclusion:

Fully abstract away the event handling. The developer should never deal with Keptn Events, but rather a stable interface.

Prototype of an SDK:

The Keptn project has already an SDK in a prototypic state for GO: https://github.com/keptn/keptn/tree/master/remediation-service/internal/sdk
The keptn/remediation-service is implemented based on that SDK
When the SDK becomes the preferred way of writing custom-services, it should be added to this Keptn CLI feature: keptn/keptn#3584
Before continue working on the SDK, research the SDK of Stripe (Payment service) which addresses a similar problem.

Documentation

Status quo:

The documentation to write a Keptn-service is provided here: https://keptn.sh/docs/0.8.x/integrations/custom_integration/
This documentation links the user to the keptn-service-template-go

What is missing?

As a developer, I would like to get an overview of the possible options to integrate with Keptn (what are the advantages/disadvantages thereof?):
- keptn-service-template-go (status quo) > custom-service integration
- Keptn SDK (should replace the keptn-service-template-go) > custom-service integration
- job-executor-service > image-based integration
- webhook-service > webhook-based integration` (future)
As a developer, I would like to follow a step-by-step guide to create my first Keptn-service.
As a developer, I would like to get details for integrating when not using the Keptn distributor.
- Polling for .triggered events
- Send a . started containing the id of the .triggered event
- Send a .finished containing the id of the .triggered event

Features requests

No indication that a `triggered` event is already consumed

Situation: I have a service (like a Lambda function) that checks for .triggered events every 10 seconds.

Current issue: In Keptn, a .triggered event remains on the /events/triggered endpoint as long as it has not received the corresponding .started event. This can cause the problem that the same service will receive the .triggered event a second time.

Proposed solution: Mark a triggered event as "consumed-by" when a service successfully pulled it.

Fine-grained event filters

As an operator of a Keptn-service, I would like to filter events based on their payload (and not only on Project/Stage/Service level).

JSON Path could be a proper way of implementing this:

events:
  - name: "sh.keptn.event.test.triggered"
    jsonpath:
      property: "$.data.test.teststrategy"
      match: "health"
  - name: "sh.keptn.event.test.triggered"
    jsonpath:
      property: "$.data.test.teststrategy"
      match: "load"

No dependency on K8s for integrations

Additional notes

Keptn-service on the Execution Plane have another release cycle than the Control-Plane: Keptn-services (Integrations) will have another release cycle than Keptn. Establish a mechanism that ensures that Keptn-services are still compatible with the current Keptn release.
Make the REST API used by the SDK very stable
Store the configuration of an event subscription on the Keptn control-plane: Subscriptions should be managed at a central place and not distributed across the different Keptn-services (Integrations). A Keptn-service then ask the Control-Plane for subscription updates and applies them automatically. (This is approach is considered in: keptn/keptn#4439 & keptn/keptn#4437)

Snapshot releases

Problem statement

Problem 1: Currently, Keptn is mainly designed for the deployment and promotion of - more or less independent - microservices without dependencies. This behavior might become a challenge when multiple services are handled and should be tested together. Using the current approach, each service has to be promoted individually and therefore, each stage might run with version combinations that are not tested together.

Problem 2: Keptn assumes that each sequence is running in the scope of a service. As a result, it is not possible to define tasks that run once per project/stage which leads to the problem that every task that is taken by a sequence has to be designed in a way that it does not interfere with other ones. If there would be such - more global - scoped tasks, it would be possible to, e.g. create infrastructure per project but also run global tests, which might be evaluated on a per-service basis.

Success Criteria

In the future, it should be possible to deploy larger-scale applications/platforms using Keptn. Therefore, Keptn should be able to deal with a set of services, but also to take tasks on project/stage scope.

As a result, the following success criteria are defined:

A set of services can be defined in Keptn
Services can be added and removed to/from the set
The current set of services in a stage can be promoted in one step to the next stage
Tasks can be taken on the scope of the set instead of services

Explanation

General overview

Image source: <>

(proposed) Entity model

Image source:

Snapshot - A set of Keptn services
Service - A Keptn service
Service Sequence Instance (Service-SI) - Execution of a sequence (e.g., delivery) for one service
Snapshot Sequence Instance (Snapshot-SI) - Execution of a sequence (e.g., delivery) for an entire snapshot
service version - value from CI build
snapshot version - defined by Keptn

Use Cases

Overview:

As a platform operator, I want to define a set of services to test and deploy them together in further stages
As a platform operator, I want to deploy all services together to ensure that they are operating together properly
As a developer, I want to ensure that my service is tested to ensure that it is running properly on the platform
As a platform operator, I want to run tests involving all of the services to ensure that the platform runs properly
As a developer/platform operator, I want to know which service causes problems related to the tests to be able to find out where problems are
As a platform operator, I’d like to promote the whole group to the next stage to ensure that the preceding tests are still valid

As a platform operator, I want to define a set of services to test and deploy them together in further stages

It should be possible to add a service to a group to treat them as one in the future sequences. This grouping could be done at a project level, therefore a promotion strategy (grouped or single-service) would have to be added to the Shipyard file. If this is set, all services in the project are part of the group.

A snapshot represents a configuration of a group for a specific point in time. When operating in a grouped mode, it would be possible to run sequences on a snapshot scope.

As a platform operator, I want to deploy all services together to ensure that they are operating together properly

it should be possible to deploy all services (snapshot) in a stage. Obviously, this doesn’t apply to the first stage, where the services will be added to the platform. The deployment could be done on a project-level (deploy everything with one artifact) or on a per-service level.

Project-Level Deployment: In an operator-managed deployment, it might be intended that all services are deployed by applying one custom resource to the target environment while having the possibility to evaluate every services quality criteria on its own. In this case, the operator would have to pass the artifact for the project-deployment as well as the configuration needed for the services to Keptn.
Service-Level Deployment: This is the current way of dealing with deployments. But instead of deploying every service on its own, Keptn would iterate over all services in the project and deploy them in the previously specified version.

As a developer, I want to ensure that my service is tested to ensure that it is running properly on the platform

After the deployment, tests scoped to the service will be executed.

As a platform operator, I want to run tests involving all of the services to ensure that the platform runs properly

There might be tests, which are scoped to the snapshot (end-to-end tests). These tests should be executed once and might span across multiple services.

As a developer/platform operator, I want to know which service causes problems related to the tests to be able to find out where problems are

After all, the results of the tests done before should be done on a per-service basis. Therefore, it should make no difference if the tests are triggered on a per-service or on a per-snapshot basis.

As a platform operator, I would like to promote the whole group (snapshot) to the next stage to ensure that the preceding tests are still valid

Last but not least, the platform operator should be able to promote the service - either automatically or manually - if all of the tests ran successfully. Furthermore, the operator should be able to promote a specific set of previously successfully deployed and tested versions to the next stage

Implementation

Prerequisites

Version is mandatory in Keptn-events
- Should be a human-readable name
- Provided by the CI
- Could be the Git-commit (from a technical point of view)

Overview

Possible Shipyard

Option A

apiVersion: "spec.keptn.sh/0.2.2"
kind: "Shipyard"
metadata:
  name: "shipyard-sockshop"
spec:
  promotionStrategy: [snapshot | service ]  < default is `service`
  stages:
    - name: "development"
      sequences:
        - name: "delivery"
          scope: service
          tasks:
            - name: "deployment"
            - name: "test"
            - name: "evaluation"
            - name: "release"

    - name: "hardening"
      sequences:
      
        - name: "snapshot-delivery"
          tasks:
            - name: "monaco"
            - ref: "delivery"
            - name: "test"
              properties:
                teststrategy: "e2e-platform"
            - name: "evaluation"                         < uses SLO from stage-level

        - name: "delivery"
          scope: service
          tasks:
            - name: "deployment"
            - name: "test"
            - name: "evaluation"
            - name: "release"

title This is a title

DEVELOPER->dev.delivery:sh.keptn.event.dev.delivery.triggered (service A, v1) 
note left of snapshot: {\n"snapshotVersion": 1,\n"stages": [dev],\n"services": [\n{"service A": "shkeptncontext"}\n]\n}
dev.delivery->dev.delivery: .delivery.finished

DEVELOPER->dev.delivery:sh.keptn.event.dev.delivery.triggered (service B, v1) 
note left of snapshot: {\n"snapshotVersion": 2,\n"stages": [dev],\n"services": [\n{"service A": "shkeptncontext"},\n{"service B": "shkeptncontext"}\n]\n}
dev.delivery->dev.delivery: .delivery.finished 


DEVELOPER->dev.delivery:sh.keptn.event.dev.delivery.triggered (service C, v1)
note left of snapshot: {\n"snapshotVersion": 3,\n"stages": [dev],\n"services": [\n{"service A": "shkeptncontext"},\n{"service B": "shkeptncontext"},\n{"service C": "shkeptncontext"}\n]\n}
dev.delivery->dev.delivery: .delivery.finished 

DEVELOPER->dev.delivery:sh.keptn.event.dev.delivery.triggered (service A, v2) 
note left of snapshot: {\n"snapshotVersion": 4,\n"stages": [dev],\n"services": [\n{"service A": "shkeptncontext"},\n{"service B": "shkeptncontext"},\n{"service C": "shkeptncontext"}\n]\n}
dev.delivery->dev.delivery: .delivery.finished

DEVELOPER->hardening.snapshot-delivery:sh.keptn.event.hardening.snapshot-delivery.triggered [snapshotVersion:3]
note left of snapshot: {\n"snapshotVersion": 3,\n"stages": [dev, hardening],\n"services": [\n{"service A": "shkeptncontext"},\n{"service B": "shkeptncontext"},\n{"service C": "shkeptncontext"}\n]\n}

hardening.snapshot-delivery->hardening.snapshot-delivery: .monaco.triggered

hardening.snapshot-delivery->hardening.delivery: hardening.delivery.triggered parallel for: [service A] & [service B] & [service C]

hardening.delivery->hardening.delivery: deployment.triggerd
hardening.delivery->hardening.delivery: test.triggerd
hardening.delivery->hardening.delivery: evaluation.triggerd
hardening.delivery->hardening.delivery: release.triggerd

hardening.snapshot-delivery<-hardening.delivery: hardening.delivery.finished for: [service A] & [service B] & [service C]


hardening.snapshot-delivery->hardening.snapshot-delivery: .test.triggered

hardening.snapshot-delivery->hardening.snapshot-delivery: .evalution.triggered

hardening.snapshot-delivery->hardening.snapshot-delivery:sh.keptn.event.hardening.snapshot-delivery.finished

Option B

apiVersion: "spec.keptn.sh/0.2.2"
kind: "Shipyard"
metadata:
  name: "shipyard-sockshop"
spec:
  promotionStrategy: [snapshot | service ]  < default is `service`
  stages:
    - name: "development"
      sequences:
        - name: "delivery"
          scope: service
          tasks:
            - name: "deployment"
            - name: "test"
            - name: "evaluation"
            - name: "release"

    - name: "hardening"
      sequences:
      
        - name: "delivery"             < to trigger: sh.keptn.event.hardening.delivery.triggered {snapshotVersion: 3} (see below)
          tasks:
            - name: "monaco"
              scope: snapshot
            - name: "deployment"       < triggers a deplyoment for each service (context) as provided in the snapshotVersion
            - name: "test"             < triggers a test -||- 
            - name: "evaluation"       < triggers a evaluation -||- 
            - name: "test"
              scope: snapshot
              properties:
                teststrategy: "e2e-platform"
            - name: "evaluation"                   < uses SLO from stage-level
              scope: snapshot

https://sequencediagram.org/

title This is a title

DEVELOPER->dev.delivery:sh.keptn.event.dev.delivery.triggered (service A, v1) 
note left of snapshot: {\n"snapshotVersion": 1,\n"stages": [dev],\n"services": [\n{"service A": "shkeptncontext"}\n]\n}
dev.delivery->dev.delivery: .delivery.finished

DEVELOPER->dev.delivery:sh.keptn.event.dev.delivery.triggered (service B, v1) 
note left of snapshot: {\n"snapshotVersion": 2,\n"stages": [dev],\n"services": [\n{"service A": "shkeptncontext"},\n{"service B": "shkeptncontext"}\n]\n}
dev.delivery->dev.delivery: .delivery.finished 

DEVELOPER->dev.delivery:sh.keptn.event.dev.delivery.triggered (service C, v1)
note left of snapshot: {\n"snapshotVersion": 3,\n"stages": [dev],\n"services": [\n{"service A": "shkeptncontext"},\n{"service B": "shkeptncontext"},\n{"service C": "shkeptncontext"}\n]\n}
dev.delivery->dev.delivery: .delivery.finished 

DEVELOPER->dev.delivery:sh.keptn.event.dev.delivery.triggered (service A, v2) 
note left of snapshot: {\n"snapshotVersion": 4,\n"stages": [dev],\n"services": [\n{"service A": "shkeptncontext"},\n{"service B": "shkeptncontext"},\n{"service C": "shkeptncontext"}\n]\n}
dev.delivery->dev.delivery: .delivery.finished

DEVELOPER->hardening.delivery: sh.keptn.event.hardening.delivery.triggered [snapshotVersion:3]
note left of snapshot: {\n"snapshotVersion": 3,\n"stages": [dev, hardening],\n"services": [\n{"service A": "shkeptncontext"},\n{"service B": "shkeptncontext"},\n{"service C": "shkeptncontext"}\n]\n}

hardening.delivery->hardening.delivery: .monaco.triggered

hardening.delivery->hardening.delivery: .deployment.triggered [serviceA: shkeptncontext]

hardening.delivery->hardening.delivery: .deployment.triggered [serviceB: shkeptncontext]

hardening.delivery->hardening.delivery: .deployment.triggered [serviceC: shkeptncontext]

hardening.delivery->hardening.delivery: .test.triggered [serviceA: shkeptncontext]

hardening.delivery->hardening.delivery: .test.triggered [serviceB: shkeptncontext]

hardening.delivery->hardening.delivery: .test.triggered [serviceC: shkeptncontext]

hardening.delivery->hardening.delivery: .evaluation.triggered [serviceA: shkeptncontext]

hardening.delivery->hardening.delivery: .evaluation.triggered [serviceB: shkeptncontext]

hardening.delivery->hardening.delivery: .evaluation.triggered [serviceC: shkeptncontext]

hardening.delivery->hardening.delivery: .test.triggered

hardening.delivery->hardening.delivery: .evalution.triggered

hardening.delivery->hardening.delivery: sh.keptn.event.hardening.delivery.finished

{
  "snapshotVersion": 3,
  "stages": [dev], 
  "services": [
    {"serviceA": "shkeptncontext"},
    {"serviceB": "shkeptncontext"},
    {"serviceC": "shkeptncontext"}
  ]
}

Grouping of Services

As described above, the Shipyard file should contain the information if the promotion of the services should happen on a per-group (snapshot) or on a per-service level.

❔ What if a user wants to change the promotionStrategy in the future?
❔ Might it be an option that snapshot scoped sequences are ignored then?
❔ What if services are removed or added?

If a user decides to use promote on a per-snapshot level, the ability to promote everything to the next stage is only “open” when all services are successfully deployed and tests/evaluations are in a success/warning state.

Enabling this feature

To enable this feature, set promotionStrategy in the Shipyard to snapshot.

Group/Project Context

If the project should be promoted on a per-snapshot level, every deployment triggered event creates a new version of the snapshot.

An example:

A project consists of service A, service B, and service C, each in version 1.0
- the snapshot version will be 3 (for each deployment, a snapshot was created. Hence, we are already at 3)
Service A should be upgraded and the version is set to 1.1, Service B and C remain at version 1
- the snapshot version will be 4
Service C is updated to version 1.1, therefore service A in version 1.1, service B in version 1.0, and service c in version 1.1
- the snapshot version will be 5

Storing the Information

As each version change will result in a new snapshot version, the version should be incremented when a new version of a service arrives at the Keptn API. Furthermore, the information about the currently deployed service context ids (in the current stage) should be stored in the database (“Snapshot”). Therefore a JSON object containing this information might look as follows (example):

{
  "snapshotVersion": 1,
  "stages": [development], 
  "services": [
    {"serviceA": "shkeptncontext"},
    {"serviceB": "shkeptncontext"},
    {"serviceC": "shkeptncontext"}
  ]
}

❔ Would it be more useful to store the KeptnContext (Service) instead of the Version?
- Yes, since it references the version

To keep track, it might be useful to make the version number unique across all stages. Therefore it can be ensured that snapshot version 1 in a dev stage is the same one as in a hardening stage.

Maintaining the Group Context

Normally, each sequence in each stage could be triggered by a Keptn event. To ensure that this functionality does not break the snapshots (e.g. by sending a new version of a service directly to the next stage), deployment events from “the outside” might only be triggered in the first stage of the Shipyard, when the promotionStrategy is set to an on snapshot-level.

Sequences

As described above, it might be necessary to have sequences that are scoped to a project/stage (snapshot) or a service level. While snapshot-scoped sequences are a new concept in Keptn, service-scoped sequences change their behavior when used with a snapshot-based promotion strategy.

Proposal: Whenever no scope is specified in a sequence, it is assumed that it is scoped to a service

service-scoped sequences

When running a service-scoped sequence, the specified tasks are running with every service in the snapshot instance scope. At first, the current project version is detected. Afterwards, Keptn iterates over all of the services in the current project version and triggers the events for them. In further consequence, the state of the current task is watched (shipyard controller?). After all services finished the execution of this task, an additional event ()

❔ Where do we find the current platform version in a stage?

Wireframe

In the Bridge, I can click on a > button that shows a list of available snapshots. (can be represented as time-line)
I select a snapshot (e.g., snapshotVersion: 3) and press deploy

Bridge: Obfuscate sensitive data in events

Add ability to define and obfuscate sensitive data in the bridge UI from incoming events.

Motivation

Allow sensitive or secret data to remain secret.

This string will be visible in plain text inside the Keptn's bridge because the bridge prints out the raw event.

Which use-cases does this KEP enable?
Which value would this KEP create?

Explanation

Imagine an incoming event payload contains sensitive data:

{
...
  "data": {
    "my-sensitive-string": "dynatrace-adam-gardner"
  }
}

Internal details

Ability to define via API and UI the keys of sensitive data fields. In the above example, it would be my-sensitive-string then the bridge would print the event as:

{
...
  "data": {
    "my-sensitive-string": "*****"
  }
}

Trade-offs and mitigations

N/A

Breaking changes

N/A

Prior art and alternatives

Alternatives:

Use of sealed secrets
Defining secrets in Keptn's bridge and all keptn services are coded to
Note this second option means a higher congnitive load on Keptn users and developers. It's not as user friendly.

Open questions

N/A

Future possibilities

N/A

Troubleshooting support for Keptn-service (aka. Integration)

Success Criteria: Support for troubleshooting errors in Keptn-services (aka. Integrations)

Motivation

Pain: A Keptn user does not get any information about errors that happened in a Keptn-service (aka. Integration)
Target: Keptn provides basic support to list errors that occurred in a Keptn-service.
Driver: Improve the experience for working with Keptn-services.

Use Case (1):

(1) As a user, I would like to see all errors that occurred in a Keptn-service in the Keptn Bridge.

Type of Errors: In a Keptn-service, two types of errors can occur: (1) task-related and (2) not-task-related once.

Task-related: When executing a specific task, an error occurs. This already results in a xyz.finished event with status errored.
Non-task-related: For example, the Keptn-service can not start, or an API endpoint can`t be reached. This functionality is not related to the execution of a task.

--> Regardless of the context in which the error happened, I want to get informed about the error in the Keptn Bridge.

User flow in Bridge:

Open the Uniform of a project
All Keptn-services (Integrations) that are connected to the Control Plane and subscribed to this project are listed.
An indicator in the column Status tells me that a service has problems: see errored in red;
When clicking on the service, the last 10 errors that happened in this Keptn-service are listed.
- An error is displayed by a red icon, the timestamp, and the error message.
- By default, just the first line of the error message is shown. When clicking on it, I get the entire message.
A show older Errors loads the next 10 errors.

Open Policy Agent for Sequences

Integrate OPA for expanded sequence control.

Motivation

Sequences are only allowed to run between certain times
Sequences can only be run by certain users
Sequences can only be run if they contain certain labels.
etc

Basic Uniform support in Keptn and Visualization in Bridge

Mockup: keptn/keptn#1280

Success Criteria:

Note: The term Keptn-service is used as a synonym to a Keptn integration that is maintained in the https://github.com/keptn-contrib organization. A Keptn-service has a subscription to an event and performs a task based on the event type and payload.

Motivation

Pain: Currently, it is difficult for a Keptn user to maintain custom Keptn-services since there is no UI support that shows the services and where they are running. Besides, the event subscriptions are not displayed leaving doubt about the execution of a sequence. The only way to check this is by listing the services that are running (on the control- and execution plane) and investigating their configured subscriptions.
Target: Keptn supports the first implementation of the Uniform approach by visualizing deployed custom Keptn-services (on the control plane, or on an execution plane) and their subscriptions.
Driver: To foster the usage of custom integrations in Keptn, there must be UI/UX improvements in handling those integrations.

User Stories:

Preparation steps:

I deploy the custom Keptn-service on a control- or execution plane (in the namespace keptn-uniform) using the provided Helm Chart:

helm install jenkins-service https://github.com/keptn-contrib/jenkins/releases/download/0.8.0/jenkins-service-0.8.0.tgz -n keptn-uniform --create-namespace --set="distributor.projectFilter=sockshop,distributor.subscription=sh.keptn.event.deployment.triggered"

I open the Bridge and go to the Uniform screen of my project.

(1) As a user, I want to see the `jenkins-service` as part of the Uniform for this project:

The information displayed for this service shows:

Name of the service: jenkins-service
Release version: 0.8.1
Name of K8s cluster where the service has been deployed on: gke_research_us-central1-c_prod-customer-A
Name of K8s namespace where the service has been deployed in: keptn-uniform
Location: ❓
Status: (1) Is my service currently available in the sense that it can process an event, and (2) did a former event processing cause an error? split healthy into: available and errored
List of active subscriptions: (short name of sh.keptn.event.deployment.triggered) -> deployment.triggered
Implemented event spec version: This is not displayed but a requirement for user story (2).
List of supported event types: This is not displayed but a requirement for user story (3), (5).

Details:

When deploying a custom Keptn-service, it automatically registers itself at the control plane and provides the above meta-information.
Then, a custom Keptn-service updates its status on:
- (1) a regular basis - (TBD: To not overload the Keptn API this update can happen 2 times a day)
- (2) after an event execution
- (3) after an update of the subscriptions

(2) As a user, I want to get notified when the deployed Keptn-service does not work with the current Keptn installation.

Next to the displayed meta-information about the Keptn-service, it should provide the info about the implemented event spec, i.e., the current Keptn event spec version: 0.2.1.

The next mockup shows three cases:

The implemented event spec is supported by Keptn ✔️
No information about the event spec version available ❌
The implemented event spec is not supported by this Keptn ❌

(3) As a user, I want to add multiple subscriptions to my custom Keptn-service by using the Bridge and then applying the change via `helm upgrade` on the cluster.

A subscription has:

name: a name of the subscription for display purposes
event: the event the subscription works on (e.g.: sh.keptn.event.deployment.triggered, sh.keptn.event.release.triggered)
filters: on the project, stage, and service level.

User flow:

The button Add subscription opens an empty form to set name and to specify event and filters.
The drop-down list for the event is reduced to the event types supported by the Keptn-service. Consequently, this is additional meta-information that must be provided by the service.
When editing the new subscription, the below text field is updated. This text field contains the helm upgrade command to update that service.

Questions:

How to maintain a list of subscriptions? This can't be implemented by the environment variables of the distributor but rather by a separate configmap ❓
- By default, a distributor can support just one subscription by its environment variables.
How does the helm upgrade command look like ❓

(4) As a user, I want to add parameters to my subscription to enrich it with context information.

A subscription supports a list of parameters; a parameter is a key:value pair.

Questions:

The parameters won't be added by the shipyard-controller to the .triggered event because this is custom information and Keptn core should not expose information of other subscriptions. Hence, can the distributor of a Keptn-service take care of adding the parameters to the event payload before forwarding it to the receiving service?

❓ How does the helm upgrade command look like?

(5) As a user, I can validate whether my shipyard (for this project) is covered by the uniform. If a missing subscription is found, I can add it with two clicks and one `helm upgrade`

Details / User flow:

Since the Keptn core knows the registered Keptn-services and the shipyard for this project, an API endpoint can derive and return the coverage of tasks (events) by subscriptions.
(1) The Bridge displays the shipyard and highlights tasks that don't have a subscription.
The Bridge displays a drop-down containing all services that support this particular task (event). The Keptn core has to offer an endpoint, to get all registered services by their supported event type.
(2) - (3) By clicking the Add subscription button, an additional subscription - with the event type and stage pre-defined - is created.
(4) Finally, the user has to apply the changes (by a helm upgrade) to make the subscription working.

Zero-downtime Upgrades / High Availability

Status Dashboard

As a Keptn admin, I want to see an overview of how Keptn is used. I need a dashboard to show high level statistics.

Motivation

How many times was each sequence triggered?
Which sequences are failing?
What is my most popular project (by sequence usage)?
How many times was each sequence triggered (split by a sequence label) - useful to cross charge teams for Keptn usage

Explanation

This dashboard shows how Keptn has been used over the last day, week, month etc. It shows which sequences are being triggered and I can see sequences split by any provided label.

Internal details

Create a dashboard which is driven by the cloud event stats that are already available.

KEP 0015: Add default sli and slo files with Keptn create and keptn onboard service

Automatically add default SLI.yaml and SLO.yaml when onboarding or creating a new service in Keptn

Motivation

When starting with Keptn SLIs & SLOs are a bit like "magic". By default there is no visible SLI and therefore it is hard to understand what is happening behind the scenes and it is also hard to get started with a custom SLI because there is no good template to start from.
The motivation therefore is to make it easier for users to understand and extend SLIs & SLOs by automatically adding a default sli & slo when onboarding a new service

Explanation

Whenever you create or onboard a service to Keptn, Keptn will automatically add a default set of SLIs and SLOs for your service. See this as "SLO driven development" where Keptn wants to you think about good SLOs from the start in order to bring your service into the right quality state.
If you want to modify your SLIs, e.g: adding new indicators, simply edit the SLI.yaml in your git repo.
If you want to modify your SLO, e.g: changing conditions or adding new SLIs, simply edit the SLO.yaml in your git repo.

Internal details

I think the keptn API that is responsible for creating or onboarding a new service should automatically add a default SLI.yaml and SLO.yaml to all stages.
Technically this should be implemented by sending a message to configured SLI providers to return a list of default SLIs. In the future we could even extend this with additional concepts, e.g: send the technology type of the service (Java, php, node ...) and depending on the tech type the SLI provider can return different defaults. Another option would be to keep the templates in a public keptn git repo. Everytime a new services is onboarded/created Keptn could reach out to that Git repo and fetch the best matching template for the technology type of the service.

In particular, please explain:

The change wouldnt impact current functionality. It just extends the default behavior and makes understanding SLIs and SLOs easier as they become visible to the end user!
The output of the Keptn CLI should be clear that these resources have been added. If an upstream git was specified a link can even be provided as part of the output to point the user to it.
I also think we should make the SLI and SLO more prominent in the Bridge when navigating through services and stages. We could provide a link or even an "Edit SLI/SLO" in the bridge!

Trade-offs and mitigations

Don't think there are trade-offs as this simply enhances the understanding of SLIs & SLOs - where they are defined and what the format looksl ike

Breaking changes

No breaking changes

Prior art and alternatives

Alternatives are more documentation!

Open questions

Shall we provide a capability like this for existing projects that have been created in the past, e.g: via keptn update service that will then add the SLI/SLO to the repo

Future possibilities

As mentioned above - we could think of technology specific template SLIs, SLOs that we could pull from a public repo.

Reference / Pinned Evaluations

This KEP proposes the ability to mark an evaluation as the "reference" evaluation against which all future evaluations for that project + service + stage would be measured.

Motivation

I run lots of evaluations, some pass and others fail due to factors outside my control. But eventually I get everything "just so" and I know that evaluation 123 from the 18th April at 10:34 was "perfect" and so I want all future evaluations to be compared against that specific evaluation - rather than any average or previous results.

Explanation

This KEP would allow us to say, in effect, compare all future evaluations against the results of this specific evaluation, until I tell you (Keptn) otherwise.

Make installer more robust (platform AKS)

We ran into the issue that "keptn install" failed on AKS as we did not perform "az login" beforehand. We then had corrupt configurations for kubectl and keptn cli resulting in errors on each retry of keptn-install. Our solution was to make a new connection manually via:
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
Note that the errors occured on the client side before reaching the AKS instance.

Maybe a "keptn reset" command or a sanity check before the actual installation is peformed would help already.

Creating/Deleting Secrets for Integrations using Bridge

Motivation

Pain: Creating a secret for a Keptn-service requires the Keptn CLI or API. This can be a problem; especially in restricted environments where the download of the CLI from GitHub is restricted.
Target: UI-based approach to create/delete a secret for a Keptn-service.
Driver: Improve the experience in working with Keptn-services; reduce context-switches between UI & CLI.

Use Case (1) - (2):

⚠️ This feature is supported only for Keptn-services (integrations) that run on the control plane. Consequently, secrets for external Keptn-services are not managed.

(1) As a user, I would like to create a secret for a Keptn-service (aka Integration) using the Keptn Bridge.

User flow in Bridge:

Open the Uniform of a project and select the Secrets tab.
Add a secret using the provided form:
- Specify the Name and at least one key:value pair
- Add multiple key:value pairs by clicking on the + symbol
- By clicking the Add Secret button, the secret is created and listed above.
- After successfully creating the secret, the form is empty.
  Details:
- By default, the value field hides the entered value.
- If the value should be revealed, use the "eye" icon.

Switch the tab and go to Services.
This view shows all all Keptn-services (Integrations) that are connected to the Control Plane and subscribed to this project
Select the Keptn-service that should get access to the created secret
In the Secrets section, give the service permissions to access the secret

(2) As a user, I would like to delete a secret of a Keptn-service (aka Integration) using the Keptn Bridge.

User flow in Bridge:

Select the Secrets tab and click the "X" to delete the particular secret.
If the secret is in use by a Keptn-service, show a notification to inform the user and provide an option to Cancel and to Delete Secret:

Independent Sequences

Sequences within a project should be independent of one another. They should be able run concurrently or "out of order".

In other words, a queued sequence (for the future) should not block other "instant run" sequences.

Motivation

Given a shipyard like this:

apiVersion: "spec.keptn.sh/0.2.2"
kind: "Shipyard"
metadata:
  name: "release-validation-shipyard"
spec:
  stages:
    - name: "validaterelease"
      sequences:
        - name: "runvalidation"
          tasks:
            - name: "evaluation"
              triggeredAfter: "60m"
              properties:
                timeframe: "2m"

If I request an execution of runvalidation I cannot process any other sequences in this project for the next 60 minutes. In effect, everything is queued behind that sequence and until that one sequence runs, my project is "locked".

"Keptn Lighthouse" - generalizing Keptn SLI/SLO evaluations, and OpenSLO adoption

Discussed in keptn/keptn#8985

^{Originally posted by oleg-nenashev October 11, 2022}
There is an idea to generalize how Keptn handles SLIs and SLOs. We have a Keptn Lighthouse service that IMHO could provide standard interfaces for SLIs/SLOs (CloudEvents as now, OTel, OpenSLO) and allow users to evaluate SLOs in a generalized way while retrieving data through Keptn SLI providers which we have for Prometheus/Datadog/Dynatrace/SumoLogic at the moment.

I think such a solution could be used not only inside Keptn but also by many others in the cloud native ecosystem. It could also drive adoption for Keptn and Keptn Lifecycle Controller that could/would natively integrate with this new subproject.

Overview slides: https://docs.google.com/presentation/d/1Y_jSgWN6KM578IAJu-eBWtz7kj0c-JTWxS_XRs15uD4/edit?usp=drivesdk

Suggested plan:

Start a new repository in sandbox for docs/demos/tasks and interested parties
TBD - start a working group
Prototype Keptn LightHouse, based on the current Keptn Codebase, It would be a Helm chart with 2 services, and without a dedicated control plane. There might be API service too, TBD

Any feedback would be welcome!

Separate Dynatrace integration and Dynatrace monitoring of the cluster itself

Hi guys,

I believe we can make the Dynatrace integration with Keptn easier and we can separate concerns:

UseCase Persona: QualityGates IT operator of Middleware Software which does not run on Kubernetes "I want to have in my Dynatrace environment the integration with Keptn, tagging, push of events, notification but I don't want to have the OneAgent installed in my cluster or single instance with QUalityGates features due sizing or not interested in monitoring Keptn or the CI/CD pipeline"

Now is needed to install the two services and run the configure command to achieve this.

dynatrace-service
dynatrace-sli-service
keptn configure monitoring dynatrace

meaning the command "keptn configure monitoring dynatrace" should be splitted into the integration with Dynatrace (tagging, alerting, events, MZ..) and the monitoring of the Cluster itself. If we monitor the cluster we can provide the possibility also to add the Cluster and Workload monitoring. Installing and running the AG in a pod and configure it via a cloud event should be possible.

Thanks and best
Sergio

Shipyard Enhancement: Include Services

Include service definitions in shipyard file

Motivation

Config as code and self contained
Implementation of this means Keptn CLI could implement a keptn init command more easily.
Project onboarding is easier: keptn create project creates project, stages and also services.

Explanation

The Keptn shipyard file already contains a partial definition of a project, all stages of a project, all sequences within those stages and tasks. However it doesn't include the service definitions. This enhancement would see a shipyard file also include the service definitions.

Internal details

For example:

apiVersion: "spec.keptn.sh/*.*.*"
kind: "Shipyard"
metadata:
  name: "shipyard"
spec:
  services:
    - name: "microserviceA"
    - name: "microserviceB"
    - ....
  stages:
    - name: "demo"
      sequences:
        - name: "hello"
          tasks:
            - name: "hello-world"

Related: #67

Sequence Timeline Visualisation (Gannt Charts)

Gannt Charts for Sequences

Generate a visual timeline (aka Gannt chart) for sequence runs to see where time was spent.

Motivation

Easily understand time spent or bottlenecks during sequence execution

Explanation

This graphic shows the timeline of the sequence execution. It shows you how long each step of the sequence took and where your bottlenecks for efficient sequence execution are.

Internal details

Inspect a keptnContext and generate step timings. Display them.

Scheduled Sequences

Introduce ability for Keptn to execute (trigger) sequences on a schedule

Motivation

Regular or cron-like tasks that need to execute at regular time intervals
The ability to regularly execute Keptn sequences

Explanation

Introduce a CRON-like mechanism so sequences can be scheduled as a

Internal details

Perhaps Keptn wraps around existing mechanisms like CronJob.

Definition of Done should include a UI creation option and the ability to CRUD these schedules via GitOps and API.

Trade-offs and mitigations

Breaking changes

Prior art and alternatives

Open questions

Future possibilities

keptn apply

Introduce a keptn apply command which mirrors kubectl apply for a GitOps approach

Motivation

Extend Keptn GitOps approach

Explanation

keptn apply project.yaml where project.yaml contains the project name + shipyard definition.

For example:

---
name: "my-first-project"
spec:
  shipyard:
    stages:
    - name: "dev"
      sequences:
        - name: "delivery"
          tasks:
            - name: "task-here"
  services:
    - name: "service1"
    - name: "service2"

Create Shipyard Visually in Bridge

This KEP covers enhancements to the bridge for the ability to create a shipyard visually in the UI

Motivation

Experienced users understand the shipyard YAML file and can create and apply via command line
GitOps users will (when KEP67 is complete) be able to apply Keptn configuration via Git
This still leaves a hole for the novice user that feels more comfortable in a UI. Let's face it, that's the majority of first time users for any tool.

Explanation

Explain the proposed change as though it was already implemented and you were explaining it to a user. Depending on which layer the proposal addresses, the "user" may vary or there may even be multiple.
We encourage you to use examples, diagrams, or whatever else helps to explain the proposal.

Internal details

Add section to the bridge with the ability to create stages, sequences and tasks via a UI. The bridge can then correctly format and apply the produced YAML file.

Allow more complex JMeter Loadtests that depend on more files, enhance documentation and logging of JMeter service

Allow complex JMeter Loadtests with more files, enhance documentation and logging of JMeter service

It would be good to allow executing more complex JMeter tests where the JMX file depends of more configuration files like a list of test users or a list of products which makes the loadtest more dynamic and robust.

Also enhancing the documentation with all the variables that are passed (like the DeploymentURI) that is mapped to SERVER_URL would be nice.

Internal details

The files are uploaded but not added to the JMEter container. Also by looking a the logs of the JMeter service it showed that the loadtest was executed succesfully.

LogLevel":"DEBUG","message":"Successfully executed JMeter test.

But it was failing. Inside the container there is the JMeter.log where the execution is written. Creating the users.txt file manually in the container solved the issue.

keptn / enhancement-proposals Goto Github PK

enhancement-proposals's Introduction

Keptn Roadmap and Keptn Enhancement Proposals (KEP)

Keptn Roadmap

Keptn Enhancement Proposals

What changes require a KEP?

Writing a new proposal

Submitting a new proposal

Final words

enhancement-proposals's People

Contributors

Stargazers

Watchers

Forkers

enhancement-proposals's Issues

Keptn Usage Analytics Engine

Motivation

Internal details

Explanation

Related problems

References:

Trade-offs and mitigations

Breaking changes

Prior art and alternatives

Open questions

Future possibilities

Introduce OpenTelemetry pre-instrumentation to Keptn

Motivation

Explanation

Internal details

Trade-offs and mitigations

Breaking changes

Prior art and alternatives

Open questions

Future possibilities

Allow external tools to provide SLI Results via Events vs just via SLI Provider

Motivation

Explanation

Internal details

Trade-offs and mitigations

Breaking changes

Prior art and alternatives

Open questions

Future possibilities

Role-based Access Control (RBAC)

Motivation

What

Use Cases

Suggested Services

Motivation

Explanation

Internal details

Benefits

Service screen revamped

Motivation

User stories

Use-cases

Mockups

Work in progress / Open questions

Gimlet OneChart integration for Keptn

Motivation

User Stories

Open Questions

Implementation Details

References

Deploy Keptn w/ Demo Project

Motivation

Extensibility in Keptn

Problem statement

Success Criteria

Workshop result

Action items

SDK for building integrations

Documentation

Features requests

No indication that a triggered event is already consumed

Fine-grained event filters

No dependency on K8s for integrations

Additional notes

Snapshot releases

No indication that a `triggered` event is already consumed

(1) As a user, I want to see the `jenkins-service` as part of the Uniform for this project:

(3) As a user, I want to add multiple subscriptions to my custom Keptn-service by using the Bridge and then applying the change via `helm upgrade` on the cluster.

(5) As a user, I can validate whether my shipyard (for this project) is covered by the uniform. If a missing subscription is found, I can add it with two clicks and one `helm upgrade`