sovereigncloudstack / standards Goto Github PK

View Code? Open in Web Editor NEW

29.0 9.0 21.0 5.06 MB

SCS standards in a machine readable format

Home Page: https://scs.community/

License: Creative Commons Attribution Share Alike 4.0 International

Python 93.62% Dockerfile 0.32% Shell 0.77% Jinja 0.73% HTML 0.63% Makefile 0.99% Go 2.94%

standards

standards's Introduction

Sovereign Cloud Stack – Standards and Certification

SCS unifies the best of cloud computing in a certified standard. With a decentralized and federated cloud stack, SCS puts users in control of their data and fosters trust in clouds, backed by a global open-source community.

SCS compatible clouds

This is a list of clouds that we test on a nightly basis against our scs-compatible certification level.

Name	Description	Operator	IaaS Compliance Check	HealthMon
gx-scs	Dev environment provided for SCS & GAIA-X context	plusserver GmbH		HM
pluscloud open - prod1 - prod2 - prod3 - prod4	Public cloud for customers (4 regions)	plusserver GmbH		HM1 HM2 HM3 HM4
Wavestack	Public cloud for customers	noris network AG/Wavecon GmbH		HM
REGIO.cloud	Public cloud for customers	OSISM GmbH		broken
CNDS	Public cloud for customers	artcodix UG		HM
aov.cloud	Community cloud for customers	aov IT.Services GmbH	(soon)	HM
PoC WG-Cloud OSBA	Cloud PoC for FITKO	Cloud&Heat Technologies GmbH		HM

SCS standards overview

Standards are organized as certification levels according to SCS-0003-v1. We currently maintain one certification level scs-compatible that is described here: Tests/scs-compatible-iaas.yaml.

More certification levels will follow as the project progresses.

Repo Structure

This repository is organized according to SCS-0002-v1.

Standards

Official SCS standards and Decision Records, see Standards/scs-0001-v1-sovereign-cloud-standards.md).

Tests

Testsuite and tools for SCS standards, see Tests/README.md.

Drafts

Old Design-Docs folder with existing Architectural Decision Records (ADRs). This directory is currently in the process of being consolidated and cleaned up. See cleanup step-1 and open questions.

standards's People

Contributors

Stargazers

Watchers

standards's Issues

Core vs Threads vs Oversubscribed vCPUs

Do we need to reflect this in the flavor name?
(1) Dedicated Cores
(2) Dedicated Hyperthreads
(3) Oversubscribed vCPU (limited to a factor X)

Letters: v = vCPU (oversubscribed), s = dedicated thread (SMT/HT), d = dedicated core

Improvements to the on-boarding process documentation

In the process of merging the contribution guide SovereignCloudStack/contributor-guide#34 with this repo, I would propose to include a reference to the gaia-x community on-boarding guide once it has been clarified see gitlab Issue.

Would you agree with this proposal? ;-)

Define and implement a flavor name validator based on Cuelang

add statusfy to status comparison overview

as discussed, let's add statusfy to the list of projects covered in the comparison overview

https://marquez.co/statusfy

[Feature Request] Storage flavor naming scheme is needed

The flavor naming scheme currently covers compute. In Team Ops meeting on April 21st it was debated, as part of the refinement of the OpenStack Flavor Manager, that a naming scheme for storage is needed.

Release Notes for R3

We have started to collect the central pieces of the release notes.
Please have a look at https://github.com/SovereignCloudStack/Docs/blob/feat/RelNotes3/Release-Notes/Release3.md in branch feat/RelNotes3 of this Docs repository.
Pull requests and comments are more than welcome!

Page not found for mailing lists link at the readme bottom

migrate PlusDemonstrator into Contributor-Docs

Hi,

PlusDemonstrator/GettingStarted.MD is mostly outdated nowadays and holds irrelevant information.

I suggest that we create a yaml file listing all cloud resources provided by partners (see #155) and migrate the relevant Information from PlusDemonstrator into a more generic documentation listing the api endpoints/clouds-public.yaml etc.

License decision helper text

As a company who wants to contribute code to the oss community I face the difficulty to choose the proper license.
We should add some docs to the scs documentation that assists with this process. This is one of the action items that came out of the second edition of the Lean SCS Operator Coffee.

tools/flavor-name-check.py: -ib without prior CPU spec gets misinterpreted

garloff@TuxKurt(wave-scs):/casa/src/SCS/Docs/Design-Docs [0]$ ./tools/flavor-name-check.py -v SCS-4C:16:50-ib
Flavor: SCS-4C:16:50-ib
 CPU:RAM: #vCPUs: 4, CPU type: Dedicated Core, ?Insec SMT: False, ##GiB RAM: 16.0, ?no ECC: False, ?RAM Over: False
 Disk: #:NrDisks: 1, #.GB Disk: 50, .Disk type: 
 No Hypervisor
 No NestedVirtualization
 CPUBrand: .CPU Vendor: Intel, #.CPU Gen: Pre-Skylake, Performance: Std Perf
 No GPU
 No Infiniband

ERROR: Could not parse: b
Traceback (most recent call last):
  File "/casa/src/SovereignCloudStack/Docs/Design-Docs/./tools/flavor-name-check.py", line 527, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/casa/src/SovereignCloudStack/Docs/Design-Docs/./tools/flavor-name-check.py", line 512, in main
    ret = parsename(name)
  File "/casa/src/SovereignCloudStack/Docs/Design-Docs/./tools/flavor-name-check.py", line 457, in parsename
    raise NameError("Error 60: Could not parse %s (extras?)" % n)
NameError: Error 60: Could not parse b (extras?)

SCS-4C:16:50-ib is a legal name, but our parser is not clever enough for it. It interprets the -i of -ib as indicator for an intel CPU and expects a number, or nothing and/or h to follow. Will need to special case this or improve the parser logic ...

Getting Metrics from Custom Sources

There are usecases where an SCS Operator will want to scrape custom services which are not naturally integrated into the existing SCS infrastructure but still helpful to gather in the centralized Prometheus provided by Kolla.

To enable said scraping the prometheus.conf must be edited thus so these custom machines can be added to the prometheus config via custom group / custom .j2 syntax. Documenting this ought to be helpful.

Document best practices on integrating alertmanager with existing on-call alerting infrastructure

As an SCS Operator I want to hook the alertmanager that comes as part of the SCS monitoring stack to my existing on-call alerting infrastructure.

There needs to be a best practice document within the 'Operational Docs' that gives examples on integrating alertmanager.

DNS recommendations: Write spec

We have several considerations on DNS that we should capture in a design doc:

We expect SCS IaaS providers to run their own local DNS service. In the simplest case, this would be just a resolver with a local address that would forward to upstream resolution. This resolver should be highly available (by advertising two or three servers or by being some HA setup) and be the default DNS servers in new subnet. In more sophisticated setups it would also serve the local names (neutron DNS integration) and serve local zones from designate.
We have the domain sovereignit.cloud to resolve SCS clouds' API/horizon/... services.
We have the domain sovereignit.tech that can be used by users on these clouds.

This should all be documented and policies be written.

Create overview of standards

As inspired by @jklippel, we should provide an overview of our standards and their current state (release version, draft, etc.)

We discussed today during SIG Standardization that we'll add a table to the root README.MD and link to the various documents.

Add meeting times as YAML

Should we try to build a single source of truth for the meeting times?
What would be a good solution – the public nextcloud calendar, a yml somewhere that builds fancy things?

Originally posted by @itrich in SovereignCloudStack/docs#64 (review)

Container Image Infrastructure

It should be ensured that the container images which are used in SCS, ...

are up to date
do not contain software with known vulnerabilities
do meet high quality standards

(Potential) Image sources include, but are most likely not limited to:

Dockerhub Official (Base) Images

Examples include: alpine, debian, ubuntu, mysql

sponsored by Docker
used very widely
To be determined: Is there a policy regarding image updates etc.?
According to docs, they are IP of Docker: https://docs.docker.com/docker-hub/official_images/

Upstram managed images on DockerHub/Quay.io/...

Examples include: prom/prometheus

mileage will vary from one project to another

Red Hat Certified Base Images

I personally do not have a lot of experience with them
they seem very nicely curated and maintained
RHEL/OpenShift centric
To be determined: Relation to subscription model, OKD and S2I

SCS solution?

Being a "cloud distribution", SCS may have its own set of maintained images driven by the given goals.

Implementing own base images with patch process etc. (got some ideas there)
Security scanning (e. g. hosting a Harbor/Quay installation)

Flavor specs: Add rationale for encoding info into names

While we encourage to use the short names in the SCS flavor naming, we do have the capability to encode a lot of details into the flavor names. This is useful, as we may have (large) clouds, where customers have the ability to chose between very finegrained flavor types for selecting capabilities and we need to do this by chosing the flavor names. (The mechanism to use extra_specs and match it with images is broken -- requiring custom images just to be able to select the right VM flavor is a concept broken by design IMVHO.)
We should clarify this in the docs.

Flavor Specs: Update CPU generations

The CPU generation table needs updating.

We have Anext = A710 meanwhile.
There exists RocketLake Xeon-E 2000 (Cypress Cove)

Monitor additional devices in bare metal environments

As a Cloud Operator, I want to be able to get metrics from external devices in my data center. This includes PDUs, UPSs and sensors (temperature, humidity, leakage, ...).

That means the monitoring architecture needs support for scraping such devices via SNMP or Modbus exporters (in case the monitoring is realized by Prometheus).
Different devices and sensors need specific exporter configurations which usually differ from data center to data center.

The monitoring somehow (= by providing ip address and type of the device) has to be aware of such devices to apply the correct device-specific exporter configuration and eventually scrape the targets.
Alternatively, the monitoring arch. must be able to get extended for supporting additional targets.
For instance by allowing to inject additional targets into configuration files of the monitoring architecture.

Prefix "GX"

The GX prefix of course suggests that this is a Gaia-X standard.
We should not suggest it is when it is not ...

Even though GX is not infringing on a trademark, the Gaia-X AISBL would probably still dislike it
Obviously we don't want to misrepresent anything

So we have two options:

Try to push this via a PDR/ADR through the Provider working group
Use the "SCS" prefix instead

Multiple Ceph Cluster Metrics not compatible with current dashboards/alert rules

The dashboards/alert rules that are being shipped with grafana / provided by https://github.com/osism/kolla-operations are not compatible with the use and scrape of multiple ceph clusters, as the dashboards are not created to work with multiple clusters and the alertrules for e.g. osds and pools break because the conflicting pool IDs from the two sources create rule evaluation failures.

License of SCS Software

Hi,

I saw that this repo uses the GPL license: https://github.com/SovereignCloudStack/generics

I guess you know that the GPL is viral.

I am missing an agreement of the SCS to a particular license.

I have concerns, that SCS won't be adapted by companies, if the code uses the GPL.

insert infrastructure role in Design Docs

PR: SovereignCloudStack/docs#66

Maintaining deployment of core services on multiple layers

The deployment of some services should be possible on multiple "layers" of SCS infrastructure. E. g. deploying Prometheus on "top level" payload K8s clusters, but also on "low level" infrastructure hosts.

In order to minimize (potentially redundant) maintenance efforts, it may be considered to use the same type of deployment scripts/manifests on all levels.

Some Options:

Reuse K8s manifests at "low level" infrastructure hosts by using some slim K8s distribution there. In order to have as few moving parts as possible, features like dynamic persistent volume provisioning should be disabled. Maybe it is even viable to opt out of Pod networking and kube-dns. Such "low level" cluster is already required for the use of Gardener, AFAIK.
Reuse K8s manifests at "low level" infrastructure hosts by using podman play kube. AFAIK, this feature is only suitable for pretty simple use cases.

Add Zuul Release Notes

To this branch:
https://github.com/SovereignCloudStack/Docs/blob/feat/Add-R1/Release-Notes/Release1.md

Flavor Naming Scheme: Use of colon character as delimiter

Summary

The current flavor naming standard uses colon characters (i.e. :) as delimiters. This is an unfortunate choice, as it is incompatible with requirements stipulated by downstream applications such as Kubernetes.

Details

Kubernetes uses labels and annotations for a variety of use cases. Most importantly in the context of this discussion will be their use for taints and tolerations and the standard well-known labels, annotations and taints.

The requirements for valid label values are:

must be 63 characters or less (can be empty),
unless empty, must begin and end with an alphanumeric character ([a-z0-9A-Z]),
could contain dashes (-), underscores (_), dots (.), and alphanumerics between.

That is, it must adhere to the regular expression (([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?').

The usage of colon characters is not allowed and their usage in flavor names causes issues when, for example, the kubelet wants to set the well-known node.kubernetes.io/instance-type label:

Jan 01 08:09:10 foo-bar-box hyperkube[12345]: I0103 08:09:10.23456 14821 kubelet_node_status.go:426] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="SCS-4V:16:50"
[...]
Jan 01 08:09:10 foo-bar-box hyperkube[12345]: E0103 08:09:10.23456 14821 kubelet_node_status.go:94] "Unable to register node with API server" err="Node \"foo-bar-node\" is invalid: metadata.labels: Invalid value: \"SCS-4V:16:50\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')" node="foo-bar-node"

Given the significance of Kubernetes in the SCS project, it would be advisable to reconsider the naming scheme and adopt one that is compatible with the data model defined there.

Thank you!

Nested Virtualization

We currently don't have a way to express in the flavor name whether or not the flavors support nested virtualization.
We should expect nested virt to be default-on and have a modifier to show that it's off?

OpenvSwitch meshing

See SovereignCloudStack/docs#98 for my definition of Prio 1-4.

Prio 1: As a Cloud Operator, I want to know if the OpenvSwitch(**) agent fails to complete its first synchronization iteration without error, because that iteration is vital for the functioning and lack of a successful first iteration causes the below alert and customer visible impact.

The potential action is to investigate why the first iteration did not complete, resolve the issue and restart the OpenvSwitch agent. If that is not feasible, workload needs to be migrated off the node.
Prio 3 or higher*: as a Cloud Operator, I want to know if an OpenStack Network on a compute node with the OpenvSwitch agent(**) is lacking any of the following tunnels (VXLAN or VLAN) configured:
- to a DHCP node with an DHCP instance for any subnet in the network; the corresponding flow rule must allow broadcast traffic.
- to an L3 node with an L3 router instance with a port in any subnet in the network; the corresponding flow rule must allow broadcast traffic.
- to another compute node with a compute instance with a port in any subnet in the network; the corresponding flow rule must allow broadcast traffic.
because lack of those rules can cause loss of DHCP leases, loss of communication between some instances in the same network and loss of internet connectivity for instances.

(*) See footnote in SovereignCloudStack/docs#98.
(**) I am not sure what SCS intends to use; I am not sure if the same in a different flavour applies for instance to OVN.

Clarify what one needs to do to get the PlusDemonstrator credientials or keystone.

https://github.com/SovereignCloudStack/Docs/blob/main/PlusDemonstrator/GettingStarted.MD

Here it suggests perhaps implicitly that one must create a PlusServer account. But there's no info about this.
Also, I have no idea what to do with the keystone.

L3 router replication status

Prio 1 = customer impact, need immediate action
Prio 2 = loss of internal redundancy, need quick action to avoid subsequent Prio 1 on another failure
Prio 3 = certain Prio 1 situation is upcoming, need fast action to avoid
Prio 4 = potential Prio 1 or certain Prio 2 situation is upcoming, need fast action to avoid

In my mind, 1-2 are paging, while 3-4 are causing daytime alerts.

Prio 3 or higher*: As a Cloud Operator, I want to know if multiple replicas of a HA L3 router think they are in master state, because that indicates a potentially customer-visible network issue (ARP fight or L2 loss between two nodes).
Prio 3 or higher*: As a Cloud Operator, I want to know if an HA L3 router has no replica in master state, as that renders the router dysfunctional, which has customer-visible impact, because the traffic is not going to reach the instances because the upstream router cannot find the MAC address to send the traffic to.
Prio 4 or higher*: As a Cloud Operator, I want to know if the number of HA L3 router replicas is below the configured number for a longer time.

(*): both of these can happen temporarily in a healthy system due to monitoring races but also due to propagation delays of state information, so it is a bit tricky how to alert correctly; hence just Prio 3.

tools/flavor-name-check.py: Should have a compliance test mode that checks for mandatory flavors.

We already test for SCS- flavors to pass the specification.
We should add a possibility to check whether all mandatory SCS flavors are available, so we have a rather complete test for compliance with the SCS flavor specification.

status page comparison should contain url to project

Flavor naming feedback

Description

As for reqeust in OpenStack Mailing List [1], filling in issue to gather feedback that has been said out during discussion.

Raised concern, that suggested in [2] naming convention is not readable at glance and in order to understand it cloud users will require to read and remember that spec, which is not convenient/acceptable from UX prespective.
As alternatives to it, following examples were mentioned for the following sample:

- nvt4: nvidia T4 GPU
- a8: AMD VCPU 8 (we also have i4 for example, for Intel)
- ram24: 24 GB of RAM
- disk50: 50 GB of local system disk
- perf2: level 2 of IOps / IO bandwidth

nvt4-a8-ram24-disk50-perf2
8vCPU-24576RAM-50SSD-pGPU:T4-10kIOPS-EPYC4

Raised opinion, that flavor naming should not be standardized/regulated at all. For the following reasons:

before creating instance people should reference flavor specs at the first place and not flavor names
during Sydney summit present operators agreed upon impossiblity to come to consensus for standard naming for flavors, because many of them used higly tuned flavors. So it is space that we might want to avoid from regulating.

[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-June/023152.html
[2] https://github.com/SovereignCloudStack/Operational-Docs/blob/main/flavor-naming-draft.MD

Network interface speed

See SovereignCloudStack/docs#98 for my definition of Prio 1-4.

Prio 3: As a Cloud Operator, I want to know when a network interface drops below the specified maximum speed on that link, as that indicates bad cabeling, a bad driver, or another (hardware) issue and may cause customer impact (reduced performance).
Prio 2: As a Cloud Operator, I want to know when the network interfaces of a compute node are saturated, as that causes degration of performance.

Flavor-Specs: Specify modifier for support of nested virtualization

When using virtualization inside VMs, it makes a big performance difference whether we need to fully emulate things (qemu) or whether we can use hardware virtualization (qemu-kvm). The latter requires nested virtualization to be allowed on the host. This should be visible from the long flavor names.

Simplifications

Some more thoughts:

Do we want to drop the 'N' and 'M' prefixes? They are redundant, as the information CPU:Mem is already encoded in the numbers.
Do we want to attach the modifiers 'H', 'G', 'L' after the numbers? Reason being that we'd allow pattern matching SCS-4:16* this way ...
We wanted to discuss to have short cuts anyhow, like SCS-4:16 (which the scheduler then can schedule to all SCS-4:16 flavors ...)
The local disk qualifier might also need a number (size of disk in GB), no? SCS-4:16L150-A-v2 for a 4vCPU 16GIB RAM 150GB disk(s) AMD 2nd gen (Zen v2 aka Rome I guess).

Monitoring Userstories

As a SCS-Operator,
I would like to monitor my OpenStack Infrastructure , OpenStack API and my Storage Environment:

wrong link to cluster-api-provicer in README.md

The README.md has the wrong link to the capi provider as it links to a nonexistent page:
https://github.com/SovereignCloudStack/Docs/blob/main/README.md?plain=1#L42

The actual link is:
https://github.com/SovereignCloudStack/k8s-cluster-api-provider

Disk sizes missing

In standard OpenStack, flavors come with a disk size. This disk is for the root filesystem.

The disk size (if any disk is provided) should probably be part of the flavor.
"SCS-4V:16:100" would then be a flavor with 4vCPUs(oversubscribed) with 16GiB Ram and 100GB disk.
"...:50L" would be a local disk (lower latency than networked storage) of 50GB
"...:100S" would be a local SSD/NVMe type disk (>10k IOPs) of 100GB.

Standard sizes 10, 20, 50, 100, 200 ?

tools/flavor-name-check.py: Read list of mandatory flavors from external YAML

https://github.com/SovereignCloudStack/Docs/blob/6b21dabe2833c3db979cba4eb9513393a0001bbe/Design-Docs/tools/flavor-name-check.py#L32
should be changed to read the list of mandatory flavors from an external YAML file, once we have it implemented.
See SovereignCloudStack/issues#235

Image properties spec: Inconsistency hotfix_policy.

The text mentions a hotfix_policy field and then later explains hotfix_hours. hotfix_policy should be replaced by hotfix_hours for consistency.

Clarify how to get cloud resources by our partners

We have discussed during the Community Hackathon 2022 that we urgently need to clarify how contributors can get cloud resources from our partners (PlusServer, Wavecon, etc.)

We have agreed on having a documented process for:

long-term projects
short-living projects that can be destroyed after x days

Long-term projects

Account name within OpenStack should contain the GitHub handle
There must be a nominated maintainer of the project, e.g. the team/SIG coordinator
There must be a nominated contact on the partner side, i.e. a community member at PlusServer, Wavecon that can create those projects

Short-living projects

We need to develop a standard and some magic so that our community members can request a short-living project within their environment

Want to define standard list of flavors always available

Want to have a base list of flavors that are available across SCS clouds.

1:4 as standard vCPU:RAM (in GB) ratio?
1:3.75 has the advantage of better matching hardware (to leave some memory for the hypervisor)

Discussion:

3.75 is better for providers
4 is standard on hyperscalers

Standardize 1:4

1:4
2:8
4:16
8:32
16:64

Also standardize 1:2 (1 till 16 vCPUs) and 1:8 (1 till 8 vCPUs) for x86-64.

1:3.75 (with rounding up)

1:4
2:8
4:15
8:30
16:60
32:120
64:240

ADR-0001: Add another optional section documenting / linking conformance tests?

    Maybe also optionally linking to existing conformance tests? (Or is that another separate optional section?)

Originally posted by @garloff in SovereignCloudStack/docs#158 (comment)

Enable GitHub Discussions for project

Issue description

Currently there's no effective way to discuss Drafts and suggestions that have been already merged. Raising an issue for open discussion of some point of the documents might be demotivating potential contributors since it might be not really an issue but suggestion in format of open discussion. At the same time issues are tend to be used for design flaws or some missing things.

Suggested solution

Enable GitHub Discussions functionality and start new thread for each new draft that is landed. This will allow potential contributors to feel more open to participate and have an efficient async topics regarding each document that anybody can join.

Flavor specs: Add extra_specs standardization

While we have the capability to encode lots of details in the flavor names (also see discussion at #73), we often will have cloud providers without such a huge variety in hardware and thus flavors. So no need to use all the optional suffixes; instead the provider would be well advised to stick to the generic shorter (and memorable) flavor names.
Nevertheless, there should be a way to communicate more details than are generally available in the flavor API (which is ram, vcpus, disk). There is a field extra_specs that could be used for this. This will need to be standardized.

nbde.md - is this for hypervisor encryption?

Hi *,
as ndbe is mentioned in the Release Notes: Is this for hypervisor encryption? How would that be integrated in the installation process? Would that be an addition for the operation docs?

Integrate K8s ServiceAccounts into IAM solution

K8s supports providing identity to Pods via ServiceAccounts. A JWT is provided to the workload Pod which may be used to access the apiserver, but also may be used outside of the cluster.

The "Service Account Issuer Discovery" seems like a perfect match to integrate them into SCS IAM/UCS: https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/#service-account-issuer-discovery

CC @stunivention

Document decision regarding backports in contributor guide

See this discussion:

SovereignCloudStack/issues#22

The public wekan boards should be linked

The public wekan boards should be linked into the docs.