Giter Club home page Giter Club logo

standards's Introduction

Sovereign Cloud Stack – Standards and Certification

SCS unifies the best of cloud computing in a certified standard. With a decentralized and federated cloud stack, SCS puts users in control of their data and fosters trust in clouds, backed by a global open-source community.

SCS compatible clouds

This is a list of clouds that we test on a nightly basis against our scs-compatible certification level.

Name Description Operator IaaS Compliance Check HealthMon
gx-scs Dev environment provided for SCS & GAIA-X context plusserver GmbH GitHub Workflow Status HM
pluscloud open
- prod1
- prod2
- prod3
- prod4
Public cloud for customers (4 regions) plusserver GmbH  
GitHub Workflow Status
GitHub Workflow Status
GitHub Workflow Status
GitHub Workflow Status
 
HM1
HM2
HM3
HM4
Wavestack Public cloud for customers noris network AG/Wavecon GmbH GitHub Workflow Status HM
REGIO.cloud Public cloud for customers OSISM GmbH GitHub Workflow Status broken
CNDS Public cloud for customers artcodix UG GitHub Workflow Status HM
aov.cloud Community cloud for customers aov IT.Services GmbH (soon) HM
PoC WG-Cloud OSBA Cloud PoC for FITKO Cloud&Heat Technologies GmbH GitHub Workflow Status HM

SCS standards overview

Standards are organized as certification levels according to SCS-0003-v1. We currently maintain one certification level scs-compatible that is described here: Tests/scs-compatible-iaas.yaml.

More certification levels will follow as the project progresses.

Repo Structure

This repository is organized according to SCS-0002-v1.

Standards

Official SCS standards and Decision Records, see Standards/scs-0001-v1-sovereign-cloud-standards.md).

Tests

Testsuite and tools for SCS standards, see Tests/README.md.

Drafts

Old Design-Docs folder with existing Architectural Decision Records (ADRs). This directory is currently in the process of being consolidated and cleaned up. See cleanup step-1 and open questions.

standards's People

Contributors

adamvsmith avatar anjastrunk avatar artificial-intelligence avatar batistein avatar berendt avatar cah-hbaum avatar cah-link avatar chess-knight avatar dependabot[bot] avatar fkr avatar frosty-geek avatar garloff avatar horazont avatar itrich avatar jklippel avatar josephinesei avatar joshmue avatar joshuai96 avatar juanptm avatar linwalth avatar markus-hentsch avatar martinmo avatar master-caster avatar matfechner avatar matofeder avatar maxwolfs avatar mbuechse avatar reqa avatar stunivention avatar tonifinger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

standards's Issues

Core vs Threads vs Oversubscribed vCPUs

Do we need to reflect this in the flavor name?
(1) Dedicated Cores
(2) Dedicated Hyperthreads
(3) Oversubscribed vCPU (limited to a factor X)

Letters: v = vCPU (oversubscribed), s = dedicated thread (SMT/HT), d = dedicated core

migrate PlusDemonstrator into Contributor-Docs

Hi,

PlusDemonstrator/GettingStarted.MD is mostly outdated nowadays and holds irrelevant information.

I suggest that we create a yaml file listing all cloud resources provided by partners (see #155) and migrate the relevant Information from PlusDemonstrator into a more generic documentation listing the api endpoints/clouds-public.yaml etc.

License decision helper text

As a company who wants to contribute code to the oss community I face the difficulty to choose the proper license.
We should add some docs to the scs documentation that assists with this process. This is one of the action items that came out of the second edition of the Lean SCS Operator Coffee.

tools/flavor-name-check.py: -ib without prior CPU spec gets misinterpreted

garloff@TuxKurt(wave-scs):/casa/src/SCS/Docs/Design-Docs [0]$ ./tools/flavor-name-check.py -v SCS-4C:16:50-ib
Flavor: SCS-4C:16:50-ib
 CPU:RAM: #vCPUs: 4, CPU type: Dedicated Core, ?Insec SMT: False, ##GiB RAM: 16.0, ?no ECC: False, ?RAM Over: False
 Disk: #:NrDisks: 1, #.GB Disk: 50, .Disk type: 
 No Hypervisor
 No NestedVirtualization
 CPUBrand: .CPU Vendor: Intel, #.CPU Gen: Pre-Skylake, Performance: Std Perf
 No GPU
 No Infiniband

ERROR: Could not parse: b
Traceback (most recent call last):
  File "/casa/src/SovereignCloudStack/Docs/Design-Docs/./tools/flavor-name-check.py", line 527, in <module>
    sys.exit(main(sys.argv[1:]))
  File "/casa/src/SovereignCloudStack/Docs/Design-Docs/./tools/flavor-name-check.py", line 512, in main
    ret = parsename(name)
  File "/casa/src/SovereignCloudStack/Docs/Design-Docs/./tools/flavor-name-check.py", line 457, in parsename
    raise NameError("Error 60: Could not parse %s (extras?)" % n)
NameError: Error 60: Could not parse b (extras?)

SCS-4C:16:50-ib is a legal name, but our parser is not clever enough for it. It interprets the -i of -ib as indicator for an intel CPU and expects a number, or nothing and/or h to follow. Will need to special case this or improve the parser logic ...

Getting Metrics from Custom Sources

There are usecases where an SCS Operator will want to scrape custom services which are not naturally integrated into the existing SCS infrastructure but still helpful to gather in the centralized Prometheus provided by Kolla.

To enable said scraping the prometheus.conf must be edited thus so these custom machines can be added to the prometheus config via custom group / custom .j2 syntax. Documenting this ought to be helpful.

DNS recommendations: Write spec

We have several considerations on DNS that we should capture in a design doc:

  • We expect SCS IaaS providers to run their own local DNS service. In the simplest case, this would be just a resolver with a local address that would forward to upstream resolution. This resolver should be highly available (by advertising two or three servers or by being some HA setup) and be the default DNS servers in new subnet. In more sophisticated setups it would also serve the local names (neutron DNS integration) and serve local zones from designate.
  • We have the domain sovereignit.cloud to resolve SCS clouds' API/horizon/... services.
  • We have the domain sovereignit.tech that can be used by users on these clouds.

This should all be documented and policies be written.

Create overview of standards

As inspired by @jklippel, we should provide an overview of our standards and their current state (release version, draft, etc.)

We discussed today during SIG Standardization that we'll add a table to the root README.MD and link to the various documents.

Container Image Infrastructure

It should be ensured that the container images which are used in SCS, ...

  • are up to date
  • do not contain software with known vulnerabilities
  • do meet high quality standards

(Potential) Image sources include, but are most likely not limited to:

Dockerhub Official (Base) Images

Examples include: alpine, debian, ubuntu, mysql

Upstram managed images on DockerHub/Quay.io/...

Examples include: prom/prometheus

  • mileage will vary from one project to another

Red Hat Certified Base Images

  • I personally do not have a lot of experience with them
  • they seem very nicely curated and maintained
  • RHEL/OpenShift centric
  • To be determined: Relation to subscription model, OKD and S2I

SCS solution?

Being a "cloud distribution", SCS may have its own set of maintained images driven by the given goals.

  • Implementing own base images with patch process etc. (got some ideas there)
  • Security scanning (e. g. hosting a Harbor/Quay installation)

Flavor specs: Add rationale for encoding info into names

While we encourage to use the short names in the SCS flavor naming, we do have the capability to encode a lot of details into the flavor names. This is useful, as we may have (large) clouds, where customers have the ability to chose between very finegrained flavor types for selecting capabilities and we need to do this by chosing the flavor names. (The mechanism to use extra_specs and match it with images is broken -- requiring custom images just to be able to select the right VM flavor is a concept broken by design IMVHO.)
We should clarify this in the docs.

Monitor additional devices in bare metal environments

As a Cloud Operator, I want to be able to get metrics from external devices in my data center. This includes PDUs, UPSs and sensors (temperature, humidity, leakage, ...).

That means the monitoring architecture needs support for scraping such devices via SNMP or Modbus exporters (in case the monitoring is realized by Prometheus).
Different devices and sensors need specific exporter configurations which usually differ from data center to data center.

The monitoring somehow (= by providing ip address and type of the device) has to be aware of such devices to apply the correct device-specific exporter configuration and eventually scrape the targets.
Alternatively, the monitoring arch. must be able to get extended for supporting additional targets.
For instance by allowing to inject additional targets into configuration files of the monitoring architecture.

Prefix "GX"

The GX prefix of course suggests that this is a Gaia-X standard.
We should not suggest it is when it is not ...

  • Even though GX is not infringing on a trademark, the Gaia-X AISBL would probably still dislike it
  • Obviously we don't want to misrepresent anything

So we have two options:

  • Try to push this via a PDR/ADR through the Provider working group
  • Use the "SCS" prefix instead

Maintaining deployment of core services on multiple layers

The deployment of some services should be possible on multiple "layers" of SCS infrastructure. E. g. deploying Prometheus on "top level" payload K8s clusters, but also on "low level" infrastructure hosts.

In order to minimize (potentially redundant) maintenance efforts, it may be considered to use the same type of deployment scripts/manifests on all levels.

Some Options:

  1. Reuse K8s manifests at "low level" infrastructure hosts by using some slim K8s distribution there. In order to have as few moving parts as possible, features like dynamic persistent volume provisioning should be disabled. Maybe it is even viable to opt out of Pod networking and kube-dns. Such "low level" cluster is already required for the use of Gardener, AFAIK.
  2. Reuse K8s manifests at "low level" infrastructure hosts by using podman play kube. AFAIK, this feature is only suitable for pretty simple use cases.

Flavor Naming Scheme: Use of colon character as delimiter

Summary

The current flavor naming standard uses colon characters (i.e. :) as delimiters. This is an unfortunate choice, as it is incompatible with requirements stipulated by downstream applications such as Kubernetes.

Details

Kubernetes uses labels and annotations for a variety of use cases. Most importantly in the context of this discussion will be their use for taints and tolerations and the standard well-known labels, annotations and taints.

The requirements for valid label values are:

  • must be 63 characters or less (can be empty),
  • unless empty, must begin and end with an alphanumeric character ([a-z0-9A-Z]),
  • could contain dashes (-), underscores (_), dots (.), and alphanumerics between.

That is, it must adhere to the regular expression (([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?').

The usage of colon characters is not allowed and their usage in flavor names causes issues when, for example, the kubelet wants to set the well-known node.kubernetes.io/instance-type label:

Jan 01 08:09:10 foo-bar-box hyperkube[12345]: I0103 08:09:10.23456 14821 kubelet_node_status.go:426] "Adding node label from cloud provider" labelKey="node.kubernetes.io/instance-type" labelValue="SCS-4V:16:50"
[...]
Jan 01 08:09:10 foo-bar-box hyperkube[12345]: E0103 08:09:10.23456 14821 kubelet_node_status.go:94] "Unable to register node with API server" err="Node \"foo-bar-node\" is invalid: metadata.labels: Invalid value: \"SCS-4V:16:50\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')" node="foo-bar-node"

Given the significance of Kubernetes in the SCS project, it would be advisable to reconsider the naming scheme and adopt one that is compatible with the data model defined there.

Thank you!

Nested Virtualization

We currently don't have a way to express in the flavor name whether or not the flavors support nested virtualization.
We should expect nested virt to be default-on and have a modifier to show that it's off?

OpenvSwitch meshing

See SovereignCloudStack/docs#98 for my definition of Prio 1-4.

  • Prio 1: As a Cloud Operator, I want to know if the OpenvSwitch(**) agent fails to complete its first synchronization iteration without error, because that iteration is vital for the functioning and lack of a successful first iteration causes the below alert and customer visible impact.

    The potential action is to investigate why the first iteration did not complete, resolve the issue and restart the OpenvSwitch agent. If that is not feasible, workload needs to be migrated off the node.

  • Prio 3 or higher*: as a Cloud Operator, I want to know if an OpenStack Network on a compute node with the OpenvSwitch agent(**) is lacking any of the following tunnels (VXLAN or VLAN) configured:

    • to a DHCP node with an DHCP instance for any subnet in the network; the corresponding flow rule must allow broadcast traffic.
    • to an L3 node with an L3 router instance with a port in any subnet in the network; the corresponding flow rule must allow broadcast traffic.
    • to another compute node with a compute instance with a port in any subnet in the network; the corresponding flow rule must allow broadcast traffic.

    because lack of those rules can cause loss of DHCP leases, loss of communication between some instances in the same network and loss of internet connectivity for instances.

(*) See footnote in SovereignCloudStack/docs#98.
(**) I am not sure what SCS intends to use; I am not sure if the same in a different flavour applies for instance to OVN.

L3 router replication status

Prio 1 = customer impact, need immediate action
Prio 2 = loss of internal redundancy, need quick action to avoid subsequent Prio 1 on another failure
Prio 3 = certain Prio 1 situation is upcoming, need fast action to avoid
Prio 4 = potential Prio 1 or certain Prio 2 situation is upcoming, need fast action to avoid

In my mind, 1-2 are paging, while 3-4 are causing daytime alerts.

  • Prio 3 or higher*: As a Cloud Operator, I want to know if multiple replicas of a HA L3 router think they are in master state, because that indicates a potentially customer-visible network issue (ARP fight or L2 loss between two nodes).
  • Prio 3 or higher*: As a Cloud Operator, I want to know if an HA L3 router has no replica in master state, as that renders the router dysfunctional, which has customer-visible impact, because the traffic is not going to reach the instances because the upstream router cannot find the MAC address to send the traffic to.
  • Prio 4 or higher*: As a Cloud Operator, I want to know if the number of HA L3 router replicas is below the configured number for a longer time.

(*): both of these can happen temporarily in a healthy system due to monitoring races but also due to propagation delays of state information, so it is a bit tricky how to alert correctly; hence just Prio 3.

Flavor naming feedback

Description

As for reqeust in OpenStack Mailing List [1], filling in issue to gather feedback that has been said out during discussion.

  1. Raised concern, that suggested in [2] naming convention is not readable at glance and in order to understand it cloud users will require to read and remember that spec, which is not convenient/acceptable from UX prespective.
    As alternatives to it, following examples were mentioned for the following sample:
- nvt4: nvidia T4 GPU
- a8: AMD VCPU 8 (we also have i4 for example, for Intel)
- ram24: 24 GB of RAM
- disk50: 50 GB of local system disk
- perf2: level 2 of IOps / IO bandwidth
  • nvt4-a8-ram24-disk50-perf2
  • 8vCPU-24576RAM-50SSD-pGPU:T4-10kIOPS-EPYC4
  1. Raised opinion, that flavor naming should not be standardized/regulated at all. For the following reasons:
  • before creating instance people should reference flavor specs at the first place and not flavor names
  • during Sydney summit present operators agreed upon impossiblity to come to consensus for standard naming for flavors, because many of them used higly tuned flavors. So it is space that we might want to avoid from regulating.

[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-June/023152.html
[2] https://github.com/SovereignCloudStack/Operational-Docs/blob/main/flavor-naming-draft.MD

Network interface speed

See SovereignCloudStack/docs#98 for my definition of Prio 1-4.

  • Prio 3: As a Cloud Operator, I want to know when a network interface drops below the specified maximum speed on that link, as that indicates bad cabeling, a bad driver, or another (hardware) issue and may cause customer impact (reduced performance).
  • Prio 2: As a Cloud Operator, I want to know when the network interfaces of a compute node are saturated, as that causes degration of performance.

Flavor-Specs: Specify modifier for support of nested virtualization

When using virtualization inside VMs, it makes a big performance difference whether we need to fully emulate things (qemu) or whether we can use hardware virtualization (qemu-kvm). The latter requires nested virtualization to be allowed on the host. This should be visible from the long flavor names.

Simplifications

Some more thoughts:

  • Do we want to drop the 'N' and 'M' prefixes? They are redundant, as the information CPU:Mem is already encoded in the numbers.
  • Do we want to attach the modifiers 'H', 'G', 'L' after the numbers? Reason being that we'd allow pattern matching SCS-4:16* this way ...
  • We wanted to discuss to have short cuts anyhow, like SCS-4:16 (which the scheduler then can schedule to all SCS-4:16 flavors ...)
  • The local disk qualifier might also need a number (size of disk in GB), no? SCS-4:16L150-A-v2 for a 4vCPU 16GIB RAM 150GB disk(s) AMD 2nd gen (Zen v2 aka Rome I guess).

Monitoring Userstories

As a SCS-Operator,
I would like to monitor my OpenStack Infrastructure , OpenStack API and my Storage Environment:
Monitoring

Disk sizes missing

In standard OpenStack, flavors come with a disk size. This disk is for the root filesystem.

The disk size (if any disk is provided) should probably be part of the flavor.
"SCS-4V:16:100" would then be a flavor with 4vCPUs(oversubscribed) with 16GiB Ram and 100GB disk.
"...:50L" would be a local disk (lower latency than networked storage) of 50GB
"...:100S" would be a local SSD/NVMe type disk (>10k IOPs) of 100GB.

Standard sizes 10, 20, 50, 100, 200 ?

Clarify how to get cloud resources by our partners

We have discussed during the Community Hackathon 2022 that we urgently need to clarify how contributors can get cloud resources from our partners (PlusServer, Wavecon, etc.)

We have agreed on having a documented process for:

  • long-term projects
  • short-living projects that can be destroyed after x days

Long-term projects

  • Account name within OpenStack should contain the GitHub handle
  • There must be a nominated maintainer of the project, e.g. the team/SIG coordinator
  • There must be a nominated contact on the partner side, i.e. a community member at PlusServer, Wavecon that can create those projects

Short-living projects

  • We need to develop a standard and some magic so that our community members can request a short-living project within their environment

Want to define standard list of flavors always available

Want to have a base list of flavors that are available across SCS clouds.

  • 1:4 as standard vCPU:RAM (in GB) ratio?
  • 1:3.75 has the advantage of better matching hardware (to leave some memory for the hypervisor)

Discussion:

  • 3.75 is better for providers
  • 4 is standard on hyperscalers

Standardize 1:4


1:4
2:8
4:16
8:32
16:64

Also standardize 1:2 (1 till 16 vCPUs) and 1:8 (1 till 8 vCPUs) for x86-64.

1:3.75 (with rounding up)

1:4
2:8
4:15
8:30
16:60
32:120
64:240

Enable GitHub Discussions for project

Issue description

Currently there's no effective way to discuss Drafts and suggestions that have been already merged. Raising an issue for open discussion of some point of the documents might be demotivating potential contributors since it might be not really an issue but suggestion in format of open discussion. At the same time issues are tend to be used for design flaws or some missing things.

Suggested solution

Enable GitHub Discussions functionality and start new thread for each new draft that is landed. This will allow potential contributors to feel more open to participate and have an efficient async topics regarding each document that anybody can join.

Flavor specs: Add extra_specs standardization

While we have the capability to encode lots of details in the flavor names (also see discussion at #73), we often will have cloud providers without such a huge variety in hardware and thus flavors. So no need to use all the optional suffixes; instead the provider would be well advised to stick to the generic shorter (and memorable) flavor names.
Nevertheless, there should be a way to communicate more details than are generally available in the flavor API (which is ram, vcpus, disk). There is a field extra_specs that could be used for this. This will need to be standardized.

nbde.md - is this for hypervisor encryption?

Hi *,
as ndbe is mentioned in the Release Notes: Is this for hypervisor encryption? How would that be integrated in the installation process? Would that be an addition for the operation docs?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.