scality / metalk8s Goto Github PK

An opinionated Kubernetes distribution with a focus on long-term on-prem deployments

License: Apache License 2.0

Python 13.99% Shell 1.39% Dockerfile 0.36% XSLT 0.01% Gherkin 0.71% Smarty 1.52% SaltStack 62.05% HTML 0.54% Scheme 0.01% JavaScript 0.61% CSS 0.07% Go 2.80% TypeScript 14.08% Mustache 0.83% Jinja 0.65% Makefile 0.37% Nix 0.01% MDX 0.02%

kubernetes kubernetes-cluster kubernetes-deployment kubernetes-setup kubernetes-monitoring k8s k8s-cluster k8s-deployer cloud cloud-native

metalk8s's Introduction

An opinionated Kubernetes distribution with a focus on long-term on-prem deployments

Integrating

MetalK8s offers a set of tools to deploy Kubernetes applications, given a set of standards for packaging such applications is respected.

For more information, please refer to the Integration Guidelines.

Building

Prerequisites are listed here.

To build a MetalK8s ISO, simply type ./doit.sh.

For more information, please refer to the Building Documentation.

Contributing

If you'd like to contribute, please review the Contributing Guidelines.

Testing

Requirements

Bootstrapping a local environment

# Install virtualbox guest addition plugin
vagrant plugin install vagrant-vbguest
# Bootstrap a platform on a vagrant environment using
./doit.sh vagrant_up

End-to-End Testing

To run the test-suite locally, first complete the bootstrap step as outlined above, then:

# Run tests with tox
tox -e tests

Documentation

Requirements

Building

To generate HTML documentation locally in docs/_build/html, run the following command:

# Generate doc with tox
tox -e docs

MetalK8s version 1 is still hosted in this repository but is no longer maintained. The last release is MetalK8s 1.3.

metalk8s's People

Contributors

Stargazers

Watchers

metalk8s's Issues

Use CoreDNS

Kubernetes 1.11 graduates the CoreDNS service to stable. Let's switch to it before MetalK8s 1.0.

Finalize integration of `ansible-hardening`

There's an old PR (#11) which integrates the OpenStack Ansible Hardening role to tighten system security according to the STIG rules.

It'd be good to finalize this work, and integrate e.g. version 17.0.4 of this project, using the current vendoring system, and come up with a sensible configuration.

Track Kubespray master node replacement certificate issues

kubernetes-sigs/kubespray#2448
kubernetes-sigs/kubespray#2496

Review storage provisioning

Related to #1

We currently pre-provision a couple of LVM2 Logical Volumes of specific sizes and inject these as PersistentVolumes resources. This is a somewhat inflexible approach.

Though we likely should keep using LVM2 as a base storage technology, using e.g. CSI (as suggested by @Zempashi) could make this somewhat more flexible. CSI is, however, not yet entirely stable.

There's an LVM2 CSI driver from MesosSphere we may want to use or contribute to (https://github.com/mesosphere/csilvm, https://mesosphere.com/blog/open-source-storage-ecosystem/). There's another plugin at https://github.com/wavezhang/k8s-csi-lvm which seems less polished.

Investigate Keycloak as OIDC provider

https://www.keycloak.org/
https://medium.com/@mrbobbytables/kubernetes-day-2-operations-authn-authz-with-oidc-and-a-little-help-from-keycloak-de4ea1bdbbe
https://github.com/mrbobbytables/oidckube

Monitor all system volumes with node_exporter

Currently node_exporter doesn't monitor all volumes available on the system: as https://github.com/prometheus/node_exporter/blob/1f11a86d594173ca1146ac1d1715cd6263e9959d/README.md#using-docker mentions, we'd need to bind all volumes in a node_exporter VM, which is impractical, most certainly because we're not setting up the node_exporter DaemonSet ourself...

It may be possible to work around this using MountPropagation and some other settings (but then see prometheus/node_exporter#672 and prometheus/node_exporter#660), which is far from ideal.

Maybe the best way forward would be to simply deploy node_exporter on all nodes, as it's intended to be deployed, and not let kube-prometheus (or prometheus-operator) manage it, just make sure the metrics are collected as intended.

Implement `url` type in `check-vendor`

Track Kubespray long-polling proxy timeout issues

One one side there's kubernetes-sigs/kubespray#2150 and kubernetes-sigs/kubespray#2151

Given history, there's kubernetes-sigs/kubespray#655 and kubernetes-sigs/kubespray#656

May need to work with @xaf-scality to figure out how to tackle this correctly.

MetalK8s roadmap

A while ago I made a PR to get metal-k8s into the https://github.com/ramitsurana/awesome-kubernetes list. One of the reasons it got rejected was because of the lack of a roadmap in the repo. This made me realize perhaps it's a good idea to have a roadmap file somewhere in this repo.

ES Curator seemingly not deleting indices

To be investigated. Logs seem to indicate no work is being done (no matching indices found).

Monitor Calico

Calico services expose metrics using the Prometheus format (see e.g. https://docs.projectcalico.org/v3.1/reference/felix/prometheus)
We should set up ServiceMonitor objects to capture these metrics, and add some Grafana dashboard(s), e.g. https://grafana.com/dashboards/3244, to expose the metrics. Over time, consider some alerting rules as well.

Deploy `node-exporter-full` Grafana dashboard

When deploying kube-prometheus, we should also ship the node-exporter-full dashboard from https://github.com/rfrail3/grafana-dashboards/blob/master/prometheus/node-exporter-full.json (see https://grafana.com/dashboards/1860)

Note: the dashboard needs to be altered slightly to work out of the box! The job value in the templating selector is wrong, should be node_exporter (or was it node-exporter?) instead of node, if I remember correctly.

Validate inter-node connectivity as a prerequisite

Early on in the deploy (and related) playbooks, where we check all nodes are reachable from the deployment host (ping), we should figure out a way to validate all nodes can talk to each other as well, e.g. to make sure no overly restrictive OpenStack security group is in place.

Secure access to `kube-ops` services

Currently we deploy an Ingress object for various browser-based services in kube-ops. Anyone who has access to one of the kube-node servers (on port 80) can access these, which is obviously problematic from a security PoV. As an example, one can access container logs through Kibana.

I suggest we:

Remove these Ingress objects
Deploy two Role definitions in kube-ops: one which has access to metrics and alerts (Prometheus, Grafana and Alertmanager), another which has access to logs (Kibana) throught the API proxy. I'm not sure how to call these, bikeshedding allowed.
Add cluster-service labels on the service object
Adjust e.g. Kibana's SERVER_BASE_PATH to match the API proxy URL
Document how to access the services: running kubectl proxy, then use the correct service URLs

All of this should just work when using kubectl with admin.conf, yet would also permit more fine-grained access control once we deploy some kind of SSO/authn solution.

Readme file lacks of some needed dependency

namely: python-dev and gcc

Monitor `etcd`-only nodes

Due to node_exporter being deployed as a DaemonSet, any node which is not part of the Kubernetes cluster isn't monitored. This can be the case for etcd nodes when they're not colocated with kube-node or kube-master roles.

We should

Deploy node_exporter on them as well (cfr. #31)
Ensure the Prometheus deployed with kube-prometheus is set up to also scrape metrics from these servers

Backup management

(What's below is copied and slightly edited from an earlier e-mail thread)

For on-prem K8s deployments, we should think about 'backup': which cluster information should be backed up? How do we achieve this?...

AFAIK all that's needed is to back-up the content of the etcd cluster used as a datastore for master nodes, not necessarily etcd-events, and the etcd data stored for the CNI system (Calico in our case, likely). If I'm not mistaken this is stored in the very same etcd cluster as used by kubeapi/master nodes.

Ensure etcd logs are ingested in Elasticsearch

We must make sure etcd logs get ingested in Elasticsearch. This may be the case on servers part of the k8s-cluster group where fluentd already accesses /var/log from the host (to be validated!), but not when etcd is deployed on a separate set of servers, where Kubernetes DaemonSet containers are not scheduled.

Ensure Elasticsearch services have anti-affinity set up

Related to #29. This could be the case already.

Duplicate definition of docker_dns_servers_strict

In group_vars/k8s-cluster/10-metal-k8s.yml, the docker_dns_servers_strict var is declared twice.

Deploy `Prometheus 2` Grafana dashboard

This can be added manually after install, but shipping by default would be nice.

Source at https://github.com/grafana/grafana/blob/master/public/app/plugins/datasource/prometheus/dashboards/prometheus_2_stats.json

Set up K8s auditing

Enterprise customers require auditing of system interactions. Let's set up K8s auditing by default.

See https://kubernetes.io/docs/tasks/debug-application-cluster/audit/

Salt API
K8s API

Investigate Dex as OIDC provider

Design a 'framework' for automated tests

The current tests are fairly basic Bash scripts using bash_unit. This makes implementing more complicated tests (e.g. validating whether certain metrics are tracked) more complex and reduces code-reuse.

Also, being able to define high-level scenarios BDD/Cucumber style is impossible.

We should decide which mechanism we want to use to implement various types of tests, including deployment scenarios (install, upgrade, re-deploy,...), various tests to run against a deployment (to validate its behaviour), and maybe 'unit'-testing of our Ansible roles (e.g. using https://github.com/metacloud/molecule).

Implement a script to release a version of MetalK8s

This script should perform various sanity checks (TODO figure out which...) and create a release package, maybe publish a Docker image,...

Productize ElasticSearch deployment

The current ElasticSearch deployment is not production-ready:

It's not using stateful storage / deployed as a StatefulSet
JVM memory settings are low
Only one CPU assigned to services
No PodDistuptionBudgets deployed
Curator config is very basic
Fluentd config is very basic

Some of these are already handled by the vendored deployment files, other things need to be manually changed.

See #27

Add platform checks

We need to check the following items after gathering the servers ansible facts

kernel version is >= 3.10.0-693.el7
docker-ce version is >= 1.12 and < 17.06
in /etc/daemon/docker.json

"storage-driver": "overlay2",
    "storage-opts": [
        "overlay2.override_kernel_check=true"
    ]

/etc/sysctl.d/may_detach_mounts.conf with content

fs.may_detach_mounts=1

Then verify that you don't run into the moby/moby#34538 issue

`etcd` memory sizing

I just ran into an issue with my etcd cluster (running on separate VMs, 3 nodes) where the etcd processes were being killed by the kernel OOM reaper because they hit the cgroup's memory limit, which was set at 512MB. The VMs have (supposedly) 4GB of memory available.

Turns out, due to https://github.com/kubernetes-incubator/kubespray/blob/595e96ebf125823c04a5a7f77e002e5a6affb9f2/roles/etcd/defaults/main.yml#L42, the memory limit is set to 512MB, because apparently Ansible decides the host has less than 4GB available (free -m says the same thing).

Bumping the memory limit to 1024MB (by setting etc_memory_limit in my inventory group_vars:etcd) ensures etcd keeps running. Should we raise the default? Size this differently?...

This also shows we should invest some time in proper monitoring of the etcd cluster...

CNI

(What's below is copied and slightly edited from an earlier e-mail thread)

The default CNI provider deployed by Kubespray is Calico, out of many possible CNI implementations. To be honest, I'm losing sight a bit across all possible choices, and what the differences/pros/cons between them are.

Does anyone have an opinion on which CNI implementation to use, and why? Should Calico suffice?

Also, do we stick with iptables as the routing mechanism for kube-proxy, or do we go for IPVS (which is likely what we want to use, but are less familiar with)?

Lots of questions, little answers :) Looking for your experiences and insights!

es-exporter metrics not captured

The Elasticsearch Prometheus metrics exposed by es-exporter are currently not being captured by Prometheus, due to restrictions in what's deployed with kube-prometheus. This is likely solved/relaxed by a newer version of kube-prometheus, though to be validated & tested.

Move all vendored code into `vendor`

Grafana node-exporter-full dashboard
Grafana prometheus dashboard
Grafana elasticsearch dashboard
metric-server manifests

Investigate using `fluent-bit` instead of `fluentd`

Using fluent-bit instead of fluentd could be useful since it's more tailored for Kubernetes environments, and supports e.g. parser definitions through Pod annotations (remember our discussion in SF, @Zempashi)

See e.g. the Pods suggest a parser through a declarative annotation section in https://www.linux.com/blog/event/kubecon/2018/4/fluent-bit-flexible-logging-kubernetes

It can also be configured to auto-detect JSON payloads in log messages and explode them.

References

https://fluentbit.io
https://github.com/fluent/fluent-bit-kubernetes-logging
https://fluentbit.io/documentation/current/filter/kubernetes.html (see Merge_JSON_Log)

The `services.yml` playbook fails due to some `Undefined` variable which can't be JSON-encoded

TASK [kube_prometheus : copy kube-prometheus values into temporary file] *********************************************************************************************************************************************************************
fatal: [metalk8s-master-01 -> 10.200.4.36]: FAILED! => {
    "changed": false,
    "msg": "AnsibleError: Unexpected templating type error occurred on (deployExporterNode: False\ngrafana:\n  extraVars:\n    - name: 'GF_SERVER_ROOT_URL'\n      value: '%(protocol)s://%(domain)s/api/v1/namespaces/kube-ops/services/kube-prometheus-grafana:http/proxy/'\n  service:\n    labels:\n      kubernetes.io/cluster-service: \"true\"\n      kubernetes.io/name: \"Grafana\"\n\nprometheus:\n  externalUrl: '/api/v1/namespaces/kube-ops/services/kube-prometheus:http/proxy/'\n  service:\n    labels:\n      kubernetes.io/cluster-service: \"true\"\n      kubernetes.io/name: \"Prometheus\"\n  replicaCount: 2\n{% if kube_prometheus_secret|default %}\n  secrets:\n{% for secret in kube_prometheus_secret %}\n  - {{ secret }}\n{% endfor %}\n{% endif %}\n  storageSpec:\n    volumeClaimTemplate:\n      spec:\n        accessModes: [\"ReadWriteOnce\"]\n        resources:\n          requests:\n            storage: {{ prometheus_storage_size }}\n\nalertmanager:\n  externalUrl: '/api/v1/namespaces/kube-ops/services/kube-prometheus-alertmanager:http/proxy/'\n  service:\n    labels:\n      kubernetes.io/cluster-service: \"true\"\n      kubernetes.io/name: \"Alertmanager\"\n  replicaCount: 2\n\nexporter-kube-etcd:\n  etcdPort: 2379\n  endpoints: {{ groups.etcd | map('extract', hostvars, ['ansible_default_ipv4', 'address'])|list|to_json }}\n  scheme: https\n  # Linked to the secret of prometheus\n{% if exporter_kube_etcd_certFile|default %}\n  certFile: {{ exporter_kube_etcd_certFile }}\n{% endif %}\n{% if exporter_kube_etcd_keyFile|default %}\n  keyFile: {{ exporter_kube_etcd_keyFile }}\n{% endif %}\n): Undefined is not JSON serializable"
}

NO MORE HOSTS LEFT ***************************************************************************************************************************************************************************************************************************

or less verbose:

Undefined is not JSON serializable

Implement a deployment/upgrade test which runs in a firewalled environment with HTTP(S) proxy

Storage design

(What's below is copied and slightly edited from an earlier e-mail thread)

As discussed with during the meeting earlier today, we'll need a story w.r.t. storage for the demo platform and, obviously, also for production deployments later on.

In general, when no kind of NAS which can be dynamically provisioned is available, and we're 'stuck' with local storage (on all servers comprising the cluster, or maybe even on some), In general, I think
there are 2.5 possible approaches:

Deploy a NAS with a dynamic volume provider ourselves
- either host-based
- or hyperconverged
Use local storage as it is: local-only

In the current architecture, our stateful services are all clustered, and don't need shared (NAS-style) storage in order for Pods comprising them to be schedulable on other nodes to continue operations. As such, a local storage solution should suit our needs, without the operational headache and overhead of a shared solution.

Next up: how to provision these volumes.

One possibility is to simply create empty directories under some folder, and deploy the local-storage provisioner. I'm not overly fond of this solution, because it doesn't allow for capacity isolation: a volume bound to an ElasticSearch Pod which is solely used for log ingestion could cause disruption of production services (other Pods) running on the same node. Less than desirable.

In 'real' on-prem deployments, we could aim for an architecture based on one-disk-per-volume, or a 'physical' partition per volume, prepared by some Ansible job (where we should really use by-UUID rules in fstab and associated mount-points such that losing fstab isn't lethal ;-)).

Alternatively, for both 'physical' as well as VM deployment (i.e. the demo platform), we could bundle 'similar' (SSD vs HDD etc) disks into LVM VGs, then create LVs according to our needs, and use them as in the scenario above.
@ballot-scality mentioned using thin provisioning in this case, I'm however not sure that's the right approach, since it'd require constant monitoring of the platform to ensure the thin pool doesn't run out of
space...

In the actual-volumes cases, I think we can't really use the local-volume provisioner though: this provisioner uses statfs to figure out the 'real' size of a volume mounted under the path it monitors, and creates a PV accordingly. It is, however, difficult (if possible at all) to create a disk/LV and FS of exactly the desired size we pre-define in our Charts. As a result, a used would need to override Chart values which define the desired PVC size request according to the specific cluster deployment, which is undesirable.

Instead, we should (at least, IMHO) pre-provision disks/LVs/FSs of the size we need (given the defaults in Chart values), then have an Ansible task POST the relevant PVs after the K8s cluster has been deployed, i.e. good old static provisioning.

There's some initial work to enhance the story around dynamic provisioning of local storage volumes, see kubernetes-retired/external-storage#651 and the pointers by the people who provided some input. These features, however, will only land into K8s 1.11 the earliest (though we may want to contribute to the efforts, I have some concerns about the current design which I'll raise in the PR later), so this won't be of any use in the foreseeable future.

To summarize, my current proposal would be to:

Let a user list the disks he wants to attribute to the K8s cluster in the Ansible configuration, as some kind of dict (per-node):

my-vg:
  drives: ['/dev/vdb', '/dev/vdc']
  storageClassName: local-ssd
  provisionedVolumeSizes:
    - 50Gi
    - 5Gi
    - 10Gi

(up to @Zempashi , @alxf and @ballot-scality to tell me how this is properly done in Ansible ;-))

Create PVs, VGs and LVs accordingly
Create some FS on the LVs (TBD which)
Create /mnt/my-vg/$UUID for every volume to be provisioned, add to fstab, mount
Deploy K8s
POST an SC for every SC defined, also setting the right scheduling options
POST PVs for every volume provisioned, including the correct node affinity rules etc, of the defined size (i.e. not using the size as reported by the FS, which may be slightly smaller)

By default, we'd pre-create PVs for all stateful services and their PVCs we're aware of.

We should check with TS people whether they feel comfortable using LVM2 for this purpose. I see, however, no reasonable way to achieve this otherwise.

Figure out the "fail fast" option of ansible

Many installation encounter errors at early stage, but ansible continue to fail long after and with error that are difficult to debug

Make `kube_elasticsearch` deployment optional

Add a var, metal_k8s_enable_elasticsearch, which makes kube_elasticsearch optional, enabled by default. When disabled but enabled in a later run of the playbook, the services should be deployed. Also, delete the services when disabled again later (helm delete, kubectl delete,...)

Integrate an OIDC provider

Instead of provisioning a single user at deployment time, we should use K8s' OIDC authn support, through an OIDC provide we provision as part of MetalK8s, and which (initially) uses its own user-base. Later on, we can integrate with other authn systems like LDAP, AD,...

Cfr. #83 #84

Review need for unauthenticated read-only API port and Heapster

When Heapster was added in 71206ef#diff-384fbcda7fecbabd124cf2ccae3167cdR9, an unauthenticated API server access mechanism was opened as well to make things work.

This is not desirable. Luckily there appears to be a way to get Heapster (and maybe metrics-server?) to work without it: kubernetes-sigs/kubespray#2550 (comment)

We may want to adopt this.

Stabilize `helm init`

We've seen some cases where Helm fails to install in the cluster (helm init returns, but Tiller is not properly deployed).

The root cause is currently unknown, but we may want to deploy Tiller without using helm init, but --dry-run and use kubectl instead, as well as wait for Tiller to be running before continuing (TBD).

Implement an upgrade test in CI

There are at least two scenarios to test:

Upgrade from some 'old' baseline version (e.g. 0.1.0 and stick to it) to proposed HEAD
Upgrade from the PR target version/branch to proposed HEAD

Test `node-problem-detector`

This could be a useful addition to deployments. See https://github.com/kubernetes/node-problem-detector

etcd3-only nodes are handled incorrectly

If one uses separate nodes for the etcd3 cluster, they're not pinged at the start of the playbook, and docker_dns_servers_strict=no is not set for them.

Consider deploying Netchecker

We currently don't deploy Netchecker (which is optional in Kubespray). We should consider doing this, though, and integrate it with Prometheus (both capturing metrics as well as having some alert rules).

We could deploy it using the Kubespray support for it, or inside MetalK8s.

https://github.com/Mirantis/k8s-netchecker-server/
https://github.com/Mirantis/k8s-netchecker-server/blob/master/doc/metrics.md
kubernetes-sigs/kubespray#2701

kube-ops Prometheus doesn't use persistent storage

The Prometheus deployed in kube-ops through kube-prometheus doesn't seem to use PVs for TSDB storage. This is, obviously, not the intent 😄

Integrate Cerebro

As suggested by JMS & @Zempashi. Cerebro is the successor of elasticsearch-kopf.

There's a Helm chart available, though this needs some work upstream because it e.g. doesn't allow to change the application configuration, and pre-configuring our Elasticsearch cluster may be useful, among other things.

Enable StorageObjectInUseProtection

Kubernetes 1.11 brings the StorageObjectInUseProtection feature, which we may want to enable for safety reasons.

Use IPVS for proxy

Kubernetes 1.11 graduates the IPVS backend for kube-proxy to stable. Let's switch to it before MetalK8s 1.0.

etcd dashboard is broken

The etcd Grafana dashboard imported in #82 is broken in the JSON file (cfr https://github.com/scality/metal-k8s/blob/a212257129b7494cb2e8e776dbe4afec7772216e/roles/kube_prometheus/files/additionnal_dashboard.yml#L16)

This likely requires fixes in the Makefile?

TLS Termination

(What's below is copied and slightly edited from an earlier e-mail thread)

In the Zenko requirements and delivery roadmap, there's some mention of TLS (among others for the Prometheus/Grafana dashboards).

In Kubernetes, TLS is often terminate by the ingress controller (in our deployments, likely Nginx). One can set annotations on an Ingress object to set up certificates, instruct the use of ECMA/LetsEncrypt,...

At the same time, there are solutions to manage certificates automatically, by declaring them as a resource, after which a controller will make sure keys are creates and signed, certificates are deployed as namespace-local secrets,... using ECMA or some (potentially self-signed) CA.

I think it could be a cool plus for the demo, showing the versatility/flexibility and existing features added to K8s, to also show this off, e.g. using cert-manager (which can be deployed using Helm) and a self-signed CA, or ECMA if we can get a floating IP and DNS set up somehow.

scality / metalk8s Goto Github PK

metalk8s's Introduction

Integrating

Building

Contributing

Testing

Requirements

Bootstrapping a local environment

End-to-End Testing

Documentation

Requirements

Building

metalk8s's People

Contributors

Stargazers

Watchers

Forkers

metalk8s's Issues

References

Recommend Projects

Recommend Topics

Recommend Org