Giter Club home page Giter Club logo

cloud-native-setup's Introduction

Cloud Native Setup

Automation around setting up the cloud-native content (Kubernetes) on Clear Linux.

Folder Structure

  • clr-k8s-examples: script tools to deploy a Kubernetes cluster
  • metrics: tools to aid in measuring the scaling capabilities of Kubernetes clusters.

cloud-native-setup's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cloud-native-setup's Issues

Add test case to create nginx-ingress with metal-lb provided IP

This allows the http gateway to be directly exposed on well known ports to the rest of the network.

apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
  labels:
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 80
      targetPort: 80
      protocol: TCP
    - name: https
      port: 443
      targetPort: 443
      protocol: TCP
  selector:
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx

---

kata-deploy causing cri-o to fail after node is up

kata-deploy adds multiple entries for the kata-runtime in /etc/crio/crio.conf, after the node is up.
This causes cri-o to fail causing the node state to show as Not-ready.
For running any further pods, I needed to manually remove the duplicate entries from the crio conf file, stop the kata-deploy ds and restart cri-o.

PVC attach is failing on latest Clearlinux

  Type     Reason       Age                 From             Message
  ----     ------       ----                ----             -------
  Warning  FailedMount  74s (x34 over 75m)  kubelet, clr-03  Unable to mount volumes for pod "grafana-55dcb5484d-dq5lp_monitoring(faa0fcc4-d402-4ac1-a5fa-f8b4edcbbe68)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"grafana-55dcb5484d-dq5lp". list of unmounted volumes=[grafana-storage]. list of unattached volumes=[grafana-storage grafana-datasources grafana-dashboards grafana-dashboard-k8s-cluster-rsrc-use grafana-dashboard-k8s-node-rsrc-use grafana-dashboard-k8s-resources-cluster grafana-dashboard-k8s-resources-namespace grafana-dashboard-k8s-resources-pod grafana-dashboard-nodes grafana-dashboard-pods grafana-dashboard-statefulset grafana-token-6bv4n]
clear@clr-01 ~/clr-k8s-examples $ kubectl describe pvc -n monitoring    grafana-storage-pvc
Name:          grafana-storage-pvc
Namespace:     monitoring
StorageClass:  rook-ceph-block
Status:        Bound
Volume:        pvc-2f14f1c5-cf4f-4567-bfdb-8f1d68f5e1fc
Labels:        app=grafana
Annotations:   control-plane.alpha.kubernetes.io/leader:
                 {"holderIdentity":"888b1f2f-a35f-11e9-9ac2-b6e407c75ca1","leaseDurationSeconds":15,"acquireTime":"2019-07-10T22:10:32Z","renewTime":"2019-...
               kubectl.kubernetes.io/last-applied-configuration:
                 {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"labels":{"app":"grafana"},"name":"grafana-storage-pvc","na...
               pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: ceph.rook.io/block
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      1Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    grafana-55dcb5484d-dq5lp
Events:        <none>
clear@clr-01 ~/clr-k8s-examples $ kubectl get pvc --all-namespaces -o wide
NAMESPACE     NAME                                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE   VOLUMEMODE
kube-system   elasticsearch-logging-elasticsearch-logging-0   Bound    pvc-cd198db3-0476-4f0c-9355-140e326757b8   1Gi        RWO            rook-ceph-block   85m   Filesystem
kube-system   kubernetes-dashboard-pvc                        Bound    pvc-ea813018-181e-47b7-a765-40cb6d28f1c9   1Gi        RWO            rook-ceph-block   21m   Filesystem
monitoring    alertmanager-main-db-alertmanager-main-0        Bound    pvc-8fbf3579-8d32-42a9-b31e-0355380a6b79   1Gi        RWO            rook-ceph-block   84m   Filesystem
monitoring    grafana-storage-pvc                             Bound    pvc-2f14f1c5-cf4f-4567-bfdb-8f1d68f5e1fc   1Gi        RWO            rook-ceph-block   85m   Filesystem
monitoring    prometheus-k8s-db-prometheus-k8s-0              Bound    pvc-c8a4acf3-1fd4-4011-8c39-645930e5e709   1Gi        RWO            rook-ceph-block   84m   Filesystem
clear@clr-03 ~ $ lsmod | grep rbd
rbd                    86016  0
libceph               311296  1 rbd

Networking is broken with latest clearlinux

Due to systemd changes in rp_filter all non local traffic to the pods is blocked.

The canal setup has not changed, so it must be either the kernel or systemd.

This can be worked around by setting FELIX_IGNORELOOSERPF to true.

diff --git a/clr-k8s-examples/0-canal/canal.yaml b/clr-k8s-examples/0-canal/canal.yaml
index 6666a2c..17b3cd4 100644
--- a/clr-k8s-examples/0-canal/canal.yaml
+++ b/clr-k8s-examples/0-canal/canal.yaml
@@ -159,6 +159,8 @@ spec:
               value: "info"
             - name: FELIX_HEALTHENABLED
               value: "true"
+            - name: FELIX_IGNORELOOSERPF
+              value: "true"
           securityContext:
             privileged: true
           resources:
diff --git a/clr-k8s-examples/setup_system.sh b/clr-k8s-examples/setup_system.sh
index f1614bf..4f00498 100755
--- a/clr-k8s-examples/setup_system.sh
+++ b/clr-k8s-examples/setup_system.sh
@@ -22,8 +22,6 @@ fi
 sudo mkdir -p /etc/sysctl.d/
 cat <<EOT | sudo bash -c "cat > /etc/sysctl.d/60-k8s.conf"
 net.ipv4.ip_forward=1
-net.ipv4.conf.default.rp_filter=1
-net.ipv4.conf.all.rp_filter=1
 EOT
 sudo systemctl restart systemd-sysctl

/cc @krsna1729

canal fails to start due to quay.io not being present in the set of valid registries

crio has started enforcing registry origin checks. Hence all registries used by the stack need to be on the whitelist. We should update /etc/containers/registries.conf or replace it with all valid registries.

/etc/containers/registries.conf

# This is a system-wide configuration file used to
# keep track of registries for various container backends.
# It adheres to TOML format and does not support recursive
# lists of registries.

# The default location for this configuration file is /etc/containers/registries.conf.

# The only valid categories are: 'registries.search', 'registries.insecure',
# and 'registries.block'.

[registries.search]
registries = ['docker.io', 'quay.io']

# If you need to access insecure registries, add the registry's fully-qualified name.
# An insecure registry is one that does not have a valid SSL certificate or only does HTTP.
[registries.insecure]
registries = ['docker.io', 'quay.io']


# If you need to block pull access from a registry, uncomment the section below
# and add the registries fully-qualified name.
#
# Docker only
[registries.block]
registries = []

Vagrant setup fails, qemu fails to load OVMF BIOS file.

I'm having trouble (still) bringing up the vagrantfile on an F29 machine. Qemu reports a fail with loading the OVMF BIOS (although I think we have seen previously that may be a misleading error, and the real error maybe elsewhere).
I will note, I can use vagrant with libvirt to load an example machine, and I can use it to load @ganeshmaharaj Ubuntu based setup that is based off of this codebase.

Here is the error output:

An error occurred while executing multiple actions in parallel.
Any errors that occurred are shown below.

An error occurred while executing the action on the 'clr-01'
machine. Please handle this error then try again:

There was an error talking to Libvirt. The error message is shown
below:

Call to virDomainCreateWithFlags failed: internal error: process exited while connecting to monitor: qemu: could not load PC BIOS '/home/gwhaley/.vagrant.d/boxes/AntonioMeireles-VAGRANTSLASH-ClearLinux/28390/libvirt/OVMF.fd'

An error occurred while executing the action on the 'clr-02'
machine. Please handle this error then try again:

There was an error talking to Libvirt. The error message is shown
below:

Call to virDomainCreateWithFlags failed: internal error: qemu unexpectedly closed the monitor: qemu: could not load PC BIOS '/home/gwhaley/.vagrant.d/boxes/AntonioMeireles-VAGRANTSLASH-ClearLinux/28390/libvirt/OVMF.fd'

An error occurred while executing the action on the 'clr-03'
machine. Please handle this error then try again:

There was an error talking to Libvirt. The error message is shown
below:

Call to virDomainCreateWithFlags failed: internal error: qemu unexpectedly closed the monitor: qemu: could not load PC BIOS '/home/gwhaley/.vagrant.d/boxes/AntonioMeireles-VAGRANTSLASH-ClearLinux/28390/libvirt/OVMF.fd'

But, if we go look see, then it seems that file does exist:

[gwhaley@fido libvirt]$ pwd
/home/gwhaley/.vagrant.d/boxes/AntonioMeireles-VAGRANTSLASH-ClearLinux/28390/libvirt
[gwhaley@fido libvirt]$ ls -la
total 2153112
drwxrwxr-x. 2 gwhaley gwhaley       4096 Mar 20 10:52 .
drwxrwxr-x. 3 gwhaley gwhaley       4096 Mar 20 10:51 ..
-rw-r--r--. 1 gwhaley gwhaley 2200567808 Mar 20 10:51 box.img
-rw-r--r--. 1 gwhaley gwhaley        201 Mar 20 10:51 info.json
-rw-r--r--. 1 gwhaley gwhaley         58 Mar 20 10:51 metadata.json
-rw-rw-r--. 1 gwhaley gwhaley    4194304 Mar 20 10:52 OVMF.fd
-rw-r--r--. 1 gwhaley gwhaley       3750 Mar 20 10:51 Vagrantfile

Here are some component version infos:

[gwhaley@fido libvirt]$ qemu-system-x86_64 --version
QEMU emulator version 3.0.0 (qemu-3.0.0-3.fc29)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
[gwhaley@fido libvirt]$ libvirtd --version
libvirtd (libvirt) 4.7.0
[gwhaley@fido libvirt]$ vagrant --version
Vagrant 2.2.4

And the base level machine I'm on:

[gwhaley@fido libvirt]$ cat /etc/os-release 
NAME=Fedora
VERSION="29 (Workstation Edition)"
ID=fedora
VERSION_ID=29
VERSION_CODENAME=""
PLATFORM_ID="platform:f29"
PRETTY_NAME="Fedora 29 (Workstation Edition)"
ANSI_COLOR="0;34"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:29"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f29/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=29
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=29
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Workstation Edition"
VARIANT_ID=workstation

If you need any more info, just ask. I will go back to trying the Ubuntu based derivative for now to continue testing.

consider adding 'easyrsa', to enable juju k8s e2e workflow

Looking at how to run the k8s e2e tests (note, that is an ubuntu specific page, but I'm struggling to find a more generic k8s e2e test page right now), it notes that one should relate the kubernetes-e2e charm to your existing kubernetes-master nodes and easyrsa.
@ganeshmaharaj also notes that easyrsa is one of the options for client certificate authentication, which is a requirement for any level of 'production ready' stack up.

Should we consider adding easyrsa to our stackup?

Multiple Cluster level Networks causing failures

The cloud native setup creates two networks for the cluster.

  • The primary network for pod traffic
  • A network for storage traffic

However when the two networks are enabled

kubectl logs does not work for worker pods from master

Interactive shell also does not work
kubectl run -i --tty busybox --image=busybox -- sh # Run pod as interactive shell

Both these features used to work on older clearlinux releases.

So the root cause of the failure is still unknown.

Cilium network plugin does not work with CRIO

The default upstream cilium plugin does not work with crio which means it does not work on Clearlinux today

kubectl create -f https://raw.githubusercontent.com/cilium/cilium/v1.5/examples/kubernetes/1.14/cilium.yaml
  Normal   Scheduled    4m32s                 default-scheduler  Successfully assigned kube-system/cilium-ch9mr to clr-02
  Warning  FailedMount  22s (x10 over 4m31s)  kubelet, clr-02    MountVolume.SetUp failed for volume "docker-socket" : hostPath type check failed: /var/run/docker.sock is not a socket file
  Warning  FailedMount  14s (x2 over 2m29s)   kubelet, clr-02    Unable to mount volumes for pod "cilium-ch9mr_kube-system(f966519a-96d7-11e9-aba4-525400f0e781)": timeout expired waiting for vo

Enable CSI storage

It's useful to have a CSI based storage enabled. CSI reached 1.0/GA in 1.13.

containerd: containerd-shim-v2

Kata now supports containerd-shim-v2. We need to support this setup as it will allow us to do proper resource accounting. We will need to hold containerd at the version that supports device mapper, so this is not ideal from a security point of view. But it is useful for us to ensure that auto-scaling and resource accounting is stress tested with kata.

updates for v1.14.0

Hello.
I started testing k8s version v1.14.0. as:

I0326 15:20:57.920018 2334 version.go:237] remote version is much newer: v1.14.0; falling back to: stable-1.13

The following error shows that the configuration scripts will need updates:

+ sudo -E kubeadm init --config=./kubeadm.yaml
your configuration file uses a deprecated API spec:
"kubeadm.k8s.io/v1alpha3". Please use 'kubeadm config migrate --old-
config old.yaml --new-config new.yaml', which will write the new,
similar spec using a newer API version.

Is there a separate branch where this update is been worked out?

error running create_stack.sh

I'm using the master branch @f863bb7f, and running create_stack.sh returned with this error:

daemonset.apps/kata-deploy created error: the path "8-kata/runtimeclass_crd.yaml" does not exist

Looks like the latest commit removes the file that is required elsewhere. Could anyone please check?

Canal Calico crashes

kubectl logs -n kube-system   canal-2d9cw -c calico-node

2019-04-19 16:56:55.642 [FATAL][1436] int_dataplane.go 824: Kernel's RPF check is set to 'loose'.  This would allow endpoints to spoof their IP address.  Calico requires net.ipv4.conf.all.rp_filter to be set to 0 or 1. If you require loose RPF and you are not concerned about spoofing, this check can be disabled by setting the IgnoreLooseRPF configuration parameter to 'true'

Enable Ansible by adding switch for no terminal

Hi,

Currently the create_stack.sh script uses [ -t 0 ] to detect a terminal but this causes Ansible to hang when it tries to run create_stack.sh. I kludged it with a sed to make it never true but it would be preferable to have a flag to disable this check. Would a PR to add this flag be accepted or is there another mechanism we should use?

Configure CNI dirs

With CRI-O v1.14.0 we have:

CNI integration now supports multiple directories for loading plugins.

Adapt to that by adding /usr/libexec/cni/ and /opt/cni/bin to the configs and clean up the symlinks.

Ubuntu version: instance does not work after vagrant halt/up cycle

Hi @ganeshmaharaj - I'm reporting here (but really against your https://github.com/ganeshmaharaj/vagrant-stuff/tree/master/k8s which I believe is a derivative??) as I thought we'd get more exposure and eyes on it then....

I was using your Ubuntu based instance fine, but after a host reboot, when I bought it back up my kubectl didn't work any more. I had a dig, and I suspect it might be that some things do not come back up. I used a vagrant halt/up cycle to make it more refined and hopefully repeatable. Here are my logs:


Try to shutdown and restart stack:

vagrant up it

run the setup script

Note, I don't add the two slave nodes here... just working with the master node.

check we are up:

~/cloud-native-setup/clr-k8s-examples$ kubectl get nodes
NAME        STATUS   ROLES    AGE    VERSION
ubuntu-01   Ready    master   110s   v1.13.4
vagrant@ubuntu-01:~/cloud-native-setup/clr-k8s-examples$ 
vagrant@ubuntu-01:~/cloud-native-setup/clr-k8s-examples$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:f4:a0:2b brd ff:ff:ff:ff:ff:ff
    inet 192.168.121.170/24 brd 192.168.121.255 scope global dynamic eth0
       valid_lft 2899sec preferred_lft 2899sec
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:5c:ba:9c brd ff:ff:ff:ff:ff:ff
    inet 192.52.100.11/24 brd 192.52.100.255 scope global eth1
       valid_lft forever preferred_lft forever
4: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 0a:58:0a:58:00:01 brd ff:ff:ff:ff:ff:ff
    inet 10.88.0.1/16 scope global cni0
       valid_lft forever preferred_lft forever
5: vethfd79254a@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default 
    link/ether 16:8c:cb:2a:fd:53 brd ff:ff:ff:ff:ff:ff link-netnsid 0
6: vetha05bf341@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default 
    link/ether aa:6a:ff:b9:cc:5b brd ff:ff:ff:ff:ff:ff link-netnsid 1
7: veth9adc6a7d@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default 
    link/ether be:e1:57:ef:a1:8b brd ff:ff:ff:ff:ff:ff link-netnsid 2
8: vethfdd2b8a2@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni0 state UP group default 
    link/ether e6:85:14:26:27:b4 brd ff:ff:ff:ff:ff:ff link-netnsid 3

vagrant halt it

[gwhaley@fido k8s]$ vagrant halt
==> ubuntu-03: Halting domain...
==> ubuntu-02: Halting domain...
==> ubuntu-01: Halting domain...
[gwhaley@fido k8s]$ vagrant status
Current machine states:

ubuntu-01                 shutoff (libvirt)
ubuntu-02                 shutoff (libvirt)
ubuntu-03                 shutoff (libvirt)

vagrant up it

# vagrant up --provider=libvirt

and fail

[gwhaley@fido k8s]$ vagrant ssh ubuntu-01
Last login: Wed Mar 20 04:13:25 2019 from 192.168.121.1
vagrant@ubuntu-01:~$ kubectl get nodes
The connection to the server 192.168.121.170:6443 was refused - did you specify the right host or port?

Wed 2019-03-20 04:21:47 PDT. --
e: Failed with result 'exit-code'.
e: Main process exited, code=exited, status=255/n/a
:47.776497    2117 server.go:261] failed to run Kubelet: failed to create kubelet: rpc error: code = Unav
:47.776368    2117 kuberuntime_manager.go:184] Get runtime version failed: rpc error: code = Unavailable 
:47.776217    2117 remote_runtime.go:72] Version from runtime service failed: rpc error: code = Unavailab
:47.775764    2117 util_unix.go:77] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please con
:47.775617    2117 util_unix.go:77] Using "/var/run/crio/crio.sock" as endpoint is deprecated, please con
:47.774978    2117 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list
:47.774897    2117 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:444: Failed to list *v1.Ser
:47.774332    2117 reflector.go:134] k8s.io/kubernetes/pkg/kubelet/kubelet.go:453: Failed to list *v1.Nod
:47.757270    2117 kubelet.go:306] Watching apiserver
:47.757233    2117 kubelet.go:281] Adding pod path: /etc/kubernetes/manifests
:47.757167    2117 state_mem.go:92] [cpumanager] updated cpuset assignments: "map[]"
:47.757149    2117 state_mem.go:84] [cpumanager] updated default cpuset: ""
:47.757061    2117 state_mem.go:36] [cpumanager] initializing new in-memory state store
:47.757029    2117 container_manager_linux.go:272] Creating device plugin manager: true
:47.756872    2117 container_manager_linux.go:253] Creating Container Manager object based on Node Config
:47.756844    2117 container_manager_linux.go:248] container manager verified user specified cgroup-root 
:47.756590    2117 server.go:666] --cgroups-per-qos enabled, but --cgroup-root was not specified.  defaul
:47.743312    2117 certificate_store.go:130] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-cli
:47.740768    2117 plugins.go:103] No cloud provider specified.
:47.740588    2117 server.go:407] Version: v1.13.4
lv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's 
lv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's 
t: The Kubernetes Node Agent.
t: The Kubernetes Node Agent.

did we maybe change IP address?

agrant@ubuntu-01:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:f4:a0:2b brd ff:ff:ff:ff:ff:ff
    inet 192.168.121.170/24 brd 192.168.121.255 scope global dynamic eth0
       valid_lft 3500sec preferred_lft 3500sec
    inet6 fe80::5054:ff:fef4:a02b/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 52:54:00:5c:ba:9c brd ff:ff:ff:ff:ff:ff
    inet 192.52.100.11/24 brd 192.52.100.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe5c:ba9c/64 scope link 
       valid_lft forever preferred_lft forever

ah, no CNI ?? This output is significantly different from our first successful run...

kubeadm join fails

Changes in 1.14 (https://github.com/kubernetes/kubernetes/pull/69366/files) are meant to change the usage of --cri-socket argument. After seting up the k8s master with the PR #80, in client I'm getting the following errors:

kubeadm join 10.219.128.45:6443 --token ctcivd.d2dwdsi660hkjmrd     --discovery-token-ca-cert-hash sha256:3bf9103c4558c31314846e691ff7136a1b337b6242f338476adca1286f746db2
**Found multiple CRI sockets, please use --cri-socket to select one: /var/run/dockershim.sock, /var/run/crio/crio.sock**
root@kube1 ~ # kubeadm join 10.219.128.45:6443 --token ctcivd.d2dwdsi660hkjmrd     --discovery-token-ca-cert-hash sha256:3bf9103c4558c31314846e691ff7136a1b337b6242f338476adca1286f746db2 -**-cri-socket /var/run/crio/crio.sock**
[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
**error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: Found multiple CRI sockets, please use --cri-socket to select one: /var/run/dockershim.sock, /var/run/crio/crio.sock**

root@kube1 ~ # file /var/run/crio/crio.sock /var/run/dockershim.sock
/var/run/crio/crio.sock:  socket
/var/run/dockershim.sock: cannot open `/var/run/dockershim.sock' (No such file or directory)


HPA: Metrics server may not be working

Testing as per test-autoscale.sh. The pods are not scaling.

Name:                                                  php-apache-test
Namespace:                                             default
Labels:                                                <none>
Annotations:                                           <none>
CreationTimestamp:                                     Tue, 04 Dec 2018 22:36:01 +0000
Reference:                                             Deployment/php-apache-test
Metrics:                                               ( current / target )
  resource cpu on pods  (as a percentage of request):  <unknown> / 50%
Min replicas:                                          1
Max replicas:                                          10
Deployment pods:                                       1 current / 0 desired
Conditions:
  Type           Status  Reason                   Message
  ----           ------  ------                   -------
  AbleToScale    True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: unable to get metrics for resource cpu: no metrics returned from resource metrics API
Events:
  Type     Reason                        Age                From                       Message
  ----     ------                        ----               ----                       -------
  Warning  FailedComputeMetricsReplicas  27s (x12 over 3m)  horizontal-pod-autoscaler  failed to get cpu utilization: unable to get metrics for resource cpu: no metrics returned from resource
metrics API
  Warning  FailedGetResourceMetric       12s (x13 over 3m)  horizontal-pod-autoscaler  unable to get metrics for resource cpu: no metrics returned from resource metrics API

kubeadm init in ./create_stack.sh fails

After a succesfull execution of cloud-native-setup/clr-k8s-examples/setup_system.sh; the "kubeadm init" call from create_stack.sh minimal reports the following error:

error execution phase upload-config/kubelet: Error writing Crisocket information for the control-plane node: timed out waiting for the condition
Do you know what I need to setup in order to have the problem fixed?

Full log is in: https://github.intel.com/gist/anonymous/476dd91c52ca3a9a3ecfb69d6394ec62

Adding Intel contributions to k8s stack

@mcastelino /cc @amshinde

Add an upgraded version of amshinde/kata-sriov-dp which has example manifests for deploying Multus-CNI, SRIOV-CNI and sriov device plugin. Optionally, include manifests for intel-device-plugins-for-kubernetes and userspace-cni-network-plugin

Also look into enabling CPUManager by default in the kubelet configuration along with #8

  --cpu-manager-policy=static 
  --cpu-manager-reconcile-period=5s 

Flannel is broken in Clearlinux due to CNI directory setup in crio.conf

Core DNS Pod logs

  Warning  FailedCreatePodSandBox  98s                    kubelet, clr-02    Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_coredns-fb8b8dccf-9fpfn_kube-system_50f10a4d-947c-11e9-9e0f-52540041e026_0(fd2c062948fa94deb8214b9c088acbe3a6f69dd2606f8261f56aae85209d876b): failed to find plugin "loopback" in path [/opt/cni/bin/ /opt/cni/bin/]

Clearlinux version

sudo swupd info
Installed version: 30030

CRIO Version

sudo crictl version
Version:  0.1.0
RuntimeName:  cri-o
RuntimeVersion:  1.14.4
RuntimeApiVersion:  v1alpha1

Root cause

# Paths to directories where CNI plugin binaries are located.
plugin_dirs = [
        "/usr/libexec/cni",
        "/opt/cni/bin/",
]

/usr/share/defaults/crio/crio.conf has the proper values, but /etc/crio/crio.conf is incorrect

It looks like based on which version of crio you were on. When clearlinux is updated from a version where crio did not support plugin_dirs list to one that did, the exiting /etc/crio/crio.conf will have the older value.

Hence on a crio update the user unfortunately needs to delete /etc/crio/crio.conf

Hence we should delete crio.conf and restart crio which will automatically setup crio based on latest values.

kubelet fails to start

I tried vagrant up --provider=virtualbox today. It updated to Clear Linux 29940. Then it failed with kubelet refusing to start:

    clr-01: Created symlink /etc/systemd/system/multi-user.target.wants/kubelet.service → /usr/lib/systemd/system/kubelet.service.
    clr-01: Created symlink /etc/systemd/system/multi-user.target.wants/crio.service → /usr/lib/systemd/system/crio.service.
    clr-01: Job for kubelet.service failed because the control process exited with error code.
    clr-01: See "systemctl status kubelet.service" and "journalctl -xe" for details.

Logging in and checking the journal shows:

Jun 17 14:27:20 clr-01 kubelet-version-check.sh[4166]: /usr/bin/kubelet-version-check.sh: line 8: /var/lib/kubelet/kubelet_version: No such file or directory
Jun 17 14:27:20 clr-01 kubelet[4165]: F0617 14:27:20.140206    4165 server.go:194] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kub>

/var/lib/kubelet/config.yaml does not exist.

Add support for firecracker

Clearlinux now supports and bundles firecracker. Add support for adding firecracker as an additional runtime class in kata. This allows runc, kata with qemu and kata with firecracker to live alongside each other in the same cluster on all nodes

Add a new script along the lines of setup_firecracker.sh

#!/bin/bash

cat <<EOT | tee /etc/kata-containers/configuration_firecracker.toml

[hypervisor.firecracker]
path = "/usr/bin/firecracker"
kernel = "/usr//share/kata-containers/vmlinux.container"
image = "/usr//share/kata-containers/kata-containers.img"
kernel_params = ""
default_vcpus = 1
default_memory = 4096
default_maxvcpus = 0
default_bridges = 1
block_device_driver = "virtio-mmio"
disable_block_device_use = false
enable_debug = true
use_vsock = true

[shim.kata]
path = "/usr//libexec/kata-containers/kata-shim"

[runtime]
internetworking_model="tcfilter"
EOT

cat <<EOT | tee /usr/bin/kata-runtime-fire
#!/bin/bash

/usr/bin/kata-runtime --kata-config /etc/kata-containers/configuration_firecracker.toml "$@"
EOT

mkdir -p /etc/crio/
cp /usr/share/defaults/crio/crio.conf /etc/crio/crio.conf

echo -e "\n[crio.runtime.runtimes.kata]\nruntime_path = \"/usr/bin/kata-runtime\"" >> /etc/crio/crio.conf
echo -e "\n[crio.runtime.runtimes.fire]\nruntime_path = \"/usr/bin/kata-runtime-fire\"" >> /etc/crio/crio.conf

sed -i 's|\(\[crio\.runtime\]\)|\1\nmanage_network_ns_lifecycle = true|' /etc/crio/crio.conf

sed -i 's/storage_driver = \"overlay\"/storage_driver = \"devicemapper\"\
storage_option = [\
  \"dm.basesize=8G\",\
  \"dm.directlvm_device=\/dev\/vdb\",\
  \"dm.directlvm_device_force=true\",\
  \"dm.fs=ext4\"\
]/g' /etc/crio/crio.conf


systemctl daemon-reload
systemctl restart crio

And a simple test case along the lines of tests/test-deploy-fire.yaml

apiVersion: apps/v1 
kind: Deployment
metadata:
  labels:
    run: php-apache-fire
  name: php-apache-fire
spec:
  replicas: 1
  selector:
    matchLabels:
      run: php-apache-fire
  template:
    metadata:
      labels:
        run: php-apache-fire
    spec:
      runtimeClassName: fire
      containers:
      - image: k8s.gcr.io/hpa-example
        imagePullPolicy: Always
        name: php-apache
        ports:
        - containerPort: 80
          protocol: TCP
        resources:
          requests:
            cpu: 200m
      restartPolicy: Always
---
apiVersion: v1 
kind: Service
metadata:
  name: php-apache-fire
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: php-apache-fire
  sessionAffinity: None
  type: ClusterIP

deploying kata example fails (cannot stat '/etc/crio/crio.conf')

Trying to follow the create_stack.sh example, and doing :
kubectl apply -f tests/deploy-svc-ing/test-deploy-kata-qemu.yaml

I get: kubectl get pod

NAME                                    READY   STATUS    RESTARTS   AGE
php-apache-kata-qemu-7d4647498f-srlmw   0/1     Pending   0          8m26s

looking at /var/log/containers/kata-deploy-v22pt_kube-system_kube-kata-18d156ed33052c5a7d2bda4637d147f30d1d92f4812746f3e196a5479f98ea09.log I can see:

2019-04-19T17:18:13.158164470+01:00 stdout F copying kata artifacts onto host
2019-04-19T17:18:13.831374115+01:00 stdout F Add Kata Containers as a supported runtime for CRIO:
2019-04-19T17:18:13.832274647+01:00 stderr F cp: cannot stat '/etc/crio/crio.conf': No such file or directory

Something weird seems broken wrt crio :-/

Discuss version pinning in the yamls

As per this comment #13 (comment) we want to avoid version pins. Today all yamls have image pinned since its considered best practice, and the yamls themselves pin the api versions of the different kubernetes objects they create. Since k8s versions will keep moving forward with clear, we need a way to keep these yamls up-to-date too.

What is the plan for upgrading the yamls?

  • Should we register an endpoint for each directory and check for updates?
  • Should we leverage helm for getting latest stable yamls where possible?

Available helm charts -

@mcastelino

scripts (or k8s) presume to use first ethernet controller

I think, by default, the scripts will set up the k8s network hung off the first ether controller on the master node.
In my case, that is not correct - I have two ether cards, and my node pool is hung off the second one.

I'll see if I can figure out what needs to be told to use the second (or rather, a specific) network controller to talk to the nodes. If anybody wants to chip in with some input, please do :-)

'dirty' nodes need /var/lib/rook deleting

If you tear down and try to re-join a slave/node then it may fail to bring up ceph-rook (fairly silently), as the installation may find an existing /var/lib/rook directory, and fail to initialise.
We should see if we can find a way to remove the /var/lib/rook or error out nicely when initialising or joining nodes, somehow.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.