Giter Club home page Giter Club logo

systemk's Introduction

Systemk: virtual kubelet for systemd

This is a virtual kubelet provider that uses systemd as its backend.

Every Linux system has systemd nowadays. By utilizing K3s (just one Go binary) and this virtual kubelet you can provision a system using the Kubernetes API. The networking is the host's network, so it make sense to use this for more heavy weight (stateful?) applications. The filesystem is semi hidden, but emptyDir and the like works.

It is hoped this setup allows you to use the Kubernetes API without the need to immerse yourself in the (large) world of kubernetes (overlay networking, ingress objects, etc., etc.). However this does imply networking and discovery (i.e. host) DNS is already working on the system you're deploying this. How to get a system into a state it has, and can run, k3s and systemk is an open questions (Ready made image? A tiny bit of config mgmt?).

Systemk will use systemd to start pods as service unit(s). This uses the cgroup implementation systemd has. This allows us to set resource limits, by specifying those in the unit file (copy paste from the podSpec almost). Generally there is no (real container) isolation in this setup - although for disk access things are fairly contained, i.e. /var/secrets/kubernets.io/token will be bind mounted into each unit.

You basically use the k8s control plane to start linux processes. There is also no address space allocated to the PODs specially, you are using the host's networking.

"Images" are referencing (Debian) packages, these will be apt-get installed. Discovering that an installed package is no longer used is hard, so this will not be done. systemk will reuse the unit file that comes from this install, almost exclusively to find the ExecStart option. A lot of extra data is injected into it to make it work fully for systemk. If there isn't an unit file (e.g. you use bash as the image), a unit file will be synthesized.

Each scheduled unit will adhere to a naming scheme so systemk knows which ones are managed by it.

Is This Useful?

Likely. I personally consider this useful enough to manage a small farm of machines. Say you administer 200 machines and you need saner management and introspection than config management can give you? I.e. with kubectl you can find which machines didn't run a DNS server, with deployments you can more safely push out upgrades, with "insert favorite Kubernetes feature here" you can do canarying.

Monitoring, for instance, only requires prometheus to discover the pods via the Kubenetes API, vastly simplifying that particular setup. See prometheus.yaml for a complete, working, setup that discovers running Pods and scrapes them. The prometheus config is just kubernetes_sd_configs.

Currently I manage 2 (Debian) machines and this is all manual - i.e.: login, apt-get upgrade, fiddle with config files etc. It may turn of that k3s + systemk is a better way of handling even 2 machines.

Note that getting to the stage where this all runs, is secured and everything has the correct TLS certs (that are also rotated) is an open question. See https://github.com/miekg/systemk/issues/39 for some ideas there.

Current Status

Multiple containers in a pod can be run and they can see each others storage. Creating, deleting, inspecting Pods all work. Higher level abstractions (replicaset, deployment) work too. Init Containers are also implemented.

EmptyDir/configMap/hostPath and Secret are implemented, all, except hostPath, are backed by a bind-mount. The entire filesystem is made available, but read-only, paths declared as volumeMounts are read-only or read-write depending on settings. When configMaps and Secrets are mutated the new contents are updated on disk. These directories are set up in /var/run/{emptydirs, secrets, configmaps}.

Retrieving pod logs also works, but setting up TLS is not automated.

Has been tested on:

  • ubuntu 20.04 and 18.04
  • arch (maybe?)

Building

Use go build in the top level directory, this should give you a systemk binary which is the virtual kubelet. Note you'll need to have libsystemd-dev (deb) or systemd-devel (rpm) installed to build systemk and you be able to build with CGO.

Design

Pods can contain multiple containers; each container is a new unit and tracked by systemd. The named image is assumed to be a package and will be installed via the normal means (apt-get, etc.). If systemk is installing the package the official system unit will be disabled; if the package already exists we leave the existing unit alone. If the install doesn't come with a unit file (say you install zsh) we will synthesize a small service unit file; for this to work the podSpec need to (at) least define a command to run.

CreatePod call we call out to systemd to create a unit per container in the pod. Each unit will be named systemk.<pod-namespace>.<pod-name>.<image-name>.service. If a command is given it will replace the first word of ExecStart and leave any options there. If args are also given the entire ExecStart is replaced with those. If only args are given the command will remain and only the options/args will be replaced.

We store a bunch of k8s meta data inside the unit in a [X-kubernetes] section. Whenever we want to know a pod state systemk will query systemd and read the unit file back. This way we know the status and have access to all the meta data, like pod UID and if the unit is an init container.

Specifying Images

In general the image you specify is a (distro) package, i.e. bash-$version.deb. But there are alternatives that can be used and give you some flexibility.

Fetching Image Remotely

If the image name starts with https:// it is assumed an URL and the package is fetched from there and installed. The image name is the first string up until the _ in the package name: https://example.org/tmp/coredns_1.7.1-bla_amd64.deb will download the package from that URL and coredns will be the package name.

Binary Exists in File System

If the image name starts with a / it's assumed to be a path to a binary that exists on the system, nothing is installed in that case. Basically this tells systemk that the image is not used. This can serve as documentation. It's likely command and/or args in the podspec will reference the same path.

Addresses

Addresses are configured with one the systemk command line flags: --node-ip and --node-external-ip, these may be IPv4 or IPv6. In the future this may get expanded into allow both (i.e. dual stack support). The primary Pod address will be the value from --node-external-ip.

If --node-ip is not given, systemk will try to find a RFC 1918 address on the interfaces and uses the first one found.

If --node-external-ip is not given, systemk will try to find a non-RFC 1918 address on the interfaces and uses the first one found.

If after all this one of the values is still not found, the other existing value will be copied, i.e internal == external in that case. If both were empty systemk exits with a fatal error.

Environment Variables

The following environment variables are made available to the units:

  • HOSTNAME, KUBERNETES_SERVICE_PORT and KUBERNETES_SERVICE_HOST (same as the kubelet).
  • SYSTEMK_NODE_INTERNAL_IP the internal IP address.
  • SYSTEMK_NODE_EXTERNAL_IP the external IP address.

Using username in securityContext

To specify an username in a securityContext you need to use the windowsOptions:

spec:
  securityContext:
    windowsOptions:
      runAsUserName: "prometheus"

The primary group will be found by systemk and both a User and Group will be injected into the unit file. The files created on disk for the configMap/secrets/emptyDir will be made of the same user/group.

Running Without Root Permissions

This is not possible, the tiniest thing we need is BindPaths which is not allowed when not running as root (or CAP_SYS_ADMIN). This means systemk needs close to full root permission to run.

Limitations

By using systemd and the host's network stack we have weak isolation between pods, i.e. no more than process isolation. Starting two pods that use the same port is guaranteed to fail for one. To expand on this, everything is run as if .spec.hostNetwork: true is specified. Port clashes are (probably?) more likely for health check ports. Two pods using the same (health check) port thus can't schedule on the same machine. We have some ideas to get around this (like generating an environment variable with a unique port number (or using $RANDOM) that can be used for health checking, but are also open to suggestions, including not doing anything.

Use with K3S

Download k3s from it's releases on GitHub, you just need the k3s binary. Use the k3s/k3s shell script to start it - this assumes k3s sits in "~/tmp/k3s". The script starts k3s with basically everything disabled.

Compile systemk and start it with.

sudo ./systemk --kubeconfig ~/.rancher/k3s/server/cred/admin.kubeconfig --disable-taint

We need root to be allowed to install packages.

Now a k3s kubcetl get nodes should show the virtual kubelet as a node:

NAME    STATUS   ROLES   AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE            KERNEL-VERSION     CONTAINER-RUNTIME
draak   Ready    agent   6s    v1.18.4   <none>        <none>        Ubuntu 20.04.1 LT   5.4.0-53-generic   systemd 245 (245.4-4ubuntu3.3)

draak is my machine's name. You can now try to schedule a pod: k3s/kubelet apply -f k3s/uptimed.yaml.

Logging works, but due to TLS, is a bit fiddly, you need to start systemk with --certfile and --keyfile to make the HTTPS endpoint happy (enough). Once that's done you can get the logs with:

% ./kubectl logs --insecure-skip-tls-verify-backend=true uptimed
-- Logs begin at Mon 2020-08-24 09:00:18 CEST, end at Thu 2020-11-19 15:40:02 CET. --
nov 19 12:12:27 draak systemd[1]: Started uptime record daemon.
nov 19 12:14:44 draak uptimed[15245]: uptimed: no useable database found.
nov 19 12:14:44 draak systemd[1]: Stopping uptime record daemon...
nov 19 12:14:44 draak systemd[1]: systemk.default.uptimed.uptimed.service: Succeeded.
nov 19 12:14:44 draak systemd[1]: Stopped uptime record daemon.
nov 19 13:38:54 draak systemd[1]: Started uptime record daemon.
nov 19 13:39:26 draak systemd[1]: Stopping uptime record daemon...
nov 19 13:39:26 draak systemd[1]: systemk.default.uptimed.uptimed.service: Succeeded.

Prometheus discovering targets to Scrape

The includes manifests k3s/prometheus.yaml and k3s/uptimed.yaml install a prometheus with k8s support and the uptimed. This allows prom to find the open ports for uptimed and will start scraping those for metrics (uptimed doesn't have metrics, but that's a minor detail). Here is the status of prom finding those targets and trying to scrape:

Prometheus running as systemd unit finding targets

Debian/Ubuntu

  1. Install k3s and compile the virtual kubelet.

I'm using uptimed as a very simple daemon that you (probably) haven't got installed, so we can check the entire flow.

  1. ./k3s/kubectl apply -f uptimed.yaml

The above should yield:

NAME      READY   STATUS    RESTARTS   AGE
uptimed   1/1     Running   0          7m42s

You can then delete the pod.

Testing

Set-up Node authentication

systemk impersonates a Node and, as such, can identify itself through a TLS client certificate. Let's set that up.

  1. Install cert-manager, required to provision a TLS certificate that will later be used when authenticating systemk as a Kubernetes Node.

    kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.1.0/cert-manager.yaml
  2. Expose private CA resources as a Secret, a requirement for the next step.

    1. If using kind:

      docker exec -e KUBECONFIG=/etc/kubernetes/admin.conf kind-control-plane sh -c \
         'kubectl -n cert-manager create secret tls priv-ca \
         --cert=/etc/kubernetes/pki/ca.crt \
         --key=/etc/kubernetes/pki/ca.key'
    2. If using minikube:

      minikube ssh
      
      sudo /var/lib/minikube/binaries/v1.20.1/kubectl \
        --kubeconfig /etc/kubernetes/admin.conf \
        -n cert-manager create secret tls priv-ca \
        --cert=/var/lib/minikube/certs/ca.crt \
        --key=/var/lib/minikube/certs/ca.key
      
      exit
  3. Create a cluster-wide certificate issuer that re-uses the cluster PKI.

    cat <<EOF | kubectl apply -f -
    apiVersion: cert-manager.io/v1
    kind: ClusterIssuer
    metadata:
      name: priv-ca-issuer
    spec:
      ca:
        secretName: priv-ca
    EOF
    
    kubectl wait --for=condition=Ready --timeout=1m clusterissuer priv-ca-issuer
  4. Create a certificate signing request that cert-manager can sign with the CA setup during your Kubernetes cluster bootstrap.

    NODENAME=systemk
    cat <<EOF | kubectl apply -f -
    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
      name: $NODENAME
      namespace: default
    spec:
      secretName: $NODENAME-tls
      duration: 8760h # 1 year
      renewBefore: 4380h # 6 months
      subject:
        organizations:
        - system:nodes
      commonName: system:node:$NODENAME
      issuerRef:
        name: priv-ca-issuer
        kind: ClusterIssuer
    EOF
    
    kubectl wait --for=condition=Ready --timeout=1m certificate $NODENAME
  5. Initialize a kubeconfig file systemk will rely on.

    1. If using kind:

      rm -f $NODENAME.kubeconfig
      kind get kubeconfig > $NODENAME.kubeconfig
      
      kubectl --kubeconfig=$NODENAME.kubeconfig config unset contexts.kind-kind
      kubectl --kubeconfig=$NODENAME.kubeconfig config unset users.kind-kind
    2. If using minikube:

      rm -f $NODENAME.kubeconfig
      KUBECONFIG=$NODENAME.kubeconfig minikube update-context
      
      kubectl --kubeconfig=$NODENAME.kubeconfig config unset contexts.minikube
      kubectl --kubeconfig=$NODENAME.kubeconfig config unset users.minikube
  6. Extract the signed certificate and respective private key, and populate the kubeconfig file, appropriately.

    kubectl get secret $NODENAME-tls -o jsonpath='{.data.tls\.crt}' | base64 -d > $NODENAME.crt
    kubectl get secret $NODENAME-tls -o jsonpath='{.data.tls\.key}' | base64 -d > $NODENAME.key
    
    kubectl --kubeconfig=$NODENAME.kubeconfig config set-credentials $NODENAME \
     --client-certificate=$NODENAME.crt \
     --client-key=$NODENAME.key
    
    kubectl --kubeconfig=$NODENAME.kubeconfig config set-context $NODENAME --cluster=minikube --user=$NODENAME
    kubectl --kubeconfig=$NODENAME.kubeconfig config use-context $NODENAME

Running the Node

  1. Allow systemk to view ConfigMaps and Secrets across the cluster. This is far from ideal but works for now.

    kubectl create clusterrolebinding $NODENAME-view --clusterrole=system:node --user=system:node:$NODENAME
  2. Finally, start systemk.

    NODENAME=systemk
    ./systemk --kubeconfig=$NODENAME.kubeconfig --nodename=$NODENAME

systemk's People

Contributors

miekg avatar pathcl avatar pires avatar sh4d1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

systemk's Issues

export metrics

Things are working well enough that it is time to think about exporting metrics and what would be a good set of metrics to have.

metrics server

Check to see if we can easily run metrics server to provide resource metrics.

This probably means exporting a bunch of metrics from systemk side

add k3s token support

this is taken from the k3s site, and even though it's k3s specific it will simplify my intended use case for it, namely using it with k3s first.

sudo k3s server &
# Kubeconfig is written to /etc/rancher/k3s/k3s.yaml
sudo k3s kubectl get node

# On a different node run the below. NODE_TOKEN comes from /var/lib/rancher/k3s/server/node-token
# on your server
sudo k3s agent --server https://myserver:6443 --token ${NODE_TOKEN}

This still doesn't solve the TLS cert issue, but makes bootstrapping slightly easier.

Tracking VK changes

This is just a way for us to track changes to VK that we'll want to adopt for various reasons, such as reduce boilerplate code, improvements, bugfixes, etc. I'm not expecting this issue to be closed by a PR.

Reduce boilerplate

Flag --nodename doesn't work

In VK, it is possible to define the hostname that a virtual node will be registered with in Kubernetes. However, this is not possible here given a bunch of calls to

// Hostname returns the machine's host name.
// if the environment variabe HOSTNAME is present this takes precedence.
func Hostname() string {
// there is also a flag for this to systemk ? use that instead?
if h := os.Getenv("HOSTNAME"); h != "" {
return h
}
h, _ := os.Hostname()
return h
}
. These calls happen in the included NodeProvider implementation

systemk/systemd/node.go

Lines 24 to 33 in 745f958

node.ObjectMeta = metav1.ObjectMeta{
Name: system.Hostname(),
Labels: map[string]string{
"type": "virtual-kubelet",
"kubernetes.io/role": "agent",
"kubernetes.io/hostname": system.Hostname(),
corev1.LabelZoneFailureDomainStable: "localhost",
corev1.LabelZoneRegionStable: system.Hostname(),
},
}

I think the flag should either be removed (inherited from virtual-kubelet/node-cli) or somehow be respected in the first block.

Figure out how to do e2e tests

Having (copying from k8s?) a (simple?) e2e test frame work would help. Case in point, if I update a project secret will systemk update the file(s) on disk? There is currently no code in the repo that captures systemk's behavior other than a few unit tests.

This requires a way to script these tests, capture the change in addition to having a control plane running.

enhancement: systemd-nspawn to launch real Image= container

using systemd-nspawn would let us run real containers, under systemd. from the man page's description:

systemd-nspawn may be used to run a command or OS in a light-weight namespace container. In many ways it is similar to chroot(1), but more powerful since it fully virtualizes the file system hierarchy, as well as the process tree, the various IPC subsystems and the host and domain name.).

it would be neat to add a mode to systemk that allows us to use systemd-nspawn to run full containers. Image= could be real images. systemd-nspawn supports a variety of network modes, such as macvlan and ipvlan. there is an machinectl shell which would enable runInContainer, a systemk feature request tracked in #37. pretty specific/fancy stuff but there's also things like the ability to clone a btrfs subvolume & launch that.

to run a systemd-nspawn container is a couple step process:

  1. create a filesystem in /var/lib/machines/ with the expanded image. systemd's machinectl tool includes import-tar and import-fs helpers which could help load images. to fetch an image, we could use docker save, docker export, or something like what nspawn-oci does to load the image ( using skopeo to get & oci-image-tools to expand an image).
  2. run the container. this can be done ephemerally, or we can create a unit.
    a. use systemd-nspawn to run the container once off, or
    b. create a /etc/systemd/nspawn/my-service.nspawn unit file to configure a container then run machinectl start to start it, in a managed way.

note that two years ago systemd seemed interested in becoming a runc compatible runtime, and if that happens my understanding is we could just run containerd directly against it, which might be a better idea than adapting systemk.

background: current systemk architecture

some notes i took, investigating how systemk works now

kubernetes pods are backed by systemd .service units created by unit manager.

these systemd services use the local system to run. Image= refers to debian packages that are run on the system.

ExecStart has 'need'

[Service]
ProtectSystem=true
ProtectHome=tmpfs
PrivateMounts=true
ReadOnlyPaths=/
StandardOutput=journal
StandardError=journal
User=0
Group=0
RemainAfterExit=true
ExecStart=need "--config.file=/etc/prometheus/prometheus.yml" "--storage.tsdb.path=/tmp/prometheus"
TemporaryFileSystem=/var /run
Environment=HOSTNAME=localhost
Environment=KUBERNETES_SERVICE_PORT=6444
Environment=KUBERNETES_SERVICE_HOST=127.0.0.1

[X-Kubernetes]
Namespace=default
ClusterName=
Id=aa-bb

This is wrong:

ExecStart=need "--config.file=/etc/prometheus/prometheus.yml" "--storage.tsdb.path=/tmp/prometheus"

it may be wrong because of testing, but it's wrong nonetheless.

Force mountPaths to be below directory

To further restrict what a podspec can do we could enforce that only mount points under /opt/systemk (or whatever) are acceptable. This prevents systemk from chowning directories it shouldn't.

When run in user mode use /var/run/$UID instead of /var/run for bindmounts

With the -u flag we run in user mode, talking to user systemd. This also moved the unitdir to the per-user tmp dir in /var/run. However the bind mount directories are still created under /var/run/ which means that will still fail.

Those directories should also be created under the /var/run/$UID dir.

chore: address code quality

It's always a good idea - or so I believe - to adopt some best-practices early on, before a project moves past the prototype phase. While growing functionality is always great, there's added value in ensuring the code is properly documented.

Now, linting code at this point will make at least one of us go through it and learn. And, following #15, automatic linting should be run by CI so every contributor comments their code appropriately. That said, don't forget to set --set_error_status when calling golint in the Github worklow.

log support requires CGO

#70 and #69 add log support, but it calls a clib, so now we no longer create a static build, but ref libc and friends

This also makes a releasing a no go, because I will probably have an incompatible libc.

go bc -ldflags="-extldflags=-static"  
# github.com/virtual-kubelet/systemk
/usr/bin/ld: /tmp/go-link-075005359/000015.o: in function `_cgo_d744bec0c646_Cfunc_dlopen':
/tmp/go-build/cgo-gcc-prolog:90: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: /tmp/go-link-075005359/000025.o: in function `mygetgrouplist':
/home/miek/up/go/src/os/user/getgrouplist_unix.go:16: warning: Using 'getgrouplist' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: /tmp/go-link-075005359/000024.o: in function `mygetgrgid_r':
/home/miek/up/go/src/os/user/cgo_lookup_unix.go:38: warning: Using 'getgrgid_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: /tmp/go-link-075005359/000024.o: in function `mygetgrnam_r':
/home/miek/up/go/src/os/user/cgo_lookup_unix.go:43: warning: Using 'getgrnam_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: /tmp/go-link-075005359/000024.o: in function `mygetpwnam_r':
/home/miek/up/go/src/os/user/cgo_lookup_unix.go:33: warning: Using 'getpwnam_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: /tmp/go-link-075005359/000024.o: in function `mygetpwuid_r':
/home/miek/up/go/src/os/user/cgo_lookup_unix.go:28: warning: Using 'getpwuid_r' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
/usr/bin/ld: /tmp/go-link-075005359/000020.o: in function `_cgo_26061493d47f_C2func_getaddrinfo':
/tmp/go-build/cgo-gcc-prolog:58: warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
****

This creates a static bin on my setup.
Another downside of cgo is that per go routines stack is no longer 2k, it's 1M or something

Implement runInContainer

I thinking of implementing run-in-container, but I'm wondering about auth. I don't fully understand it yet from the k8s side, but I do think it makes sense to have a defense in depth and also check the local user on the system.

So I'm thinking of utilizing ssh and using the user ID from the kubectl call (how to get that though?) and then performing a ssh USER@localhost and dropping you in the /var/run/systemk/UID directory; which you can escape as you are free to roam the system; just like normal ssh. And commands given are dropped for the time being, but we may just want to go with garbage in; garbage out.

securityContext

securityContext can be applied on the pod level or even per container. This focusses on the pod
level one.

For starters it only allows integers to be specified. Systemd allows user/group names (and also
numbers?). Anyway. I want to run prometheus with the 'prometheus' user. As the pkg (on Debian) gets
installed, it will also add the user 'prometheus' and reference it in the unit file. This works perfectly
fine from the systemd perspective.

Now... from the k8s side you can't guess the user id of the prometheus user, so you probably then
will want to omit the securityContext and let the unit figure it out. (1)

Or (2) you make sure users exist and the securityContext can reference those by ID.

So I don't want to go the (2) route because I don't want to saddle myself with syncing users ID as
well. But doing (1) has implications as well, mostly for chowning the dirs on disk for volumes.

Example:

Say the user is prometheus and the securityContext is empty. Empty context assumes "root" in k8s
land, and of course doing everything as root will work. But we want to use 'prometheus' here. So we
need to figure out what user the unit file has and delay making changes on disk until we know this.

But how does this work with multiple units (=containers) and different user IDs?
If they share a emptyDir, what perms should we do there?

Also when there are multiple units which one is the one we looks for the user?
The first? Or with some annotation?

Fix Image == Name confusion

Currently it is assume image == pod name, which is generally not true. There is also an Image function that should actually be Name.

Clear up this confusion and use the right terms in the right place.

crashloopBackoff

when we see this:
systemk.monitoring.prometheus.prometheus.service: Scheduled restart job, restart counter is at 5.
we should probably signal crashloopBackoff to the server.

add configmap updates

One of the last things I need is having a watch on configmap and update the file in the unit.

With that piece only the infrastructure side of TLS certificates remains

remove klog flags that are not needed

I've used one klog flag today klog.v. It seems odd that the logging library needs to have the majority of the flags of the binary.

I propose we ditch them all except for the v one

flags.Lookup("nodename") does not work

We are being smart on how --nodename is used, but looking up the flag. But that code does not seem to be executed because the flags.Lookup() func is looking in the wrong flag set. We need to look this up in the original VK flagset, but that's hidden from us in cli.Command. That type doesn't give us access to corba.Command which does have method to access the flagset.

This also means we can't remove flags from the original set even though they are useless in systemk - they still show up in the --help ๐Ÿ‘Ž

ownership of host mountpath

When setting up a bindmount there needs to be a dir on the host fs that is used to hook the bind mount on.

    volumeMounts:
    - mountPath: /data/cdrom
      name: demo-volume

the directory /data/cdrom needs to exist, we currently chown this to uid/gid of the user running the pod (or via the security context), this means when 2 pods use the same mountPath but run as different users we chown this twice and to different user.

A better idea is to just create this with the uid/gid of the user running systemk and setting a liberal mode on it.

Test k8s Jobs

I need something to run once (across all nodes - which Jobs doesn't do), but Jobs make heavy use of restartPolicy, so we should check how they fare with systemk

user mode is not feasible

I've gutted everything I can think of, but I keep getting

feb 02 16:18:53 draak systemd[66422]: systemk.default.date.date.service: Failed to set up mount namespacing: Operation not supported
feb 02 16:18:53 draak systemd[66422]: systemk.default.date.date.service: Failed at step NAMESPACE spawning /usr/bin/bash: Operation not supported
feb 02 16:18:53 draak systemd[1762]: systemk.default.date.date.service: Main process exited, code=exited, status=226/NAMESPACE
feb 02 16:18:53 draak systemd[1762]: systemk.default.date.date.service: Failed with result 'exit-code'.

This is with a unit file:

[Unit]
Description=systemk
Documentation=man:systemk(8)

[Install]
WantedBy=multi-user.target

[Service]
PrivateMounts=false
NoNewPrivileges=true
StandardOutput=journal
StandardError=journal
RemainAfterExit=true
ExecStart=/usr/bin/bash -c "while true; do date;  sleep 1; done"
BindPaths=/var/run/user/1000/da80f0e1-5cae-421a-906f-6cd80a7f1bd6/secrets/#0:/var/run/secrets/kubernetes.io/serviceaccount
Environment=HOSTNAME=draak
Environment=KUBERNETES_SERVICE_PORT=6444
Environment=KUBERNETES_SERVICE_HOST=127.0.0.1
Environment=SYSTEMK_NODE_INTERNAL_IP=192.168.86.22
Environment=SYSTEMK_NODE_EXTERNAL_IP=192.168.86.22

[X-Kubernetes]
Namespace=default
ClusterName=
Id=da80f0e1-5cae-421a-906f-6cd80a7f1bd6
Image=bash

I might also drop the BindPaths entirely (that's the last thing that still there), but breaks the setup because that path is assumed to be /var/run/secrets.
But, then again, maybe it's something else entirely that I'm missing

root-less: syntax for not installing packages

I have a use case where the binaries are mounted into the system via some automounter (not that important). But this means the binaries are just there, nothing to install, this means we don't need root access (#30) just to the ability to speak systemd (systemd should be configured to disallow privilege escalation or maybe we can build something into systemk).

Anyway to do this know you do the following:

  - name: while
    image: bash
    command: ["bash", "-c"]
    args: ["while true; do cat /etc/uptimed/test.txt; sleep 1; done"]

where bash will be installed, but the args and command don't need to be in the image, they can reference anything on the filesystem.

So (1) to run root-less and prevent installing software we can do 2 things: a bogus image or something more structured, i.e.

  - name: while
    image: never-install-this
    command: ["bash", "-c"]
    args: ["while true; do cat /etc/uptimed/test.txt; sleep 1; done"

Will mean never-install-this is a special name for which we skip installation.

Or: (2)

  - name: while
    image: /this/image/contains/slashes/don't/install
    command: ["bash", "-c"]
    args: ["while true; do cat /etc/uptimed/test.txt; sleep 1; done"

Here the image name contains slashes (or starts with a slash). We take this as a cue to not install anything in this case. This approach is similar to the https:// prefix that gets the package and installs it.

I like approach (2) the best. Upgrading and getting a new version is changing the image-name/path which is how it should be.
If agreed actual implementation is fairly trival

[question] systemk <-> k3s server tunnel ?

does systemk vk use k3s remotedialer to setup tunnel with k3s server ? I dont see it referenced in this repo. If not, does it mean vk (or any worker node) can join k3s cluster without tunnel ?

liveness (network) probes

Implement liveness probes: httpGet and tcpSocket. Execing a command in the container is hard(er), as we may need to get into the namespace - this is doable as we know the volumes that exist. But we should guard against an rm -rf / command as we run as root...
httpGet and tcpSocket are much less destructive and can be simply built into systemk

feature: kubectl logs options

Unit logs are made available to journald, and that's great but that's an implementation detail. For any Kubernetes user, kubectl logs [--follow] <pod name> is what's expected to work.

At the time of this writing, there are hints that we agree this feature makes sense:

systemk/systemd/pod.go

Lines 265 to 274 in 745f958

func (p *P) GetContainerLogs(ctx context.Context, namespace, podName, containerName string, opts api.ContainerLogOpts) (io.ReadCloser, error) {
log.Printf("GetContainerLogs called")
unitname := UnitPrefix(namespace, podName) + separator + containerName
args := []string{"-u", unitname}
cmd := exec.Command("journalctl", args...)
// returns the buffers? What about following, use pipes here or something?
buf, err := cmd.CombinedOutput()
return ioutil.NopCloser(bytes.NewReader(buf)), err
}

The way I see it, the implementation could do with a bit more love. And as I was looking into what could be done on that front, I recalled moby (dockerd) does provide a journald logging driver, which relies on https://github.com/coreos/go-systemd/blob/master/sdjournal/journal.go, but it is quite verbose. So I'm starting this issue to brainstorm if it makes sense to adopt such code, if there are any known alternatives or, ultimately, if this becomes a non-goal entirely.

internal/external address

Right now there is a bit of code that figures out what the external addresses are and which are internal. This assumes rfc-1918 address space, which may be a stretch.

We probable need the make flags for this if the defaults don't suite you. A pod can then use these by inspecting the node object (I think that how you do it) and set some env vars which can then be used.

My use case: prom metrics should be exposed on the private port.

feature: support different versions of systemd

For instance, in RHEL7 (systemd 219), user-mode systemd doesn't seem to be a thing. Also, certain unit attributes we populate today are unsupported, such as:

  • BindReadOnlyPaths
  • PrivateMounts
  • ReadOnlyPaths
  • TemporaryFileSystem

I understand older distros may not be a goal but if we want this project to be used on bigger companies, we'll have to account for this.

Stop chowning paths from systemk

I think (not sure here) that we want to outsource the creation/chowning of files and directories to systemd. I thought of this before but the thinking back then was this will be too much shell scripting.

I'm coming back to this because permissions: if we (systemk binary) doesn't do it, we can run root-less pretty easily and just outsource it to systemd.

Alternatively we can write a little go utility that does this and that gets called from systemd, main point(s) being systemk doesn't do this, something called from a unit file does does. This also removed state tracking from systemk, as we just create more units. Ordering will be done with Before: and After:.
This can just be done with (multiple) ExecPreStart and ExecPostStops (see comments below)

Keeping systemk stateless is also worth while.

static pods

Should do something like static pods? Can probably use k3s server for this, but something needs to mirror it in the api server, and systemk must do that.

Would simplify a bunch of things as we can just throw some yaml in a directory and stuff will happen.

unit file getter function

Right now we have a bunch of write functions for unit file Overwrite, Delete and Insert we don't have a Read function, so there is a bunch of u.Contents["Service"]["ExecStart"] to get to a value.

An Getter function would be nice here, maybe:

// Content returns the values set in section, under name. If not found nil is returned.
func (u *File) Content(section, name string) []string { .... }

Should be added in internal/unit/file.go and then all callers should be updated.

warn when unimplemented podspec constructs are seen

Right now there is no good way (except reading the source of pod.go) to know if a specific construct in a podspec is implemented or not. It would be good to log/alert/fail if we see such a construct. This would also make it clear what things are still "to be implemented"

run root-less?

Can we run as non-root? Potentially letting systemd doing all the heavy lifting?

Of course being able to send arbitrary unit to systemd is indistinguishable from actually being root... But it might benefit some use cases.

It could very much turn out this is not possbile.

TLS certs for handlers we run

For the log handler we have the tls-cert and tls-key flags that read a cert. It would be nice(r) if we just embed a ACME client that gets these certs from a CA.

Alternatively this could be outsourced, but then we would need to periodically re-read the key and cert from disk.

I think at this point I prefer ACME as I need to run my own CA anyhow.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.