kubermatic / machine-controller Goto Github PK

License: Apache License 2.0

Go 96.16% Shell 3.52% Makefile 0.20% Dockerfile 0.07% Python 0.04%

machine-controller's Introduction

Overview / User Guides

Kubermatic Kubernetes Platform is in an open source project to centrally manage the global automation of thousands of Kubernetes clusters across multicloud, on-prem and edge with unparalleled density and resilience.

All user documentation is available at the Kubermatic Kubernetes Platform docs website.

Editions

There are two editions of Kubermatic Kubernetes Platform:

Kubermatic Kubernetes Platform Community Edition (CE) is available freely under the Apache License, Version 2.0. Kubermatic Kubernetes Platform Enterprise Edition (EE) includes premium features that are most useful for organizations with large-scale Kubernetes installations with more than 50 clusters. To access the Enterprise Edition and get official support please become a subscriber.

Licensing

See the LICENSE file for licensing information as it pertains to files in this repository.

Installation

We strongly recommend that you use an official release of Kubermatic Kubernetes Platform. Follow the instructions under the Installation section of our documentation to get started.

The code and sample YAML files in the main branch of the kubermatic repository are under active development and are not guaranteed to be stable. Use them at your own risk!

More information

The documentation provides a getting started guide, plus information about building from source, architecture, extending kubermatic, and more.

Please use the version selector at the top of the site to ensure you are using the appropriate documentation for your version of kubermatic.

Troubleshooting

If you encounter issues file an issue or talk to us on the #kubermatic channel on the Kubermatic Community Slack (click here to join).

Contributing

Thanks for taking the time to join our community and start contributing!

Before you start

Please familiarize yourself with the Code of Conduct before contributing.
See CONTRIBUTING.md for instructions on the developer certificate of origin that we require.

Repository layout

├── addons    # Default Kubernetes addons
├── charts    # The Helm charts we use to deploy
├── cmd       # Various Kubermatic binaries for the controller-managers, operator etc.
├── codegen   # Helper programs to generate Go code and Helm charts
├── docs      # Some basic developer-oriented documentation
├── hack      # scripts for development and CI
└── pkg       # most of the actual codebase

Development environment

git clone [email protected]:kubermatic/kubermatic.git
cd kubermatic

There are a couple of scripts in the hacks directory to aid in running the components locally for testing purposes.

Running components locally

user-cluster-controller-manager

In order to instrument the seed-controller to allow for a local user-cluster-controller-manager, you need to add a worker-name label with your local machine's name as its value. Additionally, you need to scale down the already running deployment.

# Using a kubeconfig, which points to the seed-cluster
export cluster_id="<id-of-your-user-cluster>"
kubectl label cluster ${cluster_id} worker-name=$(uname -n)
kubectl scale deployment -n cluster-${cluster_id} usercluster-controller --replicas=0

Afterwards, you can start your local user-cluster-controller-manager.

# Using a kubeconfig, which points to the seed-cluster
./hack/run-user-cluster-controller-manager.sh

seed-controller-manager

./hack/run-seed-controller-manager.sh

master-controller-manager

./hack/run-master-controller-manager.sh

Run linters

Before every push, make sure you run:

make lint

Run tests

make test

Update code generation

The Kubernetes code-generator tool does not work outside of GOPATH (upstream issue), so the script below will automatically run the code generation in a Docker container.

hack/update-codegen.sh

Pull requests

We welcome pull requests. Feel free to dig through the issues and jump in.

Changelog

See the list of releases to find out about feature changes.

machine-controller's People

Contributors

Stargazers

Watchers

Forkers

alvaroaleman lbvffvbl syseleven redradrat thz yves-vogl cbeneke nikhita domcar ahilsend jichenjc openshift-training nikhita-bot galthaus javadev-a curx bnerd displague deitch boluisa gaberger switchboardop ainmosni quentinbisson bashofmann nacef-labidi eckiddbest yasmeeen1414 maksim-paskal chrkl kubermatic-bot manustoessel moelsayed guibrazlima mbr001 multi-io irozzo-1a ldb kron4eg xrstf xmudrii chriscoester maxilampert tecbiz-ch forkkit df8web kinvolk-archives ricable mfahlandt sh4d1 erdii mfranczy mcbenjemaa unixfox luks moadqassem cpuspellcaster happy2c0de mintert agowa ederst julienlefur imharshita mlavacca lsviben emmanuel-kubermatic usedandloved stroebitzer bejoynr mate4st kstiehl minxlicom florinpeter dermorz hdurand0710 embik 2000yeshu vgramer simontheleg simon-wessel dakraus anexia-it nerdeveloper jesse-forked ol-iver gigo1980 pkprzekwas rastislavs pratikdeoghare zreigz simonfuhrer tlamr mustanish coredgeio lucakuendig atomicjar hc2p jenyapoyarkov sachintiptur csengerszabo

machine-controller's Issues

Vsphere E2E tests

Add the following test cases to the existing E2E test suite.

Ubuntu + Docker 1.13
Ubuntu + Docker 17.03
Ubuntu + CRI-O 1.9

Clean up .circle/config.yaml, so that it doesn't run the test using create-and-destroy-machine.sh script

Update to client-go v7

Test and document centos on vsphere

As a user I want to be able to spin up worker nodes on vsphere that use CentOS as distribution.

Acceptance criteria:

There is a documentation on how to import/create a suitable image for centos on vsphere
An image in our vsphere test cluster was created by following the steps documented
The e2e tests are extended to also test CentOS on vsphere

`make machine-controller` sets wrong permissions

Doing a make machine-controller results in a Binary owned by root.

Can we pass in the UID from the user and do a chown in the build-container?

add prometheus metrics

Add the following metrics:

Total number of errors
Total number of machines
Total number of nodes
Time how long it took to create/delete a instance on the cloud provider
Timedifference between node.CreationTimestamp -
machine.CreationTimestamp

Not possible to delete machines using AWS

We have 2 clusters on dev.kubermatic.io which cannot be deleted because the machine-controller is not able to delete the machines.

Logs:
kubectl -n cluster-dt56ds7tsb logs machine-controller-559788b7f9-89q9v

E0411 07:29:28.561133       1 machine.go:200] machine-kubermatic-dt56ds7tsb-gf4xr failed with: failed to delete machine at cloudprovider, due to instance not found
E0411 07:29:28.594842       1 machine.go:200] machine-kubermatic-dt56ds7tsb-d5pgz failed with: failed to delete machine at cloudprovider, due to instance not found
E0411 07:29:28.613675       1 machine.go:200] machine-kubermatic-dt56ds7tsb-64ql4 failed with: failed to delete machine at cloudprovider, due to instance not found

e2e tests modify manifest by providing a field selector

At the moment tests replace desired fields in the manifest based on string matching. For example:

params = fmt.Sprintf("%s,<< MACHINE_NAME >>=%s,<< NODE_NAME >>=%s", params, machineName, nodeName)
params = fmt.Sprintf("%s,<< OS_NAME >>=%s,<< CONTAINER_RUNTIME >>=%s,<< CONTAINER_RUNTIME_VERSION >>=%s", params, testCase.osName, testCase.containerRuntime, testCase.containerRuntimeVersion)

we would like to change that by providing the field path for example spec.providerConfig. cloudProvider. this would not only look better but would also allow to consume manifest under example directory.

integrate vmware vsphere

Drop leader lease on shutdown

E2E taking too long to run

At the moment running the complete test suite takes ~1h. This prevents us from running the test suite on "pull request" basis.

We would like to take a deeper look into this and hopefully improve test time as much as possible.

the completed runs:
https://circleci.com/gh/kubermatic/machine-controller/2738

Extend circle pipeline

Whats missing:
Building a docker image

On push using the commit hash as docker tag
On tag using the git tag as docker tag & latest

Running into AWS rate-limits

When creating 5 machines simultanously, we're getting rate limited by AWS - on all machines.

It seems it happens during validation. Thus the errors we get from AWS are being handled as terminal.

Make container runtime version optional

Based upon the entered Kubernetes version and the selected OS we should default to a docker/cri-o version.

For now the logic should be:

cri-o
- Kubernetes v1.8 + Ubuntu 16.04 -> error as theres no cri-o 1.8 package in the repos
- Kubernetes v1.9 + Ubuntu 16.04 -> cri-o 1.9
- Kubernetes v1.8 + Container Linux -> error as theres no cri-o for coreos
- Kubernetes v1.8 + Container Linux -> error as theres no cri-o for coreos
docker
- Kubernetes v1.8 + Ubuntu 16.04 -> docker 1.13
- Kubernetes v1.9 + Ubuntu 16.04 -> docker 1.13
- Kubernetes v1.8 + Container Linux -> docker 1.12
- Kubernetes v1.8 + Container Linux -> docker 1.12

Defaulting for Openstack

Usage of the Openstack provider would be easier if there was defaulting for

availabilityZone
Region
Network
Subnet
FloatingIPPool

To achieve this, the machine-controller should request a list of the given resource, check if there is exactly one and if yes default to that.

RBAC broken

#56 apparently broke RBAC:

GET https://10.96.0.1:443/api/v1/configmaps?resourceVersion=13569859&timeoutSeconds=346&watch=true
I0203 01:42:59.303274       1 round_trippers.go:439] Response Status: 403 Forbidden in 1 milliseconds
I0203 01:42:59.303287       1 round_trippers.go:442] Response Headers:
I0203 01:42:59.303293       1 round_trippers.go:445]     Content-Type: application/json
I0203 01:42:59.303298       1 round_trippers.go:445]     X-Content-Type-Options: nosniff
I0203 01:42:59.303302       1 round_trippers.go:445]     Content-Length: 277
I0203 01:42:59.303307       1 round_trippers.go:445]     Date: Sat, 03 Feb 2018 01:42:59 GMT
I0203 01:42:59.303628       1 request.go:873] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"configmaps is forbidden: User \"system:serviceaccount:kube-system:machine-controller\" cannot watch configmaps at the cluster scope","reason":"Forbidden","details":{"kind":"configmaps"},"code":403}

What surprises me a little is that it watches all configmaps, shouldn't a watch on the cluster-info configmap in the kube-public namespace be enough?

move verify tool to e2e tests

for reference see also:
#159 (comment)
#159
#129

#178 (e2e tests modify manifest by providing a field selector)

Make security-groups creation on openstack a fallback

We should only create a security-group on Openstack when none is defined.
Also we need to log this with a loglevel of 2 probably

Add support for Enterprise Linux based distros (RHEL, CentOS)

Aside from RedHat ContainerLinux and Ubuntu, we should also support Enterprise Linux based distros, e.G. CentOS

Allow to specify ssh-key via flag

Current state:
On initial start, we check if a secret with a private ssh key exists.
If no secret is found, we generate a secret with a private key.

This ssh key will be later used when creating instances at cloud-providers.
This was made so the user does not have to specify a ssh public key in the machine-manifest, as some cloud providers require to specify a public key when creating a instance (aws).

All public keys from the machine manifest are getting deployed via cloud-init.

Desired state:
The controller should accept a path to a private key via a command line flag.
If the flag is specified and a valid key got found, this key should be taken.
If no flag was specified or the key was not found, the old logic with the secret should apply.

Make security-group creation on aws a fallback

We need to add a config variable for the securityGroups & should only create a security-group on AWS when none is defined. As a convenience/quickstart help.
Also we need to log this with a loglevel of 2 probably.

Openstack: Floating IPs are not reused which may result in FIP exhaustion

Basically title, from the the machine controller log:

E0120 13:47:35.431740       1 machine.go:162] machine-controller failed with: failed to create machine at cloudprovider: failed to allocate a floating ip: Expected HTTP response code [201 202] when accessing [POST http://192.168.0.39:9696/v2.0/floatingips], but got 409 instead
{"NeutronError": {"message": "No more IP addresses available on network 06fb6e98-4e98-4320-9f00-34e028ed53cb.", "type": "IpAddressGenerationFailure", "detail": ""}}

I'd expect the machine-controller to reuse already assigned but unused FIPs instead of requesting a new one.

running machine-controller through leader election is optional

the machine controller has been incorporated into kubermatic and is an inherent part of every cluster.
That made local development/testing impossible as is highly possible that the machine controller which runs inside kubermatic will acquire a lock right before the local instance.

Making leader election optional seems to remedy this issue.

Concurrency issues when creating secret in e2e tests

Right now the e2e tests sometimes fail due to kubectl trying to create the secret when its already there

e.G. https://circleci.com/gh/kubermatic/machine-controller/2877?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

We should probably just not use a secret in the e2e tests and instead just put the required credentials into the machine spec.

Correctly set a nodes `ownerRef` after it gets recreated

Right now if a node gets deleted for whatever reason (e.G. manually), the machine-controller will recreate it. This is fine, but results in the node not having an ownerRef.

Do not use cluster-info configmap anymore

Right now the machine-controller uses the cluster-info configmap to get the CACert and the endpoint for the apiserver.

Instead it should get the CACert from its kubeconfig or from /run and the apiserver endpoints from its kubeconfig or from the endpoints of the kubernetes service when running in-cluster.

This will reduce the configuration overhead and help ppl to get started faster.

Use `kubeadm join` instead of manually maintaining kubelet config

Right now we maintain the kubelet config as part of the distro-specific templates. This has some drawbacks:

We may miss important configuration parameters
Whenever we change something, we have to change it at multiple places
There is no way to have different configs based on Kubelet version

Instead it would be easier if we just used kubeadm join to configure the Kubelet

Extend e2e testing

We should add the following test cases:

Hetzner
- Ubuntu + Docker 1.13
- Ubuntu + Docker 17.03
- Ubuntu + CRI-O 1.9
Digitalocean
- Ubuntu + Docker 1.13
- Ubuntu + Docker 17.03
- Ubuntu + CRI-O 1.9
- CoreOS + Docker 1.13
- CoreOS + Docker 17.03
AWS
- Ubuntu + Docker 1.13
- Ubuntu + Docker 17.03
- Ubuntu + CRI-O 1.9
- CoreOS + Docker 1.13
- CoreOS + Docker 17.03
Openstack (We need a sponsor here)
- Ubuntu + Docker 1.13
- Ubuntu + Docker 17.03
- Ubuntu + CRI-O 1.9
- CoreOS + Docker 1.13
- CoreOS + Docker 17.03

Hetzner E2E tests

Add the following test cases to the existing E2E test suite.

Ubuntu + Docker 1.13
Ubuntu + Docker 17.03
Ubuntu + CRI-O 1.9

Clean up .circle/config.yaml, so that it doesn't run the test using create-and-destroy-machine.sh script

Integration machine-controller

Integration the machine-controller based on the proposal

Add integration testing script

To be able to properly validate the machine-controller is working as intended, we need some kind of integration testing.

Because it is not possible to both test external PRs automatically and be sure they are not used to steal credentials, this script is not supposed to be executed automatically. Instead it will:

Take credentials for a cloud provider, e.G. from environment
Take a ssh pubkey
Create a single node k8s cluster via kubeadm at cloudprovider
Deploy machine-controller in a version built from git HEAD into the newly created cluster using the deployment in the repo
Verify machine-controller is running
Provide a teardown functionality

Don't start InformerFactories in go routine

In https://github.com/kubermatic/machine-controller/blob/master/cmd/controller/main.go#L211
We start the informerFactories in a separate go routine.

The informerFactory.Start(stopCh) itself is non blocking.

This might be the reason i see sometimes the controller trying to create something what already exists.
Just the Lister doesn't have it yet.

rename coreos to containerLinux

We use operatingSystem:coreos.
As it got renamed to ContainerLinux we should adapt

Schedule E2E night tests run

Since running the complete e2e suite takes too long, as a temporary step we could schedule a night tests run. Running the test frequently would increase confidence and hopefully would reveal potential issues that might crop up.

Machine deletion doesn't delete the node from the cluster (at least for Hetzner)

Just observed that deleting a machine does delete the machine at the cloud provide but does not delete the kubernetes node, at least for Hetzner. Didn't try other provider so far.

Move cloudprovider secrets out of machine definition and into a secret

Right now the machine definition contains all access secrets to the cloud provider it is spawned on. This has two drawbacks:

Someone who shall have the permission to create machines has to know these credentials
There are some objects (e.G. security groups, ssh keys) whose lifetime is not coupled to a machine but to the usage of the cloudprovider, meaning as long as any machine uses that cloud provider, they have to exist

Instead we want to move the cloudprovider secrets into an actual secret which is then referenced by machines.

Add end-to-end testing to CircleCI pipeline

To better know if PRs add bugs, we should add the existing end-to-end testing to the CircleCI-Pipeline.

This requires:

Configuring the ssh keypair name at the cloudprovider with a random prefix
Deleting the ssh keypair before ending the e2e test
Adding the test-e2e target to circleci

Possibility to set AWS AMI

It should be possible to set the AWS AMI id.

Trigger events

With the implementation of transient and terminal errors we now correctly set machine.status.errorReason & machine.status.errorMessage when the controller runs into a terminal error.

Transient errors though are not reported back. The only way to see those is by investigating the logs.
Instead if just logging, we should trigger a event which is attached to the machine.

simple e2e test tool

having a simple command line tool that would verify whether a node has been created serves not only as a good warm up exercise but also as a handy test tool.

the idea is that we would have a list of predefined machine manifests that would need some customisation in terms of credentials. The credentials could be accepted as a command line arguments and passed all the way down to the manifests. After POST'ing the given manifests to the kube-api server the test tool would read the current cluster state in order to determine the correctness of machine-controller

the test tool would use the standard client-go library to talk to the api server and would read the kubeconfig configuration file to discover where the cluster is actually located.

assumptions:

cluster was created manually
kube config is accessible
there is a list of predefined machine manifests

for example, running the following command: verify -input path_to_manifest -parameters key=value, key2=value would print a machine "node-docker" has been crated to stdout.

Enable setting metadata on openstack instances

It should be possible to set the metadata on a openstack instance via a machine.yaml

Very similar to the tags in AWS

Parse versions via a semver library

We need to parse the user given versions (kubelet & container runtime) to process them correctly in the end.
Especially so we can accept v1.9.2 and 1.9.2 as input.
Currently we require the kubelet version to have a leading v but we dont require it for the container runtime version

Add MarshalJSON function to all configVar* types

Add MarshalJSON function to all configVar* types.
Also check if we can refactor the logic in the Unmarshall functions.

Use MetricsHandler for exposing health checks as metrics

First we need to have general Prometheus support being merged with #49.

See docs: https://godoc.org/github.com/heptiolabs/healthcheck#example-package--Metrics

add leaderelection

Use the leaderelection component from the kubernetes go-client

Add graceful shutdown

deadlock when trying to delete a machine

Steps to reproduce :

Create an invalid machine - you can use the following manifest that doesn't specify required credentials https://github.com/kubermatic/machine-controller/blob/master/examples/machine-digitalocean.yaml
Delete the previously created machine.
List machine resources

Result:
The machine was not deleted and the server keeps saying machine1 failed with: failed to get instance for machine machine1 after the delete got triggered
The only way of getting out of this situation is to manually edit the machine's spec and remove finalizers.

In general the described state exists because we add finalizers to a machine before creating a node because we want to prevent deletion of a machine resource.

As I can imagine that the call that requests a node can fail for many reasons, I think that this issue could help us track discussion on possible solutions to this issue.

process machines which were annotated

the machine controller has been incorporated into kubermatic and is an inherent part of every cluster. That made local development/testing impossible as every machine is processed by incluster machine controller.

we could annotate a machine manifest with some arbitrary data and at the same time we could introduce a new command line flag. On a successful match a controller should continue otherwise it should leave it to others. An empty annotation means there is no preference.

Add support for accepting cloud-provider credentials as EnvVar's

The Machine object accepts multiple sources for cloudProviderSpec fields:

Direct value

...
spec:
...
  providerConfig:
    cloudProvider: "aws"
    cloudProviderSpec:
      accessKeyId: "foo"

Secret ref

...
spec:
...
  providerConfig:
    cloudProvider: "aws"
    cloudProviderSpec:
      accessKeyId:
        secretKeyRef:
          namespace: kube-system
          name: machine-controller-aws
          key: accessKeyId

ConfigMap ref

...
spec:
...
  providerConfig:
    cloudProvider: "aws"
    cloudProviderSpec:
      accessKeyId:
        configMapKeyRef:
          namespace: kube-system
          name: machine-controller-aws
          key: accessKeyId

It should also be possible to pass in the secret values implicitly as environment variable.
The secret values differ from cloud provider.

AWS
- Access Key ID
- Secret Access Key
Hetzner
- Token
Digitalocean
- Token
OpenStack
- Username
- Password

Each secret field needs to have one specific environment key. Like AWS_ACCESS_KEY_ID.
During the processing of the cloudProviderSpec we would need to check if the environment variable is set, and if so we need to use this value.

Reason: In scenarios where the master components is managed by an external entity (Loodse kubermatic/ SAP Gardener) it might not be possible to expose the cloud provider specific secrets to the users.

Create temporary ssh key during instance creation when required. Delete afterwards