kubermatic / machine-controller Goto Github PK

View Code? Open in Web Editor NEW

297.0 18.0 124.0 19.4 MB

License: Apache License 2.0

Go 95.99% Shell 3.68% Makefile 0.21% Dockerfile 0.08% Python 0.05%

machine-controller's Issues

Enable setting metadata on openstack instances

It should be possible to set the metadata on a openstack instance via a machine.yaml

Very similar to the tags in AWS

E2E taking too long to run

At the moment running the complete test suite takes ~1h. This prevents us from running the test suite on "pull request" basis.

We would like to take a deeper look into this and hopefully improve test time as much as possible.

the completed runs:
https://circleci.com/gh/kubermatic/machine-controller/2738

Hetzner E2E tests

Add the following test cases to the existing E2E test suite.

Ubuntu + Docker 1.13
Ubuntu + Docker 17.03
Ubuntu + CRI-O 1.9

Clean up .circle/config.yaml, so that it doesn't run the test using create-and-destroy-machine.sh script

Vsphere E2E tests

Add the following test cases to the existing E2E test suite.

Ubuntu + Docker 1.13
Ubuntu + Docker 17.03
Ubuntu + CRI-O 1.9

Clean up .circle/config.yaml, so that it doesn't run the test using create-and-destroy-machine.sh script

process machines which were annotated

the machine controller has been incorporated into kubermatic and is an inherent part of every cluster. That made local development/testing impossible as every machine is processed by incluster machine controller.

we could annotate a machine manifest with some arbitrary data and at the same time we could introduce a new command line flag. On a successful match a controller should continue otherwise it should leave it to others. An empty annotation means there is no preference.

deadlock when trying to delete a machine

Steps to reproduce :

Create an invalid machine - you can use the following manifest that doesn't specify required credentials https://github.com/kubermatic/machine-controller/blob/master/examples/machine-digitalocean.yaml
Delete the previously created machine.
List machine resources

Result:
The machine was not deleted and the server keeps saying machine1 failed with: failed to get instance for machine machine1 after the delete got triggered
The only way of getting out of this situation is to manually edit the machine's spec and remove finalizers.

In general the described state exists because we add finalizers to a machine before creating a node because we want to prevent deletion of a machine resource.

As I can imagine that the call that requests a node can fail for many reasons, I think that this issue could help us track discussion on possible solutions to this issue.

Schedule E2E night tests run

Since running the complete e2e suite takes too long, as a temporary step we could schedule a night tests run. Running the test frequently would increase confidence and hopefully would reveal potential issues that might crop up.

Defaulting for Openstack

Usage of the Openstack provider would be easier if there was defaulting for

availabilityZone
Region
Network
Subnet
FloatingIPPool

To achieve this, the machine-controller should request a list of the given resource, check if there is exactly one and if yes default to that.

Add release handling for CI pipeline

Circle-CI needs to build when we add a tag.
The tag should be used for creating the docker-image

running machine-controller through leader election is optional

the machine controller has been incorporated into kubermatic and is an inherent part of every cluster.
That made local development/testing impossible as is highly possible that the machine controller which runs inside kubermatic will acquire a lock right before the local instance.

Making leader election optional seems to remedy this issue.

integrate vmware vsphere

Make container runtime version optional

Based upon the entered Kubernetes version and the selected OS we should default to a docker/cri-o version.

For now the logic should be:

cri-o
- Kubernetes v1.8 + Ubuntu 16.04 -> error as theres no cri-o 1.8 package in the repos
- Kubernetes v1.9 + Ubuntu 16.04 -> cri-o 1.9
- Kubernetes v1.8 + Container Linux -> error as theres no cri-o for coreos
- Kubernetes v1.8 + Container Linux -> error as theres no cri-o for coreos
docker
- Kubernetes v1.8 + Ubuntu 16.04 -> docker 1.13
- Kubernetes v1.9 + Ubuntu 16.04 -> docker 1.13
- Kubernetes v1.8 + Container Linux -> docker 1.12
- Kubernetes v1.8 + Container Linux -> docker 1.12

Do not use cluster-info configmap anymore

Right now the machine-controller uses the cluster-info configmap to get the CACert and the endpoint for the apiserver.

Instead it should get the CACert from its kubeconfig or from /run and the apiserver endpoints from its kubeconfig or from the endpoints of the kubernetes service when running in-cluster.

This will reduce the configuration overhead and help ppl to get started faster.

Running into AWS rate-limits

When creating 5 machines simultanously, we're getting rate limited by AWS - on all machines.

It seems it happens during validation. Thus the errors we get from AWS are being handled as terminal.

Add support for Enterprise Linux based distros (RHEL, CentOS)

Aside from RedHat ContainerLinux and Ubuntu, we should also support Enterprise Linux based distros, e.G. CentOS

Add support for accepting cloud-provider credentials as EnvVar's

The Machine object accepts multiple sources for cloudProviderSpec fields:

Direct value

...
spec:
...
  providerConfig:
    cloudProvider: "aws"
    cloudProviderSpec:
      accessKeyId: "foo"

Secret ref

...
spec:
...
  providerConfig:
    cloudProvider: "aws"
    cloudProviderSpec:
      accessKeyId:
        secretKeyRef:
          namespace: kube-system
          name: machine-controller-aws
          key: accessKeyId

ConfigMap ref

...
spec:
...
  providerConfig:
    cloudProvider: "aws"
    cloudProviderSpec:
      accessKeyId:
        configMapKeyRef:
          namespace: kube-system
          name: machine-controller-aws
          key: accessKeyId

It should also be possible to pass in the secret values implicitly as environment variable.
The secret values differ from cloud provider.

AWS
- Access Key ID
- Secret Access Key
Hetzner
- Token
Digitalocean
- Token
OpenStack
- Username
- Password

Each secret field needs to have one specific environment key. Like AWS_ACCESS_KEY_ID.
During the processing of the cloudProviderSpec we would need to check if the environment variable is set, and if so we need to use this value.

Reason: In scenarios where the master components is managed by an external entity (Loodse kubermatic/ SAP Gardener) it might not be possible to expose the cloud provider specific secrets to the users.

Openstack: Floating IPs are not reused which may result in FIP exhaustion

Basically title, from the the machine controller log:

E0120 13:47:35.431740       1 machine.go:162] machine-controller failed with: failed to create machine at cloudprovider: failed to allocate a floating ip: Expected HTTP response code [201 202] when accessing [POST http://192.168.0.39:9696/v2.0/floatingips], but got 409 instead
{"NeutronError": {"message": "No more IP addresses available on network 06fb6e98-4e98-4320-9f00-34e028ed53cb.", "type": "IpAddressGenerationFailure", "detail": ""}}

I'd expect the machine-controller to reuse already assigned but unused FIPs instead of requesting a new one.

Use `kubeadm join` instead of manually maintaining kubelet config

Right now we maintain the kubelet config as part of the distro-specific templates. This has some drawbacks:

We may miss important configuration parameters
Whenever we change something, we have to change it at multiple places
There is no way to have different configs based on Kubelet version

Instead it would be easier if we just used kubeadm join to configure the Kubelet

e2e tests modify manifest by providing a field selector

At the moment tests replace desired fields in the manifest based on string matching. For example:

params = fmt.Sprintf("%s,<< MACHINE_NAME >>=%s,<< NODE_NAME >>=%s", params, machineName, nodeName)
params = fmt.Sprintf("%s,<< OS_NAME >>=%s,<< CONTAINER_RUNTIME >>=%s,<< CONTAINER_RUNTIME_VERSION >>=%s", params, testCase.osName, testCase.containerRuntime, testCase.containerRuntimeVersion)

we would like to change that by providing the field path for example spec.providerConfig. cloudProvider. this would not only look better but would also allow to consume manifest under example directory.

Extend e2e testing

We should add the following test cases:

Hetzner
- Ubuntu + Docker 1.13
- Ubuntu + Docker 17.03
- Ubuntu + CRI-O 1.9
Digitalocean
- Ubuntu + Docker 1.13
- Ubuntu + Docker 17.03
- Ubuntu + CRI-O 1.9
- CoreOS + Docker 1.13
- CoreOS + Docker 17.03
AWS
- Ubuntu + Docker 1.13
- Ubuntu + Docker 17.03
- Ubuntu + CRI-O 1.9
- CoreOS + Docker 1.13
- CoreOS + Docker 17.03
Openstack (We need a sponsor here)
- Ubuntu + Docker 1.13
- Ubuntu + Docker 17.03
- Ubuntu + CRI-O 1.9
- CoreOS + Docker 1.13
- CoreOS + Docker 17.03

Add integration testing script

To be able to properly validate the machine-controller is working as intended, we need some kind of integration testing.

Because it is not possible to both test external PRs automatically and be sure they are not used to steal credentials, this script is not supposed to be executed automatically. Instead it will:

Take credentials for a cloud provider, e.G. from environment
Take a ssh pubkey
Create a single node k8s cluster via kubeadm at cloudprovider
Deploy machine-controller in a version built from git HEAD into the newly created cluster using the deployment in the repo
Verify machine-controller is running
Provide a teardown functionality

Add graceful shutdown

Integration machine-controller

Integration the machine-controller based on the proposal

simple e2e test tool

having a simple command line tool that would verify whether a node has been created serves not only as a good warm up exercise but also as a handy test tool.

the idea is that we would have a list of predefined machine manifests that would need some customisation in terms of credentials. The credentials could be accepted as a command line arguments and passed all the way down to the manifests. After POST'ing the given manifests to the kube-api server the test tool would read the current cluster state in order to determine the correctness of machine-controller

the test tool would use the standard client-go library to talk to the api server and would read the kubeconfig configuration file to discover where the cluster is actually located.

assumptions:

cluster was created manually
kube config is accessible
there is a list of predefined machine manifests

for example, running the following command: verify -input path_to_manifest -parameters key=value, key2=value would print a machine "node-docker" has been crated to stdout.

Extend circle pipeline

Whats missing:
Building a docker image

On push using the commit hash as docker tag
On tag using the git tag as docker tag & latest

RBAC broken

#56 apparently broke RBAC:

GET https://10.96.0.1:443/api/v1/configmaps?resourceVersion=13569859&timeoutSeconds=346&watch=true
I0203 01:42:59.303274       1 round_trippers.go:439] Response Status: 403 Forbidden in 1 milliseconds
I0203 01:42:59.303287       1 round_trippers.go:442] Response Headers:
I0203 01:42:59.303293       1 round_trippers.go:445]     Content-Type: application/json
I0203 01:42:59.303298       1 round_trippers.go:445]     X-Content-Type-Options: nosniff
I0203 01:42:59.303302       1 round_trippers.go:445]     Content-Length: 277
I0203 01:42:59.303307       1 round_trippers.go:445]     Date: Sat, 03 Feb 2018 01:42:59 GMT
I0203 01:42:59.303628       1 request.go:873] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"configmaps is forbidden: User \"system:serviceaccount:kube-system:machine-controller\" cannot watch configmaps at the cluster scope","reason":"Forbidden","details":{"kind":"configmaps"},"code":403}

What surprises me a little is that it watches all configmaps, shouldn't a watch on the cluster-info configmap in the kube-public namespace be enough?

Add MarshalJSON function to all configVar* types

Add MarshalJSON function to all configVar* types.
Also check if we can refactor the logic in the Unmarshall functions.

add prometheus metrics

Add the following metrics:

Total number of errors
Total number of machines
Total number of nodes
Time how long it took to create/delete a instance on the cloud provider
Timedifference between node.CreationTimestamp -
machine.CreationTimestamp

move verify tool to e2e tests

for reference see also:
#159 (comment)
#159
#129

#178 (e2e tests modify manifest by providing a field selector)

Test and document centos on vsphere

As a user I want to be able to spin up worker nodes on vsphere that use CentOS as distribution.

Acceptance criteria:

There is a documentation on how to import/create a suitable image for centos on vsphere
An image in our vsphere test cluster was created by following the steps documented
The e2e tests are extended to also test CentOS on vsphere

Drop leader lease on shutdown

Not possible to delete machines using AWS

We have 2 clusters on dev.kubermatic.io which cannot be deleted because the machine-controller is not able to delete the machines.

Logs:
kubectl -n cluster-dt56ds7tsb logs machine-controller-559788b7f9-89q9v

E0411 07:29:28.561133       1 machine.go:200] machine-kubermatic-dt56ds7tsb-gf4xr failed with: failed to delete machine at cloudprovider, due to instance not found
E0411 07:29:28.594842       1 machine.go:200] machine-kubermatic-dt56ds7tsb-d5pgz failed with: failed to delete machine at cloudprovider, due to instance not found
E0411 07:29:28.613675       1 machine.go:200] machine-kubermatic-dt56ds7tsb-64ql4 failed with: failed to delete machine at cloudprovider, due to instance not found

Create temporary ssh key during instance creation when required. Delete afterwards

Current state:
On initial start, we check if a secret with a private ssh key exists.
If no secret is found, we generate a secret with a private key.

This ssh key will be later used when creating instances at cloud-providers.
This was made so the user does not have to specify a ssh public key in the machine-manifest, as some cloud providers require to specify a public key when creating a instance (digitalocean).

All public keys from the machine manifest are getting deployed via cloud-init.

Desired state:
The whole ssh key logic should be removed.
If a cloud provider requires a ssh key during instance creation:

Create a temporary key before the instance get created
Use the temporary key for instance creation
Delete the temporary key after the key has been created

Possibility to set AWS AMI

It should be possible to set the AWS AMI id.

Correctly set a nodes `ownerRef` after it gets recreated

Right now if a node gets deleted for whatever reason (e.G. manually), the machine-controller will recreate it. This is fine, but results in the node not having an ownerRef.

Parse versions via a semver library

We need to parse the user given versions (kubelet & container runtime) to process them correctly in the end.
Especially so we can accept v1.9.2 and 1.9.2 as input.
Currently we require the kubelet version to have a leading v but we dont require it for the container runtime version

Update to client-go v7

Make security-group creation on aws a fallback

We need to add a config variable for the securityGroups & should only create a security-group on AWS when none is defined. As a convenience/quickstart help.
Also we need to log this with a loglevel of 2 probably.

Move cloudprovider secrets out of machine definition and into a secret

Right now the machine definition contains all access secrets to the cloud provider it is spawned on. This has two drawbacks:

Someone who shall have the permission to create machines has to know these credentials
There are some objects (e.G. security groups, ssh keys) whose lifetime is not coupled to a machine but to the usage of the cloudprovider, meaning as long as any machine uses that cloud provider, they have to exist

Instead we want to move the cloudprovider secrets into an actual secret which is then referenced by machines.

Add end-to-end testing to CircleCI pipeline

To better know if PRs add bugs, we should add the existing end-to-end testing to the CircleCI-Pipeline.

This requires:

Configuring the ssh keypair name at the cloudprovider with a random prefix
Deleting the ssh keypair before ending the e2e test
Adding the test-e2e target to circleci

Trigger events

With the implementation of transient and terminal errors we now correctly set machine.status.errorReason & machine.status.errorMessage when the controller runs into a terminal error.

Transient errors though are not reported back. The only way to see those is by investigating the logs.
Instead if just logging, we should trigger a event which is attached to the machine.

Don't start InformerFactories in go routine

In https://github.com/kubermatic/machine-controller/blob/master/cmd/controller/main.go#L211
We start the informerFactories in a separate go routine.

The informerFactory.Start(stopCh) itself is non blocking.

This might be the reason i see sometimes the controller trying to create something what already exists.
Just the Lister doesn't have it yet.

Machine deletion doesn't delete the node from the cluster (at least for Hetzner)

Just observed that deleting a machine does delete the machine at the cloud provide but does not delete the kubernetes node, at least for Hetzner. Didn't try other provider so far.

Use MetricsHandler for exposing health checks as metrics

First we need to have general Prometheus support being merged with #49.

See docs: https://godoc.org/github.com/heptiolabs/healthcheck#example-package--Metrics

add leaderelection

Use the leaderelection component from the kubernetes go-client

Concurrency issues when creating secret in e2e tests

Right now the e2e tests sometimes fail due to kubectl trying to create the secret when its already there

e.G. https://circleci.com/gh/kubermatic/machine-controller/2877?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

We should probably just not use a secret in the e2e tests and instead just put the required credentials into the machine spec.

`make machine-controller` sets wrong permissions

Doing a make machine-controller results in a Binary owned by root.

Can we pass in the UID from the user and do a chown in the build-container?

rename coreos to containerLinux

We use operatingSystem:coreos.
As it got renamed to ContainerLinux we should adapt

Allow to specify ssh-key via flag

Current state:
On initial start, we check if a secret with a private ssh key exists.
If no secret is found, we generate a secret with a private key.

All public keys from the machine manifest are getting deployed via cloud-init.

Desired state:
The controller should accept a path to a private key via a command line flag.
If the flag is specified and a valid key got found, this key should be taken.
If no flag was specified or the key was not found, the old logic with the secret should apply.

Make security-groups creation on openstack a fallback

We should only create a security-group on Openstack when none is defined.
Also we need to log this with a loglevel of 2 probably

kubermatic / machine-controller Goto Github PK

machine-controller's Issues

Recommend Projects

Recommend Topics

Recommend Org