aws / eks-anywhere Goto Github PK

View Code? Open in Web Editor NEW

1.9K 31.0 266.0 41.08 MB

Run Amazon EKS on your own infrastructure 🚀

Home Page: https://anywhere.eks.amazonaws.com

License: Apache License 2.0

Makefile 0.66% Dockerfile 0.02% Go 98.18% Shell 1.07% Python 0.07%

kubernetes aws k8s kubernetes-cluster kubernetes-deployment eks vmware docker baremetal baremetal-provisioning

eks-anywhere's Introduction

Amazon EKS Anywhere

Build status:

Conformance test status:

Amazon EKS Anywhere is a new deployment option for Amazon EKS that enables you to easily create and operate Kubernetes clusters on-premises with your own virtual machines or bare metal hosts. It brings a consistent AWS management experience to your data center, building on the strengths of Amazon EKS Distro, the same distribution of Kubernetes that powers EKS on AWS. Its goal is to include full lifecycle management of multiple Kubernetes clusters that are capable of operating completely independently of any AWS services.

Here are the steps for getting started with EKS Anywhere. Full documentation for releases can be found on https://anywhere.eks.amazonaws.com.

Development

The EKS Anywhere is tested using Prow, the Kubernetes CI system. EKS operates an installation of Prow, which is visible at https://prow.eks.amazonaws.com/. Please read our CONTRIBUTING guide before making a pull request. Refer to our end to end guide to run E2E tests locally.

The dependencies which make up EKS Anywhere are defined and built via the build-tooling repo. To update dependencies please review the Readme for the specific dependency before opening a PR.

See Cherry picking for backporting to release branches

Security

If you discover a potential security issue in this project, or think you may have discovered a security issue, we ask that you notify AWS Security via our vulnerability reporting page. Please do not create a public GitHub issue.

License

This project is licensed under the Apache-2.0 License.

eks-anywhere's People

Contributors

Stargazers

Watchers

Forkers

abhinavmpandey08 g-gaston jiayiwang7 danbudris taneyland panktishah26 abhay-krishna terryhowe mitalipaygude rothgar ptrivedi vivek-koppuru pokearu eks-distro-pr-bot kschumy jesusoctavioas danwent hyandell srashee stardomsolutions arvindmits valentin-nasta samsar jainthomas87 bwagner5 picassio tbazz melnikalex mdsgabriel jaxesn mabotokyo rizalgowandy antoniordz96 fmoerdel lo764640 vignesh-goutham rajkrishnamurthy snebel29 vkritsimas iac-infrastructureascode pramine jeffwan kssivakumar onlykumarabhishek azpolo younghai vj1900 pkesavap lwr20 stevekim-git ambermehra mrajashree aomjk mo2274 bnrjee pankajworks apertus-dev rob-opsi sunilbhai atdavidpark aarthikasirajan blogtheristo kabassociates chrisnegus vellankikoti hariohmprasath spread0x rubycommunists gwesterfieldjr cctvbtx sreekanth1980 saintdle couchgott micahhausler weaveworks-liquidmetal minasys jonrichards bodeplot-jpg chrisdoherty4 developgo maxdrib karishch-git eks-distro-bot frankfanslc wanyufe kingdonb donaldtech a-cool-train chu-yik bebecici vincentni estellaliu-c alexhiggins732 wongni emreysrc gobrial anurag1122-ctrl ewollesen lewisdiamond lilley2412

eks-anywhere's Issues

Add support to create and manage a "management cluster"

Currently eks-anywhere clusters are self-managed, the eks-anywhere objects and controller are installed in every cluster and users can interact with these objects to modify their cluster infrastructure. This may not be the desired setup for all organizations. We should introduce the ability for users to create a management cluster which can be used to managed their other clusters. This would also enable gitops based operations for all changes, including creation of, managed clusters.

Support Podman as a provider for local installs

What would you like to be added: Podman provider, in addition to Docker.

Why is this needed: Fedora 3x does not natively support Docker.

Air gap support

What would you like to be added:
We need air gap support for users who want disconnected environments.

Add reference creds for VCenter and GitOps in Cluster spec

The three environment variables that can be set for a EKS-A cluster are the VSphere username and password as well as the GitHub token (if GitOps is enabled). These secrets can be loaded on the created Kubernetes cluster as a reference from the Cluster spec. Below are just examples of how that might look:

VSphereDatacenterConfig:

apiVersion: anywhere.eks.amazonaws.com/v1
kind: VSphereDatacenterConfig
metadata:
  name: test-datacenter
  namespace: default
spec:
  ...
  credentialsRef: # populated by CLI when passed in as env var
   secretName: "" # needs to be a basic-auth type secret

GitOpsConfig:

apiVersion: anywhere.eks.amazonaws.com/v1
kind: GitOpsConfig
metadata:
  name: test-gitops
  namespace: default
spec:
  flux:
    github:
      ...
      authTokenRef: # populated by CLI if passed in as flag/env var
        secretName: "" # needs to be an opaque secret
        secretKey: "personal-access-token"

Proxmox Provider Support

Support for deploying EKS Anywhere using a provider for Proxmox.

Many labs, homelabs, & test systems run proxmox as the underlying VM host. Having an easy way to deploy to a single instance of proxmox or a proxmox cluster would be very useful for testing/development/training.

As many ways as possible to deploy a full cluster onto a single server host would rapidly increase adoption in this realm.

Deployment fails when template stored on VMFS5 and deployment done on VMFS6

What happened:
When eksctl anywhere create ... created the bottlerocket template, it ended up on a VMFS5 volume by default. When the installer subsequently attempted to clone the template to VMs, it got stuck in a seemingly infinite loop of failed operations, referencing "the specified delta disk format 'redoLogFormat' is not supported". Thousands of attempts where made, and the only way to stop it (well, the way I realized at the time at least) was to convert the template to a VM and then delete it.

What you expected to happen:
Cloning the template to VMs should work.

How to reproduce it (as minimally and precisely as possible):

VM Template on VMFS5
EKSA pointing to a VMFS6 datastore for the VMs
Ruh eksctl anywhaere create ...

Anything else we need to know?:

Converting the template to a VM, using vMotion to move it from the VMFS5 to a VMFS6 datastore and converting it back to a template solved the issue for me.

Environment:

EKS Anywhere Release: 0.66
EKS Distro Release:

Waiting on namespace which has no pods running

Tried to create the local dev cluster as per the blog demonstrated but the process hung at:

2021-09-16T10:16:28.374Z        V4      Task start      {"task_name": "workload-cluster-init"}
2021-09-16T10:16:28.374Z        V0      Creating new workload cluster
2021-09-16T10:16:28.374Z        V6      Executing command       {"cmd": "/usr/bin/docker run -i --network host -v /home/ec2-user:/home/ec2-user -w /home/ec2-user -v /var/run/docker.sock:/var/run/docker.sock --entrypoint kubectl public.ecr.aws/eks-anywhere/cli-tools:v0.1.0-eks-a-1 apply -f dev-cluster/generated/dev-cluster-eks-a-cluster.yaml --namespace eksa-system --kubeconfig dev-cluster/generated/dev-cluster.kind.kubeconfig"}
2021-09-16T10:16:29.956Z        V5      Retry execution successful      {"retries": 1, "duration": "1.581950842s"}
2021-09-16T10:16:29.956Z        V3      Waiting for external etcd to be ready
2021-09-16T10:16:29.957Z        V6      Executing command       {"cmd": "/usr/bin/docker run -i --network host -v /home/ec2-user:/home/ec2-user -w /home/ec2-user -v /var/run/docker.sock:/var/run/docker.sock --entrypoint kubectl public.ecr.aws/eks-anywhere/cli-tools:v0.1.0-eks-a-1 wait --timeout 60m --for=condition=ManagedEtcdReady clusters.cluster.x-k8s.io/dev-cluster --kubeconfig dev-cluster/generated/dev-cluster.kind.kubeconfig -n eksa-system"}
2021-09-16T10:17:01.603Z        V3      External etcd is ready
2021-09-16T10:17:01.603Z        V3      Waiting for control plane to be ready
2021-09-16T10:17:01.603Z        V6      Executing command       {"cmd": "/usr/bin/docker run -i --network host -v /home/ec2-user:/home/ec2-user -w /home/ec2-user -v /var/run/docker.sock:/var/run/docker.sock --entrypoint kubectl public.ecr.aws/eks-anywhere/cli-tools:v0.1.0-eks-a-1 wait --timeout 60m --for=condition=ControlPlaneReady clusters.cluster.x-k8s.io/dev-cluster --kubeconfig dev-cluster/generated/dev-cluster.kind.kubeconfig -n eksa-system"}

Noticed the command was waiting on the namespace "eksa-system" which have no pods running:

NAMESPACE                           NAME                                                              READY   STATUS    RESTARTS   AGE
capd-system                         capd-controller-manager-659dd5f8bc-6b7pp                          2/2     Running   0          24m
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-69889cb844-nmp8l        2/2     Running   0          24m
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-6ddc66fb75-ppblp    2/2     Running   0          24m
capi-system                         capi-controller-manager-db59f5789-ps89q                           2/2     Running   0          24m
capi-webhook-system                 capi-controller-manager-64b8c548db-fsxbn                          2/2     Running   0          24m
capi-webhook-system                 capi-kubeadm-bootstrap-controller-manager-68b8cc9759-lnt8k        2/2     Running   0          24m
capi-webhook-system                 capi-kubeadm-control-plane-controller-manager-7dc88f767d-2t5ps    2/2     Running   0          24m
cert-manager                        cert-manager-5f6b885b4-f29hz                                      1/1     Running   0          25m
cert-manager                        cert-manager-cainjector-bb6d9bcb5-zntdt                           1/1     Running   0          25m
cert-manager                        cert-manager-webhook-56cbc8f5b8-8sctg                             1/1     Running   0          25m
etcdadm-bootstrap-provider-system   etcdadm-bootstrap-provider-controller-manager-54476b7bf9-k7tkg    2/2     Running   0          24m
etcdadm-controller-system           etcdadm-controller-controller-manager-d5795556-2bqfc              2/2     Running   0          24m
kube-system                         coredns-7c68f85774-vbkwm                                          1/1     Running   0          25m
kube-system                         coredns-7c68f85774-wcl8h                                          1/1     Running   0          25m
kube-system                         etcd-dev-cluster-eks-a-cluster-control-plane                      1/1     Running   0          25m
kube-system                         kindnet-wtjm9                                                     1/1     Running   0          25m
kube-system                         kube-apiserver-dev-cluster-eks-a-cluster-control-plane            1/1     Running   0          25m
kube-system                         kube-controller-manager-dev-cluster-eks-a-cluster-control-plane   1/1     Running   0          25m
kube-system                         kube-proxy-p6hzl                                                  1/1     Running   0          25m
kube-system                         kube-scheduler-dev-cluster-eks-a-cluster-control-plane            1/1     Running   0          25m
local-path-storage                  local-path-provisioner-666bfc797f-ldgjp                           1/1     Running   0          25m

Workload cluster creation failed as a result. Is this a bug or something required to change in the generated spec?

Add status to Cluster object

Add some fields to the status for the Cluster that might be useful for a user to get information about the status of their EKS-A cluster object. Below is just an example of what the status object can look like:

status:
  version: ""
  controlPlaneMachineCount: 1
  workerNodeGroups:
  - machineCount: 1
  externalEtcdMachineCount: 1
  gitOps:
    gitOpsRef: ""
    repository: ""
    owner: ""
    branch: ""
    latestCommit: ""
    status: pending/synced/paused/unavailable # reflects whether our eks-a controller is syncing with the repository or not
  bundlesRef: "" # References a bundle CRD that contains information about all versions of container images that we rely on
  licenseRef: ""

Upgrade worker node groups after successful control plane upgrades

Currently during Kubernetes upgrades worker group upgrades are kicked off during control plane upgrades. This should be changed to make sure CP nodes are successfully upgraded before starting the worker group process.

Update the Flux CLI and controller images on ECR

What would you like to be added:

Hi team, I've noticed that the Flux container images used by EKS Anywhere are being pushed to ECR. The Flux images are behind the upstream distribution (critical bugs where fixed and new features were added).

Can you please update the Flux images on the Public ECR and the Flux binary included in EKS Anywhere?

Is there a public repo with the build script? if so I could update them myself.

Why is this needed:

It would be great if EKS Anywhere would be in sync with the latest stable Flux release.

$ flux -v
  flux version 0.17.0

$ flux install --export | grep ghcr
  image: ghcr.io/fluxcd/helm-controller:v0.11.2
  image: ghcr.io/fluxcd/kustomize-controller:v0.14.0
  image: ghcr.io/fluxcd/notification-controller:v0.16.0
  image: ghcr.io/fluxcd/source-controller:v0.15.4

Business logic in executables package

The executables package now contains business logic, when it was supposed to be just a wrapper around the low level operations provided by the binaries
Any other higher level operation should be composed of these low level operations and moved up, either to a higher level client, provider or just a new type/module

More over, the govc module has some side effects, specifically it updates some of the vsphere cluster config structs. We should try to avoid this: it makes the code more difficult to understand and debug. Any updates to such structs should be done by the type that owns them. In this case, it seems like the vsphere provider or some other vsphere type/submodule

Utilize support bundles in e2e testing for debugging failures

Improve cleanup of resources on failure

Currently the CLI leaves resources hanging on failure (such as VM nodes, kind bootstrap cluster, etc.) that might be troublesome for users to have to clean up on their own. Instead, the default for create can be to clean up on failure based on the phase of the workflow they are in unless specified through a flag like --skip-cleanup. We can also make the appropriate govc calls to clean up VMs in case cleaning them up through the bootstrap cluster fails for us.

Support AWS(EC2) provider

What would you like to be added:

The current implementation does not include aws provider for dev/prd environment, but the provider pkg contains AWS.
https://github.com/aws/eks-anywhere/tree/315f3f3783dce07019b79340ce720f0592d397aa/pkg/providers/aws

Since I do not have vSphere environment, I hope AWS to be supported for production use as well.

Why is this needed:

Support static ip allocation for nodes

What would you like to be added:
At present, ip allocation happens using dhcp on vcenter. We should support static ip allocation.

Cleanup docker volumes/resources

Running the command multiple times on the same machine might result in volumes being leftover and fill up disk space. Instead, the CLI should be able to cleanup the resources after use for all the commands that are run within the docker containers as well as any other resources that are left behind from the creation process.

"Bring your own Registry" support

Currently we pull all container images from public.ecr to make it easier for users who prefer to limit outgoing traffic. Some users would prefer having a local registry, such as Harbor, to avoid pulling the images from the internet for every node. This becomes increasingly more important as cluster size increases.

Allow local registry configuration via cluster spec
Provide script/subcommand to download all images from a given release manifest and push to local registry
capi patch to pass registry mirror config to bottlerocket user-data (@abhinavmpandey08). A PR for this is already open: Add capi patch for bottlerocket registry mirror support eks-anywhere-build-tooling#41
Update the deployment template to pass mirror config for br based on the capi patch (@abhinavmpandey08)
etcd changes to add registry mirror settings to bottlerocket etcd nodes (@mrajashree)
CAPI patch to fix BR proxy template error (@mrajashree)
e2e test for testing registry mirror
docs update to reflect the changes (@mdsgabriel)

High level instructions which beta users were able to use to manually support this workflow:

Convert the ova template to a VM
Power on the VM and get shell access. The easiest way of doing this is by resetting the VM a few times until you get to the GRUB menu and then boot into recovery mode (Advanced options for Ubuntu -> recovery mode -> Drop to root shell prompt).

Append the following section to the /etc/containerd/config.toml file. Make sure to replace “” with your container registry endpoint

[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
  [plugins."io.containerd.grpc.v1.cri".registry.mirrors."public.ecr.aws"]
    endpoint = ["https://<YOUR-LOCAL-REGISTRY-ENDPOINT>"]
  [plugins."io.containerd.grpc.v1.cri".registry.configs."<YOUR-LOCAL-REGISTRY-ENDPOINT>".tls]
    insecure_skip_verify = true

If you want you use tls verification, you can follow the steps listed herehttps://github.com/containerd/containerd/blob/main/docs/cri/registry.md#configure-registry-tls-communication
Save the file and power off the VM
Convert the ova back to a template and use this modified template in your clusterconfig yaml

Manage the content of cluster config file pushed to Github when Flux is enabled

What would you like to be added:

Similar to how we generate cluster config, having a centralized place to filer/restrict the fields we write to local git repo and push to remote.

Refer to the spec design doc for the fields we want to include.

Eg.

remove metadata.creationTimestamp
exclude status field for each resource

Implementation

Code to start: https://github.com/aws/eks-anywhere/blob/main/pkg/clustermarshaller/clustermarshaller.go#L14

The filter also applies to the cluster config file that's written to the local path (WriteClusterConfigTask)

Why is this needed:

Currently the cluster config file pushed to Github has no filter. Every field added during the creation/upgrade are pushed to the repo. Eg. metadata.creationTimestamp: null. We want full control over what fields to include/exclude in the file that's pushed to the repo.

Add proxy configuration to etcd nodes for bottlerocket

What would you like to be added:
Proxy configuration for pulling images on etcd nodes in bottlerocket

Why is this needed:
For bottlerocket, etcd nodes need to pull some images from public ecr. The proxy settings are not configured for etcd which results in image pull issues for environments that rely on proxy.

E2E tests improvements

Developer experience when debugging failed tests (needed tools in ami, docs/wiki, support bundle generated and stored on s3, etc)
Tracking failures over time
Flake handling process
Define Oncall responsibilities
Clean up left over VMs

E2E code changes:

Pod stuck in ContainerCreating: Unit ...slice already exists when upgrading to 1.21

Occasionally when upgrading ubuntu based clusters to 1.21 the new control plane nodes fail to launch. It appears to maybe be due to kubernetes/kubernetes#102676.

We will look to update EKS-D to a new version of 1.21 from 1.21.2 which is confirmed to be affected by the above.

providerID is empty

What happened:
Cluster setup failed

How to reproduce it (as minimally and precisely as possible):
vSphere is not connected to the internet, but node network has connection to the internet. OVA imported into vCenter and specified in VSphereMachineConfig.
DHCP pool has generated forward and reverse DNS zone. During cluster creation 6 VMs were created in vSphere.

eks-1-md-0-5df5857bc4-cjzkp
eks-1-kgtph
eks-1-etcd-z47r7
eks-1-etcd-fx8nz
eks-1-md-0-5df5857bc4-p8ptw
eks-1-etcd-899g8

Anything else we need to know?:

Environment: vSphere 7.0.2

EKS Anywhere Release: 0.5.0
EKS Distro Release: v1.21.2-eks-1-21-4

Cluster YAML

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: eks-1
spec:
  clusterNetwork:
    cni: cilium
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 2
    endpoint:
      host: "%value%"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: eks-1-master
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: eks-1
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: eks-1-etcd
  kubernetesVersion: "1.21"
  workerNodeGroupConfigurations:
  - count: 2
    machineGroupRef:
      kind: VSphereMachineConfig
      name: eks-1-worker

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: eks-1
spec:
  datacenter: "%value%"
  insecure: false
  network: "%value%"
  server: "%value%"
  thumbprint: "%value%"

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: eks-1-master
spec:
  datastore: "%value%"
  diskGiB: 25
  folder: "%value%"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "%value%"
  template: "bottlerocket-vmware-k8s-1.21-x86_64-1.2.0-ccf1b754"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa %value%

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: eks-1-worker
spec:
  datastore: "%value%"
  diskGiB: 25
  folder: "%value%"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "%value%"
  template: "bottlerocket-vmware-k8s-1.21-x86_64-1.2.0-ccf1b754"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa %value%

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: eks-1-etcd
spec:
  datastore: "%value%"
  diskGiB: 25
  folder: "%value%"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "%value%"
  template: "bottlerocket-vmware-k8s-1.21-x86_64-1.2.0-ccf1b754"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa %value%

---

$ eksctl anywhere create cluster -f eks-1.yaml -v 5
........
2021-09-13T18:14:33.150+0300	V3	Waiting for external etcd to be ready
2021-09-13T18:16:31.045+0300	V3	External etcd is ready
2021-09-13T18:16:31.045+0300	V3	Waiting for control plane to be ready
2021-09-13T19:17:38.138+0300	V4	Task finished	{"task_name": "workload-cluster-init", "duration": "1h3m7.627303368s"}
2021-09-13T19:17:38.138+0300	V4	----------------------------------
2021-09-13T19:17:38.138+0300	V4	Tasks completed	{"duration": "1h6m19.137271592s"}
Error: failed to create cluster: error waiting for workload cluster control plane to be ready: error executing wait: error: timed out waiting for the condition on clusters/eks-1

$ kubectl --kubeconfig eks-1.kind.kubeconfig -n eksa-system get machine
NAME                          PROVIDERID                                       PHASE          VERSION
eks-1-etcd-899g8              vsphere://420384b5-5262-e671-f018-9cd42973c693   Running
eks-1-etcd-fx8nz              vsphere://4203ef27-07de-76c0-36be-4f23be23ee57   Running
eks-1-etcd-z47r7              vsphere://420395c3-4f97-6ffe-8fb4-aa777208faba   Running
eks-1-kgtph                   vsphere://42034c54-335a-e5dc-7f99-2bf9675f5625   Provisioning   v1.21.2-eks-1-21-4
eks-1-md-0-5df5857bc4-cjzkp   vsphere://4203e9e3-eec6-89c9-7c89-9120f8019654   Provisioning   v1.21.2-eks-1-21-4
eks-1-md-0-5df5857bc4-p8ptw   vsphere://420370b5-46df-e8f1-2e1a-2795e0c78c3c   Provisioning   v1.21.2-eks-1-21-4

capi logs

E0914 08:08:58.179206       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="eks-10-16-17-74.vm.local"
E0914 08:08:58.179375       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="eks-10-16-17-74.vm.local"
E0914 08:08:58.740079       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="eks-10-16-17-75.vm.local"
E0914 08:08:58.740151       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="eks-10-16-17-75.vm.local"
E0914 08:08:59.108376       1 machine_controller_noderef.go:181] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="eks-10-16-17-73.vm.local"
E0914 08:08:59.108497       1 machine_controller_noderef.go:181] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="eks-10-16-17-74.vm.local"
E0914 08:08:59.108525       1 machine_controller_noderef.go:181] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="eks-10-16-17-75.vm.local"
E0914 08:09:02.332249       1 machine_controller_noderef.go:181] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="eks-10-16-17-73.vm.local"

Support Macs with Apple Silicon and arm processors

What would you like to be added:
Currently the bootstrap cluster and clusters using the docker provider cannot run on M1 Macs. It would be good to support these options for administrative machines and local clusters.

Why is this needed:
To allow more users to try EKS-A and manage clusters.

Add robust generate command to show all optional fields

When running eksctl anywhere generate clusterconfig, the default doesn't include all the fields that can be configured as mentioned in the documentation. Adding an option to show all the optional fields that can be filled in, like a template, will make it easier to show what are all the fields that can be configured.

Add support for upgrading eks-anywhere components

Currently the upgrade command can only upgrade Kubernetes versions. We need to support updating the eks-anywhere components, including capi/capv.

Until we can support this, we may need to introduce a new release mode to support generating new release bundles with only Kubernetes (eks-distro) updates included so customers can update to new Kubernetes versions.

Design doc

In order to be able to gain some upgrade capabilities as soon as possible (due to the dependency of other features on this one), we split the work on 3 chunks. These could be moved to separate issues if necessary (for example if we decide to release the first or first a second ones in an early release):

Minimal functionality

This provides upgrades for all components except for Cilium and any CAPI version bump that involves an API version change. It doesn't include E2E tests and any Bundle/patch cli version would involve significant manual work and testing. It is preferred, if possible, to release this together with, the second set of tasks.

Automation

This provides proper automated testing and reduces to a minimum the amount of manual work for new Bundle releases.

Moved to #551

Full feature

Moved to #551

Upgrade troubleshoot version to one which supports `copyFromHost` functionality

After restarting the machine that was running the single-system method EKS Anywhere, I lost my EKS-A Cluster.

What happened:
After restarting the machine that was running the single-system method EKS Anywhere, I lost my EKS-A Cluster.

What you expected to happen:
After restarting the machine, the single-system method EKS Anywhere will restart too.

How to reproduce it (as minimally and precisely as possible):
Restarting the machine that was running the single-system method EKS Anywhere.

Anything else we need to know?:

Environment:

EKS Anywhere Release: v0.5.0
EKS Distro Release:

govc tasks should be internally optional

When running eksctl anywhere create on a system where the tags (e.g. os:bottlerocket) or the tag catgories (e.g. os) already exist, the whole installation errors out. Rerunning it will succeed (though only after the user manually assigns the tags to the template to get the installer to proceed). Either the existence should be tested for explicitly in the workflow and actions taken accordingly, or the error handling be improved so that the workflow can continue.

vSphere 6.7 Update 3 support

What would you like to be added:
Official support for vSphere 6.7 Update 3. We have had a couple users try this out with success, but we do not have an official testing environment and strategy in place.

Why is this needed:

Support static IP address management and configuration for EKS-A clusters

What would you like to be added:

Why is this needed:

timeout on cert-manager install

What happened:

Timout on cert-manager install on local machine in Docker mode

What you expected to happen:

Installing cert-manager Version="v1.1.0"
Waiting for cert-manager to be available...
Error: timed out waiting for the condition
2021-09-15T08:00:24.463Z V4 Task finished {"task_name": "bootstrap-cluster-init", "duration": "12m30.072230612s"}
2021-09-15T08:00:24.472Z V4 ----------------------------------
2021-09-15T08:00:24.472Z V4 Tasks completed {"duration": "12m30.143671565s"}
2021-09-15T08:00:24.617Z V6 Executing command {"cmd": "/usr/bin/docker run -i --network host -v /home/cloud/anywhere:/home/cloud/anywhere -w /home/cloud/anywhere -v /var/run/docker.sock:/var/run/docker.sock --entrypoint kubectl public.ecr.aws/eks-anywhere/cli-tools:v0.1.0-eks-a-1 --kubeconfig dev-cluster/generated/dev-cluster.kind.kubeconfig logs deployment/cert-manager-webhook -n cert-manager"}

Custer to be fully deployed

How to reproduce it (as minimally and precisely as possible):

eksctl anywhere create cluster -f dev-cluster.yaml --verbosity 9

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: dev-cluster
spec:
  clusterNetwork:
    cni: cilium
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 1
  datacenterRef:
    kind: DockerDatacenterConfig
    name: dev-cluster
  externalEtcdConfiguration:
    count: 1
  kubernetesVersion: "1.21"
  workerNodeGroupConfigurations:
  - count: 1

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: DockerDatacenterConfig
metadata:
  name: dev-cluster
spec: {}

---

Anything else we need to know?:

Environment:

EKS Anywhere Release: 0.5
EKS Distro Release: 1.21

Add release process to update just EKS-D versions

Currently release bundles include both EKS-A and EKS-D components and we do not have a way to handle updating EKS-A components. To provide users a way to update Kubernetes versions we need a process that can produce new release bundles for existing releases with just updated EKS-D components + new OVA and Kind images

ESXI Standalone Provider Support

Ability to deploy EKS Anywhere using a provider to a standalone instance of VMWare ESXI without vCenter.

Many lab machines, and home lab machines are running VMWare ESXI in a single server configuration. While this is not production grade, it would make bootstrapping a cluster for development/testing/training trivial.
Currently you cannot do this with the existing providers as you do not define a datacenter name on a single instance resulting in this error:

Validation failed {"validation": "vsphere Provider setup is valid", "error": "error validating vCenter setup: failed to get datacenter: govc: datacenter 'localhost.localdomain' not found\n", "remediation": ""}
Error: failed to create cluster: validations failed

The upgrade command should automatically offer the user a choice to upgrade

When a user runs the upgrade command it should check for new release manifest/bundles and offer the user the choice to update Kubernetes patch/minor versions. This is to mimic the behavior of EKS in the cloud which automatically updates control planes.

Support node taints in cluster config

What would you like to be added:

Add node taints option in cluster config for both control plane and worker node groups. So that the defined taints can be attached to the node(s) automatically during cluster creation/upgrade.

cluster config

We add taints option at the cluster level, eg:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: cluster-1
spec:
  ...
  controlPlaneConfiguration:
    count: 2
    endpoint:
      host: "198.18.100.27"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: cluster-1-cp
    taints: # optional
    - key: key1
      value: val1 # optional
      effect: NoSchedule # Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
  workerNodeGroupConfigurations:
  - count: 2
    machineGroupRef:
      kind: VSphereMachineConfig
      name: cluster-1
    taints: # optional
    - key: key2
      value: val2 # optional
      effect: PreferNoSchedule # Valid effects are NoSchedule, PreferNoSchedule and NoExecute.

implementation

Cluster API already supports taints for both control plane and machine deployment:

We might easily pass down the taints option to the CAPI spec before applying.

The CAPI taints option does not work for worker nodes (kubeadmconfigtemplates). We https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/controllers/controller.go#L267might add taints for worker nodes in the CLI workflow, or as a postKubeadmCommands defined in the kubeadmconfigtemplates.

Why is this needed:

User requests to allow the retention of node taints during an upgrade. Currently the rolling update recreates VMs and the new nodes joined do not have the taints defined in the original nodes. User needs to manually attach those taints back after upgrade.

We'd like to provide a descriptive way to let user define node taints in the cluster config, so that the nodes being created/upgraded can have the proper taints (if defined) on the node(s) automatically without manual setup.

TODO list

TBD:

support worker nodes taints update without VM recreate

Support bare metal deployments

It would be great if I could bring my own servers and EKS Anywhere will install Kubernetes directly on top of an existing OS or provision the OS and Kubernetes for me.

Please 👍 if you would like this feature

Define to be Released work processes

How do we build a release Changelog?
How do we handle docs when merging features for unreleased versions?
?

Sync with the BottleRocket team to see how they handle this type of management.

Make test files dynamic for unit tests

Currently they are a lot of test files we maintain for our unit tests (for example https://github.com/aws/eks-anywhere/tree/main/pkg/api/v1alpha1/testdata) which makes it inconvenient to change them all based on few spec changes. Having one common file that can be modified based on the test case using similar logic for our E2E tests can make it easier to maintain.

Support bundle improvements

The current support bundle functionality can pull relevant pod logs and analyze the current state of the cluster. We should improve the gathered data to better support debugging failures via the support bundle.

Add retry logic to govc calls

Most vsphere related preflight checks are govc calls to check for things like, valid user/password, existing template, existing network, etc... Most of these should be retry-able with much consequence and would avoid temporary failures due to networking of vsphere blips.

Provide aws-iam-authenticator as a configurable option during cluster create

There should be an easy way for users to enable aws-iam-authenticator during cluster creation. eks-anywhere should configure the api server and launch the proper pods to enable users to use a similar auth experience to the cloud.

Failed to pull image in bootstrap cluster. certificate signed by unknown authority. Custom CA

Error from server (NotFound): namespaces "capi-webhook-system" not found
Error from server (NotFound): namespaces "capi-kubeadm-bootstrap-system" not found
Error from server (NotFound): namespaces "capv-system" not found
Error from server (NotFound): namespaces "capi-webhook-system" not found
Error from server (BadRequest): container "cert-manager" in pod "cert-manager-webhook-56cbc8f5b8-7fhvt" is waiting to start: trying and failing to pull image
Error from server (BadRequest): container "cert-manager" in pod "cert-manager-cainjector-bb6d9bcb5-jtz48" is waiting to start: trying and failing to pull image
Error from server (BadRequest): container "cert-manager" in pod "cert-manager-5f6b885b4-pg5fh" is waiting to start: trying and failing to pull image

Describe failed pods shows -

Failed to pull image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-controller:v1.1.0-eks-a-1": rpc error: code = Unknown desc = failed to pull and unpack image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-controller:v1.1.0-eks-a-1": failed to resolve reference "public.ecr.aws/eks-anywhere/jetstack/cert-manager-controller:v1.1.0-eks-a-1": failed to do request: Head "https://public.ecr.aws/v2/eks-anywhere/jetstack/cert-manager-controller/manifests/v1.1.0-eks-a-1": x509: certificate signed by unknown authority
Failed to pull image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-cainjector:v1.1.0-eks-a-1": rpc error: code = Unknown desc = failed to pull and unpack image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-cainjector:v1.1.0-eks-a-1": failed to resolve reference "public.ecr.aws/eks-anywhere/jetstack/cert-manager-cainjector:v1.1.0-eks-a-1": failed to do request: Head "https://public.ecr.aws/v2/eks-anywhere/jetstack/cert-manager-cainjector/manifests/v1.1.0-eks-a-1": x509: certificate signed by unknown authority
Failed to pull image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-webhook:v1.1.0-eks-a-1": rpc error: code = Unknown desc = failed to pull and unpack image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-webhook:v1.1.0-eks-a-1": failed to resolve reference "public.ecr.aws/eks-anywhere/jetstack/cert-manager-webhook:v1.1.0-eks-a-1": failed to do request: Head "https://public.ecr.aws/v2/eks-anywhere/jetstack/cert-manager-webhook/manifests/v1.1.0-eks-a-1": x509: certificate signed by unknown authority

I'm in a secure corp environment with corp CA's used for filtering all web requests.

What happened:
Local cluster creation times out after many minutes.

What you expected to happen:
Local cluster is created.

How to reproduce it (as minimally and precisely as possible):
Not sure

Anything else we need to know?:
macOS Big Sur 11.5.2

Environment:

EKS Anywhere Release: v0.5.0
EKS Distro Release:

Add brew formula update during release

We publishing new releases we should automatically publish new versions to the aws brew tap.

Refactor support bundle to use `troubleshoot.sh` binary under-the-hood

EKS-A currently uses troubleshoot.sh to provide support-bundle functionality for the clusters we provision. This functionality allows our customers to quickly and easily gather diagnostic information about their cluster for their own debugging and to share with AWS support/engineering.

In our initial iteration, we directly consumed troubleshoot.sh as a go library. This had the advantage of allowing us to easily programmatically construct support bundles using the existing abstractions, requiring less initial development overhead. However, using troubleshoot.sh as a library posed some challenges when it came time to upgrade to the latest version — an upgrade which would allow us to take advantage of the important copy-from-host functionality.

The latest version of troubleshoot introduced a new dependency, longhorn-manager, which vendors their k8s.io dependencies. This meant that troubleshoot, the initial consumer of longhorn-manager, had to use replace directives in their go.mod to specify which versions of the vendored dependencies to use. Since replace directives are not transitive, each consumer of the troubleshoot.sh go package is now required to specify similar replace directives. Without them, the package will not build. This means that we must either specify similar replace directives or not use the lastest version of the troubleshoot.sh library.

We have decided to move forward with refactoring the support bundle functionality to wrap the troubleshoot binary, rather than directly consume their library. This allows us to:

avoid polluting our dependencies, and the dependencies of downstream consumers of EKS-A, with fixed versions and replace directives
avoid unnecessary complexity in our dependency management from troubleshoot, now and in the future
avoid tightly coupling our Diagnostic Bundle package to the internal troubleshoot.sh implementation, by building an interface to the binary

eks-connector

What happened:
Hello,

I tried to configure eks-connector with my running eks-anywhere cluster in docker mode.

I' ve followed this documention to create the IAM role for EKS connector.

The eks connector pod seems to be running find and the local cluster is registered in the AWS Console, nonetheless I've got and IAM error that I do not understand

users "arn:aws:iam::xxxxxxxx:role/AWSReservedSSO_AWSAdministratorAccess_8efa0b9a2a4cd738" is forbidden: User "system:serviceaccount:eks-connector:eks-connector" cannot impersonate resource "users" in API group "" at the cluster scope

What you expected to happen:

Cluster is registered to he AWS Console but I cannot see any workload

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

EKS Anywhere Release:
EKS Distro Release:

Allow to get logs without waiting for cli to fail

What would you like to be added:
A mechanism to collect logs ad hoc: either with a new command, or automatically when the cli is killed

Why is this needed:
If during the create command, there is a problem and the cli gets stuck waiting for a component that will never come up, the cli only collects logs after a timeout is reached. If a user wants the logs, they need to wait until the cli fails, which is also not communicated in logs. This could be a long long time.

DNS resolver

What happened:

How can I add a DNS resolver 8.8.8.8 in the /etc/resolv.conf of my pods using the cluster description yaml file ?
I'm using the docker provider and in the /etc/resolv.conf I have the IP og the CoreDNS Service.

My Pods do not resolve external IPs like amazon.com

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

EKS Anywhere Release:
EKS Distro Release:

Pod CIDR in the generated spec is not used for cluster provision

The pod CIDR was updated in the generated spec:

clusterNetwork:
  cni: cilium
  pods:
    cidrBlocks:
    - 10.10.0.0/16

Despite the changed, kubectl cluster-info dump --kubeconfig dev-cluster-eks-a-cluster.kubeconfig | grep -i cidr reported:

 "io.cilium.network.ipv4-pod-cidr": "192.168.0.0/24",
                "podCIDR": "192.168.0.0/24",
                "podCIDRs": [
                    "io.cilium.network.ipv4-pod-cidr": "192.168.1.0/24",
                "podCIDR": "192.168.1.0/24",
                "podCIDRs": [
                            "--allocate-node-cidrs=true",
                            "--cluster-cidr=192.168.0.0/16",

Is modify the spec the correct way to configure the cluster CIDR and direct Cilium to follow the cluster CIDR?

Add more meaningful log messages when the cluster creation fails while waiting for control plane

What would you like to be added:
Have better logs in case of failures due to something going wrong on the control plane node during cluster creation.

Why is this needed:
Currently, if something goes wrong on the workload cluster (image pull fails, pod errors out, etc.), the cli waits till the timeout and then errors out with error waiting for workload cluster control plane to be ready.

One thing we can do is that while the cli waits for control plane to be ready, it should also monitor the workload cluster for any issues/failures and report it back through the logs

Update capi/capv to v1beta1

Prerequisite: #93 (to allow the actual CAPI components upgrade process)

aws / eks-anywhere Goto Github PK

eks-anywhere's Introduction

Amazon EKS Anywhere

Development

Security

License

eks-anywhere's People

Contributors

Stargazers

Watchers

Forkers

eks-anywhere's Issues

Minimal functionality

Automation

Full feature

Recommend Projects

Recommend Topics

Recommend Org