Giter Club home page Giter Club logo

eks-anywhere's Introduction

Amazon EKS Anywhere

Release Downloads Go Report Card CII Best Practices Contributors License

Build status: Build status

Conformance test status: Conformance test status

Amazon EKS Anywhere is a new deployment option for Amazon EKS that enables you to easily create and operate Kubernetes clusters on-premises with your own virtual machines or bare metal hosts. It brings a consistent AWS management experience to your data center, building on the strengths of Amazon EKS Distro, the same distribution of Kubernetes that powers EKS on AWS. Its goal is to include full lifecycle management of multiple Kubernetes clusters that are capable of operating completely independently of any AWS services.

Here are the steps for getting started with EKS Anywhere. Full documentation for releases can be found on https://anywhere.eks.amazonaws.com.

Development

The EKS Anywhere is tested using Prow, the Kubernetes CI system. EKS operates an installation of Prow, which is visible at https://prow.eks.amazonaws.com/. Please read our CONTRIBUTING guide before making a pull request. Refer to our end to end guide to run E2E tests locally.

The dependencies which make up EKS Anywhere are defined and built via the build-tooling repo. To update dependencies please review the Readme for the specific dependency before opening a PR.

See Cherry picking for backporting to release branches

Security

If you discover a potential security issue in this project, or think you may have discovered a security issue, we ask that you notify AWS Security via our vulnerability reporting page. Please do not create a public GitHub issue.

License

This project is licensed under the Apache-2.0 License.

eks-anywhere's People

Contributors

a-cool-train avatar abhay-krishna avatar abhinavmpandey08 avatar ahreehong avatar chrisdoherty4 avatar chrisnegus avatar cxbrowne1207 avatar d8660091 avatar danbudris avatar dependabot[bot] avatar eks-distro-pr-bot avatar ewollesen avatar g-gaston avatar gwesterfieldjr avatar ivyostosh avatar jiayiwang7 avatar jonahjon avatar jonathanmeier5 avatar maxdrib avatar mitalipaygude avatar mrajashree avatar panktishah26 avatar pokearu avatar ptrivedi avatar rahulbabu95 avatar taneyland avatar tatlat avatar terryhowe avatar vignesh-goutham avatar vivek-koppuru avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

eks-anywhere's Issues

Add support to create and manage a "management cluster"

Currently eks-anywhere clusters are self-managed, the eks-anywhere objects and controller are installed in every cluster and users can interact with these objects to modify their cluster infrastructure. This may not be the desired setup for all organizations. We should introduce the ability for users to create a management cluster which can be used to managed their other clusters. This would also enable gitops based operations for all changes, including creation of, managed clusters.

Air gap support

What would you like to be added:
We need air gap support for users who want disconnected environments.

Add reference creds for VCenter and GitOps in Cluster spec

The three environment variables that can be set for a EKS-A cluster are the VSphere username and password as well as the GitHub token (if GitOps is enabled). These secrets can be loaded on the created Kubernetes cluster as a reference from the Cluster spec. Below are just examples of how that might look:

VSphereDatacenterConfig:

apiVersion: anywhere.eks.amazonaws.com/v1
kind: VSphereDatacenterConfig
metadata:
  name: test-datacenter
  namespace: default
spec:
  ...
  credentialsRef: # populated by CLI when passed in as env var
   secretName: "" # needs to be a basic-auth type secret

GitOpsConfig:

apiVersion: anywhere.eks.amazonaws.com/v1
kind: GitOpsConfig
metadata:
  name: test-gitops
  namespace: default
spec:
  flux:
    github:
      ...
      authTokenRef: # populated by CLI if passed in as flag/env var
        secretName: "" # needs to be an opaque secret
        secretKey: "personal-access-token"

Proxmox Provider Support

Support for deploying EKS Anywhere using a provider for Proxmox.

Many labs, homelabs, & test systems run proxmox as the underlying VM host. Having an easy way to deploy to a single instance of proxmox or a proxmox cluster would be very useful for testing/development/training.

As many ways as possible to deploy a full cluster onto a single server host would rapidly increase adoption in this realm.

Deployment fails when template stored on VMFS5 and deployment done on VMFS6

What happened:
When eksctl anywhere create ... created the bottlerocket template, it ended up on a VMFS5 volume by default. When the installer subsequently attempted to clone the template to VMs, it got stuck in a seemingly infinite loop of failed operations, referencing "the specified delta disk format 'redoLogFormat' is not supported". Thousands of attempts where made, and the only way to stop it (well, the way I realized at the time at least) was to convert the template to a VM and then delete it.

What you expected to happen:
Cloning the template to VMs should work.

How to reproduce it (as minimally and precisely as possible):

  1. VM Template on VMFS5
  2. EKSA pointing to a VMFS6 datastore for the VMs
  3. Ruh eksctl anywhaere create ...

Anything else we need to know?:

Converting the template to a VM, using vMotion to move it from the VMFS5 to a VMFS6 datastore and converting it back to a template solved the issue for me.

Environment:

  • EKS Anywhere Release: 0.66
  • EKS Distro Release:

Waiting on namespace which has no pods running

Tried to create the local dev cluster as per the blog demonstrated but the process hung at:

2021-09-16T10:16:28.374Z        V4      Task start      {"task_name": "workload-cluster-init"}
2021-09-16T10:16:28.374Z        V0      Creating new workload cluster
2021-09-16T10:16:28.374Z        V6      Executing command       {"cmd": "/usr/bin/docker run -i --network host -v /home/ec2-user:/home/ec2-user -w /home/ec2-user -v /var/run/docker.sock:/var/run/docker.sock --entrypoint kubectl public.ecr.aws/eks-anywhere/cli-tools:v0.1.0-eks-a-1 apply -f dev-cluster/generated/dev-cluster-eks-a-cluster.yaml --namespace eksa-system --kubeconfig dev-cluster/generated/dev-cluster.kind.kubeconfig"}
2021-09-16T10:16:29.956Z        V5      Retry execution successful      {"retries": 1, "duration": "1.581950842s"}
2021-09-16T10:16:29.956Z        V3      Waiting for external etcd to be ready
2021-09-16T10:16:29.957Z        V6      Executing command       {"cmd": "/usr/bin/docker run -i --network host -v /home/ec2-user:/home/ec2-user -w /home/ec2-user -v /var/run/docker.sock:/var/run/docker.sock --entrypoint kubectl public.ecr.aws/eks-anywhere/cli-tools:v0.1.0-eks-a-1 wait --timeout 60m --for=condition=ManagedEtcdReady clusters.cluster.x-k8s.io/dev-cluster --kubeconfig dev-cluster/generated/dev-cluster.kind.kubeconfig -n eksa-system"}
2021-09-16T10:17:01.603Z        V3      External etcd is ready
2021-09-16T10:17:01.603Z        V3      Waiting for control plane to be ready
2021-09-16T10:17:01.603Z        V6      Executing command       {"cmd": "/usr/bin/docker run -i --network host -v /home/ec2-user:/home/ec2-user -w /home/ec2-user -v /var/run/docker.sock:/var/run/docker.sock --entrypoint kubectl public.ecr.aws/eks-anywhere/cli-tools:v0.1.0-eks-a-1 wait --timeout 60m --for=condition=ControlPlaneReady clusters.cluster.x-k8s.io/dev-cluster --kubeconfig dev-cluster/generated/dev-cluster.kind.kubeconfig -n eksa-system"}

Noticed the command was waiting on the namespace "eksa-system" which have no pods running:

NAMESPACE                           NAME                                                              READY   STATUS    RESTARTS   AGE
capd-system                         capd-controller-manager-659dd5f8bc-6b7pp                          2/2     Running   0          24m
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-69889cb844-nmp8l        2/2     Running   0          24m
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-6ddc66fb75-ppblp    2/2     Running   0          24m
capi-system                         capi-controller-manager-db59f5789-ps89q                           2/2     Running   0          24m
capi-webhook-system                 capi-controller-manager-64b8c548db-fsxbn                          2/2     Running   0          24m
capi-webhook-system                 capi-kubeadm-bootstrap-controller-manager-68b8cc9759-lnt8k        2/2     Running   0          24m
capi-webhook-system                 capi-kubeadm-control-plane-controller-manager-7dc88f767d-2t5ps    2/2     Running   0          24m
cert-manager                        cert-manager-5f6b885b4-f29hz                                      1/1     Running   0          25m
cert-manager                        cert-manager-cainjector-bb6d9bcb5-zntdt                           1/1     Running   0          25m
cert-manager                        cert-manager-webhook-56cbc8f5b8-8sctg                             1/1     Running   0          25m
etcdadm-bootstrap-provider-system   etcdadm-bootstrap-provider-controller-manager-54476b7bf9-k7tkg    2/2     Running   0          24m
etcdadm-controller-system           etcdadm-controller-controller-manager-d5795556-2bqfc              2/2     Running   0          24m
kube-system                         coredns-7c68f85774-vbkwm                                          1/1     Running   0          25m
kube-system                         coredns-7c68f85774-wcl8h                                          1/1     Running   0          25m
kube-system                         etcd-dev-cluster-eks-a-cluster-control-plane                      1/1     Running   0          25m
kube-system                         kindnet-wtjm9                                                     1/1     Running   0          25m
kube-system                         kube-apiserver-dev-cluster-eks-a-cluster-control-plane            1/1     Running   0          25m
kube-system                         kube-controller-manager-dev-cluster-eks-a-cluster-control-plane   1/1     Running   0          25m
kube-system                         kube-proxy-p6hzl                                                  1/1     Running   0          25m
kube-system                         kube-scheduler-dev-cluster-eks-a-cluster-control-plane            1/1     Running   0          25m
local-path-storage                  local-path-provisioner-666bfc797f-ldgjp                           1/1     Running   0          25m

Workload cluster creation failed as a result. Is this a bug or something required to change in the generated spec?

Add status to Cluster object

Add some fields to the status for the Cluster that might be useful for a user to get information about the status of their EKS-A cluster object. Below is just an example of what the status object can look like:

status:
  version: ""
  controlPlaneMachineCount: 1
  workerNodeGroups:
  - machineCount: 1
  externalEtcdMachineCount: 1
  gitOps:
    gitOpsRef: ""
    repository: ""
    owner: ""
    branch: ""
    latestCommit: ""
    status: pending/synced/paused/unavailable # reflects whether our eks-a controller is syncing with the repository or not
  bundlesRef: "" # References a bundle CRD that contains information about all versions of container images that we rely on
  licenseRef: ""

Update the Flux CLI and controller images on ECR

What would you like to be added:

Hi team, I've noticed that the Flux container images used by EKS Anywhere are being pushed to ECR. The Flux images are behind the upstream distribution (critical bugs where fixed and new features were added).

Can you please update the Flux images on the Public ECR and the Flux binary included in EKS Anywhere?

Is there a public repo with the build script? if so I could update them myself.

Why is this needed:

It would be great if EKS Anywhere would be in sync with the latest stable Flux release.

$ flux -v
  flux version 0.17.0

$ flux install --export | grep ghcr
  image: ghcr.io/fluxcd/helm-controller:v0.11.2
  image: ghcr.io/fluxcd/kustomize-controller:v0.14.0
  image: ghcr.io/fluxcd/notification-controller:v0.16.0
  image: ghcr.io/fluxcd/source-controller:v0.15.4

Business logic in executables package

The executables package now contains business logic, when it was supposed to be just a wrapper around the low level operations provided by the binaries
Any other higher level operation should be composed of these low level operations and moved up, either to a higher level client, provider or just a new type/module

More over, the govc module has some side effects, specifically it updates some of the vsphere cluster config structs. We should try to avoid this: it makes the code more difficult to understand and debug. Any updates to such structs should be done by the type that owns them. In this case, it seems like the vsphere provider or some other vsphere type/submodule

Improve cleanup of resources on failure

Currently the CLI leaves resources hanging on failure (such as VM nodes, kind bootstrap cluster, etc.) that might be troublesome for users to have to clean up on their own. Instead, the default for create can be to clean up on failure based on the phase of the workflow they are in unless specified through a flag like --skip-cleanup. We can also make the appropriate govc calls to clean up VMs in case cleaning them up through the bootstrap cluster fails for us.

Cleanup docker volumes/resources

Running the command multiple times on the same machine might result in volumes being leftover and fill up disk space. Instead, the CLI should be able to cleanup the resources after use for all the commands that are run within the docker containers as well as any other resources that are left behind from the creation process.

"Bring your own Registry" support

Currently we pull all container images from public.ecr to make it easier for users who prefer to limit outgoing traffic. Some users would prefer having a local registry, such as Harbor, to avoid pulling the images from the internet for every node. This becomes increasingly more important as cluster size increases.

  • Allow local registry configuration via cluster spec
  • Provide script/subcommand to download all images from a given release manifest and push to local registry
  • capi patch to pass registry mirror config to bottlerocket user-data (@abhinavmpandey08). A PR for this is already open: Add capi patch for bottlerocket registry mirror support eks-anywhere-build-tooling#41
  • Update the deployment template to pass mirror config for br based on the capi patch (@abhinavmpandey08)
  • etcd changes to add registry mirror settings to bottlerocket etcd nodes (@mrajashree)
  • CAPI patch to fix BR proxy template error (@mrajashree)
  • e2e test for testing registry mirror
  • docs update to reflect the changes (@mdsgabriel)

High level instructions which beta users were able to use to manually support this workflow:

  1. Convert the ova template to a VM
  2. Power on the VM and get shell access. The easiest way of doing this is by resetting the VM a few times until you get to the GRUB menu and then boot into recovery mode (Advanced options for Ubuntu -> recovery mode -> Drop to root shell prompt).
  3. Append the following section to the /etc/containerd/config.toml file. Make sure to replace โ€œโ€ with your container registry endpoint
    [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
      [plugins."io.containerd.grpc.v1.cri".registry.mirrors."public.ecr.aws"]
        endpoint = ["https://<YOUR-LOCAL-REGISTRY-ENDPOINT>"]
      [plugins."io.containerd.grpc.v1.cri".registry.configs."<YOUR-LOCAL-REGISTRY-ENDPOINT>".tls]
        insecure_skip_verify = true
    
  4. If you want you use tls verification, you can follow the steps listed herehttps://github.com/containerd/containerd/blob/main/docs/cri/registry.md#configure-registry-tls-communication
  5. Save the file and power off the VM
  6. Convert the ova back to a template and use this modified template in your clusterconfig yaml

Manage the content of cluster config file pushed to Github when Flux is enabled

What would you like to be added:

Similar to how we generate cluster config, having a centralized place to filer/restrict the fields we write to local git repo and push to remote.

Refer to the spec design doc for the fields we want to include.

Eg.

  • remove metadata.creationTimestamp
  • exclude status field for each resource

Implementation

Code to start: https://github.com/aws/eks-anywhere/blob/main/pkg/clustermarshaller/clustermarshaller.go#L14

The filter also applies to the cluster config file that's written to the local path (WriteClusterConfigTask)

Why is this needed:

Currently the cluster config file pushed to Github has no filter. Every field added during the creation/upgrade are pushed to the repo. Eg. metadata.creationTimestamp: null. We want full control over what fields to include/exclude in the file that's pushed to the repo.

Add proxy configuration to etcd nodes for bottlerocket

What would you like to be added:
Proxy configuration for pulling images on etcd nodes in bottlerocket

Why is this needed:
For bottlerocket, etcd nodes need to pull some images from public ecr. The proxy settings are not configured for etcd which results in image pull issues for environments that rely on proxy.

E2E tests improvements

  • Developer experience when debugging failed tests (needed tools in ami, docs/wiki, support bundle generated and stored on s3, etc)
  • Tracking failures over time
  • Flake handling process
  • Define Oncall responsibilities
  • Clean up left over VMs

E2E code changes:

providerID is empty

What happened:
Cluster setup failed

How to reproduce it (as minimally and precisely as possible):
vSphere is not connected to the internet, but node network has connection to the internet. OVA imported into vCenter and specified in VSphereMachineConfig.
DHCP pool has generated forward and reverse DNS zone. During cluster creation 6 VMs were created in vSphere.

eks-1-md-0-5df5857bc4-cjzkp
eks-1-kgtph
eks-1-etcd-z47r7
eks-1-etcd-fx8nz
eks-1-md-0-5df5857bc4-p8ptw
eks-1-etcd-899g8

Anything else we need to know?:

Environment: vSphere 7.0.2

  • EKS Anywhere Release: 0.5.0
  • EKS Distro Release: v1.21.2-eks-1-21-4

Cluster YAML

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: eks-1
spec:
  clusterNetwork:
    cni: cilium
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 2
    endpoint:
      host: "%value%"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: eks-1-master
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: eks-1
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: eks-1-etcd
  kubernetesVersion: "1.21"
  workerNodeGroupConfigurations:
  - count: 2
    machineGroupRef:
      kind: VSphereMachineConfig
      name: eks-1-worker

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: eks-1
spec:
  datacenter: "%value%"
  insecure: false
  network: "%value%"
  server: "%value%"
  thumbprint: "%value%"

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: eks-1-master
spec:
  datastore: "%value%"
  diskGiB: 25
  folder: "%value%"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "%value%"
  template: "bottlerocket-vmware-k8s-1.21-x86_64-1.2.0-ccf1b754"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa %value%

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: eks-1-worker
spec:
  datastore: "%value%"
  diskGiB: 25
  folder: "%value%"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "%value%"
  template: "bottlerocket-vmware-k8s-1.21-x86_64-1.2.0-ccf1b754"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa %value%

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: eks-1-etcd
spec:
  datastore: "%value%"
  diskGiB: 25
  folder: "%value%"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "%value%"
  template: "bottlerocket-vmware-k8s-1.21-x86_64-1.2.0-ccf1b754"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa %value%

---
$ eksctl anywhere create cluster -f eks-1.yaml -v 5
........
2021-09-13T18:14:33.150+0300	V3	Waiting for external etcd to be ready
2021-09-13T18:16:31.045+0300	V3	External etcd is ready
2021-09-13T18:16:31.045+0300	V3	Waiting for control plane to be ready
2021-09-13T19:17:38.138+0300	V4	Task finished	{"task_name": "workload-cluster-init", "duration": "1h3m7.627303368s"}
2021-09-13T19:17:38.138+0300	V4	----------------------------------
2021-09-13T19:17:38.138+0300	V4	Tasks completed	{"duration": "1h6m19.137271592s"}
Error: failed to create cluster: error waiting for workload cluster control plane to be ready: error executing wait: error: timed out waiting for the condition on clusters/eks-1
$ kubectl --kubeconfig eks-1.kind.kubeconfig -n eksa-system get machine
NAME                          PROVIDERID                                       PHASE          VERSION
eks-1-etcd-899g8              vsphere://420384b5-5262-e671-f018-9cd42973c693   Running
eks-1-etcd-fx8nz              vsphere://4203ef27-07de-76c0-36be-4f23be23ee57   Running
eks-1-etcd-z47r7              vsphere://420395c3-4f97-6ffe-8fb4-aa777208faba   Running
eks-1-kgtph                   vsphere://42034c54-335a-e5dc-7f99-2bf9675f5625   Provisioning   v1.21.2-eks-1-21-4
eks-1-md-0-5df5857bc4-cjzkp   vsphere://4203e9e3-eec6-89c9-7c89-9120f8019654   Provisioning   v1.21.2-eks-1-21-4
eks-1-md-0-5df5857bc4-p8ptw   vsphere://420370b5-46df-e8f1-2e1a-2795e0c78c3c   Provisioning   v1.21.2-eks-1-21-4

capi logs

E0914 08:08:58.179206       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="eks-10-16-17-74.vm.local"
E0914 08:08:58.179375       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="eks-10-16-17-74.vm.local"
E0914 08:08:58.740079       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="eks-10-16-17-75.vm.local"
E0914 08:08:58.740151       1 machine_controller.go:685] controllers/Machine "msg"="Unable to retrieve machine from node" "error"="no matching Machine"  "node"="eks-10-16-17-75.vm.local"
E0914 08:08:59.108376       1 machine_controller_noderef.go:181] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="eks-10-16-17-73.vm.local"
E0914 08:08:59.108497       1 machine_controller_noderef.go:181] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="eks-10-16-17-74.vm.local"
E0914 08:08:59.108525       1 machine_controller_noderef.go:181] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="eks-10-16-17-75.vm.local"
E0914 08:09:02.332249       1 machine_controller_noderef.go:181] controllers/Machine "msg"="Failed to parse ProviderID" "error"="providerID is empty" "providerID"={} "node"="eks-10-16-17-73.vm.local"

Support Macs with Apple Silicon and arm processors

What would you like to be added:
Currently the bootstrap cluster and clusters using the docker provider cannot run on M1 Macs. It would be good to support these options for administrative machines and local clusters.

Why is this needed:
To allow more users to try EKS-A and manage clusters.

Add robust generate command to show all optional fields

When running eksctl anywhere generate clusterconfig, the default doesn't include all the fields that can be configured as mentioned in the documentation. Adding an option to show all the optional fields that can be filled in, like a template, will make it easier to show what are all the fields that can be configured.

Add support for upgrading eks-anywhere components

Currently the upgrade command can only upgrade Kubernetes versions. We need to support updating the eks-anywhere components, including capi/capv.

Until we can support this, we may need to introduce a new release mode to support generating new release bundles with only Kubernetes (eks-distro) updates included so customers can update to new Kubernetes versions.


Design doc

In order to be able to gain some upgrade capabilities as soon as possible (due to the dependency of other features on this one), we split the work on 3 chunks. These could be moved to separate issues if necessary (for example if we decide to release the first or first a second ones in an early release):

Minimal functionality

This provides upgrades for all components except for Cilium and any CAPI version bump that involves an API version change. It doesn't include E2E tests and any Bundle/patch cli version would involve significant manual work and testing. It is preferred, if possible, to release this together with, the second set of tasks.

Automation

This provides proper automated testing and reduces to a minimum the amount of manual work for new Bundle releases.

Moved to #551

Full feature

Moved to #551

After restarting the machine that was running the single-system method EKS Anywhere, I lost my EKS-A Cluster.

What happened:
After restarting the machine that was running the single-system method EKS Anywhere, I lost my EKS-A Cluster.

What you expected to happen:
After restarting the machine, the single-system method EKS Anywhere will restart too.

How to reproduce it (as minimally and precisely as possible):
Restarting the machine that was running the single-system method EKS Anywhere.

Anything else we need to know?:

Environment:

  • EKS Anywhere Release: v0.5.0
  • EKS Distro Release:

image
image

govc tasks should be internally optional

When running eksctl anywhere create on a system where the tags (e.g. os:bottlerocket) or the tag catgories (e.g. os) already exist, the whole installation errors out. Rerunning it will succeed (though only after the user manually assigns the tags to the template to get the installer to proceed). Either the existence should be tested for explicitly in the workflow and actions taken accordingly, or the error handling be improved so that the workflow can continue.

vSphere 6.7 Update 3 support

What would you like to be added:
Official support for vSphere 6.7 Update 3. We have had a couple users try this out with success, but we do not have an official testing environment and strategy in place.

Why is this needed:

timeout on cert-manager install

What happened:

Timout on cert-manager install on local machine in Docker mode

What you expected to happen:

Installing cert-manager Version="v1.1.0"
Waiting for cert-manager to be available...
Error: timed out waiting for the condition
2021-09-15T08:00:24.463Z V4 Task finished {"task_name": "bootstrap-cluster-init", "duration": "12m30.072230612s"}
2021-09-15T08:00:24.472Z V4 ----------------------------------
2021-09-15T08:00:24.472Z V4 Tasks completed {"duration": "12m30.143671565s"}
2021-09-15T08:00:24.617Z V6 Executing command {"cmd": "/usr/bin/docker run -i --network host -v /home/cloud/anywhere:/home/cloud/anywhere -w /home/cloud/anywhere -v /var/run/docker.sock:/var/run/docker.sock --entrypoint kubectl public.ecr.aws/eks-anywhere/cli-tools:v0.1.0-eks-a-1 --kubeconfig dev-cluster/generated/dev-cluster.kind.kubeconfig logs deployment/cert-manager-webhook -n cert-manager"}

Custer to be fully deployed

How to reproduce it (as minimally and precisely as possible):

eksctl anywhere create cluster -f dev-cluster.yaml --verbosity 9

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: dev-cluster
spec:
  clusterNetwork:
    cni: cilium
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 1
  datacenterRef:
    kind: DockerDatacenterConfig
    name: dev-cluster
  externalEtcdConfiguration:
    count: 1
  kubernetesVersion: "1.21"
  workerNodeGroupConfigurations:
  - count: 1

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: DockerDatacenterConfig
metadata:
  name: dev-cluster
spec: {}

---

Anything else we need to know?:

Environment:

  • EKS Anywhere Release: 0.5
  • EKS Distro Release: 1.21

Add release process to update just EKS-D versions

Currently release bundles include both EKS-A and EKS-D components and we do not have a way to handle updating EKS-A components. To provide users a way to update Kubernetes versions we need a process that can produce new release bundles for existing releases with just updated EKS-D components + new OVA and Kind images

ESXI Standalone Provider Support

Ability to deploy EKS Anywhere using a provider to a standalone instance of VMWare ESXI without vCenter.

Many lab machines, and home lab machines are running VMWare ESXI in a single server configuration. While this is not production grade, it would make bootstrapping a cluster for development/testing/training trivial.
Currently you cannot do this with the existing providers as you do not define a datacenter name on a single instance resulting in this error:

Validation failed {"validation": "vsphere Provider setup is valid", "error": "error validating vCenter setup: failed to get datacenter: govc: datacenter 'localhost.localdomain' not found\n", "remediation": ""}
Error: failed to create cluster: validations failed

Support node taints in cluster config

What would you like to be added:

Add node taints option in cluster config for both control plane and worker node groups. So that the defined taints can be attached to the node(s) automatically during cluster creation/upgrade.

cluster config

We add taints option at the cluster level, eg:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: cluster-1
spec:
  ...
  controlPlaneConfiguration:
    count: 2
    endpoint:
      host: "198.18.100.27"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: cluster-1-cp
    taints: # optional
    - key: key1
      value: val1 # optional
      effect: NoSchedule # Valid effects are NoSchedule, PreferNoSchedule and NoExecute.
  workerNodeGroupConfigurations:
  - count: 2
    machineGroupRef:
      kind: VSphereMachineConfig
      name: cluster-1
    taints: # optional
    - key: key2
      value: val2 # optional
      effect: PreferNoSchedule # Valid effects are NoSchedule, PreferNoSchedule and NoExecute.

implementation

Cluster API already supports taints for both control plane and machine deployment:

We might easily pass down the taints option to the CAPI spec before applying.

The CAPI taints option does not work for worker nodes (kubeadmconfigtemplates). We https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/controllers/controller.go#L267might add taints for worker nodes in the CLI workflow, or as a postKubeadmCommands defined in the kubeadmconfigtemplates.

Why is this needed:

User requests to allow the retention of node taints during an upgrade. Currently the rolling update recreates VMs and the new nodes joined do not have the taints defined in the original nodes. User needs to manually attach those taints back after upgrade.

We'd like to provide a descriptive way to let user define node taints in the cluster config, so that the nodes being created/upgraded can have the proper taints (if defined) on the node(s) automatically without manual setup.

TODO list

TBD:

  • support worker nodes taints update without VM recreate

Support bare metal deployments

It would be great if I could bring my own servers and EKS Anywhere will install Kubernetes directly on top of an existing OS or provision the OS and Kubernetes for me.

Please ๐Ÿ‘ if you would like this feature

Define to be Released work processes

  • How do we build a release Changelog?
  • How do we handle docs when merging features for unreleased versions?
  • ?

Sync with the BottleRocket team to see how they handle this type of management.

Support bundle improvements

The current support bundle functionality can pull relevant pod logs and analyze the current state of the cluster. We should improve the gathered data to better support debugging failures via the support bundle.

Add retry logic to govc calls

Most vsphere related preflight checks are govc calls to check for things like, valid user/password, existing template, existing network, etc... Most of these should be retry-able with much consequence and would avoid temporary failures due to networking of vsphere blips.

Failed to pull image in bootstrap cluster. certificate signed by unknown authority. Custom CA

Error from server (NotFound): namespaces "capi-webhook-system" not found
Error from server (NotFound): namespaces "capi-kubeadm-bootstrap-system" not found
Error from server (NotFound): namespaces "capv-system" not found
Error from server (NotFound): namespaces "capi-webhook-system" not found
Error from server (BadRequest): container "cert-manager" in pod "cert-manager-webhook-56cbc8f5b8-7fhvt" is waiting to start: trying and failing to pull image
Error from server (BadRequest): container "cert-manager" in pod "cert-manager-cainjector-bb6d9bcb5-jtz48" is waiting to start: trying and failing to pull image
Error from server (BadRequest): container "cert-manager" in pod "cert-manager-5f6b885b4-pg5fh" is waiting to start: trying and failing to pull image

Describe failed pods shows -

Failed to pull image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-controller:v1.1.0-eks-a-1": rpc error: code = Unknown desc = failed to pull and unpack image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-controller:v1.1.0-eks-a-1": failed to resolve reference "public.ecr.aws/eks-anywhere/jetstack/cert-manager-controller:v1.1.0-eks-a-1": failed to do request: Head "https://public.ecr.aws/v2/eks-anywhere/jetstack/cert-manager-controller/manifests/v1.1.0-eks-a-1": x509: certificate signed by unknown authority
Failed to pull image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-cainjector:v1.1.0-eks-a-1": rpc error: code = Unknown desc = failed to pull and unpack image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-cainjector:v1.1.0-eks-a-1": failed to resolve reference "public.ecr.aws/eks-anywhere/jetstack/cert-manager-cainjector:v1.1.0-eks-a-1": failed to do request: Head "https://public.ecr.aws/v2/eks-anywhere/jetstack/cert-manager-cainjector/manifests/v1.1.0-eks-a-1": x509: certificate signed by unknown authority
Failed to pull image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-webhook:v1.1.0-eks-a-1": rpc error: code = Unknown desc = failed to pull and unpack image "public.ecr.aws/eks-anywhere/jetstack/cert-manager-webhook:v1.1.0-eks-a-1": failed to resolve reference "public.ecr.aws/eks-anywhere/jetstack/cert-manager-webhook:v1.1.0-eks-a-1": failed to do request: Head "https://public.ecr.aws/v2/eks-anywhere/jetstack/cert-manager-webhook/manifests/v1.1.0-eks-a-1": x509: certificate signed by unknown authority

I'm in a secure corp environment with corp CA's used for filtering all web requests.

What happened:
Local cluster creation times out after many minutes.

What you expected to happen:
Local cluster is created.

How to reproduce it (as minimally and precisely as possible):
Not sure

Anything else we need to know?:
macOS Big Sur 11.5.2

Environment:

  • EKS Anywhere Release: v0.5.0
  • EKS Distro Release:

Refactor support bundle to use `troubleshoot.sh` binary under-the-hood

EKS-A currently uses troubleshoot.sh to provide support-bundle functionality for the clusters we provision. This functionality allows our customers to quickly and easily gather diagnostic information about their cluster for their own debugging and to share with AWS support/engineering.

In our initial iteration, we directly consumed troubleshoot.sh as a go library. This had the advantage of allowing us to easily programmatically construct support bundles using the existing abstractions, requiring less initial development overhead. However, using troubleshoot.sh as a library posed some challenges when it came time to upgrade to the latest version โ€” an upgrade which would allow us to take advantage of the important copy-from-host functionality.

The latest version of troubleshoot introduced a new dependency, longhorn-manager, which vendors their k8s.io dependencies. This meant that troubleshoot, the initial consumer of longhorn-manager, had to use replace directives in their go.mod to specify which versions of the vendored dependencies to use. Since replace directives are not transitive, each consumer of the troubleshoot.sh go package is now required to specify similar replace directives. Without them, the package will not build. This means that we must either specify similar replace directives or not use the lastest version of the troubleshoot.sh library.

We have decided to move forward with refactoring the support bundle functionality to wrap the troubleshoot binary, rather than directly consume their library. This allows us to:

  • avoid polluting our dependencies, and the dependencies of downstream consumers of EKS-A, with fixed versions and replace directives
  • avoid unnecessary complexity in our dependency management from troubleshoot, now and in the future
  • avoid tightly coupling our Diagnostic Bundle package to the internal troubleshoot.sh implementation, by building an interface to the binary

eks-connector

What happened:
Hello,

I tried to configure eks-connector with my running eks-anywhere cluster in docker mode.

I' ve followed this documention to create the IAM role for EKS connector.

The eks connector pod seems to be running find and the local cluster is registered in the AWS Console, nonetheless I've got and IAM error that I do not understand

users "arn:aws:iam::xxxxxxxx:role/AWSReservedSSO_AWSAdministratorAccess_8efa0b9a2a4cd738" is forbidden: User "system:serviceaccount:eks-connector:eks-connector" cannot impersonate resource "users" in API group "" at the cluster scope

What you expected to happen:

Cluster is registered to he AWS Console but I cannot see any workload

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • EKS Anywhere Release:
  • EKS Distro Release:

Allow to get logs without waiting for cli to fail

What would you like to be added:
A mechanism to collect logs ad hoc: either with a new command, or automatically when the cli is killed

Why is this needed:
If during the create command, there is a problem and the cli gets stuck waiting for a component that will never come up, the cli only collects logs after a timeout is reached. If a user wants the logs, they need to wait until the cli fails, which is also not communicated in logs. This could be a long long time.

DNS resolver

What happened:

How can I add a DNS resolver 8.8.8.8 in the /etc/resolv.conf of my pods using the cluster description yaml file ?
I'm using the docker provider and in the /etc/resolv.conf I have the IP og the CoreDNS Service.

My Pods do not resolve external IPs like amazon.com

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • EKS Anywhere Release:
  • EKS Distro Release:

Pod CIDR in the generated spec is not used for cluster provision

The pod CIDR was updated in the generated spec:

clusterNetwork:
  cni: cilium
  pods:
    cidrBlocks:
    - 10.10.0.0/16

Despite the changed, kubectl cluster-info dump --kubeconfig dev-cluster-eks-a-cluster.kubeconfig | grep -i cidr reported:

 "io.cilium.network.ipv4-pod-cidr": "192.168.0.0/24",
                "podCIDR": "192.168.0.0/24",
                "podCIDRs": [
                    "io.cilium.network.ipv4-pod-cidr": "192.168.1.0/24",
                "podCIDR": "192.168.1.0/24",
                "podCIDRs": [
                            "--allocate-node-cidrs=true",
                            "--cluster-cidr=192.168.0.0/16",

Is modify the spec the correct way to configure the cluster CIDR and direct Cilium to follow the cluster CIDR?

Add more meaningful log messages when the cluster creation fails while waiting for control plane

What would you like to be added:
Have better logs in case of failures due to something going wrong on the control plane node during cluster creation.

Why is this needed:
Currently, if something goes wrong on the workload cluster (image pull fails, pod errors out, etc.), the cli waits till the timeout and then errors out with error waiting for workload cluster control plane to be ready.

One thing we can do is that while the cli waits for control plane to be ready, it should also monitor the workload cluster for any issues/failures and report it back through the logs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.