aws-samples / eks-workshop-v2 Goto Github PK

Hands-on labs for Amazon EKS

License: Apache License 2.0

HCL 44.76% Shell 21.85% CSS 2.36% Dockerfile 0.30% Python 0.61% Makefile 0.30% JavaScript 17.09% TypeScript 11.02% SCSS 1.72%

aws eks eks-cluster kubernetes

eks-workshop-v2's Introduction

Amazon Elastic Kubernetes Service Workshop

Welcome to the repository for the Amazon Elastic Kubernetes Services workshop. This contains the source for the website content as well as the accompanying infrastructure-as-code to set up a workshop lab environment in your AWS account. Please review the Introduction chapter of the workshop for more details.

Introduction

The Amazon EKS Workshop is built to help users learn about Amazon EKS features and integrations with popular open-source projects. The workshop is abstracted into high-level learning modules, including Networking, Security, DevOps Automation, and more. These are further broken down into standalone labs focusing on a particular feature, tool, or use-case. To ensure a consistent and predictable learning experience, the Amazon EKS Workshop closely adheres to the following tenets:

Tenets:

Modular: The workshop is made up of standalone modules that can be individually completed, allowing you to start at any module and easily switch between them.
Consistent sample app: The workshop uses the same sample retail store application across all modules: AWS Containers Retail Sample.
Amazon EKS-focused: Although the workshop covers some Kubernetes basics, it primarily focuses on familiarizing the user with concepts directly related to Amazon EKS.
Continuously tested: We automatically test the infrastructure provisioning and CLI steps in the workshop, allowing us to keep the workshop updated and tracking the latest versions of Amazon EKS.

Navigating the repository

The top level repository can be split is to several areas.

Site content

The workshop content itself is a docusaurus site. All workshop content is written using Markdown and can be found in website.

Contributing content

To learn how to author content on this repository, read CONTRIBUTING.md and the authoring content guide.

Workshop infrastructure

The infrastructure required to run the workshop content (EKS cluster configuration, VPC networking, components like Helm charts) are defined as Terraform infrastructure-as-code configuration in the terraform directory.

Learner environment

There are several tools that are required to run the workshop such as kubectl that need to be installed for a participant to complete the workshop content. This "learner environment" can be setup automatically using the scripts and other artifacts in the environment directory. This includes scripts to install all the pre-requisite tools, as well as container images to easily re-create a consistent environment.

Community

Governance

Steering Committee: governance/steering.md
Governance model: governance/model.md
Tenets: governance/tenets.md

Meetings

2nd Thursday every month at 8am PT (3pm UTC)

Meeting link: Chime Web Meeting Link
Agenda, Notes, and calendar invites: Google Doc

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

eks-workshop-v2's People

Contributors

Stargazers

Watchers

Forkers

rounoff svennam92 csantanapr saleem-mirza palbiren hemanthkumarbattula tariqsheikhsw davidbenna durgaganesh9637 maniemirateshr hammuu786 abhijit-gaonkar awsandy sessin prjteresi alla119 smaruy30 jackyyim881 sunibhaikc bhupendrasingh02 rehamfahmyhussien allamand vpoerwoto mickeysh normalfaults majidshokrolahi arsany has2aws discoverrahul-11 yulisys gabrielehima rohanrusta21 tankibaj aju100 huytn1219 cloudaform rajkrishnamurthy ajaysingh77 nicsan kalyanpenmetsa cadetd sukhanthn hansuuuuuuuuuu cloudgeek7 nagadurga8 rmacduff 17mason spodala0391 robertnorthard arjunrv88 faroukrajhi praveenmadupu karthikskygit trantdai megazone-hcseo zakame lakshmiramteja oktab1 zball ybezsonov arshkharbanda2010 huannv93 hungngomobile arunnalpet manish-we leo1piece heinrichse adribreezy timothysunilath canye007 maungsan tagekezo jfblg dzilbermanvmw devopnugs shanthikishoremallula pavel-kiper sunstone-secure-llc ruparee honeysexycombo amriz shrestha-ajay raghavsharma1729 bkgardiner mkarakas gobrial ngadde jsgoudar cornellanthony ttrentler andaro74 t9opaul nagaharish5 baralganesh taltaf913 jc1518 cyberjames ganbayard cunymatthieu mbrydak

eks-workshop-v2's Issues

error: no matches for kind "OpenTelemetryCollector" in version "opentelemetry.io/v1alpha1"

What happened?

reset-environment
Resetting the environment, please wait
error: no matches for kind "OpenTelemetryCollector" in version "opentelemetry.io/v1alpha1"

What did you expect to happen?

ran - reset-environment

How can we reproduce it?

reset-environment
Resetting the environment, please wait
error: no matches for kind "OpenTelemetryCollector" in version "opentelemetry.io/v1alpha1"

kubectl apply -k /workspace/modules/observability/oss-metrics/adotserviceaccount/adot-collector unchanged
clusterrole.rbac.authorization.k8s.io/otel-prometheus-role unchanged
clusterrolebinding.rbac.authorization.k8s.io/otel-prometheus-role-binding unchanged
configmap/otel-env-6kcf2bbb72 unchanged
error: unable to recognize "/workspace/modules/observability/oss-metrics/adot": no matches for kind "OpenTelemetryCollector" in version "opentelemetry.io/v1beta1"

Anything else we need to know?

is there workaround to get eskworkshop working

EKSversion

1.23

Migrate EFS Terraform to new AWS module

An EFS Terraform module has been released which would be good to use for the EFS Terraform:

https://github.com/terraform-aws-modules/terraform-aws-efs

Add note about Cloud9 ownership

What would you like to be added?

A note should be added to the "Accessing the IDE" page that mentions that the Cloud9 may need shared with the console role/user if the Terraform was run with a separate user.

aws cloud9 create-environment-membership --environment-id XXXXXXXXXXXXXXXXX  --user-arn arn:aws:sts::1234567890:assumed-role/Admin/somerole --permissions read-write

Why is this needed?

This scenario will cause the Cloud9 instance to appear missing and confuse users.

[Bug]: Error: couldn't find key name in Secret catalog/catalog-db

What happened?

In the StatefulSet with EBS Volume module,
the catalog-mysql-ebs pod is unable to find the secret value then reports the error Error: couldn't find key name in Secret catalog/catalog-db.

What did you expect to happen?

Pod recreates without any errors.

How can we reproduce it?

$ pwd
/home/ec2-user/eks-workshop-v2/environment/workspace/modules/fundamentals/storage/ebs

$ kubectl apply -k .
namespace/catalog unchanged
serviceaccount/catalog unchanged
configmap/catalog unchanged
secret/catalog-db unchanged
service/catalog unchanged
service/catalog-mysql unchanged
service/catalog-mysql-ebs unchanged
deployment.apps/catalog unchanged
statefulset.apps/catalog-mysql unchanged
statefulset.apps/catalog-mysql-ebs configured

$ kubectl get pod catalog-mysql-ebs-0 -n catalog
NAME                  READY   STATUS                       RESTARTS   AGE
catalog-mysql-ebs-0   0/1     CreateContainerConfigError   0          2m20s

$ kubectl describe pod catalog-mysql-ebs-0 -n catalog | tail -n2
  Warning  Failed                  24s (x12 over 2m34s)  kubelet                  Error: couldn't find key name in Secret catalog/catalog-db
  Normal   Pulled                  24s (x11 over 2m34s)  kubelet                  Container image "public.ecr.aws/docker/library/mysql:5.7" already present on machine

Anything else we need to know?

No response

EKSversion

v1.23.14-eks-ffeb93d

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.9", GitCommit:"c1de2d70269039fe55efb98e737d9a29f9155246", GitTreeState:"clean", BuildDate:"2022-07-13T14:26:51Z", GoVersion:"go1.17.11", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.14-eks-ffeb93d", GitCommit:"96e7d52c98a32f2b296ca7f19dc9346cf79915ba", GitTreeState:"clean", BuildDate:"2022-11-29T18:43:31Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}

WAF with ALB Ingress

What would you like to be added/changed?

Add a section to the ingress module which adds a WAF to the ALB with the appropriate annotation.

The WAF should be pre-created in Terraform and the WAF ID exported using the environment variables mechanism. Configure it with the core rule set and SQL protection. Add a custom rule that allows us to demonstrate the WAF working in the retail sample app front end (block viewing a specific product?).

[Bug]: workshop does not install using current instructions

What happened?

Provisioned a brand new AWS Account.
Invited account to an AWS Organization.
Followed instructions here: https://www.eksworkshop.com/docs/introduction/setup/your-account
When executing step terraform apply --auto-approve received the following error:

[ec2-user@ip-172-31-3-204 terraform]$ terraform apply --auto-approve
module.ide.module.cloud9_bootstrap_lambda.data.external.archive_prepare[0]: Reading...
module.ide.module.cloud9_bootstrap_lambda.data.external.archive_prepare[0]: Read complete after 1s [id=-]
╷
│ Error: Invalid provider configuration
│
│ Provider "registry.terraform.io/hashicorp/aws" requires explicit configuration. Add a provider block to the root module and configure the provider's required arguments as
│ described in the provider documentation.
│
╵
╷
│ Error: configuring Terraform AWS Provider: error validating provider credentials: retrieving caller identity from STS: operation error STS: GetCallerIdentity, failed to resolve service endpoint, an AWS region is required, but was not found
│
│ with provider["registry.terraform.io/hashicorp/aws"],
│ on line 0:
│ (source code not available)
│
╵

What did you expect to happen?

Expected to have working workshop deployed.

How can we reproduce it?

Follow instructions on https://www.eksworkshop.com/docs/introduction/setup/your-account

Anything else we need to know?

No response

EKSversion

None

Typo in 2 labs

What would you like to be added/changed?

https://reinvent.eksworkshop.com/docs/fundamentals/managed-node-groups/taints/configuring-taints - "Challange" -> "Challenge:

https://reinvent.eksworkshop.com/docs/observability/resource-view/extensions/webhook-configurations - "WebhookConfigurations" -> "Webhook Configurations"

Cloud9 should use VPC created in Terraform

What would you like to be added/changed?

Currently the Cloud9 instance created will randomly select a public subnet in the default VPC of the account. This is problematic since not all accounts will have a default VPC and will cause Terraform failures.

The configuration should be changed to select one of the public subnets in the VPC that is created by the Terraform and use that so there are less assumptions/dependencies on infrastructure not explicitly created.

fix: GitOps sidebar titles too long

The sidebar titles in the GitOps sidebar are longer than the rest of the modules and look a little out of place. Can we shorten them down to be a little more succinct?

Container Insights Prometheus with ADOT

What would you like to be added/changed?

Expand the Container Insights module to include sending scraped custom metrics to Container Insights Prometheus. Here is an example ADOT configuration showing the basic concept:

https://github.com/aws-observability/aws-otel-collector/blob/edc99d9c60603f3684474ebea74fb0e582af90cd/config/eks/prometheus/config-all.yaml#L230

Might need to understand how to only send custom metrics, don't re-resend "core" container insights metrics.

Module on How to debug k8s

What would you like to be added?

a module how to debug a broken kubernetes system for common issues. For example, why is my pod not being exposed. Or why is my database crashlooping (say, the volume is full).

Justin Garrison shared this with me on the AWS Twitch Stream (2023-02-08). Might be a good starting point
https://thenewstack.io/living-with-kubernetes-12-commands-to-debug-your-workloads/

Why is this needed?

It's needed because in the real world things WILL go wrong at some point. There might be a multitude of reasons why it goes wrong and a very important part of being a developer/devops engineer is figuring out what's wrong, and eventually finding a proper solution.

enhancement: Kubecost module accessibility

Originally we decided to use kubectl port-forward for being able to access Kubecost but I'm not sure this will work when the user is in Cloud9?

Additionally, we should add a command snippet after we tell the user that Kubecost is already installed that describes the deployment, something like:

kubectl get deployment -n kubecost

We can also attach a test hook to this so we can verify that kubecost is installed correctly and the UI is up.

Change how Terraform creates VPC private subnet architecture

Currently the Terraform creates 6 "private subnets":

3 from the VPC primary CIDR
3 from the secondary VPC CIDR for custom networking

The way this is done is the use the private_subnets parameter of the VPC module for all of them. This causes other Terraform that needs to retrieve the subnet IDs to be more complex than necessary, for example to get the first set of 3 subnet IDs you need to do something like:

private_subnet_ids = slice(module.aws_vpc.private_subnets, 0, 3)

We should explore if we can use other parameters for this, like intra_subnets or database_subnets. This will more cleanly separate the subnet ID outputs and make them easier to consume.

fix: Include commands to create Spot Service Linked Role

Expected Behavior

All of the steps necessary to run Karpenter in a new account should be outlined, including creating the EC2 Spot Service Linked Role.

https://karpenter.sh/v0.17.0/getting-started/getting-started-with-terraform/#create-the-ec2-spot-service-linked-role

Actual Behavior

This step is omitted, which will cause the Karpenter content to break in accounts without the role.

Error in CloudTrail logs:

The provided credentials do not have permission to create the service-linked role for EC2 Spot Instances.

Steps to Reproduce the Problem

Run the Karpenter scenarios either manually or using the automation. Ensure that the Spot Service Linked Role is not present in the account.

Implement Search Feature

Add a search feature to Docusaurus-based EKS Workshop site

bug: Terraform destroy times out due to orphaned ENI

Expected Behavior

Running terraform destroy should remove all infrastructure

Actual Behavior

ENIs can be left dangling due to this issue:

aws/amazon-vpc-cni-k8s#1223

This prevents one or more subnets from being deleted and the Terraform process times out.

Steps to Reproduce the Problem

This only occurs intermittently and is difficult to reproduce on purpose.

new: Secrets Manager integration

What would you like to be added/changed?

Create a lab using AWS Secrets and Configuration Provider for the Kubernetes Secrets Store CSI Driver. The scenario should create a Secrets Manager entry for the catalog database username and password, use ASCP to sync this to a Kubernetes Secret and then modify the catalog deployment and corresponding MySQL to use this secret. It could also set up the volume mounts to show this in parallel.

EKS Control plane log query examples

Currently the EKS control plane section only quickly leads the user through where the CloudWatch log groups are. To show how this is actually useful in real like we should demonstrate leverage CW log query for real-life scenario.

Some examples of CW log queries can be found here:

https://aws.amazon.com/premiumsupport/knowledge-center/eks-get-control-plane-logs/

Figure out which examples are practical with the workshop cluster setup.

Missing notification about LB creation time in Load Balancers lab

What would you like to be added/changed?

Missing notification about LB creation time in Load Balancers lab

Terraform 1.3.0 compatibility

What would you like to be added/changed?

Currently we pin the Terraform version 1.2.9 and explicitly do not allow anything higher. This was originally due to EKS Blueprints issues with TF 1.3.0. Now that this is resolved upstream we should support 1.3.0.

Initial testing shows that this version has changed how module outputs are handled during destroy, and using 1.3.0 seems to give an error.

Tasks for this issue:

Make any Terraform changes necessary for compatibility
Relax provider requirement
Update version pins for GitHub Actions and CFN CodeBuild to the latest 1.3.X version

ACK module

The module will show case how to use AWS Controllers for Kubernetes (ACK) deploying the app for production using AWS services RDS and Amazon MQ

First time Setup Error in AWS account

Error while installing hashicorp/time v0.9.1: mkdir .terraform/providers/registry.terraform.io/hashicorp/time: no space left on device

Facing Errors after the Terraform init command .

please let me know how can i fix this

[feat] Add ArgoCD lab under Automation section

What would you like to be added?

A an ArgoCD lab to the Automation under the subsection GitOps
We already have Flux lab, users have requested to show the same scenario with alternative tooling ArgoCD.

Both Flux and ArgoCD are graduated CNCF projects.

Why is this needed?

The Automation Working Group review the request, and define the lab content and requirements beyond following the expected guidelines

Add AWS VPC Lattice to the workshop

What would you like to be added?

This chapter will focus on installing gateway-api controller and will explain the different APIs supported by gateway-api controller. We will demonstrate Lattice integrations and functions by building a secure application networking using the sample app used by EKS Workshop. We will demonstrate canary deployment. This module will also show how you can bring in Vanity names and bring your own cert.

Why is this needed?

This is a launch feature and enable customers to use Amazon VPC Lattice.

[Bug]: Cloud9 IDE fails to communicate with SSH

What happened?

Hi,

When building this workshop using terraform apply --auto-approve

I noticed the Cloud9 IDE that is created failed to create properly (communication issues with ssh)

Changing the ide.tf in modules/ide - to include:

connection_type = "CONNECT_SSM"

resolved the issue, and is probably preferable to using SSH anyway I'd suggest

I also think we should add to the docs to encourage folks to put their console role/user into the top level variables.tf for cloud9 - otherwise the cloud9 IDE that gets created may not be visible (if your starting with another cloud9 and an instance profile)

What did you expect to happen?

Cloud9 IDE to create without issues

How can we reproduce it?

run terraform apply from a cloud9 instance using an instance profile and temporary creds disabled.

Anything else we need to know?

No response

EKSversion

1.23 - workshop default.

Additional Karpenter scenarios

What would you like to be added/changed?

The Karpenter section should include more scenarios which it facilitates:

Consolidation
Provisioner flexibility with regards to things like GPU instances
Using multiple CPU architectures (x86/arm64)
Multiple provisioners with labels, weights

Fargate logging

What would you like to be added/changed?

There should be a section in the Observability module that covers Fargate logging and configuring the ConfigMap.

Add cost estimation

What would you like to be added?

We should provide a blurb saying how much the lab environment costs to run, something like:

Cost to complete: This workshop will cost you less than $XX for 3 hours.*

*This estimate assumes you follow the recommended configurations throughout the tutorial and terminate all resources immediately after you complete the tutorial.

Why is this needed?

Users should know the expense of the Terraform they are deploying ahead of time.

Incorrect affinity/Anti-affinity module flow

Currently we have an affinity on the redis deployment, meaning it won't start unless you have a node with a checkout service running. This is backwards — one instance of redis should be running on every node, and the checkout service should check that it is there before starting. The current flow has it backwards.

We need to remove the affinity on redis and add it to checkout.

FYI @smirman
https://preview--eksworkshop-v2-next.netlify.app/docs/fundamentals/managed-node-groups/affinity/

Integrate Kubecost Deployment with AMP

Topology Aware Hints in EKS 1.24

What would you like to be added/changed?

EKS support for Kubernetes 1.24 adds the ability to leverage Topology Aware Hints for trying to reduce cross-zone traffic. There should be a lab in the networking section that demonstrates this functionality.

Suggested flow:

Scale up application so it has 3 replicas of each service
Turn on load generation component
Show initial setup spreads traffic across all zones with no bias
Activate TAH on a backend service like catalog
Illustrate change in traffic

Although this is a generic Kubernetes feature it has implications for EKS users with regards to inter-AZ data transfer cost optimization.

Fundamentals > Stateful Workload using EBS

In this module we will review how to deploy a MySQL database using StatefulSet and Amazon Elastic Block Store (EBS) as PersistentVolume.
1. Inspect existing MySQL containers storing data in-memory
2. Provide blurb on IRSA and how the driver is prein
3. Dynamic provisioning with storage classes, PVCs
5. Configure node affinities and taints with kustomize
6. Show that MySQL data is preserved even when containers are restarted

Migrate Security Groups for Pods lab to use sample application

What would you like to be added/changed?

Currently the SG for Pods module leverages a different sample application, this should be moved to use the shared sample application. There is already a MySQL database created by the Terraform for this purpose which exports the following environment variables:

ORDERS_RDS_ENDPOINT
ORDERS_RDS_USERNAME
ORDERS_RDS_PASSWORD
ORDERS_RDS_DATABASE_NAME
ORDERS_RDS_SG_ID

This should be used with the catalog service, which will automatically populate the schema in a new database when the pods start up.

Steps:

Rename the Terraform resources and environment variables to be named appropriately for catalog service instead of orders
Migrate the content to follow a flow similar to the IRSA module

Configuring taints lab last command outputs a lot of data, we should add grep at the end to output only what's required for simplicity

What would you like to be added/changed?

Configuring taints lab last command outputs a lot of data, we should add grep at the end to output only what's required for simplicity

Checkov pull request check

What would you like to be added/changed?

Add a pull request check and associated script to run locally for using checkov to assess Terraform configuration for basic issues. The markdown linting mechanism can probably be used as a guide.

Add content for Pod readiness gates

What would you like to be added/changed?

The AWS Load Balancer Controller supports pod readiness gates:

https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.1/deploy/pod_readiness_gate/

This should be added in the existing section related to Ingress/ALB

Scenario:

Demonstrate scenario that pod readiness gates are designed to help with
Apply readiness gate configuration
Demonstrate that scenario is remediated

Network Policies

What would you like to be added/changed?

Lab demonstrating how to use Kubernetes Network Policies to restrict traffic in the EKS cluster. Install Cilium in the cluster with Blueprints addon https://github.com/aws-ia/terraform-aws-eks-blueprints/tree/main/modules/kubernetes-addons/cilium

Suggested flow:

Describe how Cilium is installed
How that all Pods can communicate to each other (exec in to catalog and curl checkout?)
Implement deny all rule and show that website is now broken
Add policies to all allow communication required
Demonstrate that Pods cannot all communicate to each other (replicate step 2)

Visualization for load generator

What would you like to be added/changed?

Several scenarios in the workshop benefit from the system being under load, for example:

Observability
Operational changes that should't cause disruption to "end users" (MNG updates etc)

The sample application includes artillery scenarios to generate load on all of the components. However, it would be useful to be able to visualize statistics coming from artillery to understand when user requests are being impacted by changes.

Initial suggestion is to use existing integration to push metrics from artillery to a Prometheus instance and leverage the existing Grafana to present a dashboard to the workshop user.

https://www.artillery.io/docs/guides/plugins/plugin-publish-metrics#prometheus-pushgateway

This should all be automatically set up in Terraform.

[Bug]: Time out waiting for condition in resource "helm_release" "addon"

What happened?

Run terraform init | apply. Getting the following error:

Error: timed out waiting for the condition
│
│ with module.cluster.module.ec2.helm_release.addon[0],
│ on .terraform\modules\cluster.ec2\modules\kubernetes-addons\helm-addon\main.tf line 1, in resource "helm_release" "addon":
│ 1: resource "helm_release" "addon" {

It seems to be related to a package and not the code itself. Maybe some packages/dependencies are broken upstream.

What did you expect to happen?

Not get any error during Terraform apply. Be able to see addons in EKS cluster.

How can we reproduce it?

Follow the steps in this article: https://www.eksworkshop.com/docs/introduction/setup/your-account

Anything else we need to know?

No response

EKSversion

Kubernetes version: 1.23 | platform version: eks.5

Rework ADOT configuration steps in Observability labs

What would you like to be added/changed?

Currently the YAML for the ADOT collector uses envsubst to render the final manifest which is then applied. This is more work than necessary and it is possible to orchestrate this whole process with kustomize.

There are two places this needs to be fixed:

Steps:

Expand vars to include any other required environment variables: https://github.com/aws-samples/eks-workshop-v2/blob/main/environment/workspace/modules/observability/oss-metrics/adot/kustomization.yaml#L7
Add env section to collector YAML with appropriate variables and substitution
Add a configurations section to the kustomization with the respective file that enables vars to work with the OpenTelemetryCollector CRD at the appropriate field (example: https://github.com/aws-containers/retail-store-sample-app/blob/main/deploy/kubernetes/kustomize/components/opentelemetry/kustomization.yaml#L30)
Update the content to reflect it no longer needs envsubst https://github.com/aws-samples/eks-workshop-v2/blob/main/website/docs/observability/container-insights/collect-metrics-adot-ci.md?plain=1#L44

[Bug]: Karpenter nodes can't communicate with MNG nodes

What happened?

When deploying workloads in Karpenter nodes they can't communicate with MNG nodes because they don't share the same security groups.

for instance, order workload can't be deployed on Karpenter as it will crash
we can't uses HPA on Karpenter as metrics-server can't reach Karpenter metrics endpoints
pods on KArpenter can't reach coreDNS pods on MNG

What did you expect to happen?

flow to works with Karpenter the same way they work on MNG

How can we reproduce it?

yes

Anything else we need to know?

I suggest to uses the same cluster SG for Karpenter as the one use for MNG

EKSversion

do not matter

AWS EKS Addon - Service Mesh deployment

What would you like to be added?

this request is to add the following content. This is a tool to demonstrate ability to deploy Service Mesh in EKS cluster with AWS EKS Addon offering. The material contains multiple easy to consume examples of applying Service Mesh principles to EKS hosted workloads.

Why is this needed?

Currently there are willingness of customers to use AWS technologies in conjunction with Service Mesh support.

The requested content would help users (a) get familiar with AWS Addon process, (b) get up to speed with basic Service Mesh use-cases and (c) start automating the process of deploying Service Mesh enabled applications in EKS clusters.

[Bug]: [Security] [Sealed Secrets] Catalog secret have only 2 var not 4

What happened?

in the workshop :

kubectl -n catalog get deployment catalog -o yaml | yq '.spec.template.spec.containers[] | .env'
- name: DB_USER
  valueFrom:
    secretKeyRef:
      key: username
      name: catalog-db
- name: DB_PASSWORD
  valueFrom:
    secretKeyRef:
      key: password
      name: catalog-db
- name: DB_NAME
  valueFrom:
    configMapKeyRef:
      key: name
      name: catalog
- name: DB_READ_ENDPOINT
  valueFrom:
    secretKeyRef:
      key: endpoint
      name: catalog-db
- name: DB_ENDPOINT
  valueFrom:
    secretKeyRef:
      key: endpoint
      name: catalog-db

and in reality:

- name: DB_USER
  valueFrom:
    secretKeyRef:
      key: username
      name: catalog-db
- name: DB_PASSWORD
  valueFrom:
    secretKeyRef:
      key: password
      name: catalog-db

What did you expect to happen?

so some commands whil failed like :

kubectl -n catalog get secrets catalog-db --template {{.data.endpoint}} | base64 -d
catalog-mysql:3306%
~
$
kubectl -n catalog get secrets catalog-db --template {{.data.name}} | base64 -d
catalog%

as it don't exists

How can we reproduce it?

just deploy the catalog service

Anything else we need to know?

No response

EKSversion

1.23

Instructions not clear in /docs/observability/kubecost/costallocation.md

What would you like to be added?

Finish the sentence in The application we installed in the Introduction section created several of these components. These components also have .... Next, we'll drill into the costs of this application by using these dimensions.

Why is this needed?

To clarify what the components have

observability module

The module will add the observability component for EKSWorkshop-this issue will be focussing on using ADOT, Amazon Managed Service for Prometheus and Amazon Managed Grafana

[Bug]: Security Groups prevent nodes from joining cluster

What happened?

When doing a terraform apply the Nodegroups fail to join the cluster

Looking in detail on how this is setup:

The security group on the control plane ENI's allow outbound ports 12050 and 443 to the security group "eks-workshop-node"

But the nodes ENI's have a different security group attached: "eks-workshop-cluster" - Is this correct ?
I don't see how the k8s endpoint can communicate with the nodes.

When I added the security group "eks-workshop-node" to the nodegroup instances and reran the /etc/etc/bootstrap.sh - they then all joined the cluster ok.

What did you expect to happen?

nodes to join the cluster without issues

How can we reproduce it?

For me - From a cloud9 with admin privs and temp credential turned off - just run:

terraform init
terraform apply --auto-approve

eventually I saw:
╷
│ Error: error waiting for EKS Node Group (eks-workshop-awsandy:managed-ondemand-20230210131156708800000024) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
│ * i-05779928e286f360e, i-07f95eb10d9b1e0b5: NodeCreationFailure: Instances failed to join the kubernetes cluster
│
│
│
│ with module.cluster.module.eks_blueprints.module.aws_eks_managed_node_groups["mg_5"].aws_eks_node_group.managed_ng,
│ on .terraform/modules/cluster.eks_blueprints/modules/aws-eks-managed-node-groups/main.tf line 1, in resource "aws_eks_node_group" "managed_ng":
│ 1: resource "aws_eks_node_group" "managed_ng" {
│
╵
╷
│ Error: error waiting for EKS Node Group (eks-workshop-awsandy:managed-system-20230210131156925400000027) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 1 error occurred:
│ * i-045f2dee415c0dd90: NodeCreationFailure: Instances failed to join the kubernetes cluster
│
│
│
│ with module.cluster.module.eks_blueprints.module.aws_eks_managed_node_groups["system"].aws_eks_node_group.managed_ng,
│ on .terraform/modules/cluster.eks_blueprints/modules/aws-eks-managed-node-groups/main.tf line 1, in resource "aws_eks_node_group" "managed_ng":
│ 1: resource "aws_eks_node_group" "managed_ng" {

Anything else we need to know?

No response

EKSversion

1.23 - workshop code default

[Bug]: When terraform apply fails at certain points, destroy will fail too.

What happened?

I attempted to deploy an EKS cluster with the provide template to my region ca-central-1. It first failed with

a Farget error for a specific availability zone:

╷
│ Error: creating Prometheus Workspace: RequestError: send request failed
│ caused by: Post "https://aps.ca-central-1.amazonaws.com/workspaces": dial tcp: lookup aps.ca-central-1.amazonaws.com on 172.16.153.81:53: no such host
│
│   with module.cluster.aws_prometheus_workspace.this,
│   on modules/cluster/amp.tf line 1, in resource "aws_prometheus_workspace" "this":
│    1: resource "aws_prometheus_workspace" "this" {
│
╵
╷
│ Error: creating EKS Fargate Profile (eks-workshop:checkout-profile): InvalidParameterException: Fargate Profile creation for the Availability Zone ca-central-1d for Subnet subnet-01854b3cde7463031 is not supported
│ {
│   RespMetadata: {
│     StatusCode: 400,
│     RequestID: "aaac7cb7-e02a-4c7e-b950-6822aebd7107"
│   },
│   Message_: "Fargate Profile creation for the Availability Zone ca-central-1d for Subnet subnet-01854b3cde7463031 is not supported"
│ }
│
│   with module.cluster.module.eks_blueprints.module.aws_eks_fargate_profiles["checkout_profile"].aws_eks_fargate_profile.eks_fargate,
│   on .terraform/modules/cluster.eks_blueprints/modules/aws-eks-fargate-profiles/main.tf line 5, in resource "aws_eks_fargate_profile" "eks_fargate":
│    5: resource "aws_eks_fargate_profile" "eks_fargate" {
│
╵

Acknowledging this as a limitation, I attempted to destroy in order to change region. However, during destroy, Terraform failed with another error:

╷
│ Error: Unsupported attribute
│
│   on ide.tf line 20, in locals:
│   20: AMP_ENDPOINT=${module.cluster.amp_endpoint}
│     ├────────────────
│     │ module.cluster is object with 31 attributes
│
│ This object does not have an attribute named "amp_endpoint".

It appears that my first failure was at a point where the creation of AMP was not done. But when I tried to destroy it is expecting the value of AMP_ENDPOINT, which should be null checked.

What did you expect to happen?

If Terraform apply fails, we should be able to destroy without failure, no matter at which point the apply failure occurred.

How can we reproduce it?

Set the default region to ca-central-1, and the template will use a zone that does not support Fargate profile creation.

Anything else we need to know?

No response

EKSversion

code: 1fbb013
EKS: 1.23

Typos & rephrasing required in autoscaling & fundamentals

What would you like to be added/changed?

There are typos in:
https://eksworkshop.com/docs/autoscaling/workloads/cluster-proportional-autoscaler/
&
https://eksworkshop.com/docs/fundamentals/fargate/enabling (2nd paragraph is confusing)

Typo in eksworkshop on configuring taints page

What would you like to be added/changed?

eks-workshop-v2/website/docs/fundamentals/managed-node-groups/taints/configuring-taints.md

Line 100 in c567a51

 * `PREFER_NO_SCHEDULE` - This corresponds to the Kubernets `PreferNoSchedule` taint effect. If possible, EKS avoids scheduling Pods that do not tolerate this taint onto the node. 

kubernets --> kubernetes

Managed Node Group AMI update cordon/drain

What would you like to be added/changed?

The Managed Node Group AMI update lab should demonstrate cordon/drain to show Pods are gracefully rescheduled by MNG as part of the process.

This will require scaling up all of the components in order to allow for "zero downtime" demonstration. A load test can be run to show effects on user traffic.