cloudbees / terraform-aws-cloudbees-ci-eks-addon Goto Github PK
View Code? Open in Web Editor NEWCloudBees CI Add-on for AWS EKS
Home Page: https://registry.terraform.io/modules/cloudbees/cloudbees-ci-eks-addon/aws
License: MIT License
CloudBees CI Add-on for AWS EKS
Home Page: https://registry.terraform.io/modules/cloudbees/cloudbees-ci-eks-addon/aws
License: MIT License
Ticket created asking for help DOCS-9778
I reproduced in the upstream code then I reported in
Based on Feedback session with Engineering Team @jtnord, best practice for getting started should be also be using gp3
as Default storage
Adding Modern Dashboards per https://github.com/dotdc/grafana-dashboards-kubernetes to explode Node Exporter Metrics
Adding the mentioned dashboard in the grafana helm charts
N/A
N/A
Add full links in the main README to reference correctly from the Terraform Registry to the links in the GH repository
Include a Disaster Recovery Blueprint here
Integrate the content from this repository https://github.com/cloudbees-oss/cbci-eks-dr-demo as new blueprint in this repository. Terraform all code and add it to this workflow.
Adding Shared Libraries in the Monorepo including Pipeline Templates Catalogs like in https://github.com/cloudbees-oss/pipeline-home-demo
Migrate Documentation and Architectural charts from https://aws.amazon.com/quickstart/architecture/cloudbees-ci/ to this repository.
For charts will be using the https://app.diagrams.net/, so diagrams need to be migrated to this format.
Ideas
1/ Terragrunt comes with extra features that helps to solve transient errors during the pipeline execution like:
It requires to aadd the tool in Docker like
ENV TG_VERSION=0.55.1 \
ARCH=amd64
RUN curl -sLO https://github.com/gruntwork-io/terragrunt/releases/download/v${TG_VERSION}/terragrunt_linux_${ARCH} && \
mv terragrunt_linux_${ARCH} /usr/bin/terragrunt && \
chmod +x /usr/bin/terragrunt
And also restructure the code using terragrunt.hcl
(see QuickStart) ==> It seems very time costly
2/ Bash approach like explained in this article but it seems not compatible with terraform
Using the following configuration inside the EKS blueprints add-ons
helm_releases = {
helm-openldap = {
namespace = "openldap-stack-ha"
create_namespace = true
chart = "openldap-stack-ha"
chart_version = "4.2.2"
repository = "https://jp-gouin.github.io/helm-openldap/"
values = [file("k8s/helm-openldap-values.yml")]
}
}
Values files
Test LDAP Validation at Operation Center fails with
Login
Authentication: failed for user "Jean Dupond"
Lookup
User lookup: user "Jean Dupond" may or may not exist.
Does the Manager DN have permissions to perform user lookup?
LDAP Group lookup: could not verify.
Please try with a user that is a member of at least one LDAP group.
Lockout
The user "Jean Dupond" will be unable to login with the supplied password.
If this is your own account this would mean you would be locked out!
Are you sure you want to save this configuration?
Advanced Configuration
Validate Log Recorder to be
aws logs describe-log-streams --log-group-name /aws/containerinsights/cbci-bp02-eks/application --order-by LastEventTime --region us-east-1 --no-descending --query 'logStreams[?creationTime > `xxxxxxxxxx` ]' | . jq
Ref: aws/aws-cli#4227 (comment)
Terraform implementation
Replace credentials by Configuring CloudBees OIDC with AWS
See CloudBees Internal Doc: https://docs.google.com/document/d/1ZHKQbpMCXVqxUH4go6E4VLaUdnJsoIHa1xcGkAOMZvA/edit
For Backup include this section (it is enabled by default but it is interested to know):https://github.com/terraform-aws-modules/terraform-aws-efs/blob/v1.3.1/examples/complete/main.tf#L97-L98 until we have a best practice defined per #39
Do run Blueprints Pipelines ONLY IF:
modified_tf_files=$(git show --name-only --oneline HEAD | tail -n +2 | grep '.tf$')
if [ "$modified_tf_files" ]; then
// Run terraform phase
else
It requires to distinguish it the modification happened: at root, bp 01 or bp02
No other alternatives yet
No additional context
I would like to use the Node Termination handler for the at-scale blueprint but I am getting the following error
β Error: creating IAM Policy (aws-node-termination-handler-20231218170201623700000007): MalformedPolicyDocument: Policy statement must contain resources.
β status code: 400, request id: cc751055-d447-4206-9046-0afc3546c91c
β
β with module.eks_blueprints_addons.module.aws_node_termination_handler.aws_iam_policy.this[0],
β on .terraform/modules/eks_blueprints_addons.aws_node_termination_handler/main.tf line 242, in resource "aws_iam_policy" "this":
β 242: resource "aws_iam_policy" "this" {
β
β΅
The pod is in RUNNING status and I can see in the logs the following
2023/12/18 17:07:50 WRN There was a problem monitoring for events error="AccessDenied: User: arn:aws:sts::324005994172:assumed-role/aws-node-termination-handler-20231218170135041900000006/1702918930359930054 is not authorized to perform: sqs:receivemessage on resource: arn:aws:sqs:us-east-1:324005994172:aws-nth-cbci-bp02-i318-eks because no identity-based policy allows the sqs:receivemessage action\n\tstatus code: 403, request id: 8b956325-4f0e-5082-8761-3edf31a835b4" event_type=SQS_MONITOR
From time to time, the terraform command fails with
Error: creating KMS Alias (alias/eks/cbci-bp01-ci-v2-eks): AlreadyExistsException: An alias with the name arn:aws:kms:us-east-1:324005994172:alias/eks/cbci-bp01-ci-v2-eks already exists
with module.eks.module.kms.aws_kms_alias.this["cluster"],
on .terraform/modules/eks.kms/main.tf line 255, in resource "aws_kms_alias" "this":
255: resource "aws_kms_alias" "this" {
It only happens in the CI pipeline that it using a s3 as backend.
If your request is for a new feature, please use the Feature request
template.
Before you submit an issue, please perform the following first:
.terraform
directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
terraform init
Module version [Required]:
Terraform version:
Steps to reproduce the behavior:
It is similar to what is explained in https://stackoverflow.com/questions/62654684/terraform-alreadyexistsexception-an-alias-with-the-name-arnawskmsxxxxxxxxxx. Two hypotheses for this behaviour:
After discussions with @Vlatombe it seems a better approach to use Pod Identity
Than current instances profile configuration (Obsolete). Ref:
Please provide a clear and concise description of the issue you are encountering, and a reproduction of your configuration. The reproduction MUST be executable by running terraform init && terraform apply
without any further changes.
If your request is for a new feature, please use the Feature request
template.
Before you submit an issue, please perform the following first:
.terraform
directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
terraform init
While upgrading to 3.17108.0
Use K8s version set to 1.28 for Blueprints 02
The cluster comes up with not issue
Node groups using Graviton does not finish to complete the creation process. They are pickup an AMI TYPE for 1.27
It does not finish it keeps waiting until finish
EKS Terraform Module supports Windows Managed Node Groups.
CloudBees support builds into Windows Nodes inside a K8s cluster:
More examples at: https://github.com/pipeline-demo-caternberg/pipeline-parallel-windows-linux/
JNLP Images:
When HA Controller try to deploy an agent into a graviton instance it fails with "Error to download image"
Cron Action than checks for CloudBees CI Helm charts updates like
helm search repo cloudbees/cloudbees-core --output json | jq -r '.[0].version'
and it automatically creates a new PR
Installing https://plugins.jenkins.io/cloudbees-jenkins-advisor/ as At scale plugin
Using Kubernetes Secrets to configure the plugin by using the same mail than the license activation
It will increase the Telemetry data on usage
The current documented simple example shows the variable hostname
and temp_license
:
module "eks_blueprints_addon_cbci" {
source = "REPLACE_ME"
hostname = "example.domain.com"
cert_arn = "arn:aws:acm:us-east-1:0000000:certificate/0000000-aaaa-bbb-ccc-thisIsAnExample"
temp_license = {
first_name = "Foo"
last_name = "Bar"
email = "[email protected]"
company = "Acme Inc."
}
}
But the variable names are actually hosted_zone
and trial_license
.
Could possibly be a race condition
β·
β Warning: EC2 Default Network ACL (acl-0bb1c751ce6d3b468) not deleted, removing from state
β
β
β΅
β·
β Error: deleting EC2 Subnet (subnet-0d21d5729923852e9): DependencyViolation: The subnet 'subnet-0d21d5729923852e9' has dependencies and cannot be deleted.
β status code: 400, request id: f381ee78-69dc-4c57-940b-f3aac1ad945f
β
β
β΅
β·
β Error: deleting EC2 Subnet (subnet-03d24ee9f4bd6e5be): DependencyViolation: The subnet 'subnet-03d24ee9f4bd6e5be' has dependencies and cannot be deleted.
β status code: 400, request id: 8dbec214-976f-417e-ae73-eeca60abac76
β
β
β΅
β·
β Error: deleting EC2 Subnet (subnet-08b859259dae46484): DependencyViolation: The subnet 'subnet-08b859259dae46484' has dependencies and cannot be deleted.
β status code: 400, request id: 349641d0-3c79-431a-9171-673fc1ed45c5
β
β
β΅
β·
β Error: deleting EC2 Internet Gateway (igw-0e686cb558c91368b): detaching EC2 Internet Gateway (igw-0e686cb558c91368b) from VPC (vpc-018c607b6927cd288): DependencyViolation: Network vpc-018c607b6927cd288 has some mapped public address(es). Please unmap those public address(es) before detaching the gateway.
β status code: 400, request id: e9835b5c-ea6e-4d94-a0c5-1771e8820b47
β
β
β΅
β·
β Error: uninstallation completed with 1 error(s): context deadline exceeded
β
β
β΅
After deleting the loadbalancer, tf destroy
fails with
β·
β Error: deleting EC2 VPC (vpc-018c607b6927cd288): operation error EC2: DeleteVpc, https response error StatusCode: 400, RequestID: c73aca26-db62-4cb2-bd82-ee1cda52a579, api error DependencyViolation: The vpc 'vpc-018c607b6927cd288' has dependencies and cannot be deleted.
The dependencies are two security groups:
After those are deleted, tf destroy
worked.
https://www.cloudbees.com/products/saas-platform
Example: https://github.com/cloudbees/cloudbees-terraform-example/blob/main/.cloudbees/workflows/workflow.yaml#L23-L39
Running inside the organization cloudbees
org because of 2 things:
Combining in the same EKS cluster (example)
Fargate References:
https://docs.cloudbees.com/docs/cloudbees-ci/latest/backup-restore/velero-dr
Sport Architecture Best Practices per https://aws.amazon.com/blogs/compute/cost-optimization-and-resilience-eks-with-spot-instances/. It is applied into Jenkins Infra here: https://github.com/jenkins-infra/aws/blob/main/cik8s-cluster.tf
See also EKS Spot Best Practice: https://repost.aws/knowledge-center/eks-spot-instance-best-practices
Include issues templates similar to https://github.com/aws-ia/terraform-aws-eks-blueprints-addons/tree/main/.github/ISSUE_TEMPLATE
For bugs related to upstream add-ons make sure they validate the feature works in the blueprints or test case
CloudBees Grafana Dashboards like these example:
The Template for filtering by Controllers does not work as it used to
enable_amazon_eks_aws_ebs_csi_driver = true
amazon_eks_aws_ebs_csi_driver_config = {
configuration_values = jsonencode(
{
controller = {
extraVolumeTags = {
cb-environment = "demo"
cb-owner = "devops-consultants"
cb-user = "${local.derived_user}"
}
}
}
)
}
The following selection start returning no records
eval $(terraform output --raw aws_logstreams_fluentbit) | jq '.[] | select(.logStreamName | contains("jenkins"))'
jq: error (at :5894): Cannot index array with string "logStreamName"
Create an external KMS key (like this) and using for encryption on
Remove continue-on-error: true
for terraform fmt
There are two option to prevent from posible node affinity conflict
during controllers restarts when using EBS volumens: make topology aware volume to the same AZs, or designing Autoscaling Groups following what is explained in the AWS article Creating Kubernetes Auto Scaling Groups for Multiple Availability Zones (one ASG per AZ for EBS volume and one single ASG per Multiple AZ for EFS volumes). At the moment of publishing this blueprints, terraform-aws-modules/eks/aws
does not support availability_zones
atribute for the embedded aws_autoscaling_group
resource, then the first option is applied in g3
Storage Class.
Only with setting g3
topology Storage Class is not enough. It is required to assigned Pod to 1 AZ only
Currently, there is not a EKS Best Practice recommendation for EFS PVs but it does for EBS (Velero).
The EFS backup/restore based on the AWS Backup (tutorial) is not prepare for EKS (Restore is happens in a different drive). See
Context this slack thread
On the first apply, I encountered this error:
β Error: Invalid for_each argument
β
β on .terraform/modules/eks/main.tf line 97, in resource βaws_ec2_tagβ βcluster_primary_security_groupβ:
β 97: for_each = { for k, v in merge(var.tags, var.cluster_tags) :
β 98: k => v if local.create && k != βNameβ && var.create_cluster_primary_security_group_tags && v != null
β 99: }
β βββββββββββββββββ
β β local.create is true
β β var.cluster_tags is empty map of string
β β var.create_cluster_primary_security_group_tags is true
β β var.tags is map of string with 2 elements
β
β The βfor_eachβ map includes keys derived from resource attributes that cannot be determined until apply, and so Terraform cannot determine the full set of keys that will
β identify the instances of this resource.
β
β When working with unknown values in for_each, itβs better to define the map keys statically in your configuration and place apply-time results only in the map values.
β
β Alternatively, you could use the -target planning option to first apply only the resources that the for_each value depends on, and then apply a second time to fully converge.
I noticed that the tags key-value collections was taken from locals:
locals {
name = "cbci-bp01-i${random_integer.ramdom_id.result}"
...
tags = merge(var.tags, {
"tf:blueprint" = local.name
"tf:repository" = "github.com/cloudbees/terraform-aws-cloudbees-ci-eks-addon"
})
}
During the first apply, the value of random_integer.ramdom_id.result
is not yet known, which triggers the error above.
What would be using for EFS?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.