managedkube / kubernetes-ops Goto Github PK
View Code? Open in Web Editor NEWRunning Kubernetes in production
License: Apache License 2.0
Running Kubernetes in production
License: Apache License 2.0
How to monitor and alert on k8s audit logs?
Simple example where policy docs can be defined in repo and role created / applied for specified pods to access an S3 bucket. Similar to EFS example https://github.com/ManagedKube/kubernetes-ops/blob/main/terraform-modules/aws/eks-efs-csi-driver/main.tf#L45
Hi, I tried to upgrade EKS Kubernetes to 1.22. Auto scaler and cert manager were timing out but after updating the versions of aws
and helm
terraform providers to latest versions and changing the api version link from alpha to beta, it started working
api_version = "client.authentication.k8s.io/v1beta1"
Now i'm stuck with the Nginx ingress. Possibly because of the Nginx announced changes
apiVersion: networking.k8s.io/v1
here is the error I receive
module.ingress-nginx-external.helm_release.helm_chart: Destroying... [id=ingress-nginx]
module.ingress-nginx-external.helm_release.helm_chart: Still destroying... [id=ingress-nginx, 10s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still destroying... [id=ingress-nginx, 20s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still destroying... [id=ingress-nginx, 30s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still destroying... [id=ingress-nginx, 40s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still destroying... [id=ingress-nginx, 50s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still destroying... [id=ingress-nginx, 1m0s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still destroying... [id=ingress-nginx, 1m10s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still destroying... [id=ingress-nginx, 1m20s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Destruction complete after 1m28s
module.ingress-nginx-external.helm_release.helm_chart: Creating...
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [10s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [20s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [30s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [40s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [50s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [1m0s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [1m10s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [1m20s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [1m30s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [1m40s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [1m50s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [2m0s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [2m10s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [2m20s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [2m30s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [2m40s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [2m50s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [3m0s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [3m10s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [3m20s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [3m30s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [3m40s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [3m50s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [4m0s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [4m10s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [4m20s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [4m30s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [4m40s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [4m50s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [5m0s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [5m10s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [5m20s elapsed]
module.ingress-nginx-external.helm_release.helm_chart: Still creating... [5m30s elapsed]
╷
│ Warning: Helm release "ingress-nginx" was created but has a failed status. Use the `helm` command to investigate the error, correct it, then run Terraform again.
│
│ with module.ingress-nginx-external.helm_release.helm_chart,
│ on .terraform/modules/ingress-nginx-external/terraform-modules/aws/helm/helm_generic/main.tf line 1, in resource "helm_release" "helm_chart":
│ 1: resource "helm_release" "helm_chart" {
│
╵
any suggestions? if it is just changing the api version, how can I do that?
Make it easy to see how to set up ingress using:
external ELB
internal ELB
external NLB
internal NLB
external/internal ALB
Graviton instances are based on ARM 64 bit architecture and offer great price/performance ratios.
I tried adding a new node group (ng2) for Graviton instances:
node_groups = {
ng1 = {
disk_size = 20
desired_capacity = 2
max_capacity = 4
min_capacity = 1
instance_types = ["t3.small"]
capacity_type = "SPOT"
additional_tags = var.tags
k8s_labels = {}
}
ng2 = {
disk_size = 20
desired_capacity = 1
max_capacity = 4
min_capacity = 1
instance_types = ["t4g.small"]
capacity_type = "SPOT"
additional_tags = var.tags
k8s_labels = {}
}
}
Applying the Terraform code results in an error. The error message shows it tries to use x86 Amazon Linux 2 image, which is not valid, since it needs the ARM64 bit image:
│ Error: error creating EKS Node Group (staging:staging-ng2-enhanced-grubworm): InvalidParameterException: [t4g.small] is not a valid instance type for requested amiType AL2_x86_64
│ {
│ RespMetadata: {
│ StatusCode: 400,
│ RequestID: "73318df5-e6c3-4e1e-ad3b-7b209bc182f6"
│ },
│ ClusterName: "staging",
│ Message_: "[t4g.small] is not a valid instance type for requested amiType AL2_x86_64",
│ NodegroupName: "staging-ng2-enhanced-grubworm"
│ }
│
│ with module.eks.module.eks.module.node_groups.aws_eks_node_group.workers["ng2"],
│ on .terraform/modules/eks.eks/modules/node_groups/node_groups.tf line 1, in resource "aws_eks_node_group" "workers":
│ 1: resource "aws_eks_node_group" "workers" {
│
Thank you!
Turn this into a module for aws so it can get the ACM cert. Make this optional for people who wants to use ACM instead of cert-manager
Could be interesting to use this for ssh access into private VM instances
https://cloud.google.com/solutions/building-internet-connectivity-for-private-vms
Can we get these clusters and process SOC 2 compliant from the start and produce the evidence of the infrastructure and process being compliant?
A guide to being compliant:
https://pages.datree.io/hubfs/SOC2-compliance-Git-guide-Datree.pdf
Are there open source tools out there to help us here?
Items like Prometheus does not have authentication. We can use this for authentication.
https://github.com/helm/charts/tree/master/stable/oauth2-proxy
How to monitor and dashboard VPC flow logs?
Create a module to run containers on Fargate within the cluster
This project is very interesting in that it has a bunch of github actions for TF:
https://github.com/dflook/terraform-github-actions
The lint looks like it actually post to the line number that had the offense: https://github.com/dflook/terraform-github-actions#linting
https://github.com/dflook/terraform-github-actions#checking-for-drift
Add variable multi_az
to module so that it can be false for dev environments
This is pretty cool. Should set this up.
We should make these cluster secure by default and have reasonably security measures in place from the start.
For example this analysis is a good start on what we should do and enable: https://blog.cloudsploit.com/a-technical-analysis-of-the-capital-one-hack-a9b43d7c8aea
For AWS:
Move this module out to it's own repository: https://github.com/ManagedKube/kubernetes-ops/tree/main/terraform-modules/aws/vpc
This will allow us to more cleanly update this module's lifecycle and provide clear releases for it.
Todo
This looks like a very robust and flexible logging-operator that can handle various scenarios like routing the logs from each namespace to a different logging backend (ES, Loki, S3, etc).
https://github.com/banzaicloud/logging-operator
Add this into the kubernetes-ops and operationalize it.
Show how to use dynamic secrets and create a production workflow for MYSQL RDS
https://github.com/falcosecurity/falco
k8s audit log config: https://github.com/falcosecurity/falco/tree/dev/examples/k8s_audit_config
Build Kibana dashboards from this data?
Send Prometheus remote_source data to ES
Can probably replicate the SumoLogic's Kubernetes offering
Hashicorp has release their version of the helm chart. Is this better than the Helm stable/vault-operator by CoreOs?
https://github.com/hashicorp/vault-helm
We should investigate.
Has anyone had these issue on AWS EKS K8S 1.23 from staging environment.
kubernetes-ops/terraform-environments/aws/staging/helm/external-dns
kubernetes-ops/terraform-environments/aws/staging/helm/ingress-nginx-external
Elasticsearch has an operator for Kubernetes that looks very interesting.
https://www.elastic.co/guide/en/cloud-on-k8s/current/index.html
This seems to be a widely asked questions that I get.
How do we get traffic into the cluster?
How do the containers inside find other containers like consul/redis/mysql?
Use draw.io and create a doc with diagrams explaining how traffic gets in. Also check in the draw.io's xml file.
This would be interesting to look at if most people are on Github anyways.
We would like to add Digital Ocean to the kubernetes-ops so that it can be managed in the same way as AWS.
We would like the same "ease" of management for Digital Ocean.
We should use this to check for drift on a continuous basis: https://github.com/dflook/terraform-github-actions#checking-for-drift
Then it should update a markdown page of the status on every run to produce something like this:
This would be cool b/c then we know what the status of the cluster is.
Add the k8s spot rescheduler usage into kubernetes-ops. This can be a very good tool to use when spot instances are used.
From Craig
Wondering about installation order - it was easier for me to go
ingress-nginx
external-dns
cert-manager
THEN
kube-prometheus-stack
grafana-loki-stack
my-application
Keep numbering scheme the same in repos but make note of order or just leave it alone? I changed it back in the docs to original, but maybe we just change kube-prometheus stack to 55 or 60 and grafana-loki to 56 or 61
Enable mod security by default
https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#lua-resty-waf
Version:
Terraform cli: 1.3.2
kubernetes-ops release: v2.0.47
module: eks
Issue:
When trying to run a plan, it errors out with error configuring Terraform AWS Provider: no valid credential sources for Terraform AWS Provider found.
I've updated my VPC to match release versions and running plans for other modules in the same tenant are working. (credentials are shared across so shouldn't be a credentials issue). I do see that there's a call to the metadata API IP that's getting a connection refused. I don't think I have anything 'incorrect' however I can't see to get it to work either
Full Error message:
Terraform v1.3.2
on linux_amd64
Initializing plugins and modules...
data.terraform_remote_state.vpc: Reading...
data.terraform_remote_state.vpc: Read complete after 1s
╷
│ Warning: Redundant ignore_changes element
│
│ on .terraform/modules/eks.eks/main.tf line 305, in resource "aws_eks_addon" "this":
│ 305: resource "aws_eks_addon" "this" {
│
│ Adding an attribute name to ignore_changes tells Terraform to ignore future
│ changes to the argument in configuration after the object has been created,
│ retaining the value originally configured.
│
│ The attribute modified_at is decided by the provider alone and therefore
│ there can be no configured value to compare with. Including this attribute
│ in ignore_changes has no effect. Remove the attribute from ignore_changes
│ to quiet this warning.
╵
╷
│ Error: error configuring Terraform AWS Provider: no valid credential sources for Terraform AWS Provider found.
│
│ Please see https://registry.terraform.io/providers/hashicorp/aws
│ for more information about providing credentials.
│
│ Error: failed to refresh cached credentials, no EC2 IMDS role found, operation error ec2imds: GetMetadata, request send failed, Get "http://169.254.169.254/latest/meta-data/iam/security-credentials/": dial tcp 169.254.169.254:80: i/o timeout
│
│
│ with provider["registry.terraform.io/hashicorp/aws"],
│ on main.tf line 35, in provider "aws":
│ 35: provider "aws" {
│
╵
Operation failed: failed running terraform plan (exit 1)
This seems like a good tool for network sniffing in kube. We should investigate the usage and write a doc or link to this blog here.
For metrics-server, is there a different way it should be added to the cluster than just running
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Wondering how the state of that is tracked in TF if at all. This was needed for HPA to work properly when I deployed my app.
eks autoscaler
config:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 3.37.0"
}
helm = {
source = "hashicorp/helm"
version = "2.5.0"
}
module "cluster-autoscaler" {
source = "github.com/ManagedKube/kubernetes-ops//terraform-modules/aws/cluster-autoscaler?ref=v2.0.82"
error:
Error: Unsupported attribute
│
│ on main.tf line 58, in provider "helm":
│ 58: host = data.terraform_remote_state.eks.outputs.cluster_endpoint
│ ├────────────────
│ │ data.terraform_remote_state.eks.outputs is object with no attributes
│
│ This object does not have an attribute named "cluster_endpoint".
╵
╷
│ Error: Unsupported attribute
│
│ on main.tf line 59, in provider "helm":
│ 59: cluster_ca_certificate = base64decode(data.terraform_remote_state.eks.outputs.cluster_certificate_authority_data)
│ ├────────────────
│ │ data.terraform_remote_state.eks.outputs is object with no attributes
│
│ This object does not have an attribute named
│ "cluster_certificate_authority_data".
╷
│ Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1"
│
│ with module.cluster-autoscaler.module.cluster-autoscaler.helm_release.helm_chart,
│ on .terraform/modules/cluster-autoscaler.cluster-autoscaler/terraform-modules/aws/helm/helm_generic/main.tf line 1, in resource "helm_release" "helm_chart":
│ 1: resource "helm_release" "helm_chart" {
│
╵
Operation failed: failed running terraform apply (exit 1)
Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1"
│
│ with module.cluster-autoscaler.module.cluster-autoscaler.helm_release.helm_chart,
│ on .terraform/modules/cluster-autoscaler.cluster-autoscaler/terraform-modules/aws/helm/helm_generic/main.tf line 1, in resource "helm_release" "helm_chart":
│ 1: resource "helm_release" "helm_chart" {
│
kube prometheus stack
config:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 3.37.0"
}
random = {
source = "hashicorp/random"
}
helm = {
source = "hashicorp/helm"
version = "2.5.0"
}
kubectl = {
source = "gavinbunney/kubectl"
version = ">= 1.7.0"
}
}
module "kube-prometheus-stack" {
source = "github.com/ManagedKube/kubernetes-ops//terraform-modules/aws/helm/kube-prometheus-stack?ref=v2.0.82"
error:
Error: unable to build kubernetes objects from release manifest: unable to recognize "": no matches for kind "PodSecurityPolicy" in version "policy/v1beta1"
│
│ with module.kube-prometheus-stack.helm_release.helm_chart,
│ on .terraform/modules/kube-prometheus-stack/terraform-modules/aws/helm/kube-prometheus-stack/main.tf line 1, in resource "helm_release" "helm_chart":
│ 1: resource "helm_release" "helm_chart" {
│
╵
Operation failed: failed running terraform apply (exit 1)
It would be nice to have the cost of the TF plan/apply in the PRs.
We can do this via infracost: https://github.com/infracost/infracost-gh-action
Can we do it with just their open source version without having to sign up and get their key?
An interesting way to secure an internal application that doesn't have it's open auth like prometheus
Do you already have something to copy containers from dev ECR to stage, then stage to prod? Basically so the container gets built and tested in dev, then once it’s good no need to touch it - just move the container into other repos for deployment
https://seethatgo.slack.com/archives/C023NPLHCJD/p1628008537040700
"Secrets and service discovery are always something I'm interested in seeing how people approach with containers/kubernetes. From what I can tell, it looks like you're using vault for secrets (love hashicorp) and standard kube services with selectors for discovery. All in all, pretty solid."
In security scanners, A security group able to pass all the inbound or outbound traffic represent a hole in the matrix by having a default security group that, even though it is not used, allows the entry and exit of all the vpc traffic.
We must try that the VPC module of this project by default, creates a security group with all the cares
vpc module: https://github.com/ManagedKube/kubernetes-ops/tree/main/terraform-modules/aws/vpc
On an apply there should be a set of E2E test that runs to make sure everything still works
https://github.com/ManagedKube/kubernetes-ops/actions/runs/1321582892
Something simple like going to the grafana's URL or something is fine
There will be times when there is a need to connect two clusters together. Possibly for DR purposes. This looks like an interesting way of doing it:
Prototype out this operator. Looks to be very complete on ops aspects.
https://github.com/banzaicloud/kafka-operator/blob/master/README.md
In most situations karpenter will give optimal cluster autoscaling when compared to CA, source: https://towardsdev.com/karpenter-vs-cluster-autoscaler-dd877b91629b
The current AWS VPC Terraform is using version 0.11.x. Using the newer version gets us a good path forward and some more parameterization use cases with Terragrunt that will make using everything easier.
This seems like a good tool to simulate traffic:
https://github.com/BuoyantIO/bb
This can help testing out:
In the kube-prometheus-stack TF module - https://github.com/ManagedKube/kubernetes-ops/tree/main/terraform-modules/aws/helm/kube-prometheus-stack
There is a syntax issue that causes Terraform to fail if the module is included as a module dependency in another.
i.e. -
module "kube-prometheus-stack" {
source = "github.com/ManagedKube/kubernetes-ops/terraform-modules/aws/helm/kube-prometheus-stack"
helm_values = file("${path.module}/values.yaml")
depends_on = [
data.terraform_remote_state.eks
]
}
The error that occurs is this:
Waiting for the plan to start...
Terraform v1.2.6
on linux_amd64
Initializing plugins and modules...
╷
│ Error: Invalid function argument
│
│ on .terraform/modules/kube-prometheus-stack/terraform-modules/aws/helm/kube-prometheus-stack/main.tf line 17, in resource "helm_release" "helm_chart":
│ 17: templatefile("./values_local.yaml", {
│ 18: enable_grafana_aws_role = var.enable_iam_assumable_role_grafana
│ 19: aws_account_id = var.aws_account_id
│ 20: role_name = local.k8s_service_account_name
│ 21: }),
│
│ Invalid value for "path" parameter: no file exists at
│ "./values_local.yaml"; this function works only with files that are
│ distributed as part of the configuration source code, so if this file will
│ be created by a resource in this configuration you must instead obtain this
│ result from an attribute of that resource.
╵
Operation failed: failed running terraform plan (exit 1)
This is due to this line of code: https://github.com/ManagedKube/kubernetes-ops/blob/main/terraform-modules/aws/helm/kube-prometheus-stack/main.tf#L17
resource "helm_release" "helm_chart" {
chart = "kube-prometheus-stack"
namespace = var.namespace
create_namespace = "true"
name = var.chart_name
version = var.helm_version
verify = var.verify
repository = "https://prometheus-community.github.io/helm-charts"
values = [
# templatefile("${path.module}/values.yaml", {
--> templatefile("./values_local.yaml", {
enable_grafana_aws_role = var.enable_iam_assumable_role_grafana
aws_account_id = var.aws_account_id
role_name = local.k8s_service_account_name
}),
var.helm_values,
]
}
Changing this line to:
templatefile("${path.module}/values_local.yaml", {
fixes the issue.
How can we incorporate gVisor into the Kubernetes clusters?
For External DNS would be nice to have example of using sealed secrets for that base64 encoded block of credentials for the Route53 IAM user - would be nice for step-by-step to set that up in the context of this module since it’s something we’re applying
Lets give the IAM perm to the external-dns pod so it has access.
This Review Access - kubectl plugin to show an access matrix for k8s server resources
https://github.com/corneliusweig/rakkess
Create a doc on how to use this with our cluster.
Create a script to facilitate the creation of different TerraformCloud workspaces with the associated sensitive environment variables for aws access and secrets
How to monitor logins
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.