Terraform module for quick deployment of baremetal Anthos on Equinix Metal

Home Page: https://registry.terraform.io/modules/equinix/anthos-on-baremetal

License: Apache License 2.0

HCL 66.57% Shell 33.43%

anthos packet baremetal kubernetes terraform-module equinix-metal

terraform-equinix-metal-anthos-on-baremetal's Introduction

Automated Anthos on Baremetal via Terraform for Equinix Metal

These files will allow you to use Terraform to deploy Google Cloud's Anthos on Baremetal on Equinix Metal's Bare Metal Cloud offering.

Terraform will create an Equinix Metal project complete with Linux machines for your Anthos on Baremetal cluster registered to Google Cloud. You can use an existing Equinix Metal Project, check this section for instructions.

Users are responsible for providing their Equinix Metal account, and Anthos subscription as described in this readme.

The build (with default settings) typically takes 25-30 minutes.

The automation in the repo is COMMUNITY SUPPORTED ONLY, if the installation succeeds, and you run the Anthos Platform Validation this cluster is production grade and supportable by Google for Anthos and Equinix Metal for Infrastructure. If you have any questions please consult with Equinix Metal Support via a support ticket.

Join us on Slack

We use Slack as our primary communication tool for collaboration. You can join the Equinix Metal Community Slack group by going to slack.equinixmetal.com and submitting your email address. You will receive a message with an invite link. Once you enter the Slack group, join the #google-anthos channel! Feel free to introduce yourself there, but know it's not mandatory.

Latest Updates

See the Releases page for a changelog describing the tagged releases.

Prerequisites

To use these Terraform files, you need to have the following Prerequisites:

An Anthos subscription
Google Cloud command-line tool gcloud and Terraform installed and configured, see this section
A Equinix Metal org-id and API key
Optionally, Google Cloud service-account keys, check this section

Associated Equinix Metal Costs

The default variables make use of 6 c3.small.x86 servers. These servers are $0.50 per hour list price (resulting in a total solution price of roughly $3.00 per hour). This deployment has been test with as little as 2 c3.small.x86 (1 Control Plane node and 1 Worker node) for a total cost of roughly $1.00.

Tested Anthos on Baremetal versions

The Terraform has been successfully tested with following versions of Anthos on Baremetal:

1.7.0
1.6.0

To simplify setup, this is designed to use manual LoadBalancing with Kube-VIP load balancer. No other load balancer support is planned at this time.

Select the version of Anthos you wish to install by setting the anthos_version variable in your terraform.tfvars file.

Install tools

Install gcloud

The gcloud command-line tool is used to configure GCP for use by Terraform. Download and install the tool from the download page.

Once installed, run the following command to log in, configure the tool and your project:

gcloud init

This will prompt you to select a GCP project that you will use to register the Anthos cluster. This project must be linked to an Anthos subscription.

Next, run the following command to configure credentials that can be used by Terraform:

gcloud auth application-default login

Install Terraform

Terraform is just a single binary. Visit their download page, choose your operating system, make the binary executable, and move it into your path.

Here is an example for macOS:

curl -LO https://releases.hashicorp.com/terraform/0.14.2/terraform_0.14.2_darwin_amd64.zip
unzip terraform_0.14.2_darwin_amd64.zip
chmod +x terraform
sudo mv terraform /usr/local/bin/
rm -f terraform_0.14.2_darwin_amd64.zip

Here is an example for Linux:

curl -LO https://releases.hashicorp.com/terraform/0.14.2/terraform_0.14.2_linux_amd64.zip
unzip terraform_0.14.2_linux_amd64.zip
chmod +x terraform
sudo mv terraform /usr/local/bin/
rm -f terraform_0.14.2_linux_amd64.zip

Manage your GCP Keys for your service accounts

The Anthos on Baremetal install requires several service accounts and keys to be created. See the Google documentation for more details. By default, Terraform will create and manage these service accounts and keys for you (recommended). Alternatively, you can create these keys manually, or use a provided helper script to make the keys for you.

If you choose to manage the keys yourself, the Terraform files expect the keys to use the following naming convention, matching that of the Google documentation:

util
|_keys
  |_cloud-ops.json
  |_connect.json
  |_gcr.json
  |_register.json
  |_bmctl.json

If doing so manually, you must create each of these keys and place it in a folder named keys within the util folder. The service accounts also need to have IAM roles assigned to each of them. To do this manually, you'll need to follow the instructions from Google

Much easier (and recommended) is to use the helper script located in the util directory called setup_gcp_project.sh to create these keys and assign the IAM roles. The script will allow you to log into GCP with your user account and the GCP project for your Anthos cluster.

You can run this script as follows:

util/setup_gcp_project.sh

Prompts will guide you through the setup.

Note that if you choose to manage the service accounts and keys outside Terraform, you will need to provide the gcp_keys_path variable to Terraform (see table below).

Download this project

To download this project, run the following command:

git clone https://github.com/equinix/terraform-metal-anthos-on-baremetal.git
cd terraform-metal-anthos-on-baremetal

Initialize Terraform

Terraform uses modules to deploy infrastructure. In order to initialize the modules simply run:

terraform init

This should download seven modules into a hidden directory .terraform.

Modify your variables

There are many variables which can be set to customize your install within variables.tf. The default variables to bring up a 6 node Anthos cluster with an HA Control Plane and three worker nodes using Equinix Metal's c3.small.x86. Change each default variable at your own risk.

There are some variables you must set with a terraform.tfvars files. You need to set metal_auth_token & metal_organization_id to connect to Equinix Metal and the metal_project_name which will be created in Equinix Metal. For the GCP side you need to set gcp_project_id so that Terraform can enable APIs and initialise the project, and it's a good idea to set cluster_name to identify your cluster in the GCP portal. Note that the GCP project must already exist, i.e. Terraform will not create the GCP project for you.

The Anthos variables include anthos_version and anthos_user_cluster_name.

Here is a quick command plus sample values to start file for you (make sure you adjust the variables to match your environment):

cat <<EOF >terraform.tfvars
metal_auth_token = "cefa5c94-e8ee-4577-bff8-1d1edca93ed8"
metal_organization_id = "42259e34-d300-48b3-b3e1-d5165cd14169"
metal_project_name = "anthos-metal-project-1"
gcp_project_id = "anthos-gcp-project-1"
cluster_name = "anthos-metal-1"
EOF

Available Variables

A complete list of variables can be found at https://registry.terraform.io/modules/equinix/anthos-on-baremetal/metal/latest?tab=inputs.

Variable Name	Type	Default Value	Description
metal_auth_token	string	n/a	Equinix Metal API Key
metal_project_id	string	n/a	Equinix Metal Project ID
metal_organization_id	string	n/a	Equinix Metal Organization ID
hostname	string	anthos-baremetal	The hostname for nodes
metro	string	ny	Equinix Metal Metro to deploy into
cp_plan	string	c3.small.x86	Equinix Metal device type to deploy control plane nodes
worker_plan	string	c3.small.x86	Equinix Metal device type to deploy for worker nodes
ha_control_plane	boolean	true	Do you want a highly available control plane?
worker_count	number	3	Number of baremetal worker nodes
operating_system	string	ubuntu_20_04	The Operating system of the node
billing_cycle	string	hourly	How the node will be billed (Not usually changed)
cluster_name	string	equinix-metal-gke-cluster	The name of the GKE cluster
metal_create_project	string	true	Create a new project for this deployment?
metal_project_name	string	baremetal-anthos	The name of the project if 'create_project' is 'true'.
gcp_project_id	string	n/a	The GCP project ID to use .
gcp_keys_path	string	n/a	The path to a directory with GCP service account keys
bgp_asn	string	65000	BGP ASN to peer with Equinix Metal
ccm_version	string	v3.2.2	The version of Cloud Provider Equinix Metal
kube_vip_version	string	0.3.8	The version of Kube-VIP to install
anthos_version	string	1.8.3	The version of Google Anthos to install
ccm_deploy_url	string	Too Long to put here...	The deploy url for the Equinix Metal CCM
storage_provider	string	n/a	Enable a Storage module (examples: "portworx", "rook")
storage_options	map	n/a	Options specific to the storage module

Supported Operating Systems

Name	Api Slug
CentOS 8	centos_8
Ubuntu 18.04	ubuntu_18_04
Ubuntu 20.04	ubuntu_20_04

Comming Soon

Name	Api Slug
Red Hat Enterprise Linux 8	rhel_8

Deploy the Anthos on Baremetal cluster onto Equinix Metal

All there is left to do now is to deploy the cluster:

terraform apply --auto-approve

This should end with output similar to this:

Apply complete! Resources: 28 added, 0 changed, 0 destroyed.

Outputs:

Control_Plane_Public_IPs = [
  "136.144.50.115",
  "136.144.50.117",
  "136.144.50.119",
]
Control_Plane_VIP = "145.40.65.107"
Ingress_VIP = "145.40.65.106"
Kubeconfig_location = "/home/cloud-user/git/baremetal-anthos/equinix-metal-gke-cluster-vomqb-kubeconfig"
Worker_Public_IPs = [
  "136.144.50.123",
  "145.40.64.221",
  "136.144.50.105",
]
ssh_key_location = "/home/cloud-user/.ssh/bm-cluster-20201211211054"

You can see this output again at anytime by running terraform output

Use an existing Equinix Metal project

If you have an existing Equinix Metal project you can use it. YOU MUST ENABLE BGP PEERING ON YOUR PROJECT WITHOUT A PASSWORD

Get your Project ID, navigate to the Project from the console.equinixmetal.com console and click on PROJECT SETTINGS, copy the PROJECT ID.

add the following variables to your terraform.tfvars

metal_create_project              = false
metal_project_id                  = "YOUR-PROJECT-ID"

Google Anthos Documentation

Once Anthos is deployed on Equinix Metal, all of the documentation for using Google Anthos is located on the Anthos Documentation Page.

Storage Providers

Storage providers are made available through optional storage modules. These storage providers include CSI (Container Native Storage) StorageClasses.

Changing or disabling a storage provider is not currently supported.

To enable a storage module, set the storage_module variable to the name of the name of the included module.

portworx: To enable the Pure Storage Portworx installation, use the following settings in terraform.tfvars:
```
storage_module = "portworx"
storage_options = {
  # portworx_version = "2.6"
  # portworx_license = "c0ffe-fefe-activation-123"
}
```
When enabled, Portworx will manage the local disks attached to each worker node, providing a fault tolerant distributed storage solution.

Read more about the Portworx module (also available on the Terraform Registry).
rook: To enable the Rook Ceph installation, use the following settings in terraform.tfvars:
```
storage_module = "rook"
storage_options = {
  # rook_version = "v1.6.0"
}
```

When enabled, Rook Ceph will manage the local disks attached to each worker node, providing a fault tolerant distributed storage solution.

terraform-equinix-metal-anthos-on-baremetal's People

Contributors

Stargazers

Watchers

terraform-equinix-metal-anthos-on-baremetal's Issues

Generate a token to register in Anthos Console

The README.md does not walk users through connecting the new cluster to the Anthos UI.

While we do not want to repeat all of the Anthos Baremetal documentation in this project, a helper script or set of copy/paste commands would ease the process. (The Google Instructions must be hand edited and selected before they can be copy/pasted, we can be opinionated here or take values from Terraform).

https://cloud.google.com/anthos/gke/docs/bare-metal/1.6/how-to/anthos-ui#authn

Should Terraform preconfigure this?
What names should we use?

Check to ensure gcloud is installed

Use gkeonprem_bare_metal_* Terraform resources from the Google provider

https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/gkeonprem_bare_metal_admin_cluster

These resources should reduce the amount of script customization is needed.

(If that is not the case, there may be benefit in having the Google provider resources included in the main module or a examples.

Control codes from apt cause terminal issues

The apt command by default uses terminal control codes to display a progress status bar when installing packages, which will break some terminals when the templates/pre_reqs.sh script is run via Terraform, requiring a reset. This could be fixed by:

Reverting to the traditional apt-get instead, which should be better behaved, or
Configuring apt not to display progress status

The first of these has the advantage of making install tool use consistent in the ubuntu_pre_reqs function.

Terraform script error for Google Anthos Baremetal Setup

Get a error during ssh key creation, line 35 of the main.tf terraform script

line 35: "resource "metal_ssh_key" "ssh_pub_key" {"

i get a "error: not found". I noticed though that there is a public key created in packet in my project and a private key in my .ssh folder. But because of the error the script fails at this point

storage_options should be optional

The Storage Options option is currently required due to the null default.

https://github.com/equinix/terraform-metal-anthos-on-baremetal/blob/master/variables.tf#L143-L147

We should be able to set the default to {}. Some of the checks around storage_options might have to be updated (or not).

Unable to progress past BGP peering step on Anthos 1.11.2

I am trying to test out Anthos 1.11.2 so that I can leverage some newer features that take advantage of Equinix metal's SRIOV features in the baremetal hardware. My preference is to use centos_8 as the backend and I patched the script to get past some errors I was having (I can send the patch in a PR) but the issue appears to happen on the default ubuntu_20_04 release as well so it doesn't appear to be OS related.

terraform.tfvars

// this should be your personal token, not the project token
metal_auth_token = "sanitized"
metal_organization_id = "sanitized"
metal_project_id = "sanitized"
// don't create a new project, use an existing
metal_create_project = false
gcp_project_id = "sanitized"
cluster_name = "anthos-metal-1"
// 1.11.X is necessary to get the latest multi-nic pieces for sriov
anthos_version = "1.11.2"
// ideally we want rhel_7 here but saw a couple bugs for rhel
// operating_system = "rhel_8"
// operating_system = "centos_8"
operating_system = "ubuntu_20_04"
facility = "dc13"

I get up to the null_resource.kube_vip_install_first_cp step and it never completes. I've even let it run overnight and it never completes even after 15 hours.

null_resource.kube_vip_install_first_cp (remote-exec): /root/bootstrap/vip.yaml FOUND!
null_resource.kube_vip_install_first_cp (remote-exec): BGP peering initiated! Cluster should be completed in about 5 minutes.
null_resource.kube_vip_install_first_cp: Creation complete after 9m23s [id=7216402651719392522]
***
***
***
null_resource.deploy_anthos_cluster: Still creating... [15h22m26s elapsed]
null_resource.deploy_anthos_cluster: Still creating... [15h22m36s elapsed]
null_resource.deploy_anthos_cluster: Still creating... [15h22m46s elapsed]
null_resource.deploy_anthos_cluster: Still creating... [15h22m56s elapsed]
^CStopping operation...
Interrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...
╷
│ Error: execution halted
│ 
│ Error: remote-exec provisioner error
│ 
│   with null_resource.deploy_anthos_cluster,
│   on main.tf line 239, in resource "null_resource" "deploy_anthos_cluster":
│  239:   provisioner "remote-exec" {
│ 
│ error executing "/tmp/terraform_925104650.sh": wait: remote command exited without exit status or exit signal
╵

Since my values are fairly standard except for the new Anthos, I presume the issue is likely a change with the BGP peering that perhaps hasn't been accounted for.

HA deployment needs metro support

As identified in #81 (comment)

typo in pre_reqs.sh script - ip add instead of ip addr

Hi,

I've been testing this and I realized there is a typo in this line I refer below. It should be "ip addr add $EIP/32 dev lo"

https://github.com/equinix/terraform-metal-anthos-on-baremetal/blob/6b80d890972ab83fdc2d58976cadeecffe10e7f3/templates/pre_reqs.sh#L75

Incorporate HA local storage ready CSI providers

This Equinix Metal + Google Anthos integration would benefit from a storage provider that could take advantage of the local disks attached to the provisioned nodes and the fast networking between devices.

From https://github.com/c0dyhi11/terraform-metal-anthos-on-baremetal/releases/tag/v0.2.0:

The only thing you'll need is to bring your own CSI, see our Anthos Ready Storage Partners Here

What does this look like?

When installing this Terraform module, the user would toggle an option that would enable one of a number of CSI integration options.

The Terraform resource enabling this could be a late install that is applied once the Kubeconfig output variable is available.
This resource could perhaps, depend on the Terraform Helm and Terraform Kubernetes providers, or it could be executed with shell commands through a null_resource provisioner.

The CSI provider could use the full disk that is made available within the device.

If the CSI provider requires raw partitions or disks, the Equinix Metal CPR (Custom Partitioning and Raid) features could be introduced during the device creation:

https://registry.terraform.io/providers/equinix/metal/latest/docs/resources/device (search "storage")
https://metal.equinix.com/developers/docs/servers/custom-partitioning-raid/

The introduced modules (if any), variables, and outputs should include one-line descriptions.

The integration should be described within the project README.md.

As an alternative to baking all of these opinions into the existing project, a new Terraform module (hosted in a different repo) could consume this project as a module.

That parent module could use this project’s kubeconfig output variable. The parent module may need to send disk configuration parameters into this project, and if so we may need to introduce those parameters into this module.

This parent module could then express other opinions, like adding Cloudflare or other providers, to point DNS records at the IPs include in this module's output variables.

Make shell script work for Enterprise Linux (CentOS and RHEL)

remove the dependency lock file

Remove the.teraform.lock.hcl file and add it to .gitignore to be in line with the Equinix terraform module standards

https://github.com/equinix-labs/equinix-labs/blob/7f04358f109e53ddbfcb08ba0cd68b0b7016c227/terraform-module-standards.md?plain=1#L25

Configurable userdata

The module should expose control-plane and worker userdata variables.

This userdata would be combined with the userdata that is already defined using Terraform's cloud-init handlers (gzip + multipart mime).

Make Anthos version upgradeable

By running terraform apply with an updated Anthos version, a user may expect this module to upgrade the installed Anthos version.

That is not the case today.

Doing so should upgrade the installed cluster.

terraform apply -var anthos_version=1.6.0 # cluster has been installed
terraform apply -var anthos_version=1.7.0 # cluster has been upgraded

Add a `check_capacity.py` script

A script like https://github.com/equinix/terraform-metal-anthos-on-vsphere/blob/main/check_capacity.py would be helpful here, to determine capacity prior to attempting to apply the terraform configuration.

Update Anthos to 1.7.0

https://cloud.google.com/anthos/clusters/docs/bare-metal/release-notes

relocate this project

We would like to have this project hosted at github.com/equinix/terraform-equinix-metal-anthos/ published to the Terraform Registry. This follows the publishing guidelines for naming Terraform modules: terraform-{provider}-{module name}

This migration will require some text renames within the project but ultimately depends on packethost/terraform-provider-packet#288

Update documentation for Terraform module use

We should be able to simplify some of the documentation now that the module is published to provide a less developer-centric installation.

If the GCP Terraform provider can be incorporated (#14), we could further simplify the installation instructions because Terraform's built in variable prompts would be run on apply and we wouldn't need the util/setup_gcp_project.sh script.

Here are some early thoughts on what simplified instructions could offer:

No need to git clone:

terraform init --from-module=equinix/anthos-on-baremetal/metal
util/setup_gcp_project.sh # We should explain where all the Google Container Registry (GCR) service account value can be found
terraform apply

To avoid entering all the variables every time you terraform apply, you can create a terraform.tfvars file, any of the input variables can be stored in this file.

When the terraform apply is done (with a green text summary), you can do things like:

KUBECONFIG=$(terraform output Kubeconfig_location) kubectl get nodes

and

ssh -i $(terraform output ssh_key_location) root@$(terraform output -json Worker_Public_IPs | jq '.[0]' ) # 0 is the first worker

A complete list of output variables is available at https://registry.terraform.io/modules/equinix/anthos-on-baremetal/metal/latest?tab=outputs.

To use this module in a larger configuration:

# example that demonstrates consuming the Kubeconfig variable

If you plan to contribute to the project, rather than running terraform init --from-module=equinix/anthos-on-baremetal/metal, you will want to run:

git clone [email protected]:equinix/terraform-metal-anthos-on-baremetal
terraform init

centos_8_2/8_3 is not a valid operating system

Error: centos_8_2 is not a valid operating system
on main.tf line 76, in resource "metal_device" "control_plane":
76: resource "metal_device" "control_plane" {

Error: centos_8_2 is not a valid operating system
on main.tf line 91, in resource "metal_device" "worker_nodes":
91: resource "metal_device" "worker_nodes" {

Error: centos_8_3 is not a valid operating system
on main.tf line 76, in resource "metal_device" "control_plane":
76: resource "metal_device" "control_plane" {

Error: centos_8_3 is not a valid operating system
on main.tf line 91, in resource "metal_device" "worker_nodes":
91: resource "metal_device" "worker_nodes" {

Do we still need the "packet-cloud-config" secret?

@displague & @thebsdbox :

Do we still need the packet-cloud-config? The CCM now references metal-cloud-config I just kept this to be compatible with the older version of Kube-VIP. Since Kube-VIP is now updated to the latest... Is it still needed.

https://github.com/equinix/terraform-metal-anthos-on-baremetal/blob/79169cdcbbd39efeea2ba14bf693570b0e5d85dc/templates/ccm_secret.yaml#L15

It looks like we're deploying the Control Plane Kube-VIP pods by providing the creds directly:
https://github.com/equinix/terraform-metal-anthos-on-baremetal/blob/79169cdcbbd39efeea2ba14bf693570b0e5d85dc/templates/kube_vip_install.sh#L36

We're still referencing this secret in the DS here:
https://github.com/equinix/terraform-metal-anthos-on-baremetal/blob/79169cdcbbd39efeea2ba14bf693570b0e5d85dc/templates/kube_vip_ds.yaml#L85

I think at a minimum we can replace the packet-cloud-config with metal-cloud-config in the kube_vip_ds.yaml.

And i think at that point we should be able to remove the packet-cloud-config secret all together, but I'm not 100% sure...

Convert gcloud usage to GCP Terraform provider

Using https://registry.terraform.io/providers/hashicorp/google/latest (https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/google_project_iam) it should be possible to remove any requirements for gcloud and defer to a few Terraform variables and upfront statements about what the project will do and what opinions will be expressed. The terraform plan will act as a better gate to the changes that will be applied to the account than what is provided currently through the setup-gcp script.

Make this work with all 4x supported OSes

This should work with:

CentOS 8
RHEL 8
Ubuntu 18.04
Ubuntu 20.04

This has only been tested on Ubuntu 20.04

Get BGP Networking working

Call to function "cidrhost" failed: invalid CIDR expression: invalid CIDR address: /0.

Following the README step for step, I get this error:

╷
│ Error: Error in function call
│
│   on main.tf line 141, in data "template_file" "deploy_anthos_cluster":
│  141:     ingress_vip      = cidrhost(metal_reserved_ip_block.ingress_vip.cidr_notation, 0)
│     ├────────────────
│     │ metal_reserved_ip_block.ingress_vip.cidr_notation is "/0"
│
│ Call to function "cidrhost" failed: invalid CIDR expression: invalid CIDR address: /0.
╵
╷
│ Error: Error in function call
│
│   on output.tf line 22, in output "Ingress_VIP":
│   22:   value       = cidrhost(metal_reserved_ip_block.ingress_vip.cidr_notation, 0)
│     ├────────────────
│     │ metal_reserved_ip_block.ingress_vip.cidr_notation is "/0"
│
│ Call to function "cidrhost" failed: invalid CIDR expression: invalid CIDR address: /0.

Add output variables linking to Google/Equinix UI pages

Users would benefit from a shortcut to the relevant Google Cloud panels for the new cluster:

https://console.cloud.google.com/kubernetes/list?project={GCP-project-id} (all clusters in the project)
https://console.cloud.google.com/kubernetes/clusters/registered/details/{What-Is-This-Token?}/{ClusterID)/details?project={GCP-project-id} (the specific cluster)

The Equinix Metal project URLs could be emitted too:

https://console.equinix.com/projects/{EquinixMetal-Project-ID}

RHEL 8 hosts are created without resolv.conf links

This leads to errors like this when Anthos tries to join the node to the cluster.
Warning FailedCreatePodSandBox 3m42s (x26 over 9m16s) kubelet Failed to create pod sandbox: open /run/systemd/resolve/resolv.conf: no such file or directory

Update kube-vip (non-ccm) to the latest version

A custom build of kube-vip is used within this project.

Update to the latest release version: https://github.com/plunder-app/kube-vip/tree/v0.3.2

(cc @thebsdbox)

gcloud application-default login required on GCE vm or terraform cluster create hangs

null_resource.deploy_anthos_cluster (remote-exec): Creating Anthos Cluster. This will take about 20 minutes...
null_resource.deploy_anthos_cluster (remote-exec): Cluster Created!
null_resource.deploy_anthos_cluster: Creation complete after 5s [id=2553823925035852511]
null_resource.download_kube_config: Creating...
null_resource.download_kube_config: Provisioning with 'local-exec'...
null_resource.download_kube_config (local-exec): Executing: ["/bin/sh" "-c" "scp -i ~/.ssh/anthos-abm-blue-p17o3 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null  [email protected]:/root/baremetal/bmctl-workspace/abm-blue-p17o3/abm-blue-p17o3-kubeconfig ."]
null_resource.download_kube_config (local-exec): Warning: Permanently added '139.178.86.49' (ECDSA) to the list of known hosts.
null_resource.download_kube_config (local-exec): scp: /root/baremetal/bmctl-workspace/abm-blue-p17o3/abm-blue-p17o3-kubeconfig: No such file or directory

null_resource.kube_vip_install_first_cp: Still creating... [10s elapsed]
null_resource.kube_vip_install_first_cp (remote-exec): Waiting for '/etc/kubernetes/manifests' to be created...
null_resource.kube_vip_install_first_cp: Still creating... [20s elapsed]
null_resource.kube_vip_install_first_cp (remote-exec): Waiting for '/etc/kubernetes/manifests' to be created...

Centos 8 is not working

HI All, I am Trying to run TF script with centos getting fallowing issue. (Ubuntu is working fine)
API error http 402 centos 8 is not valid operating system.

https://drive.google.com/file/d/1ic9MMwod7FLLXJZ5JdQzYavl5wpefbtM/view?usp=sharing

Investigate alternatives to CPEM IP Address provisioning

In issue #55, we seek to make Cloud Provider: Equinix Metal (CPEM) optional during installation.

In lieu of CPEM, users will need to provision IP addresses in some other way (manually, using another controller, using Terraform) and assign these addresses statically on all subsequent LoadBalancer resources.

What solutions can we provide here? What are the obstacles to those approaches?

Use Crossplane to provision IP ranges? crossplane-contrib/provider-equinix-metal#41
Would we preinstall and configure Crossplane?
How would the provisioned addresses update the Service records? How would crossplane detect the need for more IP addresses from Service records without addresses?
Create a custom controller for this purpose?
Build this functionality into Kube-VIP or MetalLB (are plugins supported)?
Let Terraform pre-provision ranges and make these ranges available to kube-vip or metallb or the user? Is this a configmap? Is this arguments to a controller?

provide kube api configuration as a module output

While the API may need to be publicly exposed first (depends on #9) users would benefit from the ability to treat this provider like a module that can be plugged into broader solutions.

This will require access to the Kubernetes API of the Anthos cluster:

module "anthos" {
 source = "equinix/equinix-metal/anthos" # this repository, once registered
}

provider "kubernetes" {
 // configure using module.anthos.some_output_variables
}

resource "kubernetes_service" "nginx" {
  metadata {
    name = "nginx-example"
  }
  spec {
    selector = {
      App = kubernetes_pod.nginx.metadata[0].labels.App
    }
    port {
      port        = 80
      target_port = 80
    }

    type = "LoadBalancer"
  }
}

output "lb_ip" {
  value = kubernetes_service.nginx.load_balancer_ingress[0].ip
}

Upgrade to TF13

centos and rhel do not persist bgp routes

The userdata script uses /etc/networking/interfaces to persist changes.

This file is not used on centos/rhel.

Change Packet provider usage to Equinix Metal

The Equinix Metal provider is now available. This repository should migrate from the Packet provider to the Equinix Metal provider.

Migration notes are being drafted at equinix/terraform-provider-metal#1

GCP generated keys should take a name parameter

The GCP key generating script https://github.com/equinix/terraform-metal-anthos-on-baremetal/blob/main/util/setup_gcp_project.sh creates keys with a fixed name. This prevents other users sharing that project from generating keys themselves.

The keys must then be shared among team members.

We can provide a better experience by taking an optional name prefix/suffix.

Optionally support Metal-network-internal API and ingress/service endpoints

Currently, clusters are created with Internet-facing Kubernetes API endpoints, and are configured such that Kubernetes ingress and service endpoints are also created with Internet-facing VIPs. This is done using kube-vip to provision Elastic IPs for control plane and ingress/service use, and managing them with BGP.

While this is perfect for demo/PoC work, to more closely replicate “real-world” deployments, it would be useful to be able to create clusters configured to expose API and/or ingress/service endpoints with Metal-network-internal IP addresses. Based on a conversation with @c0dyhi11 a while back, this may also require moving from an L3 to an L2 model for the internal network.

BGP is not enabled on some generated projects

As a new Equinix Metal user, additional steps may be needed to get BGP enabled on generated Equinix Metal Projects.

The user will see long wait times, that ultimately fail, on the null_resource.deploy_anthos_cluster provisioner.
When canceled, the user will see "Error: BGP is not enabled".

After manually enabling BGP on the project, a subsequent terraform apply will succeed within seconds.

Also, the README file currently states:

## Use an existing Equinix Metal project
If you have an existing Equinix Metal project you can use it.
**YOU MUST ENABLE BGP PEERING ON YOUR PROJECT WITHOUT A PASSWORD**

It is not clear if this means that the project must already have BGP enabled, or if the focus is on the password-less configuration.

Any steps outside of terraform apply that first-time users should follow to get BGP to work should be noted.

Update Packet CCM to latest version of Cloud Provider Equinix Metal

This project is using a development build of the Packet-CCM on a user's fork.

The latest official version of this project has incorporated the changes that were needed in that development fork and has also been renamed.

v3.0.1 of Cloud Provider Equinix Metal is the latest release currently: https://github.com/equinix/cloud-provider-equinix-metal/releases/

bundle Anthos-ready testing tools

We should bundle the Anthos ready testing tools:

Developers and maintainers will benefit from simplified testing and verification.
Users will benefit from the ability to independently verify the Anthos readiness. This is especially helpful for supporting the platform when users may use untested device configurations.

The requirements for these tests include network diagrams and scripted test results.

A testing folder should be provisioned on the control plane node(s).
Files named network-logical.pdf and network-physical.pdf should be generated or included in this project and deposited in the testing folder.

A testing script should be included in the testing folder (example):

#!/bin/sh
mkdir -p /root/test/report && cd /root/test
wget -q https://storage.googleapis.com/anthos_ready_test_script/apv-ts;
wget -q -O testcfg.apv.yaml.orig https://storage.googleapis.com/anthos_ready_test_script/testcfg.apv.yaml
chmod +x apv-ts;

[ -f testcfg.apv.yaml ] && ./apv-ts --config testcfg.apv.yaml

A testing configuration should be included in the testing folder (this would replace testcfg.apv.yaml from the example script above). Terraform should be equipped to supply the values needed by the test configuration.
Here's an example: https://gist.github.com/displague/6ec57e2e5c6bdf15af67a1c42a8bc022

Cloud Monitoring workspace for GCP project required or cluster bootstrap fails

Readme needs updating to indicate:

For a GCP proj, the Cloud Monitoring Workspace needs to be added to project via GCP console

"msg"="Failed to bootstrap." "error"="Error validating cluster config: 1 error occurred:\n\t* ClusterOperations check failed: failed to access Cloud Monitoring service, please create workspace for the project anthos-cpe-clusters following https://cloud.google.com/monitoring/workspaces/create and try again: googleapi: Error 400: 'projects/' is not a Workspace., badRequest\n\n"

GCP project id should be optional

The default Terraform provider project id can be determined with:

data "google_project" "project" {
}

To make the project_id optional, we can set the project_id search attribute on this resource when set. Any references to the gcp_project_id variable would need to refer to this data.google_project.project.

Make Cloud Provider Equinix Metal optional

In some environments, users may wish to use an independent Cloud Provider controller (aka CCM) that can treat all Equinix Metal devices and devices from other cloud providers as pure baremetal behind a Layer2 network.

In these environments, the 'Cloud Provider: Equinix Metal' (CPEM) controller should not set the Node providerID nor manage labels and annotations on Nodes. Another CCM may take on those responsibilities in a baremetal + L2 (cloud-agnostic) way.

When CPEM is removed, users still wanting to take advantage of LoadBalancer services may continue to do so using MetalLB or Kube-VIP. A benefit of CPEM is that it will provision or discover available IP reservations from Equinix Metal APIs and populate the Service address accordingly.

Users choosing to omit CPEM from their Anthos installation will want to find an alternative approach to dynamic provisioning of IP Addresses, or use static addressing.

Terraform script does not create the logical volume for PX KVDB

I used the following steps to setup the Google Anthos on Equinix Metal with PX.

gcloud init
gcloud auth application-default login
mkdir -p equinix-px
cd equinix-px/
 export TF_VAR_metal_auth_token=...
export TF_VAR_metal_project_id=...
export TF_VAR_gcp_project_id=...

terraform init

# variables.tf
variable "metal_auth_token" {}
variable "metal_project_id" {}
variable "gcp_project_id" {}

# main.tf
module "anthos-on-baremetal" {
  source  = "equinix/anthos-on-baremetal/metal" 
  version = "0.5.1"
 
  gcp_project_id = var.gcp_project_id
  metal_auth_token = var.metal_auth_token
  metal_project_id = var.metal_project_id
  metal_create_project = false
  ha_control_plane = false
  facility = "da11"
  cp_plan = "c3.medium.x86"
  worker_plan = "c3.medium.x86"
  worker_count = 3
  storage_module = "portworx"
  storage_options = {}
}

terraform apply

After the cluster is up and running with PX, I notice that the separate LV for PX KVDB is missing in all the three worker nodes. We definitely need to fix this problem ASAP as this prolongs the PX setup in the cluster.

root@eqnx-metal-gke-ww35x-worker-03:~# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 447.1G  0 disk 
sdb      8:16   0 447.1G  0 disk 
sdc      8:32   0 223.6G  0 disk 
sdd      8:48   0 223.6G  0 disk 
├─sdd1   8:49   0     2M  0 part 
├─sdd2   8:50   0   1.9G  0 part 
└─sdd3   8:51   0 221.7G  0 part /
root@eqnx-metal-gke-ww35x-worker-03:~# 
Collapse

Portworx LVM script may choose the wrong disks

Here are some of the inconsistencies:

2 out of 3 nodes did not create a LVM for the PX KVDB. Only one node successfully did it.
The 3rd node that did create the pxw_vg, picked up the larger 480GB drive instead of the 240GB.

I am including some of the snippets.

This is worker node 1 where it could not create the pwx_vg and you can clearly see the warning message.

root@equinix-metal-gke-cluster-yk9or-worker-01:~# pxctl status
Status: PX is operational
License: Trial (expires in 31 days)
Node ID: 1534165d-4b6b-41df-b8e1-03e8c8d5c4d1
    IP: 145.40.77.105 
    Local Storage Pool: 2 pools
    POOL    IO_PRIORITY RAID_LEVEL  USABLE  USED    STATUS  ZONE    REGION
    0   HIGH        raid0       447 GiB 10 GiB  Online  default default
    1   HIGH        raid0       224 GiB 10 GiB  Online  default default
    Local Storage Devices: 2 devices
    Device  Path        Media Type      Size        Last-Scan
    0:1 /dev/sdb    STORAGE_MEDIUM_SSD  447 GiB     12 Feb 21 17:34 UTC
    1:1 /dev/sdc    STORAGE_MEDIUM_SSD  224 GiB     12 Feb 21 17:34 UTC
    * Internal kvdb on this node is sharing this storage device /dev/sdc  to store its data.
    total       -   671 GiB
    Cache Devices:
     * No cache devices
Cluster Summary
    Cluster ID: equinix-metal-gke-cluster-yk9or
    Cluster UUID: 47eb0c51-b2c1-456b-a254-e5c849a7d1db
    Scheduler: kubernetes
    Nodes: 3 node(s) with storage (3 online)
    IP      ID                  SchedulerNodeName               StorageNode Used    Capacity    Status  StorageStatus   Version     Kernel          OS
    145.40.77.101   9afd9a30-0eb3-4a8d-937f-86f5cf63c4bc    equinix-metal-gke-cluster-yk9or-worker-03   Yes     20 GiB  671 GiOnline    Up      2.6.3.0-4419aa4 5.4.0-52-generic    Ubuntu 20.04.1 LTS
    145.40.77.211   99a6f578-6c6f-4b09-b516-8dd332beef7e    equinix-metal-gke-cluster-yk9or-worker-02   Yes     20 GiB  668 GiOnline    Up      2.6.3.0-4419aa4 5.4.0-52-generic    Ubuntu 20.04.1 LTS
    145.40.77.105   1534165d-4b6b-41df-b8e1-03e8c8d5c4d1    equinix-metal-gke-cluster-yk9or-worker-01   Yes     20 GiB  671 GiOnline    Up (This node)  2.6.3.0-4419aa4 5.4.0-52-generic    Ubuntu 20.04.1 LTS
    Warnings: 
         WARNING: Internal Kvdb is not using dedicated drive on nodes [145.40.77.105]. This configuration is not recommended for production clusters.
Global Storage Pool
    Total Used      :  60 GiB
    Total Capacity  :  2.0 TiB
root@equinix-metal-gke-cluster-yk9or-worker-01:~# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 447.1G  0 disk 
sdb      8:16   0 447.1G  0 disk 
sdc      8:32   0 223.6G  0 disk 
sdd      8:48   0 223.6G  0 disk 
├─sdd1   8:49   0     2M  0 part 
├─sdd2   8:50   0   1.9G  0 part 
└─sdd3   8:51   0 221.7G  0 part /
root@equinix-metal-gke-cluster-yk9or-worker-01:~#

This worker node 2 where there is no pwx_vg for the KVDB.

root@equinix-metal-gke-cluster-yk9or-worker-02:~# pxctl status
Status: PX is operational
License: Trial (expires in 31 days)
Node ID: 99a6f578-6c6f-4b09-b516-8dd332beef7e
    IP: 145.40.77.211 
    Local Storage Pool: 2 pools
    POOL    IO_PRIORITY RAID_LEVEL  USABLE  USED    STATUS  ZONE    REGION
    0   HIGH        raid0       447 GiB 10 GiB  Online  default default
    1   HIGH        raid0       221 GiB 10 GiB  Online  default default
    Local Storage Devices: 2 devices
    Device  Path        Media Type      Size        Last-Scan
    0:1 /dev/sdb    STORAGE_MEDIUM_SSD  447 GiB     12 Feb 21 17:47 UTC
    1:1 /dev/sdc2   STORAGE_MEDIUM_SSD  221 GiB     12 Feb 21 17:47 UTC
    * Internal kvdb on this node is sharing this storage device /dev/sdc2  to store its data.
    total       -   668 GiB
    Cache Devices:
     * No cache devices
    Journal Device: 
    1   /dev/sdc1   STORAGE_MEDIUM_SSD
Cluster Summary
    Cluster ID: equinix-metal-gke-cluster-yk9or
    Cluster UUID: 47eb0c51-b2c1-456b-a254-e5c849a7d1db
    Scheduler: kubernetes
    Nodes: 3 node(s) with storage (3 online)
    IP      ID                  SchedulerNodeName               StorageNode Used    Capacity    Status  StorageStatus   Version     Kernel          OS
    145.40.77.101   9afd9a30-0eb3-4a8d-937f-86f5cf63c4bc    equinix-metal-gke-cluster-yk9or-worker-03   Yes     20 GiB  671 GiOnline    Up      2.6.3.0-4419aa4 5.4.0-52-generic    Ubuntu 20.04.1 LTS
    145.40.77.211   99a6f578-6c6f-4b09-b516-8dd332beef7e    equinix-metal-gke-cluster-yk9or-worker-02   Yes     20 GiB  668 GiOnline    Up (This node)  2.6.3.0-4419aa4 5.4.0-52-generic    Ubuntu 20.04.1 LTS
    145.40.77.105   1534165d-4b6b-41df-b8e1-03e8c8d5c4d1    equinix-metal-gke-cluster-yk9or-worker-01   Yes     20 GiB  671 GiOnline    Up      2.6.3.0-4419aa4 5.4.0-52-generic    Ubuntu 20.04.1 LTS
    Warnings: 
         WARNING: Internal Kvdb is not using dedicated drive on nodes [145.40.77.105 145.40.77.211]. This configuration is not recommended for production clusters.
Global Storage Pool
    Total Used      :  60 GiB
    Total Capacity  :  2.0 TiB
root@equinix-metal-gke-cluster-yk9or-worker-02:~# lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 447.1G  0 disk 
sdb      8:16   0 447.1G  0 disk 
sdc      8:32   0 223.6G  0 disk 
├─sdc1   8:33   0     3G  0 part 
└─sdc2   8:34   0 220.6G  0 part 
sdd      8:48   0 223.6G  0 disk 
├─sdd1   8:49   0     2M  0 part 
├─sdd2   8:50   0   1.9G  0 part 
└─sdd3   8:51   0 221.7G  0 part /
root@equinix-metal-gke-cluster-yk9or-worker-02:~#

Finally this is worker node 3. This node creates the pwx_vg on the larger capacity drive.

root@equinix-metal-gke-cluster-yk9or-worker-03:~# pxctl status
Status: PX is operational
License: Trial (expires in 31 days)
Node ID: 9afd9a30-0eb3-4a8d-937f-86f5cf63c4bc
    IP: 145.40.77.101 
    Local Storage Pool: 2 pools
    POOL    IO_PRIORITY RAID_LEVEL  USABLE  USED    STATUS  ZONE    REGION
    0   HIGH        raid0       447 GiB 10 GiB  Online  default default
    1   HIGH        raid0       224 GiB 10 GiB  Online  default default
    Local Storage Devices: 2 devices
    Device  Path        Media Type      Size        Last-Scan
    0:1 /dev/sdb    STORAGE_MEDIUM_SSD  447 GiB     12 Feb 21 17:34 UTC
    1:1 /dev/sdc    STORAGE_MEDIUM_SSD  224 GiB     12 Feb 21 17:34 UTC
    total           -           671 GiB
    Cache Devices:
     * No cache devices
    Kvdb Device:
    Device Path     Size
    /dev/pwx_vg/pwxkvdb 447 GiB
     * Internal kvdb on this node is using this dedicated kvdb device to store its data.
Cluster Summary
    Cluster ID: equinix-metal-gke-cluster-yk9or
    Cluster UUID: 47eb0c51-b2c1-456b-a254-e5c849a7d1db
    Scheduler: kubernetes
    Nodes: 3 node(s) with storage (3 online)
    IP      ID                  SchedulerNodeName               StorageNode Used    Capacity    Status  StorageStatus   Version     Kernel          OS
    145.40.77.101   9afd9a30-0eb3-4a8d-937f-86f5cf63c4bc    equinix-metal-gke-cluster-yk9or-worker-03   Yes     20 GiB  671 GiOnline    Up (This node)  2.6.3.0-4419aa4 5.4.0-52-generic    Ubuntu 20.04.1 LTS
    145.40.77.211   99a6f578-6c6f-4b09-b516-8dd332beef7e    equinix-metal-gke-cluster-yk9or-worker-02   Yes     20 GiB  668 GiOnline    Up      2.6.3.0-4419aa4 5.4.0-52-generic    Ubuntu 20.04.1 LTS
    145.40.77.105   1534165d-4b6b-41df-b8e1-03e8c8d5c4d1    equinix-metal-gke-cluster-yk9or-worker-01   Yes     20 GiB  671 GiOnline    Up      2.6.3.0-4419aa4 5.4.0-52-generic    Ubuntu 20.04.1 LTS
Global Storage Pool
    Total Used      :  60 GiB
    Total Capacity  :  2.0 TiB
root@equinix-metal-gke-cluster-yk9or-worker-03:~# lsblk
NAME             MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                8:0    0 447.1G  0 disk 
└─pwx_vg-pwxkvdb 253:0    0 447.1G  0 lvm  
sdb                8:16   0 447.1G  0 disk 
sdc                8:32   0 223.6G  0 disk 
sdd                8:48   0 223.6G  0 disk 
├─sdd1             8:49   0     2M  0 part 
├─sdd2             8:50   0   1.9G  0 part 
└─sdd3             8:51   0 221.7G  0 part /
root@equinix-metal-gke-cluster-yk9or-worker-03:~#

Any thoughts regarding these inconsistencies.

Originally posted by @bikashrc25 in #37 (comment)

Enable CI/CD testing for PRs

Stub to be filled in later.

GCP permissions for application-default login

Use case - you don't own the target GCP project for the deployment. Owner of the project provides the service-account keys and you do the rest.

This may be a corner case issue but in the above scenario the keys and permissions generated by the setup_gcp_project.sh script are not enough. You need to give GCP application-default credentials for terraform to use. The instructions in README use gcloud application-default login for this but if you don't have access to the target GCP project, it will error a about 3 minutes into the terraform run.

Prior to changing the super-admin SA to bmctl SA and reducing permissions, you could use the super-admin SA key for the application-default credentials (link). That account had project editor and IAM admin roles.

Terraform Deprecation Warnings

Hi,

Getting these deprecation warnings when running apply/delete from the main branch.

│ Warning: "facilities": [DEPRECATED] Use metro attribute instead
│
│   with metal_device.control_plane,
│   on main.tf line 76, in resource "metal_device" "control_plane":
│   76: resource "metal_device" "control_plane" {
│
│ (and 7 more similar warnings elsewhere)

I am guessing the provider has changed.

Thanks

equinix / terraform-equinix-metal-anthos-on-baremetal Goto Github PK