redpanda-data / deployment-automation Goto Github PK

View Code? Open in Web Editor NEW

60.0 37.0 46.0 521 KB

Cluster configuration best practices

Home Page: https://redpanda.com

License: Apache License 2.0

HCL 74.75% Smarty 1.01% Jinja 0.10% Shell 24.14%

redpanda ansible

deployment-automation's Introduction

Terraform and Ansible Deployment for Redpanda

Terraform and Ansible configuration to easily provision a Redpanda cluster on AWS, GCP, Azure, or IBM.

Installation Prerequisites

Here are some prerequisites you'll need to install to run the content in this repo. You can also choose to use our Dockerfile_FEDORA or Dockerfile_UBUNTU dockerfiles to build a local client if you'd rather not install terraform and ansible on your machine.

Install Terraform: https://www.terraform.io/downloads.html
Install Ansible: https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html
Depending on your system, you might need to install some python packages (e.g. selinux or jmespath). Ansible will throw an error with the expected python packages, both on local and remote machines.

On Mac OS X:

You can use brew to install the prerequisites. You will also need to install gnu-tar:

brew tap hashicorp/tap
brew install hashicorp/tap/terraform
brew install ansible
brew install gnu-tar

Basic Usage:

# Set required ansible variables
export CLOUD_PROVIDER=aws
export ANSIBLE_COLLECTIONS_PATHS=${PWD}/artifacts/collections
export ANSIBLE_ROLES_PATH=${PWD}/artifacts/roles
export ANSIBLE_INVENTORY=${PWD}/${CLOUD_PROVIDER}/hosts.ini

# Assumes default private and public key names, if these aren't correct for you set them to the correct values

# Deploy VM
# ASSUMES YOU HAVE A DEFAULT VPC, if you don't, create one and set vpc_id and subnet_id
cd $CLOUD_PROVIDER
terraform init
terraform apply --auto-approve -var='public_key_path=~/.ssh/id_rsa.pub' -var='deployment_prefix=go-rp'
cd ..

# Install collections and roles
ansible-galaxy install -r ./requirements.yml

# Run a Playbook
# You need to pick the correct playbook for you, in this case we picked provision-basic-cluster
ansible-playbook ansible/provision-basic-cluster.yml --private-key ~/.ssh/id_rsa

Additional Documentation

More information on consuming this collection is available here in our official documentation.

Troubleshooting

On Mac OS X, Python unable to fork workers

If you see something like this:

ok: [34.209.26.177] => {“changed”: false, “stat”: {“exists”: false}}
objc[57889]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[57889]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
ERROR! A worker was found in a dead state

You might try resolving by setting an environment variable: export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

See: https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr

Contribution Guide

testing with a specific branch of redpanda-ansible-collection

Change the redpanda.cluster entry in your requirements.yml file to the following:

  - name: https://github.com/redpanda-data/redpanda-ansible-collection.git
    type: git
    version: <<<YOUR BRANCH NAME>>>

pre-commit

We use pre-commit to ensure good code health on this repo. To install pre-commit check the docs here. The basic idea is that you'll have a fairly comprehensive checkup happen on each commit, guaranteeing that everything will be properly formatted and validated. You may also need to install some pre-requisite tools for pre-commit to work correctly. At the time of writing this includes:

Ansible Linter Skip List Whys and Wherefores

A lot of effort to bring the linter and IDE into alignment without meaningful improvement in readability, outcomes or correctness.

jinja[spacing]
yaml[brackets]
yaml[line-length]

Breaks the play because intermediate commands in the pipe return nonzero (but irrelevant) error codes

risky-shell-pipe

deployment-automation's People

Contributors

Stargazers

Watchers

deployment-automation's Issues

Enable SI on AWS

Make it possible to enable shadow indexing from a flag, and then deploying a new environment with SI works out of the box (after running terraform and ansible scripts).

Ansible fails when enabling TLS due to safe restart

The code for safe restart uses rpk cluster maintenance enable {{ node_id }} which unfortunately uses the new rpk settings before TLS is enabled.

We either need a way to single-thread the whole of the last few plays (to ensure that maintenance mode is enabled, the node is restarted and then MM disabled all in a single action), or we need to nestle the updating of the redpanda.yaml file into that single shell command.

When I wrote this I couldn't find a way of having a block with throttle set within a role.

Validate/test if cluster configuration property of type string can be un-set

The rpk cluster config set at the moment of writing has the following logic:
https://github.com/redpanda-data/redpanda/blob/c3cca097b01c250715fd4012bac175f09c26778e/src/go/rpk/pkg/cli/cmd/cluster/config/set.go#L85-L99

			if meta.Nullable && value == "null" {
				// Nullable types may be explicitly set to null
				upsert[key] = nil
			} else if meta.Type != "string" && (value == "") {
				// Non-string types that receive an empty string
				// are reset to default
				remove = append(remove, key)
			} else if meta.Type == "array" {
				var a anySlice
				err = yaml.Unmarshal([]byte(value), &a)
				out.MaybeDie(err, "invalid list syntax")
				upsert[key] = a
			} else {
				upsert[key] = value
			}

This mean that if any property of type string which has different default value then "" (empty string) can not be unset.

I didn't test it, but it might be worth checking the rpk cluster config import
https://github.com/redpanda-data/redpanda/blob/c3cca097b01c250715fd4012bac175f09c26778e/src/go/rpk/pkg/cli/cmd/cluster/config/import.go#L98-L223

Related issue redpanda-data/helm-charts#395

Always reconcile redpanda.yaml as it have node configuration

The following condition:

deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml

Line 63 in 496d91f

when: not is_initialized

should be dropped as node configuration can not be updated using Admin API.
It can be updated only by replacing redpanda.yaml and restart Redpanda process.

Add playbook for adding/removing a node

Standardize client deployment across cloud providers

Right now the client instance is sometimes deployed or not depending on the cloud provider chosen. AWS default is false, Azure default is true, etc. The same default should be used all cloud providers.

(Defaults across cloud providers likely impacts more than just client deployment, more issues can be created to track other issues).

Add Upgrade Playbook

We should have a playbook on performing rolling upgrades.

by default only install minimal dependencies

Break out the dependencies into required, and non-required.

Redpanda only needs redpanda, no JDK, java, etc or IOTOP all of these tools are just there as suplementary things when debugging a cluster.

By default, only the minimal_profile=true should be installed (i made up that variable name)

Pass ansible variables in via config file rather than flags

Currently ansible commands take flags, and this can make the commands long and difficult to read. Create a config file to variables that are initially populated with working default values, and updated docs to mention this.

Restructure ansible to be a Galaxy Collection

We need to restructure the ansible directory to be formatted as an Ansible Galaxy Collection before submitting to https://galaxy.ansible.com/.

For details of Ansible Galaxy Collections see https://docs.ansible.com/ansible/latest/collections_guide/index.html

Terraform AWS deploy always regerates all resources

Every time you run a terraform command, almost every resource (all the hosts, in particular) are torn down and replaced, so al the IP addresses change, any existing state is lost, etc.

The plan shows that the cause is the use of timestamp() as part of the deployment_id. Unlike the random UUID which will change only when the deployment is initially created, the timestamp is different on every call, meaning that the deployment ID changes every time and we can't apply changes to an existing deployment: it effectively creates a new deployment every time.

As a minimal example, if you run two consecutive terraform apply with the AWS modules, you would expect the second does nothing: after all, nothing has changed and the first apply put everything into the desired state. However, currently all hosts will be torn down and new ones created. The only thing that changed was the ssh key name, which is based on the deployment ID.

Add an integration testing GH Action

We need some verification that both the terraform and ansible parts work when we merge new changes.
There is a minimal terratest module for the tf code but we also need to test that the ansible modules work as expected at least for a minimal use case.

Hard to tell who created instances

It's hard to tell who created a given node, i.e., to find abandoned machines or so users can check what they have running.

I propose we add a tag with the IAM username to make it clearer.

Terraform should tag resources

All three terraform modules should tag resources with an 'owner' tag based either on their username/iam name, or based on something that can be configured.

We do this for AWS today: https://github.com/redpanda-data/deployment-automation/blob/main/aws/cluster.tf#L14 but not for Azure or GCP.

Add molecule tests for redpanda role

https://molecule.readthedocs.io/en/latest/

Ideally we should add Github actions to help with regression testing as well.

Latest version of rp has issues with redpanda-tuner starting from ansible

First, I don't know why redpanda-tuner runs as a systemd service. 🤷‍♂️
Second and most importantly, redpanda-tuner now fails trying to run as a systemd service, at least when deployed with ansible.
Rogger Vasquez saw it in action and is looking into it!

Support for Graviton instance and ami

Currently can't select graviton instances

Add a playbook to install redpanda console

This needs a change in the terraform scripts as well.

An example of how the playbook would look like: https://redpandacommunity.slack.com/archives/C039U14NY04/p1662538501878789?thread_ts=1662124131.079259&cid=C039U14NY04

excessive polling for geerlinguy.node_exporter can trigger Github ratelimits

Looks like geerlingguy.node_exporter may somehow trigger rate limiting on Github's side because the module checks for the latest version each time ansible runs. We need to figure out if there's a way to disable ~~this limiting~~ the frequency of checking.

TASK [geerlingguy.node_exporter : Check current node_exporter version.] *********************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/main.yml:2
ok: [XX.YY.138.212] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007202", "end": "2023-01-31 01:48:20.308344", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.301142", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n  build user:       root@6e7732a7b81b\n  build date:       20221129-18:59:09\n  go version:       go1.19.3\n  platform:         linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", "  build user:       root@6e7732a7b81b", "  build date:       20221129-18:59:09", "  go version:       go1.19.3", "  platform:         linux/amd64"]}
ok: [XX.YY.97.33] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007200", "end": "2023-01-31 01:48:20.685369", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.678169", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n  build user:       root@6e7732a7b81b\n  build date:       20221129-18:59:09\n  go version:       go1.19.3\n  platform:         linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", "  build user:       root@6e7732a7b81b", "  build date:       20221129-18:59:09", "  go version:       go1.19.3", "  platform:         linux/amd64"]}
ok: [XX.YY.156.16] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007467", "end": "2023-01-31 01:48:20.889350", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.881883", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n  build user:       root@6e7732a7b81b\n  build date:       20221129-18:59:09\n  go version:       go1.19.3\n  platform:         linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", "  build user:       root@6e7732a7b81b", "  build date:       20221129-18:59:09", "  go version:       go1.19.3", "  platform:         linux/amd64"]}

TASK [geerlingguy.node_exporter : Configure latest version] *********************************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/main.yml:8
included: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/config-version.yaml for XX.YY.156.16, XX.YY.97.33, XX.YY.138.212
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (1 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (1 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (1 retries left).

TASK [geerlingguy.node_exporter : Determine latest GitHub release (local)] ******************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/config-version.yaml:2
fatal: [XX.YY.97.33 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}
fatal: [XX.YY.138.212 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}
fatal: [XX.YY.156.16 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}

Ansible-lint: provision-node.yml cleanup

Work through remaining issues noted for provision-node.yml and it's related roles and tasks.

warn_list:  # or 'skip_list' to silence them completely
  - command-instead-of-module  # Using command rather than module.
  - command-instead-of-shell  # Use shell only when shell functionality is required.
  - experimental  # all rules tagged as experimental
  - fqcn[action-core]  # Use FQCN for builtin actions.
  - name[missing]  # Rule for checking task and play names.
  - no-changed-when  # Commands should not change things if nothing needs doing.
  - risky-shell-pipe  # Shells that use pipes should set the pipefail option.
  - yaml[line-length]  # Violations reported by yamllint.

                               Rule Violation Summary
 count tag                       profile    rule associated tags
     1 command-instead-of-module basic      command-shell, idiom
     1 command-instead-of-shell  basic      command-shell, idiom
     2 key-order[task]           basic      formatting, experimental (warning)
    11 jinja[spacing]            basic      formatting (warning)
     8 name[missing]             basic      idiom
    11 name[play]                basic      idiom (warning)
     8 yaml[line-length]         basic      formatting, yaml
    22 name[casing]              moderate   idiom (warning)
     3 risky-file-permissions    safety     unpredictability, experimental (warning)
     1 risky-shell-pipe          safety     command-shell
     2 no-changed-when           shared     command-shell, idempotency
     1 fqcn[action-core]         production formatting
     2 fqcn[action]              production formatting (warning)
     2 args[module]                         syntax, experimental (warning)

Failed after min profile: 22 failure(s), 53 warning(s) on 10 files.
A new release of ansible-lint is available: 6.11.0 → 6.12.0 Upgrade by running: pip install --upgrade ansible-lint

s3 setup for archival storage

When deploying a cluster it would be great to have an option that setup infra support for archival storage feature in redpanda. Briefly this would be:

s3 bucket
query credentials, urls, etc...
deploy redpanda.yml with archival settings from (1) and (2)

@Lazin do you have the set of configuration elements that we'll need? I thought they were in configuration.h but I don't see them in there. Maybe they haven't merged yet?

Use AWS region from AWS config by default

Read from ~/.aws/config to get user's default region rather than hard-coding aws_region to us-west-2.

Stranded Compute in CI builds

Currently there is a risk that resources can be stranded when a build is canceled. This will need to be resolved for long term use of this CI to be safe and cost effective. Mitigation techniques in place include:

skipping intermediate builds
wrap tests in a trap script to ensure any interrupts are concluded with a destroy
use non-randomized naming to ensure stranded assets cause conflicts and are detected immediately

"as for cleaning up stranded compute which could be a risk upon every push to a PR, i can see 2 paths:

test terraform apply only on manual request like via github label trigger. it would help if the aws resources could have a tag that could point back to the PR and git commit so resource cleanup can be easier.
hook up this repo to terraform cloud to deploy only on merge to main and have the PR check lint and terraform plan. then the cloud will keep track of the state file and manage cleanup."

Originally posted by @andrewhsu in #137 (review)

Irregular failure in ansible script for grafana repository

There is an irregular failure in adding the grafana repository that happens in about 5% of builds which causes a fatal failure.

TASK [cloudalchemy.grafana : Add Grafana repository [Debian/Ubuntu]] ****************************************************************************
FAILED - RETRYING: [35.90.68.22]: Add Grafana repository [Debian/Ubuntu] (5 retries left).
FAILED - RETRYING: [35.90.68.22]: Add Grafana repository [Debian/Ubuntu] (4 retries left).

Ansible code using shell command built-in broken on ansible releases 2.14+

If using Ansible 2.14 or later, it appears like we get failures for any usage of the shell command that has the warn parameter defined.

See background on the deprecation/removal: ansible/ansible#79379

There are two places where this occurs in our automation (one in our code, one in a dependency).

roles/redpanda_broker/tasks/install-redpanda.yml in add the redpanda repo
in the cloudalchemy/node-exporter dependency in it's preflight checks.

We can fix the first. The second is going to be harder because the node-exporter appears to no longer be maintained and is possibly moving to a different repo per this open issue in their github repo:

cloudalchemy/ansible-node-exporter#279

Looks like we need to use the prometheus-community version of node-exporter which patches this warn issue in the preflight.

prometheus-community/ansible@ec6c857
We can fix the first, but

Error connecting to kafkaJS with TLS.

I share in detail the steps I took and the mistakes I got.

First of all, I have a basic typescript project. And kafkaJS is installed in it. ( https://kafka.js.org/docs/getting-started )

my transactions:

first uninstalling 1 broker and 1 monitoring on aws

cd aws
terraform apply ( Necessary parameters and public_keys have been created for the machine. )

and the machines I wanted were successfully created on aws.

Installing redpanda and grafana on broker with ansible.

ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -e advertise_public_ips=true -e grafana_admin_pass=******* -i hosts.ini -v ansible/playbooks/provision-node.yml

and successful redpanda installation on broker. I can also connect to grafana with the password I specified.

private createKafkaConsumer(clientId: string, groupId: string): Consumer {

        const kafka = new Kafka({
            clientId: clientId,
            brokers: ['*.*.*.*:9092']
        })
        const consumer = kafka.consumer({ groupId })
        return consumer
    }

I can successfully connect to the broker with the code snippet above.

Up to this stage I can successfully use redpanda. But there is insecurity here. So I have to activate TLS.

Here are the steps I need to do according to the documentation for TLS.

ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -v ansible/playbooks/create-ca.yml
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/generate-csrs.yml
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/issue-certs.yml
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/install-certs.yml

After these steps, the redpanda.yml file in the broker is updated accordingly. It is updated as in the template specified for redpanda.yml in the ansible playbook section.

Now I'm trying to connect to the broker again.

private createKafkaConsumer(clientId: string, groupId: string): Consumer {

        const kafka = new Kafka({
            clientId: clientId,
            brokers: ['*.*.*.*:9092'],
            ssl: {
                rejectUnauthorized: false,
                ca: [fs.readFileSync('/tls/ca/ca.crt', 'utf-8')]
            }
        })
        const consumer = kafka.consumer({ groupId })
        return consumer
    }

I can no longer connect to the broker I was able to connect to without TLS. I'm doing something wrong but I can't find it.

This is the error I got.

{"level":"INFO","timestamp":"2023-01-05T20:48:59.897Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:48:59.897Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"WebSiteProducer_Group"}
{"level":"INFO","timestamp":"2023-01-05T20:49:00.198Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"172.31.17.111:9092","clientId":"71b80823-8058-4279-9814-4561eb3840a4"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSConnectionError: Connection timeout","groupId":"WebSiteProducer_Group","stack":"KafkaJSConnectionError: Connection timeout\n    at Timeout.onTimeout [as _onTimeout] (/home/emredarak/repo/consumer/node_modules/kafkajs/src/network/connection.js:223:23)\n    at listOnTimeout (node:internal/timers:559:17)\n    at processTimers (node:internal/timers:502:7)"}
{"level":"INFO","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.346Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"WebSiteProducer_Group"}
{"level":"INFO","timestamp":"2023-01-05T20:49:01.647Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"WebSiteProducer_Group"}

setting 'advertise_public_ips=false' doesn't work

Setting -e 'advertise_public_ips=true causes public IPs to be advertised, so -e 'advertise_public_ips=false' should be the opposite, right?

Not so fast! Setting it to false is the same as true as a result of the coercion from a string to boolean that happens in {% if ... %}.

We should use the sort-of idiomatic var | d() | bool instead, which handles all cases. Everything except the "yaml truthy" values are set to false.

Switch to /public_metrics endpoint

The grafana dashboard is generated based off the internal /metrics endpoint. This should be switched to the public-facing /public_metrics endpoint.

deployment-automation/ansible/playbooks/deploy-prometheus-grafana.yml

Line 5 in 01d1c3f

  rpk generate grafana-dashboard --datasource prometheus --metrics-endpoint 'http://{{hostvars[inventory_hostname].private_ip}}:9644/metrics' > '/tmp/redpanda-grafana.json' 

Debug task referencing package_result.results appears to break ansible runs permanently.

Discovered while working on PR-119. Would be good to run this down, or at least, figure out why removing the debug task doesn't allow the ansible run to resume correctly running again. There's probably some other level of caching that I'm not catching and once package_result.results gets wedged, it's not clearing itself out and repopulating.

Somehow I'm managing to break the ansible runs with a debug statement that references a variable that (sometimes?) may not contain anything, and once it gets that way, future ansible runs break, even if the debug task is removed. I've tried running with --flush-cache on the ansible run and that's not fixing it. What's also concerning is that once it breaks, even with removing the debug, it impacts the Establish whether restart required task.

TASK [redpanda_broker : Establish whether restart required] *********************************************************************************************************************************************************
task path: /Users/tcampbell/Documents/github/prs/pr-119-redpanda-data-deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml:223
fatal: [XX.XX.XX.XX]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'results'. 'dict object' has no attribute 'results'\n\nThe error appears to be in '/Users/tcampbell/Documents/github/prs/pr-119-redpanda-data-deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml': line 223, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n#\n- name: Establish whether restart required\n  ^ here\n"}

it looks like it might be something within the package_result variable.

Here's the example debug I was working with

- name: debug
  ansible.builtin.debug:
    msg:
    - "is_initialized is {{ is_initialized }}"
    - "restart_required_rc is {{ restart_required_rc.stdout }}"
    - "package_result is {{ package_result }}"
    - "nodeconfig_result is {{ nodeconfig_result }}"
    - "restart_node is {{ restart_node | default('true') | bool }}"
    - "is restart_required_rc.stdout true: {{ restart_required_rc.stdout }}"
    - "is nodeconfig_result changed {{ nodeconfig_result.changed }}"
    - "is package_result.results removed or ugprade {{ package_result.results }}"
#    - "what is_initialized and result of (nodeconfig_result.changed or package_result) {{ is_initialized and (nodeconfig_result.changed or 'Removed' in package_result.results or '1 upgraded' in package_result.results) }}"
#    - "restart_required: {{ ('true' in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or 'Removed' in package_result.results or '1 upgraded' in package_result.results))) and (restart_node | default('true') | bool) }}"

Set the Redpanda license

Redpanda will eventually enforce license for enterprise feature like tiered storage. It would be good to set it up.
Example:
https://github.com/redpanda-data/helm-charts/blob/1c1048b68950f0dbe454d41448db420b7b115423/charts/redpanda/templates/post-install-upgrade-job.yaml#L107
https://github.com/redpanda-data/redpanda/blob/dev/src/go/rpk/pkg/cli/cmd/cluster/license/set.go

Revisit client configuration

Terraform is configured to deploy a client EC2 instance (if var.clients is non-zero), but there is no ansible configuration to install the Redpanda CLI
var.clients is an unbounded integer value, but maybe it should just be a boolean
Wouldn't users still need to connect directly to nodes at some point (viewing logs, etc.)? Having a client seems like an added step, and we would need to explain to the user when to connect to the client and when to just connect to the node. Possibly a better route would be to remove the client.

tls/install-certs.yml playbook deletes rpc_server

tls/install-certs.yml: This part of the code sets redpanda again so it deletes whatever we had inside redpanda field in redpanda.yaml. we might be deleting important user configurations such as rpc_server or additional node properties.

We should instead set kafka_api_tls and rpc_server_tls separately via rpk:

instead of:

  - name: Configure via RPK
    shell:
      cmd: |
        rpk redpanda config set redpanda "
          kafka_api:
          - name: default
            address: {{ hostvars[inventory_hostname]['private_ip'] }}
            port: 9092
          advertised_kafka_api:
          - name: default
            address: {{ inventory_hostname }}
            port: 9092
          kafka_api_tls:
          - name: default
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
          rpc_server_tls:
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
        " --format yaml

do:

- name: Configure via RPK
    shell:
      cmd: |
        set -e
         
        rpk redpanda config set redpanda.kafka_api_tls "
          - name: default
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
         " --format yaml

        rpk redpanda config set redpanda.rpc_server_tls "
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
         " --format yaml

AWS Cloudformation demo

I was asked to move redpanda-data/redpanda#2844 here.

For engineers on AWS, it would be amazing to have a vectorized.io zipfile you could place on S3 with enterprise licence features stripped out built for Ubuntu 20.04 - and a cloud formation script similar to the one for AWS MSK so it is almost a drop-in replacement for testing/benchmarking.

For testing it would also be amazing if you released a free https://github.com/localstack/localstack plugin to provide Red Panda mock AWS MSK on localhost. That is the easiest way for me to get Red Panda code into the Enterprise git repository.

Setup VPC and stop relying on IAM user's default VPC

Right now the script relies on the IAM user associated with the access/secret keys to have a default VPC configured. Without a default VPC, the following error appears when running terraform apply:

Error: error creating Security Group (redpanda-9b2beded-7423-d346-b28e-896c27dd58de-2022-10-20T14:29:42Z-node-sec-group): VPCIdNotSpecified: No default VPC for this user
│       status code: 400, request id: 9ec37fe0-bd45-4c1f-81b4-9070e9faeadc
│ 
│   with aws_security_group.node_sec_group,
│   on cluster.tf line 65, in resource "aws_security_group" "node_sec_group":
│   65: resource "aws_security_group" "node_sec_group" {
│ 
╵
exit 1

The script could instead create a VPC and not rely on the default value.

ansible: avoid end-user breakage for any roles changes

When we redo roles, let's detail how our changes may break end user workflows

What breakage might an end user see, and why
Document ways around it / path forward / how the end-user should upgrade

IBM Cloud Support

Add Terraform scripts for Redpanda deployment to IBM Cloud.

rpk config set fails because of list vs single issue

As described in redpanda-data/redpanda#2958 things can go wrong when the list syntax for some properties are mixed with the single map syntax.

This happens in the ansible deploy, where we use the single syntax, but it gets transformed to list and then subsequent redpanda configures fail to set that property because we can't overwrite a list with a single value.

The list format is preferred and supported by all tools, while the single value is effectively deprecated and not supported by some tools. We should use list syntax.

Add tags to generated key pair

This will allow easy filtering of key pairs in the console based on which resources were created by this script.

update docs in azure

README states that:

vm_data_disk_gb: Size of the Premium_LRS data disk in GiB (default 512 P20)

When the default is 2048 #P40 https://github.com/redpanda-data/deployment-automation/blob/main/azure/vars.tf#L31

Using curl bash to install local repo is probably not a great idea

Currently we are downloading and setting up a local repo on the servers using a curl script and then doing a rpk yum etc from local repo.

This should likely be changed to use our public repos appropriately with the correct tool with an explicit version flag, and related variable.

Current code:

- name: add the redpanda repo
  shell: |
    curl -1sLf {{ redpanda_repo_script }} | sudo -E bash
  args:
    warn: no

Add TLS support

Add TLS support when deploying redpanda.

Steps with multiple-line shell scripts only catch failures on the last command

Consider the configure redpanda step:

- name: configure redpanda
    notify:
      - restart redpanda-tuner
      - restart redpanda
    vars:
      use_public_ips: "{{ advertise_public_ips | default(false, true) | bool }}"
    shell: |
      rpk config set cluster_id 'redpanda'
      rpk config set organization 'redpanda-test'
      rpk config set redpanda.advertised_kafka_api '{
      {% if use_public_ips %}
        address: {{ inventory_hostname }},
      {% else %}
        address: {{ hostvars[inventory_hostname].private_ip }},
      {% endif %}
        port: 9092
      }' --format yaml
      rpk config set redpanda.advertised_rpc_api '{
        address: {{ hostvars[inventory_hostname].private_ip }},
        port: 33145
      }' --format yaml
      sudo rpk mode production

      {% if hostvars[groups['redpanda'][0]].id == hostvars[inventory_hostname].id %}
      sudo rpk config bootstrap \
        --id {{ hostvars[inventory_hostname].id }} \
        --self {{ hostvars[inventory_hostname].private_ip }}

      {% else %}

      sudo rpk config bootstrap \
        --id {{ hostvars[inventory_hostname].id }} \
        --self {{ hostvars[inventory_hostname].private_ip }} \
        --ips {{ groups["redpanda"] | map('extract', hostvars) | map(attribute='private_ip') | join(',') }}
      {% endif %}

If any shell command fails other than the last one, this step will succeed when it should fail, since these steps are effectively run as a shell script whose return value is that of the last command to run.

The easiest fix is probably just a set -e at the top of the script.

Improve EC2 instance naming

add instance_name_prefix variable and include at beginning of name tag
include count.index at end of name tag

Establish restart fails in some conditions because package_result isn't populated right

in start-redpanda.yml, the task that establishes if a restart is necessary seems to fail on occasion when the package_result registered variable is checked for the results key. Not sure why it's missing in some cases, but this seems to at least be the fix for now. Documenting it for now so we can get a branch made and tested

index 1ff9680..42a9827 100644
--- a/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml
+++ b/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml
@@ -210,7 +210,8 @@
 #
 - name: Establish whether restart required
   ansible.builtin.set_fact:
-    restart_required: '{{ ("true" in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or "Removed" in package_result.results or "1 upgraded" in package_result.results))) and (restart_node | default("true") | bool) }}'
+    restart_required: '{{ ("true" in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or package_result.results is defined and ("Removed" in package_result.results or "1 upgraded" in package_result.results)))) and (restart_node | default("true") | bool) }}'


 # serial: 1 would be the proper solution here, but that can only be set on play level```

[Slack Message](https://redpandadata.slack.com/archives/C049AE7V4U8/p1678491547657399?thread_ts=1678485045.593989&cid=C049AE7V4U8)

Update deprecated rpk commands

We use several deprecated commands:

rpk tune
rpk config
rpk mode

We should update for the new commands to avoid printing the deprecated logs and confuse users.

High qty message count causes critical failures in benchmark processes

Recently in attempting to generate traffic to provide 1.5 GB/sec to an appropriately sized cluster, using 100 byte message sizes (customer is moving from SQS) we have encountered difficulties with the OMB producers. Messages per second seems to fall apart somewhere around 1 to 1.2 million messages per second. The same cluster can handle 1.5GB w/ 1024 message sizes with excellent performance. 1.5GB w 100 byte messages (same batch size, etc) results in producers erroring out and aborting.

@travisdowns has additional details around the nature of these failures he can add to this issue.

Additional experiments using other client technologies have been in progress by @larsenpanda .

Ultimately we need to update our client so that these types of workloads are successful out of the box and define limits for the client in documentation so customers know what the upper boundaries of the producers and consumers are so they do not infer poor performance by Redpanda.

start-redpanda.yml tasks are somewhat opaque and need better comments

There's some somewhat complex interactions in this playbook, like the fact that there are multiple stages for generating node configs. Need to run through and add some additional explanation based on discussions with @tmgstevens so future readers can understand what's going on.

Enable rack awareness

The user should be able to opt-in for rack awareness with an ansible variable: -e enable-rack-awareness=true

See: #65 (comment)

jmespath install is needed prior to running json_query_filter

When running the provision node ansible playbook, the redpanda nodes succeed but the prometheus node fails with the following message:

TASK [cloudalchemy.prometheus : Get all file_sd files from scrape_configs] *************************************************************************************************************************************************************
fatal: [35.84.208.32]: FAILED! => {"msg": "You need to install \"jmespath\" prior to running json_query filter"}

Allow for use of hostnames in `advertised_kafka_api` and other locations

Presently, the ansible scripts work off a set of IP addresses to configure broker nodes. These IP addresses are used to identify listener bindings but also to come up with the list of seed brokers and advertised addresses. We should allow for the passing of hostnames to be used for advertised addresses and other things.