Giter Club home page Giter Club logo

deployment-automation's Introduction

Terraform and Ansible Deployment for Redpanda

Build status

Terraform and Ansible configuration to easily provision a Redpanda cluster on AWS, GCP, Azure, or IBM.

Installation Prerequisites

Here are some prerequisites you'll need to install to run the content in this repo. You can also choose to use our Dockerfile_FEDORA or Dockerfile_UBUNTU dockerfiles to build a local client if you'd rather not install terraform and ansible on your machine.

On Mac OS X:

You can use brew to install the prerequisites. You will also need to install gnu-tar:

brew tap hashicorp/tap
brew install hashicorp/tap/terraform
brew install ansible
brew install gnu-tar

Basic Usage:

# Set required ansible variables
export CLOUD_PROVIDER=aws
export ANSIBLE_COLLECTIONS_PATHS=${PWD}/artifacts/collections
export ANSIBLE_ROLES_PATH=${PWD}/artifacts/roles
export ANSIBLE_INVENTORY=${PWD}/${CLOUD_PROVIDER}/hosts.ini

# Assumes default private and public key names, if these aren't correct for you set them to the correct values

# Deploy VM
# ASSUMES YOU HAVE A DEFAULT VPC, if you don't, create one and set vpc_id and subnet_id
cd $CLOUD_PROVIDER
terraform init
terraform apply --auto-approve -var='public_key_path=~/.ssh/id_rsa.pub' -var='deployment_prefix=go-rp'
cd ..

# Install collections and roles
ansible-galaxy install -r ./requirements.yml

# Run a Playbook
# You need to pick the correct playbook for you, in this case we picked provision-basic-cluster
ansible-playbook ansible/provision-basic-cluster.yml --private-key ~/.ssh/id_rsa

Additional Documentation

More information on consuming this collection is available here in our official documentation.

Troubleshooting

On Mac OS X, Python unable to fork workers

If you see something like this:

ok: [34.209.26.177] => {“changed”: false, “stat”: {“exists”: false}}
objc[57889]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[57889]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
ERROR! A worker was found in a dead state

You might try resolving by setting an environment variable: export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

See: https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr

Contribution Guide

testing with a specific branch of redpanda-ansible-collection

Change the redpanda.cluster entry in your requirements.yml file to the following:

  - name: https://github.com/redpanda-data/redpanda-ansible-collection.git
    type: git
    version: <<<YOUR BRANCH NAME>>>

pre-commit

We use pre-commit to ensure good code health on this repo. To install pre-commit check the docs here. The basic idea is that you'll have a fairly comprehensive checkup happen on each commit, guaranteeing that everything will be properly formatted and validated. You may also need to install some pre-requisite tools for pre-commit to work correctly. At the time of writing this includes:

Ansible Linter Skip List Whys and Wherefores

A lot of effort to bring the linter and IDE into alignment without meaningful improvement in readability, outcomes or correctness.

  • jinja[spacing]
  • yaml[brackets]
  • yaml[line-length]

Breaks the play because intermediate commands in the pipe return nonzero (but irrelevant) error codes

  • risky-shell-pipe

deployment-automation's People

Contributors

0xdiba avatar ajfabbri avatar andrwng avatar bpraseed avatar braybaut avatar dotnwat avatar drossos avatar emaxerrno avatar gene-redpanda avatar genghis-tuan avatar gfcrawford avatar hcoyote avatar ivotron avatar jrkinley avatar larsenpanda avatar patrickangeles avatar pmw-rp avatar r-vasquez avatar rkruze avatar takidau avatar tmgstevens avatar travisdowns avatar twmb avatar vadimplh avatar vuldin avatar weswwagner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deployment-automation's Issues

Enable SI on AWS

Make it possible to enable shadow indexing from a flag, and then deploying a new environment with SI works out of the box (after running terraform and ansible scripts).

Ansible fails when enabling TLS due to safe restart

The code for safe restart uses rpk cluster maintenance enable {{ node_id }} which unfortunately uses the new rpk settings before TLS is enabled.

We either need a way to single-thread the whole of the last few plays (to ensure that maintenance mode is enabled, the node is restarted and then MM disabled all in a single action), or we need to nestle the updating of the redpanda.yaml file into that single shell command.

When I wrote this I couldn't find a way of having a block with throttle set within a role.

Validate/test if cluster configuration property of type string can be un-set

The rpk cluster config set at the moment of writing has the following logic:
https://github.com/redpanda-data/redpanda/blob/c3cca097b01c250715fd4012bac175f09c26778e/src/go/rpk/pkg/cli/cmd/cluster/config/set.go#L85-L99

			if meta.Nullable && value == "null" {
				// Nullable types may be explicitly set to null
				upsert[key] = nil
			} else if meta.Type != "string" && (value == "") {
				// Non-string types that receive an empty string
				// are reset to default
				remove = append(remove, key)
			} else if meta.Type == "array" {
				var a anySlice
				err = yaml.Unmarshal([]byte(value), &a)
				out.MaybeDie(err, "invalid list syntax")
				upsert[key] = a
			} else {
				upsert[key] = value
			}

This mean that if any property of type string which has different default value then "" (empty string) can not be unset.

I didn't test it, but it might be worth checking the rpk cluster config import
https://github.com/redpanda-data/redpanda/blob/c3cca097b01c250715fd4012bac175f09c26778e/src/go/rpk/pkg/cli/cmd/cluster/config/import.go#L98-L223

Related issue redpanda-data/helm-charts#395

Standardize client deployment across cloud providers

Right now the client instance is sometimes deployed or not depending on the cloud provider chosen. AWS default is false, Azure default is true, etc. The same default should be used all cloud providers.

(Defaults across cloud providers likely impacts more than just client deployment, more issues can be created to track other issues).

by default only install *minimal* dependencies

Break out the dependencies into required, and non-required.

Redpanda only needs redpanda, no JDK, java, etc or IOTOP all of these tools are just there as suplementary things when debugging a cluster.

By default, only the minimal_profile=true should be installed (i made up that variable name)

Terraform AWS deploy always regerates all resources

Every time you run a terraform command, almost every resource (all the hosts, in particular) are torn down and replaced, so al the IP addresses change, any existing state is lost, etc.

The plan shows that the cause is the use of timestamp() as part of the deployment_id. Unlike the random UUID which will change only when the deployment is initially created, the timestamp is different on every call, meaning that the deployment ID changes every time and we can't apply changes to an existing deployment: it effectively creates a new deployment every time.

As a minimal example, if you run two consecutive terraform apply with the AWS modules, you would expect the second does nothing: after all, nothing has changed and the first apply put everything into the desired state. However, currently all hosts will be torn down and new ones created. The only thing that changed was the ssh key name, which is based on the deployment ID.

Add an integration testing GH Action

We need some verification that both the terraform and ansible parts work when we merge new changes.
There is a minimal terratest module for the tf code but we also need to test that the ansible modules work as expected at least for a minimal use case.

Hard to tell who created instances

It's hard to tell who created a given node, i.e., to find abandoned machines or so users can check what they have running.

I propose we add a tag with the IAM username to make it clearer.

excessive polling for geerlinguy.node_exporter can trigger Github ratelimits

Looks like geerlingguy.node_exporter may somehow trigger rate limiting on Github's side because the module checks for the latest version each time ansible runs. We need to figure out if there's a way to disable this limiting the frequency of checking.

TASK [geerlingguy.node_exporter : Check current node_exporter version.] *********************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/main.yml:2
ok: [XX.YY.138.212] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007202", "end": "2023-01-31 01:48:20.308344", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.301142", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n  build user:       root@6e7732a7b81b\n  build date:       20221129-18:59:09\n  go version:       go1.19.3\n  platform:         linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", "  build user:       root@6e7732a7b81b", "  build date:       20221129-18:59:09", "  go version:       go1.19.3", "  platform:         linux/amd64"]}
ok: [XX.YY.97.33] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007200", "end": "2023-01-31 01:48:20.685369", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.678169", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n  build user:       root@6e7732a7b81b\n  build date:       20221129-18:59:09\n  go version:       go1.19.3\n  platform:         linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", "  build user:       root@6e7732a7b81b", "  build date:       20221129-18:59:09", "  go version:       go1.19.3", "  platform:         linux/amd64"]}
ok: [XX.YY.156.16] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007467", "end": "2023-01-31 01:48:20.889350", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.881883", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n  build user:       root@6e7732a7b81b\n  build date:       20221129-18:59:09\n  go version:       go1.19.3\n  platform:         linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", "  build user:       root@6e7732a7b81b", "  build date:       20221129-18:59:09", "  go version:       go1.19.3", "  platform:         linux/amd64"]}

TASK [geerlingguy.node_exporter : Configure latest version] *********************************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/main.yml:8
included: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/config-version.yaml for XX.YY.156.16, XX.YY.97.33, XX.YY.138.212
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (1 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (1 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (1 retries left).

TASK [geerlingguy.node_exporter : Determine latest GitHub release (local)] ******************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/config-version.yaml:2
fatal: [XX.YY.97.33 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}
fatal: [XX.YY.138.212 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}
fatal: [XX.YY.156.16 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}

Ansible-lint: provision-node.yml cleanup

Work through remaining issues noted for provision-node.yml and it's related roles and tasks.

warn_list:  # or 'skip_list' to silence them completely
  - command-instead-of-module  # Using command rather than module.
  - command-instead-of-shell  # Use shell only when shell functionality is required.
  - experimental  # all rules tagged as experimental
  - fqcn[action-core]  # Use FQCN for builtin actions.
  - name[missing]  # Rule for checking task and play names.
  - no-changed-when  # Commands should not change things if nothing needs doing.
  - risky-shell-pipe  # Shells that use pipes should set the pipefail option.
  - yaml[line-length]  # Violations reported by yamllint.

                               Rule Violation Summary
 count tag                       profile    rule associated tags
     1 command-instead-of-module basic      command-shell, idiom
     1 command-instead-of-shell  basic      command-shell, idiom
     2 key-order[task]           basic      formatting, experimental (warning)
    11 jinja[spacing]            basic      formatting (warning)
     8 name[missing]             basic      idiom
    11 name[play]                basic      idiom (warning)
     8 yaml[line-length]         basic      formatting, yaml
    22 name[casing]              moderate   idiom (warning)
     3 risky-file-permissions    safety     unpredictability, experimental (warning)
     1 risky-shell-pipe          safety     command-shell
     2 no-changed-when           shared     command-shell, idempotency
     1 fqcn[action-core]         production formatting
     2 fqcn[action]              production formatting (warning)
     2 args[module]                         syntax, experimental (warning)

Failed after min profile: 22 failure(s), 53 warning(s) on 10 files.
A new release of ansible-lint is available: 6.11.0 → 6.12.0 Upgrade by running: pip install --upgrade ansible-lint

s3 setup for archival storage

When deploying a cluster it would be great to have an option that setup infra support for archival storage feature in redpanda. Briefly this would be:

  1. s3 bucket
  2. query credentials, urls, etc...
  3. deploy redpanda.yml with archival settings from (1) and (2)

@Lazin do you have the set of configuration elements that we'll need? I thought they were in configuration.h but I don't see them in there. Maybe they haven't merged yet?

Stranded Compute in CI builds

Currently there is a risk that resources can be stranded when a build is canceled. This will need to be resolved for long term use of this CI to be safe and cost effective. Mitigation techniques in place include:

  • skipping intermediate builds
  • wrap tests in a trap script to ensure any interrupts are concluded with a destroy
  • use non-randomized naming to ensure stranded assets cause conflicts and are detected immediately

"as for cleaning up stranded compute which could be a risk upon every push to a PR, i can see 2 paths:

  1. test terraform apply only on manual request like via github label trigger. it would help if the aws resources could have a tag that could point back to the PR and git commit so resource cleanup can be easier.
  2. hook up this repo to terraform cloud to deploy only on merge to main and have the PR check lint and terraform plan. then the cloud will keep track of the state file and manage cleanup."

Originally posted by @andrewhsu in #137 (review)

Irregular failure in ansible script for grafana repository

There is an irregular failure in adding the grafana repository that happens in about 5% of builds which causes a fatal failure.

TASK [cloudalchemy.grafana : Add Grafana repository [Debian/Ubuntu]] ****************************************************************************
FAILED - RETRYING: [35.90.68.22]: Add Grafana repository [Debian/Ubuntu] (5 retries left).
FAILED - RETRYING: [35.90.68.22]: Add Grafana repository [Debian/Ubuntu] (4 retries left).

Ansible code using shell command built-in broken on ansible releases 2.14+

If using Ansible 2.14 or later, it appears like we get failures for any usage of the shell command that has the warn parameter defined.

See background on the deprecation/removal: ansible/ansible#79379

There are two places where this occurs in our automation (one in our code, one in a dependency).

  1. roles/redpanda_broker/tasks/install-redpanda.yml in add the redpanda repo
  2. in the cloudalchemy/node-exporter dependency in it's preflight checks.

We can fix the first. The second is going to be harder because the node-exporter appears to no longer be maintained and is possibly moving to a different repo per this open issue in their github repo:

cloudalchemy/ansible-node-exporter#279

Looks like we need to use the prometheus-community version of node-exporter which patches this warn issue in the preflight.

prometheus-community/ansible@ec6c857
We can fix the first, but

Error connecting to kafkaJS with TLS.

I share in detail the steps I took and the mistakes I got.

First of all, I have a basic typescript project. And kafkaJS is installed in it. ( https://kafka.js.org/docs/getting-started )

my transactions:

first uninstalling 1 broker and 1 monitoring on aws

  • cd aws

  • terraform apply ( Necessary parameters and public_keys have been created for the machine. )

and the machines I wanted were successfully created on aws.

Installing redpanda and grafana on broker with ansible.

  • ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -e advertise_public_ips=true -e grafana_admin_pass=******* -i hosts.ini -v ansible/playbooks/provision-node.yml

and successful redpanda installation on broker. I can also connect to grafana with the password I specified.

private createKafkaConsumer(clientId: string, groupId: string): Consumer {

        const kafka = new Kafka({
            clientId: clientId,
            brokers: ['*.*.*.*:9092']
        })
        const consumer = kafka.consumer({ groupId })
        return consumer
    }

I can successfully connect to the broker with the code snippet above.

Up to this stage I can successfully use redpanda. But there is insecurity here. So I have to activate TLS.

Here are the steps I need to do according to the documentation for TLS.

  • ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -v ansible/playbooks/create-ca.yml

  • ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/generate-csrs.yml

  • ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/issue-certs.yml

  • ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/install-certs.yml

After these steps, the redpanda.yml file in the broker is updated accordingly. It is updated as in the template specified for redpanda.yml in the ansible playbook section.

Now I'm trying to connect to the broker again.

private createKafkaConsumer(clientId: string, groupId: string): Consumer {

        const kafka = new Kafka({
            clientId: clientId,
            brokers: ['*.*.*.*:9092'],
            ssl: {
                rejectUnauthorized: false,
                ca: [fs.readFileSync('/tls/ca/ca.crt', 'utf-8')]
            }
        })
        const consumer = kafka.consumer({ groupId })
        return consumer
    }

I can no longer connect to the broker I was able to connect to without TLS. I'm doing something wrong but I can't find it.

This is the error I got.

{"level":"INFO","timestamp":"2023-01-05T20:48:59.897Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:48:59.897Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"WebSiteProducer_Group"}
{"level":"INFO","timestamp":"2023-01-05T20:49:00.198Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"172.31.17.111:9092","clientId":"71b80823-8058-4279-9814-4561eb3840a4"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSConnectionError: Connection timeout","groupId":"WebSiteProducer_Group","stack":"KafkaJSConnectionError: Connection timeout\n    at Timeout.onTimeout [as _onTimeout] (/home/emredarak/repo/consumer/node_modules/kafkajs/src/network/connection.js:223:23)\n    at listOnTimeout (node:internal/timers:559:17)\n    at processTimers (node:internal/timers:502:7)"}
{"level":"INFO","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.346Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"WebSiteProducer_Group"}
{"level":"INFO","timestamp":"2023-01-05T20:49:01.647Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"WebSiteProducer_Group"}

setting 'advertise_public_ips=false' doesn't work

Setting -e 'advertise_public_ips=true causes public IPs to be advertised, so -e 'advertise_public_ips=false' should be the opposite, right?

Not so fast! Setting it to false is the same as true as a result of the coercion from a string to boolean that happens in {% if ... %}.

We should use the sort-of idiomatic var | d() | bool instead, which handles all cases. Everything except the "yaml truthy" values are set to false.

Debug task referencing package_result.results appears to break ansible runs permanently.

Discovered while working on PR-119. Would be good to run this down, or at least, figure out why removing the debug task doesn't allow the ansible run to resume correctly running again. There's probably some other level of caching that I'm not catching and once package_result.results gets wedged, it's not clearing itself out and repopulating.

Somehow I'm managing to break the ansible runs with a debug statement that references a variable that (sometimes?) may not contain anything, and once it gets that way, future ansible runs break, even if the debug task is removed. I've tried running with --flush-cache on the ansible run and that's not fixing it. What's also concerning is that once it breaks, even with removing the debug, it impacts the Establish whether restart required task.

TASK [redpanda_broker : Establish whether restart required] *********************************************************************************************************************************************************
task path: /Users/tcampbell/Documents/github/prs/pr-119-redpanda-data-deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml:223
fatal: [XX.XX.XX.XX]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'results'. 'dict object' has no attribute 'results'\n\nThe error appears to be in '/Users/tcampbell/Documents/github/prs/pr-119-redpanda-data-deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml': line 223, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n#\n- name: Establish whether restart required\n  ^ here\n"}

it looks like it might be something within the package_result variable.

Here's the example debug I was working with

- name: debug
  ansible.builtin.debug:
    msg:
    - "is_initialized is {{ is_initialized }}"
    - "restart_required_rc is {{ restart_required_rc.stdout }}"
    - "package_result is {{ package_result }}"
    - "nodeconfig_result is {{ nodeconfig_result }}"
    - "restart_node is {{ restart_node | default('true') | bool }}"
    - "is restart_required_rc.stdout true: {{ restart_required_rc.stdout }}"
    - "is nodeconfig_result changed {{ nodeconfig_result.changed }}"
    - "is package_result.results removed or ugprade {{ package_result.results }}"
#    - "what is_initialized and result of (nodeconfig_result.changed or package_result) {{ is_initialized and (nodeconfig_result.changed or 'Removed' in package_result.results or '1 upgraded' in package_result.results) }}"
#    - "restart_required: {{ ('true' in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or 'Removed' in package_result.results or '1 upgraded' in package_result.results))) and (restart_node | default('true') | bool) }}"

Revisit client configuration

  • Terraform is configured to deploy a client EC2 instance (if var.clients is non-zero), but there is no ansible configuration to install the Redpanda CLI
  • var.clients is an unbounded integer value, but maybe it should just be a boolean
  • Wouldn't users still need to connect directly to nodes at some point (viewing logs, etc.)? Having a client seems like an added step, and we would need to explain to the user when to connect to the client and when to just connect to the node. Possibly a better route would be to remove the client.

tls/install-certs.yml playbook deletes rpc_server

tls/install-certs.yml: This part of the code sets redpanda again so it deletes whatever we had inside redpanda field in redpanda.yaml. we might be deleting important user configurations such as rpc_server or additional node properties.

We should instead set kafka_api_tls and rpc_server_tls separately via rpk:

instead of:

  - name: Configure via RPK
    shell:
      cmd: |
        rpk redpanda config set redpanda "
          kafka_api:
          - name: default
            address: {{ hostvars[inventory_hostname]['private_ip'] }}
            port: 9092
          advertised_kafka_api:
          - name: default
            address: {{ inventory_hostname }}
            port: 9092
          kafka_api_tls:
          - name: default
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
          rpc_server_tls:
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
        " --format yaml

do:

- name: Configure via RPK
    shell:
      cmd: |
        set -e
         
        rpk redpanda config set redpanda.kafka_api_tls "
          - name: default
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
         " --format yaml

        rpk redpanda config set redpanda.rpc_server_tls "
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
         " --format yaml

AWS Cloudformation demo

I was asked to move redpanda-data/redpanda#2844 here.

For engineers on AWS, it would be amazing to have a vectorized.io zipfile you could place on S3 with enterprise licence features stripped out built for Ubuntu 20.04 - and a cloud formation script similar to the one for AWS MSK so it is almost a drop-in replacement for testing/benchmarking.

For testing it would also be amazing if you released a free https://github.com/localstack/localstack plugin to provide Red Panda mock AWS MSK on localhost. That is the easiest way for me to get Red Panda code into the Enterprise git repository.

Setup VPC and stop relying on IAM user's default VPC

Right now the script relies on the IAM user associated with the access/secret keys to have a default VPC configured. Without a default VPC, the following error appears when running terraform apply:

Error: error creating Security Group (redpanda-9b2beded-7423-d346-b28e-896c27dd58de-2022-10-20T14:29:42Z-node-sec-group): VPCIdNotSpecified: No default VPC for this user
│       status code: 400, request id: 9ec37fe0-bd45-4c1f-81b4-9070e9faeadc
│ 
│   with aws_security_group.node_sec_group,
│   on cluster.tf line 65, in resource "aws_security_group" "node_sec_group":
│   65: resource "aws_security_group" "node_sec_group" {
│ 
╵
exit 1

The script could instead create a VPC and not rely on the default value.

rpk config set fails because of list vs single issue

As described in redpanda-data/redpanda#2958 things can go wrong when the list syntax for some properties are mixed with the single map syntax.

This happens in the ansible deploy, where we use the single syntax, but it gets transformed to list and then subsequent redpanda configures fail to set that property because we can't overwrite a list with a single value.

The list format is preferred and supported by all tools, while the single value is effectively deprecated and not supported by some tools. We should use list syntax.

Using curl bash to install local repo is probably not a great idea

Currently we are downloading and setting up a local repo on the servers using a curl script and then doing a rpk yum etc from local repo.

This should likely be changed to use our public repos appropriately with the correct tool with an explicit version flag, and related variable.

Current code:

- name: add the redpanda repo
  shell: |
    curl -1sLf {{ redpanda_repo_script }} | sudo -E bash
  args:
    warn: no

Steps with multiple-line shell scripts only catch failures on the last command

Consider the configure redpanda step:

- name: configure redpanda
    notify:
      - restart redpanda-tuner
      - restart redpanda
    vars:
      use_public_ips: "{{ advertise_public_ips | default(false, true) | bool }}"
    shell: |
      rpk config set cluster_id 'redpanda'
      rpk config set organization 'redpanda-test'
      rpk config set redpanda.advertised_kafka_api '{
      {% if use_public_ips %}
        address: {{ inventory_hostname }},
      {% else %}
        address: {{ hostvars[inventory_hostname].private_ip }},
      {% endif %}
        port: 9092
      }' --format yaml
      rpk config set redpanda.advertised_rpc_api '{
        address: {{ hostvars[inventory_hostname].private_ip }},
        port: 33145
      }' --format yaml
      sudo rpk mode production

      {% if hostvars[groups['redpanda'][0]].id == hostvars[inventory_hostname].id %}
      sudo rpk config bootstrap \
        --id {{ hostvars[inventory_hostname].id }} \
        --self {{ hostvars[inventory_hostname].private_ip }}

      {% else %}

      sudo rpk config bootstrap \
        --id {{ hostvars[inventory_hostname].id }} \
        --self {{ hostvars[inventory_hostname].private_ip }} \
        --ips {{ groups["redpanda"] | map('extract', hostvars) | map(attribute='private_ip') | join(',') }}
      {% endif %}

If any shell command fails other than the last one, this step will succeed when it should fail, since these steps are effectively run as a shell script whose return value is that of the last command to run.

The easiest fix is probably just a set -e at the top of the script.

Improve EC2 instance naming

  1. add instance_name_prefix variable and include at beginning of name tag
  2. include count.index at end of name tag

Establish restart fails in some conditions because package_result isn't populated right

in start-redpanda.yml, the task that establishes if a restart is necessary seems to fail on occasion when the package_result registered variable is checked for the results key. Not sure why it's missing in some cases, but this seems to at least be the fix for now. Documenting it for now so we can get a branch made and tested

index 1ff9680..42a9827 100644
--- a/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml
+++ b/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml
@@ -210,7 +210,8 @@
 #
 - name: Establish whether restart required
   ansible.builtin.set_fact:
-    restart_required: '{{ ("true" in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or "Removed" in package_result.results or "1 upgraded" in package_result.results))) and (restart_node | default("true") | bool) }}'
+    restart_required: '{{ ("true" in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or package_result.results is defined and ("Removed" in package_result.results or "1 upgraded" in package_result.results)))) and (restart_node | default("true") | bool) }}'


 # serial: 1 would be the proper solution here, but that can only be set on play level```

[Slack Message](https://redpandadata.slack.com/archives/C049AE7V4U8/p1678491547657399?thread_ts=1678485045.593989&cid=C049AE7V4U8)

Update deprecated rpk commands

We use several deprecated commands:

  • rpk tune
  • rpk config
  • rpk mode

We should update for the new commands to avoid printing the deprecated logs and confuse users.

High qty message count causes critical failures in benchmark processes

Recently in attempting to generate traffic to provide 1.5 GB/sec to an appropriately sized cluster, using 100 byte message sizes (customer is moving from SQS) we have encountered difficulties with the OMB producers. Messages per second seems to fall apart somewhere around 1 to 1.2 million messages per second. The same cluster can handle 1.5GB w/ 1024 message sizes with excellent performance. 1.5GB w 100 byte messages (same batch size, etc) results in producers erroring out and aborting.

@travisdowns has additional details around the nature of these failures he can add to this issue.

Additional experiments using other client technologies have been in progress by @larsenpanda .

Ultimately we need to update our client so that these types of workloads are successful out of the box and define limits for the client in documentation so customers know what the upper boundaries of the producers and consumers are so they do not infer poor performance by Redpanda.

jmespath install is needed prior to running json_query_filter

When running the provision node ansible playbook, the redpanda nodes succeed but the prometheus node fails with the following message:

TASK [cloudalchemy.prometheus : Get all file_sd files from scrape_configs] *************************************************************************************************************************************************************
fatal: [35.84.208.32]: FAILED! => {"msg": "You need to install \"jmespath\" prior to running json_query filter"}

Allow for use of hostnames in `advertised_kafka_api` and other locations

Presently, the ansible scripts work off a set of IP addresses to configure broker nodes. These IP addresses are used to identify listener bindings but also to come up with the list of seed brokers and advertised addresses. We should allow for the passing of hostnames to be used for advertised addresses and other things.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.