redpanda-data / deployment-automation Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 46.0 438 KB

Cluster configuration best practices

Home Page: https://redpanda.com

License: Apache License 2.0

HCL 74.75% Smarty 1.01% Jinja 0.10% Shell 24.14%

ansible redpanda

deployment-automation's Issues

Terraform should tag resources

All three terraform modules should tag resources with an 'owner' tag based either on their username/iam name, or based on something that can be configured.

We do this for AWS today: https://github.com/redpanda-data/deployment-automation/blob/main/aws/cluster.tf#L14 but not for Azure or GCP.

ansible: avoid end-user breakage for any roles changes

When we redo roles, let's detail how our changes may break end user workflows

What breakage might an end user see, and why
Document ways around it / path forward / how the end-user should upgrade

Latest version of rp has issues with redpanda-tuner starting from ansible

First, I don't know why redpanda-tuner runs as a systemd service. 🤷‍♂️
Second and most importantly, redpanda-tuner now fails trying to run as a systemd service, at least when deployed with ansible.
Rogger Vasquez saw it in action and is looking into it!

Add molecule tests for redpanda role

https://molecule.readthedocs.io/en/latest/

Ideally we should add Github actions to help with regression testing as well.

Irregular failure in ansible script for grafana repository

There is an irregular failure in adding the grafana repository that happens in about 5% of builds which causes a fatal failure.

TASK [cloudalchemy.grafana : Add Grafana repository [Debian/Ubuntu]] ****************************************************************************
FAILED - RETRYING: [35.90.68.22]: Add Grafana repository [Debian/Ubuntu] (5 retries left).
FAILED - RETRYING: [35.90.68.22]: Add Grafana repository [Debian/Ubuntu] (4 retries left).

Standardize client deployment across cloud providers

Right now the client instance is sometimes deployed or not depending on the cloud provider chosen. AWS default is false, Azure default is true, etc. The same default should be used all cloud providers.

(Defaults across cloud providers likely impacts more than just client deployment, more issues can be created to track other issues).

Restructure ansible to be a Galaxy Collection

We need to restructure the ansible directory to be formatted as an Ansible Galaxy Collection before submitting to https://galaxy.ansible.com/.

For details of Ansible Galaxy Collections see https://docs.ansible.com/ansible/latest/collections_guide/index.html

Always reconcile redpanda.yaml as it have node configuration

The following condition:

deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml

Line 63 in 496d91f

when: not is_initialized

should be dropped as node configuration can not be updated using Admin API.
It can be updated only by replacing redpanda.yaml and restart Redpanda process.

Support for Graviton instance and ami

Currently can't select graviton instances

Terraform AWS deploy always regerates all resources

Every time you run a terraform command, almost every resource (all the hosts, in particular) are torn down and replaced, so al the IP addresses change, any existing state is lost, etc.

The plan shows that the cause is the use of timestamp() as part of the deployment_id. Unlike the random UUID which will change only when the deployment is initially created, the timestamp is different on every call, meaning that the deployment ID changes every time and we can't apply changes to an existing deployment: it effectively creates a new deployment every time.

As a minimal example, if you run two consecutive terraform apply with the AWS modules, you would expect the second does nothing: after all, nothing has changed and the first apply put everything into the desired state. However, currently all hosts will be torn down and new ones created. The only thing that changed was the ssh key name, which is based on the deployment ID.

start-redpanda.yml tasks are somewhat opaque and need better comments

There's some somewhat complex interactions in this playbook, like the fact that there are multiple stages for generating node configs. Need to run through and add some additional explanation based on discussions with @tmgstevens so future readers can understand what's going on.

tls/install-certs.yml playbook deletes rpc_server

tls/install-certs.yml: This part of the code sets redpanda again so it deletes whatever we had inside redpanda field in redpanda.yaml. we might be deleting important user configurations such as rpc_server or additional node properties.

We should instead set kafka_api_tls and rpc_server_tls separately via rpk:

instead of:

  - name: Configure via RPK
    shell:
      cmd: |
        rpk redpanda config set redpanda "
          kafka_api:
          - name: default
            address: {{ hostvars[inventory_hostname]['private_ip'] }}
            port: 9092
          advertised_kafka_api:
          - name: default
            address: {{ inventory_hostname }}
            port: 9092
          kafka_api_tls:
          - name: default
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
          rpc_server_tls:
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
        " --format yaml

do:

- name: Configure via RPK
    shell:
      cmd: |
        set -e
         
        rpk redpanda config set redpanda.kafka_api_tls "
          - name: default
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
         " --format yaml

        rpk redpanda config set redpanda.rpc_server_tls "
            enabled: true
            require_client_auth: false
            cert_file: /etc/redpanda/certs/node.crt
            key_file: /etc/redpanda/certs/node.key
            truststore_file: /etc/redpanda/certs/truststore.pem
         " --format yaml

Ansible code using shell command built-in broken on ansible releases 2.14+

If using Ansible 2.14 or later, it appears like we get failures for any usage of the shell command that has the warn parameter defined.

See background on the deprecation/removal: ansible/ansible#79379

There are two places where this occurs in our automation (one in our code, one in a dependency).

roles/redpanda_broker/tasks/install-redpanda.yml in add the redpanda repo
in the cloudalchemy/node-exporter dependency in it's preflight checks.

We can fix the first. The second is going to be harder because the node-exporter appears to no longer be maintained and is possibly moving to a different repo per this open issue in their github repo:

cloudalchemy/ansible-node-exporter#279

Looks like we need to use the prometheus-community version of node-exporter which patches this warn issue in the preflight.

prometheus-community/ansible@ec6c857
We can fix the first, but

Allow for use of hostnames in `advertised_kafka_api` and other locations

Presently, the ansible scripts work off a set of IP addresses to configure broker nodes. These IP addresses are used to identify listener bindings but also to come up with the list of seed brokers and advertised addresses. We should allow for the passing of hostnames to be used for advertised addresses and other things.

Enable rack awareness

The user should be able to opt-in for rack awareness with an ansible variable: -e enable-rack-awareness=true

See: #65 (comment)

Revisit client configuration

Terraform is configured to deploy a client EC2 instance (if var.clients is non-zero), but there is no ansible configuration to install the Redpanda CLI
var.clients is an unbounded integer value, but maybe it should just be a boolean
Wouldn't users still need to connect directly to nodes at some point (viewing logs, etc.)? Having a client seems like an added step, and we would need to explain to the user when to connect to the client and when to just connect to the node. Possibly a better route would be to remove the client.

Switch to /public_metrics endpoint

The grafana dashboard is generated based off the internal /metrics endpoint. This should be switched to the public-facing /public_metrics endpoint.

deployment-automation/ansible/playbooks/deploy-prometheus-grafana.yml

Line 5 in 01d1c3f

  rpk generate grafana-dashboard --datasource prometheus --metrics-endpoint 'http://{{hostvars[inventory_hostname].private_ip}}:9644/metrics' > '/tmp/redpanda-grafana.json' 

Add a playbook to install redpanda console

This needs a change in the terraform scripts as well.

An example of how the playbook would look like: https://redpandacommunity.slack.com/archives/C039U14NY04/p1662538501878789?thread_ts=1662124131.079259&cid=C039U14NY04

Set the Redpanda license

Redpanda will eventually enforce license for enterprise feature like tiered storage. It would be good to set it up.
Example:
https://github.com/redpanda-data/helm-charts/blob/1c1048b68950f0dbe454d41448db420b7b115423/charts/redpanda/templates/post-install-upgrade-job.yaml#L107
https://github.com/redpanda-data/redpanda/blob/dev/src/go/rpk/pkg/cli/cmd/cluster/license/set.go

Ansible-lint: provision-node.yml cleanup

Work through remaining issues noted for provision-node.yml and it's related roles and tasks.

warn_list:  # or 'skip_list' to silence them completely
  - command-instead-of-module  # Using command rather than module.
  - command-instead-of-shell  # Use shell only when shell functionality is required.
  - experimental  # all rules tagged as experimental
  - fqcn[action-core]  # Use FQCN for builtin actions.
  - name[missing]  # Rule for checking task and play names.
  - no-changed-when  # Commands should not change things if nothing needs doing.
  - risky-shell-pipe  # Shells that use pipes should set the pipefail option.
  - yaml[line-length]  # Violations reported by yamllint.

                               Rule Violation Summary
 count tag                       profile    rule associated tags
     1 command-instead-of-module basic      command-shell, idiom
     1 command-instead-of-shell  basic      command-shell, idiom
     2 key-order[task]           basic      formatting, experimental (warning)
    11 jinja[spacing]            basic      formatting (warning)
     8 name[missing]             basic      idiom
    11 name[play]                basic      idiom (warning)
     8 yaml[line-length]         basic      formatting, yaml
    22 name[casing]              moderate   idiom (warning)
     3 risky-file-permissions    safety     unpredictability, experimental (warning)
     1 risky-shell-pipe          safety     command-shell
     2 no-changed-when           shared     command-shell, idempotency
     1 fqcn[action-core]         production formatting
     2 fqcn[action]              production formatting (warning)
     2 args[module]                         syntax, experimental (warning)

Failed after min profile: 22 failure(s), 53 warning(s) on 10 files.
A new release of ansible-lint is available: 6.11.0 → 6.12.0 Upgrade by running: pip install --upgrade ansible-lint

by default only install minimal dependencies

Break out the dependencies into required, and non-required.

Redpanda only needs redpanda, no JDK, java, etc or IOTOP all of these tools are just there as suplementary things when debugging a cluster.

By default, only the minimal_profile=true should be installed (i made up that variable name)

Add TLS support

Add TLS support when deploying redpanda.

jmespath install is needed prior to running json_query_filter

When running the provision node ansible playbook, the redpanda nodes succeed but the prometheus node fails with the following message:

TASK [cloudalchemy.prometheus : Get all file_sd files from scrape_configs] *************************************************************************************************************************************************************
fatal: [35.84.208.32]: FAILED! => {"msg": "You need to install \"jmespath\" prior to running json_query filter"}

Enable SI on AWS

Make it possible to enable shadow indexing from a flag, and then deploying a new environment with SI works out of the box (after running terraform and ansible scripts).

Debug task referencing package_result.results appears to break ansible runs permanently.

Discovered while working on PR-119. Would be good to run this down, or at least, figure out why removing the debug task doesn't allow the ansible run to resume correctly running again. There's probably some other level of caching that I'm not catching and once package_result.results gets wedged, it's not clearing itself out and repopulating.

Somehow I'm managing to break the ansible runs with a debug statement that references a variable that (sometimes?) may not contain anything, and once it gets that way, future ansible runs break, even if the debug task is removed. I've tried running with --flush-cache on the ansible run and that's not fixing it. What's also concerning is that once it breaks, even with removing the debug, it impacts the Establish whether restart required task.

TASK [redpanda_broker : Establish whether restart required] *********************************************************************************************************************************************************
task path: /Users/tcampbell/Documents/github/prs/pr-119-redpanda-data-deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml:223
fatal: [XX.XX.XX.XX]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'results'. 'dict object' has no attribute 'results'\n\nThe error appears to be in '/Users/tcampbell/Documents/github/prs/pr-119-redpanda-data-deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml': line 223, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n#\n- name: Establish whether restart required\n  ^ here\n"}

it looks like it might be something within the package_result variable.

Here's the example debug I was working with

- name: debug
  ansible.builtin.debug:
    msg:
    - "is_initialized is {{ is_initialized }}"
    - "restart_required_rc is {{ restart_required_rc.stdout }}"
    - "package_result is {{ package_result }}"
    - "nodeconfig_result is {{ nodeconfig_result }}"
    - "restart_node is {{ restart_node | default('true') | bool }}"
    - "is restart_required_rc.stdout true: {{ restart_required_rc.stdout }}"
    - "is nodeconfig_result changed {{ nodeconfig_result.changed }}"
    - "is package_result.results removed or ugprade {{ package_result.results }}"
#    - "what is_initialized and result of (nodeconfig_result.changed or package_result) {{ is_initialized and (nodeconfig_result.changed or 'Removed' in package_result.results or '1 upgraded' in package_result.results) }}"
#    - "restart_required: {{ ('true' in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or 'Removed' in package_result.results or '1 upgraded' in package_result.results))) and (restart_node | default('true') | bool) }}"

Add an integration testing GH Action

We need some verification that both the terraform and ansible parts work when we merge new changes.
There is a minimal terratest module for the tf code but we also need to test that the ansible modules work as expected at least for a minimal use case.

setting 'advertise_public_ips=false' doesn't work

Setting -e 'advertise_public_ips=true causes public IPs to be advertised, so -e 'advertise_public_ips=false' should be the opposite, right?

Not so fast! Setting it to false is the same as true as a result of the coercion from a string to boolean that happens in {% if ... %}.

We should use the sort-of idiomatic var | d() | bool instead, which handles all cases. Everything except the "yaml truthy" values are set to false.

AWS Cloudformation demo

I was asked to move redpanda-data/redpanda#2844 here.

For engineers on AWS, it would be amazing to have a vectorized.io zipfile you could place on S3 with enterprise licence features stripped out built for Ubuntu 20.04 - and a cloud formation script similar to the one for AWS MSK so it is almost a drop-in replacement for testing/benchmarking.

For testing it would also be amazing if you released a free https://github.com/localstack/localstack plugin to provide Red Panda mock AWS MSK on localhost. That is the easiest way for me to get Red Panda code into the Enterprise git repository.

Stranded Compute in CI builds

Currently there is a risk that resources can be stranded when a build is canceled. This will need to be resolved for long term use of this CI to be safe and cost effective. Mitigation techniques in place include:

skipping intermediate builds
wrap tests in a trap script to ensure any interrupts are concluded with a destroy
use non-randomized naming to ensure stranded assets cause conflicts and are detected immediately

"as for cleaning up stranded compute which could be a risk upon every push to a PR, i can see 2 paths:

test terraform apply only on manual request like via github label trigger. it would help if the aws resources could have a tag that could point back to the PR and git commit so resource cleanup can be easier.
hook up this repo to terraform cloud to deploy only on merge to main and have the PR check lint and terraform plan. then the cloud will keep track of the state file and manage cleanup."

Originally posted by @andrewhsu in #137 (review)

Update deprecated rpk commands

We use several deprecated commands:

rpk tune
rpk config
rpk mode

We should update for the new commands to avoid printing the deprecated logs and confuse users.

Pass ansible variables in via config file rather than flags

Currently ansible commands take flags, and this can make the commands long and difficult to read. Create a config file to variables that are initially populated with working default values, and updated docs to mention this.

Validate/test if cluster configuration property of type string can be un-set

The rpk cluster config set at the moment of writing has the following logic:
https://github.com/redpanda-data/redpanda/blob/c3cca097b01c250715fd4012bac175f09c26778e/src/go/rpk/pkg/cli/cmd/cluster/config/set.go#L85-L99

			if meta.Nullable && value == "null" {
				// Nullable types may be explicitly set to null
				upsert[key] = nil
			} else if meta.Type != "string" && (value == "") {
				// Non-string types that receive an empty string
				// are reset to default
				remove = append(remove, key)
			} else if meta.Type == "array" {
				var a anySlice
				err = yaml.Unmarshal([]byte(value), &a)
				out.MaybeDie(err, "invalid list syntax")
				upsert[key] = a
			} else {
				upsert[key] = value
			}

This mean that if any property of type string which has different default value then "" (empty string) can not be unset.

I didn't test it, but it might be worth checking the rpk cluster config import
https://github.com/redpanda-data/redpanda/blob/c3cca097b01c250715fd4012bac175f09c26778e/src/go/rpk/pkg/cli/cmd/cluster/config/import.go#L98-L223

Related issue redpanda-data/helm-charts#395

IBM Cloud Support

Add Terraform scripts for Redpanda deployment to IBM Cloud.

Setup VPC and stop relying on IAM user's default VPC

Right now the script relies on the IAM user associated with the access/secret keys to have a default VPC configured. Without a default VPC, the following error appears when running terraform apply:

Error: error creating Security Group (redpanda-9b2beded-7423-d346-b28e-896c27dd58de-2022-10-20T14:29:42Z-node-sec-group): VPCIdNotSpecified: No default VPC for this user
│       status code: 400, request id: 9ec37fe0-bd45-4c1f-81b4-9070e9faeadc
│ 
│   with aws_security_group.node_sec_group,
│   on cluster.tf line 65, in resource "aws_security_group" "node_sec_group":
│   65: resource "aws_security_group" "node_sec_group" {
│ 
╵
exit 1

The script could instead create a VPC and not rely on the default value.

excessive polling for geerlinguy.node_exporter can trigger Github ratelimits

Looks like geerlingguy.node_exporter may somehow trigger rate limiting on Github's side because the module checks for the latest version each time ansible runs. We need to figure out if there's a way to disable ~~this limiting~~ the frequency of checking.

TASK [geerlingguy.node_exporter : Check current node_exporter version.] *********************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/main.yml:2
ok: [XX.YY.138.212] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007202", "end": "2023-01-31 01:48:20.308344", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.301142", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n  build user:       root@6e7732a7b81b\n  build date:       20221129-18:59:09\n  go version:       go1.19.3\n  platform:         linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", "  build user:       root@6e7732a7b81b", "  build date:       20221129-18:59:09", "  go version:       go1.19.3", "  platform:         linux/amd64"]}
ok: [XX.YY.97.33] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007200", "end": "2023-01-31 01:48:20.685369", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.678169", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n  build user:       root@6e7732a7b81b\n  build date:       20221129-18:59:09\n  go version:       go1.19.3\n  platform:         linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", "  build user:       root@6e7732a7b81b", "  build date:       20221129-18:59:09", "  go version:       go1.19.3", "  platform:         linux/amd64"]}
ok: [XX.YY.156.16] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007467", "end": "2023-01-31 01:48:20.889350", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.881883", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n  build user:       root@6e7732a7b81b\n  build date:       20221129-18:59:09\n  go version:       go1.19.3\n  platform:         linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", "  build user:       root@6e7732a7b81b", "  build date:       20221129-18:59:09", "  go version:       go1.19.3", "  platform:         linux/amd64"]}

TASK [geerlingguy.node_exporter : Configure latest version] *********************************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/main.yml:8
included: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/config-version.yaml for XX.YY.156.16, XX.YY.97.33, XX.YY.138.212
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (1 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (1 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (1 retries left).

TASK [geerlingguy.node_exporter : Determine latest GitHub release (local)] ******************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/config-version.yaml:2
fatal: [XX.YY.97.33 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}
fatal: [XX.YY.138.212 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}
fatal: [XX.YY.156.16 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}

Steps with multiple-line shell scripts only catch failures on the last command

Consider the configure redpanda step:

- name: configure redpanda
    notify:
      - restart redpanda-tuner
      - restart redpanda
    vars:
      use_public_ips: "{{ advertise_public_ips | default(false, true) | bool }}"
    shell: |
      rpk config set cluster_id 'redpanda'
      rpk config set organization 'redpanda-test'
      rpk config set redpanda.advertised_kafka_api '{
      {% if use_public_ips %}
        address: {{ inventory_hostname }},
      {% else %}
        address: {{ hostvars[inventory_hostname].private_ip }},
      {% endif %}
        port: 9092
      }' --format yaml
      rpk config set redpanda.advertised_rpc_api '{
        address: {{ hostvars[inventory_hostname].private_ip }},
        port: 33145
      }' --format yaml
      sudo rpk mode production

      {% if hostvars[groups['redpanda'][0]].id == hostvars[inventory_hostname].id %}
      sudo rpk config bootstrap \
        --id {{ hostvars[inventory_hostname].id }} \
        --self {{ hostvars[inventory_hostname].private_ip }}

      {% else %}

      sudo rpk config bootstrap \
        --id {{ hostvars[inventory_hostname].id }} \
        --self {{ hostvars[inventory_hostname].private_ip }} \
        --ips {{ groups["redpanda"] | map('extract', hostvars) | map(attribute='private_ip') | join(',') }}
      {% endif %}

If any shell command fails other than the last one, this step will succeed when it should fail, since these steps are effectively run as a shell script whose return value is that of the last command to run.

The easiest fix is probably just a set -e at the top of the script.

update docs in azure

README states that:

vm_data_disk_gb: Size of the Premium_LRS data disk in GiB (default 512 P20)

When the default is 2048 #P40 https://github.com/redpanda-data/deployment-automation/blob/main/azure/vars.tf#L31

Hard to tell who created instances

It's hard to tell who created a given node, i.e., to find abandoned machines or so users can check what they have running.

I propose we add a tag with the IAM username to make it clearer.

Use AWS region from AWS config by default

Read from ~/.aws/config to get user's default region rather than hard-coding aws_region to us-west-2.

s3 setup for archival storage

When deploying a cluster it would be great to have an option that setup infra support for archival storage feature in redpanda. Briefly this would be:

s3 bucket
query credentials, urls, etc...
deploy redpanda.yml with archival settings from (1) and (2)

@Lazin do you have the set of configuration elements that we'll need? I thought they were in configuration.h but I don't see them in there. Maybe they haven't merged yet?

Add Upgrade Playbook

We should have a playbook on performing rolling upgrades.

Establish restart fails in some conditions because package_result isn't populated right

in start-redpanda.yml, the task that establishes if a restart is necessary seems to fail on occasion when the package_result registered variable is checked for the results key. Not sure why it's missing in some cases, but this seems to at least be the fix for now. Documenting it for now so we can get a branch made and tested

index 1ff9680..42a9827 100644
--- a/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml
+++ b/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml
@@ -210,7 +210,8 @@
 #
 - name: Establish whether restart required
   ansible.builtin.set_fact:
-    restart_required: '{{ ("true" in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or "Removed" in package_result.results or "1 upgraded" in package_result.results))) and (restart_node | default("true") | bool) }}'
+    restart_required: '{{ ("true" in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or package_result.results is defined and ("Removed" in package_result.results or "1 upgraded" in package_result.results)))) and (restart_node | default("true") | bool) }}'


 # serial: 1 would be the proper solution here, but that can only be set on play level```

[Slack Message](https://redpandadata.slack.com/archives/C049AE7V4U8/p1678491547657399?thread_ts=1678485045.593989&cid=C049AE7V4U8)

rpk config set fails because of list vs single issue

As described in redpanda-data/redpanda#2958 things can go wrong when the list syntax for some properties are mixed with the single map syntax.

This happens in the ansible deploy, where we use the single syntax, but it gets transformed to list and then subsequent redpanda configures fail to set that property because we can't overwrite a list with a single value.

The list format is preferred and supported by all tools, while the single value is effectively deprecated and not supported by some tools. We should use list syntax.

High qty message count causes critical failures in benchmark processes

Recently in attempting to generate traffic to provide 1.5 GB/sec to an appropriately sized cluster, using 100 byte message sizes (customer is moving from SQS) we have encountered difficulties with the OMB producers. Messages per second seems to fall apart somewhere around 1 to 1.2 million messages per second. The same cluster can handle 1.5GB w/ 1024 message sizes with excellent performance. 1.5GB w 100 byte messages (same batch size, etc) results in producers erroring out and aborting.

@travisdowns has additional details around the nature of these failures he can add to this issue.

Additional experiments using other client technologies have been in progress by @larsenpanda .

Ultimately we need to update our client so that these types of workloads are successful out of the box and define limits for the client in documentation so customers know what the upper boundaries of the producers and consumers are so they do not infer poor performance by Redpanda.

Add tags to generated key pair

This will allow easy filtering of key pairs in the console based on which resources were created by this script.

Improve EC2 instance naming

add instance_name_prefix variable and include at beginning of name tag
include count.index at end of name tag

Add playbook for adding/removing a node

Error connecting to kafkaJS with TLS.

I share in detail the steps I took and the mistakes I got.

First of all, I have a basic typescript project. And kafkaJS is installed in it. ( https://kafka.js.org/docs/getting-started )

my transactions:

first uninstalling 1 broker and 1 monitoring on aws

cd aws
terraform apply ( Necessary parameters and public_keys have been created for the machine. )

and the machines I wanted were successfully created on aws.

Installing redpanda and grafana on broker with ansible.

ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -e advertise_public_ips=true -e grafana_admin_pass=******* -i hosts.ini -v ansible/playbooks/provision-node.yml

and successful redpanda installation on broker. I can also connect to grafana with the password I specified.

private createKafkaConsumer(clientId: string, groupId: string): Consumer {

        const kafka = new Kafka({
            clientId: clientId,
            brokers: ['*.*.*.*:9092']
        })
        const consumer = kafka.consumer({ groupId })
        return consumer
    }

I can successfully connect to the broker with the code snippet above.

Up to this stage I can successfully use redpanda. But there is insecurity here. So I have to activate TLS.

Here are the steps I need to do according to the documentation for TLS.

ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -v ansible/playbooks/create-ca.yml
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/generate-csrs.yml
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/issue-certs.yml
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/install-certs.yml

After these steps, the redpanda.yml file in the broker is updated accordingly. It is updated as in the template specified for redpanda.yml in the ansible playbook section.

Now I'm trying to connect to the broker again.

private createKafkaConsumer(clientId: string, groupId: string): Consumer {

        const kafka = new Kafka({
            clientId: clientId,
            brokers: ['*.*.*.*:9092'],
            ssl: {
                rejectUnauthorized: false,
                ca: [fs.readFileSync('/tls/ca/ca.crt', 'utf-8')]
            }
        })
        const consumer = kafka.consumer({ groupId })
        return consumer
    }

I can no longer connect to the broker I was able to connect to without TLS. I'm doing something wrong but I can't find it.

This is the error I got.

{"level":"INFO","timestamp":"2023-01-05T20:48:59.897Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:48:59.897Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"WebSiteProducer_Group"}
{"level":"INFO","timestamp":"2023-01-05T20:49:00.198Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"172.31.17.111:9092","clientId":"71b80823-8058-4279-9814-4561eb3840a4"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSConnectionError: Connection timeout","groupId":"WebSiteProducer_Group","stack":"KafkaJSConnectionError: Connection timeout\n    at Timeout.onTimeout [as _onTimeout] (/home/emredarak/repo/consumer/node_modules/kafkajs/src/network/connection.js:223:23)\n    at listOnTimeout (node:internal/timers:559:17)\n    at processTimers (node:internal/timers:502:7)"}
{"level":"INFO","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.346Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"WebSiteProducer_Group"}
{"level":"INFO","timestamp":"2023-01-05T20:49:01.647Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"WebSiteProducer_Group"}

Ansible fails when enabling TLS due to safe restart

The code for safe restart uses rpk cluster maintenance enable {{ node_id }} which unfortunately uses the new rpk settings before TLS is enabled.

We either need a way to single-thread the whole of the last few plays (to ensure that maintenance mode is enabled, the node is restarted and then MM disabled all in a single action), or we need to nestle the updating of the redpanda.yaml file into that single shell command.

When I wrote this I couldn't find a way of having a block with throttle set within a role.

Using curl bash to install local repo is probably not a great idea

Currently we are downloading and setting up a local repo on the servers using a curl script and then doing a rpk yum etc from local repo.

This should likely be changed to use our public repos appropriately with the correct tool with an explicit version flag, and related variable.

Current code:

- name: add the redpanda repo
  shell: |
    curl -1sLf {{ redpanda_repo_script }} | sudo -E bash
  args:
    warn: no

redpanda-data / deployment-automation Goto Github PK

deployment-automation's Issues

Recommend Projects

Recommend Topics

Recommend Org