redpanda-data / deployment-automation Goto Github PK
View Code? Open in Web Editor NEWCluster configuration best practices
Home Page: https://redpanda.com
License: Apache License 2.0
Cluster configuration best practices
Home Page: https://redpanda.com
License: Apache License 2.0
All three terraform modules should tag resources with an 'owner' tag based either on their username/iam name, or based on something that can be configured.
We do this for AWS today: https://github.com/redpanda-data/deployment-automation/blob/main/aws/cluster.tf#L14 but not for Azure or GCP.
When we redo roles, let's detail how our changes may break end user workflows
First, I don't know why redpanda-tuner runs as a systemd service. ๐คทโโ๏ธ
Second and most importantly, redpanda-tuner now fails trying to run as a systemd service, at least when deployed with ansible.
Rogger Vasquez saw it in action and is looking into it!
https://molecule.readthedocs.io/en/latest/
Ideally we should add Github actions to help with regression testing as well.
There is an irregular failure in adding the grafana repository that happens in about 5% of builds which causes a fatal failure.
TASK [cloudalchemy.grafana : Add Grafana repository [Debian/Ubuntu]] ****************************************************************************
FAILED - RETRYING: [35.90.68.22]: Add Grafana repository [Debian/Ubuntu] (5 retries left).
FAILED - RETRYING: [35.90.68.22]: Add Grafana repository [Debian/Ubuntu] (4 retries left).
Right now the client instance is sometimes deployed or not depending on the cloud provider chosen. AWS default is false, Azure default is true, etc. The same default should be used all cloud providers.
(Defaults across cloud providers likely impacts more than just client deployment, more issues can be created to track other issues).
We need to restructure the ansible directory to be formatted as an Ansible Galaxy Collection before submitting to https://galaxy.ansible.com/.
For details of Ansible Galaxy Collections see https://docs.ansible.com/ansible/latest/collections_guide/index.html
The following condition:
Currently can't select graviton instances
Every time you run a terraform command, almost every resource (all the hosts, in particular) are torn down and replaced, so al the IP addresses change, any existing state is lost, etc.
The plan shows that the cause is the use of timestamp()
as part of the deployment_id
. Unlike the random UUID which will change only when the deployment is initially created, the timestamp is different on every call, meaning that the deployment ID changes every time and we can't apply changes to an existing deployment: it effectively creates a new deployment every time.
As a minimal example, if you run two consecutive terraform apply
with the AWS modules, you would expect the second does nothing: after all, nothing has changed and the first apply put everything into the desired state. However, currently all hosts will be torn down and new ones created. The only thing that changed was the ssh key name, which is based on the deployment ID.
There's some somewhat complex interactions in this playbook, like the fact that there are multiple stages for generating node configs. Need to run through and add some additional explanation based on discussions with @tmgstevens so future readers can understand what's going on.
tls/install-certs.yml: This part of the code sets redpanda
again so it deletes whatever we had inside redpanda
field in redpanda.yaml
. we might be deleting important user configurations such as rpc_server or additional node properties.
We should instead set kafka_api_tls
and rpc_server_tls
separately via rpk:
instead of:
- name: Configure via RPK
shell:
cmd: |
rpk redpanda config set redpanda "
kafka_api:
- name: default
address: {{ hostvars[inventory_hostname]['private_ip'] }}
port: 9092
advertised_kafka_api:
- name: default
address: {{ inventory_hostname }}
port: 9092
kafka_api_tls:
- name: default
enabled: true
require_client_auth: false
cert_file: /etc/redpanda/certs/node.crt
key_file: /etc/redpanda/certs/node.key
truststore_file: /etc/redpanda/certs/truststore.pem
rpc_server_tls:
enabled: true
require_client_auth: false
cert_file: /etc/redpanda/certs/node.crt
key_file: /etc/redpanda/certs/node.key
truststore_file: /etc/redpanda/certs/truststore.pem
" --format yaml
do:
- name: Configure via RPK
shell:
cmd: |
set -e
rpk redpanda config set redpanda.kafka_api_tls "
- name: default
enabled: true
require_client_auth: false
cert_file: /etc/redpanda/certs/node.crt
key_file: /etc/redpanda/certs/node.key
truststore_file: /etc/redpanda/certs/truststore.pem
" --format yaml
rpk redpanda config set redpanda.rpc_server_tls "
enabled: true
require_client_auth: false
cert_file: /etc/redpanda/certs/node.crt
key_file: /etc/redpanda/certs/node.key
truststore_file: /etc/redpanda/certs/truststore.pem
" --format yaml
If using Ansible 2.14 or later, it appears like we get failures for any usage of the shell command that has the warn
parameter defined.
See background on the deprecation/removal: ansible/ansible#79379
There are two places where this occurs in our automation (one in our code, one in a dependency).
add the redpanda repo
We can fix the first. The second is going to be harder because the node-exporter appears to no longer be maintained and is possibly moving to a different repo per this open issue in their github repo:
cloudalchemy/ansible-node-exporter#279
Looks like we need to use the prometheus-community version of node-exporter which patches this warn
issue in the preflight.
prometheus-community/ansible@ec6c857
We can fix the first, but
Presently, the ansible scripts work off a set of IP addresses to configure broker nodes. These IP addresses are used to identify listener bindings but also to come up with the list of seed brokers and advertised addresses. We should allow for the passing of hostnames to be used for advertised addresses and other things.
The user should be able to opt-in for rack awareness with an ansible variable: -e enable-rack-awareness=true
See: #65 (comment)
var.clients
is non-zero), but there is no ansible configuration to install the Redpanda CLIvar.clients
is an unbounded integer value, but maybe it should just be a booleanThe grafana dashboard is generated based off the internal /metrics
endpoint. This should be switched to the public-facing /public_metrics
endpoint.
This needs a change in the terraform scripts as well.
An example of how the playbook would look like: https://redpandacommunity.slack.com/archives/C039U14NY04/p1662538501878789?thread_ts=1662124131.079259&cid=C039U14NY04
Redpanda will eventually enforce license for enterprise feature like tiered storage. It would be good to set it up.
Example:
https://github.com/redpanda-data/helm-charts/blob/1c1048b68950f0dbe454d41448db420b7b115423/charts/redpanda/templates/post-install-upgrade-job.yaml#L107
https://github.com/redpanda-data/redpanda/blob/dev/src/go/rpk/pkg/cli/cmd/cluster/license/set.go
Work through remaining issues noted for provision-node.yml and it's related roles and tasks.
warn_list: # or 'skip_list' to silence them completely
- command-instead-of-module # Using command rather than module.
- command-instead-of-shell # Use shell only when shell functionality is required.
- experimental # all rules tagged as experimental
- fqcn[action-core] # Use FQCN for builtin actions.
- name[missing] # Rule for checking task and play names.
- no-changed-when # Commands should not change things if nothing needs doing.
- risky-shell-pipe # Shells that use pipes should set the pipefail option.
- yaml[line-length] # Violations reported by yamllint.
Rule Violation Summary
count tag profile rule associated tags
1 command-instead-of-module basic command-shell, idiom
1 command-instead-of-shell basic command-shell, idiom
2 key-order[task] basic formatting, experimental (warning)
11 jinja[spacing] basic formatting (warning)
8 name[missing] basic idiom
11 name[play] basic idiom (warning)
8 yaml[line-length] basic formatting, yaml
22 name[casing] moderate idiom (warning)
3 risky-file-permissions safety unpredictability, experimental (warning)
1 risky-shell-pipe safety command-shell
2 no-changed-when shared command-shell, idempotency
1 fqcn[action-core] production formatting
2 fqcn[action] production formatting (warning)
2 args[module] syntax, experimental (warning)
Failed after min profile: 22 failure(s), 53 warning(s) on 10 files.
A new release of ansible-lint is available: 6.11.0 โ 6.12.0 Upgrade by running: pip install --upgrade ansible-lint
Break out the dependencies into required, and non-required.
Redpanda only needs redpanda, no JDK, java, etc or IOTOP all of these tools are just there as suplementary things when debugging a cluster.
By default, only the minimal_profile=true
should be installed (i made up that variable name)
Add TLS support when deploying redpanda.
When running the provision node ansible playbook, the redpanda nodes succeed but the prometheus node fails with the following message:
TASK [cloudalchemy.prometheus : Get all file_sd files from scrape_configs] *************************************************************************************************************************************************************
fatal: [35.84.208.32]: FAILED! => {"msg": "You need to install \"jmespath\" prior to running json_query filter"}
Make it possible to enable shadow indexing from a flag, and then deploying a new environment with SI works out of the box (after running terraform and ansible scripts).
Discovered while working on PR-119. Would be good to run this down, or at least, figure out why removing the debug task doesn't allow the ansible run to resume correctly running again. There's probably some other level of caching that I'm not catching and once package_result.results
gets wedged, it's not clearing itself out and repopulating.
Somehow I'm managing to break the ansible runs with a debug statement that references a variable that (sometimes?) may not contain anything, and once it gets that way, future ansible runs break, even if the debug task is removed. I've tried running with --flush-cache
on the ansible run and that's not fixing it. What's also concerning is that once it breaks, even with removing the debug, it impacts the Establish whether restart required
task.
TASK [redpanda_broker : Establish whether restart required] *********************************************************************************************************************************************************
task path: /Users/tcampbell/Documents/github/prs/pr-119-redpanda-data-deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml:223
fatal: [XX.XX.XX.XX]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'results'. 'dict object' has no attribute 'results'\n\nThe error appears to be in '/Users/tcampbell/Documents/github/prs/pr-119-redpanda-data-deployment-automation/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml': line 223, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n#\n- name: Establish whether restart required\n ^ here\n"}
it looks like it might be something within the package_result
variable.
Here's the example debug I was working with
- name: debug
ansible.builtin.debug:
msg:
- "is_initialized is {{ is_initialized }}"
- "restart_required_rc is {{ restart_required_rc.stdout }}"
- "package_result is {{ package_result }}"
- "nodeconfig_result is {{ nodeconfig_result }}"
- "restart_node is {{ restart_node | default('true') | bool }}"
- "is restart_required_rc.stdout true: {{ restart_required_rc.stdout }}"
- "is nodeconfig_result changed {{ nodeconfig_result.changed }}"
- "is package_result.results removed or ugprade {{ package_result.results }}"
# - "what is_initialized and result of (nodeconfig_result.changed or package_result) {{ is_initialized and (nodeconfig_result.changed or 'Removed' in package_result.results or '1 upgraded' in package_result.results) }}"
# - "restart_required: {{ ('true' in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or 'Removed' in package_result.results or '1 upgraded' in package_result.results))) and (restart_node | default('true') | bool) }}"
We need some verification that both the terraform and ansible parts work when we merge new changes.
There is a minimal terratest module for the tf code but we also need to test that the ansible modules work as expected at least for a minimal use case.
Setting -e 'advertise_public_ips=true
causes public IPs to be advertised, so -e 'advertise_public_ips=false'
should be the opposite, right?
Not so fast! Setting it to false
is the same as true
as a result of the coercion from a string to boolean that happens in {% if ... %}
.
We should use the sort-of idiomatic var | d() | bool
instead, which handles all cases. Everything except the "yaml truthy" values are set to false.
I was asked to move redpanda-data/redpanda#2844 here.
For engineers on AWS, it would be amazing to have a vectorized.io zipfile you could place on S3 with enterprise licence features stripped out built for Ubuntu 20.04 - and a cloud formation script similar to the one for AWS MSK so it is almost a drop-in replacement for testing/benchmarking.
For testing it would also be amazing if you released a free https://github.com/localstack/localstack plugin to provide Red Panda mock AWS MSK on localhost. That is the easiest way for me to get Red Panda code into the Enterprise git repository.
Currently there is a risk that resources can be stranded when a build is canceled. This will need to be resolved for long term use of this CI to be safe and cost effective. Mitigation techniques in place include:
"as for cleaning up stranded compute which could be a risk upon every push to a PR, i can see 2 paths:
Originally posted by @andrewhsu in #137 (review)
We use several deprecated commands:
We should update for the new commands to avoid printing the deprecated logs and confuse users.
Currently ansible commands take flags, and this can make the commands long and difficult to read. Create a config file to variables that are initially populated with working default values, and updated docs to mention this.
The rpk cluster config set
at the moment of writing has the following logic:
https://github.com/redpanda-data/redpanda/blob/c3cca097b01c250715fd4012bac175f09c26778e/src/go/rpk/pkg/cli/cmd/cluster/config/set.go#L85-L99
if meta.Nullable && value == "null" {
// Nullable types may be explicitly set to null
upsert[key] = nil
} else if meta.Type != "string" && (value == "") {
// Non-string types that receive an empty string
// are reset to default
remove = append(remove, key)
} else if meta.Type == "array" {
var a anySlice
err = yaml.Unmarshal([]byte(value), &a)
out.MaybeDie(err, "invalid list syntax")
upsert[key] = a
} else {
upsert[key] = value
}
This mean that if any property of type string which has different default value then ""
(empty string) can not be unset.
I didn't test it, but it might be worth checking the rpk cluster config import
https://github.com/redpanda-data/redpanda/blob/c3cca097b01c250715fd4012bac175f09c26778e/src/go/rpk/pkg/cli/cmd/cluster/config/import.go#L98-L223
Related issue redpanda-data/helm-charts#395
Right now the script relies on the IAM user associated with the access/secret keys to have a default VPC configured. Without a default VPC, the following error appears when running terraform apply
:
Error: error creating Security Group (redpanda-9b2beded-7423-d346-b28e-896c27dd58de-2022-10-20T14:29:42Z-node-sec-group): VPCIdNotSpecified: No default VPC for this user
โ status code: 400, request id: 9ec37fe0-bd45-4c1f-81b4-9070e9faeadc
โ
โ with aws_security_group.node_sec_group,
โ on cluster.tf line 65, in resource "aws_security_group" "node_sec_group":
โ 65: resource "aws_security_group" "node_sec_group" {
โ
โต
exit 1
The script could instead create a VPC and not rely on the default value.
Looks like geerlingguy.node_exporter
may somehow trigger rate limiting on Github's side because the module checks for the latest version each time ansible runs. We need to figure out if there's a way to disable this limiting the frequency of checking.
TASK [geerlingguy.node_exporter : Check current node_exporter version.] *********************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/main.yml:2
ok: [XX.YY.138.212] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007202", "end": "2023-01-31 01:48:20.308344", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.301142", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n build user: root@6e7732a7b81b\n build date: 20221129-18:59:09\n go version: go1.19.3\n platform: linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", " build user: root@6e7732a7b81b", " build date: 20221129-18:59:09", " go version: go1.19.3", " platform: linux/amd64"]}
ok: [XX.YY.97.33] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007200", "end": "2023-01-31 01:48:20.685369", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.678169", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n build user: root@6e7732a7b81b\n build date: 20221129-18:59:09\n go version: go1.19.3\n platform: linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", " build user: root@6e7732a7b81b", " build date: 20221129-18:59:09", " go version: go1.19.3", " platform: linux/amd64"]}
ok: [XX.YY.156.16] => {"changed": false, "cmd": ["/usr/local/bin/node_exporter", "--version"], "delta": "0:00:00.007467", "end": "2023-01-31 01:48:20.889350", "failed_when_result": false, "msg": "", "rc": 0, "start": "2023-01-31 01:48:20.881883", "stderr": "", "stderr_lines": [], "stdout": "node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)\n build user: root@6e7732a7b81b\n build date: 20221129-18:59:09\n go version: go1.19.3\n platform: linux/amd64", "stdout_lines": ["node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)", " build user: root@6e7732a7b81b", " build date: 20221129-18:59:09", " go version: go1.19.3", " platform: linux/amd64"]}
TASK [geerlingguy.node_exporter : Configure latest version] *********************************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/main.yml:8
included: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/config-version.yaml for XX.YY.156.16, XX.YY.97.33, XX.YY.138.212
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (5 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (4 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (3 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (2 retries left).
FAILED - RETRYING: [XX.YY.138.212 -> localhost]: Determine latest GitHub release (local) (1 retries left).
FAILED - RETRYING: [XX.YY.97.33 -> localhost]: Determine latest GitHub release (local) (1 retries left).
FAILED - RETRYING: [XX.YY.156.16 -> localhost]: Determine latest GitHub release (local) (1 retries left).
TASK [geerlingguy.node_exporter : Determine latest GitHub release (local)] ******************************************************************************************************************************************
task path: /Users/hcoyote/.ansible/roles/geerlingguy.node_exporter/tasks/config-version.yaml:2
fatal: [XX.YY.97.33 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}
fatal: [XX.YY.138.212 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}
fatal: [XX.YY.156.16 -> localhost]: FAILED! => {"access_control_allow_origin": "*", "access_control_expose_headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Used, X-RateLimit-Resource, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset", "attempts": 5, "changed": false, "connection": "close", "content_length": "279", "content_security_policy": "default-src 'none'; style-src 'unsafe-inline'", "content_type": "application/json; charset=utf-8", "date": "Tue, 31 Jan 2023 01:48:49 GMT", "elapsed": 0, "json": {"documentation_url": "https://docs.github.com/rest/overview/resources-in-the-rest-api#rate-limiting", "message": "API rate limit exceeded for XX.YY.27.195. (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"}, "msg": "Status code was 403 and not [200]: HTTP Error 403: rate limit exceeded", "redirected": false, "referrer_policy": "origin-when-cross-origin, strict-origin-when-cross-origin", "server": "Varnish", "status": 403, "strict_transport_security": "max-age=31536000; includeSubdomains; preload", "url": "https://api.github.com/repos/prometheus/node_exporter/releases/latest", "x_content_type_options": "nosniff", "x_frame_options": "deny", "x_github_media_type": "github.v3; format=json", "x_github_request_id": "REDACTED", "x_ratelimit_limit": "60", "x_ratelimit_remaining": "0", "x_ratelimit_reset": "1675129929", "x_ratelimit_resource": "core", "x_ratelimit_used": "60", "x_xss_protection": "1; mode=block"}
Consider the configure redpanda step:
- name: configure redpanda
notify:
- restart redpanda-tuner
- restart redpanda
vars:
use_public_ips: "{{ advertise_public_ips | default(false, true) | bool }}"
shell: |
rpk config set cluster_id 'redpanda'
rpk config set organization 'redpanda-test'
rpk config set redpanda.advertised_kafka_api '{
{% if use_public_ips %}
address: {{ inventory_hostname }},
{% else %}
address: {{ hostvars[inventory_hostname].private_ip }},
{% endif %}
port: 9092
}' --format yaml
rpk config set redpanda.advertised_rpc_api '{
address: {{ hostvars[inventory_hostname].private_ip }},
port: 33145
}' --format yaml
sudo rpk mode production
{% if hostvars[groups['redpanda'][0]].id == hostvars[inventory_hostname].id %}
sudo rpk config bootstrap \
--id {{ hostvars[inventory_hostname].id }} \
--self {{ hostvars[inventory_hostname].private_ip }}
{% else %}
sudo rpk config bootstrap \
--id {{ hostvars[inventory_hostname].id }} \
--self {{ hostvars[inventory_hostname].private_ip }} \
--ips {{ groups["redpanda"] | map('extract', hostvars) | map(attribute='private_ip') | join(',') }}
{% endif %}
If any shell
command fails other than the last one, this step will succeed when it should fail, since these steps are effectively run as a shell script whose return value is that of the last command to run.
The easiest fix is probably just a set -e
at the top of the script.
README states that:
vm_data_disk_gb: Size of the Premium_LRS data disk in GiB (default 512 P20)
When the default is 2048 #P40 https://github.com/redpanda-data/deployment-automation/blob/main/azure/vars.tf#L31
It's hard to tell who created a given node, i.e., to find abandoned machines or so users can check what they have running.
I propose we add a tag with the IAM username to make it clearer.
Read from ~/.aws/config
to get user's default region rather than hard-coding aws_region
to us-west-2
.
When deploying a cluster it would be great to have an option that setup infra support for archival storage feature in redpanda. Briefly this would be:
@Lazin do you have the set of configuration elements that we'll need? I thought they were in configuration.h but I don't see them in there. Maybe they haven't merged yet?
We should have a playbook on performing rolling upgrades.
in start-redpanda.yml, the task that establishes if a restart is necessary seems to fail on occasion when the package_result registered variable is checked for the results
key. Not sure why it's missing in some cases, but this seems to at least be the fix for now. Documenting it for now so we can get a branch made and tested
index 1ff9680..42a9827 100644
--- a/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml
+++ b/ansible/playbooks/roles/redpanda_broker/tasks/start-redpanda.yml
@@ -210,7 +210,8 @@
#
- name: Establish whether restart required
ansible.builtin.set_fact:
- restart_required: '{{ ("true" in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or "Removed" in package_result.results or "1 upgraded" in package_result.results))) and (restart_node | default("true") | bool) }}'
+ restart_required: '{{ ("true" in restart_required_rc.stdout or (is_initialized and (nodeconfig_result.changed or package_result.results is defined and ("Removed" in package_result.results or "1 upgraded" in package_result.results)))) and (restart_node | default("true") | bool) }}'
# serial: 1 would be the proper solution here, but that can only be set on play level```
[Slack Message](https://redpandadata.slack.com/archives/C049AE7V4U8/p1678491547657399?thread_ts=1678485045.593989&cid=C049AE7V4U8)
As described in redpanda-data/redpanda#2958 things can go wrong when the list syntax for some properties are mixed with the single map syntax.
This happens in the ansible deploy, where we use the single syntax, but it gets transformed to list and then subsequent redpanda configures fail to set that property because we can't overwrite a list with a single value.
The list format is preferred and supported by all tools, while the single value is effectively deprecated and not supported by some tools. We should use list syntax.
Recently in attempting to generate traffic to provide 1.5 GB/sec to an appropriately sized cluster, using 100 byte message sizes (customer is moving from SQS) we have encountered difficulties with the OMB producers. Messages per second seems to fall apart somewhere around 1 to 1.2 million messages per second. The same cluster can handle 1.5GB w/ 1024 message sizes with excellent performance. 1.5GB w 100 byte messages (same batch size, etc) results in producers erroring out and aborting.
@travisdowns has additional details around the nature of these failures he can add to this issue.
Additional experiments using other client technologies have been in progress by @larsenpanda .
Ultimately we need to update our client so that these types of workloads are successful out of the box and define limits for the client in documentation so customers know what the upper boundaries of the producers and consumers are so they do not infer poor performance by Redpanda.
This will allow easy filtering of key pairs in the console based on which resources were created by this script.
instance_name_prefix
variable and include at beginning of name tagcount.index
at end of name tagI share in detail the steps I took and the mistakes I got.
First of all, I have a basic typescript project. And kafkaJS is installed in it. ( https://kafka.js.org/docs/getting-started )
my transactions:
first uninstalling 1 broker and 1 monitoring on aws
cd aws
terraform apply ( Necessary parameters and public_keys have been created for the machine. )
and the machines I wanted were successfully created on aws.
Installing redpanda and grafana on broker with ansible.
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -e advertise_public_ips=true -e grafana_admin_pass=******* -i hosts.ini -v ansible/playbooks/provision-node.yml
and successful redpanda installation on broker. I can also connect to grafana with the password I specified.
private createKafkaConsumer(clientId: string, groupId: string): Consumer {
const kafka = new Kafka({
clientId: clientId,
brokers: ['*.*.*.*:9092']
})
const consumer = kafka.consumer({ groupId })
return consumer
}
I can successfully connect to the broker with the code snippet above.
Up to this stage I can successfully use redpanda. But there is insecurity here. So I have to activate TLS.
Here are the steps I need to do according to the documentation for TLS.
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -v ansible/playbooks/create-ca.yml
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/generate-csrs.yml
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/issue-certs.yml
ansible-playbook --private-key ./Production_Redpanda_Broker_OpenSSH -i hosts.ini -e -v ansible/playbooks/install-certs.yml
After these steps, the redpanda.yml file in the broker is updated accordingly. It is updated as in the template specified for redpanda.yml in the ansible playbook section.
Now I'm trying to connect to the broker again.
private createKafkaConsumer(clientId: string, groupId: string): Consumer {
const kafka = new Kafka({
clientId: clientId,
brokers: ['*.*.*.*:9092'],
ssl: {
rejectUnauthorized: false,
ca: [fs.readFileSync('/tls/ca/ca.crt', 'utf-8')]
}
})
const consumer = kafka.consumer({ groupId })
return consumer
}
I can no longer connect to the broker I was able to connect to without TLS. I'm doing something wrong but I can't find it.
This is the error I got.
{"level":"INFO","timestamp":"2023-01-05T20:48:59.897Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:48:59.897Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"WebSiteProducer_Group"}
{"level":"INFO","timestamp":"2023-01-05T20:49:00.198Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"172.31.17.111:9092","clientId":"71b80823-8058-4279-9814-4561eb3840a4"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSConnectionError: Connection timeout","groupId":"WebSiteProducer_Group","stack":"KafkaJSConnectionError: Connection timeout\n at Timeout.onTimeout [as _onTimeout] (/home/emredarak/repo/consumer/node_modules/kafkajs/src/network/connection.js:223:23)\n at listOnTimeout (node:internal/timers:559:17)\n at processTimers (node:internal/timers:502:7)"}
{"level":"INFO","timestamp":"2023-01-05T20:49:01.345Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"WebSiteProducer_Group"}
{"level":"ERROR","timestamp":"2023-01-05T20:49:01.346Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"WebSiteProducer_Group"}
{"level":"INFO","timestamp":"2023-01-05T20:49:01.647Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"WebSiteProducer_Group"}
The code for safe restart uses rpk cluster maintenance enable {{ node_id }}
which unfortunately uses the new rpk
settings before TLS is enabled.
We either need a way to single-thread the whole of the last few plays (to ensure that maintenance mode is enabled, the node is restarted and then MM disabled all in a single action), or we need to nestle the updating of the redpanda.yaml
file into that single shell command.
When I wrote this I couldn't find a way of having a block with throttle set within a role.
Currently we are downloading and setting up a local repo on the servers using a curl script and then doing a rpk yum etc from local repo.
This should likely be changed to use our public repos appropriately with the correct tool with an explicit version flag, and related variable.
Current code:
- name: add the redpanda repo
shell: |
curl -1sLf {{ redpanda_repo_script }} | sudo -E bash
args:
warn: no
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.