cloudalchemy / ansible-alertmanager Goto Github PK

View Code? Open in Web Editor NEW

142.0 142.0 94.0 266 KB

Deploy Prometheus Alertmanager service

License: MIT License

Python 37.51% Jinja 62.49%

alerting alertmanager ansible molecule prometheus prometheus-alertmanager tox

ansible-alertmanager's People

Contributors

Stargazers

Watchers

Forkers

manetos swesterveld porkepix nikosgraser orhan89 intento mbaran0v nicosto pngmbh shvadchak tuxknight mgrecar etcet slomo rismalrv schewara tanialanin-sw cdelgehier sardarhalip lunarthegrey injeti-manohar vaizard aruhier siavashsardari open-io thirdeye-labs g5paul johanssone jin771998569 cimon-io digineo vk1z ppetit w00dbury nihini includerandom dvonessen iwagner-inmar vanillaops sammcadams-8451 aruntheja-0 abeluck nikosmeds xaptum-eng binbashar bryanasdev000 bartoszcisek bastibrunner s3rk cdvel ctrlaltdel samspiri h3po ouisouss opuscapita chromy96 raven428 debeste123 philippeaccorsi bientt210 saez0pub alimli datenkollektiv-net armohamm cosandr lennarkivistik iuhenio caotangdaiduong paleou imsdevsecu pbousek adswizz chambsi soloradish pimmerks tarasoff diraol jeff-chandler liubenok-vladyslav hakamine christian-heusel mamercad rdemachkovych tugytur dhanrajsr govindpawa triuhv gaima8 walterdolce addshoppers madalinignisca fxselazy mklassen aaam

ansible-alertmanager's Issues

Config validation not done on templates

What happened?

I made a mistake with my Go template. Ansible restarted Alertmanager. The mistake causes Alertmanager to crash on startup. There is no validation done.

Did you expect to see some different?
It would be nice to have a amtool check-config done before the restart.

Unfortunately it only accepts alertmanager.yml and not the .tmpl file directly, blocking the use of the validate option. Possible solution would be to have a task manually run amtool check-config, and restart if the templates were updated and the check-config is OK.

How to reproduce it (as minimally and precisely as possible):
Mess up a go template, run your playbook.

Environment

Role version:

0.17.2

Ansible version information:

ansible 2.8.5
  config file = /var/lib/ansible/ansible.cfg
  configured module search path = ['/home/vos/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.7/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.7.3 (default, Apr  3 2019, 05:39:12) [GCC 8.3.0]

Variables:

insert role variables relevant to the issue

Ansible playbook execution Logs:

TASK [cloudalchemy.alertmanager : copy alertmanager config] *******************************************************************************************************************************************************
ok: [prometheus1]

TASK [cloudalchemy.alertmanager : create systemd service unit] ****************************************************************************************************************************************************
ok: [prometheus1]

TASK [cloudalchemy.alertmanager : copy alertmanager template files] ***********************************************************************************************************************************************
--- before: /etc/alertmanager/templates/vos.tmpl
+++ after: /var/lib/ansible/playbooks/files/alertmanager/templates/vos.tmpl
@@ -1,10 +1,11 @@
+{{ invalidtemplatecode }}
 {{ define "slack.default.text" }}
     [PromQL Expression]({{ (index .Alerts 0).GeneratorURL }})
     {{ if gt (len .Alerts.Firing) 0 }}
         {{ range .Alerts.Firing }}{{ .Annotations.description }}
         {{ end }}
     {{ end }}
     {{ if gt (len .Alerts.Resolved) 0 }}
         **Resolved:**
         {{ range .Alerts.Resolved }}~~{{ .Annotations.description }}~~
         _start: {{ .StartsAt }}_

changed: [prometheus1] => (item=/var/lib/ansible/playbooks/files/alertmanager/templates/vos.tmpl)

TASK [cloudalchemy.alertmanager : ensure alertmanager service is started and enabled] *****************************************************************************************************************************
ok: [prometheus1]

RUNNING HANDLER [cloudalchemy.alertmanager : restart alertmanager] ************************************************************************************************************************************************
changed: [prometheus1]

May 10 22:18:41 prometheus1 systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
May 10 22:18:41 prometheus1 systemd[1]: alertmanager.service: Failed with result 'exit-code'.

Anything else we need to know?:

Add a real life example playbook

Please add a real life example playbook. An "empty" one just including role isn't enought, as the assert enforce you need receiver & route. And coming up with the proper yaml is a bit tedious. Just something like this would have helped getting stuff really started

---
# https://github.com/cloudalchemy/ansible-alertmanager

alertmanager_web_external_url: alertmanager.domain.tld

alertmanager_receivers:
  - name: infra-ml
    email_configs:
      - to: "[email protected]"
        from: "[email protected]"
        smarthost: "smtp.gmail.com:587"
        auth_username: "[email protected]"
        auth_identity: "[email protected]"
        auth_password: UseGmailAppToken

alertmanager_route:
  group_by: ["..."]
  receiver: infra-ml

Best regards,

Private IP problem

If, say, your container doesn't have an IP address compatible with RFC1918, alertmanager will fail to start.
Quick fix: add --cluster.advertise-address=127.0.0.1:$YOUR_PORT as a CLI parameter

This could be done in a playbook I guess

Permissions of alertmanager config file

What happened?

A positive scenario - installation of alertmanager - creates a configuration file, e.g. /etc/alertmanager/alertmanager.yml.

The permissions on said config file are as follows:

$ ls -ald /etc/alertmanager/alertmanager.yml
-rw-r--r-- 1 alertmanager alertmanager 1050 Apr 30 13:02 /etc/alertmanager/alertmanager.yml

which is defined here:

ansible-alertmanager/tasks/configure.yml

Line 18 in c10f2a8

mode: 0644

It is a security problem because alertmanager receivers (e.g.email_config) include secrets in plain text which would be visible to every user logged into the Alertmanager host.

Did you expect to see some different?

I would expect the config to not be readable by "others", i.e.:

$ ls -ald /etc/alertmanager/alertmanager.yml
-rw-r----- 1 alertmanager alertmanager 1050 Apr 30 13:02 /etc/alertmanager/alertmanager.yml

How to reproduce it (as minimally and precisely as possible):

Standard task, e.g.

    - name: Install Alertmanager
      ansible.builtin.import_role:
        name: cloudalchemy.alertmanager

Environment

Role version:

0.19.1
Ansible version information:

$ ansible --version
ansible 2.10.8
  config file = /Users/weakcamel/git/auto/ansible.cfg
  configured module search path = ['/Users/weakcamel/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /Users/weakcamel/git/auto/.venv/lib/python3.9/site-packages/ansible
  executable location = /Users/weakcamel/git/auto/.venv/bin/ansible
  python version = 3.9.4 (default, Apr  5 2021, 01:50:46) [Clang 12.0.0 (clang-1200.0.32.29)]

Variables:

n/a

Ansible playbook execution Logs:

n/a

Anything else we need to know?:

Can't use custom templates for receivers

I tried adding custom templates, using alertmanager_template_files, which works. What doesn't work is when I configure the receiver:

alertmanager_receivers:
  - name: "receiver"
    pagerduty_configs:
      - routing_key: "xxx"
        send_resolved: true
        description: '{{ template "pagerduty.custom.description" . }}'

The job failes on preflight.yml, Fail when there are no receivers defined and Ansible says

Error was a <class 'ansible.errors.AnsibleError'>, original message: template error while templating string: expected token 'end of print statement', got 'string'. String: {{ template \"pagerduty.custom.description\" . }}

What am I expected to write to use custom templates?

Do not create a directory for alertmanager logs

As alertmanager do not write logs to the directory it is not necessary to create this dir

binary checksum seems broken

What happened?
The checksum always failed and the whole playbook failed. Suspect ansible lookup doesn't following link with HTTP 302

Did you expect to see some different?
Checksum is correctly pulled?

How to reproduce it (as minimally and precisely as possible):

git clone https://github.com/cloudalchemy/ansible-alertmanager roles/alertmanager
Create a basic playbook and include it as roles.

Environment

$ curl -s https://github.com/prometheus/alertmanager/releases/download/v0.20.0/sha256sums.txt
<html><body>You are being <a href="https://github-production-release-asset-2e65be.s3.amazonaws.com/11452538/a510c800-1c60-11ea-8d4c-414fc4fea6b5?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200303%2Fus-east-1%2Fs3%2Faws4_request&amp;X-Amz-Date=20200303T081307Z&amp;X-Amz-Expires=300&amp;X-Amz-Signature=857a1313e6f3cf669e7cef5c2fceb7b529cab239a1835ba201aad78b68bc6748&amp;X-Amz-SignedHeaders=host&amp;actor_id=0&amp;response-content-disposition=attachment%3B%20filename%3Dsha256sums.txt&amp;response-content-type=application%2Foctet-stream">redirected</a>.</body></html>

$ curl -s https://github.com/prometheus/alertmanager/releases/download/v0.20.0/sha256sums.txt -L
78d741b3bdcb910619f498d7662969cae363c9d5840d7cef1e5481f103de59ca  alertmanager-0.20.0.darwin-386.tar.gz
5134585c6200856ca70f61502a460413f6a0e6c848e7156724e5d568f77aba56  alertmanager-0.20.0.darwin-amd64.tar.gz
8bc8ac50a4a7545b0a2e6a6d710ab4c51e8ccd85d519013db9495e4549546f93  alertmanager-0.20.0.dragonfly-amd64.tar.gz
810fabaad75f1d5d172f48eeafb6099f669c37baababff458e798474e006969e  alertmanager-0.20.0.freebsd-386.tar.gz
5789adb5d4da773ec8480458c3d445985d9fbb3ce8cbe939090508b0a96f436d  alertmanager-0.20.0.freebsd-amd64.tar.gz
645fd8b1bb541a360521b12694e7483017f9c5b95152a313630b8f3c06cbeb3e  alertmanager-0.20.0.freebsd-armv6.tar.gz
48d3b69ca5618bd6632b10563eab1e7331ffdf9e1b6943cc34002beaccdec7cb  alertmanager-0.20.0.freebsd-armv7.tar.gz
0f922a82a7358a33736d388faa9b44c661f4844c15a4c2eadeb71d1f6738bd66  alertmanager-0.20.0.linux-386.tar.gz
3a826321ee90a5071abf7ba199ac86f77887b7a4daa8761400310b4191ab2819  alertmanager-0.20.0.linux-amd64.tar.gz
ee219113b4dad6042f3f88dccea48ee15ac5e7d5c84933bc90f320819b71e1c5  alertmanager-0.20.0.linux-arm64.tar.gz
11d92562c72d9fc747db45bcf48d181f3db7b178a254af4877f74ab20f986a6a  alertmanager-0.20.0.linux-armv5.tar.gz
89762e97cb18b4a47557cf74734fb398645ea5d8191b71b248b0dc515073e370  alertmanager-0.20.0.linux-armv6.tar.gz
5ebd33da8d61cef6ea1aab2ecc73310ff3fe4320eb76851ae71e22e3c5ddbc36  alertmanager-0.20.0.linux-armv7.tar.gz
fbd6ab4471b4c9c167d7fbe8f4b90cca2415b2d5426a7ce74734a9182613573e  alertmanager-0.20.0.linux-mips64.tar.gz
5ea2ee935119d15247359d9bf1124c00dfb8e62882be8862721a477ff728b3b4  alertmanager-0.20.0.linux-mips64le.tar.gz
66c1aa886c48e6aef7cda3d835fa985254d59a2a811b204f69aa806e7796e806  alertmanager-0.20.0.linux-ppc64.tar.gz
1cf6e81a3964e63019026518574722922fd6d98fb256f3dba49efdcff20b14ff  alertmanager-0.20.0.linux-ppc64le.tar.gz
53e8be5b029dc00fce97d1f79d5202a54ad5b20aa5ca135fc168f5eefd0f6b5c  alertmanager-0.20.0.linux-s390x.tar.gz
d53382b389139876e13e22f500c19cd79fde67ab899dae51961bdb0e097734e0  alertmanager-0.20.0.netbsd-386.tar.gz
378a19ab208631f989ab353b0b3e3e4c1637202ca1b6c47f134ce1470560912f  alertmanager-0.20.0.netbsd-amd64.tar.gz
2defc8b8ab59a291aa0e81c032aa27185e9c4ea702858e7e994f0df1dafd4164  alertmanager-0.20.0.netbsd-armv6.tar.gz
8463f50957935c1723d1f2ec5d386c9a1379442dde655d8ea430d6ded1a9d47e  alertmanager-0.20.0.netbsd-armv7.tar.gz
476edebdff737cc7356556637b8caaea7b3708a2b8b370ab8d3201bd2c52cc36  alertmanager-0.20.0.openbsd-386.tar.gz
039d0fc1cc00e710f94ceda53f1bb4ec22341965c0e4bc623ab511edf1d61930  alertmanager-0.20.0.openbsd-amd64.tar.gz
28ad18728935412dcddc10a6ab7bdd4858bd7c832444bfb1880266caf2a310f6  alertmanager-0.20.0.windows-386.tar.gz
5887902bea633d8b3396804760308acc0a2631b7ca85df75d8c526cd1985d62b  alertmanager-0.20.0.windows-amd64.tar.gz

Role version:

de29024767b37a6df1073cbf86254ea908a378a0
Ansible version information:

    $ ansible --version
ansible 2.9.4
  config file = /Users/teochenglim/thunes/sysadmin/ansible/projects/cobra/ansible.cfg
  configured module search path = ['/Users/teochenglim/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/Cellar/ansible/2.9.4_1/libexec/lib/python3.8/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.8.1 (default, Dec 27 2019, 18:06:00) [Clang 11.0.0 (clang-1100.0.33.16)]

Variables:

my work around as to hard code it.

go_arch: "amd64"
alertmanager_checksum: "3a826321ee90a5071abf7ba199ac86f77887b7a4daa8761400310b4191ab2819"

Ansible playbook execution Logs:

 ___________________________________________________________
< TASK [alertmanager : Get checksum for amd64 architecture] >
 -----------------------------------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

objc[72571]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called.
objc[72571]: +[__NSCFConstantString initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
ERROR! A worker was found in a dead state

Anything else we need to know?:

I commented out 1 job at file task/preflight.yml
Hard coded the variable using group_var.
I am running this on MacOS

# - name: "Get checksum for {{ go_arch }} architecture"
#   set_fact:
#     alertmanager_checksum: "{{ item.split(' ')[0] }}"
#   with_items:
#     - "{{ lookup('url', 'https://github.com/prometheus/alertmanager/releases/download/v' + alertmanager_version + '/sha256sums.txt', wantlist=True) | list }}"
#   when:
#     - "('linux-' + go_arch + '.tar.gz') in item"
#     - alertmanager_binary_local_dir | length == 0

go_arch: "amd64"
alertmanager_checksum: "3a826321ee90a5071abf7ba199ac86f77887b7a4daa8761400310b4191ab2819"

Configure for Telegram

Hello,
It looks like this role can't use with Telegram just out of the box. Isn't yes?

Not working with latest release of alertmanager (0.13.0)

Latest release of alertmanager (0.13.0) requires a change to the systemd alertmanager.service template: all startup options needs to be prefixed with '--' and not just '-' (e.g. --config.file instead of -config.file)

Not able to configure a mesh

Hello,

with this role it is not possible to configure a mesh.
To configure a mesh each peer needs to know it's peers. The documentation says:

--mesh.peer value: initial peers (repeat flag for each additional peer)

so I need to be able to repeat the mesh.peer flag in the role var "alertmanager_cli_flags"

I tried

alertmanager_cli_flags:
  mesh.peer=123.456.789.1:6783
  mesh.peer=123.456.789.2:6783

but only the last "mesh.peer" is passed to the cli as parameter. I looked at the code and it seems that a dictionary/map is used which always overrides the the same key with another value. That's why only the last key is passed

A workaround I found is setting a default value and passing "more":

alertmanager_cli_flags:
  log.level: "info --mesh.peer=123.456.789.1:6783 --mesh.peer=123.456.789.2:6783"

Support for webhook_config

Hello,
from what I have seen, this ansible role doesn't support the webhook config field in the alertmanager.yml.
https://prometheus.io/docs/alerting/configuration/#webhook_config

Will you support it in the near future ?
Thanks a lot for your work !

Disable preflight checks if custom config path

I want to provide my own config template with receivers and routes hard coded. So currently I silence the preflight checks using:

alertmanager_routes: null
alertmanager_receivers: null

Maybe this could be builtin behaviour? I can provide a PR if needed.

"No package policycoreutils-python available."

When running the playbook against a Centos 8 host, I receive the following error:

failed: [host] (item=policycoreutils-python) => {"ansible_loop_var": "item", "attempts": 5, "changed": false, "failures": ["No package policycoreutils-python available."], "item": "policycoreutils-python", "msg": "Failed to install some of the specified packages", "rc": 1, "results": []}

Could this be resolved by having the same centos.yml and centos-8.yml files in the vars directory like https://github.com/cloudalchemy/ansible-prometheus/tree/master/vars ?

Http config unable to set basic auth

When creating an http config via the default template it is unable to handle using basic auth. The issue appears to be here:

{% if alertmanager_http_config | length %}
  http_config:
{% endif %}
{% for key, value in alertmanager_http_config.items() %}
    {{ key }}: {{ value | quote }}
{% endfor %}

I was able to fix it locally by adding a new variable and modifying the template as so:

{% if alertmanager_http_config | length %}
  http_config:
{% endif %}
{% if alermanager_http_config_basic_auth | length %}
    basic_auth:
{% for key, value in alermanager_http_config_basic_auth.items() %}
      {{ key }}: {{ value | quote }}
{% endfor %}
{% endif %}
{% for key, value in alertmanager_http_config.items() %}
    {{ key }}: {{ value | quote }}
{% endfor %}

Can be fixed by using a custom template, but seems like it should be part of the default. Would be happy to put in a PR. Also, it appears that the existence of the alert_manager_http_config variable is missing all together from the README.

When will be the 0.13.13 ?

Hi,

When the new release will be made ?
I've seen on CHANGELOG.md you've updated ansible-alertmanager to use the last version of alert-manager but no release yet :(

Thanks !

alertmanager_checksum_url Gives error after commit from 7 day ago

What happened?
After commit 6f050af
I received following error:

`TASK [ansible-alertmanager : Get checksum for amd64 architecture] **************
fatal: [promcmt-test01]: FAILED! => {"msg": "template error while templating string: expected token 'end of print statement', got 'select'. String: {{ lookup('url', 'https://github.com/prometheus/alertmanager/releases/download/v' + alertmanager_version + '/sha256sums.txt', wantlist=True) | list select('contains', 'linux-' + go_arch + '.tar.gz') | list | first).split(' ')[0] }}"}
...ignoring

TASK [ansible-alertmanager : Checksum lookup error message] ********************
fatal: [promcmt-test01]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'alertmanager_checksum_url' is undefined\n\nThe error appears to be in '/tmp/awx_183634_xomgifxg/requirements_roles/ansible-alertmanager/tasks/preflight.yml': line 51, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Checksum lookup error message\n ^ here\n"}`

Did you expect to see some different?

How to reproduce it (as minimally and precisely as possible):

Get latest version of amd64 alertmanager installed

Environment

Role version:

0.21
Insert release version/galaxy tag or Git SHA here

Ansible version information:

ansible --version
Variables:

insert role variables relevant to the issue

Ansible playbook execution Logs:

insert Ansible logs relevant to the issue here

Anything else we need to know?:

Unify variables naming

Variables from here should be similiar in name to ones from prometheus role.

RFE: Add a way to create silences from ansible

It might be beneficial to add an option to create silences from ansible.

What needs to be researched:

checking if silence if already added
adding silences
figure out silence format/model for storing in ansible variable

mute_time_interval support

What is missing?
https://github.com/prometheus/alertmanager/blob/master/docs/configuration.md#mute_time_interval support

Why do we need it?
It is already implemented in alertmanager but not available in the role.

Environment

Role version:
0.19.1

Request for more TAGS

Hi,

Commit 6f050af broke this role for me.
Can someone please create a TAG from before this change.

Kind regards,

Mathias

Error while getting checksum list for version `latest`

What happened?
Error when the checksums for the version are get

Did you expect to see some different?
Download works with version "latest" as documented in README.md

How to reproduce it (as minimally and precisely as possible):

Use the role in another role:

playbook.yml:

---
- hosts: monitoringserver
  roles:
    - testrole

roles/testrole/tasks/main.yml: (example from README.md)

---
- name: configure prometheus alertmanager
  ansible.builtin.include_role:
    name: ansible-alertmanager
  vars:
    alertmanager_version: latest
    alertmanager_slack_api_url: "http://example.org"
    alertmanager_receivers:
      - name: slack
        slack_configs:
          - send_resolved: true
            channel: '#alerts'
    alertmanager_route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      receiver: slack

Environment

Role version:

0.19.1 / also tested with master

Ansible version information:

ansible [core 2.12.4]
config file = /home/chris/Documents/shared_projects/ansible-server/ansible.cfg
configured module search path = ['/home/chris/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3/dist-packages/ansible
ansible collection location = /home/chris/.ansible/collections:/usr/share/ansible/collections
executable location = /usr/bin/ansible
python version = 3.10.5 (main, Jun  8 2022, 09:26:22) [GCC 11.3.0]
jinja version = 3.0.3
libyaml = True

Variables:

alertmanager_version: latest

Ansible playbook execution Logs:

TASK [cloudalchemy.alertmanager : Set prometheus version to 0.24.0] ************************************************************************************************************************************************************************************************************************
ok: [spinach]

TASK [cloudalchemy.alertmanager : Get checksum for amd64 architecture] *********************************************************************************************************************************************************************************************************************
fatal: [spinach]: FAILED! => {"msg": "An unhandled exception occurred while running the lookup plugin 'url'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Received HTTP error for https://github.com/prometheus/alertmanager/releases/download/vlatest/sha256sums.txt : HTTP Error 404: Not Found. Received HTTP error for https://github.com/prometheus/alertmanager/releases/download/vlatest/sha256sums.txt : HTTP Error 404: Not Found"}

Anything else we need to know?:
Probably goes wrong here:

ansible-alertmanager/tasks/preflight.yml

Lines 34 to 37 in f295fe7

 - name: "Set alertmanager version to {{ _latest_release.json.tag_name[1:] }}" 

 set_fact: 

 alertmanager_version: "{{ _latest_release.json.tag_name[1:] }}" 

 alertmanager_checksum_url: "https://github.com/prometheus/alertmanager/releases/download/v{{ alertmanager_version }}/sha256sums.txt"

i guess thats because of the variable precedence rules described in this issue: ansible/ansible#22025 (comment) or here in the docs but to be honest I dont really understand what could be the problem.

var name docu mismatch alertmanager_binaries_local_dir

The README calls the variable alertmanager_binaries_local_dir but it is actually called
alertmanager_binary_local_dir.

This causes playbook fails, when people use the name from the README.

alertmanager_template_files variable config is incorrect

What happened?
if you run ansible and you put your alert rules in template directory ansible will not copy them, based on the config here:
https://github.com/cloudalchemy/ansible-alertmanager/blob/master/defaults/main.yml#L10

Did you expect to see some different?
it should copy all the files(default *.tmpl) in to alertmanager template directory(/etc/alertmanager/templates/)

Ansible version information:

ansible 2.9.27
Variables:
alertmanager_template_files

insert role variables relevant to the issue

Ansible playbook execution Logs:
TASK [copy alertmanager template files] *****************************************************************************************************
[WARNING]: Unable to find 'alertmanager/templates' in expected paths (use -vvvvv to see paths)

HOW YOU CAN FIX THIS:
correct config is:

alertmanager_template_files:
  - templates/*.tmpl

you have to remove alertmanager directory from the path.

Service command line flags set during `install` but not during `configure`

Not sure this is a bug, but it was an unexpected surprise to me and it's causing me some pain.

I have found that when trying to set the alertmanager_external_url value, it is only set when running the install tag, and not the configure tag. Digging in, the reason is that the web.external-url value is provided to the binary as a command line flag, and that is baked in to the alertmanager.service template, which is only done during the install step.

This is a problem for me, because I'm trying to bake AMIs in AWS with most of the installation done ahead of time, and then deployment instantiates these AMIs and applies dynamic configuration to them, things like IP addresses, ports, URLs, etc. I've found that the setup here makes it so that I really can't do that.

Since those arguments to the binary are really (at least, in my mind) configuration, shouldn't the configure task update the flag as appropriate or re-template the service?

Use 0.14.0 as default

Please use the version 0.14.0 as default as there is a bug fixed that shows mesh network as "pending" although it is "established".

This would reduce confusions when using this role

prometheus/alertmanager#1204

Prevent misconfiguration

We need to add preflight checks to ensure alertmanager is configured properly. Currently configuration isn't checked and this can lead to unusable alertmanager.

Deprecation warning about `include`

More or less duplicated this issue from ansible-prometheus

What happened?

When using this role I get the following deprecation warning:

[DEPRECATION WARNING]: "include" is deprecated, use include_tasks/import_tasks instead. 

This feature will be removed in version 2.16. Deprecation warnings can be disabled by 
setting deprecation_warnings=False in ansible.cfg.

Did you expect to see some different?

No deprecation warnings.

How to reproduce it (as minimally and precisely as possible):

Use this playbook:

- hosts: all
  roles: [cloudalchemy.alertmanager]

Environment

Role version:
0.19.1
Variables:
none
Ansible playbook execution Logs:
none, the warning is displayed before anything else is done by the playbook.

Anything else we need to know?:

The relevant code seems to be here:

ansible-alertmanager/tasks/main.yml

Line 2 in e42e7ef

- include: preflight.yml

ansible-alertmanager/tasks/main.yml

Line 8 in e42e7ef

- include: install.yml

Download URL

Hello,
Can you put the binaries download url in a variable.
I'm installing on a host that doesn't have internet access. Currently the download url is hardcoded.
Thank you

slack template issue

What did you do?
I want to setup template notifications in slack
Did you expect to see some different?

Environment

Role version:
latest
Insert release version/galaxy tag or Git SHA here
https://galaxy.ansible.com/cloudalchemy/alertmanager (i can't find galaxy tag or something)
Ansible version information:

ansible --version
ansible 2.9.6
Variables:

---
alertmanager_version: latest
alertmanager_binary_local_dir: ''

alertmanager_config_dir: /etc/alertmanager
alertmanager_db_dir: /var/lib/alertmanager

alertmanager_config_file: 'alertmanager.yml.j2'

alertmanager_template_files:
  - templates/*.tmpl

alertmanager_web_listen_address: '0.0.0.0:9093'
alertmanager_web_external_url: 'http://localhost:9093/'

alertmanager_http_config: {}

alertmanager_resolve_timeout: 3m

alertmanager_config_flags_extra: {}
# alertmanager_config_flags_extra:
#   data.retention: 10

# SMTP default params
alertmanager_smtp: {}
# alertmanager_smtp:
#   from: ''
#   smarthost: ''
#   auth_username: ''
#   auth_password: ''
#   auth_secret: ''
#   auth_identity: ''
#   require_tls: "True"

# Default values you can see here -> https://prometheus.io/docs/alerting/configuration/
alertmanager_slack_api_url: 'XXX'
alertmanager_pagerduty_url: ''
alertmanager_opsgenie_api_key: ''
alertmanager_opsgenie_api_url: ''
alertmanager_victorops_api_key: ''
alertmanager_victorops_api_url: ''
alertmanager_hipchat_api_url: ''
alertmanager_hipchat_auth_token: ''
alertmanager_wechat_url: ''
alertmanager_wechat_secret: ''
alertmanager_wechat_corp_id: ''

# First read: https://github.com/prometheus/alertmanager#high-availability
alertmanager_cluster:
  listen-address: ""
# alertmanager_cluster:
#   listen-address: "{{ ansible_default_ipv4.address }}:6783"
#   peers:
#     - "{{ ansible_default_ipv4.address }}:6783"
#     - "demo.cloudalchemy.org:6783"

# alertmanager_receivers: []
alertmanager_receivers:
  - name: slack
    slack_configs:
      - send_resolved: true
        channel: '#it'
        icon_url: https://avatars3.githubusercontent.com/u/3380462
        # title: '{{ template "slack.title" . }}'
        # text: '{% raw %}{{ template "slack.text" . }}{% endraw %}'
        title: '{% raw %}{{ template "slack.monzo.title" . }}{% endraw %}'
        icon_emoji: '{% raw %}{{ template "slack.monzo.icon_emoji" . }}{% endraw %}'
        color: '{% raw %}{{ template "slack.monzo.color" . }}{% endraw %}'
        text: '{% raw %}{{ template "slack.monzo.text" . }}{% endraw %}'
        actions:
        - type: button
          text: 'Runbook :green_book:'
          url: '{% raw %}{{ (index .Alerts 0).Annotations.runbook }}{% endraw %}'
        - type: button
          text: 'Query :mag:'
          url: '{% raw %}{{ (index .Alerts 0).GeneratorURL }}{% endraw %}'
        - type: button
          text: 'Dashboard :grafana:'
          url: '{% raw %}{{ (index .Alerts 0).Annotations.dashboard }}{% endraw %}'
        - type: button
          text: 'Silence :no_bell:'
          url: '{% raw %}{{ template "__alert_silence_link" . }}{% endraw %}'
        - type: button
          text: '{% raw %}{{ template "slack.monzo.link_button_text" . }}{% endraw %}'
          url: '{% raw %}{{ .CommonAnnotations.link_url }}{% endraw %}'

alertmanager_inhibit_rules: []
# alertmanager_inhibit_rules:
#   - target_match:
#       label: value
#     source_match:
#       label: value
#     equal: ['dc', 'rack']
#   - target_match_re:
#       label: value1|value2
#     source_match_re:
#       label: value3|value5

# alertmanager_route: {}
alertmanager_route:
  repeat_interval: 1h
  receiver: slack
#   # This routes performs a regular expression match on alert labels to
#   # catch alerts that are related to a list of services.
#     - match_re:
#         service: ^(foo1|foo2|baz)$
#       receiver: team-X-mails
#       # The service has a sub-route for critical alerts, any alerts
#       # that do not match, i.e. severity != critical, fall-back to the
#       # parent node and are sent to 'team-X-mails'
#       routes:
#         - match:
#             severity: critical
#           receiver: team-X-pager
#     - match:
#         service: files
#       receiver: team-Y-mails
#       routes:
#         - match:
#             severity: critical
#           receiver: team-Y-pager
#     # This route handles all alerts coming from a database service. If there's
#     # no team to handle it, it defaults to the DB team.
#     - match:
#         service: database
#       receiver: team-DB-pager
#       # Also group alerts by affected database.
#       group_by: [alertname, cluster, database]
#       routes:
#         - match:
#             owner: team-X
#           receiver: team-X-pager
#         - match:
#             owner: team-Y
#           receiver: team-Y-pager

# The template for amtool's configuration
alertmanager_amtool_config_file: 'amtool.yml.j2'

# Location (URL) of the alertmanager
alertmanager_amtool_config_alertmanager_url: "{{ alertmanager_web_external_url }}"

# Extended output of `amtool` commands, use '' for less verbosity
alertmanager_amtool_config_output: 'extended'

Ansible playbook execution Logs:

fatal: [10.10.10.151]: FAILED! => {"changed": false, "checksum": "ac7b62613e1c55fb5592a36f500c9fb6399e8cb0", "exit_status": 1, "msg": "failed to validate", "stderr": "amtool: error: failed to validate 1 file(s)\n\n", "stderr_lines": ["amtool: error: failed to validate 1 file(s)", ""], "stdout": "Checking '/root/.ansible/tmp/ansible-tmp-1601894343.1803482-191455237155978/source'  SUCCESS\nFound:\n - global config\n - route\n - 0 inhibit rules\n - 1 receivers\n - 1 templates\n  FAILED: template: slack.tmpl:14: unexpected EOF\n\n", "stdout_lines": ["Checking '/root/.ansible/tmp/ansible-tmp-1601894343.1803482-191455237155978/source'  SUCCESS", "Found:", " - global config", " - route", " - 0 inhibit rules", " - 1 receivers", " - 1 templates", "  FAILED: template: slack.tmpl:14: unexpected EOF", ""]}

Anything else we need to know?:
This is my slack.tmpl template

# This builds the silence URL.  We exclude the alertname in the range
# to avoid the issue of having trailing comma separator (%2C) at the end
# of the generated URL
{{ define "__alert_silence_link" -}}
    {{ .ExternalURL }}/#/silences/new?filter=%7B
    {{- range .CommonLabels.SortedPairs -}}
        {{- if ne .Name "alertname" -}}
            {{- .Name }}%3D"{{- .Value -}}"%2C%20
        {{- end -}}
    {{- end -}}
    alertname%3D"{{ .CommonLabels.alertname }}"%7D
{{- end }}



{{ define "__alert_severity_prefix" -}}
    {{ if ne .Status "firing" -}}
    :lgtm:
    {{- else if eq .Labels.severity "critical" -}}
    :fire:
    {{- else if eq .Labels.severity "warning" -}}
    :warning:
    {{- else -}}
    :question:
    {{- end }}
{{- end }}

{{ define "__alert_severity_prefix_title" -}}
    {{ if ne .Status "firing" -}}
    :lgtm:
    {{- else if eq .CommonLabels.severity "critical" -}}
    :fire:
    {{- else if eq .CommonLabels.severity "warning" -}}
    :warning:
    {{- else if eq .CommonLabels.severity "info" -}}
    :information_source:
    {{- else -}}
    :question:
    {{- end }}
{{- end }}


{{/* First line of Slack alerts */}}
{{ define "slack.monzo.title" -}}
    [{{ .Status | toUpper -}}
    {{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{- end -}}
    ] {{ template "__alert_severity_prefix_title" . }} {{ .CommonLabels.alertname }}
{{- end }}


{{/* Color of Slack attachment (appears as line next to alert )*/}}
{{ define "slack.monzo.color" -}}
    {{ if eq .Status "firing" -}}
        {{ if eq .CommonLabels.severity "warning" -}}
            warning
        {{- else if eq .CommonLabels.severity "critical" -}}
            danger
        {{- else -}}
            #439FE0
        {{- end -}}
    {{ else -}}
    good
    {{- end }}
{{- end }}


{{/* Emoji to display as user icon (custom emoji supported!) */}}
{{ define "slack.monzo.icon_emoji" }}:prometheus:{{ end }}

{{/* The test to display in the alert */}}
{{ define "slack.monzo.text" -}}
    {{ range .Alerts }}
        {{- if .Annotations.message }}
            {{ .Annotations.message }}
        {{- end }}
        {{- if .Annotations.description }}
            {{ .Annotations.description }}
        {{- end }}
    {{- end }}
{{- end }}



{{- /* If none of the below matches, send to #monitoring-no-owner, and we 
can then assign the expected code_owner to the alert or map the code_owner
to the correct channel */ -}}
{{ define "__get_channel_for_code_owner" -}}
    {{- if eq . "platform-team" -}}
        platform-alerts
    {{- else if eq . "security-team" -}}
        security-alerts
    {{- else -}}
        monitoring-no-owner
    {{- end -}}
{{- end }}

{{- /* Select the channel based on the code_owner. We only expect to get
into this template function if the code_owners label is present on an alert.
This is to defend against us accidentally breaking the routing logic. */ -}}
{{ define "slack.monzo.code_owner_channel" -}}
    {{- if .CommonLabels.code_owner }}
        {{ template "__get_channel_for_code_owner" .CommonLabels.code_owner }}
    {{- else -}}
        monitoring
    {{- end }}
{{- end }}

{{ define "slack.monzo.link_button_text" -}}
    {{- if .CommonAnnotations.link_text -}}
        {{- .CommonAnnotations.link_text -}}
    {{- else -}}
        Link
    {{- end }} :link:
{{- end }}

Add on README.md example playbook that alertmanager_route and alertmanager_receivers is needed

What is missing?

Add on README.md a example playbook that alertmanager_route and alertmanager_receivers is needed to the role to achieve the desired state.

Even if it is in the documentation of the variables, it does not explain the obligation of the same and as the demonstration project is archived, I find it interesting to provide a basic example (functional or not) with these two variables.

Why do we need it?

Quickstart for anyone that want to use this excellent role.

Environment

Role version:

3f4f089cae9fb6cf7a71c27745a3a5e6eac0b929
Ansible version information:

ansible --version

Anything else we need to know?:

If desired I can perform a PR with the content below (simple boilerplate for Telegram) that I am using in another repository for example:

- name: AlertManager role
  hosts: monitoring
  become: yes
  roles:
    - role: cloudalchemy.alertmanager
      vars:
        alertmanager_route:
          group_by: [job]
          receiver: default
        alertmanager_receivers:
          - name: default
            webhook_configs:
            - send_resolved: True
              url: http://localhost:9087/alert/-chatid

Thanks for the excellent role!

Tasks "download alertmanager binary to local folder" and "unpack alertmanager binaries" always changed

What happened?

These tasks are always have changed state in normal and check mode as well. But they shouldn't make noise if alertmanager is already installed.

TASK [cloudalchemy.alertmanager : download alertmanager binary to local folder] ***
changed: [myhost]
TASK [cloudalchemy.alertmanager : unpack alertmanager binaries] ****************
changed: [myhost]

Did you expect to see some different?

After a first successful run, these tasks shouldn't have changed state in next runs (if there is no changes on host of course).

How to reproduce it (as minimally and precisely as possible):

Install this role on a Ubuntu 20.04.1 LTS.

Environment

Role version:

0.19.1
Ansible version information:

2.10.2
Variables:

insert role variables relevant to the issue

Ansible playbook execution Logs:

insert Ansible logs relevant to the issue here

Anything else we need to know?:

Are the check_mode: false statements correctly used here?

log.level=debug

What did you do?

tried to install alertmanager with log.level=debug

Did you expect to see some different?

Environment

dev

Role version:

Insert release version/galaxy tag or Git SHA here
latest
Ansible version information:

ansible --version
Variables:

insert role variables relevant to the issue

Ansible playbook execution Logs:

insert Ansible logs relevant to the issue here

Anything else we need to know?:

does this role support log.level=debug or any log.level?

Having the choice to install the package or not

Hi,

In some corporate environment, server don't have external access.
This playbook fail as it is looking to download from Github.

It could be nice to have a 'flag' to tell if the binaries are managed with this playbook or not
Thanks
Nicolas

Problem with alertmanager_child_routes

Hi,
if I put this route per severity which I using in cluster:

alertmanager_child_routes:
    - match:
        severity: Lowest
      receiver: slack

   - match:
       severity: Low
     receiver: jira
     continue: true
   - match:
       severity: Low
     receiver: slack

   - match:
        severity: High
     receiver: jira
     continue: true
   - match:
         severity: High
     receiver: slack

In alertmanager.yml I get:

routes:
  - match:
      severity: Lowest
    receiver: slack
  - continue: true
    match:
      severity: Low
    receiver: jira
  - match:
      severity: Low
    receiver: slack
  - continue: true
    match:
      severity: High
    receiver: jira
  - match:
      severity: High
    receiver: slack
  - continue: true

Expected:

- match:
    severity: Lowest
  receiver: slack

- match:
    severity: Low
  receiver: jira
  continue: true
- match:
    severity: Low
  receiver: slack

- match:
    severity: High
  receiver: jira
  continue: true
- match:
    severity: High
  receiver: slack

systemd: Failed to start Alertmanager (when Prometheus is not running yet)

The Alertmanager service fails to start, when Prometheus has not started yet. We observer this mainly after a machine reboot:

# journalctl -u alertmanager.service --boot

-- Logs begin at Thu 2019-07-04 01:20:09 UTC, end at Wed 2019-07-10 08:03:29 UTC. --
Jul 10 00:00:19 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:19 prometheus.example.com alertmanager[2859]: level=info ts=2019-07-10T00:00:19.771Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:19 prometheus.example.com alertmanager[2859]: level=info ts=2019-07-10T00:00:19.773Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:19 prometheus.example.com alertmanager[2859]: level=warn ts=2019-07-10T00:00:19.799Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:19 prometheus.example.com alertmanager[2859]: level=error ts=2019-07-10T00:00:19.815Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:19 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:19 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 1.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com alertmanager[3231]: level=info ts=2019-07-10T00:00:20.213Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3231]: level=info ts=2019-07-10T00:00:20.213Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3231]: level=warn ts=2019-07-10T00:00:20.221Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:20 prometheus.example.com alertmanager[3231]: level=error ts=2019-07-10T00:00:20.227Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 2.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com alertmanager[3355]: level=info ts=2019-07-10T00:00:20.468Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3355]: level=info ts=2019-07-10T00:00:20.468Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3355]: level=warn ts=2019-07-10T00:00:20.472Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:20 prometheus.example.com alertmanager[3355]: level=error ts=2019-07-10T00:00:20.476Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 3.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:20 prometheus.example.com alertmanager[3790]: level=info ts=2019-07-10T00:00:20.874Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3790]: level=info ts=2019-07-10T00:00:20.877Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:20 prometheus.example.com alertmanager[3790]: level=warn ts=2019-07-10T00:00:20.882Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:20 prometheus.example.com alertmanager[3790]: level=error ts=2019-07-10T00:00:20.885Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:20 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 4.
Jul 10 00:00:21 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:21 prometheus.example.com systemd[1]: Started Prometheus Alertmanager.
Jul 10 00:00:21 prometheus.example.com alertmanager[3918]: level=info ts=2019-07-10T00:00:21.109Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
Jul 10 00:00:21 prometheus.example.com alertmanager[3918]: level=info ts=2019-07-10T00:00:21.110Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
Jul 10 00:00:21 prometheus.example.com alertmanager[3918]: level=warn ts=2019-07-10T00:00:21.115Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jul 10 00:00:21 prometheus.example.com alertmanager[3918]: level=error ts=2019-07-10T00:00:21.118Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP addres
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Main process exited, code=exited, status=1/FAILURE
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Service hold-off time over, scheduling restart.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Scheduled restart job, restart counter is at 5.
Jul 10 00:00:21 prometheus.example.com systemd[1]: Stopped Prometheus Alertmanager.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Start request repeated too quickly.
Jul 10 00:00:21 prometheus.example.com systemd[1]: alertmanager.service: Failed with result 'exit-code'.
Jul 10 00:00:21 prometheus.example.com systemd[1]: Failed to start Prometheus Alertmanager.

We're running Prometheus and Alertmanager on the same host (deployed using your Ansible roles 👍), so waiting for Prometheus seems a good measure:

diff templates/alertmanager.service.j2
 [Unit]
-After=network.target
+After=network.target prometheus.service

I realise this might need a new variable and a conditional for general usage (i.e. when both services run on different hosts). Alternatively (or additionaly), it might be also useful to add a delay between the retries to give Prometheus a fair chance to start (as you can see in the log above, the restart attempts all happened within 2s):

diff templates/alertmanager.service.j2
 Restart=always
+RestartSec=5s

cluster setup broken in default

Unless the clustering setup is modified, alertmanager fails with:
"create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"

This can be avoided by having:
alertmanager_cluster:
listen-address: ""
as documented at: https://github.com/prometheus/alertmanager#high-availability
--cluster.listen-address string: cluster listen address (default "0.0.0.0:9094"; empty string disables HA mode)

I propose this be set as the default.

I suspect this is from a recent upstream change. Maybe: prometheus/alertmanager@78c9ebc

CI did not catch this so perhaps it is testing successful deployment but not successful service startup.

I could make a PR for the proposed (trivial!) default change, but I'm not yet well positioned to delve into the CI setup.

offline mode fails

What happened?

Even after setting alertmanager_binary_local_dir
the task "Get checksum for {{ go_arch }} architecture" gets executed and fails (curl fails because we are offline).

Did you expect to see some different?

no failure (task skipped)

How to reproduce it (as minimally and precisely as possible):

Environment

Role version:

3f4f089

Ansible version information:

2.9.9

Variables:

alertmanager_binary_local_dir: "alertmanager-0.21.0.linux-amd64"

Ansible playbook execution Logs:

TASK [ansible-alertmanager-master : Get checksum for amd64 architecture] *********************************************************************************************************************************************************************
task path: /home/user/ansible/roles/ansible-alertmanager-master/tasks/preflight.yml:46
url lookup connecting to https://github.com/prometheus/alertmanager/releases/download/v0.21.0/sha256sums.txt
fatal: FAILED! => {
    "msg": "An unhandled exception occurred while running the lookup plugin 'url'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Failed lookup url for https://github.com/prometheus/alertmanager/releases/download/v0.21.0/sha256sums.txt : <urlopen error [Errno -2] Name or service not known>"
}

Allow using `latest` as a version specifier

Use mechanism introduced in cloudalchemy/ansible-prometheus to allow using latest as a version specifier instead of numbers.

too many open files error

What happened?

http: Accept error: accept tcp [::]:9093: accept4: too many open files; retrying in 1s
http: Accept error: accept tcp [::]:9093: accept4: too many open files; retrying in 1s
http: Accept error: accept tcp [::]:9093: accept4: too many open files; retrying in 1s

is appearing on alertmanager journalctl error, and it stops sending alerts

adding this should fix the error:

[Service]
LimitNOFILE=16000
LimitNOFILESoft=16000

Unable to get the checksum due to errors in the Ansible

ansible-alertmanager/tasks/preflight.yml

Line 44 in e42e7ef

 alertmanager_checksum: "{{ lookup('url', 'https://github.com/prometheus/alertmanager/releases/download/v' + alertmanager_version + '/sha256sums.txt', wantlist=True) | list \ 

Mac OS 10.14.6 ssh into Linux CentOS Linux release 7.9.2009

Ansible Version

ansible [core 2.11.4]
  config file = None
  configured module search path = ['/Users/yoakum/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /Users/yoakum/venv/ansible/lib/python3.9/site-packages/ansible
  ansible collection location = /Users/yoakum/.ansible/collections:/usr/share/ansible/collections
  executable location = /Users/yoakum/venv/ansible/bin/ansible
  python version = 3.9.6 (default, Jun 29 2021, 04:45:03) [Clang 11.0.0 (clang-1100.0.33.17)]
  jinja version = 2.11.2
  libyaml = True

Errors:

TASK [alertmanager : Get checksum for amd64 architecture] **********************************************************************************************************************************************************************************************************************
fatal: [3.239.126.164]: FAILED! => {"msg": "template error while templating string: expected token 'end of print statement', got 'select'. String: {{ lookup('url', 'https://github.com/prometheus/alertmanager/releases/download/v' + alertmanager_version + '/sha256sums.txt',
 wantlist=True) | list select('contains', 'linux-' + go_arch + '.tar.gz') | list | first).split(' ')[0] }}"}
...ignoring

TASK [alertmanager : Checksum lookup error message] ****************************************************************************************************************************************************************************************************************************
fatal: [3.239.126.164]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'alertmanager_checksum_url' is undefined\n\nThe error appears to be in '/Users/yoakum/git/analytics-techops/analytics-bakery/ansible/roles/alertmanager/tasks
/preflight.yml': line 51, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Checksum lookup error message\n  ^ here\n"}

PLAY RECAP *********************************************************************************************************************************************************************************************************************************************************************
3.239.126.164              : ok=5    changed=0    unreachable=0    failed=1    skipped=2    rescued=0    ignored=1

I was able to get passed part of this issue by adding a '(' before the lookup and then adding a '|' after list. I also had to add the checksum url to the main ansible because it was not being set.

"{{ (lookup('url', 'https://github.com/prometheus/alertmanager/releases/download/v' + alertmanager_version + '/sha256sums.txt', wantlist=True) | list | select('contains', 'linux-' + go_arch + '.tar.gz') | list | first).split(' ')[0] }}"

	- name: "Set alertmanager version to {{ _latest_release.json.tag_name[1:] }}"
	set_fact:
	alertmanager_version: "{{ _latest_release.json.tag_name[1:] }}"
	alertmanager_checksum_url: "https://github.com/prometheus/alertmanager/releases/download/v{{ alertmanager_version }}/sha256sums.txt"

cloudalchemy / ansible-alertmanager Goto Github PK

ansible-alertmanager's People

Contributors

Stargazers

Watchers

Forkers

ansible-alertmanager's Issues

What did you do? I want to setup template notifications in slack Did you expect to see some different?

Environment

Recommend Projects

Recommend Topics

Recommend Org

What did you do?
I want to setup template notifications in slack
Did you expect to see some different?