canonical / hotsos Goto Github PK

Software analysis toolkit. Define checks in high-level language and leverage library to perform analysis of common Cloud applications.

License: Apache License 2.0

Shell 7.44% Python 72.77% HTML 0.02% Roff 18.69% Clojure 0.80% Makefile 0.12% Jinja 0.04% templ 0.01% Perl 0.12%

hotsos's Introduction

Hotsos

Use hotsos to implement repeatable analysis and extract useful information from common cloud applications and subsystems. Write analysis Scenarios using a high-level language and helpful Python libraries. A catalog of analysis implementations is included.

The design of hotsos is oriented around “plugins” that provide easy access to the state of applications and subsystems. Hotsos is run against a Data Root which can either be the host it is run on or a sosreport. The output is a summary containing key information from each plugin along with any issuses or known bugs detected and suggestions on what actions can be taken to handle them.

Documentation

Installation and Usage: https://hotsos.readthedocs.io/en/latest/install/index.html

Contributing: https://hotsos.readthedocs.io/en/latest/contrib/index.html

Full documentation: https://hotsos.readthedocs.io/en/latest/

hotsos's People

Contributors

Stargazers

Watchers

hotsos's Issues

Detect ceph version mismatches

If daemons aren't restarted properly/correctly (e.g. after an upgrade), there may be a mixed version of daemons. This may show up in unexpected ways/problems.

create alert if softlockups detected in kern.log

discover any qemu arch

include any type of qemu-system-* process when identifying vms (not just qemu-system-x86_64)

Further bcache tuning checks

Similar to what we do in bcache tuning charm, we can also do few more checks to ensure the caching set is configured optimally:

sequential_cutoff=0
cache_mode=writeaback
writeback_percent=10
congested_read_threshold_us=0
congested_write_threshold_us=0

bluefs_buffered_io + kernel version check

Warn if bluefs_buffered_io=enable and kernel <5.4 (e.g. non-HWE for bionic): Ceph performance may be affected

bluefs_buffered_io was enabled by default in v13.2.0 and v14.2.0.
bluefs_buffered_io was disabled by default in v14.2.10 and v15.2.0.
bluefs_buffered io was re-enabled in the following point releases:
v14.2.22
v15.2.13
v16.2.0

Reference:
https://docs.google.com/document/d/1uv1Q8p2OR3hG_eEjbpP7OuS3rxmXUgpIlTG5JcpFDaY/edit#

Report EOL of Openstack/Ceph releases

We can detect whether a particular version of Openstack/Ceph has been EOL'ed or going to be EOL'ed (Openstack has fixed dates, so we could report it ahead of EOL happening).

We probably need to report this as a warning, noting/differentiating the upstream EOL as well as Ubuntu/UCA EOLs.

Aim is inform users that they need to (or consider) upgrading to avoid going out of support and/or running into "old" bugs.

Add check for potential collision between network activity and pinned VMs/processes

We've seen performance be an issue if a VM or app is running on a
CPU and it's preempted by network packet processing and loses time.
This can happen in the presence of pinning of any form as well as when
no pinning of any sort is present.

Check pinning for VMs/processes or presence of isolcpus.
Check steal time metrics if possible
Check softirq distribution (if heavily concentrated in some CPUs,
and if those might be occurring on isolated or CPUs which have tasks pinned to them.

(This needs some discussion & thinking & tweaking to improve coarse check).

[juju] Check for tainted openstack charms

An openstack charm installed from the Juju charm store will have a repo-info file containing the commit from the upstream repo when the charm was built. The commit id in that file should match /var/lib/juju/agents/unit--0/charm/version. If it does not this implies that the charm has been upgraded using a version not from the charmstore e.g. locally modified copy. We should show a list of tainted charms in the output since this may be useful.

Add openvswitch plugin

get_ceph_pg_imbalance needs to report >500 PGs per osd separately

The current get_ceph_pg_imbalance check only reports if it has 10% more than the optimal 200 number.

However there is a hard limit when an OSD tries to create more than 750 PGs it stops creating them and it hangs the cluster. So we should report a separate more clear hard error when there is > 500 PGs as requiring immediate and urgent attention and have this report in the summary reported to support cases, not the detailed output.

Detect PG imbalance

Ceph osd df tree can be used to detect and report lopsided PGs.

sed: can't read sos_commands/block/udevadm_info_.dev.bcache0: No such file or directory

I'm not sure if these failures are symptom of an issue with the sosreport, the system where it was taken, or simply a bcache setup that wasn't considered by hotsos.

$ /snap/bin/hotsos -a sosreport-juju-XXXXX
sed: can't read sos_commands/block/udevadm_info_.dev.bcache0: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache0: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache1: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache1: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache2: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache2: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache3: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache3: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache4: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache4: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache5: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache5: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.nvme0n1: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.nvme0n1: No such file or directory

Check for masked systemd Openstack services

We should report if core services are masked:

e.g. sos_commands/systemd/systemctl_list-unit-files:neutron-server.service masked

[ovn] [metadata] Identify known bug lp1907686

Hotsos should have plugin to identify lp1907686 as known issue.
https://bugs.launchpad.net/charm-ovn-chassis/+bug/1907686

Typical problems because of this bug:

Instance VIF creation timeout and VM instance not launched
Instance does not get metadata

Ways to find out:

On neutron-api unit - neutron-server logs, look for "ovsdbapp.exceptions.TimeoutException"
- Fix/Workaround: restart neutron-server service
  -systemctl restart neutron-server.service
On nova-compute unit - neutron-ovn-metadata-agent logs, look for "ovsdbapp.exceptions.TimeoutException"

Fix/Workaround: restart neutron-ovn-metadata-agent service
systemctl restart neutron-ovn-metadata-agent.service

openstack plugin is broken

Something isn't right with openstack plugin. It prints lots of text under vms.

I haven't looked at the code - appears to be a sed/grep/awk gone-wrong problem.

I haven't posted the problematic example here due to restrictions (was on sosreports from AT&T). To reproduce, run hotsos on the sosreport of 00279613.

add apache logs to checks for api services that use apache

check openstack *-manage.log for db migration errors

check filestore to bluestore migration is setup properly

In a user environment, we noticed that:

bluestore is set to false
journal device is left configured from a previous filestore deployment.

These could be checked by hotsos. Possibly anything else that's left over from filestore.

Ceph OSD Map Trimming Warning

Report on ceph_report .osdmap_manifest.pinned_maps list too long

jq '.osdmap_manifest.pinned_maps | length'

$ cat report-0.txt |jq '.osdmap_manifest.pinned_maps | length'

Seems possibly 5000 is 'bad' and 500 is 'normal' from https://docs.ceph.com/en/latest/dev/mon-osdmap-prune/

Can be related to this bug:
https://tracker.ceph.com/issues/44184

Doesn't exist on luminous so needs to gracefully degrade if data is not there (plus sometimes its not valid json if its run on an osd node)

Add /proc/net/{snmp, netstat, softnet_stat} as sources

To add a few more network checks, including implementing previous
issue #113, we need to check items from the following sources:

/proc/net/snmp
/proc/net/netstat
/proc/net/softnet_stat
(The /proc/ directory is usually captured in default sosreport commands,
but might not be. If not, sosreport uses the output of 'netstat -s' to capture
the first two).
output of 'tc -s qdisc show'
(which sosreport saves as sos_commands/networking/tc_-s_qdisc_show)

If it helps design of hotsos to have these be implemented in helper infrastructure
can implement them separately or we can just use directly in the networking checks.
Feel free to kill this request if so.

Add rabbitmq plugin

It would be useful to include some key configs/stats from rabbitmq server hosts in the output. Since sosreport contains rabbitmqctl report we can cross check things like queue balance and tuning values against upstream recommendations like at [1]

[1] https://www.rabbitmq.com/networking.html#os-tuning

suppress ceph client error messages

If running the storage plugin on a host running ceph daemons but that does not have full cephx creds, you can see:

2021-04-29 10:36:58.069 7f214319a700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2021-04-29 10:36:58.069 7f214319a700 -1 monclient: ERROR: missing keyring, cannot use cephx for authentication
[errno 2] error connecting to the cluster

These are non-fatal but should be hidden since they are misleading.

add check for apparmor errors

If the application under analysis is triggering apparmor DENIED events and is in enforce mode we should report that as it may be having negative side-effects

check if sysctl defined in /etc/sysctl.d/* are out of sync with what's reported by `sysctl -a`

hotsos could verify that all the sysctl configurations defined in /etc/sysctl.d/* are effectively set in the machine, this is important for situations where a sysctl couldn't be set at boot time because the module that defined it hasn't been loaded yet.

More details on the background issue can be found at https://review.opendev.org/c/openstack/charm-neutron-gateway/+/738116 and https://bugs.launchpad.net/charm-neutron-gateway/+bug/1885192

Validate require_osd_release and v2 messenger bindings

After upgrading to Nautilus, Octopus, or Pacific it is required to run the following command:
sudo ceph osd require-osd-release {nautilus,octopus,pacific}

For Nautilus, in particular, this step is critical to prevent communication failures. See the Nautilus release notes [0]:

This step is mandatory. Failure to execute this step will make it impossible for OSDs to communicate after msgrv2 is enabled.

There are two checks I recommend implementing here:

Parse ceph osd dump to confirm require_osd_release aligns with the ceph major version.
Parse ceph osd dump and ceph mon dump for any mons/osds that are not binding to v2 addresses.

[0] https://docs.ceph.com/en/latest/releases/nautilus/#instructions

RabbitMQ unsynchronised queue issues

Alert on the following log lines:
`rmq3 | 2019-06-27 09:20:59.926 [warning] <0.943.0> Mirrored queue 'rmq-two-queue' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

09:26:25.793 [error] Discarding message {'$gen_call',{<0.753.0>,#Ref<0.989368845.173015041.56949>},{info,[name,pid,slave_pids,synchronised_slave_pids]}} from <0.753.0> to <0.943.0> in an old incarnation (3) of this node (1)
`

And suggest to stop all rabbitmq nodes, verify they are stopped, then start them all again. This should recover the issue from past experience.

Reference: rabbitmq/rabbitmq-server#2045

Add network check to see if lacp bonding is actually aggregating

Bonding using LACP is common in client environments. The physical
interfaces configured under the bond interface have their traffic
aggregated (counted together under the bond interface). Under
network misconfigurations or other issues, the interfaces might
not actually sync up as bonding interfaces and the switch
might not be aggregating their traffic. Check status
in proc/net/bonding/bond-file and "aggregate id" field for all the
configured interfaces under the bond (needs to be same).

Not an alert, just available in the report in case network throughput
or latency issues arise.

Raise issue when Octavia o-hm0 port does not have ip address

If the healthmanager port inside an octavia host does not have an address it will no be able to reach any amphora vms so we should alert when this happens.

compare mysql num open connections with max open file limit

Change the `pythonenv` of `pylint` and `pep8` tox environments

I can't find a way to modify the pythonenv for the pylint and pep8 tox environments. On Hirsute, the default Python interpreter is 3.9 which makes testing there a little tricky. Is there a way to override this variable?

Detect LP bug 1929262

Any upgrades (not a fresh deployment of Octopus) from older releases to Octopus is going to experience:
https://bugs.launchpad.net/charm-ceph-mon/+bug/1929262

If it happens, we're going to see

[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure global_id reclaim

More info: https://docs.ceph.com/en/latest/security/CVE-2021-20288/

[storage] Check ceph crush failure domain host balance

Some deployments will configure Cceph crush failure domains to fail by rack rather than the default of by host. It is sometimes the case that not all racks have the same number of nodes. If the disparity is large this can have a significant impact on Ceph which will still need to write the same number of objects to all racks. If a sosreport contains information such as ceph osd tree we would be able to determine this and report a warning.

Handle missing /proc/slabinfo

In VMs/containers, /proc/slabinfo may be missing if slub is disabled.

In that we 01kernel errors out. e.g.

cat: ./proc/slabinfo: No such file or directory
/home/ponnuvel/myGit/hotsos/plugins/kernel/01kernel: line 75: 0*/1024: syntax error: operand expected (error token is "/1024")

storage plugin ceph error

Traceback (most recent call last):
File "/home/user1/git/dosaboy/hotsos/plugins/storage/01ceph", line 189, in
get_ceph_versions_mismatch()
File "/home/user1/git/dosaboy/hotsos/plugins/storage/01ceph", line 52, in get_ceph_versions_mismatch
data = json.loads(''.join(lines))
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1)

detect bcache writeback bypass issue

Raise an issue if repeated checksum errors occur

We currently check for crc checksum errors and report those in the summary but we don't raise an issue. There is apparently a known issue in osds where it may get a checksum error but if it succeeds on a retry that can be ignored. However if we get significant numbers of these errors within a short space of time we should alert with an issue.

hotsos snap encoding issue

Running hotsos (snap) against a sosreport with cdk installed happens to trigger this when parsing snap names:

Traceback (most recent call last):
  File "/snap/hotsos/160/plugins/kubernetes/01general", line 100, in <module>
    get_snap_info()
  File "/snap/hotsos/160/plugins/kubernetes/01general", line 56, in get_snap_info
    snap_list_all = helpers.get_snap_list_all()
  File "/snap/hotsos/160/plugins/kubernetes/common/helpers.py", line 64, in catch_exception_inner2
    return f(*args, **kwargs)
  File "/snap/hotsos/160/plugins/kubernetes/common/helpers.py", line 218, in get_snap_list_all
    return open(path, 'r').readlines()
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1604: ordinal not in range(128)

only raise issue if neutron-ovs-cleanup run manually since last boot/reboot

We currently raise an issue if we detect a manual restart of neutron-ovs-cleanup but since a reboot will resolve the impact if this we should only report the issue of the restart was done since the last boot/reboot.

blueFS log can get too big

In older Ceph releases, we might hit the bug https://tracker.ceph.com/issues/45903

If the blueFS log is too big, we can detect and report this.

A possible way to manually shrink (Nautilus or newer): https://docs.ceph.com/en/latest/man/8/ceph-bluestore-tool/#bluefs-log-rescue

Trent: "You can diagnose this situation as teh “META” column from “ceph osd df tree” is large, 10-1200GB. Normally should be maybe 0-4GB".

GitHub Actions is not running any tests

Unless I am misinterpreting https://github.com/dosaboy/hotsos/runs/2244821163?check_suite_focus=true#step:5:2 then GitHub Actions is not running any tests.

alert on nf_conntrack_full

fsid identification is buggy for Ceph

The offset field is calculated on the assumption that "osd id" appears before "osd fsid". But this isn't always the case. In one of the sosreport, ceph-volume_lvm_list contains:

===== osd.979 ======

  [block]       /dev/ceph-e8ac8bb7-944d-4015-beba-d7ebcf497022/osd-block-626dbbf0-a72f-4146-9d46-e49709a9e820

      block device              /dev/ceph-e8ac8bb7-944d-4015-beba-d7ebcf497022/osd-block-626dbbf0-a72f-4146-9d46-e49709a9e820
      block uuid                kGu17z-cHXs-PYt0-ShJZ-giOg-2BKi-FTZp9L
      cephx lockbox secret      
      cluster fsid              e0e71965-4698-4145-b2e0-67fcd2d430d4
      cluster name              ceph
      crush device class        None
      encrypted                 0   
      osd fsid                  626dbbf0-a72f-4146-9d46-e49709a9e820
      osd id                    979 
      type                      block
      vdo                       0   
      devices                   /dev/sdr

(osd id appears after fsid id).

So the offset calculation needs fixing.

check for default RX/TX queue values in nova.conf

Alert if nova.conf doesn't have set: rx_queue_size, tx_queue_size(DPDK) under [libvirt].
Also alert if the value is under 1024.

Basically we are looking for the following to be set in nova.conf and altert otherwise:

[libvirt]
rx_queue_size=1024
tx_queue_size=1024

Coloured output from hotsos?

As hotsos output grows, it would be nice to have colours for like:

red - "definitely wrong"
yellow - "possibly wrong, look more closely"
etc

Add check if oom killer has gone off

Check kernel log if oom killer has been invoked, and kernel
has killed processes. This should be an alert if so, because
there's little point debugging any other issue in the presence
of oom situation where kernel is killing processes.

add smartctl analysis

Sosreports contain smartcl status information which we could and should check for hardware errors and report if any found.

detect and analyse laggy Ceph PGs

Sometimes Ceph reports OSD PGs as "laggy" and we may be able to identify hotspots i.e. nodes that for some reason have slower OSDs than others.

report sosreport plugin timeout

Sometimes plugins timeout and results will be missing. You can detect this from the following sources:
[./sos_reports/manifest.json]
.components.report.plugins[].timeout_hit

[./sos_logs/ui.log]
2021-05-17 08:20:48,602 ERROR:
Plugin networking timed out

Recently we had an issue with networking plugin timeout, due to simply not enough time due to the number of interfaces on a busy dvr_snat iptables_hybrid host. Took me a while to realise thats why I was missing info and ovs-stat was erroring out about snap network-control not being connected, because it couldn't find the ip_netns file.

[juju] detect primary juju application

Detect primary juju application and add this information to the summary.
Would be useful for determining a sosreports function at a glance.

Check stats for application interfaces

For example check interface/bond used by Ceph and Openstack services and report high levels of error/dropped packets.

canonical / hotsos Goto Github PK

hotsos's Introduction

Hotsos

Documentation

hotsos's People

Contributors

Stargazers

Watchers

Forkers

hotsos's Issues

Recommend Projects

Recommend Topics

Recommend Org