Giter Club home page Giter Club logo

hotsos's Introduction

Hotsos

Use hotsos to implement repeatable analysis and extract useful information from common cloud applications and subsystems. Write analysis Scenarios using a high-level language and helpful Python libraries. A catalog of analysis implementations is included.

The design of hotsos is oriented around “plugins” that provide easy access to the state of applications and subsystems. Hotsos is run against a Data Root which can either be the host it is run on or a sosreport. The output is a summary containing key information from each plugin along with any issuses or known bugs detected and suggestions on what actions can be taken to handle them.

Documentation

Installation and Usage: https://hotsos.readthedocs.io/en/latest/install/index.html

Contributing: https://hotsos.readthedocs.io/en/latest/contrib/index.html

Full documentation: https://hotsos.readthedocs.io/en/latest/

hotsos's People

Contributors

alanbach avatar aserdean avatar bilboer avatar brianphaley avatar ddstreet avatar dnegreira avatar dosaboy avatar drencrom avatar freyes avatar gerald-yang avatar hemanthnakkina avatar jay-vosburgh avatar joalif avatar kellenrenshaw avatar lathiat avatar mfoliveira avatar mustafakemalgilor avatar nicolasbock avatar nkshirsagar avatar pponnuvel avatar rodrigogansobarbieri avatar sombrafam avatar taodd avatar tpsilva avatar xtrusia avatar zhhuabj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hotsos's Issues

Detect ceph version mismatches

If daemons aren't restarted properly/correctly (e.g. after an upgrade), there may be a mixed version of daemons. This may show up in unexpected ways/problems.

discover any qemu arch

include any type of qemu-system-* process when identifying vms (not just qemu-system-x86_64)

Further bcache tuning checks

Similar to what we do in bcache tuning charm, we can also do few more checks to ensure the caching set is configured optimally:

  1. sequential_cutoff=0

  2. cache_mode=writeaback

  3. writeback_percent=10

  4. congested_read_threshold_us=0

  5. congested_write_threshold_us=0

Report EOL of Openstack/Ceph releases

We can detect whether a particular version of Openstack/Ceph has been EOL'ed or going to be EOL'ed (Openstack has fixed dates, so we could report it ahead of EOL happening).

We probably need to report this as a warning, noting/differentiating the upstream EOL as well as Ubuntu/UCA EOLs.

Aim is inform users that they need to (or consider) upgrading to avoid going out of support and/or running into "old" bugs.

Add check for potential collision between network activity and pinned VMs/processes

We've seen performance be an issue if a VM or app is running on a
CPU and it's preempted by network packet processing and loses time.
This can happen in the presence of pinning of any form as well as when
no pinning of any sort is present.

  1. Check pinning for VMs/processes or presence of isolcpus.
  2. Check steal time metrics if possible
  3. Check softirq distribution (if heavily concentrated in some CPUs,
    and if those might be occurring on isolated or CPUs which have tasks pinned to them.

(This needs some discussion & thinking & tweaking to improve coarse check).

[juju] Check for tainted openstack charms

An openstack charm installed from the Juju charm store will have a repo-info file containing the commit from the upstream repo when the charm was built. The commit id in that file should match /var/lib/juju/agents/unit--0/charm/version. If it does not this implies that the charm has been upgraded using a version not from the charmstore e.g. locally modified copy. We should show a list of tainted charms in the output since this may be useful.

get_ceph_pg_imbalance needs to report >500 PGs per osd separately

The current get_ceph_pg_imbalance check only reports if it has 10% more than the optimal 200 number.

However there is a hard limit when an OSD tries to create more than 750 PGs it stops creating them and it hangs the cluster. So we should report a separate more clear hard error when there is > 500 PGs as requiring immediate and urgent attention and have this report in the summary reported to support cases, not the detailed output.

sed: can't read sos_commands/block/udevadm_info_.dev.bcache0: No such file or directory

I'm not sure if these failures are symptom of an issue with the sosreport, the system where it was taken, or simply a bcache setup that wasn't considered by hotsos.

$ /snap/bin/hotsos -a sosreport-juju-XXXXX
sed: can't read sos_commands/block/udevadm_info_.dev.bcache0: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache0: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache1: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache1: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache2: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache2: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache3: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache3: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache4: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache4: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.bcache5: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.bcache5: No such file or directory
sed: can't read sos_commands/block/udevadm_info_.dev.nvme0n1: No such file or directory
grep: ./sos_commands/block/udevadm_info_.dev.*: No such file or directory
grep: sos_commands/block/udevadm_info_.dev.nvme0n1: No such file or directory

[ovn] [metadata] Identify known bug lp1907686

Hotsos should have plugin to identify lp1907686 as known issue.
https://bugs.launchpad.net/charm-ovn-chassis/+bug/1907686

Typical problems because of this bug:

  • Instance VIF creation timeout and VM instance not launched
  • Instance does not get metadata

Ways to find out:

  • On neutron-api unit - neutron-server logs, look for "ovsdbapp.exceptions.TimeoutException"
    • Fix/Workaround: restart neutron-server service
      -systemctl restart neutron-server.service
  • On nova-compute unit - neutron-ovn-metadata-agent logs, look for "ovsdbapp.exceptions.TimeoutException"
  • Fix/Workaround: restart neutron-ovn-metadata-agent service
  • systemctl restart neutron-ovn-metadata-agent.service

openstack plugin is broken

Something isn't right with openstack plugin. It prints lots of text under vms.

I haven't looked at the code - appears to be a sed/grep/awk gone-wrong problem.

I haven't posted the problematic example here due to restrictions (was on sosreports from AT&T). To reproduce, run hotsos on the sosreport of 00279613.

check filestore to bluestore migration is setup properly

In a user environment, we noticed that:

  • bluestore is set to false
  • journal device is left configured from a previous filestore deployment.

These could be checked by hotsos. Possibly anything else that's left over from filestore.

Ceph OSD Map Trimming Warning

Report on ceph_report .osdmap_manifest.pinned_maps list too long

jq '.osdmap_manifest.pinned_maps | length'

$ cat report-0.txt |jq '.osdmap_manifest.pinned_maps | length'

Seems possibly 5000 is 'bad' and 500 is 'normal' from https://docs.ceph.com/en/latest/dev/mon-osdmap-prune/

Can be related to this bug:
https://tracker.ceph.com/issues/44184

Doesn't exist on luminous so needs to gracefully degrade if data is not there (plus sometimes its not valid json if its run on an osd node)

Add /proc/net/{snmp, netstat, softnet_stat} as sources

To add a few more network checks, including implementing previous
issue #113, we need to check items from the following sources:

  1. /proc/net/snmp
  2. /proc/net/netstat
  3. /proc/net/softnet_stat
    (The /proc/ directory is usually captured in default sosreport commands,
    but might not be. If not, sosreport uses the output of 'netstat -s' to capture
    the first two).
  4. output of 'tc -s qdisc show'
    (which sosreport saves as sos_commands/networking/tc_-s_qdisc_show)

If it helps design of hotsos to have these be implemented in helper infrastructure
can implement them separately or we can just use directly in the networking checks.
Feel free to kill this request if so.

suppress ceph client error messages

If running the storage plugin on a host running ceph daemons but that does not have full cephx creds, you can see:

2021-04-29 10:36:58.069 7f214319a700 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2021-04-29 10:36:58.069 7f214319a700 -1 monclient: ERROR: missing keyring, cannot use cephx for authentication
[errno 2] error connecting to the cluster

These are non-fatal but should be hidden since they are misleading.

add check for apparmor errors

If the application under analysis is triggering apparmor DENIED events and is in enforce mode we should report that as it may be having negative side-effects

check if sysctl defined in /etc/sysctl.d/* are out of sync with what's reported by `sysctl -a`

hotsos could verify that all the sysctl configurations defined in /etc/sysctl.d/* are effectively set in the machine, this is important for situations where a sysctl couldn't be set at boot time because the module that defined it hasn't been loaded yet.

More details on the background issue can be found at https://review.opendev.org/c/openstack/charm-neutron-gateway/+/738116 and https://bugs.launchpad.net/charm-neutron-gateway/+bug/1885192

Validate require_osd_release and v2 messenger bindings

After upgrading to Nautilus, Octopus, or Pacific it is required to run the following command:
sudo ceph osd require-osd-release {nautilus,octopus,pacific}

For Nautilus, in particular, this step is critical to prevent communication failures. See the Nautilus release notes [0]:

This step is mandatory. Failure to execute this step will make it impossible for OSDs to communicate after msgrv2 is enabled.

There are two checks I recommend implementing here:

  1. Parse ceph osd dump to confirm require_osd_release aligns with the ceph major version.
  2. Parse ceph osd dump and ceph mon dump for any mons/osds that are not binding to v2 addresses.

[0] https://docs.ceph.com/en/latest/releases/nautilus/#instructions

RabbitMQ unsynchronised queue issues

Alert on the following log lines:
`rmq3 | 2019-06-27 09:20:59.926 [warning] <0.943.0> Mirrored queue 'rmq-two-queue' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

09:26:25.793 [error] Discarding message {'$gen_call',{<0.753.0>,#Ref<0.989368845.173015041.56949>},{info,[name,pid,slave_pids,synchronised_slave_pids]}} from <0.753.0> to <0.943.0> in an old incarnation (3) of this node (1)
`

And suggest to stop all rabbitmq nodes, verify they are stopped, then start them all again. This should recover the issue from past experience.

Reference: rabbitmq/rabbitmq-server#2045

Add network check to see if lacp bonding is actually aggregating

Bonding using LACP is common in client environments. The physical
interfaces configured under the bond interface have their traffic
aggregated (counted together under the bond interface). Under
network misconfigurations or other issues, the interfaces might
not actually sync up as bonding interfaces and the switch
might not be aggregating their traffic. Check status
in proc/net/bonding/bond-file and "aggregate id" field for all the
configured interfaces under the bond (needs to be same).

Not an alert, just available in the report in case network throughput
or latency issues arise.

[storage] Check ceph crush failure domain host balance

Some deployments will configure Cceph crush failure domains to fail by rack rather than the default of by host. It is sometimes the case that not all racks have the same number of nodes. If the disparity is large this can have a significant impact on Ceph which will still need to write the same number of objects to all racks. If a sosreport contains information such as ceph osd tree we would be able to determine this and report a warning.

Handle missing /proc/slabinfo

In VMs/containers, /proc/slabinfo may be missing if slub is disabled.

In that we 01kernel errors out. e.g.

cat: ./proc/slabinfo: No such file or directory
/home/ponnuvel/myGit/hotsos/plugins/kernel/01kernel: line 75: 0*/1024: syntax error: operand expected (error token is "/1024")

storage plugin ceph error

Traceback (most recent call last):
File "/home/user1/git/dosaboy/hotsos/plugins/storage/01ceph", line 189, in
get_ceph_versions_mismatch()
File "/home/user1/git/dosaboy/hotsos/plugins/storage/01ceph", line 52, in get_ceph_versions_mismatch
data = json.loads(''.join(lines))
File "/usr/lib/python3.6/json/init.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1)

Raise an issue if repeated checksum errors occur

We currently check for crc checksum errors and report those in the summary but we don't raise an issue. There is apparently a known issue in osds where it may get a checksum error but if it succeeds on a retry that can be ignored. However if we get significant numbers of these errors within a short space of time we should alert with an issue.

hotsos snap encoding issue

Running hotsos (snap) against a sosreport with cdk installed happens to trigger this when parsing snap names:

Traceback (most recent call last):
  File "/snap/hotsos/160/plugins/kubernetes/01general", line 100, in <module>
    get_snap_info()
  File "/snap/hotsos/160/plugins/kubernetes/01general", line 56, in get_snap_info
    snap_list_all = helpers.get_snap_list_all()
  File "/snap/hotsos/160/plugins/kubernetes/common/helpers.py", line 64, in catch_exception_inner2
    return f(*args, **kwargs)
  File "/snap/hotsos/160/plugins/kubernetes/common/helpers.py", line 218, in get_snap_list_all
    return open(path, 'r').readlines()
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1604: ordinal not in range(128)

fsid identification is buggy for Ceph

The offset field is calculated on the assumption that "osd id" appears before "osd fsid". But this isn't always the case. In one of the sosreport, ceph-volume_lvm_list contains:

===== osd.979 ======

  [block]       /dev/ceph-e8ac8bb7-944d-4015-beba-d7ebcf497022/osd-block-626dbbf0-a72f-4146-9d46-e49709a9e820

      block device              /dev/ceph-e8ac8bb7-944d-4015-beba-d7ebcf497022/osd-block-626dbbf0-a72f-4146-9d46-e49709a9e820
      block uuid                kGu17z-cHXs-PYt0-ShJZ-giOg-2BKi-FTZp9L
      cephx lockbox secret      
      cluster fsid              e0e71965-4698-4145-b2e0-67fcd2d430d4
      cluster name              ceph
      crush device class        None
      encrypted                 0   
      osd fsid                  626dbbf0-a72f-4146-9d46-e49709a9e820
      osd id                    979 
      type                      block
      vdo                       0   
      devices                   /dev/sdr

(osd id appears after fsid id).

So the offset calculation needs fixing.

check for default RX/TX queue values in nova.conf

Alert if nova.conf doesn't have set: rx_queue_size, tx_queue_size(DPDK) under [libvirt].
Also alert if the value is under 1024.

Basically we are looking for the following to be set in nova.conf and altert otherwise:

[libvirt]
rx_queue_size=1024
tx_queue_size=1024

Coloured output from hotsos?

As hotsos output grows, it would be nice to have colours for like:

red - "definitely wrong"
yellow - "possibly wrong, look more closely"
etc

Add check if oom killer has gone off

Check kernel log if oom killer has been invoked, and kernel
has killed processes. This should be an alert if so, because
there's little point debugging any other issue in the presence
of oom situation where kernel is killing processes.

add smartctl analysis

Sosreports contain smartcl status information which we could and should check for hardware errors and report if any found.

detect and analyse laggy Ceph PGs

Sometimes Ceph reports OSD PGs as "laggy" and we may be able to identify hotspots i.e. nodes that for some reason have slower OSDs than others.

report sosreport plugin timeout

Sometimes plugins timeout and results will be missing. You can detect this from the following sources:
[./sos_reports/manifest.json]
.components.report.plugins[].timeout_hit

[./sos_logs/ui.log]
2021-05-17 08:20:48,602 ERROR:
Plugin networking timed out

Recently we had an issue with networking plugin timeout, due to simply not enough time due to the number of interfaces on a busy dvr_snat iptables_hybrid host. Took me a while to realise thats why I was missing info and ovs-stat was erroring out about snap network-control not being connected, because it couldn't find the ip_netns file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.