Giter Club home page Giter Club logo

Comments (29)

sempervictus avatar sempervictus commented on June 18, 2024

One additional note on this is that devices comprising the vdevs (/dev/disk/by-id/ata-XXX-YYY) are used as full disks. However, the plugin detects /dev/disk/by-id/ata-XXX-YYY-part1 as the member, wheras the zpool properly shows ata-XXX-YYY under status. This could explain why its not finding the disk if searching for a -part1 if the OS removed the base path preceding that.

EDIT: we didnt catch this prior as our other zpools use dm-crypt devices as their backing stores, zdb -L on other systems shows /dev/disk/by-id/dm-name-... whereas on the raw disks zdb -L actually shows the -part1 suffix. Still not sure why this is causing the zenpack to report the offline disk as ready though...

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

I think I've fixed the problem in the zpool status parser and the zpool modeler should drop the partition/slice number from the device name if it's a whole disk. I was initially resigned to not do vdev templates, at least not yet, due to the naming difference in zdb vs zpool output, but I guess my half-baked health checks changed that... I need to redo the zpool status parser, though. Can probably do vdev I/O graphs soon, too, as a result.

Is your failed disk marked as REMOVED, UNAVAIL, or not present at all in zpool status?

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

Thanks, the failed disk is marked as REMOVED.

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

Also interesting that a pool which is in the SUSPENDED state due to too many failures show up as still being up (though the health threshold error works fine).

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

If you're talking about the Status line on the component's Details, that seems to be controlled by the presence of an event of class /Status for that component. A /Status event will change it from Up to Down, but I'm not sure how to exert any finer-grained control over it. Poking around in zendmd, components have getStatus() and getStatusString() methods but not a corresponding setStatus().

I need to create some event transforms for this ZenPack yet, especially to make the health status meaningful. Aside from different severity levels, I could re-class the events that need to mark a component as truly "down" (or not).

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

@daviswr: any chance you figured out the transforms and vdev status bit in a private branch somewhere? Seems pool status and vdev status always come back as 0/ONLINE despite pools and their vdevs showing as degraded. I tried editing the zpool status parser, the way its written it seems like it'll bail on a pools VDEV members at first match on status if the input is a full zpool status -v

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

Unfortunately 5c6f23e did not fix the issue:
image

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

Indeed not. Hm. The expected value's getting into the RRD, but the threshold isn't working as expected. I'm investigating.

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

Thanks as always sir.

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

I havent looked into the sources in a while, but if they parse the error counters on status output, that might be another thing to look for - if not /\s+0\s+0\s+0$/ sort of approach to checking VDEV status as a backup?

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

I think I've got a solution but am doing some more testing (something the last commit sorely lacked if I'm honest...) before a commit.
Hadn't previously considered the error counters, but that's good data to collect.

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

Much appreciated, will keep an eye on the tab when you need another tester

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

Could I get your opinion on the severity mappings for states?

    # https://docs.oracle.com/cd/E19253-01/819-5461/gamno/index.html
    # https://docs.oracle.com/cd/E19253-01/819-5461/gcvcw/index.html
    severities = {
        # The device or virtual device is in normal working order
        'ONLINE': SEVERITY_CLEAR,
        # Available hot spare
        'AVAIL': SEVERITY_CLEAR,
        # Hot spare that is currently in use
        'INUSE': SEVERITY_INFO,
        # The virtual device has experienced a failure but can still function
        'DEGRADED': SEVERITY_WARNING,
        # The device or virtual device is completely inaccessible
        'FAULTED': SEVERITY_CRITICAL,
        # The device has been explicitly taken offline by the administrator
        'OFFLINE': SEVERITY_WARNING,
        # The device or virtual device cannot be opened
        'UNAVAIL': SEVERITY_CRITICAL,
        # The device was physically removed while the system was running
        'REMOVED': SEVERITY_ERROR,
        'SUSPENDED': SEVERITY_ERROR,
        }

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

I think that i'd map any severity where the disk is inoperable as critical - Zenoss has a tendency to rate a lot of things as error which sort of creates alert fatigue for the class.
OFFLINE & DEGRADED i'd make into an error
REMOVED i'd promote to critical, ditto SUSPENDED
i think AVAIL should probably be info so we know when we have spares hanging out.

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

Give d24d30a a try

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

Thank you - success.

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

There may be an edge-case or two. After a manual model run, it shows the state of some VDEVs on some systems as TBD
image

The vast majority are working correctly now - detecting root VDEV and leaf VDEV states, alerting, the works.

EDIT: seems that the ZFS command parser is failing on that one host - its incorectly reading a single snapshot to be all of the datasets (snapshots are set to be ignored for all of these), so root and disk VDEVs are in TBD status and the dataset view is confused.

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

TBD's a temporary state for VDevs until the real value's polled. zdb doesn't list their health state and due to how the zpool modeler's written, it'd be, shall we say, cumbersome to get vdev health at model time.

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

I haven't touched any of the dataset-related stuff. Was that host working before?

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

It was working before, but after modeling they all seem to be setting themselves to TBD. The datasets issue was a bookmark showing up as a ZFS dataset and seemingly unrelated to the TBD thing. Whats weird is that right after installing the updated plugin VDEVs were showing state, but now i'm seeing TBD on a bunch. Any chance they reset from a modeling pass?

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

Yeah, right now all vdevs will reset to TBD temporarily after modelling. Zenoss seemed to want something in that field for the property to display. Default cycle time for command polling's 60 seconds, they should pick up the real value soon.

In typing this I might've thought of a couple of things to play with, but basically I'm going to have to spend a lot of "quality time" with the zpool modeler to fix it properly. Time/effort tradeoff and all.

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

Unfortunately they're not coming back from TBD state, after nearly 30m.

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

Do non-ONLINE ones show up? If it's just devices that should be ONLINE that are showing as TBD, I think I know what it is. The health threshold right now doesn't trip for online (only as clearing a non-online state), so the call to update the model never gets reached if it wasn't a different (non-TBD) state beforehand. I should've caught that.

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

I think that theory makes sense, i've reenabled monitoring for a dead VDEV (tick 'em off unless they're data critical once a case is created to deal with ti) to verify.

Yeah, DEGRADED VDEVs show up, good ones are TBD. Waiting for a critical one to cycle through for the leaf VDEV.

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

Confirm: all "not doing well" states show up, all good states are TBD

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

I took a look over the last few commits, and have a question - if component health is already set to whatever the last polling interval provided, could the modeling process skip updating it when the value with which to update is TBD?

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

Tried a few different things and came up with what I probably should have done in the first place for displaying current health. Whew. affe923 - display looks at the actual current datapoint rather than relying on the transform to update the model. Transform still does, though; it's just not displaying that particular field.

Added bonuses are error graphs and scrub/resilver events.

I don't know if there's a way to get a current attribute or datapoint value during modeling.

from zenpacks.daviswr.zfs.

sempervictus avatar sempervictus commented on June 18, 2024

They're showing online again, graphs work, offline/failed devices work. So far so good.
Want to close this out and i'll open new ones as i find any latent issues?

from zenpacks.daviswr.zfs.

daviswr avatar daviswr commented on June 18, 2024

Works for me.

from zenpacks.daviswr.zfs.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.