Giter Club home page Giter Club logo

Comments (13)

Napsty avatar Napsty commented on June 14, 2024 2

The SMART attributes of a NVMe drive can be seen as log identifier 02h on a NVMe device.
Information based on the current NVMe specification (https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3d-2019.03.20-Ratified.pdf)

Attributes worth to check for NVMe devices

Critical Warning

This field indicates critical warnings for the state of the controller.Each bit corresponds to a critical warning type; multiple bits may be set.If a bit is cleared to ‘0’, then that critical warning does not apply.Critical warnings may result in an asynchronous event notification to the host.Bits in this field represent the current associated state and are not persistent.

Bit 0: If set to ‘1’, then the available spare capacity has fallen below the threshold
Bit 1: If set to ‘1’, then a temperature is (> over temp threshold) or (< below temp threshold)
Bit 2: f set to ‘1’, then the NVM subsystem reliability has been degraded due to significant media related errors or any internal error that degrades NVM subsystemreliability.
Bit 3: If set to ‘1’, then the media has been placed in read only mode
Bit 4: If set to ‘1’, then the volatile memory backup device has failed.This field is only valid if the controller has a volatile memory backup solution.

So, to my current understanding, a value of 0x00means everything is OK so far and this NVMe does not have a memory backup. A value of 0x10 would mean serious degradation of the device.
Nope, that's wrong. Trying to find the relevant infos or specs how the value would actually look like. I believe I found two cases so far:

Any hint in the right direction to understand how the bits are actually set and how this represents the final value would be much appreciated!

Update: Yes! Seems I found it in the smartmontools source code: https://github.com/smartmontools/smartmontools/blob/e3fdde7aff4cd069e629ee987bf33ac8ccd621ad/smartmontools/nvmeprint.cpp#L300

These are the possible values for attribute Critical Warning as of now:

  • 0x01 = available spare has fallen below threshold
  • 0x02 = temperature is above or below threshold
  • 0x04 = NVM subsystem reliability has been degraded
  • 0x08 = media has been placed in read only mode
  • 0x10 = volatile memory backup device has failed
  • 0x1f = unknown critical warning(s)

But what I still don't understand is what if multiple errors happen at the same time. E.g. available spare (0x01) and temperature threshold (0x02). Would that result in 0x03? I have nowhere seen any example like this.

According to the source code, smartctl itself will already report a fail on the self-assessment check (step 1 in check_smart). In this case we could skip this attribute and focus on the other ones with performance data.

Available Spare

Contains a normalized percentage (0 to 100%) of the remaining spare capacity available.

Means as soon as the value is less than 100%, the device is slowly wearing out. This is an important indicator to see when a device will likely be "too old/too used" and needs to be replaced.

Percentage Used

Contains a vendor specific estimate of the percentage of NVM subsystemlife used based on the actual usage and the manufacturer’s prediction of NVM life.

Not sure yet if this should be counted in.

Media and Data Integrity Errors

Contains the number of occurrences where the controller detected an unrecovered data integrity error.Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field.

Probably the most important attribute to be checked. Similar to "bad sectors" of a hard drive.

Error Information Log Entries

Contains the number of Error Information log entries over the life of the controller.

Not sure yet, however this could be a helpful hint to see increasing issues on a device.

Performance data to be collected

All attributes except "Critical Warning"

  • Temperature
  • Available Spare
  • Percentage Used
  • Data Units Read
  • Data Units Written
  • Host Read Commands
  • Host Write Commands
  • Controller Busy Time
  • Power Cycles
  • Power On Hours
  • Unsafe Shutdowns
  • Media and Data Integrity Errors
  • Error Information Log Entries

from check_smart.

Napsty avatar Napsty commented on June 14, 2024 2

Working on it. Someone got me a remote access to a server with NVMe.

from check_smart.

Napsty avatar Napsty commented on June 14, 2024 1

NVMe support officially released with 6.7.0.

from check_smart.

Napsty avatar Napsty commented on June 14, 2024

Good idea. Could you please share a smartctl -a output of a nvme drive?

from check_smart.

Rohlik avatar Rohlik commented on June 14, 2024

Of course.

[root@fooo ~]# smartctl --all /dev/nvme1n1
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.5.1.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZQLW960HMJP-00003
Serial Number:                      S35XNX0KA02248
Firmware Version:                   CXV8601Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 960 197 124 096 [960 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          960 197 124 096 [960 GB]
Namespace 1 Utilization:            803 477 762 048 [803 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Wed Apr 17 10:10:54 2019 CEST
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000e):   Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     84 Celsius
Critical Comp. Temp. Threshold:     85 Celsius
Namespace 1 Features (0x02):        NA_Fields

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.00W       -        -    0  0  0  0        5       5

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        32 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    4 092 023 [2,09 TB]
Data Units Written:                 3 044 102 [1,55 TB]
Host Read Commands:                 31 971 434
Host Write Commands:                23 782 181
Controller Busy Time:               128
Power Cycles:                       25
Power On Hours:                     2 114
Unsafe Shutdowns:                   20
Media and Data Integrity Errors:    0
Error Information Log Entries:      9
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               32 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          9     0  0x001b  0x4004  0x028            0     0     -
  1          8     0  0x001b  0x4004  0x028            0     0     -
  2          7     0  0x001b  0x4004  0x028            0     0     -
  3          6     0  0x001b  0x4004  0x028            0     0     -
  4          5     0  0x001b  0x4004  0x028            0     0     -
  5          4     0  0x001b  0x4004  0x028            0     0     -
  6          3     0  0x001b  0x4004  0x028            0     0     -
  7          2     0  0x001b  0x4004  0x028            0     0     -
  8          1     0  0x001b  0x4004  0x028            0     0     -

from check_smart.

Napsty avatar Napsty commented on June 14, 2024

Here's the smartctl output of another NVMe:

# smartctl -a /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.18.5-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       UCS-SDHPCIE 800GB
Serial Number:                      STM0001BAE33
Firmware Version:                   KMCCP108
PCI Vendor ID:                      0x1c58
PCI Vendor Subsystem ID:            0x1137
IEEE OUI Identifier:                0x000cca
Controller ID:                      414
Number of Namespaces:               1
Namespace 1 Size/Capacity:          800,166,076,416 [800 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            000cca 00602ba300
Local Time is:                      Thu Jun  6 08:06:40 2019 UTC
Firmware Updates (0x09):            4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0    15000   15000
 1 +    20.00W       -        -    1  1  1  1    15000   15000
 2 +    15.00W       -        -    2  2  2  2    15000   15000
 3 +    10.00W       -        -    3  3  3  3    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -     512       8         2
 2 -    4096       0         0
 3 -    4096       8         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,667,851 [853 GB]
Data Units Written:                 5,430,405 [2.78 TB]
Host Read Commands:                 11,553,415
Host Write Commands:                23,371,696
Controller Busy Time:               89
Power Cycles:                       92
Power On Hours:                     6,563
Unsafe Shutdowns:                   80
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Error Information (NVMe Log 0x01, max 63 entries)
No Errors Logged

And another one:

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.75] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       UCS-SDHPCIE 1.6TB
Serial Number:                      CJH00100C4C9
Firmware Version:                   KMCCP105
PCI Vendor ID:                      0x1c58
PCI Vendor Subsystem ID:            0x1137
IEEE OUI Identifier:                0x000cca
Controller ID:                      415
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,600,321,314,816 [1.60 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Thu Jun  6 04:23:50 2019 EDT
Firmware Updates (0x09):            4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0    15000   15000
 1 +    20.00W       -        -    1  1  1  1    15000   15000
 2 +    15.00W       -        -    2  2  2  2    15000   15000
 3 +    10.00W       -        -    3  3  3  3    15000   15000
 4 -    10.00W       -        -    3  3  3  3    15000   15000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0
 1 -     512       8         2
 2 -    4096       0         0
 3 -    4096       8         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    1,394,509 [713 GB]
Data Units Written:                 173,186,724 [88.6 TB]
Host Read Commands:                 131,022,884
Host Write Commands:                11,448,977,782
Controller Busy Time:               51,941
Power Cycles:                       43
Power On Hours:                     18,856
Unsafe Shutdowns:                   40
Media and Data Integrity Errors:    0
Error Information Log Entries:      3

Error Information (NVMe Log 0x01, max 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          3     1  0x00f7  0xdead  0x000            0     0  0xf7
  1          2     1  0x00f2  0xdead  0x000            0     0  0xf2
  2          1     1  0x0109  0xdead  0x000            0     0  0x09

from check_smart.

roben avatar roben commented on June 14, 2024

Hi, are there any news on this? Can I offer help with something?

from check_smart.

Napsty avatar Napsty commented on June 14, 2024

@roben I have the code "in my mind" already, but I need a system with a NVMe to test. Anyone would be willing to give me a remote access to a system having a nvme? Contact me on https://www.claudiokuenzler.com/about/.

from check_smart.

roben avatar roben commented on June 14, 2024

Sorry, I only have company servers available where I can't provide access to.

I stumbled upon this, though: https://github.com/thomas-krenn/check_smart_attributes#NVMedevices
It seems to do similar checks and already supports NVMEs, so maybe it can help to confirm your ideas.

from check_smart.

Napsty avatar Napsty commented on June 14, 2024

@Rohlik @roben
Please test https://github.com/Napsty/check_smart/blob/nvme/check_smart.pl in nvme branch.

from check_smart.

roben avatar roben commented on June 14, 2024

Thanks! It looks good:

/usr/lib/nagios/plugins/check_nrpe -H xxx -c check_smart_nvme_all OK: [/dev/nvme0] - Device is clean --- [/dev/nvme1] - Device is clean|

with

command[check_smart_nvme_all]=/usr/local/.../check_smart.pl -g "/dev/nvme[0-9]" -i nvme

It's hard to test for the faulty drive case, though, because they are all working fine.

from check_smart.

Napsty avatar Napsty commented on June 14, 2024

@roben Thanks for testing. I just pushed another important change (regex adjusted). Can you test again with the newest version from the nvme branch please:

https://raw.githubusercontent.com/Napsty/check_smart/nvme/check_smart.pl

Please also make a single NVME drive check if you can, to see if performance data are correctly appearing. (worked on the server I got access to)

from check_smart.

roben avatar roben commented on June 14, 2024

Here's the single device check:

./check_smart.pl -d /dev/nvme0 -i nvme
OK: Drive  KXG60ZNV1T02 TOSHIBA S/N 89CS10Z1T0RM: no SMART errors detected. |Temperature=42 Available_Spare=100 Available_Spare_Threshold=10 Percentage_Used=0 Data_Units_Read=13608073 Data_Units_Written=6004240 Host_Read_Commands=3734157080 Host_Write_Commands=41754653 Controller_Busy_Time=684 Power_Cycles=6 Power_On_Hours=2906 Unsafe_Shutdowns=2 Media_and_Data_Integrity_Errors=0 Error_Information_Log_Entries=0 Warning__Comp._Temperature_Time=0 Critical_Comp._Temperature_Time=0 Temperature_Sensor_1=42

The output for the multi device check was the same as above.

from check_smart.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.