Comments (13)
The SMART attributes of a NVMe drive can be seen as log identifier 02h on a NVMe device.
Information based on the current NVMe specification (https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3d-2019.03.20-Ratified.pdf)
Attributes worth to check for NVMe devices
Critical Warning
This field indicates critical warnings for the state of the controller.Each bit corresponds to a critical warning type; multiple bits may be set.If a bit is cleared to ‘0’, then that critical warning does not apply.Critical warnings may result in an asynchronous event notification to the host.Bits in this field represent the current associated state and are not persistent.
Bit 0: If set to ‘1’, then the available spare capacity has fallen below the threshold
Bit 1: If set to ‘1’, then a temperature is (> over temp threshold) or (< below temp threshold)
Bit 2: f set to ‘1’, then the NVM subsystem reliability has been degraded due to significant media related errors or any internal error that degrades NVM subsystemreliability.
Bit 3: If set to ‘1’, then the media has been placed in read only mode
Bit 4: If set to ‘1’, then the volatile memory backup device has failed.This field is only valid if the controller has a volatile memory backup solution.
So, to my current understanding, a value of 0x00
means everything is OK so far and this NVMe does not have a memory backup. A value of 0x10
would mean serious degradation of the device.
Nope, that's wrong. Trying to find the relevant infos or specs how the value would actually look like. I believe I found two cases so far:
- Value of
0x04
seems to be bit 4 set, meaning the volatile memory backup device has failed (https://www.reddit.com/r/homelab/comments/88ash6/proxmox_shows_nvme_drives_as_failed_zpool_says/) - Value of
0x02
seems to be bit 2 set, which would mean serious degradation of the device, however the SMART self-assessment test indicates a temperature alert (https://ubuntuforums.org/showthread.php?t=2348089&s=c0f1bc10beae9004a6c07e9f69493208)
Any hint in the right direction to understand how the bits are actually set and how this represents the final value would be much appreciated!
Update: Yes! Seems I found it in the smartmontools source code: https://github.com/smartmontools/smartmontools/blob/e3fdde7aff4cd069e629ee987bf33ac8ccd621ad/smartmontools/nvmeprint.cpp#L300
These are the possible values for attribute Critical Warning
as of now:
- 0x01 = available spare has fallen below threshold
- 0x02 = temperature is above or below threshold
- 0x04 = NVM subsystem reliability has been degraded
- 0x08 = media has been placed in read only mode
- 0x10 = volatile memory backup device has failed
- 0x1f = unknown critical warning(s)
But what I still don't understand is what if multiple errors happen at the same time. E.g. available spare (0x01) and temperature threshold (0x02). Would that result in 0x03? I have nowhere seen any example like this.
According to the source code, smartctl itself will already report a fail on the self-assessment check (step 1 in check_smart). In this case we could skip this attribute and focus on the other ones with performance data.
Available Spare
Contains a normalized percentage (0 to 100%) of the remaining spare capacity available.
Means as soon as the value is less than 100%, the device is slowly wearing out. This is an important indicator to see when a device will likely be "too old/too used" and needs to be replaced.
Percentage Used
Contains a vendor specific estimate of the percentage of NVM subsystemlife used based on the actual usage and the manufacturer’s prediction of NVM life.
Not sure yet if this should be counted in.
Media and Data Integrity Errors
Contains the number of occurrences where the controller detected an unrecovered data integrity error.Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this field.
Probably the most important attribute to be checked. Similar to "bad sectors" of a hard drive.
Error Information Log Entries
Contains the number of Error Information log entries over the life of the controller.
Not sure yet, however this could be a helpful hint to see increasing issues on a device.
Performance data to be collected
All attributes except "Critical Warning"
- Temperature
- Available Spare
- Percentage Used
- Data Units Read
- Data Units Written
- Host Read Commands
- Host Write Commands
- Controller Busy Time
- Power Cycles
- Power On Hours
- Unsafe Shutdowns
- Media and Data Integrity Errors
- Error Information Log Entries
from check_smart.
Working on it. Someone got me a remote access to a server with NVMe.
from check_smart.
NVMe support officially released with 6.7.0.
from check_smart.
Good idea. Could you please share a smartctl -a
output of a nvme drive?
from check_smart.
Of course.
[root@fooo ~]# smartctl --all /dev/nvme1n1
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.5.1.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZQLW960HMJP-00003
Serial Number: S35XNX0KA02248
Firmware Version: CXV8601Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 960 197 124 096 [960 GB]
Unallocated NVM Capacity: 0
Controller ID: 2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 960 197 124 096 [960 GB]
Namespace 1 Utilization: 803 477 762 048 [803 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Wed Apr 17 10:10:54 2019 CEST
Firmware Updates (0x17): 3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x000e): Format Frmw_DL NS_Mngmt
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 84 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 9.00W - - 0 0 0 0 5 5
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 4096 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 32 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 4 092 023 [2,09 TB]
Data Units Written: 3 044 102 [1,55 TB]
Host Read Commands: 31 971 434
Host Write Commands: 23 782 181
Controller Busy Time: 128
Power Cycles: 25
Power On Hours: 2 114
Unsafe Shutdowns: 20
Media and Data Integrity Errors: 0
Error Information Log Entries: 9
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 32 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 9 0 0x001b 0x4004 0x028 0 0 -
1 8 0 0x001b 0x4004 0x028 0 0 -
2 7 0 0x001b 0x4004 0x028 0 0 -
3 6 0 0x001b 0x4004 0x028 0 0 -
4 5 0 0x001b 0x4004 0x028 0 0 -
5 4 0 0x001b 0x4004 0x028 0 0 -
6 3 0 0x001b 0x4004 0x028 0 0 -
7 2 0 0x001b 0x4004 0x028 0 0 -
8 1 0 0x001b 0x4004 0x028 0 0 -
from check_smart.
Here's the smartctl
output of another NVMe:
# smartctl -a /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.18.5-arch1-1-ARCH] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: UCS-SDHPCIE 800GB
Serial Number: STM0001BAE33
Firmware Version: KMCCP108
PCI Vendor ID: 0x1c58
PCI Vendor Subsystem ID: 0x1137
IEEE OUI Identifier: 0x000cca
Controller ID: 414
Number of Namespaces: 1
Namespace 1 Size/Capacity: 800,166,076,416 [800 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 000cca 00602ba300
Local Time is: Thu Jun 6 08:06:40 2019 UTC
Firmware Updates (0x09): 4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 15000 15000
1 + 20.00W - - 1 1 1 1 15000 15000
2 + 15.00W - - 2 2 2 2 15000 15000
3 + 10.00W - - 3 3 3 3 15000 15000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 512 8 2
2 - 4096 0 0
3 - 4096 8 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 42 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,667,851 [853 GB]
Data Units Written: 5,430,405 [2.78 TB]
Host Read Commands: 11,553,415
Host Write Commands: 23,371,696
Controller Busy Time: 89
Power Cycles: 92
Power On Hours: 6,563
Unsafe Shutdowns: 80
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Error Information (NVMe Log 0x01, max 63 entries)
No Errors Logged
And another one:
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.75] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: UCS-SDHPCIE 1.6TB
Serial Number: CJH00100C4C9
Firmware Version: KMCCP105
PCI Vendor ID: 0x1c58
PCI Vendor Subsystem ID: 0x1137
IEEE OUI Identifier: 0x000cca
Controller ID: 415
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,600,321,314,816 [1.60 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Thu Jun 6 04:23:50 2019 EDT
Firmware Updates (0x09): 4 Slots, Slot 1 R/O
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 15000 15000
1 + 20.00W - - 1 1 1 1 15000 15000
2 + 15.00W - - 2 2 2 2 15000 15000
3 + 10.00W - - 3 3 3 3 15000 15000
4 - 10.00W - - 3 3 3 3 15000 15000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
1 - 512 8 2
2 - 4096 0 0
3 - 4096 8 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning: 0x00
Temperature: 29 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,394,509 [713 GB]
Data Units Written: 173,186,724 [88.6 TB]
Host Read Commands: 131,022,884
Host Write Commands: 11,448,977,782
Controller Busy Time: 51,941
Power Cycles: 43
Power On Hours: 18,856
Unsafe Shutdowns: 40
Media and Data Integrity Errors: 0
Error Information Log Entries: 3
Error Information (NVMe Log 0x01, max 63 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 3 1 0x00f7 0xdead 0x000 0 0 0xf7
1 2 1 0x00f2 0xdead 0x000 0 0 0xf2
2 1 1 0x0109 0xdead 0x000 0 0 0x09
from check_smart.
Hi, are there any news on this? Can I offer help with something?
from check_smart.
@roben I have the code "in my mind" already, but I need a system with a NVMe to test. Anyone would be willing to give me a remote access to a system having a nvme? Contact me on https://www.claudiokuenzler.com/about/.
from check_smart.
Sorry, I only have company servers available where I can't provide access to.
I stumbled upon this, though: https://github.com/thomas-krenn/check_smart_attributes#NVMedevices
It seems to do similar checks and already supports NVMEs, so maybe it can help to confirm your ideas.
from check_smart.
@Rohlik @roben
Please test https://github.com/Napsty/check_smart/blob/nvme/check_smart.pl in nvme branch.
from check_smart.
Thanks! It looks good:
/usr/lib/nagios/plugins/check_nrpe -H xxx -c check_smart_nvme_all OK: [/dev/nvme0] - Device is clean --- [/dev/nvme1] - Device is clean|
with
command[check_smart_nvme_all]=/usr/local/.../check_smart.pl -g "/dev/nvme[0-9]" -i nvme
It's hard to test for the faulty drive case, though, because they are all working fine.
from check_smart.
@roben Thanks for testing. I just pushed another important change (regex adjusted). Can you test again with the newest version from the nvme branch please:
https://raw.githubusercontent.com/Napsty/check_smart/nvme/check_smart.pl
Please also make a single NVME drive check if you can, to see if performance data are correctly appearing. (worked on the server I got access to)
from check_smart.
Here's the single device check:
./check_smart.pl -d /dev/nvme0 -i nvme
OK: Drive KXG60ZNV1T02 TOSHIBA S/N 89CS10Z1T0RM: no SMART errors detected. |Temperature=42 Available_Spare=100 Available_Spare_Threshold=10 Percentage_Used=0 Data_Units_Read=13608073 Data_Units_Written=6004240 Host_Read_Commands=3734157080 Host_Write_Commands=41754653 Controller_Busy_Time=684 Power_Cycles=6 Power_On_Hours=2906 Unsafe_Shutdowns=2 Media_and_Data_Integrity_Errors=0 Error_Information_Log_Entries=0 Warning__Comp._Temperature_Time=0 Critical_Comp._Temperature_Time=0 Temperature_Sensor_1=42
The output for the multi device check was the same as above.
from check_smart.
Related Issues (20)
- status line 2000GB Gigabyte AORUS M.2 2280 PCIe 4.0 x4 NVMe HOT 4
- Warning thresholds does NOT give the expected result. HOT 2
- Add attribute 188 Command_Timeout to raw check list HOT 1
- Handling dots in attribute names HOT 1
- add aacraid HOT 5
- Request: Auto detect and count all drive on system
- Add special monitoring on SSD attribute 202 (Percent_Lifetime_Remain) HOT 1
- Prioritise output by criticality HOT 14
- Wear_Leveling_Count is not reported as CRIT when disk is almost dead HOT 7
- No performance data on NVMe drive HOT 2
- 6.12.0 regression: invalid interface
- megaraid,N not work with 6.12 HOT 2
- Add TBW calculations for end of life prediction in SSDs HOT 1
- Percent_Lifetime_Remain usage HOT 5
- flag to disable temperature check HOT 2
- Intel ssd wearout not reported when almost dead HOT 9
- check_smart.pl very slow on Almalinux 8 HOT 1
- Percent_Lifetime_Remain threshold unset with -w HOT 19
- No output after pipe HOT 4
- Kingston ssd wearout not detected HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from check_smart.