Monitoring S.M.A.R.T. Metrics with Prometheus and PMM

visualized using Grafana

In his excellent blog post, Pavel Trukhanov showed the value of S.M.A.R.T. metric collections, so I wondered how hard would it be to enable their collection in Percona Monitoring and Management (PMM)

A quick search led me to the  text_collector plugin SmartMon, which can be easily integrated with any Prometheus Installation

For PMM, Vadim Yalovets recently showed how to do custom integrations based on text_collector

Let’s put those together:

  1. Ensure you have the smartctl tool installed. It is available in repositories for most Linux distributions
  2. Get  smartmon.sh and place it in /usr/local/bin or other location
  3. Install the cron job
    echo  "*/5 * * * * root bash  /usr/local/bin/smartmon.sh > /tmp/smart_metrics.prom  " > /etc/cron.d/smartmon
  4. Enable textfile_collector as described in this blog post

That’s it! You should get your data flowing. Now you can use Prometheus to query device information:

use prometheus to query device

Or if you want to get a specific S.M.A.R.T value, such as media_wearout indicator:

specific smart value wearout indicator

If you would like to see a nicer visualization in Grafana, you can install the appropriate dashboard from the Grafana web site.

visualized using Grafana

The number and kind of metrics you’re going to get depends on the storage device vendor and model. Here is an example list from one of my test systems:

# HELP smartmon_smartctl_version SMART metric smartctl_version
# TYPE smartmon_smartctl_version gauge
smartmon_smartctl_version{version="6.5"} 1
# HELP smartmon_current_pending_sector_raw_value SMART metric current_pending_sector_raw_value
# TYPE smartmon_current_pending_sector_raw_value gauge
smartmon_current_pending_sector_raw_value{disk="/dev/sda",type="sat",smart_id="197"} 0.000000e+00
# HELP smartmon_current_pending_sector_threshold SMART metric current_pending_sector_threshold
# TYPE smartmon_current_pending_sector_threshold gauge
smartmon_current_pending_sector_threshold{disk="/dev/sda",type="sat",smart_id="197"} 0
# HELP smartmon_current_pending_sector_value SMART metric current_pending_sector_value
# TYPE smartmon_current_pending_sector_value gauge
smartmon_current_pending_sector_value{disk="/dev/sda",type="sat",smart_id="197"} 100
# HELP smartmon_current_pending_sector_worst SMART metric current_pending_sector_worst
# TYPE smartmon_current_pending_sector_worst gauge
smartmon_current_pending_sector_worst{disk="/dev/sda",type="sat",smart_id="197"} 100
# HELP smartmon_device_info SMART metric device_info
# TYPE smartmon_device_info gauge
smartmon_device_info{disk="/dev/sda",type="sat",vendor="",product="",revision="",lun_id="",model_family="",device_model="Crucial_CT275MX300SSD1",serial_number="16431465B53F",firmware_version="M0CR031"} 1
# HELP smartmon_device_smart_available SMART metric device_smart_available
# TYPE smartmon_device_smart_available gauge
smartmon_device_smart_available{disk="/dev/sda",type="sat"} 1
# HELP smartmon_device_smart_enabled SMART metric device_smart_enabled
# TYPE smartmon_device_smart_enabled gauge
smartmon_device_smart_enabled{disk="/dev/sda",type="sat"} 1
# HELP smartmon_device_smart_healthy SMART metric device_smart_healthy
# TYPE smartmon_device_smart_healthy gauge
smartmon_device_smart_healthy{disk="/dev/sda",type="sat"} 1
# HELP smartmon_end_to_end_error_raw_value SMART metric end_to_end_error_raw_value
# TYPE smartmon_end_to_end_error_raw_value gauge
smartmon_end_to_end_error_raw_value{disk="/dev/sda",type="sat",smart_id="184"} 0.000000e+00
# HELP smartmon_end_to_end_error_threshold SMART metric end_to_end_error_threshold
# TYPE smartmon_end_to_end_error_threshold gauge
smartmon_end_to_end_error_threshold{disk="/dev/sda",type="sat",smart_id="184"} 0
# HELP smartmon_end_to_end_error_value SMART metric end_to_end_error_value
# TYPE smartmon_end_to_end_error_value gauge
smartmon_end_to_end_error_value{disk="/dev/sda",type="sat",smart_id="184"} 100
# HELP smartmon_end_to_end_error_worst SMART metric end_to_end_error_worst
# TYPE smartmon_end_to_end_error_worst gauge
smartmon_end_to_end_error_worst{disk="/dev/sda",type="sat",smart_id="184"} 100
# HELP smartmon_offline_uncorrectable_raw_value SMART metric offline_uncorrectable_raw_value
# TYPE smartmon_offline_uncorrectable_raw_value gauge
smartmon_offline_uncorrectable_raw_value{disk="/dev/sda",type="sat",smart_id="198"} 0.000000e+00
# HELP smartmon_offline_uncorrectable_threshold SMART metric offline_uncorrectable_threshold
# TYPE smartmon_offline_uncorrectable_threshold gauge
smartmon_offline_uncorrectable_threshold{disk="/dev/sda",type="sat",smart_id="198"} 0
# HELP smartmon_offline_uncorrectable_value SMART metric offline_uncorrectable_value
# TYPE smartmon_offline_uncorrectable_value gauge
smartmon_offline_uncorrectable_value{disk="/dev/sda",type="sat",smart_id="198"} 100
# HELP smartmon_offline_uncorrectable_worst SMART metric offline_uncorrectable_worst
# TYPE smartmon_offline_uncorrectable_worst gauge
smartmon_offline_uncorrectable_worst{disk="/dev/sda",type="sat",smart_id="198"} 100
# HELP smartmon_power_cycle_count_raw_value SMART metric power_cycle_count_raw_value
# TYPE smartmon_power_cycle_count_raw_value gauge
smartmon_power_cycle_count_raw_value{disk="/dev/sda",type="sat",smart_id="12"} 2.000000e+01
# HELP smartmon_power_cycle_count_threshold SMART metric power_cycle_count_threshold
# TYPE smartmon_power_cycle_count_threshold gauge
smartmon_power_cycle_count_threshold{disk="/dev/sda",type="sat",smart_id="12"} 0
# HELP smartmon_power_cycle_count_value SMART metric power_cycle_count_value
# TYPE smartmon_power_cycle_count_value gauge
smartmon_power_cycle_count_value{disk="/dev/sda",type="sat",smart_id="12"} 100
# HELP smartmon_power_cycle_count_worst SMART metric power_cycle_count_worst
# TYPE smartmon_power_cycle_count_worst gauge
smartmon_power_cycle_count_worst{disk="/dev/sda",type="sat",smart_id="12"} 100
# HELP smartmon_power_on_hours_raw_value SMART metric power_on_hours_raw_value
# TYPE smartmon_power_on_hours_raw_value gauge
smartmon_power_on_hours_raw_value{disk="/dev/sda",type="sat",smart_id="9"} 1.313300e+04
# HELP smartmon_power_on_hours_threshold SMART metric power_on_hours_threshold
# TYPE smartmon_power_on_hours_threshold gauge
smartmon_power_on_hours_threshold{disk="/dev/sda",type="sat",smart_id="9"} 0
# HELP smartmon_power_on_hours_value SMART metric power_on_hours_value
# TYPE smartmon_power_on_hours_value gauge
smartmon_power_on_hours_value{disk="/dev/sda",type="sat",smart_id="9"} 100
# HELP smartmon_power_on_hours_worst SMART metric power_on_hours_worst
# TYPE smartmon_power_on_hours_worst gauge
smartmon_power_on_hours_worst{disk="/dev/sda",type="sat",smart_id="9"} 100
# HELP smartmon_raw_read_error_rate_raw_value SMART metric raw_read_error_rate_raw_value
# TYPE smartmon_raw_read_error_rate_raw_value gauge
smartmon_raw_read_error_rate_raw_value{disk="/dev/sda",type="sat",smart_id="1"} 0.000000e+00
# HELP smartmon_raw_read_error_rate_threshold SMART metric raw_read_error_rate_threshold
# TYPE smartmon_raw_read_error_rate_threshold gauge
smartmon_raw_read_error_rate_threshold{disk="/dev/sda",type="sat",smart_id="1"} 0
# HELP smartmon_raw_read_error_rate_value SMART metric raw_read_error_rate_value
# TYPE smartmon_raw_read_error_rate_value gauge
smartmon_raw_read_error_rate_value{disk="/dev/sda",type="sat",smart_id="1"} 100
# HELP smartmon_raw_read_error_rate_worst SMART metric raw_read_error_rate_worst
# TYPE smartmon_raw_read_error_rate_worst gauge
smartmon_raw_read_error_rate_worst{disk="/dev/sda",type="sat",smart_id="1"} 100
# HELP smartmon_reallocated_sector_ct_raw_value SMART metric reallocated_sector_ct_raw_value
# TYPE smartmon_reallocated_sector_ct_raw_value gauge
smartmon_reallocated_sector_ct_raw_value{disk="/dev/sda",type="sat",smart_id="5"} 0.000000e+00
# HELP smartmon_reallocated_sector_ct_threshold SMART metric reallocated_sector_ct_threshold
# TYPE smartmon_reallocated_sector_ct_threshold gauge
smartmon_reallocated_sector_ct_threshold{disk="/dev/sda",type="sat",smart_id="5"} 10
# HELP smartmon_reallocated_sector_ct_value SMART metric reallocated_sector_ct_value
# TYPE smartmon_reallocated_sector_ct_value gauge
smartmon_reallocated_sector_ct_value{disk="/dev/sda",type="sat",smart_id="5"} 100
# HELP smartmon_reallocated_sector_ct_worst SMART metric reallocated_sector_ct_worst
# TYPE smartmon_reallocated_sector_ct_worst gauge
smartmon_reallocated_sector_ct_worst{disk="/dev/sda",type="sat",smart_id="5"} 100
# HELP smartmon_reported_uncorrect_raw_value SMART metric reported_uncorrect_raw_value
# TYPE smartmon_reported_uncorrect_raw_value gauge
smartmon_reported_uncorrect_raw_value{disk="/dev/sda",type="sat",smart_id="187"} 0.000000e+00
# HELP smartmon_reported_uncorrect_threshold SMART metric reported_uncorrect_threshold
# TYPE smartmon_reported_uncorrect_threshold gauge
smartmon_reported_uncorrect_threshold{disk="/dev/sda",type="sat",smart_id="187"} 0
# HELP smartmon_reported_uncorrect_value SMART metric reported_uncorrect_value
# TYPE smartmon_reported_uncorrect_value gauge
smartmon_reported_uncorrect_value{disk="/dev/sda",type="sat",smart_id="187"} 100
# HELP smartmon_reported_uncorrect_worst SMART metric reported_uncorrect_worst
# TYPE smartmon_reported_uncorrect_worst gauge
smartmon_reported_uncorrect_worst{disk="/dev/sda",type="sat",smart_id="187"} 100
# HELP smartmon_smartctl_run SMART metric smartctl_run
# TYPE smartmon_smartctl_run gauge
smartmon_smartctl_run{disk="/dev/sda",type="sat"} 1535666337
# HELP smartmon_temperature_celsius_raw_value SMART metric temperature_celsius_raw_value
# TYPE smartmon_temperature_celsius_raw_value gauge
smartmon_temperature_celsius_raw_value{disk="/dev/sda",type="sat",smart_id="194"} 3.100000e+01
# HELP smartmon_temperature_celsius_threshold SMART metric temperature_celsius_threshold
# TYPE smartmon_temperature_celsius_threshold gauge
smartmon_temperature_celsius_threshold{disk="/dev/sda",type="sat",smart_id="194"} 0
# HELP smartmon_temperature_celsius_value SMART metric temperature_celsius_value
# TYPE smartmon_temperature_celsius_value gauge
smartmon_temperature_celsius_value{disk="/dev/sda",type="sat",smart_id="194"} 69
# HELP smartmon_temperature_celsius_worst SMART metric temperature_celsius_worst
# TYPE smartmon_temperature_celsius_worst gauge
smartmon_temperature_celsius_worst{disk="/dev/sda",type="sat",smart_id="194"} 59
# HELP smartmon_udma_crc_error_count_raw_value SMART metric udma_crc_error_count_raw_value
# TYPE smartmon_udma_crc_error_count_raw_value gauge
smartmon_udma_crc_error_count_raw_value{disk="/dev/sda",type="sat",smart_id="199"} 0.000000e+00
# HELP smartmon_udma_crc_error_count_threshold SMART metric udma_crc_error_count_threshold
# TYPE smartmon_udma_crc_error_count_threshold gauge
smartmon_udma_crc_error_count_threshold{disk="/dev/sda",type="sat",smart_id="199"} 0
# HELP smartmon_udma_crc_error_count_value SMART metric udma_crc_error_count_value
# TYPE smartmon_udma_crc_error_count_value gauge
smartmon_udma_crc_error_count_value{disk="/dev/sda",type="sat",smart_id="199"} 100
# HELP smartmon_udma_crc_error_count_worst SMART metric udma_crc_error_count_worst
# TYPE smartmon_udma_crc_error_count_worst gauge
smartmon_udma_crc_error_count_worst{disk="/dev/sda",type="sat",smart_id="199"} 100

The post Monitoring S.M.A.R.T. Metrics with Prometheus and PMM appeared first on Percona Database Performance Blog.

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569