How to check block storage and zfs pool health

I need to know the health of my disks because I use my computers until they fails. In general, the power supply fails first then hard disks then RAM and CPU.

When a disk fails, I restore the data from backup. With the ZFS filesystem, I check the integrity of my data and my backups.

Disks

The disk health information is provided by SMART and is displayed with smartctl. All commands have to be run as root.

To install smartctl, run:

apt-get install smartmontools

Then choose the disk you want to check:

lsblk
#or
ls /dev/disk/by-id/

Then run:

smartctl -a /dev/sdX

Newer disks provide more information like Form_Factor, Head_Flying_Hours, ...

Very important parameters to check are, among the others, Reallocated_Sector_Ct and Current_Pending_Sector. The Reallocated_Sector_Ct is the count of sectors on the block device which cannot be used correctly. When such a sector is found it is remapped to one of the available spare sectors of the storage device, and data contained in it is relocated. The Current_Pending_Sector attribute, instead, is the count of bad sectors that are still waiting to be remapped. If you want to know more about the S.M.A.R.T attributes and their meaning, you can begin to take a look at the wikipedia S.M.A.R.T page.

smartctl can also be used to start the self-tests:

smartctl -t short /dev/sdX

When the test is finished, the result is shown with the command:

smartctl -a /dev/sdX

For more information about the self-tests, read man smartctl.

On my Toshiba nvme ssd, smartctl doesn't give a lot of information and it is not possible to run self-tests

ZFS pools

I have ZFS on my disks and to check the health of the file system, I run:

zpool list
zpool scrub myPool

The scrub command is fast, it takes a few seconds for multiple TB of data. zpool scrub starts a background process that check the pool, the status is displayed with the command:

zpool status

I want to run scrub regularly and get an email when my pools are unhealthy as described in this serverfault post: how-to-run-a-command-once-a-zfs-scrub-completes

In debian, I use zed

Configuration for the ZED is located in /etc/zfs/zed.d/zed.rc

I set my email address and my email program (mutt):

ZED_EMAIL_ADDR="myemail@example.com"
ZED_EMAIL_PROG="mutt"

zed sends an email only the pool is degraded like this:

ZFS has finished a scrub:

   eid: 23
 class: scrub_finish
  host: nuc
  time: 2022-02-06 18:08:12+0200
  pool: rpool
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:05:24 with 2 errors on Sun Feb  6 18:08:12 2022
config:

        NAME                                             STATE     READ WRITE CKSUM
        rpool                                            DEGRADED     0     0     0
          ata-WDC_WDS100T1B0A-00H9H0_164710800985-part4  DEGRADED     0     0     2  too many errors

errors: 2 data errors, use '-v' for a list

And I setup a cronjob to scrub my pools regularly:

crontab -e
0 1 * * 4 /root/bin/scrub.sh

# scrub.sh:
zpool scrub rpool
zpool scrub bpool

In Freebsd, I setup 2 cronjobs

These jobs are setup in the root crontab.

The first cronjob scrubs the pools and the second job check the string returned by zpool status -x, it should be:

pool poolName is healthy.

When this string is not found a mail is sent.

hashtags: #zfs