How to check block storage and zfs pool health
I need to know the health of my disks because I use my computers until they fails. In general, the power supply fails first then hard disks then RAM and CPU.
When a disk fails, I restore the data from backup. With the ZFS filesystem, I check the integrity of my data and my backups.
Disks
The disk health information is provided by SMART and is displayed with smartctl.
All commands have to be run as root.
To install smartctl, run:
apt-get install smartmontools
Then choose the disk you want to check:
lsblk
#or
ls /dev/disk/by-id/
Then run:
smartctl -a /dev/sdX
Newer disks provide more information like Form_Factor, Head_Flying_Hours, ...
Very important parameters to check are, among the others, Reallocated_Sector_Ct and Current_Pending_Sector. The Reallocated_Sector_Ct is the count of sectors on the block device which cannot be used correctly. When such a sector is found it is remapped to one of the available spare sectors of the storage device, and data contained in it is relocated. The Current_Pending_Sector attribute, instead, is the count of bad sectors that are still waiting to be remapped. If you want to know more about the S.M.A.R.T attributes and their meaning, you can begin to take a look at the wikipedia S.M.A.R.T page.
smartctl can also be used to start the self-tests:
smartctl -t short /dev/sdX
When the test is finished, the result is shown with the command:
smartctl -a /dev/sdX
For more information about the self-tests, read man smartctl.
On my Toshiba nvme ssd, smartctl doesn't give a lot of information and it is not possible to run self-tests
ZFS pools
I have ZFS on my disks and to check the health of the file system, I run:
zpool list
zpool scrub myPool
The scrub command is fast, it takes a few seconds for multiple TB of data. zpool scrub starts a background process that check the pool, the status is displayed with the command:
zpool status
I want to run scrub regularly and get an email when my pools are unhealthy as described in this serverfault post: how-to-run-a-command-once-a-zfs-scrub-completes
In debian, I use zed
Configuration for the ZED is located in /etc/zfs/zed.d/zed.rc
I set my email address and my email program (mutt):
ZED_EMAIL_ADDR="myemail@example.com"
ZED_EMAIL_PROG="mutt"
zed sends an email only the pool is degraded like this:
ZFS has finished a scrub:
eid: 23
class: scrub_finish
host: nuc
time: 2022-02-06 18:08:12+0200
pool: rpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:05:24 with 2 errors on Sun Feb 6 18:08:12 2022
config:
NAME STATE READ WRITE CKSUM
rpool DEGRADED 0 0 0
ata-WDC_WDS100T1B0A-00H9H0_164710800985-part4 DEGRADED 0 0 2 too many errors
errors: 2 data errors, use '-v' for a list
And I setup a cronjob to scrub my pools regularly:
crontab -e
0 1 * * 4 /root/bin/scrub.sh
# scrub.sh:
zpool scrub rpool
zpool scrub bpool
In Freebsd, I setup 2 cronjobs
These jobs are setup in the root crontab.
The first cronjob scrubs the pools and the second job check the string returned by zpool status -x, it should be:
pool poolName is healthy.
When this string is not found a mail is sent.
hashtags: #zfs