Zfs kernel panic
Last night my backup server was in panic during a zfs recv command, I'm running Debian bullseye:
Linux 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux
Source: zfs-linux
Version: 2.0.3-9
Feb 10 00:20:41 kernel: [5032630.764473] VERIFY3(sa.sa_magic == SA_MAGIC) failed (2576474 == 3100762)
Feb 10 00:20:41 kernel: [5032630.764474] VERIFY3(sa.sa_magic == SA_MAGIC) failed (2576474 == 3100762)
Feb 10 00:20:41 kernel: [5032630.764476] PANIC at zfs_quota.c:89:zpl_get_file_info()
Feb 10 00:20:41 kernel: [5032630.764480] PANIC at zfs_quota.c:89:zpl_get_file_info()
Feb 10 00:20:41 kernel: [5032630.764482] Showing stack for process 2333
Feb 10 00:20:41 kernel: [5032630.764482] Showing stack for process 2327
Feb 10 00:20:41 kernel: [5032630.764484] CPU: 3 PID: 2327 Comm: dp_sync_taskq Tainted: P OE 5.10.0-9-amd64 #1 Debian 5.10.70-1
Feb 10 00:20:41 kernel: [5032630.764485] Hardware name: ASUS System Product Name/ROG STRIX B460-I GAMING, BIOS 0211 03/13/2020
Feb 10 00:20:41 kernel: [5032630.764486] Call Trace:
Feb 10 00:20:41 kernel: [5032630.764496] dump_stack+0x6b/0x83
Feb 10 00:20:41 kernel: [5032630.764508] spl_panic+0xd4/0xfc [spl]
Feb 10 00:20:41 kernel: [5032630.764617] ? dbuf_sync_leaf+0x44f/0x590 [zfs]
Feb 10 00:20:41 kernel: [5032630.764667] zpl_get_file_info+0x9b/0x220 [zfs]
Feb 10 00:20:41 kernel: [5032630.764687] dmu_objset_userquota_get_ids+0x11d/0x490 [zfs]
Feb 10 00:20:41 kernel: [5032630.764687] VERIFY3(sa.sa_magic == SA_MAGIC) failed (2576474 == 3100762)
Feb 10 00:20:41 kernel: [5032630.764688] PANIC at zfs_quota.c:89:zpl_get_file_info()
Feb 10 00:20:41 kernel: [5032630.764689] Showing stack for process 2328
Feb 10 00:20:41 kernel: [5032630.764710] dnode_sync+0x11a/0xa30 [zfs]
Feb 10 00:20:41 kernel: [5032630.764715] ? cpumask_next+0x17/0x20
Feb 10 00:20:41 kernel: [5032630.764717] ? __update_idle_core+0x55/0xa0
Feb 10 00:20:41 kernel: [5032630.764721] ? __switch_to_asm+0x42/0x70
Feb 10 00:20:41 kernel: [5032630.764724] ? __switch_to+0x114/0x460
Feb 10 00:20:41 kernel: [5032630.764728] ? _cond_resched+0x16/0x40
Feb 10 00:20:41 kernel: [5032630.764746] sync_dnodes_task+0x58/0x130 [zfs]
Feb 10 00:20:41 kernel: [5032630.764754] taskq_thread+0x2da/0x520 [spl]
Feb 10 00:20:41 kernel: [5032630.764757] ? wake_up_q+0xa0/0xa0
Feb 10 00:20:41 kernel: [5032630.764761] ? taskq_thread_spawn+0x50/0x50 [spl]
Feb 10 00:20:41 kernel: [5032630.764763] kthread+0x11b/0x140
Feb 10 00:20:41 kernel: [5032630.764765] ? __kthread_bind_mask+0x60/0x60
Feb 10 00:20:41 kernel: [5032630.764766] ret_from_fork+0x1f/0x30
Feb 10 00:20:41 kernel: [5032630.764769] CPU: 4 PID: 2328 Comm: dp_sync_taskq Tainted: P OE 5.10.0-9-amd64 #1 Debian 5.10.70-1
Feb 10 00:20:41 kernel: [5032630.764773] Hardware name: ASUS System Product Name/ROG STRIX B460-I GAMING, BIOS 0211 03/13/2020
Feb 10 00:20:41 kernel: [5032630.764774] Call Trace:
Feb 10 00:20:41 kernel: [5032630.764779] dump_stack+0x6b/0x83
Feb 10 00:20:41 kernel: [5032630.764784] spl_panic+0xd4/0xfc [spl]
Feb 10 00:20:41 kernel: [5032630.764818] ? zio_free_sync+0xda/0xf0 [zfs]
Feb 10 00:20:41 kernel: [5032630.764821] ? _cond_resched+0x16/0x40
Feb 10 00:20:41 kernel: [5032630.764825] ? irq_exit_rcu+0x3e/0xc0
Feb 10 00:20:41 kernel: [5032630.764847] ? dbuf_sync_leaf+0x44f/0x590 [zfs]
Feb 10 00:20:41 kernel: [5032630.764880] zpl_get_file_info+0x9b/0x220 [zfs]
Feb 10 00:20:41 kernel: [5032630.764904] dmu_objset_userquota_get_ids+0x11d/0x490 [zfs]
Feb 10 00:20:41 kernel: [5032630.764930] dnode_sync+0x11a/0xa30 [zfs]
Feb 10 00:20:41 kernel: [5032630.764932] ? __switch_to_asm+0x42/0x70
Feb 10 00:20:41 kernel: [5032630.764934] ? __switch_to+0x114/0x460
Feb 10 00:20:41 kernel: [5032630.764935] ? _cond_resched+0x16/0x40
Feb 10 00:20:41 kernel: [5032630.764959] sync_dnodes_task+0x58/0x130 [zfs]
Feb 10 00:20:41 kernel: [5032630.764963] taskq_thread+0x2da/0x520 [spl]
Feb 10 00:20:41 kernel: [5032630.764965] ? wake_up_q+0xa0/0xa0
Feb 10 00:20:41 kernel: [5032630.764969] ? taskq_thread_spawn+0x50/0x50 [spl]
Feb 10 00:20:41 kernel: [5032630.764971] kthread+0x11b/0x140
Feb 10 00:20:41 kernel: [5032630.764972] ? __kthread_bind_mask+0x60/0x60
Feb 10 00:20:41 kernel: [5032630.764974] ret_from_fork+0x1f/0x30
Feb 10 00:20:41 kernel: [5032630.764977] CPU: 1 PID: 2333 Comm: dp_sync_taskq Tainted: P OE 5.10.0-9-amd64 #1 Debian 5.10.70-1
Feb 10 00:20:41 kernel: [5032630.764979] Hardware name: ASUS System Product Name/ROG STRIX B460-I GAMING, BIOS 0211 03/13/2020
Feb 10 00:20:41 kernel: [5032630.764980] Call Trace:
Feb 10 00:20:41 kernel: [5032630.764982] dump_stack+0x6b/0x83
Feb 10 00:20:41 kernel: [5032630.764986] spl_panic+0xd4/0xfc [spl]
Feb 10 00:20:41 kernel: [5032630.764989] ? __wake_up_common+0x80/0x180
Feb 10 00:20:41 kernel: [5032630.765010] ? dbuf_sync_leaf+0x44f/0x590 [zfs]
Feb 10 00:20:41 kernel: [5032630.765039] zpl_get_file_info+0x9b/0x220 [zfs]
Feb 10 00:20:41 kernel: [5032630.765057] dmu_objset_userquota_get_ids+0x11d/0x490 [zfs]
Feb 10 00:20:41 kernel: [5032630.765077] dnode_sync+0x11a/0xa30 [zfs]
Feb 10 00:20:41 kernel: [5032630.765079] ? __switch_to_asm+0x42/0x70
Feb 10 00:20:41 kernel: [5032630.765081] ? __switch_to+0x114/0x460
Feb 10 00:20:41 kernel: [5032630.765082] ? _cond_resched+0x16/0x40
Feb 10 00:20:41 kernel: [5032630.765100] sync_dnodes_task+0x58/0x130 [zfs]
Feb 10 00:20:41 kernel: [5032630.765104] taskq_thread+0x2da/0x520 [spl]
Feb 10 00:20:41 kernel: [5032630.765105] ? wake_up_q+0xa0/0xa0
Feb 10 00:20:41 kernel: [5032630.765109] ? taskq_thread_spawn+0x50/0x50 [spl]
Feb 10 00:20:41 kernel: [5032630.765110] kthread+0x11b/0x140
Feb 10 00:20:41 kernel: [5032630.765111] ? __kthread_bind_mask+0x60/0x60
Feb 10 00:20:41 kernel: [5032630.765113] ret_from_fork+0x1f/0x30
I found 2 issues with similar stack traces:
After the panic, sshd had lots of defunct processes with ppid 1 and I couldn't ssh to the backup server. The shutdown command was not working, thanks systemd, I pressed the power button. The shutdown command must always work.
root@backup ~ 02-10 08:40> shutdown
Failed to set wall message, ignoring: Connection timed out
Failed to call ScheduleShutdown in logind, no action will be taken: Connection timed out
root@backup ~ 02-10 08:41> echo $?
1
I think in my case, the panic is due my Intel NUC running zfs send sends garbage data.
I fried the CPU and the RAM and I get regular bit flips, this hardware is completely unreliable and I see checksum errors in the zfs datasets.
hashtags: #zfs