ZFS 文件系统 硬盘故障和更换硬盘的操作

ZFS文件系统

ZFS 文件系统 硬盘故障和更换硬盘的操作

存储构建是Nvme的系统盘 + 36块 x 14T SAS 企业硬盘

首先是监控告警 告知有硬盘故障

然后登录系统进行检查

可以看到 有一块硬盘残了
这个时候 通过 smartctl进行检查 发现返回的状态是PASS
代表硬盘健康的 , 日志如下

sudo smartctl -a /dev/sdu
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.18.0-477.27.2.el8_8.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: IBM-ESXS
Product: ST14000NM0288 E
Revision: ECH8
Compliance: SPC-5
User Capacity: 13,902,809,137,152 bytes [13.9 TB]
Logical block size: 4096 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c500a75c161b
Serial number: ZHZ150AG0000C917PFA8
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Wed Feb 21 00:19:25 2024 HKT
SMART support is: Available – device has SMART capability.
SMART support is: Disabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Grown defects during certification
Total blocks reassigned during format
Total new blocks reassigned
Power on minutes since format
Current Drive Temperature: 38 C
Drive Trip Temperature: 65 C
Accumulated power on time, hours:minutes 35206:58
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 523 0 523 547 738215.494 3
write: 0 0 0 0 0 693544.406 0
verify: 0 0 0 0 0 32918.965 0
Non-medium error count: 103
Pending defect count:2 Pending Defects: index, LBA and accumulated_power_on_hours follow
1: 0x235cd6f , 31883
Smartctl: Exception: json.cpp(39): Assertion failed: ‘a’ <= c && c <= ‘z’
Please inform [email protected], including output of smartctl -V.

可以看见 SMART Health Status: OK 但是下面抛出异常
这种基本可以判断硬盘有物理坏道了,必须更换
如果硬盘没错误, 可以尝试运行 zfs clear 会重新把硬盘加入存储群里

smartctl 的输出来看,硬盘的 SMART Health Status 显示为 “OK”,这表示硬盘当前没有报告任何健康问题。此外,温度、累计上电时间等参数也在正常范围内。

然而,有一点需要注意的是,smartctl 输出中显示了一些 ECC 错误和非介质错误。具体来说,Errors Corrected by ECC 列显示了一些读取错误并通过 ECC 纠正的次数,而 Non-medium error count 则显示了非介质错误的数量为 103。

这些错误并不一定表示硬盘有坏道,而是可能由于一些临时的噪音或传输问题引起的。如果硬盘的 SMART Health Status 为 “OK”,通常可以继续使用硬盘。但考虑到出现了一些错误,建议继续监控硬盘的健康状况,特别是如果错误次数增加。

至于报错的 Smartctl: Exception: json.cpp(39): Assertion failed,这是一个异常,可能是 smartctl 版本或相关库的问题。你可以尝试更新 smartmontools 到最新版本,或者向 smartmontools 的支持团队报告这个问题。

更换硬盘 , 现在一般都支持热插拔 不需要关机断电的
先确定硬盘ID (因为硬盘比较多 ID不可能一个一个找)一般我会这样简化操作

接着执行

然后在观察ZFS状态

zpool status
pool: storpool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Thu Feb 22 09:15:23 2024
105T scanned at 29.2G/s, 6.33T issued at 1.76G/s, 432T total
134G resilvered, 1.47% done, 2 days 20:59:16 to go
config:
NAME STATE READ WRITE CKSUM storpool DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 scsi-SIBM-ESXS_ST14000NM004G_E_ZTM0B9W40000C2044R23 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM004G_E_ZTM0BB0Y0000C2044P91 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM004G_E_ZTM0G0X20000C213RJP6 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM004G_E_ZTM0LQ0J0000C2233MDD ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ124GJ0000C8471FRW ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ14FVB0000C917PHJR ONLINE 0 0 0 replacing-6 DEGRADED 0 0 0 13074863735725225853 UNAVAIL 0 0 0 was /dev/disk/by-id/scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ150AG0000C917PFA8-part1 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2L91E0000C925CKWR ONLINE 0 0 0 (resilvering) scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ1F6ZM0000C9192A5W ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ1PA0L0000C9247S4R ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ230R80000C9218F1J ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JZ6Q0000C9342YRY ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ28S8K0000C931KYUK ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2AZA20000C843F6Y3 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2E91F0000C9334G9R ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2ER3D0000C931KZHJ ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2G24J0000C925CJN4 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2J3P00000C9346K15 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2J6AB0000C9346JLP ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JDTT0000C9358NA0 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JJDQ0000C9346011 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JKTH0000C93460L6 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JLF20000C9345ZFV ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JLHW0000C9218L10 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JN2W0000C9334KM6 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JNAW0000C9334JYP ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JNC50000C9334K0M ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JNXT0000C931KZT6 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JPJC0000C9220XG2 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JX1P0000C9355R8W ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JY2C0000C9342Y2F ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JY900000C9218LLZ ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JYAS0000C9218K9B ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JZ660000C9342YNM ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2JZ6J0000C9342YQJ ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2K5FX0000C934304Z ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2KH3C0000C9355NVC ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2KQTZ0000C9362CL2 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2KZT70000C9247VA9 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2L0E50000C9362EFW ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2L5FE0000C9247Q0D ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2L6KC0000C9342Z96 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2L99G0000C922FX4M ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ2LNQF0000C93460MP ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ3E3DE0000C951084R ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ3JWSB0000C00341G0 ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ49K3Q0000C0079W6X ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ571CJ0000C011KG7H ONLINE 0 0 0 scsi-SIBM-ESXS_ST14000NM0288_E_ZHZ25VRP0000C8471H1H ONLINE 0 0 0
errors: No known data errors

等待它同步完成就可以了 稍等个几分钟运行 zpool status 就可以看到百分比进度和预计完成时间

完成后 运行 zpool status 可以看到全部 Online 了 代表全部硬盘正常工作

空闲时间建议进行 zpool scrub storpool 操作

zpool scrub 是 ZFS 存储池中的一项操作,用于检查和修复存储池中数据块的一致性。这个操作会读取存储池中的每个数据块,并与其校验和进行比较,以便检测并纠正任何可能的数据损坏。如果存储池中有硬盘错误或其他问题,zpool scrub 可能会尝试修复这些问题。

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注