昨天发现一个非常诡异的问题
监控报警 提示Raid阵列性能降低
Adapter 0 VirtualDrive 0 state is Degraded
这台服务器是24块10T的硬盘 + LSI Hardware Raid 阵列卡 构建的Raid 6
最大允许2块硬盘损坏
提示报警以后 开始检查服务器,
系统日志里 没任何错误
没不提示 硬盘错误 硬盘故障 硬盘离线
先检查所有硬盘 都是 Online 无问题
折腾了很久
最后检查日志发现 “MegaCli64 -FwTermLog -Dsply -aALL”
02/06/25 9:00:12: C0:iopiEvent: EVENT_SAS_DISCOVERY
02/06/25 9:00:12: C0:DM_HandleDiscEvent: Discovery started on Port 0
02/06/25 9:00:12: C0:DevId [110] Reduce Queue Depth recursive retry: maxQDepth 1 : maxDepthChanged 0 : curQDepth 0
02/06/25 9:00:12: C0:iopiEvent: EVENT_SAS_DEVICE_STATUS_CHANGE
02/06/25 9:00:12: C0: DM_HandleDevStatusChgEvent: devHandle=x0014 SASAdd=5000cca2663da381 TaskTag=xffff ASC=x00 ASCQ=x00 IOCLogInfo x00000000 IOCStatus x0000 ReasonCode x0e - INTERNAL_DEVICE_RESET complete
02/06/25 9:00:12: C0:iopiEvent: MPI2_EVENT_SAS_TOPOLOGY_CHANGE_LIST
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt: PhysicalPort=0 NumberOfPhys=x33 NumEntries=x29 StartPhy=xa
02/06/25 9:00:12: C0:ExpStatus=x03 PhysicalPort=0 EnclosureHandle=x0002 Expander devHandle=x0009 - Exp RESPONDING
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt devH 14 has no matching devId
02/06/25 9:00:12: C0:Phy changed - phy 0a devHandle 0014 linkRate bb curLinkRate b
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt: curr_lr=0xb, prev_lr=0xb, DM_DevMgrIsReady=1 devId=0xffff
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt devH 0 has no matching devId
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt devH 0 has no matching devId
...
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt devH 1 has no matching devId
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt devH 1 has no matching devId
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt devH 1 has no matching devId
...
02/06/25 9:00:12: C0:Phy changed - phy 2b devHandle 0000 linkRate 00 curLinkRate 0
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt: curr_lr=0x0, prev_lr=0x0, DM_DevMgrIsReady=1 devId=0xffff
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt devH 0 has no matching devId
02/06/25 9:00:12: C0:Phy changed - phy 2c devHandle 0000 linkRate 00 curLinkRate 0
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt: curr_lr=0x0, prev_lr=0x0, DM_DevMgrIsReady=1 devId=0xffff
02/06/25 9:00:12: C0:DM_HandleTopologyChgEvnt devH 0 has no matching devId
02/06/25 9:00:12: C0:Phy changed - phy 2d devHandle 0000 linkRate 00 curLinkRate 0
日志一直在循环显示 SASAdd=5000cca2663da381 物理链接异常 一会有一会没的
然后尝试搜索这块SAS硬盘地址 发现 not found,找不到
意思就是这块硬盘在服务器上消失了
这个时候才知道 是硬盘烧了, 正常情况无论硬盘故障还是残废 应该有 miss 或者 offline 这类状态返回的
但是这个硬盘比较灵异 是直接消失了 没任何信息
然后开始检查硬盘数量,果然
MegaCLI -AdpAllInfo -aAll print info
Device Present
================
Virtual Drives : 1
Degraded : 1
Offline : 0
Physical Devices : 26
Disks : 23
Critical Disks : 0
Failed Disks : 0
返回DISK数量就剩下23块了
算是终于找到原因了
通知值班的 找到停转的硬盘 更换新盘
阵列会自动 rebuild , 等待10多个小时 才重建完成
算满血复活了
这种情况真第一次遇到,正常来说
构建阵列的时候 多少硬盘 有个总数
硬盘快坏了 阵列会报警 提示更换
硬盘坏了 阵列也会报警 提示硬盘故障 已offline状态 标记 failed
这个直接少了一块 没任何信息是真第一次遇见