Intel RAID error messages - how to predict a disk failure in advance...

The_Unbeliever

Honorary Master
Joined
Apr 19, 2005
Messages
103,193
Reaction score
10,233
Location
Nkaaaaandla
This morning at around 08:46 we lost a HDD in our mailserver's RAID.

Luckily I had a spare. A quick >chunka<>chunka< with a rebuilt later all is well, and I can relax once more.

Of course we do have backups. ;)

Anyways.

I had a shufty at the Intel RAID Management log, and picked up this interesting titbit :

(Note that you have to read the log from the bottom up)

Code:
ID = 113
SEQUENCE NUMBER = 107886
TIME = 03-07-2011 09:00:27
LOCALIZED MESSAGE = Controller ID:  0   Unexpected sense:   PD       =   Ports 0-3:1:1,   CDB   =    0x1a  0x08  0x19  0x00  0xff  0x00     ,   Sense   =    0x70  0x00  0x05  0x00  0x00  0x00  0x00  0x0a  0x00  0x00  0x00  0x00  0x24  0x00  0x00  0x00  0x00  0x00

There was a couple of these at the same time on the same date. Quite nice to see that the RAID controller tried to correct the error by itself.

<doomsday voice> So if you start seeing lots of these types of errors in your log files, then know that the end is near. </doomsday voice>

Then, this morning disaster! Panic! Chaos! :

Code:
ID = 61
SEQUENCE NUMBER = 109048
TIME = 14-07-2011 08:46:32
LOCALIZED MESSAGE = Controller ID:  0   Consistency Check failed on VD:       0

ID = 114
SEQUENCE NUMBER = 109047
TIME = 14-07-2011 08:46:32
LOCALIZED MESSAGE = Controller ID:  0   State change:   PD       =   Ports 0-3:1:1  Previous   =   Online      Current   =   Failed

ID = 251
SEQUENCE NUMBER = 109046
TIME = 14-07-2011 08:46:32
LOCALIZED MESSAGE = Controller ID:  0  VD is now DEGRADED   VD       0

ID = 81
SEQUENCE NUMBER = 109045
TIME = 14-07-2011 08:46:32
LOCALIZED MESSAGE = Controller ID:  0   State change on VD:   0      Previous   =   Optimal  Current   =       Degraded

ID = 87
SEQUENCE NUMBER = 109044
TIME = 14-07-2011 08:46:32
LOCALIZED MESSAGE = Controller ID:  0   Error:   Ports 0-3:1:1      ( Error   2)

ID = 108
SEQUENCE NUMBER = 109043
TIME = 14-07-2011 08:46:32
LOCALIZED MESSAGE = Controller ID:  0   Reassign write operation failed:   PD   Ports 0-3:1:1      Location   0x5170000

ID = 113
SEQUENCE NUMBER = 109042
TIME = 14-07-2011 08:46:31
LOCALIZED MESSAGE = Controller ID:  0   Unexpected sense:   PD       =   Ports 0-3:1:1,   CDB   =    0x2e  0x00  0x17  0x05  0x4e  0x00  0x00  0x00  0x80  0x00     ,   Sense   =    0xf0  0x00  0x03  0x17  0x05  0x4e  0x63  0x0a  0x00  0x00  0x00  0x00  0x11  0x00  0x00  0x00  0x00  0x00 

ID = 63
SEQUENCE NUMBER = 109041
TIME = 14-07-2011 08:46:28
LOCALIZED MESSAGE = Controller ID:  0   Consistency Check found inconsistent parity on VD     strip:       ( VD   =   0,   strip       =   0x5c1538)

ID = 57
SEQUENCE NUMBER = 109040
TIME = 14-07-2011 08:46:28
LOCALIZED MESSAGE = Controller ID:  0   Consistency Check corrected medium error:       ( VD   0  Location   0x17054e63,       PD   Ports 0-3:1:1  Location   0x17054e63)

ID = 113
SEQUENCE NUMBER = 109039
TIME = 14-07-2011 08:46:24
LOCALIZED MESSAGE = Controller ID:  0   Unexpected sense:   PD       =   Ports 0-3:1:1,   CDB   =    0x28  0x00  0x17  0x05  0x4e  0x00  0x00  0x00  0x80  0x00     ,   Sense   =    0xf0  0x00  0x03  0x17  0x05  0x4e  0x63  0x0a  0x00  0x00  0x00  0x00  0x11  0x00  0x00  0x00  0x00  0x00

All I did was remove this specific HDD, and plugged in a spare. No need to power the server down (hot-swap cage). At first I was worried about getting the RAID controller to know that I put in a new HDD - but it picked it up automatically, and started the rebuild process.

One hour and 30 minutes later (this is dependent on the size of your HDD's, RAID setup and server load) the RAID was fully operational.

Code:
ID = 114
SEQUENCE NUMBER = 109060
TIME = 14-07-2011 10:09:12
LOCALIZED MESSAGE = Controller ID:  0   State change:   PD       =   Ports 0-3:1:1  Previous   =   Offline      Current   =   Rebuild

ID = 106
SEQUENCE NUMBER = 109059
TIME = 14-07-2011 10:09:12
LOCALIZED MESSAGE = Controller ID:  0   Rebuild automatically started:   PD       Ports 0-3:1:1

ID = 114
SEQUENCE NUMBER = 109058
TIME = 14-07-2011 10:09:12
LOCALIZED MESSAGE = Controller ID:  0   State change:   PD       =   Ports 0-3:1:1  Previous   =   Unconfigured Good      Current   =   Offline

ID = 114
SEQUENCE NUMBER = 109057
TIME = 14-07-2011 10:09:12
LOCALIZED MESSAGE = Controller ID:  0   State change:   PD       =   Ports 0-3:1:1  Previous   =   Unconfigured Bad      Current   =   Unconfigured Good

ID = 247
SEQUENCE NUMBER = 109056
TIME = 14-07-2011 10:09:12
LOCALIZED MESSAGE = Controller ID:  0  Device inserted   Device Type:       Disk  Device Id:   Ports 0-3:1:1

ID = 91
SEQUENCE NUMBER = 109055
TIME = 14-07-2011 10:09:12
LOCALIZED MESSAGE = Controller ID:  0   PD inserted:       Ports 0-3:1:1

Then the final message that says it all :

Code:
ID = 114
SEQUENCE NUMBER = 109164
TIME = 14-07-2011 11:15:48
LOCALIZED MESSAGE = Controller ID:  0   State change:   PD       =   Ports 0-3:1:1  Previous   =   Rebuild      Current   =   Online

ID = 249
SEQUENCE NUMBER = 109163
TIME = 14-07-2011 11:15:48
LOCALIZED MESSAGE = Controller ID:  0  VD is now OPTIMAL   VD       0

ID = 81
SEQUENCE NUMBER = 109162
TIME = 14-07-2011 11:15:48
LOCALIZED MESSAGE = Controller ID:  0   State change on VD:   0      Previous   =   Degraded  Current   =       Optimal

ID = 100
SEQUENCE NUMBER = 109161
TIME = 14-07-2011 11:15:48
LOCALIZED MESSAGE = Controller ID:  0   Rebuild complete       Ports 0-3:1:1

BUT ... a RAID is NOT a substitute for a backup. Things can go horribly wrong if you're not lucky.
 
Last edited:
This morning at around 08:46 we lost a HDD in our mailserver's RAID.

BUT ... a RAID is NOT a substitute for a backup. Things can go horribly wrong if you're not lucky.

No admin should ever rely on luck. If I could offer a suggestion, create emergency situations for yourself and TEST the backups. I've made a policy of doing systematic full system restores for some of my SLA clients. Once every six months seems to be adequate.
 
BUT ... a RAID is NOT a substitute for a backup. Things can go horribly wrong if you're not lucky.

Yep, I had a RAI Fail then try to rebuild from a damaged drive to a healthy drive...

Hilarity did not ensue.
 
while we are on the raid failure issue, has anyone found a decent bare metal restore utility/software? We are not to worried about our data as this is on a SAN, but all our server OS are on RAD 1+0. We have tested so many different solutions, none are really reliable or easy to use.
 
Top
Sign up to the MyBroadband newsletter
X