mdadm RAID10 help

Mal1ce · Jan 18, 2018

Hi everyone, hoping for some guidance here (Linux noob).

Helping a client out with a CentOS 6 file server, looks like they have had a drive missing from their software RAID10 for a while now. Output of cat /proc/mdstat below:

md0 : active raid1 sdb1[1] sdc1[2] sdd1[3]
1048512 blocks super 1.0 [4/3] [_UUU]
bitmap: 1/1 pages [4KB], 65536KB chunk

md2 : active raid10 sdc3[2] sdd3[3] sdb3[1]
3871111168 blocks super 1.1 512K chunks 2 near-copies [4/3] [_UUU]
bitmap: 21/29 pages [84KB], 65536KB chunk

md1 : active raid10 sdc2[2] sdd2[3] sdb2[1]
33523712 blocks super 1.1 512K chunks 2 near-copies [4/3] [_UUU]

unused devices: <none>

/dev/sda appears to be missing, but I can see it from fdisk -l:

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x000ef4d8

Device Boot Start End Blocks Id System
/dev/sda1 * 1 131 1048576 fd Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sda2 131 2220 16778240 fd Linux raid autodetect
/dev/sda3 2220 243202 1935686656 fd Linux raid autodetect

From what I've read, the event count is very high and the drive shouldn't be added to the array in current state?

Internal Bitmap : 8 sectors from superblock
Update Time : Sat Dec 24 01:38:49 2016
Bad Block Log : 512 entries available at offset 72 sectors
Checksum : 22dec0cf - correct
Events : 6585

Layout : near=2
Chunk Size : 512K

Device Role : Active device 0

Any guidance on how to re-add the missing disk to the array? Or what I should do to test to ensure all is OK before doing so?

Many thanks!

ginggs · Jan 18, 2018

Check the SMART information on the suspect drive:

Code:

sudo smartctl -a /dev/sda

Mal1ce · Jan 18, 2018

SMART info here:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 5
3 Spin_Up_Time 0x0027 182 169 021 Pre-fail Always - 3866
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 111
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 22778
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 111
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 46
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 152
194 Temperature_Celsius 0x0022 114 102 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

This is the only drive of the 4 showing any value other than 0 for Raw_Read_Error_Rate.

Rest of it looks OK, /dev/sd[b,c,d] look fine as well.

PsyWulf · Jan 18, 2018

Probably got dropped and desynced due to a read error

Get the SMART statuses (I like HDD sentinel on windows for the fancy metrics)
If the disk is going replace it asap

ginggs · Jan 18, 2018

Looks fine to me.

Mal1ce said:
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0

Try stopping and starting the arrays:

Code:

mdadm --stop /dev/md0
mdadm --stop /dev/md1
mdadm --stop /dev/md2
mdadm --assemble --scan

Hopefully the arrays start rebuilding.

If not, try forcing the arrays online, one by one:

Code:

mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
mdadm --assemble --force /dev/md1 /dev/sda2 /dev/sdb2 /dev/sdc2 /dev/sdd2
mdadm --assemble --force /dev/md2 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3

Check that the above lists of devices are correct for your setup, but I think they are.

Mal1ce · Jan 18, 2018

Thanks, will wait for the weekend with a verified full backup in hand before I tackle this on Saturday morning.

Mal1ce · Jan 20, 2018

First hurdle:

mdadm --stop /dev/md0
mdadm: Cannot get exclusive access to /dev/md0

erhaps a running process, mounted filesystem or active volume group?

Tried stopping SAMBA, no joy.

Do I need to unmount the RAID first?

Can I not just remove /dev/sda[1,2,3] from the arrays, wipe the drive, copy partitions from another working drive, re-add /dev/sda[1,2,3] and it should rebuild/resync?

SauRoNZA · Jan 20, 2018

You absolutely have to unmount it.

Even says so right there in your error.

Join the MyBroadband community

Get started

mdadm RAID10 help

Mal1ce

Well-Known Member

ginggs

༼ つ ◕_◕ ༽つ

Mal1ce

Well-Known Member

PsyWulf

Honorary Master

ginggs

༼ つ ◕_◕ ༽つ

Mal1ce

Well-Known Member

Mal1ce

Well-Known Member

SauRoNZA

Honorary Master