Righteous Wrath Online Community

General => Tech Chat => Topic started by: Tom on February 08, 2015, 02:36:13 PM

Title: NAS and Server rescue
Post by: Tom on February 08, 2015, 02:36:13 PM
So... I had some fun working on two computers this week.

Basically my NAS dropped 2 drives out of the 7 in its raid 5 array. Usually this means your data is gone. just gone. So as you can imagine, I was a bit upset. Not only that, but my backup of the NAS array had been "borrowed" while I was playing with that new big server I built last summer, so I had no backup of any of the data on the array. In addition, that backup array had also kicked out two members, and was DOA. I suspect these sas cards dislike it when a malfunctioning drive is connected and will cause delays or spurious errors when accessing other drives, which will make mdraid boot them.

I've spent quite a bit of time this week to try and rescue the data on both arrays.

I learned about:

Tools used:

The process so far has gone like so:

I tell you what, when I got the old array back up and things seemed ok, I was SO happy.

It seems that the XFS filesystem is so fault tolerant that it can withstand quite a bit of shenanigans, as is mdraid, it will let you re-assemble an existing array even with a disk that doesn't "match" the rest of the array. You have to be careful with the "mdadm --assemble --force" command though, it will modify your drives, in particular it will update the "event count" in the mdraid superblock to match the rest of the disks, if it finds two disks it thinks are ok to re-add, it'll start resyncing (which if the data is behind enough, it'll corrupt everything). If the disk was behind enough it can and will cause corruption, just hope and pray ;) If that doesn't work, you can try "mdadm --create --assume-clean" with the same settings you created the array with to begin with, and that will give you an array that assembles, but it is very dangerous, it will assume the drives are clean and the parity matches reasonably well. If the parity or data doesn't match well enough, you are guaranteed some significant corruption, and you will have to do some more serious data recovery (ie: photorec).

During the recovery, I made a few serious mistakes:


In the end, I just gave XFS a blank 128MB file-backed loop device for the external log. Despite all of that, XFS was fine, and there was very few errors on mount. I will not be using an external log again :D (it can speed up performance, as can the mdraid external write intent bitmap, as it hits a different "spindle", and causes less disk thrashing)

I had some additional problems with the server. The version of systemd it uses treats ALL mounts in /etc/fstab as SUPER IMPORTANT so if any one of them fails it falls back to an emergency login prompt. That isn't honestly too terrible, but it was launching TWO of those prompts in the same console, causing the input to be split, making it nearly impossible to do anything (see attached image). I eventually booted in via changing init (init=/bin/bash) which is where I did most of the steps I did to rebuild the backup array, and then to fix the boot issue i manually told systemd to boot into emergency mode, instead of waiting for a failure (so it didn't even bother to mount the nfs share) to remove the failing entry (the nfs share on the NAS) from fstab. I can't tell you how mad that all made me last night. So pissed off. I'm getting angry just thinking about it. lol.

Hopefully someone finds this interesting at the very least, if not useful :)
Title: Re: NAS and Server rescue
Post by: Melbosa on February 08, 2015, 03:07:37 PM
Very Interesting and Good Job!
Title: Re: NAS and Server rescue
Post by: Tom on February 08, 2015, 03:22:38 PM
Here's a suggestion to everyone. If you have ANY seagate drives, especially from the past few years, GO CHECK for firmware updates NOW. I think I have had 4-5 seagates fail in the past 5 years or so. a couple 1TB (if not more), a couple 2TB, and possibly one 3TB (it isn't dead yet, and im hoping the firmware update will keep it going till I can afford a replacement). Secondary to that, do not buy any 3TB seagate that was made in the past few years. 40% failure rate.
Title: Re: NAS and Server rescue
Post by: Tom on February 08, 2015, 03:54:30 PM
Smart info from 4 drives that have failed the past few years:


/dev/sda   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       366
/dev/sda   5 Reallocated_Sector_Ct   0x0033   002   002   036    Pre-fail  Always   FAILING_NOW 4015
/dev/sda 183 Runtime_Bad_Block       0x0032   001   001   000    Old_age   Always       -       249
/dev/sda 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
/dev/sdd   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       27
/dev/sdd   5 Reallocated_Sector_Ct   0x0033   094   094   036    Pre-fail  Always       -       9088
/dev/sdd 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
/dev/sdd 198 Offline_Uncorrectable   0x0010   089   087   000    Old_age   Offline      -       1824
/dev/sde   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       92
/dev/sde   5 Reallocated_Sector_Ct   0x0033   086   086   036    Pre-fail  Always       -       18416
/dev/sde 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
/dev/sde 198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
/dev/sdf   4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       84
/dev/sdf   5 Reallocated_Sector_Ct   0x0033   072   053   036    Pre-fail  Always       -       37096
/dev/sdf 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
/dev/sdf 198 Offline_Uncorrectable   0x0010   001   001   000    Old_age   Offline      -       27720
Title: Re: NAS and Server rescue
Post by: Tom on February 08, 2015, 05:08:03 PM
Those three drives consist of one RMA replacement 1TB, and THREE 2TB seagates. One of which went last spring or last winter (cant remember now), and two that just recently failed, one before june and the other in oct. So dumb.
Title: Re: NAS and Server rescue
Post by: Tom on April 06, 2015, 08:16:47 PM
Fun story, one of the 3TB seagate's that i was hoping I had saved during the firmware update process has started to give some smart errors. thank god for setting up smartd again on that box (this is the backup array in my home server, so no fancy web admin for any of this).

It spewed two errors every day for a week it seems, CurrentPendingSector's and OfflineUncorrectableSector's. They have gone away now though, and no new emails.. That was suspicious enough to make me check the drive with smartctl, and now I have 5 ReportedUncorrectableSector's So yay.

So I get to buy two new drives. woo hoo. (two because i want a spare) I was looking at BackBlaze's latest hdd roundup, and the "best" 3TB drive that I can get (that isn't a seagate) is a Toshiba. NCIX has them on sale for like $120, i might go with those, but I don't know if i should risk it as the newegg reviews are absolutely horrible. HALF of the reviews are essentially DOA, or within a year. It's that or I go with WD Reds for $150 :o which also don't have great reviews... *sigh*
Title: Re: NAS and Server rescue
Post by: Thorin on April 06, 2015, 08:41:58 PM
Are the WD Reds better than the WD Blacks?
Title: Re: NAS and Server rescue
Post by: Tom on April 06, 2015, 08:44:28 PM
Quote from: Thorin on April 06, 2015, 08:41:58 PM
Are the WD Reds better than the WD Blacks?
They are "meant" for NAS duties. I don't know if that truely means they are better at it than Blacks. But they will actually warranty them for NAS/Raid situations, and include TLER (time limited error recovery), which the blacks haven't supported in years (which is good for raid, as it lets the raid recover the data asap rather than waiting up to a couple minutes for the drive to attempt and potentially fail recovery).
Title: Re: NAS and Server rescue
Post by: Thorin on April 06, 2015, 09:16:50 PM
Hmm, I'm pretty sure I'm just using all WD Blacks in the Drobo.  And it's been humming along for several years now.
Title: Re: NAS and Server rescue
Post by: Tom on April 06, 2015, 09:29:46 PM
Quote from: Thorin on April 06, 2015, 09:16:50 PM
Hmm, I'm pretty sure I'm just using all WD Blacks in the Drobo.  And it's been humming along for several years now.
You can get away without TLER, especially if the machine doesn't use RAID. Blacks are quite good though, and if they are old enough, they may just have TLER available (they did for a long time, till people actually started buying them instead of their higher priced enterprise drives!).
Title: Re: NAS and Server rescue
Post by: Thorin on April 06, 2015, 09:52:21 PM
Yay, I bought something good without really realizing it and without intending to!

I just, I'd read about the Caviar Greens, or whatever they're called, and how they weren't meant for always-on usage.  Which just seemed completely counter to what I'd want my hard drive to be designed for...

Oh, and hopefully you get some drives that work better for you.
Title: Re: NAS and Server rescue
Post by: Tom on April 06, 2015, 10:18:13 PM
Quote from: Thorin on April 06, 2015, 09:52:21 PM
Yay, I bought something good without really realizing it and without intending to!
Well, WD Blacks have always been intended for "enthusiasts". They are WDs high end consumer line. Something you'd put in a workstation or game rig back in the day (now you'd just use SSDs :o)

Quote from: Thorin on April 06, 2015, 09:52:21 PM
I just, I'd read about the Caviar Greens, or whatever they're called, and how they weren't meant for always-on usage.  Which just seemed completely counter to what I'd want my hard drive to be designed for...
They are meant for regular consumer work loads, which means they are only on for like 4-6 hours a day. I have two greens still, but I'm suspicious of at least one of them.. I think I'd only use them for stuff that is mostly idle (ie: not a nas device, or linux where it likes to keep the disk on 24/7).

Quote from: Thorin on April 06, 2015, 09:52:21 PM
Oh, and hopefully you get some drives that work better for you.
I'm going with the WD Reds. It was that or the HGST NAS drives from newegg, but i prefer dealing with ncix atm when i have to worry about returns..
Title: Re: NAS and Server rescue
Post by: Tom on August 06, 2015, 08:31:24 PM
I just had another Seagate die in my NAS. Not even a plain old URE or read/write error. It just fell off the bus. I'll be messing with it to figure out what went wrong :o luckily I have a spare for that box.