Showing posts with label RAID. Show all posts
Showing posts with label RAID. Show all posts

Consumer or Enterprise Drives for RAID? (Part 2 of 2)

Monday, July 4, 2011 at 9:03 AM

In my last post, I described a couple of "ideal" scenarios that involved a standalone consumer-class hard drive, along with enterprise-class drives connected to a RAID controller. For the big finale, let’s look at the non-ideal scenario:

Scenario #3: Let’s say your data is stored on a RAID array using consumer-class drives.

You go to print your paper and one of the hard drives is unable to read a sector. What happens now? As mentioned previously, consumer-class drives don't give up quickly. On the flip-side, RAID controllers don't have much patience. After a handful of seconds, the controller says “that drive is not responding to commands so it must have failed, I’m going to kick it out of the array and get on with my day”. The controller detaches the drive from the array and recreates the missing data from the remaining drives and you’re able to print your paper.

So far that’s not such a bad thing, at least as far as your paper is concerned. You were able to print it out and go on your way. Due to the nature of RAID, all you should have to do is put a new hard drive back into the array and it will rebuild your parity data from the other drives. Right?

Unfortunately, this leaves you in a somewhat precarious position. The data on your array is now at risk (assuming RAID5). You don’t have any redundancy until the array can be completely rebuilt. What are the chances that you’d have 2 drives fail at the same time? Pretty low. What about the chances of there being a single read-error on one of the two remaining consumer-class drives during the rebuild process? Much greater. And guess what happens when one of those other drives encounters a read error, takes heroic measures to get it, and the controller kicks it out of the array? Very, very bad news for your data. Kiss it goodbye and you better have backups.

Scenario #4: Let’s compare this same situation using enterprise-class drives. You go to print, there’s a read-error on one of the drives, the drive gives up after 7 seconds and notifies the controller, the controller recreates the data from that one sector by using data from the other drives and ALL of your drives stay in the array! The controller can re-create the missing data from the other drives, write it somewhere else on this other drive, and you’re as good as new!

The moral of the story is: TLER/CCTL/ERC ensures that your hard drives stay in the array even when they encounter an error. Consumer-class drives are much more likely to be kicked out of an array under similar circumstances – and that’s bad news for your data.

This happened to me with some slight variations. I was using RAID6, which preserves data even with 2 drive failures. When one drive failed, I replaced it with a different one. During the rebuilt, another drive was kicked out of the array, and during a subsequent rebuild a 3rd drive was kicked out as well. This toasted the data on my array. It took weeks, $$$, and a lot of time to gather that data back together – probably a lot more than the cost delta between consumer-class and enterprise-class drives, and definitely more than a decent backup solution.

I've since moved to a ZFS-based storage appliance (NexentaStor) and religiously backup all of my data.

Consumer or Enterprise Drives for RAID? (Part 1 of 2)

Wednesday, April 27, 2011 at 9:55 AM

Enterprise-class hard drives have a few features that make them more appropriate for use in RAID arrays. One well-known technology is Western Digital's Time Limited Error Recovery (TLER). Samsung has something similar in their Command Completion Time Limit (CCTL), and Seagate calls theirs Error Recovery Control (ERC).

What's the big deal about TLER/CCTL/ERC? Feel free to hit the links above if you would like the long-winded manufacturer's answer. The short of it is that these hard drives will "give up" fairly quickly when they experience a read error.

I'm sure you're thinking "WHAT??? IT GIVES UP MORE QUICKLY?" Yes, in a RAID array, giving up quickly is a good thing(tm). Let's look at two different scenarios:

Scenario 1: You spent all week working on a paper for school. The document was stored on your hard drive in a sector that's just starting to develop a problem. The next day when you go to print it, the hard drive is unable to read from the sector where your paper was stored.

Pop Quiz: What do you want your hard drive to do?
- A.) give up after a few seconds and say “sorry, you’ve lost your paper”
- B.) be heroic, keep at it, attempt reading the failing sector over and over again until the data is recovered, no matter how long it takes.

Yes, “B” was the correct answer, and that’s exactly what consumer-class hard drives do. When they experience a read-error, they keep at it. I'm not sure how long, but as far as you're concerned, it can take as long as it wants because you need that data.

Scenario 2: Same situation, but instead of storing your paper on a consumer-class hard drive, you store it on a RAID array using enterprise-class drives with TLER/CCTL/ERC. When it comes time to print the paper, one of the hard drives is unable to read one of the sectors where your paper is stored. This isn’t a problem because the drive will give up after a few seconds. The drive will notify the RAID controller that it couldn’t read a sector. The RAID controller will then recreate the missing data from the other drives. You print your paper and you’re off to school. In this situation, giving up quickly is a good thing. You only had to wait a couple of seconds and you were able to print your paper.

In Scenario #1, while the drive was attempting to get your data, the computer is unresponsive. It acts like it's frozen, and it can't do much else until it's able to complete reading your file - and that's okay. You want that file and there's only one place to get it.

In Scenario #2, delays like this are completely unacceptable. Can you imagine if 100's of employees had to sit around twiddling their thumbs for who-knows-how-long because "the server" (with a consumer-class drive) took heroic measures to read a failing sector? Lost productivity. What about 100's of customers on your website trying to buy stuff? Lost revenue. Either way you want that drive to quickly give up so that the RAID array can do its thing and your server can get back to supporting your business.

There's one more scenario, the one where you use consumer-class hard drives in the RAID array. That's a longer story and one I'll save for my next post. Let's just say that a drive taking "heroic measures" isn't awarded any medals by the RAID controller. Instead it is taken out back and shot in the head. It's not a pretty sight.

Line Rate | Powered by Blogger | Entries (RSS) | Comments (RSS) | Designed by MB Web Design | XML Coded By Cahayabiru.com