Top Ten Differences between Disk-based Archive & Disk-based Storage

 By Jonathan Morgan

This is a long post! You could just read the ten headings. Or just this: a disk-based archive exists to keep your data safe, protected, forever. You can stop reading now if you want to :-)

It’s a common challenge. You present a new way of doing things, and the first thing people want to know is, why?

I’ve always lived without it, why do I need it?

What is it? Or more generally, what is a disk based archive? Is it just a bunch of disks?

Disk based archives aren’t new. One of the first modern day clustered archives was Centera from EMC. Some would argue that product had a slow start: in fact, it wasn’t until Sarbanes Oxley/e-mail retention rules came in that the product really started to fly.

But of course, the first product to market isn’t always the best. E.g., Java wasn’t the first programming language, but once people* “got-it”, it took over from other languages which then quickly became marginalised. (*Let me just say at this stage I didn’t get-it for a while – till there was a case point in hand where hackers could inject code, I thought running code in a sandbox was more like a three-legged hurdles race than a state of the art stadium!).

So. What are the advantages we are looking for in a disk-based storage cluster ARCHIVE? Why not use SANs? Why not use ZFS? Why not a SATA beast?

Reason 1.

If you are archiving your data, it’s probably because you don’t want to lose it.

Raison d’etre for a disk based archive? To keep data – safe. For a SAN? Speed of delivery, QoS… You wouldn’t put 256 bit delivery checksums into a SAN; SANs cut corners on flushing to disk; SANs don’t build in search or audit-trails, or security; SANs can down completely because of single-points-of-failure in the hardware; a bad software update in a SAN and…. Don’t do it. With nursing care and attention they can run fine for years, but they are inherently tightly coupled, software version sensitive, high maintenance, error prone and hardware technology dependent… even if they are brilliant at fast storage and delivery of information…

A disk-based archive must be: loosely coupled and free from dependencies between hardware components on independent nodes (surely the greatest example of a loosely coupled solution is the world-wide-web; you have no fear on the www that a server going down, say, hosting an IBM site, is going to bring down another in Cupertino!); free from requiring constant latest updates to software/firmware; able to guarantee safe delivery and storage of data; and basically, able to safely, securely store and protect data for year upon year, without complications, manual intervention, spanners…

Reason 2.

There’ll be bigger, better, cheaper, more efficient disks in 2009, and in 2010, and in 2011…

Will there be bigger, better, cheaper, more energy efficient storage devices coming out this year, and every year that follows? Yes, of course there will be.

In your SAN do you have to mirror between like-sized devices? What happens when one of those devices goes down in 2 years time? Do you end up throwing away the good device? In your SAN can you bolt on new technologies as they arrive; holographic disks that store 10TB a shot, or new fibre connectors?

In ZFS can you decommission a part of a storage pool, replacing it with new storage devices without significant bleeding edge techniques and without disrupting the rest? Ideally, it be great to bolt new technology into an archive, as and when they arrive, rolling out old technologies if they reach the point of diminishing returns; to be able to do that whilst always seeing a single archive storage cluster; and without a maintenance or data migration headache; or should I say; without risk. A disk based archive can achieve that, if selected carefully.

Reason 3.

Vendor tie-in is more like Vendor hand-cuffs.

OK – this isn’t strictly about SAN vs Disk based archiving; but fact of the matter is that most SAN/any other disk-based storage solutions tie you in to a particular vendor, which is great when they are supplying the ‘best-in-class’ solution of the moment at time of purchase, but not quite so clever when you come to upgrade that solution a year down the line and they aren’t offering the best in class anymore.

The archive should be vendor independent otherwise, for many reasons, you’re just creating tomorrow’s headache with a solution from yesteryear.

Reason 4.

Administrators like Coffee.

This could have been my number 1 reason…. data is lost by human error as much as any other reason. Be that the administrator accidentally deleting the wrong directory, a software update having a wrong selection ticked, a project manager saying “we don’t need data x now” (when he meant to say data ‘y’)

Disk based data archives, from Centera to MatrixStore allow you to set policies when data is being stored to guarantee that data is locked from removal, e.g., for a period of time. The data is locked down, practically at an operating system level: and if a policy is set to be irrevocable, then guess what.

Put data on a SAN for long-term storage, and there, but by the grace of correct coffee consumption it shall live….

Reason 5.

Viruses. Hackers.

Choice one:

  • “out of the box” configured with encryption, firewalled, data locked down, all access to data routed through PPK, all maintenance functionality requiring 256 bit passwords

Choice two:

  • bolt on each of the above to your favourite SAN/filesystem. Wait five years as your conglomerate of software solutions evolve (along with the workforce) and cross fingers A disk-based archive must be secure out-of-the-box.

Reason 6.

If you are putting it on disk, it’s probably because you want to use it.

For the same reasons as number 1. You really don’t want to have to tape backup your SAN/ A.N.Other solution.

Reason 7.

RAID 5 aint good enough. RAID 50 aint good enough.

e.g.: http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

In MatrixStore, we could have said to just RAID5 data (in one node, across multiple nodes…) But we didn’t. We did the maths. And we came out with the following conclusions:

  1. if you RAID5 you are taking a risk. And your data is either in a single location (gulp) or is spread across some very very inter-dependent nodes. Possibly relying on a controller.
  2. If you mirror – 2 disks down = bye bye data. A bad batch of disks and… how important >was< that data?
  3. If you RAID50… much much better, but if the system is always able to maintain two good copies of data (if e.g., a node goes down) …. then you have a data set that can organically recreate itself, without human training/trial and error/intervention, for as long as it is required.

Reason 8.

Because you’ve got an admin with a lot of know-how to install your SAN/ZFS.

Enough said.

Reason 9.

Because new-fangled technology slips up.

Pop quiz. What’s more stable – a SAN or XFS? ZFS or XFS? Disk-based archiving might have new ways of managing the interactions between storage nodes, but it certainly should not be using un-tried and tested to the 1000th degree methods of safely storing data.

Reason 10.

Cost.

Oh … OK … forgive me but I just had to throw this in, even though I know that mirroring and RAID5’ing data isn’t that cheap (we could have easily settled for less in our solution, but we believe in data safety first and forever in all of our decisions)… but hey, SANs with all their fibre etc are expensive, and do it yourself solutions have costs both today and tomorrow in time and effort. We believe that if cost was an issue (which normally be secondary to data safety) then due to lower time required and hardware requirements, a good disk-based archiving piece of software has no reason not to be the star turn there too!


About this entry