Traditionally when we are designing a disk array for an I/O load, we take the total load and divide it by a rate speed for a disk drive to come up the total number of spindles needed to handle the load.
In some cases the I/O load out strips the required capacity causing interesting situation. We have seen where the required capacity might be 100 TB and the I/O load is so intense it causes the number of spindles to exceed the required capacity yielding 350TB of space. This then drives upcost.
Most array manufacturers have this design issue because they all process data in the array in a similar fashion and they all have access to the same drive technology. So what if you could dedupe the data before it hits a spindle? Would this be a better way to get more I/O with less spindle count? The short answer is “yes,” with some caveats.
Data Domain originally entered the storage market as a niche player targeted for backup. They designed for the deduplication of storage, because of all of the copies we all keep in our backup policies. In fact, the first appliances from Data Domain were optimized for large blobs, like the ones that are output from NetBackup. They accomplished this through an inline, real time deduplication process.
If you need more I/O and have a storage requirement over 36TB, then you consider solid state disks drives.
Visit Adexis at www.adexisstorage.com
Actually the idea that the deduped data segments are randomly scattered across the disk is not true in the case of Data Domain. I ran across a paper on the web the other day where they explain what they do to avoid this problem.
And the idea that deduping will always be slower than just writing to vtl assumes that after deduping, you still have to write to vtl. That isn't true.
If you find you have already store the data once, you don't have to store it again.
Posted by: Brad Jensen | May 21, 2009 at 07:37 PM
If only Data Domain's performance were solely constrained by CPU. The DeDupe algorithm and layout employed by Data Domain by it's very definition destroys any temporal locality of data laid out on its volumes; meaning that disk heads need to fly all over the place to re-assemble data when requested. This makes it impossible to employ any efficient read-ahead methodology and optimize read cache at the block layer, as seek times will alwasy do you in.
I do agree that SSD drives will mask much of this problem. But until the long-term reliability issue of SSD drives is resolved (and we're a ways off) along with the apparent performance problems even they face, I wouldn't be architecting any primary disk solutions with Data Domain that require even modest I/O performance.
Posted by: Glenn Dekhayser | May 20, 2009 at 09:47 AM
This entire blog post is baffling in that it says alot without saying anything. For instance:
"...deduping data before it gets to a spindle is faster than not deduping the data before it gets to the spindle"
The alternative to Data Domain is post-process VTL/deduplication which does not require the CPU overhead that an inline dedupe (like Data Domain) requires. Backing up to disk directly via VTL is ALWAYS faster than inline.
It's simple physics: You are moving an object from 'point A' to 'point B'. One way is direct, the other way requires a process in the path of A to B. Any processor that Data Domain can use to accelerate dedupe, no matter how fast, ALWAYS creates additional CPU overhead that direct VTL does not.
Your assumption is that the bottleneck is always at the disk level. More often than not, the larger the environment (and the larger the disk capacity and ultimately spindle count) the more likely that the bottleneck is at the controller, not the disk. Deduping inline does does reduce the amount being written to disk, but it cannot overcome the processing time required to dedupe in the first place.
Posted by: Gary Aknos | May 19, 2009 at 03:37 PM