Decreasing data build-up through 'deduplication'

Daily Newsletters

Sign up to ZDNet UK's daily newsletter.

ANALYSIS

I have yet to meet an IT manager who doesn't complain about having to manage too much data. Many, it seems, feel that accumulating data has become something like the weather — everyone talks about it, but apparently there's just not a lot anybody can do.

The superabundance of data in corporate data centres results from many things, including:

  • Substantial growth in the use of rich media (for example, video-on-demand movies)
  • The digitising of analogue data to make it more rapidly and usefully accessible (for example, making X-rays part of a patient's online medical records)
  • The use of mirrored disks, clones and replicated volumes as part of corporate data-protection schemes

There are less valid reasons behind data build-up, such as keeping the accounting department's football pool on corporate storage and the fact that stored data, whatever its irrelevance or lack of value, never seems to get discarded. Once things get stored, they tend to stay stored forever, and forever is a very long time. You can solve some aspects of the problem through technology, such as using data deduplication.

Email and file attachments
The most blatant examples of redundant data are multiple copies of file attachments, a problem with which every Exchange administrator is all too familiar. A typical scenario for this might be as follows: an initial email message with a 2MB attachment gets sent to 50 recipients, each of whom, perhaps for only a few days but maybe for many months, saves his own copy of the attachment in the Exchange database. The original 2MB of data now takes up 100MB of storage, and affects all services that are applied to the Exchange database: backups take longer, data retrieval takes longer, and so on. The second-order effects are more difficult to calculate, but extend well beyond storage and are hardly subtle: traffic on the network increases as services are applied to the data, resulting in network brownouts that happen because of unplanned-for hotspots that have created constriction points in the data traffic.

The problem doesn't stop with Exchange, of course. Users have a habit of saving multiple instances of the data they create themselves. Users often maintain, for example, multiple iterations of a PowerPoint presentation they are working on, or keep several copies of a document that is in group review in order to capture its history. But is it really necessary to keep all of that data in order to maintain the same amount of information? The answer is no.

Deduplication solves some of the problem
Data-deduplication techniques ensure that only one instance of each significant piece of information is kept on the system; every other instance — even gigabyte-sized objects — will be replaced by a pointer to the initial instance. Using deduplication, an Exchange environment that needed 100MB of storage to accommodate 50 separate instances of a 2MB file would now only need 2MB of storage for the file itself, plus several additional bytes of storage to accommodate each of the file pointers. In an Exchange environment with 1,000 mailboxes, the potential for saving disk space is enormous.

What is equally interesting is that you can do many vendors' data deduplication at the "sub-file" level. Products deduplicating at the sub-file level can identify identical "chunks" (that is, byte aggregations) of data. Once this "byte-level differencing" identifies those chunks, it replaces the byte strings with pointers. This is particularly useful during backups because less data has to be sent to the backup device. But it really proves its value during recoveries, where it delivers substantial performance enhancements.

There are, of course, several methods you can use to implement, not only byte-level differencing, but every other aspect of data deduplication. Data Domain, EMC Avamar, FalconStor and Quantum do deduplication at the block level, while ExaGrid, Diligent and Sepaton do it at the byte level. Some people deduplicate the original data, installing agents on servers to accomplish this, while others prefer to leave the original data alone and do their deduplication on a virtual tape or other secondary storage device. When you keep data at remote or branch offices, deduplication is less of a storage issue but more of a concern for network administrators. In fact, some of the earliest deduplication technology comes out of the wide area files services (WAFS) segment, where net administrators emphasised reducing redundant data so as to improve bandwidth utilisation of expensive WAN assets.

The value of deduping
Just about every form of data deduplication delivers value by reducing the total amount of data. Vendors claim data reductions of anywhere from 30 to 300 percent, and such claims are likely justified. Whether or not they are significant is another question. The significance depends on the vendor's product, as well as on the environment to which the technologies are being applied. Not all data lends itself well to deduplication at the byte level — MRIs and digitised x-rays clearly fall within this category.

Selecting a technology
I would suggest you use six parameters as a guideline in selecting any deduplication technology:

  • Performance (different measures will apply depending on where the deduplication takes place)
  • Capacity
  • Scalability
  • Deduplication ratio (only useful in making like-to-like comparisons, such as comparing VTLs to one another)
  • Data types (what will be deduped)
  • Data location (data centre or remote location)

If you take these criteria into consideration, it is a pretty good bet that you can find a product that will significantly streamline your data-management operations.

Mike Karp is a senior analyst with Enterprise Management Associates, an industry research firm focused on IT management.

Talkback

I love the reference to a 300% size reduction.

A real mathematician at work. (For the less mathematical, once it's all gone -- 100% reduction -- how do you reduce more?)

Dori Schmetterling

46426 4 May, 2007 18:07
Reply

He meant a dedupe ratio of 300 rather than a percentage.
If you want a percentage it is 1 - (1/300) = 99.67%

Obama is the one who needs a math lesson...



(please don't tell him want comes after a trillion!)

mikedutch 1 April, 2010 23:35
Reply

Post your comment

In order to post a comment you need to be registered and logged in.

You can also log in with Facebook. Log in or create your ZDNet UK account below

  • Login

Will not be displayed with your comment

By signing up for this service, you indicate that you agree to our Terms and Conditions and have read and understood our Privacy Policy. Questions about membership? Find the answers in the Community FAQ

Get ZDNet UK's daily newsletter

Enter your email address to sign up

ZDNet UK Live

Paul Smyth

Is this classic FUD? One thing I would definitely have notice is a Mozilla threat to stop supporting GNU/Linux.

1 hour ago by Paul Smyth via Facebook on Firefox rapid release improves Fedora Linux
UnderINK

I agree with the previous commenter wholeheartedly. I couldn't say it better myself. This is very 'Big Brother'. And while I agree with protecting...

5 hours ago by UnderINK on European e-identity plan to be unveiled this month
Simon Bisson and Mary Branscombe

Nice to see that Turing's idea of a general purpose computer doing once-hardware-powered tasks in software is now universal ;-) Mary

10 hours ago by Simon Bisson and Mary Branscombe on Software with everything
Jason Burchell

seriously now. I've only bothered to read a small bit of the comments. do me and the rest of the world a favour. stop saying it does not work or...

14 hours ago by Jason Burchell via Facebook on Music industry negotiating over 24-bit downloads
Philip Charles Cohen

Read about it and weep, John Donahoe ... In addition to Visa’s V.me, there is now MasterCard’s PayPass digital wallet soon to arrive; another...

18 hours ago by Philip Charles Cohen via Facebook on PayPal takes phone-based payments to the high street
apexwm

Leslie Satenstein : Where have you ever seen Mozilla even mention this? Firefox is the most popular browser in the GNU/Linux OS, so I don't see...

19 hours ago by apexwm on Firefox rapid release improves Fedora Linux
songmaster

SHleG: Do you remember building a clockwork scorpion kit (I'm pretty sure I have a photo of it somewhere) — I think it was called something like...

21 hours ago by songmaster on Software with everything
Chris Wortman

Good I love Yahoo! Their search engine is getting better than Google as of late. I find more of what I want on the first page, and usually within...

21 hours ago by Chris Wortman via Facebook on Linux Mint 13 ramps up for KDE release
PatrickG

openhgs has made the point for Windows 8 multiple monitors without realising it! With Windows 7 you have to switch the mouse and so your focus...

23 hours ago by PatrickG on Windows 8 could speed multi-monitor uptake
Leslie Satenstein

Mozilla has threatened to stop supporting Linux. I guess that UBUNTU is going with another browser. I indicated that if Mozilla stops supporting...

1 day ago by Leslie Satenstein via Facebook on Firefox rapid release improves Fedora Linux
Andy Bolstridge

Much as I abhor Microsoft's licensing practices, this is almost certainly down to purchasing IT equipment via 3rd party consultants - you get the...

1 day ago by Andy Bolstridge via Facebook on 6 million wasted licences and £1,200 PCs: welcome to government IT
Jack Schofield

@openhgs Windows users have had multiple desktops since Linus started writing Linux. They just haven't shipped as standard because not enough...

2 days ago by Jack Schofield on Windows 8 could speed multi-monitor uptake
Jack Schofield

@Phil at Cloud4 What, Microsoft gets £1,200 per PC and £1,622 per server? Gosh, I'm amazed....

2 days ago by Jack Schofield on 6 million wasted licences and £1,200 PCs: welcome to government IT
craigsc

You guys have no idea what is going on at Autonomy. Autonomy could have been a much more profitable organization. The sales operations at Autonomy...

2 days ago by craigsc on HP cuts 27,000 staff as Autonomy chief Lynch leaves
Moley

How does this impact on dual or multi booting? Seems to me to more or less prohibit this, from Windows 8 anyway. Will Grub 2 recognise Windows 8,...

2 days ago by Moley on Windows 8 start-up speed forces USB boot workaround
apexwm

I don't understand why there cannot be a slight pause during the boot process so the user can press a key. Many operating systems do this, even if...

2 days ago by apexwm on Windows 8 start-up speed forces USB boot workaround
Gavin Goodman

You can now buy the Xi3 modular computer in the UK at http://www.ocdistribution.com . This can be bought with the Tand3m software, pricing and...

2 days ago by Gavin Goodman on CES 2012: Xi3 microSERV3R
Phil at Cloud4

I agree: Mike Lynch can clearly build a business and manage strategy. I suspect the exit of Mike is more likely the end of a planned handover...

2 days ago by Phil at Cloud4 on HP cuts 27,000 staff as Autonomy chief Lynch leaves
Phil at Cloud4

This is unbeleivable government wastage with only one winner... Microsoft 1 - Tax payer Nil!

2 days ago by Phil at Cloud4 on 6 million wasted licences and £1,200 PCs: welcome to government IT
Mispam

So what do you do when you can't boot into windows? Why can't I just hold Shift while I power up instead of having to boot into windows and click a...

2 days ago by Mispam on Windows 8 start-up speed forces USB boot workaround