Enterprise Strategy Group | Getting to the bigger truth.TM
Register to view ESG Content
Search

reports.gif Market Reports: Data Deduplication Diversity: Evaluating Software- vs. Hardware-Based Approaches
Published on Tuesday, April 21st, 2009 at 8:28 am
Categories: Backup and Recovery Software | Data Protection Software & Services | IT Infrastructure | Information and Risk Management | Market Reports | Purpose-built Disk Storage Systems and Appliances | Storage |
Authors: Lauren Whitehouse |
starstarstarstarstar
Deduplication has been more popularly deployed in target storage hardware, including virtual tape libraries (VTLs) and storage appliances; however, recent introductions by backup vendors have shifted the spotlight. How do hardware and software approaches differ? What should be considered when evaluating solutions?

Introduction

Deduplication dramatically improves the value proposition of disk-based data protection as it eliminates the redundancy typically seen in secondary storage processes.  The use of deduplication will drive further backup-to-disk adoption and deliver associated performance and reliability benefits.

Selecting a strategy for data deduplication requires consideration of several factors in order to avoid surprises later.  Clearly understanding how deduplication works—especially in conjunction with other requirements such as performance, ease of use, and offsite copy creation—should go a long way toward selecting and designing a solution that delivers maximum business, operational, and financial benefits.

Deduplication has been more popularly deployed in target storage hardware, including virtual tape libraries (VTLs) and storage appliances; however, recent introductions by backup vendors have shifted the spotlight.  How do hardware and software approaches differ? What should be considered when evaluating solutions?

External Forces Contribute to IT Challenges

There are a number of external forces working against IT organizations today, including data growth, compliance, and economic difficulties. These realities are impacting not only data protection processes and secondary storage environments, but data center environmentals such as power, cooling, and floor space, as well as bandwidth between primary and secondary recovery sites.

Relentless information growth is necessitating greater investments in IT infrastructure.  ESG estimates that database data is growing at 25% per annum, with unstructured data increasing at two to three times that rate.[1] Data protection processes, such as backup and replication, compound capacity growth since multiple copies of primary data are made for operational and disaster recovery. ESG research respondents cited data protection as the application that will be most responsible for storage growth over the next 24 months (see Figure 1).[2]

Figure 1. Data Protection’s Complicity in Storage Growth

DedupeDiversityF1The need to retain information for longer periods of time on accessible media, such as disk, for compliance, eDiscovery, and business intelligence purposes also contributes to capacity overabundance and stress on the data protection infrastructure.  Moreover, in an effort to improve the performance and reliability of backup and recovery operations, organizations have been increasingly using disk as both the initial and final resting place of backup copies.  While capital investments in disk can increase costs, disk-based data protection can contribute to lower operational expenses and improvements in backup and recovery service level agreements (SLAs).

In the midst of global financial turmoil, IT organizations are highly motivated to reduce costs and optimize efficiency—but not at the expense of introducing risk or impacting value.  Technologies that create efficiency and deliver rapid ROI without sacrificing organizations’ other goals—meeting backup and recovery SLAs, for example—are those being considered and deployed.  One such technology is data deduplication.

Data Deduplication

Data deduplication identifies and eliminates redundancy, minimizing bandwidth and storage capacity requirements.  Deduplication, while not new, has gained even greater popularity today:  The increased use of disk in backup and recovery and data protection’s aforementioned contribution to storage capacity growth make deduplication attractive.  In fact, ESG research found that data reduction tops the list of respondents’ top five storage priorities over the next 24 months (see Figure 2).[3]

Figure 2. Top Storage-related Initiatives by Enterprise Organizations Citing Cost Reduction as a Major Factor Impacting Storage Spending

DedupeDiversityF2Deduplication in backup processes ensures that only unique data is stored and replicate data is not.  Initially, data is backed up to the storage device and all subsequently written data is examined for redundancy, with only unique data being written to storage.  When duplicate data is found, only a pointer linked to the original unique piece of data is stored.  This pointer consumes significantly less space than storing the whole item multiple times.

The effectiveness of deduplication is often expressed as a reduction ratio denoting the ratio of protected capacity to the actual physical capacity stored.  A 10:1 ratio means that 10 times more data is protected than the physical space required to store it and a 20:1 ratio means that 20 times more data can be protected.  Factoring in data growth, retention, and assuming deduplication ratios in the 20:1 range, 2 TB of storage capacity could protect up to 40 TB of retained backup data.

The benefit of storing less data is obvious when it comes to reducing storage requirements and saving money.  Another way of looking at it is that storing data more efficiently on disk allows for either longer retention periods or the “recapture” of disk and the ability to extend disk-based data protection to more workloads, contributing to improvements in recovery objectives.  Efficient storage of data on disk can be a catalyst to reduce or eliminate tape media.

When it comes to selecting deduplication technology, many factors should be considered.  After surveying organizations using or considering using deduplication, ESG research found that, not surprisingly, the cost of the solution was the most frequently-cited factor (although savings garnered from capacity reduction often overcomes financial objections to deploying deduplication).  Otherwise, the data suggests that ease of deployment and ease of use, as well as the impact on backup/recovery performance were important considerations—more so than technical implementations, such as inline or post-process approaches or the deduplication ratio (see Figure 3).[4]

Figure 3. Considerations for Selecting Data Deduplication Technology

DedupeDiversityF3Data deduplication is a ground-breaking technology that changes the economics of disk-based backup and recovery, so the decision to adopt it should be easy.  However, organizations must familiarize themselves with the many facets of deduplication solutions and consider them prior to purchase—this can make the evaluation and selection of a solution more complex.

Deduplication in Secondary Storage Processes

Deduplication is a feature of both software- and hardware-based data protection solutions.  Vendors offering this feature have taken different approaches to how, where, and when deduplication occurs and possess varying limitations in the scope of deduplication.

How Deduplication Occurs

Deduplication solutions either have knowledge about the data in the backup stream or they don’t.  Those that do are content-aware—they can look at patterns in the data stream (the bytes that make up a file) and determine the optimal segment boundaries, which maximizes the likelihood of identifying duplicates.  Backup software understands the content, whereas target-side deduplication solutions typically do not. Targets simply receive a “stream” of data from the backup application.  Those target-side deduplication devices that are content-aware typically have to extract the metadata associated with the backup and “reverse engineer” the backup stream to understand its contents.

Hash-Based Algorithms

Deduplication solutions may depend on a hash algorithm to determine redundancy.  Traditionally, hash algorithms were used to compare data read vs. data written by performing a calculation on a “chunk” of the data.  If the result is identical, then the data read is the same as the data written.  The concept has been applied to identifying unique data as a method of “fingerprinting” chunks of data.  The concept, as applied to backup, is that multiple segments of the backup data stream are fingerprinted.  The ID of each chunk of incoming data is compared against the central index. Unique IDs are stored in the index and unique data is written to disk. Any duplicates are discarded and a pointer to the existing data is stored instead.

Considerations with hash-based methods are the size and location of the index.  Hash comparisons made with a memory-resident index will be considerably faster than with a disk-based one.  The index may be kept in RAM, but its size may be constrained by the memory limitations of the solution.  An index stored on disk could grow large; however, disk seeks may impact performance.  These factors, therefore, may impact the capacity of storage contained in a single system.

Delta Differencing

Another deduplication approach is delta differencing.  With this method, there is a level of content awareness.  This means that backup streams can be compared from one to the next, i.e., the backup performed today is compared to yesterday’s backup—an approach often taken by content-aware solutions.  Only the new or changed blocks or bytes (differences) are stored.  Old or recurring blocks or bytes are discarded.  This approach may be faster than hash-based approaches, but cannot deduplicate across backup streams from different backup applications.

Pattern Matching

Other vendors use different approaches for finding replicate data.  Pattern matching doesn’t rely on using hashing; instead, this technique uses an advanced pattern recognition and differencing algorithm to find and keep track of duplicates.  Like delta differencing, this approach examines the incoming data stream to see if similar data was received in the past.  However, that similar data is further scrutinized to find any differences and only the unique bytes are saved.  This method may be faster than traditional hashing methods because it is less CPU- and memory-intensive.  The size of the index is smaller than with traditional hashing methods, often resulting in greater levels of performance and scalability.

Where Deduplication Occurs

In data protection, deduplication can occur at one or more places in the data path: at the system being backed up (source-side deduplication), the backup media server (proxy deduplication), or the destination storage device (target-side deduplication).

Some backup applications deduplicate at the data source via client agent technology.  In this case, client software running on an application server identifies and transfers only unique data to the backup media server and target storage device, providing greater network efficiency.  Other backup software solutions deduplicate the backup stream at the backup server—removing any performance burden from production application servers.  Further, some distribute the deduplication process throughout the data path, performing hashing at the client and deduplication at the media server.  It will be important to understand if and/or how deduplication solutions optimize performance and distribute the deduplication workload in software-based approaches.

Deduplicating data after it has passed through the media server is referred to as target-side deduplication.  This approach typically leverages powerful purpose-built storage appliances to accommodate processing of the entire (non-deduplicated) backup load either pre- or post-ingestion.

There are pros and cons to every approach, and selecting one over another depends on multiple factors, such as performance requirements, flexibility, scalability, and cost.  One of the drawbacks of a software approach is that adopting that feature could require a switch or upgrade in backup application or client agents. However, software-based deduplication may offer more flexibility, especially for disk vendor selection. As a built-in feature of backup, a big benefit may be no added cost.  Performance may or may not be a drawback.  This will depend on the characteristics of the hardware where deduplication takes place, whether or not deduplication processing is distributed, and the aforementioned method for identifying duplicates.

Performance is often considered less of an issue with hardware-based deduplication as it typically leverages powerful purpose-built storage appliances.  The trade-offs could be flexibility (in disk vendor) and scalability, depending on the solution.  The key is finding a solution that allows for capacity and performance growth without necessitating a “forklift” upgrade.  This could mean a single highly-scalable system or multiple systems that are managed and monitored from a single management interface.

When Deduplication Occurs

Deduplication can occur before data is written to disk (inline processing) or after it is written to disk (post-processing). Inline approaches inspect and deduplicate data at the source, at the media server, or upon ingest at the disk.  The tradeoffs with this approach are related to performance, which depends on a few factors such as how duplicates are identified, the granularity of deduplication, how the deduplication processing workload is distributed, network performance, and more.  An inline approach may be preferred for workloads if replication to an offsite location is needed immediately.

Post-process deduplication will write the backup image to disk before initiating deduplication, which allows the backup to complete at full disk performance. Oftentimes, the trade-off with this approach is the amount of disk capacity required for the solution as disk capacity will be required to temporarily store the backup stream plus the deduplicated backup.  Some post-process solutions perform deduplication on a job-by-job basis, acting on data as it arrives and releasing the space once the deduplication process is completed for that job, while others deduplicate as data is ingested, minimizing the need for a landing area.  A post-process approach may be preferred if the workload includes a lot of new data, if the backup window is small, or if replicating data to an off-site location can afford some lag time.

A few vendors offer both inline and post-process options on a per job basis, which offers additional flexibility.  This makes it possible to customize deduplication strategies for specific workloads.

Deduplication Domains

The deduplication domain refers to the realm of data used for subsequent comparisons when identifying duplicates.  Local deduplication only compares data against other data passing through the same system.  Most target-side deduplication solutions fall into this category.  The good news is that this approach is more field-proven.  The bad news is that local deduplication is often the consequence of scalability limitations.

Conversely, global deduplication makes comparisons within and across systems. This capability is more often seen in software-based and grid-architecture approaches, but may also be supported for target deduplication systems that replicate in a hub-and-spoke fashion (with global deduplication occurring at the hub).  Global deduplication can result in higher deduplication ratios as data is deduplicated within and across backup sources, and greater economies of scale with respect to operational overhead and capital costs.

Another aspect of deduplication domain is the storage tiers where deduplication can be applied.  Target-side deduplication solutions are limited to disk-based storage, while backup software with media management capabilities may extend to the tape tier, too.  The ability to move data in the compressed state from disk to tape introduces capacity savings for long-term archiving.

Considerations for Evaluating Deduplication

Ease of Deployment and Use

As previously noted, ESG research found that the ability to integrate with existing backup processes and overall ease of use are of greater importance to users than more specific technical considerations. If a deduplication solution is not easy to manage and does not benignly integrate with existing data protection processes, even the best-performing product with the latest whiz-bang features will be a non-starter.

Software incorporating deduplication is either going to be the easiest to deploy and use—or be the most disruptive.  This will depend on if a switch from incumbent software is required and whether or not disk is already incorporated in the backup process.

Hardware-based deduplication solutions have garnered popularity as they are easy to deploy and less disruptive to existing backup environments and processes.  Most vendors’ solutions are delivered as target storage systems, appearing as a file server over Ethernet or as a VTL over Fibre Channel.  They offer a plug-and-play experience and don’t require client software.  Compatibility with existing backup software depends on the deduplication approach since content-aware target-side solutions require some development efforts for each backup application supported.  One of the drawbacks to this approach is that the quantity of backup applications compatible with a target deduplication solution may be limited—and the vendor may be slow to add support.

Performance

To understand the optimal deduplication strategy, organizations need to examine backup data sets—size, frequency, criticality, and whether or not deduplication makes sense—to determine a deduplication strategy and how it impacts the overall performance of backup.  Providing policy-based deduplication (the ability to turn it on or off depending on the workload and its requirements) gives the flexibility to enable deduplication for data sets with lower backup performance requirements or high data redundancy, and disable deduplication for data sets with high backup performance requirements or little data redundancy.  Policy-based deduplication also extends to determining whether inline or post-processing implementations occur.  Post-process deduplication approaches have less impact on backup windows, while an inline method could introduce some impact on performance.

Deduplication solutions require regular housekeeping operations called “cleaning” or “garbage collection.”  This process reorganizes stored data—reclaiming space freed by expired data and consolidating free capacity.  Housekeeping operations could impact performance, so it is important to understand if and how the process can be scheduled to avoid peak backup windows.  Enabling housekeeping during an evaluation will provide a real-world simulation of system performance.  Importantly, when evaluating deduplication solutions, it’s important to test over an extended period of time—a minimum of one to two weeks or several backup cycles.

Recovery performance is equally important. With deduplication making it much more economical to store backup data sets for longer periods of time, it is more likely that data will be restored from deduplicated data. It will, therefore, be important to test how a deduplication engine performs in several recovery scenarios, especially for data stored over a longer period of time, to judge the potential impact of deduplication in the environment.

Scalability

Data deduplication should mitigate the need to expand storage capacity.  However, it is still important to understand what the upper threshold of capacity is for the solution and, when additional capacity is required, how easy or difficult it is to augment it.  For example, can the solution’s repository expand on a per system basis or will a device upgrade require a new system to be deployed and data to be migrated?  The worst case scenario is for the IT organization to manage an ever-growing number of independent silos.

Manageability

Manageability is a key concern that is often overlooked.  Backup is typically managed from the backup application.  Configuring policy settings, monitoring operations, and reporting results and statistics are centralized.  Adding target-side deduplication creates another point of management where deduplication-specific policies must be set.

While target-side deduplication is simple to implement, what is the management impact as backup capacity grows and the environment scales?  Fewer silos mean fewer points of management; therefore, what are the long-term prospects for managing the backup environment with multiple target deduplication systems?  Does each device have to be managed individually?  Can data be deduplicated across target devices or is each a silo?

Centralized management of policies simplifies administration, decreases complexity, and reduces operational costs.  Backup software with deduplication consolidates policy management and may provide better visibility of operations.

Offsite Copies

Typically, disk-based backup with deduplication is a replacement for tape-based backup.  If that’s the case, then how can backup sets be moved offsite for DR purposes?  Hardware-based deduplication solutions often offer device-to-device remote replication.  While there may be an added cost for acquiring and deploying a second system at a remote location, doing so will provide a safeguard in the event that the primary site (or the backup set managed at that site) is unavailable.  Some backup applications can also replicate data from site to site. It’s important to understand if the data replicated between sites—by either hardware or software solutions—maintains data’s deduplicated state to optimize bandwidth.

For many environments, physical tape creation is still necessary to fulfill DR and retention requirements.  Most backup software solutions and some VTLs with deduplication offer the added capability of creating physical tape media.  Most solutions supporting tape must “reinflate” data prior to it being copied to tape, eliminating the benefits of deduplication.  One backup solution does offer the ability to move deduplicated data from disk to tape, minimizing the number of tapes required to store data for long-term archiving.  When a recovery is required, tape-based deduplicated data must be copied back to disk, and will then be available to the end-user or application.

Summary

Before seeking out specific vendors, it is most important to understand the organization’s deduplication needs to ensure the right fit for the environment and ease of integration.  This process includes some capacity planning to make sure that the pursued solution will have some longevity and that capacity scaling needs are well understood before purchase.

Once a vendor short list is determined, vet the company and its deduplication product.  Seek out references, understand how many active deployments of the technology are in place, leverage the vendor’s ROI model, compare the results versus competing solutions, and get a glimpse into the product roadmap.  It is important to understand the vendor’s business success and long-term viability, its support capability, how well it communicates with you, and what other services or products it could offer to you today and over time.

Next, test, test, test.  Beginning with installation and configuration, test viable solutions using real data based on policies in place in the current data protection environment.  Record deduplication ratio results, backup and recovery performance (single stream and aggregate performance), and replication performance over an extended period of time—a week or two weeks at a minimum.  These tests should include deleting and expiring data, as well as simulating expected change rates.  If applicable, test the physical tape creation process.  Finally, simulate system failure to test resiliency.

Choosing a deduplication strategy is not a simple task.  Technology maturity varies considerably and the vendor landscape is in flux.  As solutions are considered, cut through the hyperbole by requesting real-world references and proof points to vendor claims.  Test backup and, importantly, restore performance.  Thorough due diligence up front may stave off surprises later.


[1] Source: ESG Research Report, Database Archiving Survey, December 2007.

[2] Source: ESG Research Report, Enterprise Storage Survey, December 2008.

[3] Source: ESG Research Report, Enterprise Storage Survey, December 2008.

[4] Source: ESG Research Report, Data Protection Market Trends, January 2008.

Printer-Friendly Version.
Please login to view a printer-friendly PDF version of this document. If you are not a member, please register. When you register, you will be able to view PDF versions of all our freely available documents, and rate and comment on site content.
For important information about using this content, please review our Terms & Conditions

0 responses to "Data Deduplication Diversity: Evaluating Software- vs. Hardware-Based Approaches"

    There are no comments yet.
Please register and/or login above to post a comment.