DEDUPLICATION NOW …
AND WHERE IT’S HEADING
Lauren Whitehouse
Senior Analyst, Enterprise Strategy Group
Need Dedupe?
Before/After Dedupe
Deduplication
Backup
Disk
Deduplication
In Backup
Process
Production
Data
Dedupe Evolution
File-level
deduplication OR
single-instance
storage
Block-level
deduplication
technology
WAN
optimization
Deduplication
appliances
Multi-
node/Grid
Solutions
VTL with
deduplication
Backup with
Deduplication
Symantec
OST Interface
Dedupe on tape
Eliminate redundancy
across files Ability to create tapes for
long-term retention that
contain deduplicated data
?
Eliminate redundancy
within and between files
Optimizes network
bandwidth; aids with data
transport between sites
Changes the economics of
disk-to-disk backup
Tape-centric disk-to-disk
now optimized
Multi-node configurations
introduce ability to deliver
HA, load balancing,
performance increase and
global deduplication
Symantec solves catalog
tracking of deduped copies
Deduplication becomes a
more pervasive feature in
backup software
What’s next?
Data Growth Out of Control?
Managing the Data Deluge
20%
42%
23%
9% 6%
9%
28%
24%
9%
30%
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
1% to 10%annually
11% to 20%annually
21% to 30%annually
31% to 40%annually
More than 40%annually
At approximately what rate do you believe your total volume of data is growing annually?
(Percent of respondents)
100 or fewer servers (N=247) More than 100 servers (N=246)
62% with <100 servers have
<20% growth/year
63% with >100 servers have
>20% growth/year
Storage Spending Priorities
8%
9%
9%
12%
14%
15%
15%
17%
17%
18%
18%
21%
21%
23%
24%
36%
0% 10% 20% 30% 40%
Increase use of flash-based SSDs
Unified storage systems
Converged data and storage networking
Storage encryption solution
Advanced file storage / file system technology…
Purchase new NAS storage systems
Tape replacement
Use cloud storage services as way to source…
Tiered storage
Purchase more power-efficient storage hardware
Data reduction technologies
Storage virtualization
Improved storage management software tools
Purchase new SAN storage systems
Data replication solution for off-site disaster…
Backup and recovery solutions
In which data storage areas will your organization make the most significant investments over the next 12-18 months?
(Percent of respondents, five responses accepted, N=289)
Why Do We Need Dedupe?
Data Growth
• Financial benefits
– Reduce disk costs; delay capital expenditures
– Lower bandwidth costs
– Reduce power & cooling costs
– Tape replacement savings
• Operational benefits
– Reduce operational overhead in backup
– Reduces time and resource needs for recovery
• Business benefits
– Increase retention periods
– Improve recovery objectives
– Improve backup consolidation from ROBOs
– Improve DR
Deduplication Creates Efficiencies
in D2D Backup
Best Dedupe Fit?
• “Traditional” file-level backup
• ROBO use cases
• Virtualized environments
… and Worst Fit?
• Pre-compressed or encrypted data
• File types that don’t have versions
(multimedia)
What Impacts Reduction Ratios?
• Backup strategy
(full vs. incremental or differential)
• Change rate between backups
• Retention
• When data is encrypted or compressed
Typical Dedupe Ratios
Less than 10x reduction,
29%
10x to 20x reduction,
56%
More than 20x reduction,
11%
Don’t know, 5%
On average, what degree of capacity reduction has your organization experienced by using data deduplication technology?
(Percent of respondents, N=140)
Capacity Savings
• Weekly full backup over 8 weeks
• 6 week retention
• 20:1 deduplication ratio
5
10
15
20
25
30
35
40
1.25 1.67 1.88 1.67 1.79 1.76 1.84 2.00
1 2 3 4 5 6 7 8
Retention Period (weeks)
Protected Capacity (TB) Stored Capacity (TB)
Which Dedupe Approach Is Best?
Backup Software
VTL Gateway Appliance
NAS Dedupe Device
Identifying Duplicates
• Hash algorithms
– More popular approach
• Fixed block size
• Variable block size
• Sliding window block size
– “Hash collisions” (false positives) a remote risk
– Central index of IDs
• Delta differences
– Faster
– No “false positives”
– Global deduplication across different backup streams is a
limitation
• Hybrid approach
– Combines delta differencing & hash calculation
– Less CPU- and memory-intensive
– Index is smaller
Data Deduplication – Where?
Backup Source Backup Initiator Backup Target
Remote or
Branch Office
ESX Server
OS
Apps
VMs
OS
Apps
OS
Apps WAN
Data Deduplication – When?
Backup Source Backup Initiator Backup Target
Inline deduplication - before data is written to disk
Post-process
deduplication
– after data is
written to disk
Remote or
Branch Office
ESX Server
OS
Apps
VMs
OS
Apps
OS
Apps WAN
Inline vs. Post-Process
Inline
• Requires less I/O
• Replication can begin
immediately
• Re-assembly of data for
recovery could impact
performance
• Examples
– EMC Data Domain
– IBM ProtecTIER
– NEC Hydrastor
– Symantec NBU 5000
Series
– Typically all software
approaches
Post-Process
• Requires more I/O
• Requires disk landing zone
(staging area)
• Dedupe & replication processes
overlap
• Most recent full kept in native
format
• Examples:
– Exagrid
– FalconStor
– GreenBytes
– HP VLS
– Quantum Dxi
– Sepaton DeltaStor
Single- vs. Multi-Node Solutions
Single-Node Dedupe
• Performance & capacity is
limited to upper threshold
– Forklift upgrade
– Add more islands of dedupe
– Over-purchase to
accommodate future growth
• Examples
– EMC Data Domain
– Fujitsu CS
– GreenBytes
– Quantum
Multi-Node Dedupe
• Manages multiple
deduplication systems as
one
• More linear throughput &
capacity scaling
• Load balancing
• Examples
– IBM ProtecTIER
– EMC Avamar
– Exagrid EX Series
– FalconStor FDS
– HP VLS
– NEC HydraStor
– Sepaton DeltaStor
– Symantec NetBackup 5000
Series
Local vs. Global Dedupe
Local • Single domain backup data
passes through an
individual system and is
compared with data passing
through the same system
• Examples:
– EMC Data Domain
– Fujitsu
– GreenBytes
– Quantum
Global
• Deduplication across domains
means backup data is
compared with data within its
system as well as other
systems in the domain
• Can result in higher dedupe
ratios
• Examples:
– Exagrid
– FalconStor
– HP VLS
– IBM ProtecTIER
– NEC
– Sepaton
– Symantec NBU 5000 Series
– Typically most backup
software solutions
Dedupe Approaches
Software-Based
• Content-aware; dedupe can be
policy-based
• Can be more cost-effective
• Flexibility in disk selection
• End-to-end bandwidth
efficiency; remote site backup
• Global dedupe
• Simplified management –
single console, policy engine
• Can extend to tape
• Examples: – Arkeia
– Asigra
– Atempo
– CA
– Cofio
– CommVault
– Druva
Hardware-Based
• Multiple backup vendor
environments
• No impact on application
performance
• Optimized replication
• Scalability of some solutions
may cause disruptive upgrades
or dedupe “islands”
• Examples: – EMC
– Exagrid
– FalconStor
– Fujitsu
– GreenBytes
– HP
– IBM
– NEC
– Quantum
– Sepaton
– Symantec
- EMC Avamar
- I365
- IBM
- PHD Virtual
- Quest
- Symantec NBU & BE
- Veeam
High-Value Feature
• Target system integration with backup catalogs
and lifecycle policies
– Symantec OpenStorage (OST)
– EMC Networker
What’s New in Dedupe?
• New dedupe techniques
– Example: Arkeia Progressive
• Dedupe on tape
– Example: CommVault
• Target solutions moving processes “upstream”
– Example: Data Domain Boost
• Modular dedupe
– Example: HP StoreOnce
• Dedupe in hardware/software from same
vendor
– Example: Symantec
• Ongoing improvements in capacity and
performance
Disruptive Trends
Purchase Considerations
9%
10%
12%
14%
17%
17%
21%
23%
24%
31%
33%
35%
46%
64%
0% 10% 20% 30% 40% 50% 60% 70%
When deduplication occurs
Experience of vendor in backup implementation
Deduplication ratio
Granularity of deduplication
Where deduplication occurs
Existing relationship with vendor
Ability to replicate deduplicated data off-site
Ability to deduplicate across systems/data sets as…
Vendor service and support
Scalability of solution
Integration with existing backup processes
Impact on backup/recovery performance
Ease of implementation/use
Cost of solution
Which of the following considerations would you say are most important in your organization’s evaluation and selection of data deduplication
technology? (Percent of respondents, N=145, five responses accepted)
Before Seeking Out Solutions …
• Understand your needs
– Capacity and throughput requirements/planning
• Full backup size; incremental backup size
• Number of full/incremental backups per week
• Change rate of data
• Projected growth rate
• Retention policies
• Full backup window
• Offsite copy window
– Performance requirements
– Requirements for offsite copies
– Budget
How is Dedupe Evolving?
• Mix of hardware & software approaches
• Scale requirements
– Performance
– Capacity
• Focus on recovery considerations
– Speed of “rehydration” and restore
– Reliability
– Criticality of the index … how is it protected?
• New architectures
• New packaging
• New dedupe techniques
APPENDIX
Fixed- vs. Variable-Length Blocks
Fixed-Length Blocks
Variable-Length Blocks
Change in file
Block A Block B Block C Block D Block E
Downstream blocks F, G & H change
= no duplication detected after the change
Initial
Examination
Subsequent
Examination
Change in file
Block A Block B Block F Block G Block H
Block A Block B Block C Block D
Block A Block E Block C Block D
Initial
Examination
Subsequent
Examination
Downstream blocks C & D unchanged
= duplication detected
Inline dedupe
Backup Job
Replication
Backup Job
Replication
Post-process dedupe
Time to DR
Time
Time