Disaster Recovery for Big Data

About us

We are nerds!

Started working in Big Data for international companies

Founded a start-up a few years ago: With colleagues working in related technical areas

And who also knew business stuff!

We’ve been participating in different Big Data projects


“I already have HDFS replication and High Availability in my services, why would I need Disaster Recovery (or backup)?”


High Availability (HA) Protects from failing

components: disks, servers, network

Is generally a “systems” issue

Redundant, doubles components

Generally has strict network requirements

Fully automated, immediate


Backup Allows you to go back to

a previous state in time: daily, monthly, etc.

It is a “data” issue

Protects from accidental deletion or modification

Also used to check for unwanted modifications

Takes some time to restore


Disaster Recovery Allows you to work


It is a “business” issue

Covers you from: main site failures such as electric power or network outages, fires, floods or building damage

Similar to having insurance

Medium time to be back online

The ideal Disaster Recovery

High Availability for datacenters

Exact duplicate of the main site Seamless operation (no

changes required)

Same performance

Same data

This is often very expensive and sometimes downright impossible

DR considerations

So, can we build a cheap(ish) DR? We must evaluate some tradeoffs:

What’s the cost of the service not being available? (Murphy’s Law: accidents will happen when you are busiest)

Is all information equally important? Can we lose a small amount of data?

Can we wait until we recover certain data from backup?

Can I find other uses for the DR site?

DR considerations

Near or far? Availability


Legal considerations

DR considerations

Synchronous vs Asynchronous Synchronous replication

requires a FAST connection

Synchronous works at transaction level and is necessary for operational systems

Asynchronous replication converges over time

Asynchronous is not affected by delays nor does it create them

Big Data DR

Can’t generally be copied synchronously

No VM replication Other DR rules apply:

Since it impacts users, someone is in charge of the “starting gun”

DNS and network changes to point clients

Main types: Storage replication

Dual ingestion

Storage replication

Similar to non-Big Data solutions, where central storage is replicated

Generally implemented using distcp and HDFS snapshots

Data is ingested in source cluster and then copied

Storage replication

Administrative overhead: Copy jobs must be


Metadata changes must be tracked

Good enough for data that comes traditional ETLs such as daily batches

Dual Ingestion

No files, just streams Generally ingested from multiple outside

sources through Kafka Streams must be directed to both sites

Dual Ingestion

Adds complexity to apps Nifi can be set up as a front-end to both


Data consistency must be checked Can be automatically set up via monitoring

Consolidation processes (such as a monthly re-sync) might be needed


Ingestion replication Variant of the dual ingestion

A consumer is set up in the source Kafka that in turn writes to a destination Kafka

Bottleneck if the initial streams were generated by many producers

Mixed: Previous solutions are not mutually exclusive

Storage replication for batch processes’ results

Dual ingestion for streams

Commercial offerings

Solutions that ease DR setup Cloudera BDR

Coordinates HDFS snapshots and copy

WANdisco Fusion Continuous storage replication

Confluent Multi-site Allows multi-site Kafka data replication


Big Data clusters have many nodes Costly to replicate

Performance / Capacity tradeoff

We can use cheaper servers in DR, since we don’t expect to use them often


Document and test procedures DR is rarely fully automated, so responsibilities and

actions should be clearly defined

Plan for (at least) a yearly DR run

Track changes in software and configuration


Once you have a DR solution, other uses will surface

DR site can be used for backup Maintain HDFS


DR data can be used for testing / reporting Warning: it may alter

stored data


Balance HA / Backup / DR as needed, they are not exclusive: Different costs

Different impact

Big Data DR is different: Dedicated hardware

No VMs, no storage cabin

Plan for DATA CENTRIC solutions


