+ All Categories
Home > Design > The challenge of serving large amount of batch-computed data

The challenge of serving large amount of batch-computed data

Date post: 07-Jul-2015
Category:
Upload: david-groozman
View: 662 times
Download: 1 times
Share this document with a friend
Description:
This is slides from our meetup about the subj. Based on SimilarGroup case. http://www.meetup.com/HadoopIsrael/events/142131622/
35
A very BIG data Company The challenge of serving massive batch-computed data sets on-line 1 / 10 / 2013
Transcript
Page 1: The challenge of serving large amount of batch-computed data

A very BIG data Company

The challenge of serving massive batch-computed

data sets on-line

1 / 10 / 2013

Page 2: The challenge of serving large amount of batch-computed data

The challenge of serving massive batch-computed data sets online

David Gruzman

Page 3: The challenge of serving large amount of batch-computed data

Serving batch-computed data

by David Gruzman

► Today we will discuss the case when we have multi-terabyte dataset which is periodically recalculated and have to be

served in the real time.

►SimilarWeb allowed us to reveal internals of their

operation in order to give context of the problem.

Page 4: The challenge of serving large amount of batch-computed data

Similar Web data flow – the context

► Company assemble billions of events from their panel on the daily basis.

► Fast growing Hadoop cluster is used to process this data using various kinds of statistical analysis and machine learning.

► The data model is “web scale”. The data derived from the raw events is processed into “top pages”,”demography”, “keywords” and many other metrics company assemble.

► Problem dimensionality is: Per domain, per day, per country. More dimensions might appear.

Page 5: The challenge of serving large amount of batch-computed data

How data is calculated

► Data is imported into HDFS from the farm of application servers.

► Set of MR Jobs as well as Hive scripts is used to do data processing.

► Result data has a common structure of the “key-value” where key – our dimensions or their subset. For example

Key: “cnn.com_01012013_USA”

Value: “Top Pages: Page1, …. statistics:.... “

Page 6: The challenge of serving large amount of batch-computed data

Abstract schema of the relevant part of SimilarWeb IT

AppServer

s

Hadoop – Map Reduce

Hadoop – Hbase Stage

Hbase Production

Hbase Production

Page 7: The challenge of serving large amount of batch-computed data

Hbase under heavy inserts

► First of all – it do works► The question – what was done...

Page 8: The challenge of serving large amount of batch-computed data

Hbase : Split storms

► When you insert data evenly into many regions all of them starts splitting roughly in the same time. Hbase does not like it... It became not available, insertion job failes, leases expired etc...

► Solution : pre split table and disable automatic split.

► Price : it is hard to achieve even distribution of the data among regions. Hotspots possible...

Page 9: The challenge of serving large amount of batch-computed data

Compaction storms

► Under heavy load to all regions – all of them starting minor compaction in the same time

► Results are similar to the split storm... Nothing good.

Page 10: The challenge of serving large amount of batch-computed data

Inherent problem – delayed work

► Hbase does not do ALL work required during insert.

► Part of the work Delayed till the compaction.

► System who delay work is inherently problematic for the prolonged high load.

► It is good to work with spikes of activity, not with steady heavy load.

Page 11: The challenge of serving large amount of batch-computed data

Massive insert problem

► There is a lot of overhead in randomly insert data.

► What happens that MapReduce produce already sorted data and Hbase is sorting it again.

► Hbase is sorting data constantly, while MR do it in the batch what is inherently more efficient

► Hbase is strongly consistent system and under heavy load all kinds of problems (leasing related) happens

Page 12: The challenge of serving large amount of batch-computed data

Domino effect

Page 13: The challenge of serving large amount of batch-computed data

HBase Snapshots – come to rescue

► Snapshot is capability to get “point in time” state of the table.

► Technically snapshot is list of files which constitute the table. So taking snapshot is pure meta-data operation.

► When files are to be deleted for the table they are moved to the archive directory.

► Thus all operation like clone, restore – are just file renames and metadata changes.

Page 14: The challenge of serving large amount of batch-computed data

Hbase – snapshot export

Region

Before 1 Before 2File afterSnapshot

Before1 In Archive

Before2 in

archive

Move / rename

Page 15: The challenge of serving large amount of batch-computed data

Hbase – snapshot export

► There is additional capability of snapshots – export.

► Technically it is like DISTCP and even not required alive cluster on the destination side. Only HDFS has to be operational.

► What we gain – DISTCP speed and scalability.

► What happens – files are copied into archive directory. Hbase is using it's structure as a metadata

► Note – you have to disable archive clean up during copy... (to verify)

Page 16: The challenge of serving large amount of batch-computed data

So how snapshots help us?

► As you remember SimilarWeb has several Hbase clusters. One used as a company data-warehouse and two used to serve production

► So we prepare data on one cluster where we have long time-outs and then move it using snapshots to the production cluster.

Page 17: The challenge of serving large amount of batch-computed data

So we get to the following solution

AppServers

Hadoop – Map Reduce

Hadoop – Hbase Stage

Hbase Production

Hbase Production

Snapshot export

Page 18: The challenge of serving large amount of batch-computed data

Is it ideal?

► We effectively minimized impact on Hbase region servers

► But we left with Hbase high availability problem

► Currently we have two Hbase servers to overcome it

► It is working but it is far from ideal HW utilization

► Hbase read latency is not ideal

Page 19: The challenge of serving large amount of batch-computed data

Conceptual problem

► In production we do not need strong consistency and we pay for it with Partition tolerance in CAP theorem. In practice – it is availability problem.

► We do not need random writes and most of Hbase is built for them

► We actually have more complex system then we need

Page 20: The challenge of serving large amount of batch-computed data

BigTable vs Dynamo

► There are two kinds of NoSQLs – built after BigTable (Hbase, Hypertable) and after Dynamo (Cassandra, Voldemort …)

► BigTable – good for data warehouse. Capability to scan data ranges is important

► Dynamo – good for online serving since the systems are more high-available

Page 21: The challenge of serving large amount of batch-computed data

Evaluation process

► We decided to do research what system better suites need.

► Need was formulated as “to be able to prepare data files offline and copy them into system by file level.”

► In addition – high availability is a must so systems built around consistent hashing idea were preferred.

Page 22: The challenge of serving large amount of batch-computed data

ElephantDB

► https://github.com/nathanmarz/elephantdb

► This is system created exactly for this case

► It is capable of serving data from index prepared offline

► It is very simple – contains about 5K lines of code

► Main drawback – unknown... Very little known usages..

Page 23: The challenge of serving large amount of batch-computed data

ElephantDB

► Berkly DB java edition is used to serve local indexes. It is common with Voldmort which also has such option.

► MR Job (Cascading) is used to prepare indexes.

► Indexes cached locally by the servers in the ring.

► There is MR job for incremental change of data.

Page 24: The challenge of serving large amount of batch-computed data

ElephantDB – batch read

► Having data sitting in the DFS in a MR friendly format enable us to do scans right there.

► Opposite example – we usually scan Hbase table to process it using MR. When there is no filtering / predicate push-down it is serious waste of resources

Page 25: The challenge of serving large amount of batch-computed data

Elephant DB - drawbacks

► First one – is rare use. We already mentioned it

► It is read only. In case we also need random writes – we will need to deploy another NoSQL.

Page 26: The challenge of serving large amount of batch-computed data

Voldemort...

Page 27: The challenge of serving large amount of batch-computed data

Project - Voldemort

► NoSQL

► Pluggable Storage engines

► Pluggable serialization (TBD)

► Consistent hashing

► Eventual consistency

► Support for batch-computed read-only stores

Page 28: The challenge of serving large amount of batch-computed data

Voldemort logical architecture

Page 29: The challenge of serving large amount of batch-computed data

How building data works

► The job gets as parameter all cluster configuration

► Thereof it can build data specific for each node

Page 30: The challenge of serving large amount of batch-computed data

Pull vs Push

► It was interesting decision of the Linkedin engineers to implement pull.

► The explanation is that Voldemort as a system should be able to throttle data load in order to prevent system performance degradation.

Page 31: The challenge of serving large amount of batch-computed data

Performance

We tested on 3 node dedicated clusters with SSD.

► Throughput – 5-6K reads per second barely change CPU level. Documentation tells about 20K requests per node.

► Latency 10-15 milliseconds on not-cached data.

We are researching this number. It sounds too much for SSD.

► 1 – 1.5 milliseconds for cached data.

Page 32: The challenge of serving large amount of batch-computed data

Caching remarks

► Voldemort (as well as MongoDB) is not develop own caching mechanism but offload it to OS.

► It is done by doing MMAP of the data files.

► In my opinion – it is inferior approach since OS do not have application specific statistics, add not-needed context switches.

Page 33: The challenge of serving large amount of batch-computed data

Voldemort summary

For: ► Easy to install. It took 2 hours to build the cluster

even without installer..

► Pluggable storage engines.

► Support for efficient import of batch-computed data

► Open Source

Against: ► Very little documentation and support

► No commercial support available

Page 34: The challenge of serving large amount of batch-computed data

Method limitation

There is limit in pre-computing way when number of dimension grow.

What we are doing – we have proprietary layer build on LINQ and C# which makes missing aggregation

We also evaluate Jethrodata which can do it in SQL way.

It is RDBMS engine running on top of HDFS and gives full index with join and group by capability

Page 35: The challenge of serving large amount of batch-computed data

ElephantDB information used

► http://www.slideshare.net/nathanmarz/elephantdb

► http://computerhelpkansascity.blogspot.co.il/2012/06/introducing-elephantdb-distributed_26.html


Recommended