+ All Categories
Home > Technology > 2013 year of real-time hadoop

2013 year of real-time hadoop

Date post: 13-Jul-2015
Category:
Upload: geoff-hendrey
View: 825 times
Download: 1 times
Share this document with a friend
Popular Tags:
37
2013: year of real-time access to Big Data? Geoffrey Hendrey @geoffhendrey @vertascale
Transcript

2013: year of real-time access to Big Data?

Geoffrey Hendrey

@geoffhendrey

@vertascale

Agenda

• Motivation

• Hadoop stack & data formats

• File access times and mechanics

• Key-based indexing systems (HBase)

• MapReduce, Hive/Pig

• MPP approaches & alternatives

Motivation

• Big Data is more opaque than small data

– Spreadsheets choke

– BI tools can’t scale

– Small samples often fail to replicate issues

• Engineers, data scientists, analysts need:

– Faster “time to answer” on Big Data

– Rapid “find, quantify, extract”

• Solve “I don’t know what I don’t know”

Survey or real-time capabilities

• Real-time, in-situ, self-service is the

“Holy Grail” for the business analyst

• spectrum of real-time capabilities exists

on Hadoop

Easy Hard

Available in Hadoop Proprietary

HDFS HBase Drill

Hadoop Stack Review

Real-time spectrum on Hadoop

Use Case Support Real-time

Seek to a particular byte in a distributed file

HDFS YES

Seek to a particular value in a distributed file, by key (1-dimensional indexing)

HBase YES

Answer complex questions expressible in code (e.g. matching users to music albums). Data science.

MapReduce(Hive, Pig)

NO

Ad-hoc query for scattered records given simple constraints (“field*4+==“music” && field*9+==“dvd”)

MPP Architectures

YES

Hadoop Underpinned By HDFS

• Hadoop Distributed File System (HDFS)

• inspired by Google FileSystem (GFS)

• underpins every piece of data in “Hadoop”

• Hadoop FileSystem API is pluggable

• HDFS can be replaced with other suitable

distributed filesystem

– S3

– kosmos

– etc

Amazon S3

Distributed File System

Distributed File System Mechanics

HDFS performance characteristics

• HDFS was designed for high throughput, not low

seek latency

• best-case configurations have shown HDFS to

perform 92K/s random reads [http://hadoopblog.blogspot.com/]

• Personal experience: HDFS very robust. Fault

tolerance is “real”. I’ve unplugged machines

and never lost data.

Typical Hadoop File Formats

MapFile for real-time access?

– Index file must be loaded by client (slow)

– Index file must fit in RAM of client by default

– scan an average of 50% of the sampling

interval

– Large records make scanning intolerable

– not a viable “real world” solution for random

access

Apache HBase

• Clone of Google’s Big Table.

• Key-based access mechanism

• Designed to hold billions of rows

• “Tables” stored in HDFS

• Supports MapReduce over tables, into

tables

• Requires you to think hard, and commit

to a key design.

HBase Architecture

HBase random read performance

http://hstack.org/hbase-performance-testing/• 7 servers, each with

• 8 cores• 32GB DDR3 and • 24 x 146GB SAS 2.0 10K RPM disks.

• Hbase table• 3 billion records,• 6600 regions.• data size is between 128-256 bytes per row,

spread in 1 to 5 columns.

MapReduce

• “MapReduce is a framework for processing

parallelizable problems across huge datasets

using a large number of computers”-wikipedia

• MapReduce is strongly tied to HDFS in Hadoop.

• Systems built on HDFS (i.e. HBase) leverage this

common foundation for integration with the MR

paradigm

MapReduce and Data Science

• Many complex algorithms can be expressed in

the MapReduce paradigm

– NLP

– Graph processing

– Image codecs

• The more complex the algorithm, the more Map

and Reduce processes become complex

programs in their own right.

• Often cascade multiple MR jobs in succession

A very bad* diagram

*this diagram makes it appear that data flows through the master node.

A better picture

Is MapReduce real-time?

• MapReduce on Hadoop has certain latencies

that are hard to improve

– Copy

– Shuffle, sort

– Iterate

• time-dependent on the both the size of the

input data and the number of processors

available

• In a nutshell, it’s a “batch process” and isn’t

“real-time”

Hive and Pig

• Run on top of MapReduce

• Provide “Table” metaphor familiar to SQL users

• Provide SQL-like (or actually same) syntax

• Store a “schema” in a database, mapping tables

to HDFS files

• Translate “queries” to MapReduce jobs

• No more real-time than MapReduce

MPP Architectures

• Massively Parallel Processing

• Lots of machines, so also lots of memory

Examples:

• Spark – general purpose data science framework

sort of like real-time MapReduce for data

science

• Dremel – columnar approach, geared toward

answering SQL-like aggregations and BI-style

questions

Spark

• Originally designed for iterative machine

learning problems at Berkeley

• MapReduce does not do a great job on iterative

workloads

• Spark makes more explicit use of memory

caches than Hadoop

• Spark can load data from any Hadoop input

source

Effect of Memory Caching in Spark

Is Spark Real-time?

• If data fits in memory, execution time for most

algorithms still depends on

– amount of data to be processed

– number of processors

• So, it still “depends”

• …but definitely more focused on fast time-to-

answer

• Interactive scala and java shells

Dremel MPP architecture

• MPP architecture for ad-hoc query on nested

data

• Apache Drill is an OS clone of Dremel

• Dremel originally developed at Google

• Features “in situ” data analysis

• “Dremel is not intended as a replacement for

MR and is often used in conjunction with it to

analyze outputs of MR pipelines or rapidly

prototype larger computations.” -Dremel:

Interactive Analysis of WebScaleDatasets

In Situ Analysis

• Moving Big Data is a nightmare

• In situ: ability to access data in

place

– In HDFS

– In Big Table

Uses For Dremel At Google

• Analysis of crawled web documents.

• Tracking install data for applications on Android

Market.

• Crash reporting for Google products.

• OCR results from Google Books.

• Spam analysis.

• Debugging of map tiles on Google Maps.

• Tablet migrations in managed Bigtable instances.

• Results of tests run on Google’s distributed build

system.

• Etc, etc.

Why so many uses for Dremel?

• On any Big Data problem or application, dev

team faces these problems:

– “I don’t know what I don’t know” about data

– Debugging often requires finding and correlating

specific needles in the haystack

– Support and marketing often require segmentation

analysis (identify and characterize wide swaths of

data)

• Every developer/analyst wants

– Faster time to answer

– Fewer trips around the mulberry bush

Column Oriented Approach

Dremel MPP query execution tree

Is Dremel real-time?

Alternative approaches?

• Both MapReduce and MPP query architectures

take “throw hardware at the problem”

approach.

• Alternatives?

– Use MapReduce to build distributed indexes on data

– Combine columnar storage and inverted indexes to

create columnar inverted indexes

– Aim for the sweet spot for data scientist and

engineer: Ad-hoc queries with results returned in

seconds on a single processing node.

Contact Info

Email:

[email protected]

Twitter:

@geoffhendrey

@vertascale

www:

http://vertascale.com

references

• http://www.ebaytechblog.com/2010/10/29/hadoop-the-power-of-the-elephant/

• http://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

• http://yoyoclouds.wordpress.com/tag/hdfs/

• http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

• http://hadoopblog.blogspot.com/

• http://lunarium.info/arc/images/Hbase.gif

• http://www.zdnet.com/i/story/60/01/073451/zdnet-amazon-s3_growth_2012_q1_1.png

• http://hstack.org/hbase-performance-testing/

• http://en.wikipedia.org/wiki/File:Mapreduce_Overview.svg

• http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png

• Dremel: Interactive Analysis of WebScale Datasets


Recommended