+ All Categories
Home > Technology > Hadoop workshop

Hadoop workshop

Date post: 26-Jan-2015
Category:
Upload: fang-mac
View: 116 times
Download: 4 times
Share this document with a friend
Description:
 
Popular Tags:
60
Hadoop workshop Cloud Connect Shanghai Sep 15, 2013 Ari Flink – Operations Architect Mac Fang – Manager, Hadoop development Dean Zhu – Hadoop Developer
Transcript
Page 1: Hadoop workshop

Hadoop workshopCloud Connect ShanghaiSep 15, 2013

Ari Flink – Operations Architect

Mac Fang – Manager, Hadoop development

Dean Zhu – Hadoop Developer

Page 2: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Agenda

1. Introductions (5 minutes)

2. Hadoop and Big Data Concepts (20 minutes)

3. Cisco Webex Hadoop architecture (10 minutes)

4. Cisco UCS Hadoop Common Platform Architecture (10 minutes)

5. Exercise 1 (30 minutes)– Configure a Hadoop single node VM on a laptop

6. Hive and Impala concepts (15 minutes)

7. Exercise 2 (30 minutes)– Analytics using Apache Hive and Cloudera Impala

8. Q & A

2

Page 3: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hadoop and Big Data Overview– Enterprise data management and big data– Problems, Opportunities and Use case examples– Hadoop architecture concepts

3

Page 4: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

For our purposes, big data refers to distributed computing architectures specifically aimed at the “3 V’s” of data: Volume, Velocity, and Variety

What is Big Data?

4

Page 5: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 5

Operational(OLTP)

Traditional Enterprise Data Management

Operational(OLTP)

ETL EDW BI/Reports

Online Transactional Processing

Extract, Transform, and Load (batch processing)

Enterprise Data Warehouse

Business Intelligence

Operational(OLTP)

Page 6: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 6

Traditional Business Intelligence Questions

Transactional Data (e.g. OLTP)

Real-time, but limited reporting/analytics

• What are the top 5 most active stocks traded in the last hour?

• How many new purchase orders have we received since noon?

Enterprise Data Warehouse

High value, structured, indexed, cleansed

• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?

• What were the top 10 most frequently back-ordered products over the past year?

Page 7: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

So what has changed?The Explosion of Unstructured Data

7

2005 20152010

• More than 90% is unstructured data

• Approx. 500 quadrillion files

• Quantity doubles every 2 years• Most unstructured data is neither stored nor analyzed!

1.8 trillion gigabytes of data was created in 2011…

10,000

0

GB

of

Da

ta

(IN

BIL

LIO

NS

)

STRUCTURED DATA

UNSTRUCTURED DATA

Source: Cloudera

Page 8: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 8

Machine

Operational(OLTP)

Operational(OLTP)

ETLBI/Reports

Operational(OLTP)

Enterprise Data Management with Big Data

Web

ETL

Dashboards

In-memory analytics

Big Data

(Hadoop, etc.)

MPP EDW

Page 9: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 9

Traditional Business Intelligence QuestionsTransactional Data (e.g.

OLTP)

Fast data, real-time

• What are the top 5 most active stocks traded in the last hour?

• How many new purchase orders have we received since noon?

Enterprise Data Warehouse

High value, structured, indexed, cleansed

• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?

• What were the top 10 most frequently back-ordered products over the past year?

Big Data

Lower value, semi-structured, multi-source, raw/”dirty”

• Which products do customers click on the most and/or spend the most time browsing without buying?

• How do we optimally set pricing for each product in each store for individual customers everyday?

• Did the recent marketing launch generate the expected online buzz, and did that translate to sales?

Page 10: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 10

Example: Web and Location Analytics

iPhone searches Amazon for Vizio TV’s in Electronics

1336083635.130 10.8.8.158 TCP_MISS/200 8400 GET http://www.amazon.com/gp/aw/s/ref=is_box_?k=Visio+tv… "Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A405 Safari/7534.48.3"

Page 11: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Big Data and Key Infrastructure Attributes

Usually not blade servers (not enough local storage)

Usually not virtualized (hypervisor only adds overhead)

Usually not highly oversubscribed (significant east-west traffic)

Usually not SAN/NAS

(What big data isn’t)

11

Move the compute to the storage

Low-cost, DAS-based, scale-out

clustered filesystem

11

$$$

Page 12: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Cost, Performance, and Capacity

EnterpriseDatabase

Massive Scale-Out Column Store

Hadoop No SQL

Data C

apacity Cost

Structured Data: Relational Database

Unstructured Data: Machine Logs, Web Click Stream, Call Data Records, Satellite Feeds, GPS Data, Sensor Readings, Sales Data, Blogs, Emails, Video

Dat

a S

tora

ge C

apac

ity

$20K/TB

$10K/TB

$300-$1K/TB

HW:SW $ split 70:30

HW:SW $ split 30:70

12

Page 13: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Big Data Software Architectures

13

Page 14: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Three basic big data software architectures

•Greenplum DB (Pivotal DB)*

•ParAccel*•Vertica•Netezza•Teradata

MPP Relational Database

Scale-out BI/DW

•Cloudera*•MapR*•Intel Hadoop*•Pivotal HD*

Batch-oriented Hadoop

Heavy lifting, processing

Real-time NoSQLFast key-value store/retrieve

•HBase (part of Apache Hadoop)*

•DataStax (Cassandra)*

•Oracle NoSQL*•Amazon Dynamo

*Cisco Partners

14

Page 15: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hadoop is a distributed, fault-tolerant framework for storing and analyzing data.

Its two primary components are the Hadoop Filesystem HDFS and the MapReduce application engine.

What Is Hadoop?

15

Page 16: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hadoop Components and OperationsHadoop Distributed File System (HDFS)

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Scalable & Fault Tolerant Filesystem is distributed, stored

across all data nodes in the cluster Files are divided into multiple large

blocks – 64MB default, typically 128MB – 512MB

Data is stored reliably. Each block is replicated 3 times by default

Types of Node Functions– Name Node - Manages HDFS– Job Tracker – Manages MapReduce

Jobs– Data Node/Task Tracker – stores

blocks/does work

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Name Node

Job Tracker

File

16

Page 17: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 17

HDFS Architecture

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15

1

Switch

Name Node

/usr/sean/foo.txt:blk_1,blk_2/usr/jacob/bar.txt:blk_3,blk_4

Data node 1:blk_1Data node 2:blk_2, blk_3Data node 3:blk_3

1

1

2

2

2

3

3

3

4

4

44

Page 18: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 18

Rack Awareness

Rack Awareness provides Hadoop the optional ability to group nodes together in logical “racks” (i.e. failure domains)

Logical “racks” may or may not correspond to physical data center racks

Distributes blocks across different “racks” to avoid failure domain of a single “rack”

It can also lessen block movement between “racks”

“Rack” 1

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

“Rack” 2

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

“Rack” 3

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15

1

1

1

2

2

2

3

3

3

4

4

4

Page 19: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

MapReduce Example: Word Count

the quickbrown

fox

the fox ate the mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown, 2fox, 2how, 1now, 1the, 3

ate, 1cow, 1mouse,

1quick, 1

the, 1brown, 1fox, 1quick, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

19

Page 20: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 20

MapReduce Architecture

ToR FEX/switch

Task Tracker 1

Task Tracker 2

Task Tracker 3

Task Tracker 4

Task Tracker 5

ToR FEX/switch

Task Tracker 6

Task Tracker 7

Task Tracker 8

Task Tracker 9

Task Tracker 10

ToR FEX/switch

Task Tracker 11

Task Tracker 12

Task Tracker 13

Task Tracker 14

Task Tracker 15

Switch

Job Tracker

Job1:TT1:Mapper1,Mapper2Job1:TT4:Mapper3,Reducer1

Job2:TT6:Reducer2Job2:TT7:Mapper1,Mapper3

M1

M2

R1

M3

M1

M3

R2

M2

Page 21: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Cisco Webex Cloud and Hadoop Architecture

21

Page 22: Hadoop workshop

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 22© 2010 Cisco and/or its affiliates. All rights reserved. 22

Cisco WebEx Collaboration Cloud

Datacenter / PoP

Leased network link

Global Scale: 13 datacenters & iPoPs around the globe

Dedicated network: dual path 10G circuits between DCs

Multi-tenant: 95k sites

Real-time collaboration: voice, desktop sharing, video, chat

Page 23: Hadoop workshop

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 23© 2010 Cisco and/or its affiliates. All rights reserved. 23

Things happen ..

Datacenter / PoP

Leased network link

People make mistakesHardware failsSoftware failsEven failovers sometimes fail

Page 24: Hadoop workshop

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 24

Cisco WebEx log collection overview

Flume

Log4j

File

Avro

Syslog

Other Sinks

SolrSink

App

licat

ion

stat

e &

AP

Is

HDFS

Thrift

AMQP RDBMS

Sqoop

HTTP/REST

MySQL

Unstructured/semi-structured data Structured data

Cisco UCS C240 M3 servers

12 x 3TB = 36 TB / server

HDFSSink

SolrCloud

Raw dataSolr index

Page 25: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Cisco UCS and Big Data

Building a big data cluster with the UCS Common Platform Architecture (CPA)

CPA NetworkingCPA Sizing and Scaling

25

Page 26: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

The evolution of big data deployments

Experimental use of Big Data

Deployed into IT Ops mandated infrastructures

“Skunk works”

Small to medium clusters

App team mandated infrastructure

Purpose built for Big Data

Big Data has established business value

Performance matters

Large or small clusters

IT Infrastructure

Big Data

VMware

WEBSAP

Generic IT servers

General Purpose IT Data Center

X86 servers

Big Data

Dedicated “Pod” for Big Data

Page 27: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 27

Hadoop Hardware Evolving in the Enterprise

Typical 2009 Hadoop node

• 1RU server• 4 x 1TB 3.5”

spindles• 2 x 4-core CPU• 1 x GE• 24 GB RAM• Single PSU• Running Apache• $

Economics favor “fat” nodes

• 6x-9x more data/node

• 3x-6x more IOPS/node

• Saturated gigabit, 10GE on the rise

• Fewer total nodes lowers licensing/support costs

• Increased significance of node and switch failure

Typical 2013 Hadoop node

• 2RU server• 12 x 3TB 3.5” or 24

x 1TB 2.5” spindles• 2 x 8-core CPU• 1-2 x 10GE• 128 GB RAM• Dual PSU• Running

commercial/licensed distribution

• $$$

Page 28: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 28

Cisco UCS Common Platform Architecture (CPA)Building Blocks for Big Data

UCS 6200 SeriesFabric Interconnects

Nexus 2232Fabric Extenders

UCS Manager

UCS 240 M3 Servers

LAN, SAN, Management

Page 29: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

CPA Network Design for Big Data

29

Page 30: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

CPA: TopologySingle wire for data and management

8 x 10GE uplinks per FEX= 2:1 oversub (16 servers/rack), no portchannel (static pinning)

2 x 10GE links per server for all traffic, data and management

Page 31: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

CPA Recommended FEX Connectivity2 FEX’s and 2 FI’s

• 2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32 • Distribute servers across port groups to maximize buffer

performance and predictably distribute static pinning on uplinks

Page 32: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Can Hadoop really push 10GE?

Analytic workloads tend to be lighter on the network

Transform workloads tend to be heavier on the network

Hadoop has numerous parameters which affect network

Take advantage of 10GE CPA:– mapred.reduce.slowstart.completed.maps– dfs.balance.bandwidthPerSec– mapred.reduce.parallel.copies– mapred.reduce.tasks– mapred.tasktracker.reduce.tasks.maximum– mapred.compress.map.output

It can, depending on workload, so tune for it!

32

Page 33: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

CPA Sizing and Scaling for Big Data

33

Page 34: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 34

Cisco UCS Reference Configurations for Big Data

Full Rack UCS Solutions Bundle for Hadoop

Capacity

Full Rack UCS Solutions Bundle for Hadoop, NoSQL Performance

2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (LFF)

E5-2640 (12 cores)128GB

12x 3TB 7.2K SATA

2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (SFF)

2x E5-2665 (16 cores)256GB

24 x 1TB 7.2K SAS

Page 35: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Sizing

Start with current storage requirement– Factor in replication (typically 3x) and compression (varies by data set)– Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL systems– Factor in average daily/weekly data ingest rate– Factor in expected growth rate (i.e. increase in ingest rate over time)

If I/O requirement known, use next table for guidance

Most big data architectures are very linear, so more nodes = more capacity and better performance

Strike a balance between price/performance of individual nodes vs. total # of nodes

Part science, part art

35

Page 36: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 36

CPA sizing and application guidelines

Server

CPU2 x E5-2690 2 x E5-2665 2 x E5-2640

Memory (GB) 256 256 128

Disk Drives 24 x 600GB 10K 24 x 1TB 7.2K 12 x 3TB 7.2K

IO Bandwidth (GB/Sec) 2.6 2.0 1.1

Rack-Level

Cores 256 256 192

Memory (TB) 4 4 2

Capacity (TB) 225 384 576

IO Bandwidth (GB/Sec) 41.3 31.9 16.9

Applications MPP DBNoSQL

HadoopNoSQL Hadoop

Best Performance Best Price/TB

Page 37: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Scaling the CPA

Single Rack 16 servers

Single Domain Up to 10 racks, 160 servers

37

Multiple Domains

L2/L3 Switching

Page 38: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Consider intra- and inter-domain bandwidth:

Servers Per Domain

(Pair of Fabric Interconnects)

Available North-Bound 10GE ports(per fabric)

Southbound oversubscription

(per fabric)

Northbound oversubscription

(per fabric)

Intra-domain server-to-server bandwidth (per

fabric, Gbits/sec)

Inter-domain server-to-server bandwidth (per

fabric, Gbits/sec)

160 16 2:1 5:1 5 1

144 24 2:1 3:1 5 1.67

128 32 2:1 2:1 5 2.5

Scaling the Common Platform ArchitectureMultiple domains based on 16 servers per rack and 2 x 2232 FEXs

38

Page 39: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Multi-Domain CPA Customer Example

39

• 10 Gits/sec Intra-Domain Server to Server NW Bandwidth

• 5 Gbits/sec Inter-Domain Server to Server NW Bandwidth

• Static pinning from FEX to FI (no port-channel)

Page 40: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Recommendations: UCS Domains and Racks

40

Single Domain Recommendation

Turn off or enable at physical rack level

• For simplicity and ease of use, leave Rack Awareness off

• Consider turning it on to limit physical rack level fault domain (e.g. localized failures due to physical data center issues – water, power, cooling, etc.)

Multi Domain Recommendation

Create one Hadoop rack per UCS Domain

• With multiple domains, enable Rack Awareness such that each UCS Domain is its own Hadoop rack

• Provides HDFS data protection across domains

• Helps minimize cross-domain traffic

Page 41: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Exercise 1

Set up a single node VM cluster on the laptop– Step 1: copy files from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc

41

Page 42: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 42

Page 43: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive

An SQL-like interface to Hadoop

Top level Apache project – http://hive.apache.org/

Hive history– Created at Facebook to allow people to quickly and easily leverage Hadoop without the effort of

writing Java MapReduce– Currently used at many companies for log processing, business intelligence and analytics

43

Page 44: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive Components

Shell: allows interactive queries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe

Page 45: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Data Model

Tables– Typed columns (int, float, string, boolean)– Also, list: map (for JSON-like data)

Partitions– For example, range-partition tables by date

Buckets– Hash partitions within ranges (useful for sampling, join optimization)

Page 46: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive

46

DBMS Hive

Language SQL-92 standard Subset of SQL-92 plus Hive extensions

Updates INSERT, UPDATE, DELETE INSERT OVERWRITENo UPDATE or DELETE

Transactions Yes No

Latency Sub-second Minutes to hours

Indexes Any number of indexes, important to performance

No indexes, data is always scanned in parallel

Dataset size TBs PBs

Page 47: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Metastore

Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Holds partitioning information Can be stored in Derby, MySQL, and other relational databases

Source: cc-licensed slide by Cloudera

Page 48: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive components

Source: cc-licensed slide by Cloudera

Hive MetaStore

SerDe

InputFormat

Hadoop cluster

Page 49: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive MetaStore

MetaStore

Impala

RDBMS

HCatalog

Pig

HiveServer2

HiveCLI

BeelineCLI

Page 50: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive Physical Layout

Warehouse directory in HDFS– E.g., /user/hive/warehouse

Tables stored in subdirectories of warehouse– Partitions form subdirectories of tables

Actual data stored in HDFS files– E.g. text, SequenceFile, RCfile, Avro– Arbitrary format with a custom SerDe

Page 51: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

External and Hive managed tables

Hive managed tables– Data moved to location /user/hive/warehouse– Can be stored in a more efficient format than text e.g. RCFile– If you drop the table, the raw data is lost

hive> CREATE TABLE test(id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

External tables– Can overlay multiple tables all pointing to the same raw data– To create external table, simply point to the location of data while creating the tables

hive> CREATE TABLE test (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/home/test/data';

Page 52: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Hive: Example

Hive looks similar to an SQL database Relational join on two tables:

– Table of word counts from Shakespeare collection– Table of word counts from the bible

SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 5;

the 25848 62394I 23031 8854and 19671 38985to 18038 13526of 16700 34654

Page 53: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 53

Impala

Page 54: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 54

Impala General purpose MPP SQL query engine for Hadoop

– Query latency milliseconds to hours, interactive data exploration– Runs on the existing Hadoop cluster on existing HDFS files and hardware

High performance– C++– Direct access to HDFS and Hbase data, no MapReduce

Unified platform– Use existing Hive metadata and query language (HiveQL)– Submit queries via ODBC or Thrift API

Performance– Disk throughput limited by hw to 100MB/sec– 3 .. 90 x faster than Hive, depending on the type of the query

Page 55: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Impala Details

55

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStored

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Page 56: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Impala Details

56

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStored

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Impalad keep contact to StateStored to update their state and to receive metadata for query planning

Page 57: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Impala Details

57

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStore

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query coordinator initiates

execution on remote impalad’s

Page 58: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Impala Details

58

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStore

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Intermediate results are streamed between impalad’s

and query results are streamed back to client

Page 59: Hadoop workshop

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Exercise 2

Analytics with Hive and Impala– Step 1: copy test dataset from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc

59

Page 60: Hadoop workshop

Recommended