Hadoop workshop

Hadoop workshopCloud Connect ShanghaiSep 15, 2013

Ari Flink – Operations Architect

Mac Fang – Manager, Hadoop development

Dean Zhu – Hadoop Developer

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai

Agenda

1. Introductions (5 minutes)

2. Hadoop and Big Data Concepts (20 minutes)

3. Cisco Webex Hadoop architecture (10 minutes)

4. Cisco UCS Hadoop Common Platform Architecture (10 minutes)

5. Exercise 1 (30 minutes)– Configure a Hadoop single node VM on a laptop

6. Hive and Impala concepts (15 minutes)

7. Exercise 2 (30 minutes)– Analytics using Apache Hive and Cloudera Impala

8. Q & A

2


Hadoop and Big Data Overview– Enterprise data management and big data– Problems, Opportunities and Use case examples– Hadoop architecture concepts

3


For our purposes, big data refers to distributed computing architectures specifically aimed at the “3 V’s” of data: Volume, Velocity, and Variety

What is Big Data?

4

© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 5

Operational(OLTP)

Traditional Enterprise Data Management

Operational(OLTP)

ETL EDW BI/Reports

Online Transactional Processing

Extract, Transform, and Load (batch processing)

Enterprise Data Warehouse

Business Intelligence

Operational(OLTP)


Traditional Business Intelligence Questions

Transactional Data (e.g. OLTP)

Real-time, but limited reporting/analytics

• What are the top 5 most active stocks traded in the last hour?

• How many new purchase orders have we received since noon?


High value, structured, indexed, cleansed

• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?

• What were the top 10 most frequently back-ordered products over the past year?


So what has changed?The Explosion of Unstructured Data

7

2005 20152010

• More than 90% is unstructured data

• Approx. 500 quadrillion files

• Quantity doubles every 2 years• Most unstructured data is neither stored nor analyzed!

1.8 trillion gigabytes of data was created in 2011…

10,000

0

GB

of

Da

ta

(IN

BIL

LIO

NS

)

STRUCTURED DATA

UNSTRUCTURED DATA

Source: Cloudera


Machine

Operational(OLTP)

Operational(OLTP)

ETLBI/Reports

Operational(OLTP)

Enterprise Data Management with Big Data

Web

ETL

Dashboards

In-memory analytics

Big Data

(Hadoop, etc.)

MPP EDW


Traditional Business Intelligence QuestionsTransactional Data (e.g.

OLTP)

Fast data, real-time

• What are the top 5 most active stocks traded in the last hour?

• How many new purchase orders have we received since noon?


High value, structured, indexed, cleansed

• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?

• What were the top 10 most frequently back-ordered products over the past year?

Big Data

Lower value, semi-structured, multi-source, raw/”dirty”

• Which products do customers click on the most and/or spend the most time browsing without buying?

• How do we optimally set pricing for each product in each store for individual customers everyday?

• Did the recent marketing launch generate the expected online buzz, and did that translate to sales?


Example: Web and Location Analytics

iPhone searches Amazon for Vizio TV’s in Electronics

1336083635.130 10.8.8.158 TCP_MISS/200 8400 GET http://www.amazon.com/gp/aw/s/ref=is_box_?k=Visio+tv… "Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A405 Safari/7534.48.3"


Big Data and Key Infrastructure Attributes

Usually not blade servers (not enough local storage)

Usually not virtualized (hypervisor only adds overhead)

Usually not highly oversubscribed (significant east-west traffic)

Usually not SAN/NAS

(What big data isn’t)

11

Move the compute to the storage

Low-cost, DAS-based, scale-out

clustered filesystem

11

$$$


Cost, Performance, and Capacity

EnterpriseDatabase

Massive Scale-Out Column Store

Hadoop No SQL

Data C

apacity Cost

Structured Data: Relational Database

Unstructured Data: Machine Logs, Web Click Stream, Call Data Records, Satellite Feeds, GPS Data, Sensor Readings, Sales Data, Blogs, Emails, Video

Dat

a S

tora

ge C

apac

ity

$20K/TB

$10K/TB

$300-$1K/TB

HW:SW $ split 70:30

HW:SW $ split 30:70

12


Big Data Software Architectures

13


Three basic big data software architectures

•Greenplum DB (Pivotal DB)*

•ParAccel*•Vertica•Netezza•Teradata

MPP Relational Database

Scale-out BI/DW

•Cloudera*•MapR*•Intel Hadoop*•Pivotal HD*

Batch-oriented Hadoop

Heavy lifting, processing

Real-time NoSQLFast key-value store/retrieve

•HBase (part of Apache Hadoop)*

•DataStax (Cassandra)*

•Oracle NoSQL*•Amazon Dynamo

*Cisco Partners

14


Hadoop is a distributed, fault-tolerant framework for storing and analyzing data.

Its two primary components are the Hadoop Filesystem HDFS and the MapReduce application engine.

What Is Hadoop?

15


Hadoop Components and OperationsHadoop Distributed File System (HDFS)

Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Scalable & Fault Tolerant Filesystem is distributed, stored

across all data nodes in the cluster Files are divided into multiple large

blocks – 64MB default, typically 128MB – 512MB

Data is stored reliably. Each block is replicated 3 times by default

Types of Node Functions– Name Node - Manages HDFS– Job Tracker – Manages MapReduce

Jobs– Data Node/Task Tracker – stores

blocks/does work

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Name Node

Job Tracker

File

16


HDFS Architecture

ToR FEX/switch

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

ToR FEX/switch

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

ToR FEX/switch

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15

1

Switch

Name Node

/usr/sean/foo.txt:blk_1,blk_2/usr/jacob/bar.txt:blk_3,blk_4

Data node 1:blk_1Data node 2:blk_2, blk_3Data node 3:blk_3

1

1

2

2

2

3

3

3

4

4

44


Rack Awareness

Rack Awareness provides Hadoop the optional ability to group nodes together in logical “racks” (i.e. failure domains)

Logical “racks” may or may not correspond to physical data center racks

Distributes blocks across different “racks” to avoid failure domain of a single “rack”

It can also lessen block movement between “racks”

“Rack” 1

Data node 1

Data node 2

Data node 3

Data node 4

Data node 5

“Rack” 2

Data node 6

Data node 7

Data node 8

Data node 9

Data node 10

“Rack” 3

Data node 11

Data node 12

Data node 13

Data node 14

Data node 15

1

1

1

2

2

2

3

3

3

4

4

4


MapReduce Example: Word Count

the quickbrown

fox

the fox ate the mouse

how nowbrown cow

Map

Map

Map

Reduce

Reduce

brown, 2fox, 2how, 1now, 1the, 3

ate, 1cow, 1mouse,

1quick, 1

the, 1brown, 1fox, 1quick, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

19


MapReduce Architecture

ToR FEX/switch

Task Tracker 1

Task Tracker 2

Task Tracker 3

Task Tracker 4

Task Tracker 5

ToR FEX/switch

Task Tracker 6

Task Tracker 7

Task Tracker 8

Task Tracker 9

Task Tracker 10

ToR FEX/switch

Task Tracker 11

Task Tracker 12

Task Tracker 13

Task Tracker 14

Task Tracker 15

Switch

Job Tracker

Job1:TT1:Mapper1,Mapper2Job1:TT4:Mapper3,Reducer1

Job2:TT6:Reducer2Job2:TT7:Mapper1,Mapper3

M1

M2

R1

M3

M1

M3

R2

M2


Cisco Webex Cloud and Hadoop Architecture

21

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 22© 2010 Cisco and/or its affiliates. All rights reserved. 22

Cisco WebEx Collaboration Cloud

Datacenter / PoP

Leased network link

Global Scale: 13 datacenters & iPoPs around the globe

Dedicated network: dual path 10G circuits between DCs

Multi-tenant: 95k sites

Real-time collaboration: voice, desktop sharing, video, chat

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 23© 2010 Cisco and/or its affiliates. All rights reserved. 23

Things happen ..

Datacenter / PoP

Leased network link

People make mistakesHardware failsSoftware failsEven failovers sometimes fail

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 24

Cisco WebEx log collection overview

Flume

Log4j

File

Avro

Syslog

Other Sinks

SolrSink

App

licat

ion

stat

e &

AP

Is

HDFS

Thrift

AMQP RDBMS

Sqoop

HTTP/REST

MySQL

Unstructured/semi-structured data Structured data

Cisco UCS C240 M3 servers

12 x 3TB = 36 TB / server

HDFSSink

SolrCloud

Raw dataSolr index


Cisco UCS and Big Data

Building a big data cluster with the UCS Common Platform Architecture (CPA)

CPA NetworkingCPA Sizing and Scaling

25


The evolution of big data deployments

Experimental use of Big Data

Deployed into IT Ops mandated infrastructures

“Skunk works”

Small to medium clusters

App team mandated infrastructure

Purpose built for Big Data

Big Data has established business value

Performance matters

Large or small clusters

IT Infrastructure

Big Data

VMware

WEBSAP

Generic IT servers

General Purpose IT Data Center

X86 servers

Big Data

Dedicated “Pod” for Big Data


Hadoop Hardware Evolving in the Enterprise

Typical 2009 Hadoop node

• 1RU server• 4 x 1TB 3.5”

spindles• 2 x 4-core CPU• 1 x GE• 24 GB RAM• Single PSU• Running Apache• $

Economics favor “fat” nodes

• 6x-9x more data/node

• 3x-6x more IOPS/node

• Saturated gigabit, 10GE on the rise

• Fewer total nodes lowers licensing/support costs

• Increased significance of node and switch failure

Typical 2013 Hadoop node

• 2RU server• 12 x 3TB 3.5” or 24

x 1TB 2.5” spindles• 2 x 8-core CPU• 1-2 x 10GE• 128 GB RAM• Dual PSU• Running

commercial/licensed distribution

• $$$


Cisco UCS Common Platform Architecture (CPA)Building Blocks for Big Data

UCS 6200 SeriesFabric Interconnects

Nexus 2232Fabric Extenders

UCS Manager

UCS 240 M3 Servers

LAN, SAN, Management


CPA Network Design for Big Data

29


CPA: TopologySingle wire for data and management

8 x 10GE uplinks per FEX= 2:1 oversub (16 servers/rack), no portchannel (static pinning)

2 x 10GE links per server for all traffic, data and management


CPA Recommended FEX Connectivity2 FEX’s and 2 FI’s

• 2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32 • Distribute servers across port groups to maximize buffer

performance and predictably distribute static pinning on uplinks


Can Hadoop really push 10GE?

Analytic workloads tend to be lighter on the network

Transform workloads tend to be heavier on the network

Hadoop has numerous parameters which affect network

Take advantage of 10GE CPA:– mapred.reduce.slowstart.completed.maps– dfs.balance.bandwidthPerSec– mapred.reduce.parallel.copies– mapred.reduce.tasks– mapred.tasktracker.reduce.tasks.maximum– mapred.compress.map.output

It can, depending on workload, so tune for it!

32


CPA Sizing and Scaling for Big Data

33


Cisco UCS Reference Configurations for Big Data

Full Rack UCS Solutions Bundle for Hadoop

Capacity

Full Rack UCS Solutions Bundle for Hadoop, NoSQL Performance

2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (LFF)

E5-2640 (12 cores)128GB

12x 3TB 7.2K SATA

2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (SFF)

2x E5-2665 (16 cores)256GB

24 x 1TB 7.2K SAS


Sizing

Start with current storage requirement– Factor in replication (typically 3x) and compression (varies by data set)– Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL systems– Factor in average daily/weekly data ingest rate– Factor in expected growth rate (i.e. increase in ingest rate over time)

If I/O requirement known, use next table for guidance

Most big data architectures are very linear, so more nodes = more capacity and better performance

Strike a balance between price/performance of individual nodes vs. total # of nodes

Part science, part art

35


CPA sizing and application guidelines

Server

CPU2 x E5-2690 2 x E5-2665 2 x E5-2640

Memory (GB) 256 256 128

Disk Drives 24 x 600GB 10K 24 x 1TB 7.2K 12 x 3TB 7.2K

IO Bandwidth (GB/Sec) 2.6 2.0 1.1

Rack-Level

Cores 256 256 192

Memory (TB) 4 4 2

Capacity (TB) 225 384 576

IO Bandwidth (GB/Sec) 41.3 31.9 16.9

Applications MPP DBNoSQL

HadoopNoSQL Hadoop

Best Performance Best Price/TB


Scaling the CPA

Single Rack 16 servers

Single Domain Up to 10 racks, 160 servers

37

Multiple Domains

L2/L3 Switching


Consider intra- and inter-domain bandwidth:

Servers Per Domain

(Pair of Fabric Interconnects)

Available North-Bound 10GE ports(per fabric)

Southbound oversubscription

(per fabric)

Northbound oversubscription

(per fabric)

Intra-domain server-to-server bandwidth (per

fabric, Gbits/sec)

Inter-domain server-to-server bandwidth (per

fabric, Gbits/sec)

160 16 2:1 5:1 5 1

144 24 2:1 3:1 5 1.67

128 32 2:1 2:1 5 2.5

Scaling the Common Platform ArchitectureMultiple domains based on 16 servers per rack and 2 x 2232 FEXs

38


Multi-Domain CPA Customer Example

39

• 10 Gits/sec Intra-Domain Server to Server NW Bandwidth

• 5 Gbits/sec Inter-Domain Server to Server NW Bandwidth

• Static pinning from FEX to FI (no port-channel)


Recommendations: UCS Domains and Racks

40

Single Domain Recommendation

Turn off or enable at physical rack level

• For simplicity and ease of use, leave Rack Awareness off

• Consider turning it on to limit physical rack level fault domain (e.g. localized failures due to physical data center issues – water, power, cooling, etc.)

Multi Domain Recommendation

Create one Hadoop rack per UCS Domain

• With multiple domains, enable Rack Awareness such that each UCS Domain is its own Hadoop rack

• Provides HDFS data protection across domains

• Helps minimize cross-domain traffic


Exercise 1

Set up a single node VM cluster on the laptop– Step 1: copy files from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc

41



Hive

An SQL-like interface to Hadoop

Top level Apache project – http://hive.apache.org/

Hive history– Created at Facebook to allow people to quickly and easily leverage Hadoop without the effort of

writing Java MapReduce– Currently used at many companies for log processing, business intelligence and analytics

43

http://hive.apache.org/




Hive Components

Shell: allows interactive queries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe


Data Model

Tables– Typed columns (int, float, string, boolean)– Also, list: map (for JSON-like data)

Partitions– For example, range-partition tables by date

Buckets– Hash partitions within ranges (useful for sampling, join optimization)


Hive

46

DBMS Hive

Language SQL-92 standard Subset of SQL-92 plus Hive extensions

Updates INSERT, UPDATE, DELETE INSERT OVERWRITENo UPDATE or DELETE

Transactions Yes No

Latency Sub-second Minutes to hours

Indexes Any number of indexes, important to performance

No indexes, data is always scanned in parallel

Dataset size TBs PBs


Metastore

Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Holds partitioning information Can be stored in Derby, MySQL, and other relational databases

Source: cc-licensed slide by Cloudera


Hive components

Source: cc-licensed slide by Cloudera

Hive MetaStore

SerDe

InputFormat

Hadoop cluster


Hive MetaStore

MetaStore

Impala

RDBMS

HCatalog

Pig

HiveServer2

HiveCLI

BeelineCLI


Hive Physical Layout

Warehouse directory in HDFS– E.g., /user/hive/warehouse

Tables stored in subdirectories of warehouse– Partitions form subdirectories of tables

Actual data stored in HDFS files– E.g. text, SequenceFile, RCfile, Avro– Arbitrary format with a custom SerDe


External and Hive managed tables

Hive managed tables– Data moved to location /user/hive/warehouse– Can be stored in a more efficient format than text e.g. RCFile– If you drop the table, the raw data is lost

hive> CREATE TABLE test(id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

External tables– Can overlay multiple tables all pointing to the same raw data– To create external table, simply point to the location of data while creating the tables

hive> CREATE TABLE test (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/home/test/data';


Hive: Example

Hive looks similar to an SQL database Relational join on two tables:

– Table of word counts from Shakespeare collection– Table of word counts from the bible

SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 5;

the 25848 62394I 23031 8854and 19671 38985to 18038 13526of 16700 34654


Impala


Impala General purpose MPP SQL query engine for Hadoop

– Query latency milliseconds to hours, interactive data exploration– Runs on the existing Hadoop cluster on existing HDFS files and hardware

High performance– C++– Direct access to HDFS and Hbase data, no MapReduce

Unified platform– Use existing Hive metadata and query language (HiveQL)– Submit queries via ODBC or Thrift API

Performance– Disk throughput limited by hw to 100MB/sec– 3 .. 90 x faster than Hive, depending on the type of the query


Impala Details

55

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStored

HiveQL interfaceUnified metadata

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad


Impala Details

56

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStored


impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Impalad keep contact to StateStored to update their state and to receive metadata for query planning


Impala Details

57

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStore


impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query coordinator initiates

execution on remote impalad’s


Impala Details

58

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

SQL App

ODBC

Hive Metastore

HDFS NN

StateStore


impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Query Planner

Query Coordinator

Query Exec Engine

HDFS DN HBase

impalad

Intermediate results are streamed between impalad’s

and query results are streamed back to client


Exercise 2

Analytics with Hive and Impala– Step 1: copy test dataset from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc

59

Date post:	26-Jan-2015
Category:	Technology
Upload:	fang-mac
View:	116 times
Download:	4 times

Hadoop workshop

Technology