Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | fang-mac |
View: | 116 times |
Download: | 4 times |
Hadoop workshopCloud Connect ShanghaiSep 15, 2013
Ari Flink – Operations Architect
Mac Fang – Manager, Hadoop development
Dean Zhu – Hadoop Developer
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Agenda
1. Introductions (5 minutes)
2. Hadoop and Big Data Concepts (20 minutes)
3. Cisco Webex Hadoop architecture (10 minutes)
4. Cisco UCS Hadoop Common Platform Architecture (10 minutes)
5. Exercise 1 (30 minutes)– Configure a Hadoop single node VM on a laptop
6. Hive and Impala concepts (15 minutes)
7. Exercise 2 (30 minutes)– Analytics using Apache Hive and Cloudera Impala
8. Q & A
2
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hadoop and Big Data Overview– Enterprise data management and big data– Problems, Opportunities and Use case examples– Hadoop architecture concepts
3
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
For our purposes, big data refers to distributed computing architectures specifically aimed at the “3 V’s” of data: Volume, Velocity, and Variety
What is Big Data?
4
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 5
Operational(OLTP)
Traditional Enterprise Data Management
Operational(OLTP)
ETL EDW BI/Reports
Online Transactional Processing
Extract, Transform, and Load (batch processing)
Enterprise Data Warehouse
Business Intelligence
Operational(OLTP)
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 6
Traditional Business Intelligence Questions
Transactional Data (e.g. OLTP)
Real-time, but limited reporting/analytics
• What are the top 5 most active stocks traded in the last hour?
• How many new purchase orders have we received since noon?
Enterprise Data Warehouse
High value, structured, indexed, cleansed
• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?
• What were the top 10 most frequently back-ordered products over the past year?
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
So what has changed?The Explosion of Unstructured Data
7
2005 20152010
• More than 90% is unstructured data
• Approx. 500 quadrillion files
• Quantity doubles every 2 years• Most unstructured data is neither stored nor analyzed!
1.8 trillion gigabytes of data was created in 2011…
10,000
0
GB
of
Da
ta
(IN
BIL
LIO
NS
)
STRUCTURED DATA
UNSTRUCTURED DATA
Source: Cloudera
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 8
Machine
Operational(OLTP)
Operational(OLTP)
ETLBI/Reports
Operational(OLTP)
Enterprise Data Management with Big Data
Web
ETL
Dashboards
In-memory analytics
Big Data
(Hadoop, etc.)
MPP EDW
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 9
Traditional Business Intelligence QuestionsTransactional Data (e.g.
OLTP)
Fast data, real-time
• What are the top 5 most active stocks traded in the last hour?
• How many new purchase orders have we received since noon?
Enterprise Data Warehouse
High value, structured, indexed, cleansed
• How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year?
• What were the top 10 most frequently back-ordered products over the past year?
Big Data
Lower value, semi-structured, multi-source, raw/”dirty”
• Which products do customers click on the most and/or spend the most time browsing without buying?
• How do we optimally set pricing for each product in each store for individual customers everyday?
• Did the recent marketing launch generate the expected online buzz, and did that translate to sales?
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 10
Example: Web and Location Analytics
iPhone searches Amazon for Vizio TV’s in Electronics
1336083635.130 10.8.8.158 TCP_MISS/200 8400 GET http://www.amazon.com/gp/aw/s/ref=is_box_?k=Visio+tv… "Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A405 Safari/7534.48.3"
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Big Data and Key Infrastructure Attributes
Usually not blade servers (not enough local storage)
Usually not virtualized (hypervisor only adds overhead)
Usually not highly oversubscribed (significant east-west traffic)
Usually not SAN/NAS
(What big data isn’t)
11
Move the compute to the storage
Low-cost, DAS-based, scale-out
clustered filesystem
11
$$$
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Cost, Performance, and Capacity
EnterpriseDatabase
Massive Scale-Out Column Store
Hadoop No SQL
Data C
apacity Cost
Structured Data: Relational Database
Unstructured Data: Machine Logs, Web Click Stream, Call Data Records, Satellite Feeds, GPS Data, Sensor Readings, Sales Data, Blogs, Emails, Video
Dat
a S
tora
ge C
apac
ity
$20K/TB
$10K/TB
$300-$1K/TB
HW:SW $ split 70:30
HW:SW $ split 30:70
12
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Big Data Software Architectures
13
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Three basic big data software architectures
•Greenplum DB (Pivotal DB)*
•ParAccel*•Vertica•Netezza•Teradata
MPP Relational Database
Scale-out BI/DW
•Cloudera*•MapR*•Intel Hadoop*•Pivotal HD*
Batch-oriented Hadoop
Heavy lifting, processing
Real-time NoSQLFast key-value store/retrieve
•HBase (part of Apache Hadoop)*
•DataStax (Cassandra)*
•Oracle NoSQL*•Amazon Dynamo
*Cisco Partners
14
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hadoop is a distributed, fault-tolerant framework for storing and analyzing data.
Its two primary components are the Hadoop Filesystem HDFS and the MapReduce application engine.
What Is Hadoop?
15
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hadoop Components and OperationsHadoop Distributed File System (HDFS)
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Scalable & Fault Tolerant Filesystem is distributed, stored
across all data nodes in the cluster Files are divided into multiple large
blocks – 64MB default, typically 128MB – 512MB
Data is stored reliably. Each block is replicated 3 times by default
Types of Node Functions– Name Node - Manages HDFS– Job Tracker – Manages MapReduce
Jobs– Data Node/Task Tracker – stores
blocks/does work
ToR FEX/switch
Data node 1
Data node 2
Data node 3
Data node 4
Data node 5
ToR FEX/switch
Data node 6
Data node 7
Data node 8
Data node 9
Data node 10
ToR FEX/switch
Data node 11
Data node 12
Data node 13
Name Node
Job Tracker
File
16
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 17
HDFS Architecture
ToR FEX/switch
Data node 1
Data node 2
Data node 3
Data node 4
Data node 5
ToR FEX/switch
Data node 6
Data node 7
Data node 8
Data node 9
Data node 10
ToR FEX/switch
Data node 11
Data node 12
Data node 13
Data node 14
Data node 15
1
Switch
Name Node
/usr/sean/foo.txt:blk_1,blk_2/usr/jacob/bar.txt:blk_3,blk_4
Data node 1:blk_1Data node 2:blk_2, blk_3Data node 3:blk_3
1
1
2
2
2
3
3
3
4
4
44
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 18
Rack Awareness
Rack Awareness provides Hadoop the optional ability to group nodes together in logical “racks” (i.e. failure domains)
Logical “racks” may or may not correspond to physical data center racks
Distributes blocks across different “racks” to avoid failure domain of a single “rack”
It can also lessen block movement between “racks”
“Rack” 1
Data node 1
Data node 2
Data node 3
Data node 4
Data node 5
“Rack” 2
Data node 6
Data node 7
Data node 8
Data node 9
Data node 10
“Rack” 3
Data node 11
Data node 12
Data node 13
Data node 14
Data node 15
1
1
1
2
2
2
3
3
3
4
4
4
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
MapReduce Example: Word Count
the quickbrown
fox
the fox ate the mouse
how nowbrown cow
Map
Map
Map
Reduce
Reduce
brown, 2fox, 2how, 1now, 1the, 3
ate, 1cow, 1mouse,
1quick, 1
the, 1brown, 1fox, 1quick, 1
quick, 1
the, 1fox, 1the, 1
how, 1now, 1brown, 1
ate, 1mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
19
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 20
MapReduce Architecture
ToR FEX/switch
Task Tracker 1
Task Tracker 2
Task Tracker 3
Task Tracker 4
Task Tracker 5
ToR FEX/switch
Task Tracker 6
Task Tracker 7
Task Tracker 8
Task Tracker 9
Task Tracker 10
ToR FEX/switch
Task Tracker 11
Task Tracker 12
Task Tracker 13
Task Tracker 14
Task Tracker 15
Switch
Job Tracker
Job1:TT1:Mapper1,Mapper2Job1:TT4:Mapper3,Reducer1
Job2:TT6:Reducer2Job2:TT7:Mapper1,Mapper3
M1
M2
R1
M3
M1
M3
R2
M2
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Cisco Webex Cloud and Hadoop Architecture
21
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 22© 2010 Cisco and/or its affiliates. All rights reserved. 22
Cisco WebEx Collaboration Cloud
Datacenter / PoP
Leased network link
Global Scale: 13 datacenters & iPoPs around the globe
Dedicated network: dual path 10G circuits between DCs
Multi-tenant: 95k sites
Real-time collaboration: voice, desktop sharing, video, chat
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 23© 2010 Cisco and/or its affiliates. All rights reserved. 23
Things happen ..
Datacenter / PoP
Leased network link
People make mistakesHardware failsSoftware failsEven failovers sometimes fail
C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. 24
Cisco WebEx log collection overview
Flume
Log4j
File
Avro
Syslog
Other Sinks
SolrSink
App
licat
ion
stat
e &
AP
Is
HDFS
Thrift
AMQP RDBMS
Sqoop
HTTP/REST
MySQL
Unstructured/semi-structured data Structured data
Cisco UCS C240 M3 servers
12 x 3TB = 36 TB / server
HDFSSink
SolrCloud
Raw dataSolr index
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Cisco UCS and Big Data
Building a big data cluster with the UCS Common Platform Architecture (CPA)
CPA NetworkingCPA Sizing and Scaling
25
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
The evolution of big data deployments
Experimental use of Big Data
Deployed into IT Ops mandated infrastructures
“Skunk works”
Small to medium clusters
App team mandated infrastructure
Purpose built for Big Data
Big Data has established business value
Performance matters
Large or small clusters
IT Infrastructure
Big Data
VMware
WEBSAP
Generic IT servers
General Purpose IT Data Center
X86 servers
Big Data
Dedicated “Pod” for Big Data
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 27
Hadoop Hardware Evolving in the Enterprise
Typical 2009 Hadoop node
• 1RU server• 4 x 1TB 3.5”
spindles• 2 x 4-core CPU• 1 x GE• 24 GB RAM• Single PSU• Running Apache• $
Economics favor “fat” nodes
• 6x-9x more data/node
• 3x-6x more IOPS/node
• Saturated gigabit, 10GE on the rise
• Fewer total nodes lowers licensing/support costs
• Increased significance of node and switch failure
Typical 2013 Hadoop node
• 2RU server• 12 x 3TB 3.5” or 24
x 1TB 2.5” spindles• 2 x 8-core CPU• 1-2 x 10GE• 128 GB RAM• Dual PSU• Running
commercial/licensed distribution
• $$$
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 28
Cisco UCS Common Platform Architecture (CPA)Building Blocks for Big Data
UCS 6200 SeriesFabric Interconnects
Nexus 2232Fabric Extenders
UCS Manager
UCS 240 M3 Servers
LAN, SAN, Management
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
CPA Network Design for Big Data
29
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
CPA: TopologySingle wire for data and management
8 x 10GE uplinks per FEX= 2:1 oversub (16 servers/rack), no portchannel (static pinning)
2 x 10GE links per server for all traffic, data and management
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
CPA Recommended FEX Connectivity2 FEX’s and 2 FI’s
• 2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32 • Distribute servers across port groups to maximize buffer
performance and predictably distribute static pinning on uplinks
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Can Hadoop really push 10GE?
Analytic workloads tend to be lighter on the network
Transform workloads tend to be heavier on the network
Hadoop has numerous parameters which affect network
Take advantage of 10GE CPA:– mapred.reduce.slowstart.completed.maps– dfs.balance.bandwidthPerSec– mapred.reduce.parallel.copies– mapred.reduce.tasks– mapred.tasktracker.reduce.tasks.maximum– mapred.compress.map.output
It can, depending on workload, so tune for it!
32
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
CPA Sizing and Scaling for Big Data
33
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 34
Cisco UCS Reference Configurations for Big Data
Full Rack UCS Solutions Bundle for Hadoop
Capacity
Full Rack UCS Solutions Bundle for Hadoop, NoSQL Performance
2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (LFF)
E5-2640 (12 cores)128GB
12x 3TB 7.2K SATA
2 x UCS 62962 x Nexus 2232 PP16 x C240 M3 (SFF)
2x E5-2665 (16 cores)256GB
24 x 1TB 7.2K SAS
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Sizing
Start with current storage requirement– Factor in replication (typically 3x) and compression (varies by data set)– Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL systems– Factor in average daily/weekly data ingest rate– Factor in expected growth rate (i.e. increase in ingest rate over time)
If I/O requirement known, use next table for guidance
Most big data architectures are very linear, so more nodes = more capacity and better performance
Strike a balance between price/performance of individual nodes vs. total # of nodes
Part science, part art
35
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 36
CPA sizing and application guidelines
Server
CPU2 x E5-2690 2 x E5-2665 2 x E5-2640
Memory (GB) 256 256 128
Disk Drives 24 x 600GB 10K 24 x 1TB 7.2K 12 x 3TB 7.2K
IO Bandwidth (GB/Sec) 2.6 2.0 1.1
Rack-Level
Cores 256 256 192
Memory (TB) 4 4 2
Capacity (TB) 225 384 576
IO Bandwidth (GB/Sec) 41.3 31.9 16.9
Applications MPP DBNoSQL
HadoopNoSQL Hadoop
Best Performance Best Price/TB
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Scaling the CPA
Single Rack 16 servers
Single Domain Up to 10 racks, 160 servers
37
Multiple Domains
L2/L3 Switching
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Consider intra- and inter-domain bandwidth:
Servers Per Domain
(Pair of Fabric Interconnects)
Available North-Bound 10GE ports(per fabric)
Southbound oversubscription
(per fabric)
Northbound oversubscription
(per fabric)
Intra-domain server-to-server bandwidth (per
fabric, Gbits/sec)
Inter-domain server-to-server bandwidth (per
fabric, Gbits/sec)
160 16 2:1 5:1 5 1
144 24 2:1 3:1 5 1.67
128 32 2:1 2:1 5 2.5
Scaling the Common Platform ArchitectureMultiple domains based on 16 servers per rack and 2 x 2232 FEXs
38
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Multi-Domain CPA Customer Example
39
• 10 Gits/sec Intra-Domain Server to Server NW Bandwidth
• 5 Gbits/sec Inter-Domain Server to Server NW Bandwidth
• Static pinning from FEX to FI (no port-channel)
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Recommendations: UCS Domains and Racks
40
Single Domain Recommendation
Turn off or enable at physical rack level
• For simplicity and ease of use, leave Rack Awareness off
• Consider turning it on to limit physical rack level fault domain (e.g. localized failures due to physical data center issues – water, power, cooling, etc.)
Multi Domain Recommendation
Create one Hadoop rack per UCS Domain
• With multiple domains, enable Rack Awareness such that each UCS Domain is its own Hadoop rack
• Provides HDFS data protection across domains
• Helps minimize cross-domain traffic
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Exercise 1
Set up a single node VM cluster on the laptop– Step 1: copy files from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc
41
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 42
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hive
An SQL-like interface to Hadoop
Top level Apache project – http://hive.apache.org/
Hive history– Created at Facebook to allow people to quickly and easily leverage Hadoop without the effort of
writing Java MapReduce– Currently used at many companies for log processing, business intelligence and analytics
43
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hive Components
Shell: allows interactive queries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Data Model
Tables– Typed columns (int, float, string, boolean)– Also, list: map (for JSON-like data)
Partitions– For example, range-partition tables by date
Buckets– Hash partitions within ranges (useful for sampling, join optimization)
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hive
46
DBMS Hive
Language SQL-92 standard Subset of SQL-92 plus Hive extensions
Updates INSERT, UPDATE, DELETE INSERT OVERWRITENo UPDATE or DELETE
Transactions Yes No
Latency Sub-second Minutes to hours
Indexes Any number of indexes, important to performance
No indexes, data is always scanned in parallel
Dataset size TBs PBs
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Metastore
Database: namespace containing a set of tables Holds table definitions (column types, physical layout) Holds partitioning information Can be stored in Derby, MySQL, and other relational databases
Source: cc-licensed slide by Cloudera
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hive components
Source: cc-licensed slide by Cloudera
Hive MetaStore
SerDe
InputFormat
Hadoop cluster
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hive MetaStore
MetaStore
Impala
RDBMS
HCatalog
Pig
HiveServer2
HiveCLI
BeelineCLI
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hive Physical Layout
Warehouse directory in HDFS– E.g., /user/hive/warehouse
Tables stored in subdirectories of warehouse– Partitions form subdirectories of tables
Actual data stored in HDFS files– E.g. text, SequenceFile, RCfile, Avro– Arbitrary format with a custom SerDe
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
External and Hive managed tables
Hive managed tables– Data moved to location /user/hive/warehouse– Can be stored in a more efficient format than text e.g. RCFile– If you drop the table, the raw data is lost
hive> CREATE TABLE test(id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
External tables– Can overlay multiple tables all pointing to the same raw data– To create external table, simply point to the location of data while creating the tables
hive> CREATE TABLE test (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/home/test/data';
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Hive: Example
Hive looks similar to an SQL database Relational join on two tables:
– Table of word counts from Shakespeare collection– Table of word counts from the bible
SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 5;
the 25848 62394I 23031 8854and 19671 38985to 18038 13526of 16700 34654
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 53
Impala
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai 54
Impala General purpose MPP SQL query engine for Hadoop
– Query latency milliseconds to hours, interactive data exploration– Runs on the existing Hadoop cluster on existing HDFS files and hardware
High performance– C++– Direct access to HDFS and Hbase data, no MapReduce
Unified platform– Use existing Hive metadata and query language (HiveQL)– Submit queries via ODBC or Thrift API
Performance– Disk throughput limited by hw to 100MB/sec– 3 .. 90 x faster than Hive, depending on the type of the query
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Impala Details
55
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
SQL App
ODBC
Hive Metastore
HDFS NN
StateStored
HiveQL interfaceUnified metadata
impalad
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
impalad
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
impalad
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Impala Details
56
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
SQL App
ODBC
Hive Metastore
HDFS NN
StateStored
HiveQL interfaceUnified metadata
impalad
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
impalad
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
impalad
Impalad keep contact to StateStored to update their state and to receive metadata for query planning
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Impala Details
57
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
SQL App
ODBC
Hive Metastore
HDFS NN
StateStore
HiveQL interfaceUnified metadata
impalad
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
impalad
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
impalad
Query coordinator initiates
execution on remote impalad’s
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Impala Details
58
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
SQL App
ODBC
Hive Metastore
HDFS NN
StateStore
HiveQL interfaceUnified metadata
impalad
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
impalad
Query Planner
Query Coordinator
Query Exec Engine
HDFS DN HBase
impalad
Intermediate results are streamed between impalad’s
and query results are streamed back to client
© 2013 Cisco and/or its affiliates. All rights reserved.Cloud Connect 2013 Shanghai
Exercise 2
Analytics with Hive and Impala– Step 1: copy test dataset from USB memory stick– Step 2: Mac & Dean to fill in …– Step 3: Mac & Dean to fill in …– etc
59