Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | emc-academic-alliance |
View: | 104 times |
Download: | 1 times |
1 © Copyright 2013 EMC Corporation. All rights reserved.
Virtualize Big Data to Make the Elephant Dance
June Yang, Senior Director of Product Management, VMWare Dan Baskett, Senior Consultant Technologist, Pivotal
2 © Copyright 2013 EMC Corporation. All rights reserved.
Unstructured Data is exploding… Hadoop is driving growth
Unstructured data driving growth Hadoop adoption is ramping
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Structured Unstructured
Complex unstructured data forecasted to outpace structured
relational data by 10x by 2020
Evaluating53%In-
production23%
Piloting18%
Testing2%
Don't know2%
Other2%
Source: Forrester Survey of 60 CIOs , September 2011
• Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider Enterprise data strategy
• Gartner predicts +800% data growth over next 5 years • Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs
3 © Copyright 2013 EMC Corporation. All rights reserved.
Log Processing / Click Stream Analytics
Machine Learning / sophisticated data mining
Web crawling / text processing
Extract Transform Load (ETL) replacement
Image / XML message processing
Broad Application of Hadoop Technology
General archiving / compliance
Financial Services
Mobile / Telecom
Internet Retailer
Scientific Research
Pharmaceutical / Drug Discovery
Social Media
Vertical Industries Use Cases
Hadoop is a platform that will revolutionize how Enterprises handle data
4 © Copyright 2013 EMC Corporation. All rights reserved.
The Big Data Journey in the Enterprise
Stage 3: Cloud Analytics Platform • Serve many departments
• Often part of mission critical workflow • Fully integrated with analytics/BI tools
Stage1: Hadoop Piloting • Often start with line of business • Try 1 or 2 use cases to explore
the value of Hadoop
Stage 2: Hadoop Production
• Serve a few departments • More use cases
• Growing # and size of clusters • Core Hadoop + components
10’s 100’s 0 node
Integrated
Scale
5 © Copyright 2013 EMC Corporation. All rights reserved.
Deploy Hadoop Clusters in Minutes
6 © Copyright 2013 EMC Corporation. All rights reserved.
One click to scale out your cluster on the fly
7 © Copyright 2013 EMC Corporation. All rights reserved.
Customize your Hadoop/Hbase Cluster
Customize with Cluster Specification File
8 © Copyright 2013 EMC Corporation. All rights reserved.
Cluster Spec File Details
Resource configuration
Cluster Specification File "groups":[ { "name":"master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": { "type": "SHARED”, sizeGB": 20}, "instance_type":MEDIUM, "instance_num":1, "ha":true}, {"name":"worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instance_type":SMALL, "instance_num":5, "ha":false …
Storage configuration Choice of shared storage or Local disk
High availability option
# of Hadoop nodes
9 © Copyright 2013 EMC Corporation. All rights reserved.
Your Choice of Hadoop Distributions and Tools Community Projects Distributions
• Flexibility to choose and try out major distributions • Support for multiple projects • Open architecture to welcome industry participation • Contributing Hadoop Virtualization Extensions (HVE) to open source
community
10 © Copyright 2013 EMC Corporation. All rights reserved.
Proactive monitoring with VCOPs Proactively monitoring through VCOPs Gain comprehensive visibility Eliminate manual processes with intelligent automation Proactively manage operations Alternatively, use monitoring tools like Nagios, Ganglia
11 © Copyright 2013 EMC Corporation. All rights reserved.
Beyond day 1 - Automation of Hadoop Cluster lifecycle management
Deploy
Customize
Load data
Execute jobs
Tune configuration
Scaling
…
12 © Copyright 2013 EMC Corporation. All rights reserved.
The Big Data Journey in the Enterprise
Stage1: Hadoop Piloting Rapid deployment
On the fly cluster resizing Choice of Hadoop distros
Automation of cluster lifecycle
Stage 2: Hadoop Production • Serve a few departments
• More use cases • Growing # and size of clusters
• Core Hadoop + components
Integrated
10’s 100’s 0 node Scale
13 © Copyright 2013 EMC Corporation. All rights reserved.
Achieve HA for the Entire Hadoop Stack
HDFS (Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI Reporting ETL Tools
Man
agem
ent
Ser
ver
Zook
eepr
(C
oord
inat
ion)
HCatalog
RDBMS
Namenode
Jobtracker
Hive MetaDB Hcatalog MDB
Server • vSphere HA is battle-tested high availability technology • Single mechanism to achieve HA for the entire Hadoop stack • One click to enable HA and/or FT
14 © Copyright 2013 EMC Corporation. All rights reserved.
Challenges of Running Hadoop in Enterprises
Production
Test
Experimentation
Dept A: recommendation engine Dept B: ad targeting
Production
Test
Experimentation
Log files
Social data Transaction data Historical cust behavior
Pain Points: 1. Cluster sprawling 2. Redundant common data in
separate clusters 3. Difficult use the right tool for
the right problem 4. Peak compute and I/O
resource is limited to number of nodes in each independent cluster
NoSQL Real time SQL …
On the horizon…
15 © Copyright 2013 EMC Corporation. All rights reserved.
What if you can…
Experimentation
Production recommendation engine
Production Ad Targeting
Test/Dev
Production
Test
Production
Test
Experimentation
Recommendation engine Ad targeting
Experimentation
One physical platform to support multiple virtual big data clusters
16 © Copyright 2013 EMC Corporation. All rights reserved.
Bigger is Better
Hadoop is linearly scalable, more nodes, better performance, for the same job, it will take
– 2 hour to complete on a 50 node cluster – 1 hour to complete on a 100 node cluster – 30 min to complete on a 200 node cluster
17 © Copyright 2013 EMC Corporation. All rights reserved.
You may ask
What about differentiated SLAs – For production Hadoop jobs, need to ensure high priority
– Lower priority of experimental Hadoop jobs.
Will I have a noisy neighbor problems with shared infrastructure approach?
18 © Copyright 2013 EMC Corporation. All rights reserved.
VM Containers with Isolation are a Tried and Tested Approach
Host Host Host Host Host Host
VMware vSphere + Serengeti
Host
Hungry Workload 1 Reckless Workload 2
Noisy Workload 3
19 © Copyright 2013 EMC Corporation. All rights reserved.
Shared infrastructure: Three big types of Isolation are Required
Resource Isolation • Control the greedy noisy neighbor
• Reserve resources to meet needs
Version Isolation • Allow concurrent OS, App, Distro versions
Security Isolation • Provide privacy between users/groups
• Runtime and data privacy required
Host Host Host Host Host Host
VMware vSphere + Serengeti
Host
20 © Copyright 2013 EMC Corporation. All rights reserved.
With virtualization, you can have your cake and eat it too
One physical platform to support multiple virtual big data clusters
– Share data to minimize copying – Single infrastructure to
maintain – Bigger cluster for better
performance – Share hardware resource to
achieve higher utilization
Virtualization ensures strong isolation between clusters.
– Resource isolation. – Failure isolation – Configure isolation – Security isolation
Compute layer
Data layer
VMware vSphere + Serengeti
High Priority
Low Priority Experimentation
Production recommendation engine
Production Ad Targeting
Test/Dev
21 © Copyright 2013 EMC Corporation. All rights reserved.
Storage
Elastic Hadoop with Virtualization
Compute Combined Storage/Compute Storage
T1 T2 VM VM VM
VM VM
VM
Unmodified Hadoop node in a VM VM lifecycle
determined by Datanode
Limited elasticity
Separate Compute from Storage Separate compute
from data Stateless compute Elastic compute
Separate Virtual Compute Clusters per tenant Separate virtual compute Compute cluster per tenant Stronger VM-grade security
and resource isolation
Hadoop Node
22 © Copyright 2013 EMC Corporation. All rights reserved.
Scale in/out Hadoop dynamically Deploy separate compute clusters for different tenants sharing HDFS.
Commission/decommission task trackers according to priority and available resources
Experimentation Dynamic resourcepool
Data layer
Production recommendation engine
Compute layer Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Compute VM
Experimentation Production
Compute VM
Job Tracker
Job Tracker
VMware vSphere + Serengeti
23 © Copyright 2013 EMC Corporation. All rights reserved.
The Big Data Journey in the Enterprise
Stage1: Hadoop Piloting Rapid deployment
On the fly cluster resizing Choice of Hadoop distros
Automation of cluster lifecycle
Stage 2: Hadoop Production High Availability Consolidation
Differentiated SLAs Elastic Scaling
Integrated Stage 3: Cloud Analytics Platform • Serve many departments
• Often part of mission critical workflow • Fully integrated with analytics/BI tools
10’s 100’s 0 node Scale
24 © Copyright 2013 EMC Corporation. All rights reserved.
Cloud Analytics Platform
E T L
Real Time Streams
HDFS
Real Time Structured Database
Data Warehouse
Unstructured and Batch Processing
Stream Processing
Compute Storage Networking Cloud Infrastructure
Automated Models
Business Intelligence
…
Machine Learning CETAS
Data Visualization
25 © Copyright 2013 EMC Corporation. All rights reserved.
Big Data Tools and Characteristics Framework Scale of
data Scale of Cluster
Computable Data?
Local Disks?
Map-reduce: Hadoop
100s PB 10s to 1,000s Yes Yes, for cost, bandwidth and availability
Big-SQL: HawQ,, Aster Data, Impala, …
PB’s 10s to 100s Some Yes, for cost and bandwidth
No-SQL: Cassandra, hBase, …
Trilions Of rows
10s to 100s Some Yes, for cost and availability
In-Memory: Redis, Gemfire, Membase, …
Billions of rows 10s-100s Yes Primarily Memory
26 © Copyright 2013 EMC Corporation. All rights reserved.
Choose a platform that… Allows user to pick the right tools at the right
time Put resources where needed based on SLA policy
27 © Copyright 2013 EMC Corporation. All rights reserved.
Ad hoc data mining
In-house Hadoop as a Service – (Hadoop + Hadoop)
Compute layer
Data layer
HDFS
Host Host Host Host Host Host
Production recommendation engine
Production ETL of log files
VMware vSphere + Serengeti
HDFS
28 © Copyright 2013 EMC Corporation. All rights reserved.
Hadoop batch analysis
Integrated Big Data Production – (Mixed big data workloads)
HDFS
Host Host Host Host Host Host
HBase real-time queries
NoSQL – Cassandra key-value store
MPP DBMS – Analysis of structured data
Compute layer
Data layer
VMware vSphere + Serengeti
29 © Copyright 2013 EMC Corporation. All rights reserved.
Short-lived Hadoop compute cluster
Integrated Hadoop and Webapps – (Big Data + Other Workloads)
HDFS
Host Host Host Host Host Host
Web servers for ecommerce site
Compute layer
Data layer
Hadoop compute cluster
VMware vSphere + Serengeti
30 © Copyright 2013 EMC Corporation. All rights reserved.
The Big Data Journey in the Enterprise
Stage1: Hadoop Piloting Rapid deployment
On the fly cluster resizing Choice of Hadoop distros
Automation of cluster lifecycle
Stage 2: Hadoop Production
High Availability Consolidation
Differentiated SLAs Elastic Scaling
Integrated
Stage 3: Cloud Analytics Platform Mixed workloads
Right tool at the right time Flexible and elastic infrastrure
10’s 100’s 0 node Scale
31 © Copyright 2013 EMC Corporation. All rights reserved.
Learn More Download and try Serengeti
– projectserengeti.org • VMware Hadoop site
– vmware.com/hadoop
• Hadoop performance on vSphere white paper
– http://www.vmware.com/files/pdf/techpaper/hadoop-vsphere51-32hosts.pdf
• Hadoop virtualization extensions (HVE) Whitepaper
– http://www.vmware.com/files/pdf/techpaper/hadoop-vsphere51-32hosts.pdf
32 © Copyright 2013 EMC Corporation. All rights reserved.
Thank You!
June Yang Senior Director, VMware [email protected]
Dan Baskette Senior Consultant Technologist [email protected]
33 © Copyright 2013 EMC Corporation. All rights reserved.
Pivotal Sessions at EMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-Data-Driven Applications
Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005
Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action
Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F
Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench
Clinton Ooi Bhavin Modi
Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A
Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights
SK Krishnamurthy
Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M
Pivotal: Big & Fast data – merging real-time data and deep analytics
Michael Crutcher
Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M
Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette
Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E
Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005