© 2009 VMware Inc. All rights reserved
Serengeti - 虚拟化你的大数据应用
蔺永华
Vmware, Inc.
Agenda
• Today’s big data system
• Why virtualize hadoop?
• Serengeti introduction
• Common questions about virtualization
• Serengeti solution
• Deep insight into Serengeti
• Summary
• Q & A
Today’s Big Data System:
ETL
Real Time
Streams
Unstructured Data (HDFS)
Real Time
Structured
Database
Big SQL
Data Parallel Batch
Processing
Real-Time
Processing (s4, storm)
Analytics
Agenda
• Today’s big data system
• Why virtualize hadoop?
• Serengeti introduction
• Common questions about virtualization
• Serengeti solution
• Deep insight into Serengeti
• Summary
• Q & A
Challenges To Use Hadoop in physical infrastructure
Deployment
• Difficult to deploy, cost several people for several days even months
• Difficult to tune cluster performance
Low Efficiency
• Hadoop clusters are typically not 100% utilized across all hardware resources.
• Difficult to share resources safely between different workload
Single Point of Failure
• Single point of failure for Name Node and Job tracker
• No HA for Hive, HCatalog, etc.
Why Virtualize Hadoop? - Get your Hadoop cluster in minutes
Hadoop Installation and Configuration
Network Configuration
OS installation
Server preparation
Manual process, cost days
Fully automated process,
10 minutes to get a
Hadoop/HBase cluster from
scratch
1/1000 human efforts,
Least Hadoop operation knowledge
Automate by Serengeti on
vSphere with best practice
Why Virtualize Hadoop? - Consolidate sprawling clusters
Single purpose clusters for various
business applications lead to cluster
sprawl.
Clusters share
servers with strong isolation
Simplify
• Single Hardware Infrastructure
• Unified operations
Optimize
• Shared Resources = higher utilization
• Elastic resources = faster on-demand access
Hadoop Dev
Hadoop Prod
HBase
Cluster Sprawling
Cluster Consolidation
Finance Hadoop
Virtualization Platform
Hadoop Dev
Hadoop Prod
HBase ... Portal
Hadoop
Portal Hadoop
30% CAPEX Down
Why Virtualize Hadoop? –
Utilize all your resources to solve the priority problem
50%+ resources are sitting
idle while high priority job is
burning up its cluster.
Utilize all resources from
pool on demand.
Dynamic elastic
scaling on shared resource pool
3X faster to get analytic results
vSphere High Availability (HA) - protection against unplanned downtime
• Protection against host and VM failures
• Automatic failure detection (host, guest OS)
• Automatic virtual machine restart in minutes, on any available host in cluster
• OS and application-independent, does not require complex configuration
changes
Overview
High Availability for the Hadoop Stack
HDFS
(Hadoop Distributed File System)
HBase (Key-Value store)
MapReduce (Job Scheduling/Execution System)
Pig (Data Flow) Hive (SQL)
BI Reporting ETL Tools
Managem
ent
Serv
er
Zookeepr
(Coord
ination) HCatalog
RDBMS
Namenode
Jobtracker
Hive MetaDB
Hcatalog MDB
Server
vSphere Fault Tolerance provides continuous protection
App
OS
App
OS
App
OS X X App
OS
App
OS
App
OS
App
OS
X
VMware ESX VMware ESX
• Single identical VMs running in
lockstep on separate hosts
• Zero downtime, zero data loss
failover for all virtual machines in
case of hardware failures
• Integrated with VMware HA/DRS
• No complex clustering or
specialized hardware required
• Single common mechanism for all
applications and operating
systems
FT HA HA
Overview
Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters
Agenda
• Today’s big data system
• Why virtualize hadoop?
• Serengeti introduction
• Common questions about virtualization
• Serengeti solution
• Deep insight into Serengeti
• Summary
• Q & A
Easy and rapid deployment and management
Open source project launched in June 2012, 0.8 is released at Apr.
and will release 0.9 at Jun.
Toolkit that leverage virtualization to simplify Hadoop deployment
and operations
Deploy a cluster in 10 Minutes fully automated
Customize Hadoop and HBase cluster
Automated cluster operation
Come with eco-system components
Support all popular Hadoop Distributions
Serengeti
Demo: 10 minutes to a Hadoop cluster with Serengeti
Agenda
• Today’s big data system
• Why virtualize hadoop?
• Serengeti introduction
• Common questions about virtualization
• Serengeti solution
• Deep insight into Serengeti
• Summary
• Q & A
Common questions about virtualization
Local Disk
• Can local disk be used in virtualization environment?
Flexibility and Scalability
• How to flexible schedule resources between clusters and different
applications as mentioned above?
Data stability
• In virtual environment, how can we distribute data across host and rack?
Data locality
• Hadoop will schedule compute tasks near by the data, to reduce network
IO for data R/W. Can virtual environment get the same result?
Performance
• How about the performance in virtual environment?
Agenda
• Today’s big data system
• Why virtualize hadoop?
• Serengeti introduction
• Common questions about virtualization
• Serengeti solution
• Deep insight into Serengeti
• Summary
• Q & A
Can I use local disk easily?
Serengeti Extend Virtual Storage Architecture to Include Local Disk
Shared Storage: SAN or NAS
• Easy to provision
• Automated cluster rebalancing
Hybrid Storage
• SAN for boot images, other
workloads
• Local disk for Hadoop & HDFS
Host
Ha
do
op
Oth
er
VM
Oth
er
VM
Host
Ha
do
op
Ha
do
op
Oth
er
VM
Host
Ha
do
op
Ha
do
op
Oth
er
VM
Host
Ha
do
op
Oth
er
VM
Oth
er
VM
Host
Ha
do
op
Ha
do
op
Oth
er
VM
Host
Ha
do
op
Ha
do
op
Oth
er
VM
How to flexible scale in/scale out
How to flexible schedule resources between clusters and
different applications?
Storage
Evolution of Hadoop on VMs – Data/Compute separation
Compute Current Hadoop: Combined Storage/Compute
Storage
T1 T2
VM VM VM
VM VM
VM
Hadoop in VM
- * VM lifecycle determined by Datanode
- * Limited elasticity
Separate Storage
- * Separate compute from data
- * Remove elastic constrain
- by Datanode
- * Elastic compute
- * Raise utilization
Separate Compute Clusters
- * Separate virtual compute
- * Compute cluster per tenant
- * Stronger VM-grade security and resource isolation
Slave Node
Serengeti Node Scale Out / Scale In
Host
NameNode
Host
D
C JobTracker C
C C
Host
D
C C
C C
Host
D
C C
C C
Host
D
C C
C C
Serengeti Ballooning Enhancement for Java Application
JVM
Guest OS
Host JVM
Guest OS
Guest OS
Host JVM
How to keep data stability?
How to access data locally if data node and compute node
are located in different VM?
Distributed and Data/Compute Associated VM Placement
Data node and task tracker combined cluster Data Compute separated cluster
Host
master
Host
worker
Host
worker
Host
master
Host
Data node
Task tracker
Host
Data node
Task tracker
Host
Name node
Job tracker
Host
Data node
Task tracker
Host
Data node
Task tracker
Job tracker Task tracker Task tracker
HDFS cluster
Compute only cluster1
Compute only cluster2
Compute Only cluster
Rack 1 Rack 2 Rack 1 Rack 2
Rack 1 Rack 2
Hadoop Topology Awareness – Serengeti HVE
Hadoop Topology Changes
for Virtualization
/
D1 D2
R1 R2
N1
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12
R3 R4
1 2 3
/
D1 D2
R1 R2
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12
R3 R4
1 2 3
N2 N3 N4 N5 N6 N7 N8
1 3 2
1 2 3 4
Hadoop Virtualization Extensions for Topology
HADOOP-8468 (Umbrella JIRA)
HADOOP-8469
HDFS-3495
HDFS-3498
Hadoop
HVE
Task Scheduling Policy Extension
Balancer Policy Extension
Replica Choosing Policy Extension
Replica Placement Policy Extension
Network Topology Extension
Replica Removal Policy Extension
HDFS MapReduce
Hadoop Common
MAPREDUCE-4310
MAPREDUCE-4309
HADOOP-8470
HADOOP-8472
Is there significant performance degradation in virtualization
environment?
Is there any performance data?
Virtualized Hadoop Performance
Native versus Virtual Platforms, 32 hosts, 16 disks/host
Source: http://www.vmware.com/resources/techresources/10360
Agenda
• Today’s big data system
• Why virtualize hadoop?
• Serengeti introduction
• Common questions about virtualization
• Serengeti solution
• Deep insight into Serengeti
• Summary
• Q & A
Serengeti architecture diagram
CLI Client
Spring Shell
UI Client
Flex UI
Serengeti
Web
Service
Hibernate/
DAO
Spring Batch
Rest API
Update
Meta DB
step
vPostgres
VM
Placement
calculation
VC adapter
VM
Provision
step
Sof tware
Mgmt
step
Ironfan
service
Thrift Service
Ironfan Progress
report
Chef
server
Rest API
Cookbook
VHM
step RabbitMQ
VM runtime
Manager
Host Host Host Host Host
Virtualization Platform
Hadoop
Node
Chef Client
HA kit
Hadoop
Node
Hadoop
Node
Package
repository
vCenter
Customizing your Hadoop/HBase cluster with Serengeti
Choice of distros
Storage configuration
• Choice of shared storage or Local disk
Resource configuration
High availability option
# of nodes
… "distro":"apache", "groups":[ { "name":"master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": {
"type": "SHARED",
"sizeGB": 20}, "instance_type":MEDIUM, "instance_num":1, "ha":true}, {"name":"worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instance_type":SMALL, "instance_num":5, "ha":false …
One command to scale out your cluster with Serengeti
>cluster resize –name <clustername> --nodegroup worker –instanceNum <#>
Configure/reconfigure Hadoop with ease by Serengeti
Modify Hadoop cluster configuration from Serengeti
• Use the “configuration” section of the json spec file
• Specify Hadoop attributes in core-site.xml, hdfs-site.xml, mapred-site.xml,
hadoop-env.sh, log4j.properties
• Apply new Hadoop configuration using the edited spec file
"configuration": {
"hadoop": {
"core-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/core-default.html
},
"hdfs-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/hdfs-default.html
},
"mapred-site.xml": {
// check for all settings at http://hadoop.apache.org/common/docs/r1.0.0/mapred-default.html
"io.sort.mb": "300"
} ,
"hadoop-env.sh": {
// "HADOOP_HEAPSIZE": "",
// "HADOOP_NAMENODE_OPTS": "",
// "HADOOP_DATANODE_OPTS": "",
…
> cluster config --name myHadoop --specFile /home/serengeti/myHadoop.json
Freedom of Choice and Open Source
Community Projects Distributions
• Flexibility to choose from major distributions
• Support for multiple projects
• Open architecture to welcome industry participation
• Contributing Hadoop Virtualization Extensions (HVE) to open
source community
cluster create --name myHadoop --distro apache
HDFS2 with Namenode Federation and HA
Deploy CDH4 Hadoop cluster
• Name Node Federation
• Name Node HA
• MapReduce v1
• HBase, Pig, Hive, and Hive Server
CDH4 configurations
Scale out
Elasticity
JobTracker HA/FT
Namenode Group 1
Active Namenode Standby Namenode
Namenode Group 2
Active Namenode Standby Namenode
Zookeeper Group
ZK ZK ZK
Coordinate Coordinate
Quorum-based metadata store
Data Nodes
Datanode Datanode Datanode Datanode Datanode Datanode Datanode Datanode
Block report Block report
Proactive monitoring and tuning with VCOPs
Proactively monitoring through VCOPs
Gain comprehensive visibility
Eliminate manual processes with intelligent automation
Proactively manage operations
Agenda
• Today’s big data system
• Why virtualize hadoop?
• Serengeti introduction
• Common questions about virtualization
• Serengeti solution
• Deep insight into Serengeti
• Summary
• Q & A
VMWare brings Agility, Efficiency, and Elasticity to Big Data
Enable full elasticity
through separation of
Data and Compute
Scale In/Out Hadoop
with Resource
Constrain
Elasticity
Deploy, configure and
monitor Hadoop
clusters on the fly
Dynamic reconfiguring
of Hadoop to meet
changing business
demands
Agility
Consolidate Hadoop
to achieve higher
utilization
Pool resources to
allow for increased
performance and
priority job processing
Efficiency
Serengeti Resources
Download and try Serengeti
• projectserengeti.org
VMware Hadoop site
• vmware.com/hadoop
Hadoop performance on vSphere
• http://www.vmware.com/files/pdf/techpa
per/hadoop-vsphere51-32hosts.pdf
Hadoop High Availability solution
• vmware.com/files/pdf/Apache-Hadoop-
VMware-HA-solution.pdf