Date post: | 12-Jun-2015 |
Category: |
Technology |
Upload: | emc-academic-alliance |
View: | 694 times |
Download: | 1 times |
1 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Analytics Workbench
APURVA DESAI
2 © Copyright 2012 EMC Corporation. All rights reserved.
Overview
3 © Copyright 2012 EMC Corporation. All rights reserved.
What is Hadoop?
What is Hadoop? – Distributed computing paradigm
– File system – HDFS
– Processing framework –Map Reduce
– Languages – PIG, HIVE
– Key Value Store – Hbase
Why is it important? – BIG Data is everywhere
– BIG Data is mostly unstructured
– Need affordable, scalable no-sql processing
4 © Copyright 2012 EMC Corporation. All rights reserved.
Analytics Workbench - Motivation
Open source – Hadoop industry is nascent
– BIG Data development needs scale
Greenplum – Innovation & Experimentation platform
– Contribute to the community
– GPDB & GPHD - Mixed mode environment
5 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Vision
6 © Copyright 2012 EMC Corporation. All rights reserved.
Buildout Pre-requisites
Hardware systems integration
Hadoop experience
Program Management
Partner ecosystem
Greenplum has Inhouse Expertise
7 © Copyright 2012 EMC Corporation. All rights reserved.
Team Introduction
System Integration – Greg, Eric, Don, Dave,
Patrick
Program Management – Mike, Joe
Hadoop – Apurva, Judes, Clinton,
Chandra, Ashwin
8 © Copyright 2012 EMC Corporation. All rights reserved.
Partners
Intel – 2000 Westmere CPUs
Mellanox – 1,000+ NICs
– 72 IB switches
Micron – 6,000 8GB DRAM
Seagate – 12,000 2TB Drives
Supermicro – 1000 Chasis/MB
9 © Copyright 2012 EMC Corporation. All rights reserved.
Partners
Switch – Hosting Facilities
VMware – Operational Support
– Rubicon
10 © Copyright 2012 EMC Corporation. All rights reserved.
Peek @ the Cluster
11 © Copyright 2012 EMC Corporation. All rights reserved.
Cluster Statistics
# Of Physical Hosts : > 1,000 (> 10,000 with VMs)
# Of Racks : 54 (50 just for the DataNodes)
# Of Processors : > 24,000
Amount Of RAM : > 48TB
Amount of Disk Capacity : > 24PB – “Equivalent to nearly half of the entire written works of
mankind from the beginning of recorded history”
Largest cluster for Apache Hadoop validation!
12 © Copyright 2012 EMC Corporation. All rights reserved.
Namenode
13 © Copyright 2012 EMC Corporation. All rights reserved.
Job Tracker
14 © Copyright 2012 EMC Corporation. All rights reserved.
CPU
15 © Copyright 2012 EMC Corporation. All rights reserved.
Use Cases
16 © Copyright 2012 EMC Corporation. All rights reserved.
Hadoop Review
17 © Copyright 2012 EMC Corporation. All rights reserved.
Hadoop Shuffle
18 © Copyright 2012 EMC Corporation. All rights reserved.
Initial Use Cases
Apache Hadoop Validation
Mellanox UDA
Terasort Benchmark
19 © Copyright 2012 EMC Corporation. All rights reserved.
Apache Hadoop Validation
Purpose – Run Apache Hadoop Validation at Scale
– Validate cluster configuration
Various Configurations Validated – Standard Out Of The Box Configs
– Configs Modified For IO Intensive Processing
20 © Copyright 2012 EMC Corporation. All rights reserved.
Apache Hadoop Preliminary Results
0
0.2
0.4
0.6
0.8
1
1.2
Execu
tio
n T
ime (
Min
)
Apache Hadoop-1.0.0 validation
1000 Nodes
21 © Copyright 2012 EMC Corporation. All rights reserved.
Apache Hadoop Findings
Apache BigTop for integration tests
Functional validation passed as expected
Next Steps – Identify integration cases
– Contribute back to BigTop
– Stabilize Hadoop 0.23
22 © Copyright 2012 EMC Corporation. All rights reserved.
Mellanox UDA - Overview RDMA in Hadoop Shuffle stage
Register Map & Reduce task buffer
Hadoop JT for Task completion
cp sorted maptask o/p reduce i/p
Perform in-memory merge @reduce
Avoid disk spills for large inputs
Reduce CPU load for sort & merge
GP + Mellanox collaboration – Open Sourcing UDA
23 © Copyright 2012 EMC Corporation. All rights reserved.
Mellanox UDA Preliminary Results
Preliminary UDA results provided by Mellanox
Show improvement with UDA vs Vanilla Hadoop.
Better CPU utilization
Reduced execution time
Next Steps – Run on Analytics Workbench schedule for June 2012
– Configuration on the workbench to turn it on/off
24 © Copyright 2012 EMC Corporation. All rights reserved.
TeraSort Benchmark
Industry standard benchmark
Good validation of configuration
3 Steps – Teragen – Generate 1TB of data
– Terasort – Sort generated data
– Teravalidate – Validate the sort
Measure time for each step
25 © Copyright 2012 EMC Corporation. All rights reserved.
TeraSort Benchmark Preliminary Results
0
1
2
3
4
5
6
7
8
9
1 TB 10 TB
Execti
on
Tim
e i
n S
ec
# of TB Generated and Sorted
Apache Hadoop-1.0.0 validation - TeraSort
TeraGen
TeraSort
26 © Copyright 2012 EMC Corporation. All rights reserved.
TeraSort Benchmark Findings
Minimal tuning of configuration
Results are within expected range.
Next Steps – Tune the cluster for optimal performance
– Use the benchmark for every new release
27 © Copyright 2012 EMC Corporation. All rights reserved.
Lessons Learnt
28 © Copyright 2012 EMC Corporation. All rights reserved.
Buildout Progress
0
200
400
600
800
1000
1200
Dec '11 Jan '12 Feb '12 Mar '12 April '12Month
Num
ber
of nodes
racked ready
29 © Copyright 2012 EMC Corporation. All rights reserved.
―Real‖ Hadoop Cluster
30 © Copyright 2012 EMC Corporation. All rights reserved.
Categories
Racking & Stacking
Networking
Non Hadoop Hosts
Base OS Setup
Hadoop Deployment
Post deployment
Process
31 © Copyright 2012 EMC Corporation. All rights reserved.
In Closing
32 © Copyright 2012 EMC Corporation. All rights reserved.
Upcoming work
Workbench Tasks – Load various data sets – Load GPDB, Hive, Hbase, Zookeeper, etc. – Load Chorus, Command center, UAP stack – VM provisioning – Various audits
On-boarding candidates – HD Education – Apache Hadoop Build & Validate – Mellanox UDA – Intel HiBench – Big data benchmarking – Hi resolution image processing, etc. etc.
33 © Copyright 2012 EMC Corporation. All rights reserved.
A day in the life @ Switch
34 © Copyright 2012 EMC Corporation. All rights reserved.
Q & A
35 © Copyright 2012 EMC Corporation. All rights reserved.
Other Relevant Greenplum Sessions
Session Presenter Times Unified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00
Greenplum Database Overview Michael Crutcher Mon 8:30-9:30 Wed 10:00-11:00
Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15
Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00
Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00
Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30
Optimizing Greenplum Database on VMware Virtualized Infrastructure
Kevin O’Leary Mon 4:00-5:00 Tues 4:15-5:15
Big Data Driven Businesses in Action: Creating Real Business Value Using Greenplum UAP (Panel w/4 Customers)
Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30
Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45
Disruptive Data Science — How Data Science and Big Data are Transforming Business, IT and People
Annika Jimenez David Dietrich
Tues 4:15-5:15 Thurs 11:30-12:30