Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

1 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Analytics Workbench

APURVA DESAI


Overview


What is Hadoop?

What is Hadoop? – Distributed computing paradigm

– File system – HDFS

– Processing framework –Map Reduce

– Languages – PIG, HIVE

– Key Value Store – Hbase

Why is it important? – BIG Data is everywhere

– BIG Data is mostly unstructured

– Need affordable, scalable no-sql processing


Analytics Workbench - Motivation

Open source – Hadoop industry is nascent

– BIG Data development needs scale

Greenplum – Innovation & Experimentation platform

– Contribute to the community

– GPDB & GPHD - Mixed mode environment


Greenplum Vision


Buildout Pre-requisites

Hardware systems integration

Hadoop experience

Program Management

Partner ecosystem

Greenplum has Inhouse Expertise


Team Introduction

System Integration – Greg, Eric, Don, Dave,

Patrick

Program Management – Mike, Joe

Hadoop – Apurva, Judes, Clinton,

Chandra, Ashwin


Partners

Intel – 2000 Westmere CPUs

Mellanox – 1,000+ NICs

– 72 IB switches

Micron – 6,000 8GB DRAM

Seagate – 12,000 2TB Drives

Supermicro – 1000 Chasis/MB


Partners

Switch – Hosting Facilities

VMware – Operational Support

– Rubicon


Peek @ the Cluster


Cluster Statistics

# Of Physical Hosts : > 1,000 (> 10,000 with VMs)

# Of Racks : 54 (50 just for the DataNodes)

# Of Processors : > 24,000

Amount Of RAM : > 48TB

Amount of Disk Capacity : > 24PB – “Equivalent to nearly half of the entire written works of

mankind from the beginning of recorded history”

Largest cluster for Apache Hadoop validation!


Namenode


Job Tracker


CPU


Use Cases


Hadoop Review


Hadoop Shuffle


Initial Use Cases

Apache Hadoop Validation

Mellanox UDA

Terasort Benchmark


Apache Hadoop Validation

Purpose – Run Apache Hadoop Validation at Scale

– Validate cluster configuration

Various Configurations Validated – Standard Out Of The Box Configs

– Configs Modified For IO Intensive Processing


Apache Hadoop Preliminary Results

0

0.2

0.4

0.6

0.8

1

1.2

Execu

tio

n T

ime (

Min

)

Apache Hadoop-1.0.0 validation

1000 Nodes


Apache Hadoop Findings

Apache BigTop for integration tests

Functional validation passed as expected

Next Steps – Identify integration cases

– Contribute back to BigTop

– Stabilize Hadoop 0.23


Mellanox UDA - Overview RDMA in Hadoop Shuffle stage

Register Map & Reduce task buffer

Hadoop JT for Task completion

cp sorted maptask o/p reduce i/p

Perform in-memory merge @reduce

Avoid disk spills for large inputs

Reduce CPU load for sort & merge

GP + Mellanox collaboration – Open Sourcing UDA


Mellanox UDA Preliminary Results

Preliminary UDA results provided by Mellanox

Show improvement with UDA vs Vanilla Hadoop.

Better CPU utilization

Reduced execution time

Next Steps – Run on Analytics Workbench schedule for June 2012

– Configuration on the workbench to turn it on/off


TeraSort Benchmark

Industry standard benchmark

Good validation of configuration

3 Steps – Teragen – Generate 1TB of data

– Terasort – Sort generated data

– Teravalidate – Validate the sort

Measure time for each step


TeraSort Benchmark Preliminary Results

0

1

2

3

4

5

6

7

8

9

1 TB 10 TB

Execti

on

Tim

e i

n S

ec

# of TB Generated and Sorted

Apache Hadoop-1.0.0 validation - TeraSort

TeraGen

TeraSort


TeraSort Benchmark Findings

Minimal tuning of configuration

Results are within expected range.

Next Steps – Tune the cluster for optimal performance

– Use the benchmark for every new release


Lessons Learnt


Buildout Progress

0

200

400

600

800

1000

1200

Dec '11 Jan '12 Feb '12 Mar '12 April '12Month

Num

ber

of nodes

racked ready


―Real‖ Hadoop Cluster


Categories

Racking & Stacking

Networking

Non Hadoop Hosts

Base OS Setup

Hadoop Deployment

Post deployment

Process


In Closing


Upcoming work

Workbench Tasks – Load various data sets – Load GPDB, Hive, Hbase, Zookeeper, etc. – Load Chorus, Command center, UAP stack – VM provisioning – Various audits

On-boarding candidates – HD Education – Apache Hadoop Build & Validate – Mellanox UDA – Intel HiBench – Big data benchmarking – Hi resolution image processing, etc. etc.


A day in the life @ Switch


Q & A


Other Relevant Greenplum Sessions

Session Presenter Times Unified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00

Greenplum Database Overview Michael Crutcher Mon 8:30-9:30 Wed 10:00-11:00

Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15

Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00

Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00

Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30

Optimizing Greenplum Database on VMware Virtualized Infrastructure

Kevin O’Leary Mon 4:00-5:00 Tues 4:15-5:15

Big Data Driven Businesses in Action: Creating Real Business Value Using Greenplum UAP (Panel w/4 Customers)

Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30

Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45

Disruptive Data Science — How Data Science and Big Data are Transforming Business, IT and People

Annika Jimenez David Dietrich

Tues 4:15-5:15 Thurs 11:30-12:30

Date post:	12-Jun-2015
Category:	Technology
Upload:	emc-academic-alliance
View:	694 times
Download:	1 times

Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

Technology