+ All Categories
Home > Technology > Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

Date post: 12-Jun-2015
Category:
Upload: emc-academic-alliance
View: 694 times
Download: 1 times
Share this document with a friend
Description:
This session discusses the rational behind the Greenplum Analytics Workbench initiative - it's goals, present status today and the roadmap for this first of a kind initiative. Enterprises learn about how a Hadoop Cloud can help unlock revenue opportunities from the data within the cluster.
Popular Tags:
36
1 © Copyright 2012 EMC Corporation. All rights reserved. Greenplum Analytics Workbench APURVA DESAI
Transcript
Page 1: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

1 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Analytics Workbench

APURVA DESAI

Page 2: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

2 © Copyright 2012 EMC Corporation. All rights reserved.

Overview

Page 3: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

3 © Copyright 2012 EMC Corporation. All rights reserved.

What is Hadoop?

What is Hadoop? – Distributed computing paradigm

– File system – HDFS

– Processing framework –Map Reduce

– Languages – PIG, HIVE

– Key Value Store – Hbase

Why is it important? – BIG Data is everywhere

– BIG Data is mostly unstructured

– Need affordable, scalable no-sql processing

Page 4: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

4 © Copyright 2012 EMC Corporation. All rights reserved.

Analytics Workbench - Motivation

Open source – Hadoop industry is nascent

– BIG Data development needs scale

Greenplum – Innovation & Experimentation platform

– Contribute to the community

– GPDB & GPHD - Mixed mode environment

Page 5: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

5 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Vision

Page 6: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

6 © Copyright 2012 EMC Corporation. All rights reserved.

Buildout Pre-requisites

Hardware systems integration

Hadoop experience

Program Management

Partner ecosystem

Greenplum has Inhouse Expertise

Page 7: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

7 © Copyright 2012 EMC Corporation. All rights reserved.

Team Introduction

System Integration – Greg, Eric, Don, Dave,

Patrick

Program Management – Mike, Joe

Hadoop – Apurva, Judes, Clinton,

Chandra, Ashwin

Page 8: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

8 © Copyright 2012 EMC Corporation. All rights reserved.

Partners

Intel – 2000 Westmere CPUs

Mellanox – 1,000+ NICs

– 72 IB switches

Micron – 6,000 8GB DRAM

Seagate – 12,000 2TB Drives

Supermicro – 1000 Chasis/MB

Page 9: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

9 © Copyright 2012 EMC Corporation. All rights reserved.

Partners

Switch – Hosting Facilities

VMware – Operational Support

– Rubicon

Page 10: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

10 © Copyright 2012 EMC Corporation. All rights reserved.

Peek @ the Cluster

Page 11: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

11 © Copyright 2012 EMC Corporation. All rights reserved.

Cluster Statistics

# Of Physical Hosts : > 1,000 (> 10,000 with VMs)

# Of Racks : 54 (50 just for the DataNodes)

# Of Processors : > 24,000

Amount Of RAM : > 48TB

Amount of Disk Capacity : > 24PB – “Equivalent to nearly half of the entire written works of

mankind from the beginning of recorded history”

Largest cluster for Apache Hadoop validation!

Page 12: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

12 © Copyright 2012 EMC Corporation. All rights reserved.

Namenode

Page 13: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

13 © Copyright 2012 EMC Corporation. All rights reserved.

Job Tracker

Page 14: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

14 © Copyright 2012 EMC Corporation. All rights reserved.

CPU

Page 15: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

15 © Copyright 2012 EMC Corporation. All rights reserved.

Use Cases

Page 16: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

16 © Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Review

Page 17: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

17 © Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Shuffle

Page 18: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

18 © Copyright 2012 EMC Corporation. All rights reserved.

Initial Use Cases

Apache Hadoop Validation

Mellanox UDA

Terasort Benchmark

Page 19: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

19 © Copyright 2012 EMC Corporation. All rights reserved.

Apache Hadoop Validation

Purpose – Run Apache Hadoop Validation at Scale

– Validate cluster configuration

Various Configurations Validated – Standard Out Of The Box Configs

– Configs Modified For IO Intensive Processing

Page 20: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

20 © Copyright 2012 EMC Corporation. All rights reserved.

Apache Hadoop Preliminary Results

0

0.2

0.4

0.6

0.8

1

1.2

Execu

tio

n T

ime (

Min

)

Apache Hadoop-1.0.0 validation

1000 Nodes

Page 21: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

21 © Copyright 2012 EMC Corporation. All rights reserved.

Apache Hadoop Findings

Apache BigTop for integration tests

Functional validation passed as expected

Next Steps – Identify integration cases

– Contribute back to BigTop

– Stabilize Hadoop 0.23

Page 22: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

22 © Copyright 2012 EMC Corporation. All rights reserved.

Mellanox UDA - Overview RDMA in Hadoop Shuffle stage

Register Map & Reduce task buffer

Hadoop JT for Task completion

cp sorted maptask o/p reduce i/p

Perform in-memory merge @reduce

Avoid disk spills for large inputs

Reduce CPU load for sort & merge

GP + Mellanox collaboration – Open Sourcing UDA

Page 23: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

23 © Copyright 2012 EMC Corporation. All rights reserved.

Mellanox UDA Preliminary Results

Preliminary UDA results provided by Mellanox

Show improvement with UDA vs Vanilla Hadoop.

Better CPU utilization

Reduced execution time

Next Steps – Run on Analytics Workbench schedule for June 2012

– Configuration on the workbench to turn it on/off

Page 24: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

24 © Copyright 2012 EMC Corporation. All rights reserved.

TeraSort Benchmark

Industry standard benchmark

Good validation of configuration

3 Steps – Teragen – Generate 1TB of data

– Terasort – Sort generated data

– Teravalidate – Validate the sort

Measure time for each step

Page 25: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

25 © Copyright 2012 EMC Corporation. All rights reserved.

TeraSort Benchmark Preliminary Results

0

1

2

3

4

5

6

7

8

9

1 TB 10 TB

Execti

on

Tim

e i

n S

ec

# of TB Generated and Sorted

Apache Hadoop-1.0.0 validation - TeraSort

TeraGen

TeraSort

Page 26: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

26 © Copyright 2012 EMC Corporation. All rights reserved.

TeraSort Benchmark Findings

Minimal tuning of configuration

Results are within expected range.

Next Steps – Tune the cluster for optimal performance

– Use the benchmark for every new release

Page 27: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

27 © Copyright 2012 EMC Corporation. All rights reserved.

Lessons Learnt

Page 28: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

28 © Copyright 2012 EMC Corporation. All rights reserved.

Buildout Progress

0

200

400

600

800

1000

1200

Dec '11 Jan '12 Feb '12 Mar '12 April '12Month

Num

ber

of nodes

racked ready

Page 29: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

29 © Copyright 2012 EMC Corporation. All rights reserved.

―Real‖ Hadoop Cluster

Page 30: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

30 © Copyright 2012 EMC Corporation. All rights reserved.

Categories

Racking & Stacking

Networking

Non Hadoop Hosts

Base OS Setup

Hadoop Deployment

Post deployment

Process

Page 31: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

31 © Copyright 2012 EMC Corporation. All rights reserved.

In Closing

Page 32: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

32 © Copyright 2012 EMC Corporation. All rights reserved.

Upcoming work

Workbench Tasks – Load various data sets – Load GPDB, Hive, Hbase, Zookeeper, etc. – Load Chorus, Command center, UAP stack – VM provisioning – Various audits

On-boarding candidates – HD Education – Apache Hadoop Build & Validate – Mellanox UDA – Intel HiBench – Big data benchmarking – Hi resolution image processing, etc. etc.

Page 33: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

33 © Copyright 2012 EMC Corporation. All rights reserved.

A day in the life @ Switch

Page 34: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

34 © Copyright 2012 EMC Corporation. All rights reserved.

Q & A

Page 35: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

35 © Copyright 2012 EMC Corporation. All rights reserved.

Other Relevant Greenplum Sessions

Session Presenter Times Unified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00

Greenplum Database Overview Michael Crutcher Mon 8:30-9:30 Wed 10:00-11:00

Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15

Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00

Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00

Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30

Optimizing Greenplum Database on VMware Virtualized Infrastructure

Kevin O’Leary Mon 4:00-5:00 Tues 4:15-5:15

Big Data Driven Businesses in Action: Creating Real Business Value Using Greenplum UAP (Panel w/4 Customers)

Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30

Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45

Disruptive Data Science — How Data Science and Big Data are Transforming Business, IT and People

Annika Jimenez David Dietrich

Tues 4:15-5:15 Thurs 11:30-12:30

Page 36: Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

Recommended