Green Plum IIIT- Allahabad

Post on 14-Apr-2017

353 views 1 download

transcript

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 1

“Presented By

Brijesh Kumar Awasthi

IMP2014002

What is GreenPlum…?

•Greenplum, the company, was founded in September 2003 by Scott Yara and Luke Lonergan.

•It was a merger of two smaller companies Metapa in Los Angeles and Didera in Fairfax, Virginia•Greenplum, based in in San Mateo, California, released its database management system software in April 2005 calling it Bizgres

Data Computing Division

E M C A C Q U I R E S G R E E N P L U M

Greenplum Becomes the Foundation of EMCʼs Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 3

“Greenplum, with expertise in the massively parallel arena, will give the storage giant a boost in big-data computing.”

– InformationWeek –

“For three years, Gartner has identified Greenplum asthe most advanced vendor in the visionary

quadrant of its data warehouse DBMS Magic Quadrant….”– Gartner

What the COO of EMC said about Green Plum And BI

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 5

New Rrealities…New Demands!• Do it faster

– Ingest more data

– Ingest it faste– Keep it unsummarised, keep it for longer

• Be more Responsive– Unpredictable queries, Rapidly evolving bespoke analy2cs– New tools: Hadoop, MapReduce, Hive, HBase, “R”

• Manage new data types– Manage and allow queries across structured, semi- ‐structured and unstructured data

• Do it at a lower cost

Big Data will revolutionizeData Warehousing and analysis.

Data Computing Division

Why Greenplum?

Fast Data

Loading Extreme Performance & Elastic Scalability

Unified Data Access

© Copyright 2011 EMC Corpora2on. All rights reserved. 6

• EMC Greenplum is a shared nothing, massively parallel processing (MPP) data warehouse system

• Core principle of data computing is to move the processing dramatically closer to the data and to the people

Data Computing Division

Segment Servers

Query processing & data storage

... ...

Master Server

Query planning & dispatch

Hadoop MapReduce

Data Sources

Loading, streaming, etc.

Network Interconnect

External Files, URLs, Hadoop (HDFS), WebServices (including from other

DBs), O/S Pipes (including from other DBs)

Standard Business Intelligence and Analy2cal tools

SQLBI tools

Analytical tools

Queries distributed across all available

resources

Shared Nothing, Massively Parallel Processing means

no boS lenecks and linear scalability.

Data loading also takes advantage of MPP architecture

Greenplum handles structured, semi- ‐

structured and unstructured data

Clients see a single database

primary server, plus hot failover

© Copyright 2011 EMC Corpora2on. All rights reserved. 7

Data Computing Division

Why is MPP different?

…Greenplum is a Scale-Out Architecture on standard commodity hardware

MPP

© Copyright 2011 EMC Corpora2on. All rights reserved. 8

• Queries shipped to each node simultaneously

• Execute parallel on each segment instance.• Multiple pipe lines of data• Highly Scalable topology• Locks and buffers not shared.

Traditional• Single database buffer used by all

user operations• More locks, means more complex

lockmanagement system

• Single pipe to data• Limited Scalability

Partitioning: The Key to ParallelismStrategy: Spread data evenly across as many nodes (and disks) as possible

Greenplum Database High Speed Loader

Data Computing Division

© Copyr2ig0h/0t 220/1112EMC Corpora2on. All rights reserved.

6 9

OrderOrder # Order

Date

Customer ID

43 Oct 20 2005 1264 Oct 20 2005 11145 Oct 20 2005 4246 Oct 20 2005 6477 Oct 20 2005 3248 Oct 20 2005 1250 Oct 20 2005 3456 Oct 20 2005 21363 Oct 20 2005 1544 Oct 20 2005 10253 Oct 20 2005 8255 Oct 20 2005 55

Greenplum DatabasePowerful Data Loading Capabilities• Industry leading performance:

– >10TB per hour per rack• Innovative, parallel-everything architecture:

– Scatter-Gather Streaming™ provides true linear scaling– Support for both large-batch and continuous real-time loading

strategies– Enable complex data transformations “in-flight”– Transparent interfaces to loading via support files, application and

services

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 10

Traditional Loading vs Greenplum DB Parallel Loading

Segment nodes

Segment nodes

Segment nodes

Segment nodes

Interconnect

Conventional Loading

ETLServers

Interconnect

ETLServers

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 11

Client

Advanced pipeline process for fast operation

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 12

Sort Request

Master Server

Segment Servers

9 6 102 11 54 3 121 7 8

Advanced pipeline process for fast operation

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 10

Master Server

Segment Servers

Client

1 3 52 6 84 7 109 11 12

Greenplum Database Extreme Performance• Optimized for BI and Analytics

– Rich eco-system of partners

• Provides automatic parallelization– Just load and query like any database– Tables are automatically distributed across

nodes– No need for manual partitioning or tuning

• Extremely scalable MPP shared-nothing Architecture

– All nodes can scan and process in parallel– Linear scalability by adding nodes

Interconnect

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 14

Loading

Platform Independence Delivers Choice and Flexibility

Virtualized Infrastructure• Pool resources• Elastic scalability

Data Computing Appliance• Optimized Price/Performance• Minimum time- ‐to- ‐value• Ideal for Produc@on Environments

Software- ‐Only• On your x86 hardware• Flexibility for any workload

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 15

Table ‘Customer’

Jan ’09 Feb ’09 Mar ’09Apr ’09 May ’09 Jun ’09

Jul ’09 Aug ’09 Sept ’09 Oct ’09 Nov ’09

Column-Oriented Archival Compression

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 16

Column-Oriented Fast Compression

Row-Oriented Fast Compression

Greenplum Polymorphic Data Storage

• Greenplum Databaseʼs engine provides a flexible storage model– Four table types: heap, row-oriented, column-oriented, external– Block compression: Gzip (levels 1-9), QuickLZ

• Storage types can be mixed within a database, and even within a table– Fully configurable via table DDL and partitioning syntax– You may also choose to index some partitions and not others

• Gives customers the choice of processing model for any table or partition– Tables/partitions of different storage types can be joined together without restriction– Highly tuned – e.g. columnar does efficient pre-projection and parallel execution

Unified Data Access Across The Enterprise• Workload Management

– Connection management controls how many users can be connected and assigns them to a queue

– User-based resource queues allow for control of the total number or cost of queries allowed at any point in time.

• Dynamic Query Prioritization– Patent pending technique of dynamically

balancing resources across running queries– Allows DBAs to control query priorities in real-

time, or determine default priorities by resource queue

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 17

Highly interactive web-basedperformance monitoring

Real-time and historic views of:

• Resource utilization

• Queries and query internals

Greenplum Performance Monitor

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 18

Key Technical Requirements for HPA

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 19

Technical Values Performance - Massively parallel Architecture Load speeds – 10TB/hr Integration with SAS In-database analytics using Java, PL/R, etc Integration with many more BI, Analytical tools, Integration with Hadoop for unstructured data analysis

Financial Value Lower Total cost of ownership Best Price/performance Ratio in the industry for

EDW/ analytical appliance Operational Values

No Indices maintenance Backup recovery solution Most robust Disaster Recovery Solution in Industry Best Technical and customer Support Organization

backing

Greenplum Customers -- Government• Pacific Northwest National Labs

(Dept. of Energy) does cyberanalytics.

• Usa spending.gov traces the outlays of the US Federal Government.

• The Federal Reserve Bank of Kansas City does economic analysis mostly related to the housing market.

• Recently, the Internal Revenue Service purchased a DCA to do work related to Fraudulent Tax returns.

• ATO uses GP as an investigatory tool in their Compliance and Audit Logging Unit.

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 12 20

High Performance Analytics

‘The power to know fast’Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 21

Thank you

Questions?

Data Computing Division

© Copyright 2011 EMC Corpora2on. All rights reserved. 22