Post on 14-Apr-2017
transcript
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 1
“Presented By
Brijesh Kumar Awasthi
IMP2014002
What is GreenPlum…?
•Greenplum, the company, was founded in September 2003 by Scott Yara and Luke Lonergan.
•It was a merger of two smaller companies Metapa in Los Angeles and Didera in Fairfax, Virginia•Greenplum, based in in San Mateo, California, released its database management system software in April 2005 calling it Bizgres
Data Computing Division
E M C A C Q U I R E S G R E E N P L U M
Greenplum Becomes the Foundation of EMCʼs Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 3
“Greenplum, with expertise in the massively parallel arena, will give the storage giant a boost in big-data computing.”
– InformationWeek –
“For three years, Gartner has identified Greenplum asthe most advanced vendor in the visionary
quadrant of its data warehouse DBMS Magic Quadrant….”– Gartner
What the COO of EMC said about Green Plum And BI
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 5
New Rrealities…New Demands!• Do it faster
– Ingest more data
– Ingest it faste– Keep it unsummarised, keep it for longer
• Be more Responsive– Unpredictable queries, Rapidly evolving bespoke analy2cs– New tools: Hadoop, MapReduce, Hive, HBase, “R”
• Manage new data types– Manage and allow queries across structured, semi- ‐structured and unstructured data
• Do it at a lower cost
Big Data will revolutionizeData Warehousing and analysis.
Data Computing Division
Why Greenplum?
Fast Data
Loading Extreme Performance & Elastic Scalability
Unified Data Access
© Copyright 2011 EMC Corpora2on. All rights reserved. 6
• EMC Greenplum is a shared nothing, massively parallel processing (MPP) data warehouse system
• Core principle of data computing is to move the processing dramatically closer to the data and to the people
Data Computing Division
Segment Servers
Query processing & data storage
... ...
Master Server
Query planning & dispatch
Hadoop MapReduce
Data Sources
Loading, streaming, etc.
Network Interconnect
External Files, URLs, Hadoop (HDFS), WebServices (including from other
DBs), O/S Pipes (including from other DBs)
Standard Business Intelligence and Analy2cal tools
SQLBI tools
Analytical tools
Queries distributed across all available
resources
Shared Nothing, Massively Parallel Processing means
no boS lenecks and linear scalability.
Data loading also takes advantage of MPP architecture
Greenplum handles structured, semi- ‐
structured and unstructured data
Clients see a single database
primary server, plus hot failover
© Copyright 2011 EMC Corpora2on. All rights reserved. 7
Data Computing Division
Why is MPP different?
…Greenplum is a Scale-Out Architecture on standard commodity hardware
MPP
© Copyright 2011 EMC Corpora2on. All rights reserved. 8
• Queries shipped to each node simultaneously
• Execute parallel on each segment instance.• Multiple pipe lines of data• Highly Scalable topology• Locks and buffers not shared.
Traditional• Single database buffer used by all
user operations• More locks, means more complex
lockmanagement system
• Single pipe to data• Limited Scalability
Partitioning: The Key to ParallelismStrategy: Spread data evenly across as many nodes (and disks) as possible
Greenplum Database High Speed Loader
Data Computing Division
© Copyr2ig0h/0t 220/1112EMC Corpora2on. All rights reserved.
6 9
OrderOrder # Order
Date
Customer ID
43 Oct 20 2005 1264 Oct 20 2005 11145 Oct 20 2005 4246 Oct 20 2005 6477 Oct 20 2005 3248 Oct 20 2005 1250 Oct 20 2005 3456 Oct 20 2005 21363 Oct 20 2005 1544 Oct 20 2005 10253 Oct 20 2005 8255 Oct 20 2005 55
Greenplum DatabasePowerful Data Loading Capabilities• Industry leading performance:
– >10TB per hour per rack• Innovative, parallel-everything architecture:
– Scatter-Gather Streaming™ provides true linear scaling– Support for both large-batch and continuous real-time loading
strategies– Enable complex data transformations “in-flight”– Transparent interfaces to loading via support files, application and
services
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 10
Traditional Loading vs Greenplum DB Parallel Loading
Segment nodes
Segment nodes
Segment nodes
Segment nodes
Interconnect
Conventional Loading
ETLServers
Interconnect
ETLServers
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 11
Client
Advanced pipeline process for fast operation
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 12
Sort Request
Master Server
Segment Servers
9 6 102 11 54 3 121 7 8
Advanced pipeline process for fast operation
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 10
Master Server
Segment Servers
Client
1 3 52 6 84 7 109 11 12
Greenplum Database Extreme Performance• Optimized for BI and Analytics
– Rich eco-system of partners
• Provides automatic parallelization– Just load and query like any database– Tables are automatically distributed across
nodes– No need for manual partitioning or tuning
• Extremely scalable MPP shared-nothing Architecture
– All nodes can scan and process in parallel– Linear scalability by adding nodes
Interconnect
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 14
Loading
Platform Independence Delivers Choice and Flexibility
Virtualized Infrastructure• Pool resources• Elastic scalability
Data Computing Appliance• Optimized Price/Performance• Minimum time- ‐to- ‐value• Ideal for Produc@on Environments
Software- ‐Only• On your x86 hardware• Flexibility for any workload
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 15
Table ‘Customer’
Jan ’09 Feb ’09 Mar ’09Apr ’09 May ’09 Jun ’09
Jul ’09 Aug ’09 Sept ’09 Oct ’09 Nov ’09
Column-Oriented Archival Compression
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 16
Column-Oriented Fast Compression
Row-Oriented Fast Compression
Greenplum Polymorphic Data Storage
• Greenplum Databaseʼs engine provides a flexible storage model– Four table types: heap, row-oriented, column-oriented, external– Block compression: Gzip (levels 1-9), QuickLZ
• Storage types can be mixed within a database, and even within a table– Fully configurable via table DDL and partitioning syntax– You may also choose to index some partitions and not others
• Gives customers the choice of processing model for any table or partition– Tables/partitions of different storage types can be joined together without restriction– Highly tuned – e.g. columnar does efficient pre-projection and parallel execution
Unified Data Access Across The Enterprise• Workload Management
– Connection management controls how many users can be connected and assigns them to a queue
– User-based resource queues allow for control of the total number or cost of queries allowed at any point in time.
• Dynamic Query Prioritization– Patent pending technique of dynamically
balancing resources across running queries– Allows DBAs to control query priorities in real-
time, or determine default priorities by resource queue
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 17
Highly interactive web-basedperformance monitoring
Real-time and historic views of:
• Resource utilization
• Queries and query internals
Greenplum Performance Monitor
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 18
Key Technical Requirements for HPA
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 19
Technical Values Performance - Massively parallel Architecture Load speeds – 10TB/hr Integration with SAS In-database analytics using Java, PL/R, etc Integration with many more BI, Analytical tools, Integration with Hadoop for unstructured data analysis
Financial Value Lower Total cost of ownership Best Price/performance Ratio in the industry for
EDW/ analytical appliance Operational Values
No Indices maintenance Backup recovery solution Most robust Disaster Recovery Solution in Industry Best Technical and customer Support Organization
backing
Greenplum Customers -- Government• Pacific Northwest National Labs
(Dept. of Energy) does cyberanalytics.
• Usa spending.gov traces the outlays of the US Federal Government.
• The Federal Reserve Bank of Kansas City does economic analysis mostly related to the housing market.
• Recently, the Internal Revenue Service purchased a DCA to do work related to Fraudulent Tax returns.
• ATO uses GP as an investigatory tool in their Compliance and Audit Logging Unit.
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 12 20
High Performance Analytics
‘The power to know fast’Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 21
Thank you
Questions?
Data Computing Division
© Copyright 2011 EMC Corpora2on. All rights reserved. 22