+ All Categories
Home > Documents > Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or...

Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or...

Date post: 08-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
14
SOS10 Panel Session: Challenges in Data-Intensive Computing Bill Blake Maui, 07 March 2006
Transcript
Page 1: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

SOS10 Panel Session:Challenges in Data-Intensive Computing

Bill BlakeMaui, 07 March 2006

Page 2: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

2

PetaPeta--Scale DataScale Data--Intensive Computing is a Intensive Computing is a Reality in Commercial IT shops Today Reality in Commercial IT shops Today

• This is driven by the need to understand customers and manage the business “at the sub-transaction level’

• Examples:> Wireless Telephone Companies> Web Stores> Credit/Financial Analysis> Retailers> ISPs

Page 3: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

3

The The ““Within ApplicationWithin Application”” ApproachApproach

• Business Intelligence demands drive the industry towards specialized database/storage blades > focus of virtualization centered on the analytic DB

• Systems will be highly-specific> Tuned to meet the performance needs of applications

versus general purpose “virtualization”

Page 4: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

4

SQL

DATA

The Database Status QuoThe Database Status Quo

Database Server Storage

Red Brick

DSS QueryFastLoadPartitioningPre Mat Views

CACHE

OptimizerLoggingLock Mgr

CACHE

KAIORaw/CookLock Mgr

CACHE

RaidVol MgrLock Mgr

CACHE

I/O I/O

I/O

I/O

SQL

DATA

Source Systems

Client

High Performance

Loader

3rd PartyApps

DBA CLI

ETL Server

SOLARIS

LINUX

HP-UX

AIX

WINDOWS

TRU64

Page 5: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

5

ClientApplications

Local Applications

Hours or Days

Processing StorageSAN

SMP HOST 1

SMP HOST 2

SMP HOST N

Data Flow – The Traditional Way

C6F7

C12F13

C38 F39G13 G22

$$$

$$$

Page 6: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

6

What is Needed Now?What is Needed Now?

Well, if moving all the data to the processors doesn’t do the job…

Then why not move the processors to where the large data resides?

It is hard to do this in a data-intensive waywhen not working within the application

Page 7: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

7

Streaming Data FlowStreaming Data Flow

Netezza Performance ServerClientBI Applications

Fast Loader/ Unloader

Local Applications

ODBC 3.XJDBC Type 4

SQL/92

SPU

FPGA

SPUC12C13

F12F13

G12G13

C14 F14 G14

FPGA

SPUC21C22

F21F22

G21G22

C23 F23 G23

FPGA

SPU

C30C31

F30F31

G30G31

FPGA

SPUC38C39

F38F39

G38G39

C40 F40 G40

FPGA

A B C D E F G H123456789101112131415161718192021222324252627282930313233343536373839404142

SMP Host

C6F7

C12F13

C38 F39

G13 G22

C6C7

F6F7

G6G7

Streaming data, joins, &

aggs @ 50MB/sec

Up to 120 TB/hourCross Section BWper 8650 system

Bulk data movement: 250 GB/hour - uncompressed

(1 TB/hour Target)

Page 8: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

8

Netezza Found Clues In Late Netezza Found Clues In Late ’’90s90sComputer Science ResearchComputer Science Research

• Active Disk architectures> Integrated processing power and memory into disk units> Scaled processing power as the dataset grew

• Decision support algorithms offloaded to Active Disks to support key decision support tasks

> Active Disk architectures use stream-based model ideal for software architecture of relational databases

In Netezza’s NPS® System: “Snippet Processing Units”take streams as inputs and generate streams as outputs

Page 9: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

9

Netezza Performance ServerGigabit Ethernet

Snippet Processing Unit (SPU)

PowerPCQuery Engine

JoiningSorting

Grouping

Snippet Queue

ReplicationManager

Main Memory

StreamingRecord Processor

Project Restrict

Transaction/LockManager

DiskControl

A Closer Look InsideA Closer Look Inside

DualDisk

Primary

SPUSwap

Mirror

Page 10: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

10

Netezza Performance Server

Snippet Processing Unit (SPU)

Active Disks as Intelligent Storage NodesActive Disks as Intelligent Storage Nodes

Netezza added: •Highly optimized query planning

•Code generation•Stream processing

Result: 10X to 100X performance speedup over existing systems

Page 11: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

11

Asymmetric Massively Parallel ProcessingAsymmetric Massively Parallel Processing®®

ArchitectureArchitecture

Massively Parallel Intelligent Storage

1

2

3

1000+

GigabitEthernetSMP Host

DBOSFront End

High-PerformanceDatabase EngineStreaming joins,

aggregations, sorts, etc.

Snippet Processing Unit (SPU)

Processor &

streaming DB logic

Snippet Processing Unit (SPU)

Processor &

streaming DB logic

Snippet Processing Unit (SPU)

Processor &

streaming DB logic

Snippet Processing Unit (SPU)

Processor &

streaming DB logic

Netezza Performance Server® System

High-speed Loader/Unloader

ODBC 3.XJDBC Type 4

SQL/92

Execution Engine

SQL Compiler

Query Plan

Optimize

Admin

Source Systems

Client

High Performance

Loader

3rd PartyApps

DBA CLI

ETL Server

SOLARIS

LINUX

HP-UX

AIX

WINDOWS

TRU64

Page 12: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

12

Binary Compiled Queries Executed on Binary Compiled Queries Executed on Massively Parallel GridMassively Parallel Grid

select c_name, sum(o_totalprice) price from customer, orders where o_orderkey in (select l_orderkey from lineitem2 where o_orderkey=l_orderkey and l_shipdate>='01-01-1995' and l_shipdate<='01-31-1995') and c_custkey=o_custkey group by c_name;" test_tim >test.out

select c_name, sum(o_totalprice) price from customer, orders where o_orderkey in (select l_orderkey from lineitem2 where o_orderkey=l_orderkey and l_shipdate>='01-01-1995' and l_shipdate<='01-31-1995') and c_custkey=o_custkey group by c_name;" test_tim >test.out/********* Code **********/

void GenPlan1(CPlan *plan, char *bufStarts,char *bufEnds, boollastCall) {

//// Setup for next loop (nodes 00..07)//// node 00 (TScanNode)TScanNode *node0 = (TScanNode*)plan->m_nodeArray[0];// For ScanNode:

TScan0 *Scan0 = BADPTR(TScan0*);CTable *tScan0 = plan->m_nodeArray[0]->m_result;

char *nullsScan0P = BADPTR(char *);// node 01 (TRestrictNode)TRestrictNode *node1 = (TRestrictNode*)plan->m_nodeArray[1];// node 02 (TProjectNode)TProjectNode *node2 = (TProjectNode*)plan->m_nodeArray[2];// node 03 (TSaveTempNode)TSaveTempNode *node3 = (TSaveTempNode*)plan->m_nodeArray[3];// For SaveTemp Node:TSaveTemp3 *SaveTemp3 = BADPTR(TSaveTemp3*);CTable *tSaveTemp3 = node3->m_result;CRecordStore *recStore3 = tSaveTemp3->m_recStore;// node 04 (THashNode)

/********* Code **********/

void GenPlan1(CPlan *plan, char *bufStarts,char *bufEnds, boollastCall) {

//// Setup for next loop (nodes 00..07)//// node 00 (TScanNode)TScanNode *node0 = (TScanNode*)plan->m_nodeArray[0];// For ScanNode:

TScan0 *Scan0 = BADPTR(TScan0*);CTable *tScan0 = plan->m_nodeArray[0]->m_result;

char *nullsScan0P = BADPTR(char *);// node 01 (TRestrictNode)TRestrictNode *node1 = (TRestrictNode*)plan->m_nodeArray[1];// node 02 (TProjectNode)TProjectNode *node2 = (TProjectNode*)plan->m_nodeArray[2];// node 03 (TSaveTempNode)TSaveTempNode *node3 = (TSaveTempNode*)plan->m_nodeArray[3];// For SaveTemp Node:TSaveTemp3 *SaveTemp3 = BADPTR(TSaveTemp3*);CTable *tSaveTemp3 = node3->m_result;CRecordStore *recStore3 = tSaveTemp3->m_recStore;// node 04 (THashNode)

…101101010101010101011111010101010010010101011101010100101111010101001010111101101001010101011101010110010101010101111101001001010101010101010101010010101001111110101010101010101001010101010010100101101001111111101010101010011010010101010100101010101010010101010101010010101010100111010101010101010101010…

101101010101010101011111010101010010010101011101010100101111010101001010111101101001010101011101010110010101010101111101001001010101010101010101010010101001111110101010101010101001010101010010100101101001111111101010101010011010010101010100101010101010010101010101010010101010100111010101010101010101010…

c_name | price--------------------+-----------Customer#000000796 | 318356.97Customer#000001052 | 293680.56Customer#000001949 | 215280.98Customer#000002093 | 282531.93Customer#000005656 | 335297.31Customer#000005861 | 233691.03Customer#000006002 | 267000.92Customer#000006343 | 595819.82Customer#000006532 | 442254.91

….real 0m0.552suser 0m0.010ssys 0m0.000s

c_name | price--------------------+-----------Customer#000000796 | 318356.97Customer#000001052 | 293680.56Customer#000001949 | 215280.98Customer#000002093 | 282531.93Customer#000005656 | 335297.31Customer#000005861 | 233691.03Customer#000006002 | 267000.92Customer#000006343 | 595819.82Customer#000006532 | 442254.91

….real 0m0.552suser 0m0.010ssys 0m0.000s

Page 13: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

13

ItIt’’s All About Scaling, Streaming and s All About Scaling, Streaming and AsymmetryAsymmetry

• Sandia: TeraFLOP PetaFLOP> Specialized Node Function> Linux + light weight kernels> System Interconnection is “secret sauce”

for high BW low latency MPP performance gains

• Netezza: TeraByte PetaByte> Specialized Node Function> Linux + light weight kernels> Storage/processor/DB integration is

“secret sauce” for streaming query processing MPP perf gains

Net I/O

System Support

Service

Sys Admin

Users

File I/O

Compute

/home

Users and Sys Admin

Net I/O File I/O

DB QueriesProcessed bySnippet ProcessingArrays

Page 14: Challenges in Data-Intensive Computing...TRU64 5 Applications Client Local Applications Hours or Days Processing Storage SAN SMP HOST 1 SMP HOST 2 SMP HOST N Data Flow – The Traditional

Thank You


Recommended