NENUG Apr14 Talk - data modeling for netezza

Data Modeling and Netezza

Biju Nair NENUG Talk

30-‐Apr-‐2014

Mo@va@on and Goal

•  Mo@va@on – Performance degrada@on with data volume

•  Goal – Highlight considera@ons while modeling for NZ

2

Data Modeling

•  Logical Data Modeling – Business domain data representa@on –  Independent of DBMS technology – NZ is an appliance for analy@cs

•  Set based processing vs row based processing •  De-‐normaliza@on •  Snow flake/Star schema

•  Physical Data Modeling – Takes in the DBMS features and constraints – Need to understand the DBMS architecture

3

NZ Architecture

Host

Snippet Processors

Snippet Processors

Snippet Processors

Data Data Data

-‐ Parse, Op+mize and Compile query -‐ Schedule snippets -‐ Distribute data

Executes snippets

-‐ Shared nothing MPP -‐ Custom IP backbone

-‐  Appliance efficiency is maximized when -‐  snippet processors can run independently i.e. data independence -‐  All snippet processors are u@lized uniformly 4

Snippet Processors

Data

Accelerator (FPGA) CPU

Compute

Host

Reads compressed

data

Un-‐compress Remove columns

Restrict rows (where)

Perform computa+on

Send data to host

-‐  Disk reads are incredibly slow rela@ve to other components especially seek @me -‐  While the CPU overhead is reduced, volume of data read will impact performance

5

Data Storage

Data

Snippet Processors

Host

Extend

Page

Extend

Extend

…

Meta-‐Data

-‐ Meta data iden@fies extends/pages to read or skip 6

Modeling Priori@es

•  U@liza@on of all snippet processors – Need to be able to u@lize uniformly

•  Maximize MPP capability of snippet processors –  Ideally snippet processors should be independent

•  Minimize data read from disk – Minimize data stored

•  Improve computa@on in snippet processor –  Compounded with data volume will help performance

7

Snippet Processor U@liza@on Data Distribu+on

Host

Snippet Processors

Snippet Processors

Snippet Processors

1,MA,1212,… 3,MA,0414,…

2,CA,0113,…

1,MA,1212,… 2,CA,0113,… 3,MA,0414,…

Distribute by state

-‐  NZ will pick one of the columns to distribute if none specified in table defini@on -‐  First column in the table

Data Skew

8


Host

Snippet Processors

Snippet Processors

Snippet Processors

1,MA,1212,… 2,CA,0113,… 3,MA,0414,…

1,MA,1212,… 2,CA,0113,… 3,MA,0414,…

Distribute by mo-‐yr

-‐  Snippet processors are u@lized uniformly -‐  What if most of the query is on for the current month?

-‐  Processing skew 9


Host

Snippet Processors

Snippet Processors

Snippet Processors

1,MA,1212,… 4,CA,0414,…

2,CA,0113,… 3,MA,0414,…

1,MA,1212,… 2,CA,0113,… 3,MA,0414,… 4,CA,0414,…

Distribute random

-‐  Snippet processors are u@lized uniformly -‐  Helps prevent processing skew

10

3,ORD1,ITEM1,… 4,ORD1,ITEM1,… 2,ORD1,ITEM1,…

1,ORD1,ITEM1,…

Snippet Processor U@liza@on Data Distribu+on and Table Joins

Host

Snippet Processors

Snippet Processors

Snippet Processors

4,CA,0414,… 1,MA,1212,… 3,MA,0414,…

2,CA,0113,…

1,ORD1,ITEM1,… 2,ORD1,ITEM1,… 3,ORD1,ITEM1,… 4,ORD1,ITEM1,…

Distribute random

-‐  Snippet processors are u@lized uniformly -‐  Makes snippet processors dependent on others impac@ng MPP maximiza@on

Need to redistribute data from both tables

11



Host

Snippet Processors

Snippet Processors

Snippet Processors

4,CA,0414,… 1,MA,1212,… 3,MA,0414,…

2,CA,0113,…


Distribute on Join column -‐ cid

-‐  Snippet processors are u@lized uniformly -‐  Makes snippet processors dependent on others impac@ng MPP maximiza@on

-‐  Becer than the previous scenario

Need to redistribute data from one table

12



Host

Snippet Processors

Snippet Processors

Snippet Processors

1,MA,1212,… 4,CA,0414,…

2,CA,0113,… 3,MA,0414,…

Distribute both tables on Join column -‐ cid

-‐  Snippet processors are u@lized uniformly -‐  Makes snippet processors independent maximizing MPP

13


•  Iden@fy keys to distribute data uniformly – Avoid data and processing skew – Try using join columns as the distribu@on keys – Choose same data types for join columns

•  If table size is small random distribu@on is fine –  If one of the join table is small, NZ will broadcast

•  Redistribu@on may not be an overkill for small data – For e.g., selec@ng a small number if columns

14

Snippet Processor U@liza@on Distribu+on and Query Time

0

0.5

1

1.5

2

2.5

3

3.5

Random 1 Correct Distribu@on 2 Correct Distribu@on

Query Time For Different Distribu+ons

Time (min)

15

Snippet Processor U@liza@on Join Column Type and Query Time

0

0.5

1

1.5

2

2.5

Incorrect Data Types Correct Data Types

Join query +me -‐ same and diff data types

Time (min)

16

Minimize Data Read From Disk Zone Maps

•  Data types which supports Zone Maps – All integer data types

•  int1 •  int2 •  int4 •  int8

– Date – Timestamp

17

Refer to the product manual for the version of NZ used for the complete list of zone map able data types

Data

Snippet Processors

Host

Extend 1

Page

Extend 2

Extend 3

…

Meta-‐Data

-‐ May end up reading all data from disk


Table column (cid) is numeric(10,0)

No zone map for cid

18

Data

Snippet Processors

Host

Extend

Page

Extend

Extend

…

Meta-‐Data

-‐ Zone maps can be used to minimize data read from disk


Table column (cid) is bigint

19

Minimize Data Read From Disk Zone Maps and Query Time

1.3

1.35

1.4

1.45

1.5

1.55

1.6

Incorrect Data Type Correct Data Type

Query +me with and without zone map

Time (min)

20

Minimize Data Read From Disk Clustered Base Tables

•  NZ stores data with same organize keys closely •  Addi@onal data types are zone map able

–  char –  varchar –  nchar –  nvarchar –  float –  double –  bool –  @me –  Interval

•  Helps improve performance of mul@ table join

21

Extend

Extend

Extend

Minimize Data Read From Disk Clustered Base Tables

Table distributed on cid

1,MA,Boston,…

5,CA,LA,…

3,FL,Tampa,,…

1,MA,Salem,…

5,CA,SF,…

3,FL,Orlando,…

1,MA,Lowell,…

5,CA,Pasadena,… 3,FL,Miami,…

Extend

Extend

Extend

Table distributed on cid organize on state

1,MA,Boston,…

1,MA,Salem,…

1,MA,Lowell,…

3,FL,Tampa,,…

3,FL,Orlando,…

3,FL,Miami,…

5,CA,LA,…

5,CA,SF,…

5,CA,Pasadena,…

Extend

1,AL,Alabama,…

5,CA,California,… 10,FL,Florida,…

State

22,MA,Mass,…

22

Minimize Data Read From Disk Clustered Base Table

0

0.5

1

1.5

2

2.5

No Org+Correct Dist Org+Correct Dist

Query Time with and without org

Time (min)

23

Minimize Data Read From Disk Materialized Views

•  View with frequently used columns of base table – Unlike views, materialized view stores data – Reduced data read from disk – Addi@onal storage required – Need to be refreshed if base table data changes

•  Can be used as an index against base table – Stores loca@on of base table data loc in a column

24

Extend

Extend

Extend


T_EMP distributed on cid

1,MA,Mike,…

5,CA,Fally,…

3,FL,Chris,…

4,MA,Robert,…

7,CA,Mary,…

2,FL,Jus+n,…

6,MA,Harini,…

8,CA,Mike,…

9,FL,Martha,…

SELECT ID, NAME FROM T_EMP;

Extend

MV on T_EMP order by state, name

5,CA,Fally

7,CA,Mary

8,CA,Mike

3,FL,Chris

2,FL,Jus+n

9,FL,Martha

6,MA,Harini

1,MA,Mike

4,MA,Robert

SELECT * FROM T_EMP WHERE STATE = ‘CA’

25


0

0.2

0.4

0.6

0.8

1

1.2

1.4

No MV With MV

Query +me on single table -‐MV vs No MV

Time (min)

26


0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

No MV With MV

Join Query Time -‐ With and without MV

Time (min)

27

Minimize Data Read From Disk Minimize Data Stored

•  Choose storage efficient data types – Difference between bigint and int is 4 bytes

•  Use char instead of varchar if the data length is fixed – varchar has a 2 byte overhead

•  Define columns as “not null” where possible

•  Store only the required data in table columns

•  Encode duplicate data stored in rows 28

Improve Computa@on in Snippet Processor NZ Object Defini+ons

•  Define columns as “not null” where possible – Removes logic to check nulls

•  Define table keys and rela@onships – Helps NZ query op@mizer to generate efficient code

29

30

[email protected]

blog.asquareb.com

https://github.com/bijugs

@gsbiju

Date post:	25-Jan-2015
Category:	Technology
Upload:	biju-nair
View:	662 times
Download:	1 times

NENUG Apr14 Talk - data modeling for netezza

Technology