+ All Categories
Home > Technology > NENUG Apr14 Talk - data modeling for netezza

NENUG Apr14 Talk - data modeling for netezza

Date post: 25-Jan-2015
Category:
Upload: biju-nair
View: 662 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
30
Data Modeling and Netezza Biju Nair NENUG Talk 30Apr2014
Transcript
Page 1: NENUG Apr14 Talk - data modeling for netezza

Data  Modeling  and  Netezza    

Biju  Nair  NENUG  Talk    

30-­‐Apr-­‐2014  

Page 2: NENUG Apr14 Talk - data modeling for netezza

Mo@va@on  and  Goal  

•  Mo@va@on  – Performance  degrada@on  with  data  volume  

•  Goal  – Highlight  considera@ons  while  modeling  for  NZ  

2  

Page 3: NENUG Apr14 Talk - data modeling for netezza

Data  Modeling  

•  Logical  Data  Modeling  – Business  domain  data  representa@on  –  Independent  of  DBMS  technology  – NZ  is  an  appliance  for  analy@cs  

•  Set  based  processing  vs  row  based  processing  •  De-­‐normaliza@on  •  Snow  flake/Star  schema  

•  Physical  Data  Modeling  – Takes  in  the  DBMS  features  and  constraints  – Need  to  understand  the  DBMS  architecture  

3  

Page 4: NENUG Apr14 Talk - data modeling for netezza

NZ  Architecture  

Host  

Snippet  Processors  

Snippet  Processors  

Snippet  Processors  

Data   Data   Data  

-­‐  Parse,  Op+mize  and  Compile  query  -­‐  Schedule  snippets  -­‐  Distribute  data  

Executes  snippets  

-­‐  Shared  nothing  MPP  -­‐  Custom  IP  backbone  

-­‐  Appliance  efficiency  is  maximized  when    -­‐  snippet  processors  can  run  independently  i.e.  data  independence  -­‐  All  snippet  processors  are  u@lized  uniformly   4  

Page 5: NENUG Apr14 Talk - data modeling for netezza

Snippet  Processors  

Data  

Accelerator  (FPGA)   CPU  

Compute  

Host  

Reads  compressed  

data  

Un-­‐compress   Remove  columns  

Restrict  rows  (where)  

Perform  computa+on  

Send  data  to  host  

-­‐  Disk  reads  are  incredibly  slow  rela@ve  to  other  components  especially  seek  @me  -­‐  While  the  CPU  overhead  is  reduced,  volume  of  data  read  will  impact  performance  

5  

Page 6: NENUG Apr14 Talk - data modeling for netezza

Data  Storage  

Data  

Snippet  Processors  

Host  

Extend  

Page  

Extend  

Extend  

…  

Meta-­‐Data  

-­‐  Meta  data  iden@fies  extends/pages  to  read  or  skip  6  

Page 7: NENUG Apr14 Talk - data modeling for netezza

Modeling  Priori@es  

•  U@liza@on  of  all  snippet  processors  – Need  to  be  able  to  u@lize  uniformly  

•  Maximize  MPP  capability  of  snippet  processors  –  Ideally  snippet  processors  should  be  independent  

•  Minimize  data  read  from  disk  – Minimize  data  stored  

•  Improve  computa@on  in  snippet  processor  –  Compounded  with  data  volume  will  help  performance      

7  

Page 8: NENUG Apr14 Talk - data modeling for netezza

Snippet  Processor  U@liza@on  Data  Distribu+on  

Host  

Snippet  Processors  

Snippet  Processors  

Snippet  Processors  

1,MA,1212,…  3,MA,0414,…  

2,CA,0113,…  

1,MA,1212,…  2,CA,0113,…  3,MA,0414,…  

Distribute  by  state  

-­‐  NZ  will  pick  one  of  the  columns  to  distribute  if  none  specified  in  table  defini@on  -­‐  First  column  in  the  table  

Data  Skew  

8  

Page 9: NENUG Apr14 Talk - data modeling for netezza

Snippet  Processor  U@liza@on  Data  Distribu+on  

Host  

Snippet  Processors  

Snippet  Processors  

Snippet  Processors  

1,MA,1212,…   2,CA,0113,…   3,MA,0414,…  

1,MA,1212,…  2,CA,0113,…  3,MA,0414,…  

Distribute  by  mo-­‐yr  

-­‐  Snippet  processors  are  u@lized  uniformly  -­‐  What  if  most  of  the  query  is  on  for  the  current  month?  

-­‐  Processing  skew  9  

Page 10: NENUG Apr14 Talk - data modeling for netezza

Snippet  Processor  U@liza@on  Data  Distribu+on  

Host  

Snippet  Processors  

Snippet  Processors  

Snippet  Processors  

1,MA,1212,…  4,CA,0414,…  

2,CA,0113,…   3,MA,0414,…  

1,MA,1212,…  2,CA,0113,…  3,MA,0414,…  4,CA,0414,…  

Distribute  random  

-­‐  Snippet  processors  are  u@lized  uniformly  -­‐  Helps  prevent  processing  skew  

10  

Page 11: NENUG Apr14 Talk - data modeling for netezza

3,ORD1,ITEM1,…  4,ORD1,ITEM1,…  2,ORD1,ITEM1,…  

1,ORD1,ITEM1,…  

Snippet  Processor  U@liza@on  Data  Distribu+on  and  Table  Joins  

Host  

Snippet  Processors  

Snippet  Processors  

Snippet  Processors  

4,CA,0414,…  1,MA,1212,…  3,MA,0414,…  

2,CA,0113,…  

1,ORD1,ITEM1,…  2,ORD1,ITEM1,…  3,ORD1,ITEM1,…  4,ORD1,ITEM1,…  

Distribute  random  

-­‐  Snippet  processors  are  u@lized  uniformly  -­‐  Makes  snippet  processors  dependent  on  others  impac@ng  MPP  maximiza@on  

Need  to  redistribute  data  from  both  tables  

11  

Page 12: NENUG Apr14 Talk - data modeling for netezza

3,ORD1,ITEM1,…  2,ORD1,ITEM1,…  1,ORD1,ITEM1,…  4,ORD1,ITEM1,…  

Snippet  Processor  U@liza@on  Data  Distribu+on  and  Table  Joins  

Host  

Snippet  Processors  

Snippet  Processors  

Snippet  Processors  

4,CA,0414,…  1,MA,1212,…  3,MA,0414,…  

2,CA,0113,…  

1,ORD1,ITEM1,…  2,ORD1,ITEM1,…  3,ORD1,ITEM1,…  4,ORD1,ITEM1,…  

Distribute  on  Join  column  -­‐  cid  

-­‐  Snippet  processors  are  u@lized  uniformly  -­‐  Makes  snippet  processors  dependent  on  others  impac@ng  MPP  maximiza@on  

-­‐  Becer  than  the  previous  scenario  

Need  to  redistribute  data  from  one  table  

12  

Page 13: NENUG Apr14 Talk - data modeling for netezza

3,ORD1,ITEM1,…  2,ORD1,ITEM1,…  1,ORD1,ITEM1,…  4,ORD1,ITEM1,…  

Snippet  Processor  U@liza@on  Data  Distribu+on  and  Table  Joins  

Host  

Snippet  Processors  

Snippet  Processors  

Snippet  Processors  

1,MA,1212,…  4,CA,0414,…  

2,CA,0113,…   3,MA,0414,…  

Distribute  both  tables  on  Join  column  -­‐  cid  

-­‐  Snippet  processors  are  u@lized  uniformly  -­‐  Makes  snippet  processors  independent  maximizing  MPP  

13  

Page 14: NENUG Apr14 Talk - data modeling for netezza

Snippet  Processor  U@liza@on  Data  Distribu+on  

•  Iden@fy  keys  to  distribute  data  uniformly  – Avoid  data  and  processing  skew  – Try  using  join  columns  as  the  distribu@on  keys  – Choose  same  data  types  for  join  columns  

•  If  table  size  is  small  random  distribu@on  is  fine  –  If  one  of  the  join  table  is  small,  NZ  will  broadcast  

•  Redistribu@on  may  not  be  an  overkill  for  small  data  – For  e.g.,  selec@ng  a  small  number  if  columns  

14  

Page 15: NENUG Apr14 Talk - data modeling for netezza

Snippet  Processor  U@liza@on  Distribu+on  and  Query  Time  

0  

0.5  

1  

1.5  

2  

2.5  

3  

3.5  

Random   1  Correct  Distribu@on   2  Correct  Distribu@on  

Query  Time  For  Different  Distribu+ons  

Time  (min)  

15  

Page 16: NENUG Apr14 Talk - data modeling for netezza

Snippet  Processor  U@liza@on  Join  Column  Type  and  Query  Time  

0  

0.5  

1  

1.5  

2  

2.5  

Incorrect  Data  Types   Correct  Data  Types  

Join  query  +me  -­‐  same  and  diff  data  types  

Time  (min)  

16  

Page 17: NENUG Apr14 Talk - data modeling for netezza

Minimize  Data  Read  From  Disk  Zone  Maps  

•  Data  types  which  supports  Zone  Maps  – All  integer  data  types  

•  int1  •  int2  •  int4  •  int8  

– Date  – Timestamp  

17  

Refer  to  the  product  manual  for  the  version  of  NZ  used  for  the  complete  list  of  zone  map  able  data  types    

Page 18: NENUG Apr14 Talk - data modeling for netezza

Data  

Snippet  Processors  

Host  

Extend  1  

Page  

Extend  2  

Extend  3  

…  

Meta-­‐Data  

-­‐  May  end  up  reading  all  data  from  disk  

Minimize  Data  Read  From  Disk  Zone  Maps  

Table  column  (cid)  is  numeric(10,0)  

No  zone  map  for  cid  

18  

Page 19: NENUG Apr14 Talk - data modeling for netezza

Data  

Snippet  Processors  

Host  

Extend  

Page  

Extend  

Extend  

…  

Meta-­‐Data  

-­‐  Zone  maps  can  be  used  to  minimize  data  read  from  disk  

Minimize  Data  Read  From  Disk  Zone  Maps  

Table  column  (cid)  is  bigint  

19  

Page 20: NENUG Apr14 Talk - data modeling for netezza

Minimize  Data  Read  From  Disk  Zone  Maps  and  Query  Time  

1.3  

1.35  

1.4  

1.45  

1.5  

1.55  

1.6  

Incorrect  Data  Type   Correct  Data  Type  

Query  +me  with  and  without  zone  map  

Time  (min)  

20  

Page 21: NENUG Apr14 Talk - data modeling for netezza

Minimize  Data  Read  From  Disk  Clustered  Base  Tables  

•  NZ  stores  data  with  same  organize  keys  closely  •  Addi@onal  data  types  are  zone  map  able  

–  char  –  varchar  –  nchar  –  nvarchar  –  float  –  double  –  bool  –  @me  –  Interval  

•  Helps  improve  performance  of  mul@  table  join  

21  

Page 22: NENUG Apr14 Talk - data modeling for netezza

Extend  

Extend  

Extend  

Minimize  Data  Read  From  Disk  Clustered  Base  Tables  

Table  distributed  on  cid  

1,MA,Boston,…  

5,CA,LA,…  

3,FL,Tampa,,…  

1,MA,Salem,…  

5,CA,SF,…  

3,FL,Orlando,…  

1,MA,Lowell,…  

5,CA,Pasadena,…  3,FL,Miami,…  

Extend  

Extend  

Extend  

Table  distributed  on  cid  organize  on  state  

1,MA,Boston,…  

1,MA,Salem,…  

1,MA,Lowell,…  

3,FL,Tampa,,…  

3,FL,Orlando,…  

3,FL,Miami,…  

5,CA,LA,…  

5,CA,SF,…  

5,CA,Pasadena,…  

Extend  

1,AL,Alabama,…  

5,CA,California,…  10,FL,Florida,…  

State  

22,MA,Mass,…  

22  

Page 23: NENUG Apr14 Talk - data modeling for netezza

Minimize  Data  Read  From  Disk  Clustered  Base  Table  

0  

0.5  

1  

1.5  

2  

2.5  

No  Org+Correct  Dist   Org+Correct  Dist  

Query  Time  with  and  without  org  

Time  (min)  

23  

Page 24: NENUG Apr14 Talk - data modeling for netezza

Minimize  Data  Read  From  Disk  Materialized  Views  

•  View  with  frequently  used  columns  of  base  table  – Unlike  views,  materialized  view  stores  data  – Reduced  data  read  from  disk  – Addi@onal  storage  required  – Need  to  be  refreshed  if  base  table  data  changes  

•  Can  be  used  as  an  index  against  base  table  – Stores  loca@on  of  base  table  data  loc  in  a  column  

24  

Page 25: NENUG Apr14 Talk - data modeling for netezza

Extend  

Extend  

Extend  

Minimize  Data  Read  From  Disk  Materialized  Views  

T_EMP  distributed  on  cid  

1,MA,Mike,…  

5,CA,Fally,…  

3,FL,Chris,…  

4,MA,Robert,…  

7,CA,Mary,…  

2,FL,Jus+n,…  

6,MA,Harini,…  

8,CA,Mike,…  

9,FL,Martha,…  

SELECT  ID,  NAME  FROM  T_EMP;  

Extend  

MV  on  T_EMP  order  by  state,  name  

5,CA,Fally  

7,CA,Mary  

8,CA,Mike  

3,FL,Chris  

2,FL,Jus+n  

9,FL,Martha  

6,MA,Harini  

1,MA,Mike  

4,MA,Robert  

SELECT  *  FROM  T_EMP  WHERE  STATE  =  ‘CA’  

25  

Page 26: NENUG Apr14 Talk - data modeling for netezza

Minimize  Data  Read  From  Disk  Materialized  Views  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

1.4  

No  MV   With  MV  

Query  +me  on  single  table  -­‐MV  vs  No  MV  

Time  (min)  

26  

Page 27: NENUG Apr14 Talk - data modeling for netezza

Minimize  Data  Read  From  Disk  Materialized  Views  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

1.4  

1.6  

1.8  

No  MV   With  MV  

Join  Query  Time  -­‐  With  and  without  MV  

Time  (min)  

27  

Page 28: NENUG Apr14 Talk - data modeling for netezza

Minimize  Data  Read  From  Disk  Minimize  Data  Stored  

•  Choose  storage  efficient  data  types  – Difference  between  bigint  and  int  is  4  bytes  

•  Use  char  instead  of  varchar  if  the  data  length  is  fixed  – varchar  has  a  2  byte  overhead  

•  Define  columns  as  “not  null”  where  possible  

•  Store  only  the  required  data  in  table  columns  

•  Encode  duplicate  data  stored  in  rows  28  

Page 29: NENUG Apr14 Talk - data modeling for netezza

Improve  Computa@on  in  Snippet  Processor  NZ  Object  Defini+ons  

•  Define  columns  as  “not  null”  where  possible  – Removes  logic  to  check  nulls  

•  Define  table  keys  and  rela@onships  – Helps  NZ  query  op@mizer  to  generate  efficient  code  

29  

Page 30: NENUG Apr14 Talk - data modeling for netezza

30  

[email protected]

blog.asquareb.com

https://github.com/bijugs

@gsbiju


Recommended