+ All Categories
Transcript
Page 1: Invisible loading

Invisible  Loading  

Yalies:  Azza  Abouzied,    Daniel  Abadi,  Avi  Silberschatz  

BigData  2012  

Page 2: Invisible loading

Problem:  The  Crying  Baby  

Page 3: Invisible loading

Two  ways  to  deal  with  this:  

Immediate  GraDficaDon   Long  term  $$$  costs  

Misery  &  sleep  deprivaDon   Long  term  benefits  

Page 4: Invisible loading

The  Crying  Baby  Problem  Wants  A(en*on  Now!  

The  ImpaDent  Boss  Problem  Wants  Answers  Now!  

≈  

Page 5: Invisible loading

Two  ways  to  analyze  data  

Locate File

Determine Key Attributes

Hack it: Parse +Map+Reduce

Locate File

Determine Key Attributes

Query: Process without Parse

Figure out schema

Load File

Organize or Index DB tables

MapReduce  way  

DB  &  HadoopDB  way  

Immediate  GraDficaDon  

Long-­‐term  cumulaDve  costs    because  MR  is  slow!  

Misery  &  sleep  deprivaDon   Long  term  benefits  

Page 6: Invisible loading

The  Problem  

Can  we  get  the  immediate  gra*fica*on  of  working  with  MapReduce  and  make  progress  towards  the  performances  advantages  of  working  with  Databases?    

Page 7: Invisible loading

Our  SoluDon  

Locate File

Determine Key Attributes

Write Map/Reduce Scripts

Figure out schema

Load File

Organize or Index DB tables

Begin  with  the  MapReduce  Way  

INCREMENTALLY  

BEHIND-­‐THE-­‐SCENCES  PER  JOB  

Run it!

File System

Database System

Page 8: Invisible loading

P1)  How  to  automaDcally  figure  out  a  schema?  

Short  answer:  DON’T    Split  map  phase  into  Parse  and  Map  phases.      Enforce  a  simple  Parse  API:  Parser  has  one  output  method:  getField(int  id)    Name  a  table  aZer  its  Parser-­‐implementaDon  and  label  a[ributes  with  their  field  id.    Different  parsers  on  the  same  file  result  in  different  tables.  

Figure out schema

Page 9: Invisible loading

P2)  How  to  load  files  with  minimal  marginal  costs?  

•  Load  only  touched  a[ributes  (VerDcal  ParDDon)  – Requires  a  Column-­‐Store  

•  Load  only  parts  of  a  column  (Horizontal  parDDon)  – AZer  a  file-­‐split  is  processed  by  Map,  its  touched  a[ributes  are  loaded  enDrely    

– How  many  splits  of  a  file  is  a  tunable  parameter.  

 

Incrementally Load File

Page 10: Invisible loading

Tuple  construcDon  

Some  columns  are  at  different  loading  stages.  – Maintain  OIDs  for  each  column:  an  address  column    •  The  OIDs  assigned  are  equivalent  to  the  inserDon  order  

– Keep  a  catalog  to  track  loading  progress  a b c d

Process  in  DB  

Use  File  System  

Page 11: Invisible loading

P3)  How  to  index  a  parDally-­‐loaded  table?  

If  a  selec*on  filter  is  applied  on  an  a(ribute,  we  organize  it.    Dealing  with  parDally  loaded  a[ributes  

Incrementally Organize file

!!"#$$$%"#$$$&"#$$$

!"#$$$'"#$$$&"#$$$("#$$$!%"#$$$

!!%&

!'&(!%

!)*

(&+,'

%&!!

!(&'!%

)*!

(,+&'

address column c1 !

"##

$%#

&'(

#$%

!"##

'(&

c1 c2

JOIN

Page 12: Invisible loading

Choosing  an  organizaDon  strategy  

•  Why  not  use  merge  sort?    

!!"#$$$%"#$$$&"#$$$

!"#$$$'"#$$$&"#$$$("#$$$!%"#$$$

)"#$$$*"#$$$!+"#$$$

,"#$$$!,"#$$$+"#$$$-"#$$$!&"#$$$

%&!!

!(&'!%

)*!+

+,-!,!&

%!(&&'!%!!

+,)-*!+!,!&

%!+,(&&)-'*!%!!!+!,!&

./01#2#3/45

367859#8:#;<=3#=87>#3?95>@

378A>9#/B#5C>#A/7D@:#8:#1050E09>

Page 13: Invisible loading

Incremental  Merge  Sort  

!!"#$$$%"#$$$&"#$$$

!"#$$$'"#$$$&"#$$$("#$$$!%"#$$$

)"#$$$*"#$$$!+"#$$$

,"#$$$!,"#$$$+"#$$$-"#$$$!&"#$$$

%&!!

!(&'!%

)*!+

+,-!,!&

%!(&&

'!%!!

+,)-

*!+!,!&

%!+,

'*!%!!

(&&)-

!+!,!&

%.#%/,

!.#'/!!

+.#(/-

,.#!+/!&

0123#4#5167

892:;#!5<=>7/?>7.#!%%%?

892:;#+5<=>7/?>7.#%!%%?

5<=>7:#>@#ABC5#C>=;#5D:7;E

5=>F;:#1G#79;#F1=HE@#>@#3272?2:;

5>E<=;#>@3;I

Page 14: Invisible loading

EVALUATION  

Page 15: Invisible loading

Setup  

•  Single-­‐Machine  Experiments  – Embarrassingly  parallel  – No  distributed  reorganizaDon  or  parDDoning  

•  MonetDB  (hacked  to  support  IMS)  •  Hadoop  •  2  GB  file  of  5  integer  a[ributes:  107,374,182  tuples.    

•  See  paper  for  more  details  

Page 16: Invisible loading

The  big  picture  

0

100

200

300

400

500

600

700

800

1 10 100

Tim

e in

Sec

onds

Job Sequence

SQL Pre-loadIncremental Reorganization (5/5)Incremental Reorganization (2/5)

Invisible Loading (5/5)Invisible Loading (2/5)

MapReduce

Page 17: Invisible loading

CumulaDve  costs  

100

1000

10000

100000

1 10 100

Cum

ulat

ive

Tim

e Sp

ent i

n Se

cond

s

Job Sequence

SQL Pre-loadIncremental Reorganization (5/5)Incremental Reorganization (2/5)

Invisible Loading (5/5)Invisible Loading (2/5)

MapReduce

Page 18: Invisible loading

Change  the  access  pa[ern  

0

100

200

300

400

500

600

700

800

1 10

Tim

e in

Sec

onds

Job Sequence (Log scale) 83 85 87 89 91 93

Job Sequence (Linear scale)

SQL Pre-loadIncremental Reorganization (5/5)Incremental Reorganization (2/5)

Invisible Loading (5/5)Invisible Loading (2/5)

MapReduce

Page 19: Invisible loading

Further  EvaluaDon  (Paper)  

•  In-­‐depth  study  of  IMS  – Comparison  with  Cracking  and  Pre-­‐sorDng  – Effect  of  integraDng  Lightweight  compressions  into  IMS.  

•  Li[le  mini-­‐experiments  –  InserDon  vs.  Copy  – Processing  in  DB  vs.  using  DB  as  a  fast  access  medium  with  all  processing  in  MapReduce  

Page 20: Invisible loading

Conclusion:  Lessons  Learned  

•  Engineering  Nightmare  – Many  complemenDng  technologies  •  Manimal,  AdapDve  Merging  …  

–  In  the  era  of  Big-­‐Data  we  need  to  design  more  modular,  plug-­‐n-­‐play  tools  

•  Can  of  worms  – Most  BigData  problems  look  decepDvely  simple  unDl  you  start  mocking  around.  

Page 21: Invisible loading

Some  problems  are  easier  than  others  

Page 22: Invisible loading

Thanks!  

QuesDons?  

Page 23: Invisible loading

Why  is  loading  this  log  file  hard?  !"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>*%?%@A#/0:(-B*-C)5*D@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>-%?%@137J@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>B%?%@!PJ4#7/$4Q+PFPJ4#7/$4Q(PFPJ4#7/$4Q*PFPJ4#7/$4QBPFPJ4#7/$4QRPFPJ4#7/$4Q5PFPJ4#7/$4QDPFPJ4#7/$4QCPFPJ4#7/$4Q(+PFPJ4#7/$4Q(*PFPJ4#7/$4Q(-PFPJ4#7/$4Q()PFPJ4#7/$4Q(5PFPJ4#7/$4Q(CPFPJ4#7/$4Q*(PFPJ4#7/$4Q**PFPJ4#7/$4Q*BPFPJ4#7/$4Q*5PFPJ4#7/$4Q-DPFPJ4#7/$4Q-CPFPJ4#7/$4QB(PFPJ4#7/$4Q-PFPJ4#7/$4Q)PFPJ4#7/$4Q((PFPJ4#7/$4Q(BPFPJ4#7/$4Q(RPFPJ4#7/$4Q(DPFPJ4#7/$4Q*+PFPJ4#7/$4Q*-PFPJ4#7/$4Q*)PFPJ4#7/$4Q*RPFPJ4#7/$4Q*DPFPJ4#7/$4Q*CPFPJ4#7/$4Q-+PFPJ4#7/$4Q-(PFPJ4#7/$4Q-*PFPJ4#7/$4Q--PFPJ4#7/$4Q-BPFPJ4#7/$4Q-)PFPJ4#7/$4Q-RPFPJ4#7/$4Q-5PFPJ4#7/$4QB+P.@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>)%?%@21OO9$7@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>R%?%@/S9'#94/T4#8'/J@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*(,+*,*+%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!Z1$%&#'%(R%(+,)),-*%*+(*.%!/0010.%!2'3/$4%(+C6((C6(+56D).%[$S9'37%O/4H17%3$%0/A#/J4%\')]V/G]VC9]V7B]V729]V(2]]^]V(D]VD7]V9)6]V+-_!Z1$%&#'%(R%((,-*,*D%*+(*.%!/0010.%!2'3/$4%C*6B+6*)-6(*5.%[$S9'37%`>[%3$%0/A#/J4%]S]V2+]VCC1]VG-]V9B!]V/L#]VCRI]V+G]VL*;]V()]VG-C!Z1$%&#'%(R%(*,(+,BD%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!Z1$%&#'%(R%(*,(+,))%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!N#/%&#'%(5%*+,((,B+%*+(*.%!$1432/.%29#aH4%"[bNc>ZF%JH#443$a%71K$!U03%&#'%*+%(C,(-,(C%*+(*.%!K90$.%[$34,%"/JJ31$%d92H/%3J%$14%21$G3a#0/7%!H3$4,%""W"/JJ31$d92H/.H4487,%d1#'7%$14%0/'39L':%7/4/0O3$/%4H/%J/0S/0@J%G#'':%A#9'3G3/7%71O93$%$9O/F%#J3$a%/V2/''/$2/6'129'%G10%"/0S/0e9O/!U03%&#'%*+%(C,(-,*+%*+(*.%!$1432/.%Y3a/J4,%a/$/0943$a%J/20/4%G10%73a/J4%9#4H/$4329431$%666!U03%&#'%*+%(C,(-,*+%*+(*.%!$1432/.%Y3a/J4,%71$/!U03%&#'%*+%(C,(-,*+%*+(*.%!$1432/.%=892H/I*6*6*(%f`$3Vg%O17TJJ'I*6*6*(%h8/$""WI+6C6D0%Y=<I*%O17TG2a37I*6-6R%21$G3a#0/7%ii%0/J#O3$a%$10O9'%18/09431$J!U03%&#'%*+%(C,(-,*-%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!U03%&#'%*+%(C,(B,+D%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(R,-C,*5%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(R,)+,*5%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(5,+B,*)%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(5,+R,)R%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(5,)+,(-%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(5,)+,*R%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(C,(C,*+%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(C,(C,--%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!N#/%&#'%*B%(5,)B,B)%*+(*.%!$1432/.%29#aH4%"[bNc>ZF%JH#443$a%71K$

What  is  the  base  schema?  Time,  Type,  Message  ?  

Message  field  varies  depending  on  applicaDon!    

Context-­‐dependent  Schema  Awareness  Different  analysts  know  the  schema  of  what  they  are  looking  for  and  don’t  care  about  other  log  messages  

Different  tables  for  each  type?  


Top Related