Invisible loading

Invisible Loading

Yalies: Azza Abouzied, Daniel Abadi, Avi Silberschatz

BigData 2012

Problem: The Crying Baby

Two ways to deal with this:

Immediate GraDficaDon Long term $$$ costs

Misery & sleep deprivaDon Long term benefits

The Crying Baby Problem Wants A(en*on Now!

The ImpaDent Boss Problem Wants Answers Now!

≈

Two ways to analyze data

Locate File

Determine Key Attributes

Hack it: Parse +Map+Reduce

Locate File


Query: Process without Parse

Figure out schema

Load File

Organize or Index DB tables

MapReduce way

DB & HadoopDB way

Immediate GraDficaDon

Long-‐term cumulaDve costs because MR is slow!

Misery & sleep deprivaDon Long term benefits

The Problem

Can we get the immediate gra*fica*on of working with MapReduce and make progress towards the performances advantages of working with Databases?

Our SoluDon

Locate File


Write Map/Reduce Scripts

Figure out schema

Load File

Organize or Index DB tables

Begin with the MapReduce Way

INCREMENTALLY

BEHIND-‐THE-‐SCENCES PER JOB

Run it!

File System

Database System

P1) How to automaDcally figure out a schema?

Short answer: DON’T Split map phase into Parse and Map phases. Enforce a simple Parse API: Parser has one output method: getField(int id) Name a table aZer its Parser-‐implementaDon and label a[ributes with their field id. Different parsers on the same file result in different tables.

Figure out schema

P2) How to load files with minimal marginal costs?

•  Load only touched a[ributes (VerDcal ParDDon) – Requires a Column-‐Store

•  Load only parts of a column (Horizontal parDDon) – AZer a file-‐split is processed by Map, its touched a[ributes are loaded enDrely

– How many splits of a file is a tunable parameter.

Incrementally Load File

Tuple construcDon

Some columns are at different loading stages. – Maintain OIDs for each column: an address column •  The OIDs assigned are equivalent to the inserDon order

– Keep a catalog to track loading progress a b c d

Process in DB

Use File System

P3) How to index a parDally-‐loaded table?

If a selec*on filter is applied on an a(ribute, we organize it. Dealing with parDally loaded a[ributes

Incrementally Organize file

!!"#$$$%"#$$$&"#$$$

!"#$$$'"#$$$&"#$$$("#$$$!%"#$$$

!!%&

!'&(!%

!)*

(&+,'

%&!!

!(&'!%

)*!

(,+&'

address column c1 !

"##

$%#

&'(

#$%

!"##

'(&

c1 c2

JOIN

Choosing an organizaDon strategy

•  Why not use merge sort?

!!"#$$$%"#$$$&"#$$$

!"#$$$'"#$$$&"#$$$("#$$$!%"#$$$

)"#$$$*"#$$$!+"#$$$

,"#$$$!,"#$$$+"#$$$-"#$$$!&"#$$$

%&!!

!(&'!%

)*!+

+,-!,!&

%!(&&'!%!!

+,)-*!+!,!&

%!+,(&&)-'*!%!!!+!,!&

./01#2#3/45

367859#8:#;<=3#=87>#3?95>@

378A>9#/B#5C>#A/7D@:#8:#1050E09>

Incremental Merge Sort

!!"#$$$%"#$$$&"#$$$

!"#$$$'"#$$$&"#$$$("#$$$!%"#$$$

)"#$$$*"#$$$!+"#$$$

,"#$$$!,"#$$$+"#$$$-"#$$$!&"#$$$

%&!!

!(&'!%

)*!+

+,-!,!&

%!(&&

'!%!!

+,)-

*!+!,!&

%!+,

'*!%!!

(&&)-

!+!,!&

%.#%/,

!.#'/!!

+.#(/-

,.#!+/!&

0123#4#5167

892:;#!5<=>7/?>7.#!%%%?

892:;#+5<=>7/?>7.#%!%%?

5<=>7:#>@#ABC5#C>=;#5D:7;E

5=>F;:#1G#79;#F1=HE@#>@#3272?2:;

5>E<=;#>@3;I

EVALUATION

Setup

•  Single-‐Machine Experiments – Embarrassingly parallel – No distributed reorganizaDon or parDDoning

•  MonetDB (hacked to support IMS) •  Hadoop •  2 GB file of 5 integer a[ributes: 107,374,182 tuples.

•  See paper for more details

The big picture

0

100

200

300

400

500

600

700

800

1 10 100

Tim

e in

Sec

onds

Job Sequence

SQL Pre-loadIncremental Reorganization (5/5)Incremental Reorganization (2/5)

Invisible Loading (5/5)Invisible Loading (2/5)

MapReduce

CumulaDve costs

100

1000

10000

100000

1 10 100

Cum

ulat

ive

Tim

e Sp

ent i

n Se

cond

s

Job Sequence



MapReduce

Change the access pa[ern

0

100

200

300

400

500

600

700

800

1 10

Tim

e in

Sec

onds

Job Sequence (Log scale) 83 85 87 89 91 93

Job Sequence (Linear scale)



MapReduce

Further EvaluaDon (Paper)

•  In-‐depth study of IMS – Comparison with Cracking and Pre-‐sorDng – Effect of integraDng Lightweight compressions into IMS.

•  Li[le mini-‐experiments –  InserDon vs. Copy – Processing in DB vs. using DB as a fast access medium with all processing in MapReduce

Conclusion: Lessons Learned

•  Engineering Nightmare – Many complemenDng technologies •  Manimal, AdapDve Merging …

–  In the era of Big-‐Data we need to design more modular, plug-‐n-‐play tools

•  Can of worms – Most BigData problems look decepDvely simple unDl you start mocking around.

Some problems are easier than others

Thanks!

QuesDons?

Why is loading this log file hard? !"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>*%?%@A#/0:(-B*-C)5*D@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>-%?%@137J@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>B%?%@!PJ4#7/$4Q+PFPJ4#7/$4Q(PFPJ4#7/$4Q*PFPJ4#7/$4QBPFPJ4#7/$4QRPFPJ4#7/$4Q5PFPJ4#7/$4QDPFPJ4#7/$4QCPFPJ4#7/$4Q(+PFPJ4#7/$4Q(*PFPJ4#7/$4Q(-PFPJ4#7/$4Q()PFPJ4#7/$4Q(5PFPJ4#7/$4Q(CPFPJ4#7/$4Q*(PFPJ4#7/$4Q**PFPJ4#7/$4Q*BPFPJ4#7/$4Q*5PFPJ4#7/$4Q-DPFPJ4#7/$4Q-CPFPJ4#7/$4QB(PFPJ4#7/$4Q-PFPJ4#7/$4Q)PFPJ4#7/$4Q((PFPJ4#7/$4Q(BPFPJ4#7/$4Q(RPFPJ4#7/$4Q(DPFPJ4#7/$4Q*+PFPJ4#7/$4Q*-PFPJ4#7/$4Q*)PFPJ4#7/$4Q*RPFPJ4#7/$4Q*DPFPJ4#7/$4Q*CPFPJ4#7/$4Q-+PFPJ4#7/$4Q-(PFPJ4#7/$4Q-*PFPJ4#7/$4Q--PFPJ4#7/$4Q-BPFPJ4#7/$4Q-)PFPJ4#7/$4Q-RPFPJ4#7/$4Q-5PFPJ4#7/$4QB+P.@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>)%?%@21OO9$7@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*+,-+,++%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%!"#$%&#'%()%*+,-+,++%*+(*.%78'9:68',%;<=>R%?%@/S9'#94/T4#8'/J@EF%0/G/0/0,%H448,II'129'H1J4I78'9:IK/L988IA#/0:J8/2I4/J4IM#/0:N0//6H4O'!"#$%&#'%()%*(,+*,*+%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!Z1$%&#'%(R%(+,)),-*%*+(*.%!/0010.%!2'3/$4%(+C6((C6(+56D).%[$S9'37%O/4H17%3$%0/A#/J4%\')]V/G]VC9]V7B]V729]V(2]]^]V(D]VD7]V9)6]V+-_!Z1$%&#'%(R%((,-*,*D%*+(*.%!/0010.%!2'3/$4%C*6B+6*)-6(*5.%[$S9'37%`>[%3$%0/A#/J4%]S]V2+]VCC1]VG-]V9B!]V/L#]VCRI]V+G]VL*;]V()]VG-C!Z1$%&#'%(R%(*,(+,BD%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!Z1$%&#'%(R%(*,(+,))%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!N#/%&#'%(5%*+,((,B+%*+(*.%!$1432/.%29#aH4%"[bNc>ZF%JH#443$a%71K$!U03%&#'%*+%(C,(-,(C%*+(*.%!K90$.%[$34,%"/JJ31$%d92H/%3J%$14%21$G3a#0/7%!H3$4,%""W"/JJ31$d92H/.H4487,%d1#'7%$14%0/'39L':%7/4/0O3$/%4H/%J/0S/0@J%G#'':%A#9'3G3/7%71O93$%$9O/F%#J3$a%/V2/''/$2/6'129'%G10%"/0S/0e9O/!U03%&#'%*+%(C,(-,*+%*+(*.%!$1432/.%Y3a/J4,%a/$/0943$a%J/20/4%G10%73a/J4%9#4H/$4329431$%666!U03%&#'%*+%(C,(-,*+%*+(*.%!$1432/.%Y3a/J4,%71$/!U03%&#'%*+%(C,(-,*+%*+(*.%!$1432/.%=892H/I*6*6*(%f`$3Vg%O17TJJ'I*6*6*(%h8/$""WI+6C6D0%Y=<I*%O17TG2a37I*6-6R%21$G3a#0/7%ii%0/J#O3$a%$10O9'%18/09431$J!U03%&#'%*+%(C,(-,*-%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!U03%&#'%*+%(C,(B,+D%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(R,-C,*5%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(R,)+,*5%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(5,+B,*)%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(5,+R,)R%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(5,)+,(-%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(5,)+,*R%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(C,(C,*+%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!"94%&#'%*(%(C,(C,--%*+(*.%!/0010.%!2'3/$4%(*56+6+6(.%U3'/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321!N#/%&#'%*B%(5,)B,B)%*+(*.%!$1432/.%29#aH4%"[bNc>ZF%JH#443$a%71K$

What is the base schema? Time, Type, Message ?

Message field varies depending on applicaDon!

Context-‐dependent Schema Awareness Different analysts know the schema of what they are looking for and don’t care about other log messages

Different tables for each type?

Date post:	01-Dec-2014
Category:	Technology
Upload:	daniel-abadi
View:	2,063 times
Download:	2 times

Invisible loading

Technology