Post on 17-Jun-2020
transcript
CompSci 516DataIntensiveComputingSystems
Lecture21Datalog
Instructor:Sudeepa Roy
1DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Announcement
• HW3duenextWednesday:11/16
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 2
Today
• Datalog– forrecursion indatabasequeries
• AquicklookatIncrementalViewMaintenance(IVM)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 3
ReadingMaterial:DatalogOptional:1. Thedatalog chaptersinthe“AliceBook”FoundationsofDatabasesAbiteboul-Hull-VianuAvailableonline:http://webdam.inria.fr/Alice/
2.Datalog tutorialSIGMOD2011“Datalog andEmergingApplications:AnInteractiveTutorial”
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 4
BriefHistoryofDatalog
• MotivatedbyProlog– startedbackin1970-80’s– thenquietforalongtime
• AlongargumentintheDatabasecommunitywhetherrecursionshouldbesupportedinquerylanguages– “Nopracticalapplicationsofrecursivequerytheory...havebeenfoundto
date”—MichaelStonebraker,1998ReadingsinDatabaseSystems,3rdEditionStonebraker andHellerstein,eds.
– RecentworkbyHellerstein etal.onDatalog-extensionstobuildnetworkingprotocolsanddistributedsystems.[Link]
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 5
Datalog isresurging!
• NumberofpapersandtutorialsinDBconferences
• Applicationsin– dataintegration,declarativenetworking,programanalysis,information
extraction,networkmonitoring,security,andcloudcomputing
• Systemssupportingdatalog inbothacademiaandindustry:– Lixto (informationextraction)– LogicBlox (enterprisedecisionautomation)– Semmle (programanalysis)– BOOM/Dedalus (Berlekey)– Coral– LDL++
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 6
RecallourdrinkerexampleinRC(Lecture4)
Find drinkers that frequent some bar that serves some beer they like.
Q(x) = $y. $z. Frequents(x, y)∧Serves(y,z)∧Likes(x,z)
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
7CompSci516:DataIntensiveComputingSystems
DukeCS,Fall2016
DrinkerexampleisfromslidesbyProfs.Balazinska andSuciuandthe[GUW]book
WriteitasaDatalog RuleFind drinkers that frequent some bar that serves some beer they like.
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
8CompSci516:DataIntensiveComputingSystems
DukeCS,Fall2016
RC:Q(x) = $y. $z. Frequents(x, y)∧Serves(y,z)∧Likes(x,z)
Datalog:Q(x) :- Frequents(x, y), Serves(y,z), Likes(x,z)
WriteitasaDatalog RuleFind drinkers that frequent some bar that serves some beer they like.
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
9CompSci516:DataIntensiveComputingSystems
DukeCS,Fall2016
Datalog:Q(x) :- Frequents(x, y), Serves(y,z), Likes(x,z)
RC:Q(x) = $y. $z. Frequents(x, y)∧Serves(y,z)∧Likes(x,z)
• Quickdifferences:– Uses“:-”not=– noneedfor$ (assumedbydefault)– Use“,”ontherighthandside(RHS)– AnythingonRHStheof:- isassumedtobecombinedwith∧ bydefault– ",Þ,notallowed– theyneedtousenegation¬– Standard“Datalog”doesnotallownegation– Negationallowedindatalog withnegation
• Howtospecifydisjunction(OR/⋁)?
Example:ORinDatalogFind drinkers that (a) either frequent some bar that serves some beer they like, (b) or like beer “BestBeer”
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
10CompSci516:DataIntensiveComputingSystems
DukeCS,Fall2016
Datalog:Q(x) :- Frequents(x, y), Serves(y,z), Likes(x,z)Q(x) :- Likes(x, “BestBeer”)
RC:Q(x)=[$y.$z.Frequents(x,y)∧Serves(y,z)∧Likes(x,z)]⋁ [Likes(x,“BestBeer”)]
Example:ORinDatalogFind drinkers that (a) either frequent some bar that serves some beer they like, (b) or like beer “BestBeer”, (c) or, frequent bars that “Joe” frequents
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
11CompSci516:DataIntensiveComputingSystems
DukeCS,Fall2016
Datalog:JoeFrequents(w):- Frequents(“Joe”,w)Q(x):- Frequents(x,y),Serves(y,z),Likes(x,z)Q(x):- Likes(x,“BestBeer”)Q(x):- Frequents(x,w),JoeFrequents(w)
RC:Q(x)=[$y.$z.Frequents(x,y)∧Serves(y,z)∧Likes(x,z)]⋁ [Likes(x,“BestBeer”)]
⋁ [$w Frequents(x,w)∧ Frequents(“Joe”,w)]
• Tospecify“OR”,writemultipleruleswiththesame“Head”• Next:terminologyforDatalog
• Each rule is of the form Head :- Body
• Each variable in the head of each rule must appear in the body of the rule
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
12CompSci516:DataIntensiveComputingSystems
DukeCS,Fall2016
JoeFrequents(w):- Frequents(“Joe”,w)Q(x):- Frequents(x,y),Serves(y,z),Likes(x,z)Q(x):- Likes(x,“BestBeer”)Q(x):- Frequents(x,w),JoeFrequents(w)
Fourrules
BodyHead
Datalog Rules
Atom
Variable
EDBsandIDBs
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
13
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
• Intensional DataBases (IDBs)– Relationsthatarederived– Canbeintermediateorfinaloutputtables– e.g.JoeFrequents,Q– CanbeontheLHSorRHS(e.g.JoeFrequents)
• ExtensionalDataBases (EDBs)– Inputrelationnames– e.g.Likes,Frequents,Serves– canonlybeontheRHSofarule
JoeFrequents(w):- Frequents(“Joe”,w)Q(x):- Frequents(x,y),Serves(y,z),Likes(x,z)Q(x):- Likes(x,“BestBeer”)Q(x):- Frequents(x,w),JoeFrequents(w)
TupleinanEDBoranIDB:aFACT
eitherbelongstoagivenEDBrelation,orisderivedinanIDB relation
GraphExample
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
14
a
b
c
d e
V1 V2
a c
b a
b d
c d
d a
d e
E(edgerelation)
Example1
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
15
E(edgerelation)
WriteaDatalog programtofindpathsoflengthtwo(outputstartandfinishvertices)
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e
Example1
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
16
E(edgerelation)
WriteaDatalog programtofindpathsoflengthtwo(outputstartandfinishvertices)
P2(x,y):- E(x,z),E(z,y)
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e
Example1:Execution
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
17
E(edgerelation)
WriteaDatalog programtofindpathsoflengthtwo(outputstartandfinishvertices)
P2(x,y):- E(x,z),E(z,y)
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e
V1 V2
a db cb ec ac ed c
P2
sameasE⨝E.V2=E.V1 E
Example2
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
18
E(edgerelation)
WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu
• CanyouwriteaSQL/RA/RCqueryforreachability?
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e
Example2
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
19
E(edgerelation)
• CanyouwriteaSQL/RA/RCqueryforreachability?• NO- SQL/RA/RCcannotexpressreachability
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu
Example2
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
20
E(edgerelation)
R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)
Option1
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu
Example2
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
21
E(edgerelation)
R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)
Option1 R(x,y):- E(x,y)R(x,y):- R(x,z),E(z,y)
Option2
R(x,y):- E(x,y)R(x,y):- R(x,z),R(z,y)
Option3
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu
linear
non-linear
LinearDatalog
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
22
• Linearrule– atmostoneatominthebodythatisrecursivewiththeheadoftherule
– e.g.R(x,y):- E(x,z),R(z,y)• Lineardatalog program
– ifallrulesarelinear– likelinearrecursion
• Top-downandbottom-upevaluationarepossible– wewillfocusonbottom-up
Example2:Execution
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
23
E
R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)
Option1
V1 V2
a cb ab dc dd ad e
Iteration1 R=E
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e
(verticesreachablein1-hopbyadirectedge)
Example2:Execution
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
24
E
R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)
Option1
V1 V2
a cb ab dc dd ad ea db cb ec ac ed c
RIteration2
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e
(verticesreachablein2-hops)
Example2:Execution
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
25
E
R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)
Option1
V1 V2
a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d
RIteration3
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e
(verticesreachablein3-hops)
Example2:Execution
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
26
E
R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)
Option1
V1 V2
a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d
RIteration4
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e
Runchanged- stop
Examples3and4
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
27
E(edgerelation)
WriteaDatalog programtofindallverticesreachablefromb
R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)QB(y):- R(b,y)
V1 V2
a c
b a
b d
c d
d a
d e
a
b
c
d e
WriteaDatalog programtofindallverticesureachablefromthemselvesR(u,u)
R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)Q(x):- R(x,x)
TerminationofaDatalog Program
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
28
Q. ADatalog programalwaysterminates– why?
TerminationofaDatalog Program
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
29
Q. ADatalog programalwaysterminates– why?
• Becausethevaluesofthevariablesarecomingfromthe“activedomain”intheinputrelations(EDBs)
• Activedomain=(finite)valuesfromthe(possiblyinfinite)domainappearingintheinstanceofadatabase
– e.g.agecanbeanyinteger(infinite),butactivedomainisonlyfinitelymanyinR(id,name,age)
• ThereforethenumberofpossiblevaluesineachoftheIDBsisfinite
• e.g.inthereachabilityexampleR(x,y),thevaluesofxandycomefrom{a,b,c,d,e}
– atmost5x5=25tuplespossibleintheIDBR(x,y)– inanyiteration,atleastonenewtupleisaddedinatleastoneIDB– Muststopafterfinitesteps– e.g.themaximumnumberofiterationinthereachabilityexampleforanygraphwith
fiveverticesis25(itwasonly4inourexample)
Bottom-upEvaluationofaDatalog Program
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
30
• Naïveevaluation
• Semi-naïveevaluation
Naïveevaluation- 1
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
31
V1 V2
a cb ab dc dd ad e
V1 V2
a c
b a
b d
c d
d a
d e
Iteration1:R=E=R1(say)
E
a
b
c
d e
Inallsubsequentiteration,checkifanyoftherulescanbeapplied
DounionofalltheruleswiththesameheadIDB
Naïveevaluation- 2
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
32
V1 V2
a cb ab dc dd ad e
V1 V2
a c
b a
b d
c d
d a
d e
V1 V2
a cb ab dc dd ad ea db cb ec ac ed c
Iteration1:R=E=R1(say)
E
a
b
c
d e
Iteration2:R=E∪
E⨝ R1=R2(say)
R1≠R2socontinue
Naïveevaluation- 3
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
33
V1 V2
a cb ab dc dd ad e
V1 V2
a c
b a
b d
c d
d a
d e
V1 V2
a cb ab dc dd ad ea db cb ec ac ed c
V1 V2
a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d
Iteration1:R=E=R1(say)
E
a
b
c
d e
Iteration2:R=E∪
E⨝ R1=R2(say)
R1≠R2socontinue
Iteration3:R=E∪
E⨝ R2=R3(say)
R2≠R3socontinue
Naïveevaluation- 4
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
34
V1 V2
a cb ab dc dd ad e
V1 V2
a c
b a
b d
c d
d a
d e
V1 V2
a cb ab dc dd ad ea db cb ec ac ed c
V1 V2
a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d
Iteration1:R=E=R1(say)
E
a
b
c
d e
Iteration2:R=E∪
E⨝ R1=R2(say)
R1≠R2socontinue
Iteration3:R=E∪
E⨝ R2=R3(say)
R2≠R3socontinue
Iteration4:R=E∪
E⨝ R3=R4(say)
R3=R4soSTOP
ProblemwithNaïveEvaluation
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
35
• ThesameIDBfactsarediscoveredagainandagain– e.g.ineachiterationalledgesinEareincludedinR– Inthe2nd-4th iterations,thefirstsixtuplesinRarecomputed
repeatedly
• Solution:Semi-NaïveEvaluation
• Workonlywiththenewtuplesgeneratedinthepreviousiteration
Semi-Naïveevaluation- 1
DukeCS,Fall2016CompSci 516:DataIntensiveComputing
Systems 36
V1 V2
a cb ab dc dd ad e
V1 V2
a c
b a
b d
c d
d a
d e
Iteration1:R=E=R1(say)ΔR1=R1
E
a
b
c
d e
Initially:R=Φ
Semi-Naïveevaluation- 2
DukeCS,Fall2016CompSci 516:DataIntensiveComputing
Systems 37
V1 V2
a cb ab dc dd ad e
V1 V2
a c
b a
b d
c d
d a
d e
V1 V2
a cb ab dc dd ad ea db cb ec ac ed c
Iteration1:R=E=R1(say)ΔR1=R1
E
a
b
c
d e
Iteration2:R=R1∪
E⨝ ΔR1=R2(say)
ΔR2=R2– R1
ΔR2≠Φsocontinue
Initially:R=Φ
Semi-Naïveevaluation- 3
DukeCS,Fall2016CompSci 516:DataIntensiveComputing
Systems 38
V1 V2
a cb ab dc dd ad e
V1 V2
a c
b a
b d
c d
d a
d e
V1 V2
a cb ab dc dd ad ea db cb ec ac ed c
V1 V2
a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d
Iteration1:R=E=R1(say)ΔR1=R1
E
a
b
c
d e
Iteration2:R=R1∪
E⨝ ΔR1=R2(say)
ΔR2=R2– R1
ΔR2≠Φsocontinue
Iteration3:R=R2∪
E⨝ ΔR2=R3(say)
ΔR3=R3– R2
ΔR3≠Φsocontinue
Initially:R=Φ
Semi-Naïveevaluation- 4
DukeCS,Fall2016CompSci 516:DataIntensiveComputing
Systems 39
V1 V2
a cb ab dc dd ad e
V1 V2
a c
b a
b d
c d
d a
d e
V1 V2
a cb ab dc dd ad ea db cb ec ac ed c
V1 V2
a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d
Iteration1:R=E=R1(say)ΔR1=R1
E
a
b
c
d e
Iteration2:R=R1∪
E⨝ ΔR1=R2(say)
ΔR2=R2– R1
ΔR2≠Φsocontinue
Iteration3:R=R2∪
E⨝ ΔR2=R3(say)
ΔR3=R3– R2
ΔR3≠Φsocontinue
Iteration4:R=R3∪E⨝ ΔR3=R4(say)
ΔR4=R4– R3ΔR=Φ(CHECKJ)soSTOP
Initially:R=Φ
IncrementalViewMaintenance(IVM)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
40
• Whydidthesemi-naïvealgorithmwork?• BecauseofthegenerictechniqueofIncrementalView
Maintenance(IVM)
• Supposeyouhave– adatabaseD=(R1,R2,R3)– aqueryQthatgivesanswerQ(D)– D=(R1,R2,R3)getsupdatedtoD’=(R1’,R2’,R3’)– e.g.R1’=R1∪ ΔR1(insertion),R2’=R2- ΔR1(deletion)etc.
IncrementalViewMaintenance(IVM)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
41
• Whydidthesemi-naïvealgorithmwork?• BecauseofthegenerictechniqueofIncrementalView
Maintenance(IVM)
• Supposeyouhave– adatabaseD=(R1,R2,R3)– aqueryQthatgivesanswerQ(D)– D=(R1,R2,R3)getsupdatedtoD’=(R1’,R2’,R3’)– e.g.R1’=R1∪ ΔR1(insertion),R2’=R2- ΔR1(deletion)etc.
• IVM: CanyoucomputeQ(D’)usingQ(D)andΔR1,ΔR2,ΔR3withoutcomputingitfromscratch(i.e.donotrerunthequeryQ)?
IVM Example:Selection
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
42
V1 V2
a cb ad ac d
V1 V2
a cb ad ac db dd e
σV1=bR
R
V1 V2
b a
R’=R∪ ΔR
ΔR
σV1=bR’
V1 V2
b ab d
V1 V2
b a
V1 V2
b d
σV1=bR σV1=bΔR
• σV1=b(R∪ ΔR)=σV1=bR∪ σV1=bΔR• Itsufficestoapplytheselectionconditiononly onΔR
– andincludewiththeoriginalsolution
∪
IVM Example:Projection
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
43
• πV1(R∪ ΔR)=πV1R∪ πV1ΔR• Itsufficestoapplytheprojectionconditiononly onΔR
– andincludewiththeoriginalsolution
V1 V2
a cb ad ad e
V1 V2
a cb ad ad eb dc d
πV1R
R
V1
abd
R’=R∪ ΔR
ΔR
πV1R’
V1
bc
πV1R
πV1ΔR
V1
abdc
V1
abd
∪
IVM Example:Join
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
44
(R∪ ΔR)⨝ (S∪ ΔS)=(R⨝ S)∪ (R⨝ ΔS)∪ (ΔR⨝ S)∪ (ΔR⨝ ΔS)
A B
a1 b1a2 b2a3 b1
B C
b1 c1b2 c2
S’=S∪ ΔSR’=R∪ ΔR
ΔRΔS
A B
a1 b1B C
b1 c1A B C
a1 b1 c1⨝
⨝
=
A B
a1 b1B C
b1 c1⨝
A B
a1 b1⨝
B C
b2 c2
A B
a2 b2a3 b1
⨝B C
b1 c1
=
A B
a2 b2a3 b1
B C
b2 c2⨝
∪
∪
∪
A B C
a1 b1 c1a3 b1 c1a2 b2 c2
= =
IVM forLinearDatalog Rule
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
45
(R∪ ΔR)⨝ (S∪ ΔS)=(R⨝ S)∪ (R⨝ ΔS)∪ (ΔR⨝ S)∪ (ΔR⨝ ΔS)
A B
a1 b1a2 b2a3 b1
B C
b1 c1
S’=S
R’=R∪ ΔR
ΔR
A B
a1 b1B C
b1 c1A B C
a1 b1 c1⨝
⨝
=
A B C
a1 b1 c1a3 b1 c1
=
• R(x,y):- E(x,z),R(z,y)– i.e.Rnew =E⨝ R
• ButEisEDB– ΔE=Φ
• Therefore,E⨝ (R∪ ΔR)=(E⨝ R)∪ (E⨝ ΔR)• Itsufficestojoinwiththedifference
ΔRandincludeintheresultinthepreviousroundE⨝ R
• Advantageofhaving“linearrule”
(Non-recursive)Datalog withNegation
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
46
• RecursionandnegationtogethermakeDatalog executioncomplicated– thereisanotioncalled“stratifiedsemantic”forthispurpose– computeIDBrelationsinstrata/layersbeforetakinganegation– notcoveredinthisclass
• Wewillonlydonegationfornon-recursiveDatalog
Unsafe/SafeDatalog Rules
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
47
• Whatistheproblemwiththisrule?• Whatshouldthisrulereturn?
– namesofalldrinkersintheworld?– namesofalldrinkersintheUSA?– namesofalldrinkersinDurham?
Find drinkers who like beer “BestBeer” Q(x) :- Likes(x, “BestBeer”)
Find drinkers who DO NOT like beer “BestBeer”
Q(x) :- ¬Likes(x, “BestBeer”)
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
48
Find drinkers who like beer “BestBeer” Q(x) :- Likes(x, “BestBeer”)
Find drinkers who DO NOT like beer “BestBeer”
Q(x) :- ¬Likes(x, “BestBeer”)
ProblemwithNegationinDatalog Rules
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
• Whatistheproblemwiththisrule?• Dependenton“domain”ofdrinkers
– domain-dependent– infiniteanswerspossibletoo..
• keepgenerating“names”
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
49
• Solution:• Restrictto“activedomain”ofdrinkersfromtheinputLikes (orFrequents)relation– “domain-independence”– samefiniteansweralways
• Becomesa“saferule”
Find drinkers who like beer “BestBeer” Q(x) :- Likes(x, “BestBeer”)
Find drinkers who DO NOT like beer “BestBeer”
Q(x) :- ¬Likes(x, “BestBeer”)
ProblemwithNegationinDatalog Rules
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
Q(x) :- Likes(x, y), ¬Likes(x, “BestBeer”)
Q(x) = $y. Likes(x, y)∧"z.(Serves(z,y) Þ Frequents(x,z))
Query: Find drinkers that like some beer (so much) that they frequent all bars that serve it
CompSci516:DataIntensiveComputingSystems
50
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
DukeCS,Fall2016
RC→Datalog withnegation→ SQL(1/8)
Ack:slidesbyProfs.Balazinska andSuciu
RevisitexamplefromLecture4
Q(x) = $y. Likes(x, y)∧"z.(Serves(z,y) Þ Frequents(x,z))
Query: Find drinkers that like some beer so much that they frequent all bars that serve it
Step 1: Replace " with $ using de Morgan’s Laws
Q(x) = $y. Likes(x, y)∧ ¬$z.(Serves(z,y) ∧ ¬Frequents(x,z))
CompSci516:DataIntensiveComputingSystems
51
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
"x P(x) same as¬$x ¬P(x)
¬(¬P∨Q) same asP∧ ¬ Q
º Q(x) = $y. Likes(x, y)∧"z.(¬ Serves(z,y) ∨ Frequents(x,z))
DukeCS,Fall2016
RC→Datalog withnegation→SQL(2/8)
Ack:slidesbyProfs.Balazinska andSuciu
RevisitexamplefromLecture4
P => Q same as ¬P∨Q
Q(x) = $y. Likes(x, y)∧"z.(Serves(z,y) Þ Frequents(x,z))
Query: Find drinkers that like some beer so much that they frequent all bars that serve it
Step 1: Replace " with $ using de Morgan’s Laws
Q(x) = $y. Likes(x, y)∧ ¬$z.(Serves(z,y) ∧ ¬Frequents(x,z))
(new) Step 2: Make all subqueries domain independent
Q(x) = $y. Likes(x, y) ∧ ¬$z.(Likes(x,y)∧Serves(z,y)∧¬Frequents(x,z))
CompSci516:DataIntensiveComputingSystems
52
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
DukeCS,Fall2016
RC→Datalog withnegation→ SQL(3/8)
Ack:slidesbyProfs.Balazinska andSuciu
RevisitexamplefromLecture4
(new) Step 3: Create a datalog rule for some subexpressions of the form$x $y…. R(….)∧ S(….)∧ T(….)∧….
Q(x) = $y. Likes(x, y) ∧¬ $z.(Likes(x,y)∧Serves(z,y)∧¬Frequents(x,z))
H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)
H(x,y)
CompSci516:DataIntensiveComputingSystems
53
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
DukeCS,Fall2016
RC→Datalog withnegation→ SQL(4/8)
Ack:slidesbyProfs.Balazinska andSuciu
RevisitexamplefromLecture4
Step 4: Write it in SQL
SELECT DISTINCT L.drinker FROM Likes LWHERE ……
H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)
54
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
RC→Datalog withnegation→ SQL(5/8)
Ack:slidesbyProfs.Balazinska andSuciu
RevisitexamplefromLecture4
Step 4: Write it in SQL
SELECT DISTINCT L.drinker FROM Likes LWHERE not exists
(SELECT * FROM Likes L2, Serves SWHERE … …)
H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)
55
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
CompSci516:DataIntensiveComputingSystems
DukeCS,Fall2016
RC→Datalog withnegation→ SQL(6/8)
Ack:slidesbyProfs.Balazinska andSuciu
RevisitexamplefromLecture4
Step 4: Write it in SQL
SELECT DISTINCT L.drinker FROM Likes LWHERE not exists
(SELECT * FROM Likes L2, Serves SWHERE L2.drinker=L.drinker and L2.beer=L.beer
and L2.beer=S.beerand not exists (SELECT * FROM Frequents F
WHERE F.drinker=L2.drinkerand F.bar=S.bar))
H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)
56
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
RC→Datalog withnegation→ SQL(7/8)
Ack:slidesbyProfs.Balazinska andSuciu
RevisitexamplefromLecture4
Sometimes can simplify the SQL query by using an unsafe datalog ruleCorrectness ensured by safe outermost rule
SELECT DISTINCT L.drinker FROM Likes LWHERE not exists
(SELECT * FROM Serves SWHERE L.beer=S.beer
and not exists (SELECT * FROM Frequents FWHERE F.drinker=L.drinker
and F.bar=S.bar))
H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y) Unsafe rule
CompSci516:DataIntensiveComputingSystems
57
Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)
DukeCS,Fall2016
RC→Datalog withnegation→SQL(8/8)
Ack:slidesbyProfs.Balazinska andSuciu
RevisitexamplefromLecture4
AnOverviewofDataProvenancewithAnnotations
Selected/adaptedslidesfromthekeynotebyProf.ValTannen,EDBT2010
(optionalmaterial:fullslidedeckisavailableonVal’swebpage)
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
58
Optional/additionalslides
Lineage
• [Cui-Widom-Wiener’00]• Lineage:
– Givenadataitemintheview– Determinethesourcedatathatproducedit– Theprocessbywhichitwasproduced
59
optionalslide
Applications
• OLAP/OLAM(mining)– originofanomalousdatatoverifyreliability
• ScientificDatabases– howanswerwasproducedfromrawdata
• OnlineNetworkmonitoringandDiagnosissystem– identifyfaultysensorfromnetworkmonitors
60
optionalslide
Applications
• Cleanseddatafeedback– Cleanrawdataandsendreporttosources
• Materializedviewschemaevolution– ifviewschemaischanged(newcolumnadded),recomputation maynotbenecessary
• Viewupdate– translateviewupdatestobasedataupdates
61
optionalslide
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 62
SlidebyValTannen,EDBT2010
optionalslide
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 63
SlidebyValTannen,EDBT2010
optionalslide
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 64
SlidebyValTannen,EDBT2010
optionalslide
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 65
SlidebyValTannen,EDBT2010
optionalslide
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 66
SlidebyValTannen,EDBT2010
optionalslide
HasAsthma
Ann
Bob
Friend
Ann Joe
Ann Tom
Bob Tom
Smoker
Joe
Tom
Booleanquery Q():- HasAsthma (x),Friend(x,y),Smoker(y)
• x,y,z∈ {0,1}• 1=present• 0 =absent• Therearethreealternativewaystoderivethe“true”answer
x1x2
z1z2
y1y2y3
67
ProvenanceExample
ProvenanceFQ,D =x1y1z1 +x1y2z2 +x2y3z2
optionalslide
HasAsthma
Ann
Bob
Friend
Ann Joe
Ann Tom
Bob Tom
Smoker
Joe
Tom
Booleanquery Q:$ x$ yHasAsthma (x)Ù Friend(x,y)Ù Smoker(y)
• x,y,z∈ {0,1}
• WhathappensifAnnisdeleted?– Doestheanswerchangetofalsefromtrue?
x1x2
z1z2
y1y2y3
68
Applicationto“DeletionPropagation”
ProvenanceFQ,D =x1y1z1 +x1y2z2 +x2y3z2 [Greenetal.’07]
optionalslide
HasAsthma
Ann
Bob
Friend
Ann Joe
Ann Tom
Bob Tom
Smoker
Joe
Tom
Booleanquery Q:$ x$ yHasAsthma (x)Ù Friend(x,y)Ù Smoker(y)
• x,y,z∈ {0,1}
• WhathappensifAnnisdeleted?– Doestheanswerchangetofalsefromtrue?
• Noneedtore-evaluatethequery– justpluginx1=0andevaluate
x1x2
z1z2
y1y2y3
69
Applicationto“DeletionPropagation”
ProvenanceFQ,D =x1y1z1 +x1y2z2 +x2y3z2
0
[Greenetal.’07]
optionalslide