+ All Categories
Home > Documents > CompSci 590.02 Instructor: Ashwin Machanavajjhala

CompSci 590.02 Instructor: Ashwin Machanavajjhala

Date post: 29-Nov-2021
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
38
Algorithms for Big‐Data Management CompSci 590.02 Instructor: Ashwin Machanavajjhala 1 Lecture 1 : 590.02 Spring 13
Transcript
Page 1: CompSci 590.02 Instructor: Ashwin Machanavajjhala

AlgorithmsforBig‐DataManagement

CompSci590.02Instructor:AshwinMachanavajjhala

1Lecture1:590.02Spring13

Page 2: CompSci 590.02 Instructor: Ashwin Machanavajjhala

AdministriviahCp://www.cs.duke.edu/courses/spring13/compsci590.2/

•  Tue/Thu3:05–4:20PM

•  “ReadingCourse+Project”–  Noexams!

–  Everyclassbasedon1(or2)assignedpapersthatstudentsmustread.

•  Projects:(50%ofgrade)–  Individualorgroupsofsize2‐3

•  ClassPar\cipa\on+assignments(other50%)

•  Officehours:byappointment

2Lecture1:590.02Spring13

Page 3: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Administrivia•  Projects:(50%ofgrade)

–  Ideaswillbepostedinthecomingweeks

•  Goals:–  Literaturereview–  Someoriginalresearch/implementa\on

•  Timeline(detailswillbepostedonthewebsitesoon)–  ≤Feb12:ChooseProject(ideaswillbeposted…newideaswelcome)

–  Feb21:Projectproposal(1‐4pagesdescribingtheproject)–  Mar21:Mid‐projectreview(2‐3pagereportonprogress)

–  Apr18:Finalpresenta\onsandsubmission(6‐10pageconferencestylepaper+20minutetalk)

Lecture1:590.02Spring13 3

Page 4: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Whyyoushouldtakethiscourse?•  Industry,academicandgovernmentresearchiden\fiesthevalue

ofanalyzinglargedatacollec\onsinallwalksoflife.–  “WhatNext?AHalf‐DozenDataManagementResearchGoalsforBig

DataandCloud”,SurajitChaudhuri,MicrosoOResearch

–  “Bigdata:ThenextfronQerforinnovaQon,compeQQon,andproducQvity”,McKinseyGlobalInsQtuteReport,2011

Lecture1:590.02Spring13 4

Page 5: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Whyyoushouldtakethiscourse?•  Veryac\vefieldandtonsofinteres\ngresearch.

Wewillreadpapersin:–  DataManagement–  Theory

–  MachineLearning

–  …

Lecture1:590.02Spring13 5

Page 6: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Whyyoushouldtakethiscourse?•  Introtoresearchbyworkingonacoolproject

–  ReadscienQficpapers

–  Formulateaproblem–  PerformascienQficevaluaQon

Lecture1:590.02Spring13 6

Page 7: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Today•  Courseoverview

•  Analgorithmforsampling

Lecture1:590.02Spring13 7

Page 8: CompSci 590.02 Instructor: Ashwin Machanavajjhala

INTRODUCTION

Lecture1:590.02Spring13 8

Page 9: CompSci 590.02 Instructor: Ashwin Machanavajjhala

WhatisBigData?

Lecture1:590.02Spring13 9

Page 10: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Lecture1:590.02Spring13 10

hCp://visual.ly/what‐big‐data

Page 11: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Lecture1:590.02Spring13 11

hCp://visual.ly/what‐big‐data

Page 12: CompSci 590.02 Instructor: Ashwin Machanavajjhala

3KeyTrends•  Increaseddatacollec\on

•  (Sharednothing)Parallelprocessingframeworksoncommodityhardware

•  Powerfulanalysisoftrendsbylinkingdatafromheterogeneoussources

Lecture1:590.02Spring13 12

Page 13: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Big‐Dataimpactsallaspectsofourlife

13Lecture1:590.02Spring13

Page 14: CompSci 590.02 Instructor: Ashwin Machanavajjhala

ThevalueinBig‐Data…

14

+250% clicks vs. editorial one size fits all

+79% clicks vs. randomly selected

+43% clicks vs. editor selected

Recommendedlinks PersonalizedNewsInterests

TopSearches

Lecture1:590.02Spring13

Page 15: CompSci 590.02 Instructor: Ashwin Machanavajjhala

ThevalueinBig‐Data…

15

“IfUShealthcareweretousebigdata

creaQvelyandeffecQvelytodriveefficiencyand

quality,thesectorcouldcreatemorethan

$300billioninvalueeveryyear.”McKinseyGlobalIns\tuteReport

Lecture1:590.02Spring13

Page 16: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Example:GoogleFlu

Lecture1:590.02Spring13 16

Page 17: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Lecture1:590.02Spring13 17

hCp://www.ccs.neu.edu/home/amislove/twiCermood/

Page 18: CompSci 590.02 Instructor: Ashwin Machanavajjhala

CourseOverview•  Sampling

–  ReservoirSampling

–  Samplingwithindices–  SamplingfromJoins

–  MarkovchainMonteCarlosampling

–  GraphSampling&PageRank

Lecture1:590.02Spring13 18

Page 19: CompSci 590.02 Instructor: Ashwin Machanavajjhala

CourseOverview•  Sampling

•  StreamingAlgorithms–  Sketches–  OnlineAggrega\on–  Windowedqueries

–  Onlinelearning

Lecture1:590.02Spring13 19

Page 20: CompSci 590.02 Instructor: Ashwin Machanavajjhala

CourseOverview•  Sampling

•  StreamingAlgorithms•  ParallelArchitectures&Algorithms

–  PRAM

–  MapReduce

–  Graphprocessingarchitectures:BulkSynchronousparallelandasynchronousmodels

–  (Graphconnec\vity,MatrixMul\plica\on,BeliefPropaga\on)

Lecture1:590.02Spring13 20

Page 21: CompSci 590.02 Instructor: Ashwin Machanavajjhala

CourseOverview•  Sampling

•  StreamingAlgorithms•  ParallelArchitectures&Algorithms

•  Joiningdatasets&RecordLinkage–  ThetaJoins:orhowtoop\mallyjointwolargedatasets

–  ClusteringsimilardocumentsusingminHash

–  Iden\fyingmatchingusersacrosssocialnetworks

–  Correla\onClustering–  MarkovLogicNetworks

Lecture1:590.02Spring13 21

Page 22: CompSci 590.02 Instructor: Ashwin Machanavajjhala

SAMPLING

Lecture1:590.02Spring13 22

Page 23: CompSci 590.02 Instructor: Ashwin Machanavajjhala

WhySampling?•  Approximatelycomputequan\\eswhen

–  Processingtheen\redatasettakestoolong.HowmanytweetsmenQonObama?

–  Computa\onisintractableNumberofsaQsfyingassignmentsforaDNF.

–  Donothaveaccessorexpensivetogetaccesstoen\redata.HowmanyrestaurantsdoesGoogleknowabout?NumberofusersinFacebookwhosebirthdayistoday.WhatfracQonofthepopulaQonhastheflu?

Lecture1:590.02Spring13 23

Page 24: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Zero‐OneEs\matorTheoremInput:AuniverseofitemsU(e.g.,alltweets)

AsubsetG(e.g.,tweetsmen\oningObama)

Goal:Es\mateμ=|G|/|U|

Algorithm:•  PickNsamplesfromU{x1,x2,…,xN}•  Foreachsample,letYi=1ifxiεG.•  Output:Y=ΣYi/N

Theorem:Letε<2.IfN>(1/μ)(4ln(2/δ)/ε2),thenPr[(1‐ε)μ<Y<(1+ε)μ]>1‐δ

Lecture1:590.02Spring13 24

Page 25: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Zero‐OneEs\matorTheoremAlgorithm:

•  PickNsamplesfromU{x1,x2,…,xN}•  Foreachsample,letYi=1ifxiεG.

•  Output:Y=ΣYi/N

Theorem:Letε<2.IfN>(1/μ)(4ln(2/δ)/ε2),then

Pr[(1‐ε)μ<Y<(1+ε)μ]>1‐δ

Proof:Homework

Lecture1:590.02Spring13 25

Page 26: CompSci 590.02 Instructor: Ashwin Machanavajjhala

SimpleRandomSample•  GivenatableofsizeN,pickasubsetofnrows,suchthateach

subsetofnrowsisequallylikely.

•  Howtosamplenrows?•  …ifwedon’tknowN?

Lecture1:590.02Spring13 26

Page 27: CompSci 590.02 Instructor: Ashwin Machanavajjhala

ReservoirSamplingHighlights:

•  Makeonepassoverthedata•  Maintainareservoirofnrecords.

•  A}erreadingtrows,thereservoirisasimplerandomsampleofthefirsttrows.

Lecture1:590.02Spring13 27

Page 28: CompSci 590.02 Instructor: Ashwin Machanavajjhala

ReservoirSampling[ViCerACMToMS‘85]AlgorithmR:

•  Ini\alizereservoirtothefirstnrows.

•  Forthe(t+1)strowR,

–  Pickarandomnumbermbetween1andt+1

–  Ifm<=n,thenreplacethemthrowinthereservoirwithR

Lecture1:590.02Spring13 28

Page 29: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Proof

Lecture1:590.02Spring13 29

Page 30: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Proof•  IfN=n,thenP[rowisinsample]=1.Hence,reservoircontains

alltherowsinthetable.

•  SupposeforN=t,thereservoirisasimplerandomsample.Thatis,eachrowhasn/tchanceofappearinginthesample.

•  ForN=t+1:–  (t+1)strowisincludedinthesamplewithprobabilityn/(t+1)–  Anyotherrow:

P[rowisinreservoir]=P[rowisinreservoira}ertsteps]*P[rowisnot replaced] =n/t*(1‐1/(t+1))=n/(t+1)

Lecture1:590.02Spring13 30

Page 31: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Complexity•  Running\me:O(N)

•  Numberofcallstorandomnumbergenerator:O(N)

•  Expectednumberofelementsthatmayappearinthereservoir:

n+ΣnN‐1n/(t+1)=n(1+HN‐Hn)≈n(1+ln(N/n))

•  Isthereawaytosamplefaster?in\meO(n(1+ln(N/n)))??

Lecture1:590.02Spring13 31

Page 32: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Fasteralgorithm•  AlgorithmRskipsover(doesnotinsertintoreservoir)anumber

ofrecords(N‐n(1+ln(N/n)))

•  Atanystept,letS(n,t)denotethenumberofrowsskippedbytheAlgorithmR.–  InvolvedO(S)\meandO(S)callstotherandomnumbergenerator.

•  P[S(n,t)=s]=?

Lecture1:590.02Spring13 32

Page 33: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Fasteralgorithm•  Atanystept,letS(n,t)denotethenumberofrowsskippedbythe

AlgorithmR.

•  P[S(n,t)=s]=forallt<x<=t+s,rowxwasnotinsertedintoreservoir,butrowt+s+1isinserted.

={1‐n/(t+1)}x{1–n/(t+2)}x…x{1‐n/(t+s)}xn/(t+s+1)

•  WecanderiveexpressionforCDF:P[S(n,t)<=s]=1–(t/t+s+1)(t‐1/t+s)(t‐2/t+s‐1)…(t‐n+1/t+s‐n+2)

Lecture1:590.02Spring13 33

Page 34: CompSci 590.02 Instructor: Ashwin Machanavajjhala

FasterAlgorithmAlgorithmX

•  Ini\alizereservoirwithfirstnrows.

•  A}erseeingtrows,randomlysampleaskips=S(n,t)fromtheCDF

•  Pickanumbermbetween1andn

•  Replacethemthrowinthereservoirwiththe(t+s+1)strow.

•  Sett=t+s+1

Lecture1:590.02Spring13 34

Page 35: CompSci 590.02 Instructor: Ashwin Machanavajjhala

FasterAlgorithmAlgorithmX

•  Ini\alizereservoirwithfirstnrows.•  A}erseeingtrows,randomlysampleaskips=S(n,t)fromthe

CDF–  PickarandomUbetween0and1

–  FindtheminimumssuchthatP[S(n,t)<=s]<=1‐U

•  Pickanumbermbetween1andn

•  Replacethemthrowinthereservoirwiththe(t+s+1)strow.•  Sett=t+s+1

Lecture1:590.02Spring13 35

Page 36: CompSci 590.02 Instructor: Ashwin Machanavajjhala

AlgorithmX•  Running\me:

EachskiptakesO(s)\metocomputeTotal\me=sumofalltheskips=O(N)

•  Expectednumberofcallstotherandomnumbergenerator=2*expectednumberofrowsinthereservoir

=O(n(1+ln(N/n)))op\mal!

Seepaperforalgorithmwhichhasop\malrun\me

Lecture1:590.02Spring13 36

Page 37: CompSci 590.02 Instructor: Ashwin Machanavajjhala

Summary•  Samplingisanimportanttechniqueforcomputa\onwhendatais

toolarge,orthecomputa\onisintractable,orifaccesstodataislimited.

•  Reservoirsamplingtechniquesallowcompu\ngasampleevenwithoutknowledgeofthesizeofthedata.–  Alsocandoweightedsampling[Efraimidis,SpirakisIPL2006]

•  Veryusefulforsamplingfromstreams(e.g.,twiCerstream)

Lecture1:590.02Spring13 37

Page 38: CompSci 590.02 Instructor: Ashwin Machanavajjhala

References•  J.ViCer,“RandomSamplingwithaReservoir”,ACMTransac\ononMathema\cal

So}ware,1985•  P.Efraimidis,P.Spirakis,“Weightedrandomsamplingwithareservoir”,Journal

Informa\onProcessingLeCers,97(5),2006

•  R.Karp,R.Luby,N.Madras,“MonteCarloApproxima\onAlgorithmsforEnumera\onProblems”,JournalofAlgorithms,1989

Lecture1:590.02Spring13 38


Recommended