IntroductiontoBigData
1
BigData– Philosophicalperspective
Whatismorevaluable,ifyouhadtopickone?• experienceorintelligence?
• Traditional(computer)science:logic![intelligence]• understand theproblem, buildmodel/algorithm• answerquestion from implementationofmodel
• Newscience:statistics![experience]• collectdata• answerquestion fromdata(whatdidothersdo?)
2
Questionsand(some)answers
• Findaspouse?• ShouldAdambiteintotheapple?• 1+1?• Cureforcancer?• Howtotreatacough?• ShouldIgiveDonaldaloan?• Premiumforfireinsurance?• Whenshouldmysoncomehome?• WhichbookshouldIreadnext?• TranslatefromGermantoEnglish.
3
Questionsand(some)answers
• Findaspouse?Idonotwanttoknow!• ShouldAdambiteintotheapple?Ifyoubelieve...• 1+1?Definition• Cureforcancer?Idonotknow.Maybe.• Howtotreatacough?Yes.(GoogleInsight)• ShouldIgiveDonaldaloan?Yes.(e.g.,Schufa)• Premiumforfireinsurance?Yes.(e.g., … )• Whenshouldmysoncomehome?No!But...• WhichbookshouldIreadnext?Yes.(Amazon)• TranslatefromGermantoEnglish.Yes.(GoogleTransl.)
4
DataScience
• Newapproachtodoscience• Step1:Collectdata• Step2:GenerateHypotheses• Step3:ValidateHypotheses• Step4:(Goto Step1or2)
• Whyisthisagoodapproach?• Automated:no thinking, lesserror
• Whyisthisabadapproach?• Howtodebugwithoutaground truth?
• Moregenerally,interdisciplinaryemergingfield(seeimages)
5
“Big”data- Pros&Cons
• Pros• tolerateerrors• discoverthelongtailandcornercases– machinelearningworksmuchbetter
• Cons• Moredata,moreerror(e.g.,semanticheterogeneity)• Withenoughdatayoucanproveanything• stillneedhumans toaskrightquestions
6
BigDataSuccessStory
• GoogleTranslate• Youcollectsnippetsoftranslations• Youmatchsentencestosnippets• Youcontinuouslydebugyour system
• Whydoesitwork?• TherearetonsofsnippetsontheWeb• Thereisaground truththathelps todebugsystem
7
GoogleTranslateisbasedonsomethingcalled"statisticalmachinetranslation".Thismeansthattheygatherasmuchtextastheycanfind thatseemstobeparallelbetweentwolanguages, andthentheycrunchtheirdatatofind thelikelihood thatsomethinginLanguageAcorrespondstosomethinginLanguageB.Thismethodworkstosomeextentforlanguage pairswherealotofmore-or-lessparallel dataisavailable, forexampleEnglish-Spanish. […](quora.com)
BigData– Businessperspective
Itisanewbusinessmodel
• Peoplepaywithdata,e.g.Facebook,Google,Twitter:• useservice,givedata• Googlesellsyourdatatoadvertisers• youpayadvertisersindirectly
• 23andMe,Amazon:• payservice+givedata• sellsdataand• usesdatatoimproveservice
8
Bigdata:Thenextfrontierforinnovation, competition,andproductivity,McKinseyGlobalInstitute,June2011
BigData– Technicalperspective
• Youcollectalldata• themorethebetter->statisticalrelevance,• keepingallischeaperthandecidingwhattokeep
• Youdecideindependentlywhattodowithdata• runexperimentsondatawhenquestionarises
• Hugedifference totraditionalinformationsystems• Designupfront whatdatatokeepandwhy!!!(e.g.,waterfallmodelofsoftwareengineering!)
9
Consequences
• Volume:dataatrest• itisgoingtobealotofdata
• Velocity(Speed):datainmotion• itisgoingtoarrivefast
• Variety(Diversity):datainmanyformats• Differentshapes(e.g.,differentversions,differentsources)
• Veracity:dataindoubt• doyouknowwhatyouhave?
10
11
12
13
14