ECE6980
An Algorithmic and Information Theoretic
Toolbox for Massive Data
Instructor:Jayadev AcharyaEmail:[email protected]:TuTh 1.25-2.40,203PhillipsOfficeHours:MoTh 3-4,304RhodesWebsite:http://people.csail.mit.edu/jayadev/ece6980
Logistics
Grading• Scribealecture:10%
• Encouragedtofillinthedetails,provideexamples
• Assignments30-60%• 2-3assignments• Typeset?
• Projectreportandpresentation:40-60%• Readanewrelatedpaper• Presentasummaryinyourownwords• Canchoosefromalist
• Interruptions: 5%
Lectures• Lecturesprimarilyontheboard• Derivethings(mostlyfromscratch)
Courseoverview• Lotofinterestindatascience
• Numberofcoursesonoffer• Manyaspectscanbecovered
• Thiscourse:• Coreprimitives• Efficientalgorithms• Fundamental limits• Mostlytheoretical,encourage implementation
Prerequisites• Undergraduateprobability/random processes• Basiccombinatorics
• Whatisthevarianceofarandomvariable?• Whatisabinomialdistribution?• Whenaretworandomvariables independent?
Whatyoushouldlearn?• Fastalgorithmsforstatisticalproblems
• Learningdiscretedistributions• Finitesamplehypothesistesting
• Howtoproveinformationtheoretic lowerbounds
Probabilisticthinking
Classical
Smalldomain𝐷
𝑛𝑙𝑎𝑟𝑔𝑒, 𝐷 𝑠𝑚𝑎𝑙𝑙
Modern
Largedomain𝐷
𝑛𝑠𝑚𝑎𝑙𝑙 , 𝐷 𝑙𝑎𝑟𝑔𝑒
Oldquestions,newissues
Domain:
𝑛 = 1000 tosses
AsymptoticanalysisComputationnotcrucial
Newchallenges
Onehumangenome
Domain:
ResourcesSamples
• Howmuchdataneeded?• Inferencewhendataisscarce
Computation• Howdoesrun-timescalewithdataanddomainsize?• Evenquadratic mightbeprohibitive
Otherresources• Storage:Notenoughspacetostorealldata• Communication:Distributeddataacrossservers
GoalsForstatisticalinference• Design efficientalgorithms• Understand fundamental limits
INFORMATIONTHEORY
MACHINELEARNINGSTATISTICS
ALGORITHMS
Distributionlearning
Asimplesetting• Support set𝒳• Distribution𝑝:𝒳 → ℝ45 ,suchthat∑ 𝑝 𝑥8∈𝒳 = 1• Samples𝑥: = 𝑥;𝑥< … 𝑥: drawnfrom𝑝• Outputadistribution𝑞(𝑥:) afterobserving𝑥:
Tossacoin:HTTTHTTH
Throwadie:31344536
Whatisagoodestimator• Wouldlike𝑞 tobecloseto𝑝• 𝐿(𝑝, 𝑞):Lossforestimating𝑝 with𝑞
• Totalvariationdistance,KLdivergence,…
• Findanestimatorwithsmall𝐿• Foragivenlossfunction,howmanysamplesneeded?
• Empiricalestimators:• 𝑞 𝐻 = C
D
• Analyzetheperformance ofempiricalestimators
Learning• Givensamples fromaGaussiandistribution𝑁(𝜇, 𝜎<)• LearnwithaGaussiandistribution
• Relativelysimple
Learning
Ratioofbreadthtoheightof1000crabsbyW.WeldonNotnormallydistributed,morethanonespecies?KarlPearson:MixturesofGaussians (muchharder!!)
Distributiontesting
PolishMultilotek:• Picks20numbersbetween1,…,80
Isitfair?
Testinguniformity
Thanks to Krzysztof Onak (pointer) and Eric Price (graph)(FigurebyOnak,Price,Rubinfeld)
Testinguniformity(contd)
Asimplesetting• 𝒳 = 𝑘• 𝑢: uniformdistributionover𝒳• 𝑥:: 𝑛 samples fromadistribution𝑝
Question: Is𝑝 = 𝑢 OR 𝑝 − 𝑢 ; ≥ 𝜀?
Howmanysamplesdoweneed?Takeaguess…
𝑋:: asportsarticle𝑌:: areligiousarticle
𝑍:oneword
Q:Is𝑍morelikelytoappear insportsorreligion?
Necessarily assign𝑍 towhereitappearsmoreoften?
Asimpleclassificationproblem
Propertyestimation
Predictingnewelements
...
Howmanynewspecies?
Corbet collectedbutterflies inMalayaforoneyear
Applications of estimating the unseen
Corbett collected butterflies in Malaya for 1 year
Frequency 1 2 3 4 5 6 7 ..Species 118 74 44 24 29 22 20 ..
# of seen species = 118 + 74 + 44 + 24 + . . .
# of new species in the next year?
# of words in a book..
17 /46
Howmanynewspecies ifhegoesforonemoreyear?
Entropyestimation
Measuring Randomness in Data
Estimating randomness of the observed data:
Neural signal processing Feature selection for machine learning
Image Registration
Approach: Estimate the “entropy” of the generating distribution
Shannon entropy H(p)
def=
Px �px log px
1
Howmuchrandomnessinneuralspikes?
Howtoestimateentropy fromobservations?
Entropyestimation• 𝒳 = 𝑘• 𝑥:: 𝑛 samples fromadistribution𝑝
Question: Estimate𝐻(𝑝)
Resourceconstraints
DatatoobigtobestoredinasinglemachineLotofrecent interest
samplestream
limitedmemory
decision
distributeddata
limitedcommunication