Post on 03-Jan-2020
transcript
Introduction and Overview
of the Foundation of Data Science 2019 Summer School
at the Georgia Institute of TechnologyAugust 5, 2019
Xiaoming Huo, Georgia Tech
Agenda
I. Overview of this summer schoolII. Foundation of data science: Convex relaxationIII. Fast algorithm in statistic computingIV. Conclusion
2
overview foundation case 1 case 2 theory simulations conclude
TRIAD: Transdisciplinary Research Institute for Advancing Data
Sciencetriad.gatech.edu
August 5, 2019 TRIAD summer school at GeorgiaTech 3
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Speakers
Arkadi Nemirovski
August 5, 2019 TRIAD summer school at GeorgiaTech 4
Mark Davenport
Polo Chau
Vladimir Koltchinskii
Yao Xie
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Speakers (2)
• Arkadi Nemirovski: Convex optimization for statistical inference
• Mark Davenport: Sparse recovery, matrix completion, and applications
• Polo Chau: Data visualization
• Vladimir Koltchinskii: Probabilistic tools for high-dimensional statistics
• Xiaoming Huo: Overview of FDS summer school
• Yao Xie: Sequential data analysis and change detection
• (Industry Guest Speaker) Huan Yan: Data Science at Wellsfargo
August 5, 2019 TRIAD summer school at GeorgiaTech 5
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Schedule
• Location: the ISyE Main Building, room 228
• Monday, August 5
• Visit our web site
August 5, 2019 TRIAD summer school at GeorgiaTech 6
Time Event Topic
9:30-10:30 Lecture Xiaoming Huo – ISyEIntroduction and overview of FDS summer school
10:30-11:00 Break
11:00-12:00 Lecture Mark Davenport – ECESparse recovery, matrix completion, and applications
12:00-1:30 Lunch
1:30-2:30 Lecture Mark Davenport – ECESparse recovery, matrix completion, and applications
2:30-3:00 Break
3:00-4:00 Lecture Vladimir I Koltchinskii – MathProbalistic tools for high-dimensional statistics
4:00-6:00 Poster Session / Reception
6:00 Dinner
overview foundation case 1 case 2 theory simulations conclude
Schedule (2)• Tuesday, August 6 • Wednesday, August 7
August 5, 2019 TRIAD summer school at GeorgiaTech 7
Time Event Topic
9:30-10:30 Lecture Arkadi S Nemirovski – ISyEConvex optimization for statistical inference
10:30-11:00 Break
11:00-12:00 Lecture Arkadi S Nemirovski – ISyE
12:00-1:30 Lunch
1:30-2:30 Lecture Vladimir I Koltchinskii – MathProbalistic tools for high-dimensional statistics
2:30-3:00 Break
3:00-4:00 Lecture Vladimir I Koltchinskii – Math
4:00-5:00 Lab
6:00 Dinner
Time Event Topic
9:30-10:30
Lecture Arkadi S Nemirovski – ISyEConvex optimization for statistical inference
10:30-11 Break
11:00-12:00
Lecture Arkadi S Nemirovski – ISyEConvex optimization for statistical inference
12:00-1: Lunch
1:00-1:30
Lecture Huan Yan – WellsfargoIndustry Guest Speaker
1:30-2:30
Lecture Mark Davenport – ECESparse recovery, matrix completion, and applications
2:30-3 Break
3:00-4:00
Lecture Mark Davenport – ECESparse recovery, matrix completion, and appli…
4:00-5:00
Tour of Georgia Tech IRIM (Institute of Robotics and Intelligent Machines)
6:00 Dinner
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Schedule (3)
• Thursday, August 8
August 5, 2019 TRIAD summer school at GeorgiaTech 8
Time Event Topic
9:30-10:30 Lecture Yao Xie – ISyESequential data analysis and change detection
10:30-11:00 Break
11:00-12:00 Lecture Polo Chau – CSEData visualization
12:00-12:15 Closing remark
Yao Xie and Xiaoming Huo
12:15-1:30 Lunch
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
The Surrounding
August 5, 2019 TRIAD summer school at GeorgiaTech 9
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Safety!!!
• http://ipat.gatech.edu/news/data-driven-policing
August 5, 2019 TRIAD summer school at GeorgiaTech 10
overview foundation case 1 case 2 theory simulations conclude
Agenda
I. Overview of this summer schoolII. Foundation of data science: Convex relaxationIII. Fast algorithm in statistic computingIV. Conclusion
11
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Foundation of Data Science
August 5, 2019 TRIAD summer school at GeorgiaTech 12
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Foundation of Data Science
August 5, 2019 TRIAD summer school at GeorgiaTech 13
Programming, data base
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Foundation of Data Science
August 5, 2019 TRIAD summer school at GeorgiaTech 14
Statistics, mathematics, tcs
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Foundation of Data Science
August 5, 2019 TRIAD summer school at GeorgiaTech 15
Domain expertise
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
L1 relaxation in statistics
• Statistical inference: 𝑋𝑋 ⇒ 𝑌𝑌• Modeling: �𝑌𝑌 = 𝑓𝑓 𝑋𝑋 ≈ 𝑌𝑌• Regression: 𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + ⋯+ 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝 + 𝑒𝑒• Least squares: ⟶ min
𝛽𝛽𝑌𝑌 − 𝑋𝑋𝛽𝛽 2
2
• Difficulty: 𝑌𝑌 ∈ ℝ𝑛𝑛,𝑋𝑋 ∈ ℝ𝑛𝑛×𝑝𝑝
Q: What if 𝑝𝑝 ≫ 𝑛𝑛
August 5, 2019 TRIAD summer school at GeorgiaTech 16
=
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
L1 relaxation in statistics (2)
• Statistical theory (60’s, 70’s…): AIC, BIC, …• Subset Ω ⊂ 1,2, … ,𝑝𝑝• E.g., AIC(Ω) = 𝑌𝑌 − 𝑋𝑋𝛽𝛽 2
2 + 𝐶𝐶 Ω• Problem: there are 2𝑝𝑝of Ω′s
August 5, 2019 TRIAD summer school at GeorgiaTech 17
𝒑𝒑 𝟐𝟐𝒑𝒑
10 103
20 106
30 109 = 1 billion
60 1018 = impossible
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
L1 relaxation in statistics (3)
• 1996, Lasso, Basis Pursuit,…• min
𝛽𝛽𝑌𝑌 − 𝑋𝑋𝛽𝛽 2
2 + 𝜆𝜆 𝛽𝛽 1
• Where 𝛽𝛽 1 = ∑𝑗𝑗=1𝑝𝑝 𝛽𝛽𝑗𝑗
• Convex relaxation
August 5, 2019 TRIAD summer school at GeorgiaTech 18
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
L1 relaxation in statistics (4)
• Convex ⇒ polynomial time algorithm (vs NP hard)• Linear programming• Many existing good solvers• 2001, Donoho+H. IEEE IT
• Under certain conditions, the relaxation delivers the identical solutions as the original formulation (which is potentially NP-hard).
• Compressive sensing…
August 5, 2019 TRIAD summer school at GeorgiaTech 19
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Foundation of Data Science
• Foundation is powerful and critical• Interdisciplinary approach (borrow the strengths)• Multi-stage → coherent
August 5, 2019 TRIAD summer school at GeorgiaTech 20
overview foundation case 1 case 2 theory simulations conclude
Agenda
I. Overview of this summer schoolII. Foundation of data science: Convex relaxationIII. Fast algorithm in statistic computingIV. Conclusion
21
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Distance Covariance
• Let’s take a look at a statistical example…
August 5, 2019 TRIAD summer school at GeorgiaTech 22
overview foundation case 1 case 2 theory simulations conclude
Linear Dependency (Pearson’s)Pearson’s linear correlation coefficient:
Corr(X,Y)= 𝐶𝐶𝐶𝐶𝐶𝐶(𝑋𝑋,𝑌𝑌)𝑉𝑉𝑉𝑉𝑉𝑉 𝑋𝑋 𝑉𝑉𝑉𝑉𝑉𝑉(𝑌𝑌)
Karl Pearson (1895)
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 23
corr =1.0 corr =1.0 corr =0.8corr =0.4
corr =0.0
Corr=1.0 Corr=1.0 Corr=0.8 Corr=0.4 Corr=0.0
Y
X
overview foundation case 1 case 2 theory simulations conclude
NonlinearityDependency could be complicated
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 24
Y
X
overview foundation case 1 case 2 theory simulations conclude
Pearson’s corr. not effective
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 25
corr =0.8 corr =1.0 corr =0.0
corr =-0.0 corr =-0.0
corr =-0.1
corr =-0.1 corr =0.0 corr =-0.0
0.8 1.0 0.0
0.0
-0.1 0.0 0.0
0.0 0.1
overview foundation case 1 case 2 theory simulations conclude
How to measure statistical dependence?Independence: f(x,y)=f(x)f(y)◦ Joint density is the multiplication of two marginal densities
Hope: ◦ X and Y independent if and only if corr(X,Y)=0◦ If X=𝑐𝑐1⋅Y+𝑐𝑐2, then corr(X,Y)=1
Pearson’s correlation coefficient not effective
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 26
overview foundation case 1 case 2 theory simulations conclude
Distance covarianceGabor J. Szekely, 2005, 2007 (AoS), 2009 (AoAS), 2012 (SPL), 2014 (AoS)
Distance covariance: (population version)
𝒱𝒱2 𝑋𝑋,𝑌𝑌 = 𝜙𝜙𝑋𝑋,𝑌𝑌 𝑡𝑡, 𝑠𝑠 − 𝜙𝜙𝑋𝑋 𝑡𝑡 𝜙𝜙𝑌𝑌 𝑠𝑠 𝑤𝑤2
≔ ∫𝑅𝑅𝑝𝑝+𝑞𝑞 𝜙𝜙𝑋𝑋,𝑌𝑌 𝑡𝑡, 𝑠𝑠 − 𝜙𝜙𝑋𝑋 𝑡𝑡 𝜙𝜙𝑌𝑌 𝑠𝑠2 𝑤𝑤 𝑡𝑡, 𝑠𝑠 𝑑𝑑𝑡𝑡 𝑑𝑑𝑠𝑠
where 𝜙𝜙𝑋𝑋,𝑌𝑌, 𝜙𝜙𝑋𝑋, and 𝜙𝜙𝑌𝑌 are characteristic func.
Weight 𝑤𝑤 𝑡𝑡, 𝑠𝑠 = (|𝑡𝑡|𝑝𝑝1+𝑝𝑝|𝑠𝑠|𝑞𝑞
1+𝑞𝑞)−1 to ensure the above integral is well defined…
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 27
overview foundation case 1 case 2 theory simulations conclude
Sample Distance CovariancePairwise distances: 𝑎𝑎𝑖𝑖𝑗𝑗 = 𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑗𝑗 , 1 ≤ 𝑖𝑖, 𝑗𝑗 ≤ 𝑛𝑛
Similarly, 𝑏𝑏𝑖𝑖𝑗𝑗 = 𝑌𝑌𝑖𝑖 − 𝑌𝑌𝑗𝑗Centered matrix:
𝐴𝐴𝑖𝑖𝑗𝑗 = �𝑎𝑎𝑖𝑖𝑗𝑗 −∑ℓ=1𝑛𝑛 𝑎𝑎𝑖𝑖ℓ𝑛𝑛 − 2 −
∑𝑘𝑘=1𝑛𝑛 𝑎𝑎𝑘𝑘𝑗𝑗𝑛𝑛 − 2 +
∑𝑘𝑘,ℓ=1𝑛𝑛 𝑎𝑎𝑘𝑘ℓ
𝑛𝑛 − 1 𝑛𝑛 − 2 , 𝑖𝑖 ≠ 𝑗𝑗;
0, 𝑖𝑖 = 𝑗𝑗
Similarly, 𝐵𝐵𝑖𝑖𝑗𝑗.
An unbiased estimator of 𝒱𝒱2 𝑋𝑋,𝑌𝑌 :
𝐴𝐴 � 𝐵𝐵 =∑𝑖𝑖≠𝑗𝑗 𝐴𝐴𝑖𝑖𝑗𝑗𝐵𝐵𝑖𝑖𝑗𝑗𝑛𝑛(𝑛𝑛 − 3)
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 28
overview foundation case 1 case 2 theory simulations conclude
ComparisonComparison between dependence measures
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 29
Name of Coeff. Comp. cost
Pearson’s 𝝆𝝆 𝑛𝑛Spearman’s 𝜌𝜌 𝑛𝑛 log𝑛𝑛Kendall’s 𝜏𝜏 𝑛𝑛 log𝑛𝑛CCA 𝑛𝑛KCCA 𝑛𝑛3ACE 𝑛𝑛MIC 2𝑛𝑛MMD 𝑛𝑛2CMMD 𝑛𝑛2RDC 𝑛𝑛 log𝑛𝑛dCor 𝑛𝑛2 → 𝑛𝑛 log𝑛𝑛
overview foundation case 1 case 2 theory simulations conclude
Main ideas towards an O(n log n) algorithmWe designed a dyadic updating scheme to compute for
�𝑖𝑖≠𝑗𝑗
𝑎𝑎𝑖𝑖𝑗𝑗𝑏𝑏𝑖𝑖𝑗𝑗 = �𝑖𝑖≠𝑗𝑗
|𝑥𝑥𝑖𝑖 − 𝑥𝑥𝑗𝑗| � |𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑗𝑗|
An O(n log n) algorithm
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 30
overview foundation case 1 case 2 theory simulations conclude
Fast Method (3)For an 𝑖𝑖, need to compute for
�𝑗𝑗:𝑗𝑗<𝑖𝑖,𝑦𝑦𝑗𝑗<𝑦𝑦𝑖𝑖
𝑐𝑐𝑗𝑗
An dyadic partitioning/updating scheme:
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 31
𝑖𝑖
𝑗𝑗
… … … … …
overview foundation case 1 case 2 theory simulations conclude
Several sets of (x, y) points, with the Pearson correlation coefficient
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 32
overview foundation case 1 case 2 theory simulations conclude
Several sets of points, with the distance correlation coefficient
August 5, 2019 TRIAD SUMMER SCHOOL AT GEORGIATECH 33
overview foundation case 1 case 2 theory simulations conclude
Main message(s)
● Foundational research key to data science● Computing, statistics, theoretical computer science, math, … play
important roles ⟹ Interdisciplinary● New paradigm in data science technologies● This summer school: Foundation of Data Science – future activities…
62
Thank you! Email: huo@gatech.edu
overview foundation case 1 case 2 theory simulations conclude
Transdisciplinary Research Institute for Advancing Data Science
Acknowledgment
• Kathy Huggins
August 5, 2019 TRIAD summer school at GeorgiaTech 63
overview foundation case 1 case 2 theory simulations conclude