Probabilistic Graphical Models
Introduction to GM
Eric XingLecture 1, January 13, 2020
© Eric Xing @ CMU, 2005-2020 1
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
Reading: see class homepage
Logistics
l Class webpage: http://www.cs.cmu.edu/~epxing/Class/10708-20/
© Eric Xing @ CMU, 2005-2020 2
Logistics
© Eric Xing @ CMU, 2005-2020 3
q Textbooks:q Daphne Koller and Nir Friedman, Probabilistic Graphical Modelsq M. I. Jordan, An Introduction to Probabilistic Graphical Models (chapters will be made available)
q Class announcements and discussion: Piazzaq Homework submission: Gradescopeq TAs:
q Xun Zhengq Ben Lengerichq Haohan Wangq Yiwen Yuanq Xiang Si q Junxian He
q Lecturer: Eric Xingq Class Assistant: Amy Protos
Logistics
q 4 homework assignments: 50% of gradeq Theory exercises, Implementation exercises
q Scribe duties: 10% (~once to twice for the whole semester)q Final project: 40% of grade
q Applying PGM to the development of a real, substantial ML systemq Design and Implement a (record-breaking) distributed Logistic Regression, Gradient Boosted Tree, Deep Network, or Topic
model on Petuum and apply to ImageNet, Wikipedia, and/or other data q Build a web-scale topic or story line tracking system for news media, or a paper recommendation system for conference
review matchingq An online car or people or event detector for web-images and webcam q An automatic “what’s up here?” or “photo album” service on iPhone
q Theoretical and/or algorithmic work q a more efficient approximate inference or optimization algorithm, e.g., based on stochastic approximation, proximal average,
or other new techniques q a distributed sampling scheme with convergence guarantee
q 3 or 4-member team to be formed in the first three weeks, proposal, mid-way report, presentation & demo, final report àpossibly conference submission !
q Bonus:q Contribution to discussion on Piazzaq Complete mid-semester evaluation © Eric Xing @ CMU, 2005-2020 4
Past projects:
q We will have a prize for the best project(s) …
© Eric Xing @ CMU, 2005-2020 5
q Award Winning Projects:J. Yang, Y. Liu, E. P. Xing and A. Hauptmann, Harmonium-Based Models for Semantic Video Representation and Classification ,Proceedings of The Seventh SIAM International Conference on Data Mining (SDM 2007 best paper)Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, Noah A. Smith, Retrofitting Word Vectors to Semantic Lexicons, NAACL 2015 best paperOthers … such as KDD 2014 best paper
q Other projects:Andreas Krause, Jure Leskovec and Carlos Guestrin, Data Association for Topic Intensity Tracking, 23rd International Conference on Machine Learning (ICML 2006).
M. Sachan, A. Dubey, S. Srivastava, E. P. Xing and Eduard Hovy, Spatial Compactness meets Topical Consistency: Jointly modeling Links and Content for Community Detection , Proceedings of The 7th ACM International Conference on Web Search and Data Mining (WSDM 2014).
),,,,,,,( 87654321 XXXXXXXXP
Recap of Basic Prob. Concepts
l Representation: what is the joint probability dist. on multiple variables?
l How many state configurations in total? --- 28
l Are they all needed to be represented?l Do we get any scientific/medical insight?
l Learning: where do we get all this probabilities? l Maximal-likelihood estimation? but how many data do we need?l Are there other est. principles?l Where do we put domain knowledge in terms of plausible relationships between variables, and plausible
values of the probabilities?
l Inference: If not all variables are observable, how to compute the conditional distribution of latent variables given evidence?l Computing p(H|A) would require summing over all 26 configurations of the unobserved variables
© Eric Xing @ CMU, 2005-2020 6
A
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
B
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Multivariate Distribution in High-D Space
q A possible world for cellular signal transduction:
© Eric Xing @ CMU, 2005-2020 7
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor B
Membrane
Cytosol
X1 X2
X3 X4 X5
X6
X7 X8
A Structured View From Domain Experts
q Dependencies among variables
© Eric Xing @ CMU, 2005-2020 8
What are graphical models?
© Eric Xing @ CMU, 2005-2020 9
q Informally, a GM is just a graph representing relationship among random variablesq Nodes: random variables (features, not examples)q Edges (or absence of edges): relationship
q Looks simple!q But detail matters, as always. q What exactly do we mean by relationship?
Relationship between two random variables
q Many types of relationships exist: q X and Y are correlatedq X and Y are dependentq X and Y are independentq X and Y are partially correlated given Zq X and Y are conditionally dependent given Zq X and Y are conditionally independent given Zq X causes Yq Y causes Xq ...
q Many of them can be measured by an “one number summary”
© Eric Xing @ CMU, 2005-2020 10
Measure of association between two random variables
q Measures of association:q Pearson’s correlationq Mutual informationq Hilbert-Schmidt Independence Criterion (HSIC)q Partial correlationq …
q Why study them? (rather than directly diving into graphical models?)
q Gives better understanding of what graphical models really meanq Useful when estimating graph from data (later in the course)
© Eric Xing @ CMU, 2005-2020 11
Probability 101: Pearson’s correlation
q Normalized covariance
q Captures linear dependencyq Linear regression from X to Y gives
q Important properties:q X ⫫ Y implies ρ(X,Y) = 0 (Why?)q ρ(X,Y) = 0 does not imply X ⫫ Y (Counterexamples?)
q Q1: Is there any measure that implies independence?q Q2: What kind of dependency should they consider?
© Eric Xing @ CMU, 2005-2020 12
� =
Cov(X,Y )
Var(X)
<latexit sha1_base64="sjF9GZk/xIJ/nXG3rQSkGJ8d7ZE=">AAACbnicfVFNa9tAEF2paZO4aeM0kENN6RITsCEYyTW0l0DAlx4TiD+KZcxqPYoXr7Tq7ijYCB37B3vrb+glPyFrW4fECRlYeLz3Znb2bZhKYdDz/jnum52373b39ivvDz58PKwefeoblWkOPa6k0sOQGZAigR4KlDBMNbA4lDAI592VPrgDbYRKbnCZwjhmt4mIBGdoqUn1zxkN9EzRxvCc/mrSCxpEmvE8QFhg3lV3RakUeWB+ayyFPtNWWDQLus02rLWoBCEge3XYozHDZjGp1r2Wty76HPglqJOyribVv8FU8SyGBLlkxox8L8VxzjQKLsEukBlIGZ+zWxhZmLAYzDhfx1XQM8tMaaS0PQnSNfu4I2exMcs4tM6Y4cxsayvyJW2UYfRjnIskzRASvrkoyiRFRVfZ06nQwFEuLWBcC7sr5TNmI0L7QxUbgr/95Oeg327531rt6079slPGsUdq5JQ0iE++k0vyk1yRHuHkv3PkfHZqzr174n5xv26srlP2HJMn5TYeACsiudo=</latexit>
⇢(X,Y ) =
Cov(X,Y )pVar(X)
pVar(Y )
<latexit sha1_base64="CbnIo/NDgwTdLW1t9FiZhSAkxEk=">AAACbXicfVHRShtBFJ1da9Vo21Xxoa3I0CBGKGE3CvoiCL700UITI9kQZid3k8HZnXXmbjAs++YX+uYv+OIvdBL3oUbphYHDOefeO3MmyqQw6PuPjrv0YfnjyupabX3j0+cv3uZWx6hcc2hzJZXuRsyAFCm0UaCEbqaBJZGEq+jmYqZfTUAbodI/OM2gn7BRKmLBGVpq4N2Heqxoo/uTXh/SMxrGmvEiRLjD4kJNykopi9DcaqyEDtMz4bCki2zDWsvafhgBsv9Oez1n4NX9pj8v+hYEFaiTqi4H3kM4VDxPIEUumTG9wM+wXzCNgksoa2FuIGP8ho2gZ2HKEjD9Yp5WSfctM6Sx0vakSOfsvx0FS4yZJpF1JgzHZlGbke9pvRzj034h0ixHSPnLojiXFBWdRU+HQgNHObWAcS3sXSkfMxsR2g+q2RCCxSe/BZ1WMzhqtn4f18+PqzhWyXfygzRIQE7IOflFLkmbcPLkeM5X55vz7O64u+7ei9V1qp5t8qrcg7+b6rmQ</latexit>
q Q1: Is there any measure that implies independence?q A1: Yes, many! We will mention two of them today.
q Q2: What kind of dependency should they consider?q A2: Nonlinear dependency.
q One way to construct such a measure of dependence: q If X ⫫ Y then joint pdf factorizesq Measure “distance” between and q distance == 0 if and only if X ⫫ Y
Strong measure of association
© Eric Xing @ CMU, 2005-2020 13
PXY = PXPY<latexit sha1_base64="IwfKFkMUbZ7RJdWwcDL3pPvhZSE=">AAACf3icfVFNT9tAEF2bQqn5CvTYy4ooIggU2RSpcEBC4sIxlZrgKI6s9WZMVqy97u4YEVn+G/yw3vpfemATfCih6kgjPb03H7tvkkIKg77/23HXPqxvfNz85G1t7+zutfYPhkaVmsOAK6l0mDADUuQwQIESwkIDyxIJd8nDzUK/ewRthMp/4LyAScbuc5EKztBSceu5E+mZot3wlI6O6RWNUs14FSE8YXWjHutGqavI/NTYCEOmF8JxTVfZri2tvU6UALL/Tns7x+vHVTiqbUc/Dm2O4lbb7/nLoO9B0IA2aaIft35FU8XLDHLkkhkzDvwCJxXTKLiE2otKAwXjD+wexhbmLAMzqZb+1bRjmSlNlbaZI12yf3dULDNmniW2MmM4M6vagvyXNi4xvZhUIi9KhJy/LkpLSVHRxTHoVGjgKOcWMK6FfSvlM2Y9Q3syz5oQrH75PRie9YKvvbPv5+3r88aOTfKFHJIuCcg3ck1uSZ8MCCd/nEPnxDl1HffI7bn+a6nrND2fyZtwL18AXUW+eA==</latexit>
PXY<latexit sha1_base64="7SQa1S9auhmQQGhBOqg+euh0jE8=">AAAD8XicfVNdb9MwFM1SYCN8bINHXq6oKrXSmNoxCV6QJhWkTjApFWzr1pTIcZzWar5mO1Mrz/+CFx5AiFf+DW/8G+w2q7oOYcnJzbnnnutz5QR5TLloNv+s2ZU7d++tb9x3Hjx89Hhza/vJCc8KhskxzuKM9QLESUxTciyoiEkvZwQlQUxOg3Hb5E8vCeM0Sz+JaU4GCRqmNKIYCQ352/Z6zWOjDOq9HThrwBvwIoaw9ASZCNnOLlWZUdLjF0yUiRPETKKhYBWta6pyal5ABPqv2k0dp+b6snemdInr9/Q+c+aAlpoxj47eaqa7A93ZIa/ASwp/rJEGvLiOdcq78qWXIDHCKJYd5Y8XAp2Ph221ZHOh+nlPq8x67Vz3bugiI2nUNdXoBYF8p3x5rg3TBFwFfS8fUaifNwbgXRQohPpcckxYSmIgSUDCkKZDyCJQ4BrJsmDRPSJIFIxAgnLDKisVjK/P/P7DsmeaCl9OzBsWFntKa9cn2nicDctpm28lu+YJ4QSc2uGK67nsimV/q9rcbc4W3A5aZVC1yuX6W7+9MMNFQlKBY8R5v9XMxUAiJiiOiXK8gpMc4TEakr4OU5QQPpCzG6ugppEQoozpnQqYocsVEiWcT5NAM41Vvpoz4L9y/UJErweSpnkhSIrnjaIiBpGBuf4QUkawiKc6QJhRfVbAI6THJvRP4ughtFYt3w5O9nZbL3f3uvvVg/1yHBvWM+u5Vbda1ivrwOpYrnVsYTu1v9jf7O8VXvla+VH5Oafaa2XNU+vGqvz6C4gDN5k=</latexit>
PXPY<latexit sha1_base64="eIDiv23kqShx+pAMPc7dg/hQ/yw=">AAAD+nicfVNdb9MwFM1SPkb4WAePvFwxVWqlUrVlErwgTRpIm2BSJljXrimR4zit1XzNcUYr1z+FFx5AiFd+CW/8G+w2rboOYcnJzbn3HN9zFXtpSDPebP7ZMku3bt+5u33Puv/g4aOd8u7jTpbkDJMznIQJ63ooIyGNyRmnPCTdlBEUeSE598aHOn9+RVhGk/gjn6ZkEKFhTAOKEVeQu2uWKw4bJVDt1qFXg9fgBAxh4XAy4eIwuZJFRgonu2S8SHQQ04mahE20qkqlVXE8wtF/1a7rWBXbFd2eVBTb7ardWyLW6ntBOTl5oyh2HU7n3c7AiXJ3rJAaPF/GKuXMXOFEiI8wCsWRdMdyKXD04fhQrvldqX5qK5X5mfVlEzVF0pJaXZVqPc8Tb6UrLpRzGoEtoe+kIwrVi9oAnMsc+VBdSI4Ji0kIJPKI79N4CEkAEmwtWRBWpwcE8ZwRiFCqqwqmhPGy53fv1z3TmLtiot+wstiVSrs6UcbDZFiMXX9Lcaqf4E/AqhxvuF7Iblh2y3vNRnO+4GbQKoI9o1i2W/7t+AnOIxJzHKIs67eaKR8IxDjFIZGWk2ckRXiMhqSvwhhFJBuI+a8roaIQH4KEqR1zmKPrDIGiLJtGnqrUVrPNnAb/levnPHg1EDROc05ivDgoyEPgCeh7AD5lBPNwqgKEGVW9Ah4hNTauboulhtDatHwz6LQbrReN9un+3sF+MY5t46nxzKgaLeOlcWAcGbZxZmDzs/nF/GZ+L81KX0s/Sj8XpeZWwXliXFulX38BxkE6UQ==</latexit>
Mutual information
q Distance between two distributions?q Recall our old friend – the Kullback–Leibler divergence
q Apply then we get another old friend – mutual information
q Foundation of many topics later in the courseq I(X, Y) = 0 if and only if X ⫫ Y
© Eric Xing @ CMU, 2005-2020 14
KL(P,Q) =
Z
x2XP (x) log
P (x)
Q(x)
dx
<latexit sha1_base64="TfrI6w9avHiU20unQ3RAts61I/4=">AAADwnicfVJbb9MwFE5bLqPcOnjk5YipUiuVqi2T4AUxaUMqgkmZoFu7pkSO47Qm19rO1Mnzn+QJ/g12mla7ICw5Pjnfd77j78heFlEuer0/lWrt3v0HD3ce1R8/efrseWP3xSlPc4bJCKdRysYe4iSiCRkJKiIyzhhBsReRMy88NPjZBWGcpsl3cZmRWYzmCQ0oRkKn3N3K76bDFim0xh2YtOEDOAFDWDqCrIQ8TC9UiSjp8CUTJXCKmAHaCm5nW5qq6k3HIwL9V+2mTr1pu3I8UbrEdsd6T7RGQTk+PtIUuwMnxe2uwIlzN9SZNrzZxBpyrlzpxEgsMIrkULmh2ggMv30+VNf8bVV/DLRK0bWzadrWRUbSqGuq0fM8+Um58lw7pTHYCqZOtqDQOm/PwFnmyIfWWjIkLCERkNgjvk+TOaQBKLCNZFmw7R4QJHJGIEaZYZWVCsL6Gv/y9bplmghXrswJW4djpaVbK+07SufllM2/kifmC/4K3MZer9srFtwN+mWwZ5XLdhu/HD/FeUwSgSPE+bTfy8RMIiYojoiqOzknGcIhmpOpDhMUEz6TxRNU0NQZH4KU6Z0IKLLXKySKOb+MPc00HvhtzCT/hU1zEbyfSZpkuSAJXjcK8ghECuY9g08ZwSK61AHCjOq7Al4gPQ+hX31dD6F/2/Ld4HTQ7b/tDk729w72y3HsWK+s11bL6lvvrANraNnWyMLVj1VSTapp7aj2s7as8TW1WilrXlo3Vu3qL9nTKQQ=</latexit>
I(X,Y ) = KL(PXY , PXPY )<latexit sha1_base64="FdhWaczVWoRy/R+MNmixZ1A5WAg=">AAAD6XicfVNdb9MwFM0aPkb5WAePvFwxVWqlMrVlErwgTRpInWBSJmjXrimR4zitlc85ztTK8z/ghQcQ4pV/xBu/Buw2rboOYcnJzT3nnutz5bhpSDPebP7eKpm3bt+5u32vfP/Bw0c7ld3HvSzJGSZdnIQJ67soIyGNSZdTHpJ+ygiK3JCcucGRxs8uCctoEn/ks5SMIjSOqU8x4irl7G79qdpskkCt34BBHV6D7TOEhc3JlIuj5FIWiBR2dsF4AfQQ00Bdwma2pqiyXLVdwtF/1a7rlKuWI/oDqUosp6/2QGnMKScnbxTFasDp/HRXYEe5E6hMHZ4vYwXZV46wI8QnGIWiI51ALgU6H46P5Jq/leqntlKZd20sm9ZVkZbU6oqq9VxXvJWOOFdOaQSWhKGdTijUzusjsC9y5EFtIRkQFpMQSOQSz6PxGBIfJFhasihYdfcJ4jkjEKFUs4pKCcHyzO/er3umMXfEVL9hZbEvlXZtqoyHybgYs/6W4lQ/wZtC+XjD9EJ1w7FT2WvuN+cLbgatItgzimU5lV+2l+A8IjHHIcqyYauZ8pFAjFMcElm284ykCAdoTIYqjFFEspGY31QJVZXxwE+Y2jGHeXa9QqAoy2aRq5jaabaJ6eS/sGHO/VcjQeM05yTGi0Z+HgJPQF978CgjmIczFSDMqDor4AlSU+Pq5yirIbQ2Ld8Meu391ov99unB3uFBMY5t46nxzKgZLeOlcWh0DMvoGrjklz6Xvpa+mYH5xfxu/lhQS1tFzRPj2jJ//gWJRzXB</latexit>
Hilbert-Schmidt Independence Criterion (HSIC)
q A relatively recent(?) finding by Gretton et al. 2005q Use maximum mean discrepancy (MMD) as the distance metric
q Looks scary! No need to know what it means for now.q Will cover later if anyone’s interested J
q HSIC(X, Y) = 0 if and only if X ⫫ Y
© Eric Xing @ CMU, 2005-2020 15
MMD(P,Q) = kµk(P )� µk(Q)kHk<latexit sha1_base64="Ml13NCkrFq1ULn1DVS7oLFqdM2U=">AAACwHicfVFda9swFJW9r877aLY97kUsBBLoQtwVtpdCWffQl4ILS+oQBSMrciMiW650XZap/pN7Gfs3kxMP1nTswoXDOedeXd2bllIYGI1+ef6Dh48eP9l7Gjx7/uLlfufV64lRlWZ8zJRUOk6p4VIUfAwCJI9LzWmeSn6Zrk4b/fKGayNU8RXWJZ/n9KoQmWAUHJV0fvaIXircjw/wdICPMck0ZZYA/wb2VN3UrVJbYq41tMKE6kYY1HiX7TtrHfRIyoH+t9vdPkEvSmw8rV1JlMQup8HWcX7+xTmiA3yxGe4Wk7xKVo4Z4Pd/sJPIbWJJTmHJqLRndbKqk053NBxtAt8HYQu6qI0o6fwgC8WqnBfAJDVmFo5KmFuqQTDJ64BUhpeUregVnzlY0Jybud0coMY9xyxwprTLAvCG/bvC0tyYdZ46ZzOl2dUa8l/arILs09yKoqyAF2z7UFZJDAo318QLoTkDuXaAMi3crJgtqVs6uJsHbgnh7pfvg8nhMPwwPLw46p4ctevYQ2/RO9RHIfqITtAZitAYMe/YY570cv+zv/SVf721+l5b8wbdCf/7b6Le1Is=</latexit>
HSIC(X,Y ) = MMD(PXY , PXPY )<latexit sha1_base64="1lbK3lGVZYb0mnNKZy880sWcPK4=">AAAC9HicfVJda9swFJW9r877aLo+7kUsBBLIgt0VtpdCIX3IHgouW1KHOBhZkRsR+aOSHBpU/4697KFl7HU/Zm/7N5MdF9J07ILgcM69594rKcwYFdK2/xjmo8dPnj7beW69ePnq9W5j781IpDnHZIhTlnIvRIIwmpChpJIRL+MExSEj5+GiX+rnS8IFTZOvcpWRaYwuEhpRjKSmgj3Davl8nsK214XjDjyCfsQRVr4kV1L102VRK4XyxSWXtTBCvBQ6Bdxm2zq10J4hkei/bvd9rJYbKG9c6BI38PQZa48q5fT0RKe4XXhWTXcN/TgPFprpwPd3WEv+daD8GMk5RkwNimBRWOv6wZfP/WJjvU3TqmX3rmMnaDTtnl0FfAicGjRBHW7Q+O3PUpzHJJGYISEmjp3JqUJcUsyIniAXJEN4gS7IRMMExURMVfVoBWxpZgajlOuTSFixmxUKxUKs4lBnlnuJba0k/6VNchl9miqaZLkkCV43inIGZQrLHwBnlBMs2UoDhDnVs0I8R/qdpP4nlr4EZ3vlh2B00HM+9A7ODpvHh/V17IC34B1oAwd8BMdgAFwwBNi4NL4ZN8atuTS/mz/Mn+tU06hr9sG9MH/9BYfg5Sk=</latexit>
µk(P ) = EZ⇠P [�(Z)] (kernel embedding of P )
<latexit sha1_base64="DadChuZyP68EcXIUmp8koRBP0eE=">AAADSXicfVLPi9NAFJ6k/ljjr64evTwshQZqaeuCXoSFKtTDQhZtt92mhslk0g7Nr51MFkt2/j0v3rz5P3jxoIgnJ2kqu13xwcDj+773vXkzz00Clopu96um127cvHV7745x9979Bw/r+4/GaZxxQkckDmI+cXFKAxbRkWAioJOEUxy6AT1xV4OCPzmnPGVx9F6sEzoP8SJiPiNYKMjZ15ymzZcxtCZtmJrwCmyfY5Lbgn4U+SA+lxUjczs946IixpgXhClhF20pqTSatksF/q/bVR+jaTn5ZCpVieVM1Jkqj1JydPRaSaw2HJe3uwA7zJyVQkx4ts0VZV84uR1isSQ4yIfSWcmtwfDd24G8NN9f1w995VJ2bW+bmkbpWJgrZWHnuvkb6eSnalAWgiVhZidLBq1Tcw72WYY9aG0cV5RHNAAautTzWLSA2AcJlunUG91Otwy4nvSqpIGqsJz6F9uLSRbSSJAAp+ms103EPMdcMBJQadhZShNMVnhBZyqNcEjTeV5ugoSmQjzwY65OJKBEL1fkOEzTdegqZTFdussV4L+4WSb8l/OcRUkmaEQ2jfwsABFDsVbgMU6JCNYqwYQzdVcgS6w+X6jlM9Qj9HZHvp6M+53e807/+KBxeFA9xx56gp6iFuqhF+gQDZGFRohon7Rv2g/tp/5Z/67/0n9vpLpW1TxGV6JW+wNP8AQT</latexit>
�(Z) = feature map of kernel k<latexit sha1_base64="pti2CRWICJaJdKdXC8+IWuCPnC8=">AAADdXicfVJdaxNBFJ0kftT1K9UXQYRrYyQLMSaxoC9CIQrxobBFkybNpsvs7Gwy7GdnZ4thO7/Af+ebf8MXX53dbEqbihcGLuece+7cy7VjnyWi2/1VqdZu3b5zd+eedv/Bw0eP67tPxkmUckJHJPIjPrFxQn0W0pFgwqeTmFMc2D49tr1Bzh+fU56wKPwmVjGdB3gRMpcRLBRk7VZ+NE2+jKA1acNUh49guhyTzBT0u8gG0bksGZmZyRkXJTHGPCd0CdtoS0ml1jRtKvB/3a77aE3DyiZTqUoMa6LeVHkUksPDT0pitOGo+N0FmEFqeQrR4c0mV5R5YWVmgMWSYD8bSsuTG4Ph1y8DeWW+S9fTvnIpurY3TXVVlFvm7kqa+9l29lla2YmalAVgSJiZ8ZJB60Sfg3mWYgdaa0uP8pD6QAObOg4LFxC5IMHQtY3+srlLsUg5hQDHuagslOBZ9Ua30y0Cbia9MmmgMgyr/tN0IpIGNBTEx0ky63VjMc8wF4z4VGpmmtAYEw8v6EylIQ5oMs+Kq5HQVIgDbsTVCwUU6NWKDAdJsgpspcwXkWxzOfgvbpYK98M8Y2GcChqSdSM39UFEkJ8gOIxTIvyVSjDhTP0VyBKrQxHqUDW1hN72yDeTcb/Te9fpH+03DvbLdeyg52gPtVAPvUcHaIgMNEKk8rv6rPqyulf9U3tRe1V7vZZWK2XNU3Qtam//Ao3xDok=</latexit>
But what do they have to do with graphical models?
q Marginal correlation/dependency graph for q Most primitive form of graphical models one can think ofq Connect variables that have nontrivial pairwise correlation/mutual
information/HSIC/etc.
q Not very informative. Why?q X = height of a kidq Y = vocabulary of a kidq Z = age of a kidq Q1: What is the marginal dependency graph? q Q2: What is the graph that you think will make more sense?
© Eric Xing @ CMU, 2005-2020 16
X = {X1, . . . , Xd}<latexit sha1_base64="gNjKfeXMCoNxjfXLdFBTMVqzNdo=">AAAEHXicfVNdb9MwFM1SPkb52AaPvFwxVWqlMrXbJHhBmjSQNsGkTLAtW10ix3Faq/ma7UydPP8RXvgrvPAAQjzwgvg32G06bR3CkuObc+85vucqCYuECdnp/Flwa7du37m7eK9+/8HDR0vLK48PRV5yQg9InuTcD7GgCcvogWQyoX7BKU7DhB6Fo22bPzqjXLA8+yDPC9pP8SBjMSNYGihYcTcaiA9zaPptOG7BK0Axx0QhScdSbednuspohcQpl1XiEHObaGmYR5umVNcbKKQS/1ftuk694QXKP9aG4gW+2cczxJ4VMOXs7b02HK8N+5N2LwClZTAySAuez2KTQheBQimWQ4ITtaODkZ4J7Lzf3dZXDF+qflw3KpNL27MuWoZkJa26KbV6Yaje6ECdGOssBU9DDxVDBs2TVh/QaYkjaE4lR5RnNAGahjSKWDaAPAYNnpWsCJe3xxTLklNIcWGrKqaG0aznt++uemaZDNTYnnBp0ddGuzk2xpN8UM3dvmu1b58QjaHe2J1zPZWdtzy1GRtJW6j8oNsGFOVSkDb4QYR0sLzaWetMFtwMulWw6lTLC5Z/GT4pU5pJkmAhet1OIfsKc8lIQnUdlYIWmIzwgPZMmOGUir6afN0aGgaJIM652ZmECXqVoXAqxHkamkrbuJjPWfBfuV4p45d9xbKilDQj04viMgGZg/1VIGKcEpmcmwATzkyvQIbYDFaaH6puhtCdt3wzOFxf626sre9vrm5tVuNYdJ46z5ym03VeOFvOjuM5Bw5xP7lf3G/u99rn2tfaj9rPaam7UHGeONdW7fdfzMpHPQ==</latexit>
Partial correlation: accounting for other variables
q Partial correlation between X and Y given a random vector Zq Correlation measured after eliminating linear effect of Zq i.e. correlation between residuals from regressing Z to X and Z to Y
q Similar to Pearson’s correlation:q X ⫫ Y | Z implies ρ(X,Y | Z) = 0q ρ(X,Y | Z) = 0 does not imply X ⫫ Y | Z
© Eric Xing @ CMU, 2005-2020 17
⇢(X,Y |Z) = ⇢(eX , eY ) =Cov(eX , eY )p
Var(eX)]
pVar(eY )
<latexit sha1_base64="4c8mEWn+M4pxYdXDJRrWyO+zRX0=">AAAEmnicfVPbTttAEDUkbal7g/ahD+3DqChSLKUooUjtCxIqvYAoklGBGuLUWq/XiRXfWK9R0LL/1G/pW/+ms4mTQoi60nrHc2bOzhnN+nkcFaLd/rO0XKvfu/9g5aH56PGTp89W156fFlnJKTuhWZxxxycFi6OUnYhIxMzJOSOJH7Mf/nBX4z8uGS+iLD0WVznrJaSfRmFEiUCXt7b8q+HyQQZNpwVnFmyDG3JCpSvYSMjd7FJViJJuccFFBZwSrgFLwby3iaHKbLg+E+S/bLd5zIbtSedMYYrtObjPph59Vo5JzuHhJ8yxW3A0Lvca3KT0huix4O3URsi99qSbEDGgJJZ7yhuqKcHe9/1ddUPwjPXnJrKML21Nq7AwSVNqdgzVfL4vPytPnqP0KAFbQdfNBxE0z60euBclCaA5oRwynrIYWOKzIIjSPmQhKLA1ZZUwuz1kRJScQUJyHVVlKhhOaz74dlNzlApPjvQJM4mOQu7mCIXHWb/qu/5X8kh/IRiB2difUz2hXSBZ6wyRU0dKx+u0wA0yUdAWOF7gKvPfzMA1TMPPlWVi/BhiHoJMsy0aghm6aKwQtHp3J0sDeri81fX2Rnu84K7RqYx1o1q2t/oba6dlwlJBY1IU3U47Fz1JuIhozFBLWbCc0CHpsy6aKUlY0ZPjp6WggZ4AwozjTgWMvTczJEmK4irxMVJ3oZjHtHMR1i1F+KEnozQvBUvp5KKwjEFkoN8pBBFnVMRXaBDKI6wV6IBgIwW+ZhOb0JmXfNc43dzovNvYPNpa39mq2rFivDLeGE2jY7w3dow9wzZODFp7Wduufal9rb+uf6zv1w8moctLVc4L49aqH/8FFfhuJg==</latexit>
eX = X � (�TXZ+ interceptX)
<latexit sha1_base64="fJBESTllXjzLNTRkVuQEL0VzXmk=">AAAE3nicfVNLb9NAEHabACW8WjhyGVFFSkSoklIJLkiVClIrqOSKPtzG6bJerxMrftVeV6m2e+DCAYS48ru48T/4Acw6TpSmhZXsHc838+1841knCfxMtNu/FxYr1Vu37yzdrd27/+Dho+WVx4dZnKeMH7A4iFPLoRkP/IgfCF8E3EpSTkMn4EfOcEvjR+c8zfw42hcXCe+FtB/5ns+oQBdZWfxTt9NBDA2rBcdNeAO2l1ImbcFHQm7F56pElLSzs1SUwCFNNdBUMO9tYKiq1W2HC/pftqs8tbpJpHWsMMUkFj7HE4/eS8c4Z3f3LeaYLdgryr0EO8zJED1NeDGxEbIvibRDKgaMBnJbkaGaEGx/3NlSM4KnrKfryFIc2ppU0cQkTanZMVTzOY58p4g8Qel+CKaCrp0MfGicNHtgn+XUhcaYcsjTiAfAQ4e7rh/1IfZAgakpy4Tp6R6nIk85hDTRUWWmguGk5vcfZjX7kSBypHeYSrQUcjdGKDyI+2Xf9beSe/oN7ghq9Z051WPaGyRrnR5y6khpkU4LbDcWGWuBRVxbt3I6NHAJk/gTVVAXECcIcvKPmZqiN80Vgs3e9dHSQDFdiCOphb+7YTtx4GYXIW6ymDlFrNP9mYLgeSkVe8bxyiQCI5pkebW91i4WXDc6pbFqlMsky79QP8tDHgkW0CzrdtqJ6EmaCp8FXNXsPOMJZUPa5100IxryrCeL66mgjh4XvDjFJxJQeGczJA0zrQEjdd3ZPKadN2HdXHive6gsyQWP2PggLw9AxKDvOrh+ypkILtCgLPWxVmADiv8CW5HVsAmdecnXjcP1tc7LtfW9jdXNjbIdS8ZT45nRMDrGK2PT2DZM48BglW7lc+Vr5Vv1U/VL9Xv1xzh0caHMeWJcWdWffwEBy4kN</latexit>
eY = Y � (�TY Z+ interceptY )
<latexit sha1_base64="P44PMSdLNLwiWJ42pY/l6HsA4Ck=">AAAFInichVRLb9NAEHabACW8WjhyGVFFikWoklKJXpAqFaRWUMkVfTiNw2ptrxMrftVeV6m2+1u48Fe4cAABJyR+DLOOE6VpCyvZO55v5tv5xmPbSeBnvNX6vbBYqd66fWfpbu3e/QcPHy2vPD7K4jx12KETB3Fq2jRjgR+xQ+7zgJlJymhoB+zYHm4r/PiMpZkfRwf8PGG9kPYj3/MdytFFViqbdSsdxNAwm9DR4TVYXkodYXE24mI7PpMlIoWVnaa8BI5oqgBdwry3gaGyVrdsxuk/2S7z1OoGEWZHYopBTLw6E4/aS8c4Z2/vDeYYTdgvyr0AK8zJED06vJjYCFkXRFgh5QOHBmJHkqGcEOx82N2WM4KnrB/XkaU4tDmpQsckRanYMVTx2bZ4K4k4Qel+CIaErpUMfGic6D2wTnPqQmNMOWRpxAJgoc1c14/6EHsgwVCUZcL0dI9RnqcMQpqoqDJTwnBS87v3s5r9iBMxUjtMJZoSuRsjFB7E/bLv6lmKfXUHdwS1+u6c6jHtNZKVTg85VaQwSbsJlhvzzGmCSVxLtXI6NHABk/gTWVAXECMIMnLDTE3R6+YKQb13dbQUMJ4uDEBWE993w7LjwM3OQ9xEMXSSmB8PZiqC56VWbBrDbybhGKHXkAopOjdQdP5L0dHJ8mprrVUsuGq0S2NVK5dBln9iD508ZBF3Appl3XYr4T1BU+47AZM1K89YQp0h7bMumhENWdYTxScuoY4eF7w4xSviUHhnMwQNM6UBI1Xd2TymnNdh3Zx7mz1UluScRc74IC8PgMeg/hfg+ilzeHCOBnVSH2sFZ0DxfWIrsho2oT0v+apxtL7Wfrm2vr+xurVRtmNJe6o90xpaW3ulbWk7mqEdak7lU+VL5Vvle/Vz9Wv1R/XXOHRxocx5ol1a1T9/AX8ios0=</latexit>
Partial correlation graphs
q Partial correlation graph for q A more informative graphical model than marginal dependency graphq Connect variables with nontrivial partial correlation given the restq Recall the height-vocab-age example (assuming everything is linear)
q A deeper look at the d x d partial correlation matrix R with
q Looks scary! (So many regressions to run?!)q But turns out R is just some version of inverse covariance matrix q Homework J
© Eric Xing @ CMU, 2005-2020 18
X = {X1, . . . , Xd}<latexit sha1_base64="gNjKfeXMCoNxjfXLdFBTMVqzNdo=">AAAEHXicfVNdb9MwFM1SPkb52AaPvFwxVWqlMrXbJHhBmjSQNsGkTLAtW10ix3Faq/ma7UydPP8RXvgrvPAAQjzwgvg32G06bR3CkuObc+85vucqCYuECdnp/Flwa7du37m7eK9+/8HDR0vLK48PRV5yQg9InuTcD7GgCcvogWQyoX7BKU7DhB6Fo22bPzqjXLA8+yDPC9pP8SBjMSNYGihYcTcaiA9zaPptOG7BK0Axx0QhScdSbednuspohcQpl1XiEHObaGmYR5umVNcbKKQS/1ftuk694QXKP9aG4gW+2cczxJ4VMOXs7b02HK8N+5N2LwClZTAySAuez2KTQheBQimWQ4ITtaODkZ4J7Lzf3dZXDF+qflw3KpNL27MuWoZkJa26KbV6Yaje6ECdGOssBU9DDxVDBs2TVh/QaYkjaE4lR5RnNAGahjSKWDaAPAYNnpWsCJe3xxTLklNIcWGrKqaG0aznt++uemaZDNTYnnBp0ddGuzk2xpN8UM3dvmu1b58QjaHe2J1zPZWdtzy1GRtJW6j8oNsGFOVSkDb4QYR0sLzaWetMFtwMulWw6lTLC5Z/GT4pU5pJkmAhet1OIfsKc8lIQnUdlYIWmIzwgPZMmOGUir6afN0aGgaJIM652ZmECXqVoXAqxHkamkrbuJjPWfBfuV4p45d9xbKilDQj04viMgGZg/1VIGKcEpmcmwATzkyvQIbYDFaaH6puhtCdt3wzOFxf626sre9vrm5tVuNYdJ46z5ym03VeOFvOjuM5Bw5xP7lf3G/u99rn2tfaj9rPaam7UHGeONdW7fdfzMpHPQ==</latexit>
Rij = ⇢(Xi, Xj |X�ij)<latexit sha1_base64="RP4YrxQuf7ojRAHylT582czUQYw=">AAAFT3ichVRLb9NAEHabFNrwauHIZUQVKRZplZRKcEGqVJBaQaUU+nAap6u1vU7c+FV7XaXa7j/kAjf+BhcOIMSs40Rp2sJK9o7nm/l2vvHYVux7KW80vs/Nl8oL9+4vLlUePHz0+MnyytOjNMoSmx3akR8lhkVT5nshO+Qe95kRJ4wGls+OrcG2wo8vWJJ6UXjAL2PWDWgv9FzPphxdZKXEqmbSj6Bm1KGtw1sw3YTawuRsyMV2dCELRAozPU94ARzRRAG6hFlvDUNlpWpajNN/sl3nqVRbRBhtiSktYuDVHnvUXjhGOXt77zCnVYf9vNwrMIOMDNCjw9rYRsi8IsIMKO/b1Bc7kgzkmGDn8+62nBI8YT3dQJb80Pq4Ch2TFKVix1DFZ1nivSTiBKV7AbQkdMy470HtRO+CeZ5RB2ojygFLQuYDCyzmOF7Yg8gFCS1FWSRMTncZ5VnCIKCxiioyJQzGNX/4OK3ZCzkRQ7XDRKIhkbs2ROF+1Cv6rp6l2Fd3cIZQqe7OqB7R3iJZ6XSRU0UKgzTrYDoRT+06GMQxVSsnQwNXMI4/kTl1DjGCICN3zNQEvW2uENS7N0dLAaPpwgBkNfB910wr8p30MsBN5EMniXF6MFURvCy0YtMYfjMxxwhdcbSRo30HR/u/HNinT0R4Z3Ki2CCeas/ZVEMMHJQ1jNHJ8mpjvZEvuGk0C2NVK1aLLH/DhttZwEJu+zRNO81GzLuCJtyzfSYrZpaymNoD2mMdNEMasLQr8v+BhCp6HHCjBK+QQ+6dzhA0SJVejFSFprOYct6GdTLuvuliF+KMs9AeHeRmPvAI1M8FHC9hNvcv0aB24mGtYPcpvnxsW1rBJjRnJd80jjbWm6/WN/Y3V7c2i3Ysas+1F1pNa2qvtS1tR2tph5pd+lL6UfpV+l3+Wv5Z/rNQhM7PFcYz7dpaWPoL3b2xjQ==</latexit>
⇥<latexit sha1_base64="HVXC5oX3Qej1CgwLUn4v2FTqumQ=">AAAFV3ichVTbTttAEDUkoTS9AO1jX0ZFkWI1oIQitS+VkGglUIsUWi4OcVit7XVi4hv2GgUt+5NVX/iVvrSzjhOFAO1Kzk7mzJzdczKxFfteypvN24XFUrmy9GT5afXZ8xcvV1bXXp2kUZbY7NiO/CgxLJoy3wvZMfe4z4w4YTSwfHZqDXcVfnrFktSLwiN+HbNeQPuh53o25Zgia6WgZiaDCOpGAzo6fALTTagtTM5GXOxGV7JApDDTy4QXwAlNFKBLmM/WsVRWa6bFOP0n212eaq1NhNGR2NImBj6dSUbtRWLcc3DwGXvaDTjMr3sDZpCRIWZ02JjECJk3RJgB5QOb+mJPkqGcEOz92N+VM4KnrOdbyJIf2pjcQscmRanYsVTxWZb4Iok4Q+leAG0JXTMeeFA/03tgXmbUgfqYcsiSkPnAAos5jhf2IXJBQltRFg3T011GeZYwCGisqopOCcPJnb9+m9XshZyIkdphKtGQyF0foXA/6he+q+9SHKpPcEZQre3PqR7TPiBZ6XSRU1UKg7QaYDoRT+0GGMQxlZXToYEbmNSfyZw6hxhBkJFHZmqKPjRXCOq9+6OlgPF0YQGyGvh7100r8p30OsBN5EMniXF+NHMjeFdoRdMY/mdijhW64uggR+cRjs5/OZRP34nwLuRUskE85c/FjCMGTsoG1uhV82iAzGR1vbnZzBfcD1pFsK4Vq01Wf6LxdhawkNs+TdNuqxnznqAJ92yfyaqZpSym9pD2WRfDkAYs7Yn8vSChhhkH3CjBJ+SQZ2c7BA1SpRsr1X3TeUwlH8K6GXc/9tCNOOMstMcHuZkPPAL1kgHHS5jN/WsMqJ14eFewBxSHAO1Lq2hCa17y/eBka7P1fnPrcHt9Z7uwY1l7o73V6lpL+6DtaHtaWzvW7NKv0u9yqVwu35b/VJYqy+PSxYWi57V2Z1XW/gKOgLFh</latexit>
Rij = � ⇥ijp⇥ii
p⇥jj
<latexit sha1_base64="2NKdu0NU/RplWOdM4Zms8EQpIuM=">AAAFnXichVTbbtNAEHVpAiXcWniDB0ZUkRKRVkmpBC9IlUqlFlqUQi9O43S1tteJG99qr6tU2/0q/oQ3/oZZx7bSG6zk7GbOzNk9Z8c2I89NeLv9Z+7BfKX68NHC49qTp8+ev1hcenmUhGlssUMr9MJYN2nCPDdgh9zlHtOjmFHf9NixOd5U+PEFixM3DA74ZcQGPh0GruNalGOILM3/qhvxKISG3oJeEz6D4cTUEgZnEy42wwuZI1IYyXnMc+CIxgpoSrgZbWCqrNUNk3H6T7brPLV6lwi9J7GkS3R8ekVEzXlgWrO39wVrui3Yz457BYafkjFGmrBSrBEyrogwfMpHFvXEtiRjWRBs/9zZlDOCS9bTNWTJNm0Vp2hikaJU7Jiq+ExTbEkiTlC660NXQt+IRi40TpoDMM5TakNjSjlmccA8YL7JbNsNhhA6IKGrKPOCcneHUZ7GDHwaqay8UsK4OPO33VnNbsCJmKgZSom6RO7GBIV74TD3Xf2XYl/9gj2BWn3nhuop7R2SlU4HOVWm0EmnBYYd8sRqgU5sQ1lZNg1cQZF/IjPqDGIEQUbu6akSvauvEGwObreWAqbdhQnIquN9Nwwz9Ozk0sdJZE0niX56MHMieJ9rRdMYvjMRx4ym4ughR+8ejt5/OZRPP4hwz2QpWSeu8udsxhEdO2UFc5SpByOkrpU1K4UtWTyLlmYUIVeWNuShM8ySZHG5vdrOBtxedPLFspaPLln8jddnpT4LuOXRJOl32hEfCBpz1/KYrBlpwiJqjemQ9XEZUJ8lA5F9XSTUMWKDE8b4BByy6GyFoH6i3MNMpTq5iangXVg/5c6nAXoapZwF1nQjJ/WAh6A+VWC7MbO4d4kLasUunhWsEUXP8BKSGprQuSn59uJobbXzYXVtf315Yz23Y0F7o73TGlpH+6htaNtaVzvUrMrrykZlp/K1+ra6Vd2tfp+mPpjLa15p10b1+C/6vctO</latexit>
Conditional independence
q How do we measure conditional (in)dependence? q After seeing strong dependency measures and partial correlation, conditional
independency appears to be harder than we thought…
q Ancient wisdom: if something is hard, assume Gaussian.
q If (X, Y, Z) are jointly Gaussian, ρ(X,Y | Z) = 0 if and only if X ⫫ Y | Zq We will see later that many papers with Gaussian assumption rely on this fact,
even though it is rarely explicitly stated
© Eric Xing @ CMU, 2005-2020 19
Short Summary
© Eric Xing @ CMU, 2005-2020 20
Pearson’s correlation
Hilbert-Schmidt Independence CriterionMutual information
Partial correlation
Measure of association between X and Y
Marginal Non-marginal (partial)
Linear Nonlinear
dist(PXY , PXPY )<latexit sha1_base64="Y5A9D5DSj/IfNlB58GDJJTMN+74=">AAAB/nicbVBNS8NAEN3Ur1q/ouLJy2IRKkhJqqDHohePEWyb0oaw2WzapZtN2N0IJRT8K148KOLV3+HNf+O2zUFbHww83pthZl6QMiqVZX0bpZXVtfWN8mZla3tnd8/cP2jLJBOYtHDCEuEGSBJGOWkpqhhxU0FQHDDSCUa3U7/zSISkCX9Q45R4MRpwGlGMlJZ88yjUO2qOn7vdyTl0fFdX98w3q1bdmgEuE7sgVVDA8c2vfpjgLCZcYYak7NlWqrwcCUUxI5NKP5MkRXiEBqSnKUcxkV4+O38CT7USwigRuriCM/X3RI5iKcdxoDtjpIZy0ZuK/3m9TEXXXk55minC8XxRlDGoEjjNAoZUEKzYWBOEBdW3QjxEAmGlE6voEOzFl5dJu1G3L+qN+8tq86aIowyOwQmoARtcgSa4Aw5oAQxy8AxewZvxZLwY78bHvLVkFDOH4A+Mzx83pJRZ</latexit>
KL divergence Max mean discrepancy
Linear Nonlinear
Conditional independence
What’s next?
Lecture 2: Conditional independence graph
q Go by many different namesq Conditional independence graphs (CIG)q Markov networks (MN)q Markov random fields (MRF)q Undirected graphical models (UG)
q Many interesting properties, widely used in physics, statistics, computer vision, NLP, deep learning, bioinformatics, coding theory, finance, …
© Eric Xing @ CMU, 2005-2020 21
X
Y1 Y2
Ising/Potts model
Lecture 3: Directed graphical models
q Another major class of models, also has many names:q Directed graphical modelsq Directed acyclic graphs (DAG) (cyclic model exists but hard to work with)
q Bayesian networks (BN)q Structural equation models (SEM)q Structural causal models (SCM)q …
q Powerful language to express structured knowledge
© Eric Xing @ CMU, 2005-2020 22
A
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
B
),()(),( )|()|()|()()()( :
65867436
25242132181
XXXPXXPXXXPXXPXXPXXXPXPXPXP =
A0
A1
Ag
B0
B1
Bg
M0
M1
F0
F1
Fg
C0
C1
Cg
Sg
Lecture 4-13 (tentative): Inference and Learning
q Given a graphical model representing our knowledge
q Inference:q What is the marginal/conditional density?q What is the mean of the marginal/conditional? q What is the mode of the marginal/conditional? q Can we draw samples from the marginal/conditional? q …
q Learning: Statistical parameter estimation and model selection
© Eric Xing @ CMU, 2005-2020 23
Lecture 15-end (tentative): Modern GMs
q Relationship between deep learning and graphical modelsq Deep generative models and their unified viewq Reinforcement learning as probabilistic inferenceq GMs on functions and setsq Bayesian nonparametricsq Large-scale algorithms and systems
q 2-3 open slots:q We will list several candidate topicsq Your voice matters!
© Eric Xing @ CMU, 2005-2020 24
Why graphical models
q A language for communicationq A language for computationq A language for development
q Origins: q Wright 1920’sq Independently developed by Spiegelhalter and Lauritzen in statistics and
Pearl in computer science in the late 1980’s
© Eric Xing @ CMU, 2005-2020 25
Probabilistic Graphical Models
q If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g.,
q Why we may favor a PGM?q Incorporation of domain knowledge and causal (logical) structures
q Modular combination of heterogeneous parts – data fusion
q Bayesian Philosophyl Knowledge meets data
© Eric Xing @ CMU, 2005-2020 26
2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28 in representation cost !
q a qÞ
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2)P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
--- M. Jordan
Why graphical models
q Probability theory provides the glue whereby the parts are combined, ensuring that the system as a whole is consistent, and providing ways to interface models to data.
q The graph theoretic side of graphical models provides both an intuitively appealing interface by which humans can model highly-interacting sets of variables as well as a data structure that lends itself naturally to the design of efficient general-purpose algorithms.
q Many of the classical multivariate probabilistic systems studied in fields such as statistics, systems engineering, information theory, pattern recognition and statistical mechanics are special cases of the general graphical model formalism
q The graphical model framework provides a way to view all of these systems as instances of a common underlying formalism.
© Eric Xing @ CMU, 2005-2020 27
© Eric Xing @ CMU, 2005-2020 28
Questions?
Appendix
© Eric Xing @ CMU, 2005-2020 29
What Are Graphical Models?
© Eric Xing @ CMU, 2005-2020 30
Graph Model
Data
M"
D = {&'( , &*( , … , &,( }(.'/
Reasoning under uncertainty!
© Eric Xing @ CMU, 2005-2020 31
Speech recognition
Information retrieval
Computer vision
Robotic control
Planning
Games
Evolution
Pedigree
The Fundamental Questions
q Representationq How to capture/model uncertainties in possible worlds?q How to encode our domain knowledge/assumptions/constraints?
q Inferenceq How do I answer questions/queries
according to my model and/or based given data?
q Learningq What model is "right"
for my data?
© Eric Xing @ CMU, 2005-2020 32
??
?
?
X1 X2 X3 X4 X5
X6 X7
X8
X9
)|( :e.g. DiXP
);( maxarg :e.g. MMM
DFMÎ
=
),,,,,,,( 87654321 XXXXXXXXP
Recap of Basic Prob. Concepts
l Representation: what is the joint probability dist. on multiple variables?
l How many state configurations in total? --- 28
l Are they all needed to be represented?l Do we get any scientific/medical insight?
l Learning: where do we get all this probabilities? l Maximal-likelihood estimation? but how many data do we need?l Are there other est. principles?l Where do we put domain knowledge in terms of plausible relationships between variables, and plausible
values of the probabilities?
l Inference: If not all variables are observable, how to compute the conditional distribution of latent variables given evidence?l Computing p(H|A) would require summing over all 26 configurations of the unobserved variables
© Eric Xing @ CMU, 2005-2020 33
A
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
B
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
What is a Graphical Model?--- Multivariate Distribution in High-D Space
q A possible world for cellular signal transduction:
© Eric Xing @ CMU, 2005-2020 34
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor B
Membrane
Cytosol
X1 X2
X3 X4 X5
X6
X7 X8
GM: Structure Simplifies Representation
q Dependencies among variables
© Eric Xing @ CMU, 2005-2020 35
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2)P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
Probabilistic Graphical Models
q If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g.,
q Why we may favor a PGM?q Incorporation of domain knowledge and causal (logical) structures
© Eric Xing @ CMU, 2005-2020 36
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
1+1+2+2+2+4+2+4=18, a 16-fold reduction from 28 in representation cost !
Stay tune for what are these independencies!
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
GM: Data Integration
© Eric Xing @ CMU, 2005-2020 37
More Data Integration
q Text + Image + Network è Holistic Social Media
q Genome + Proteome + Transcriptome + Phenome + … è PanOmic Biology
© Eric Xing @ CMU, 2005-2020 38
Probabilistic Graphical Models
q If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g.,
q Why we may favor a PGM?q Incorporation of domain knowledge and causal (logical) structures
q Modular combination of heterogeneous parts – data fusion
© Eric Xing @ CMU, 2005-2020 39
2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28 in representation cost !
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BXX11 XX22
XX33 XX44 XX55
XX66
XX77 XX88
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BXX11 XX22
XX33 XX44 XX55
XX66
XX77 XX88
XX11 XX22
XX33 XX44 XX55
XX66
XX77 XX88
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X2) P(X4| X2) P(X5| X2) P(X1) P(X3| X1) P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
q This allows us to capture uncertainty about the model in a principled way
q But how can we specify and represent a complicated model?q Typically the number of genes need to be modeled are in the order of thousands!
Rational Statistical Inference
© Eric Xing @ CMU, 2005-2020 40
å΢
¢¢=
Hhhphdphphdpdhp)()|(
)()|()|(
Posteriorprobability
Likelihood Priorprobability
Sum over space of hypotheses
h
d
The Bayes Theorem:
GM: MLE and Bayesian Learning
q Probabilistic statements of Q is conditioned on the values of the observed variablesAobs and prior p( |c)
© Eric Xing @ CMU, 2005-2020 41
(A,B,C,D,E,…)=(T,F,F,T,F,…)A= (A,B,C,D,E,…)=(T,F,T,T,F,…)
……..(A,B,C,D,E,…)=(F,T,T,T,F,…)
A
C
F
G H
ED
BA
C
F
G H
ED
B A
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
B
0.9 0.1
c
dc
0.2 0.8
0.01 0.99
0.9 0.1
dcdd
c
DC P(F | C,D)0.9 0.1
c
dc
0.2 0.8
0.01 0.99
0.9 0.1
dcdd
c
DC P(F | C,D)
p(Q; c)
);()|();|( cc ΘΘΘ ppp AA µposterior likelihood prior
ΘΘΘΘ dpBayes ò= ),|( cA
Probabilistic Graphical Models
q If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g.,
q Why we may favor a PGM?q Incorporation of domain knowledge and causal (logical) structures
q Modular combination of heterogeneous parts – data fusion
q Bayesian Philosophyl Knowledge meets data
© Eric Xing @ CMU, 2005-2020 42
2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28 in representation cost !
q a qÞ
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2)P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
So What Is a PGM After All?
© Eric Xing @ CMU, 2005-2020 43
In a nutshell:
PGM = Multivariate Statistics + Structure
GM = Multivariate Obj. Func. + Structure
So What Is a PGM After All?
q The informal blurb:q It is a smart way to write/specify/compose/design exponentially-large probability
distributions without paying an exponential cost, and at the same time endow the distributions with structured semantics
q A more formal description:q It refers to a family of distributions on a set of random variables that are compatible
with all the probabilistic independence propositions encoded by a graph that connects these variables
© Eric Xing @ CMU, 2005-2020 44
A
C
F
G H
ED
BA
C
F
G H
ED
B A
C
F
G H
ED
BA
C
F
G H
ED
BA
C
F
G H
ED
B
)( 87654321 ,X,X,X,X,X,X,XX P),()(),(
)|()|()|()()()( :
65867436
25242132181
XXXPXXPXXXPXXPXXPXXXPXPXPXP =
Two types of GMs
l Directed edges give causality relationships (Bayesian Network or Directed Graphical Model):
l Undirected edges simply give correlations between variables (Markov Random Field or Undirected Graphical model):
© Eric Xing @ CMU, 2005-2020 45
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
Receptor A
Kinase C
TF F
Gene G Gene H
Kinase EKinase D
Receptor BX1 X2
X3 X4 X5
X6
X7 X8
X1 X2
X3 X4 X5
X6
X7 X8
P(X1, X2, X3, X4, X5, X6, X7, X8)
= P(X1) P(X2) P(X3| X1) P(X4| X2) P(X5| X2)P(X6| X3, X4) P(X7| X6) P(X8| X5, X6)
P(X1, X2, X3, X4, X5, X6, X7, X8)
= 1/Z exp{E(X1)+E(X2)+E(X3, X1)+E(X4, X2)+E(X5, X2)+ E(X6, X3, X4)+E(X7, X6)+E(X8, X5, X6)}
X
Y1 Y2
Descendent
Ancestor
Parent
Children's co-parentChildren's co-parent
Child
Bayesian Networks
© Eric Xing @ CMU, 2005-2020 46
q Structure: DAG
q Meaning: a node is conditionally independent of every other node in the network outside its Markov blanket
q Local conditional distributions (CPD) and the DAG completely determine the joint dist.
q Give causality relationships, and facilitate a generative process
X
Y1 Y2
Markov Random Fields
© Eric Xing @ CMU, 2005-2020 47
q Structure: undirected graph
q Meaning: a node is conditionally independent of every other node in the network given its Directed neighbors
q Local contingency functions (potentials) and the cliques in the graph completely determine the jointdist.
q Give correlations between variables, but no explicit way to generate samples
Towards structural specification of probability distribution
q Separation properties in the graph imply independence properties about the associated variables
q For the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents
q The Equivalence TheoremFor a graph G,Let D1 denote the family of all distributions that satisfy I(G),Let D2 denote the family of all distributions that factor according to G,Then D1≡D2.
© Eric Xing @ CMU, 2005-2020 48
Density estimation
Regression
Classification
Parametric and nonparametric methods
Linear, conditional mixture, nonparametric
Generative and discriminative approach
Q
X
Q
X
X Y
m,s
X X
GMs are your old friends
© Eric Xing @ CMU, 2005-2020 49
Clustering
(Picture by Zoubin Ghahramani and Sam Roweis)
An (incomplete) genealogy of graphical models
© Eric Xing @ CMU, 2005-2020 50
Fancier GMs: reinforcement learning
q Partially observed Markov decision processes (POMDP)
© Eric Xing @ CMU, 2005-2020 51
Fancier GMs: machine translation
© Eric Xing @ CMU, 2005-2020 52
SMT
The HM-BiTAM model (B. Zhao and E.P Xing, ACL 2006)
Fancier GMs: genetic pedigree
© Eric Xing @ CMU, 2005-2020 53
A0
A1
Ag
B0
B1
Bg
M0
M1
F0
F1
Fg
C0
C1
Cg
Sg
An allele network
Fancier GMs: solid state physics
© Eric Xing @ CMU, 2005-2020 54
Ising/Potts model
Application of GMs
q Machine Learningq Computational statistics
q Computer vision and graphicsq Natural language processing q Informational retrievalq Robotic control q Decision making under uncertaintyq Error-control codesq Computational biologyq Genetics and medical diagnosis/prognosisq Finance and economicsq Etc.
© Eric Xing @ CMU, 2005-2020 55