Post on 04-Feb-2016
description
transcript
Information Bottleneck
presented by
Boris Epshtein & Lena GorelickAdvanced Topics in Computer and Human Vision
Spring 2004
Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle• IB algorithms
– iIB– dIB– aIB
• Application
Motivation
Clustering Problem
Motivation
• “Hard” Clustering – partitioning of the input data into several exhaustive and mutually exclusive clusters
• Each cluster is represented by a centroid
Motivation
• “Good” clustering – should group similar data points together and dissimilar points apart
• Quality of partition – average distortion between the data points and corresponding representatives (cluster centroids)
• “Soft” Clustering – each data point is assigned to all clusters with some normalized probability
• Goal – minimize expected distortion between the data points and cluster centroids
Motivation
Complexity-Precision Trade-off
• Too simple model Poor precision• Higher precision requires more complex model
Motivation…
Complexity-Precision Trade-off
• Too simple model Poor precision• Higher precision requires more complex model• Too complex model Overfitting
Motivation…
Complexity-Precision Trade-off
• Too Complex Model – can lead to overfitting– is hard to learn
• Too Simple Model– can not capture the real structure of the data
• Examples of approaches:– SRM Structural Risk Minimization– MDL Minimum Description Length– Rate Distortion Theory
Motivation…
Poor generalization
Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle• IB algorithms
– iIB– dIB– aIB
• Application
Entropy
• The measure of uncertainty about the random variable
Definitions…
Entropy - Example
– Fair Coin:
– Unfair Coin:
Definitions…
Entropy - IllustrationDefinitions…
Highest Lowest
Conditional Entropy
• The measure of uncertainty about the random variable given the value of the variable
Definitions…
Conditional EntropyExample
Definitions…
Mutual Information
• The reduction in uncertainty of due to the knowledge of – Nonnegative– Symmetric– Convex w.r.t. for a fixed
Definitions…
Mutual Information - ExampleDefinitions…
• A distance between distributions– Nonnegative– Asymmetric
Kullback Leibler DistanceDefinitions…
Over the same alphabet
Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle• IB algorithms
– iIB– dIB– aIB
• Application
Rate Distortion TheoryIntroduction
• Goal: obtain compact clustering of the data with minimal expected distortion
• Distortion measure is a part of the problem setup
• The clustering and its quality depend on the choice of the distortion measure
Rate Distortion Theory
• Obtain compact clustering of the data with minimal expected distortion given fixed set of representatives
Data
?
Cover & Thomas
• – zero distortion – not compact
• – high distortion – very compact
Rate Distortion Theory - Intuition
• The quality of clustering is determined by
– Complexity is measured by
– Distortion is measured by
Rate Distortion Theory – Cont.
(a.k.a. Rate)
Rate Distortion Plane
Ed(X,T)
Maximal Compression
Minimal Distortion
D - distortion constraint
Higher values of mean more relaxed distortion constraint
Stronger compression levels are attainable
Rate Distortion Function
• Given the distortion constraint find the most compact model (with smallest complexity )
• Let be an upper bound constraint on the expected distortion
Rate Distortion Function• Given
– Set of points with prior– Set of representatives – Distortion measure
• Find– The most compact soft clustering of
points of that satisfies the distortion constraint
• Rate Distortion Function
Rate Distortion Function
Lagrange Multiplier
Complexity Term
Distortion Term
Minimize !
Rate Distortion Curve
Ed(X,T)
Maximal Compression
Minimal Distortion
Subject to
The minimum is attained when
Rate Distortion Function
Normalization
Minimize
Known
Solution - Analysis
The solution is implicit
Solution:
When is similar to is small
closer points are attached to with higher
probability
Solution - Analysis
Solution:
For a fixed
reduces the influence of distortion
does not depend on
this + maximal compression single cluster
Solution - Analysis
Solution:
most of cond. prob. goes to some
with smallest distortion
hard clustering
Fix t
Fix x
Varying
Solution - Analysis
Solution:
Intermediate soft clustering,
intermediate complexity
Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle• IB algorithms
– iIB– dIB– aIB
• Application
Blahut – Arimoto AlgorithmInput:
Randomly init
Optimize convex function over convex set the minimum is global
Blahut-Arimoto Algorithm
Advantages:• Obtains compact clustering of the data with
minimal expected distortion
• Optimal clustering given fixed set of representatives
Blahut-Arimoto Algorithm
Drawbacks:• Distortion measure is a part of the problem
setup – Hard to obtain for some problems – Equivalent to determining relevant features
• Fixed set of representatives
• Slow convergence
Rate Distortion Theory – Additional Insights
– Another problem would be to find optimal representatives given the clustering.
– Joint optimization of clustering and representatives doesn’t have a unique solution. (like EM or K-means)
Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle• IB algorithms
– iIB– dIB– aIB
• Application
Information Bottleneck
• Copes with the drawbacks of Rate Distortion approach
• Compress the data while preserving “important” (relevant) information
• It is often easier to define what information is important than to define a distortion measure.
• Replace the distortion upper bound constraint by a lower bound constraint over the relevant information
Tishby, Pereira & Bialek, 1999
Information Bottleneck-Example
Documents Joint prior Topics
Given:
Information Bottleneck-Example
Words Partitioning Topics
Obtain:
I(Cluster;Topic)
I(Word;Topic)
I(Word;Cluster)
Information Bottleneck-ExampleExtreme case 1:
I(Cluster;Topic)=0
I(Word;Cluster)=0
Very Compact
Not Informative
Information Bottleneck-Example
Minimize I(Word; Cluster) & maximize I(Cluster; Topic)
I(Cluster;Topic)=max
I(Word;Cluster)=max
Not Compact
VeryInformative
Extreme case 2:
Information Bottleneck
Compactness Relevant Information
words
topics
Relevance Compression Curve
Maximal Compression
Maximal Relevant
InformationD – relevance constraint
• Let be minimal allowed value of
Smaller more relaxed relevant information constraint
Stronger compression levels are attainable
Relevance Compression Function
• Given relevant information constraint Find the most compact model (with smallest )
Relevance Compression Function
Lagrange Multiplier
Compression Term
RelevanceTerm
Minimize !
Relevance Compression Curve
Maximal Compression
Maximal Relevant
Information
Subject to
The minimum is attained when
Relevance Compression Function
Normalization
Minimize
Solution - Analysis
The solution is implicit
Solution:
Known
Solution - Analysis
Solution:
• KL distance emerges as effective distortion measure from IB principle
The optimization is also over cluster representatives
When is similar to KL is small
attach such points to with higher probability
For a fixed
reduces the influence of KL
does not depend on
this + maximal compression single cluster
Solution - Analysis
Solution:
most of cond. prob. goes to some with smallest KL (hard mapping)
Fix t
Fix x
Relevance Compression Curve
Maximal Compression
Maximal Relevant
Information
Hard Mapping
Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle• IB algorithms
– iIB– dIB– aIB
• Application
Iterative Optimization Algorithm (iIB)
• Input:
• Randomly init
Pereira, Tishby, Lee , 1993; Tishby, Pereira, Bialek, 2001
Iterative Optimization Algorithm (iIB)
p(cluster | word)
p(cluster)
p(topic | cluster)
Pereira, Tishby, Lee , 1993;
iIB simulation
• Given:– 300 instances of with prior– Binary relevant variable – Joint prior –
• Obtain:– Optimal clustering (with minimal )
X points and their priors
iIB simulation…
Given the is given by the color of the point on the map
iIB simulation…
iIB simulation…
Single Cluster – Maximal Compression
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
iIB simulation…
Hard Clustering – Maximal Relevant Information
Iterative Optimization Algorithm (iIB)
• Analogy to K-means or EM
Optimize non-convex functional over 3 convex sets the minimum is local
“Semantic change” in the clustering solution
Advantages:
• Defining relevant variable is often easier and more intuitive than defining distortion measure
• Finds local minimum
Iterative Optimization Algorithm (iIB)
Drawbacks:
• Finds local minimum (suboptimal solutions)
• Need to specify the parameters
• Slow convergence
• Large data sample is required
Iterative Optimization Algorithm (iIB)
Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle• IB algorithms
– iIB– dIB– aIB
• Application
• Iteratively increase the parameter and then adapt the solution from the previous value of to the new one.
• Track the changes in the solution as the system shifts its preference from compression to relevance
• Tries to reconstruct the relevance-compression curve
Deterministic Annealing-like algorithm (dIB)
Slonim, Friedman, Tishby, 2002
Solution from previous step:
Deterministic Annealing-like algorithm (dIB)
Deterministic Annealing-like algorithm (dIB)
Small Perturbation
Deterministic Annealing-like algorithm (dIB)
Apply iIB using the duplicated cluster set as initialization
Deterministic Annealing-like algorithm (dIB)
if are differentleave the split
elseuse the old
Deterministic Annealing-like algorithm (dIB)
Deterministic Annealing-like algorithm (dIB)Illustration
What clusters split at which values of
Advantages:
• Finds local minimum (suboptimal solutions)
• Speed-up convergence by adapting previous soultion
Deterministic Annealing-like algorithm (dIB)
Drawbacks:
• Need to specify and tune several parameters:
- perturbation size
- step for (splits might be “skipped”)
- similarity threshold for splitting
- may need to vary parameters during the process
• Finds local minimum (suboptimal solutions)
• Large data sample is required
Deterministic Annealing-like algorithm (dIB)
Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle• IB algorithms
– iIB– dIB– aIB
• Application
Agglomerative Algorithm (aIB)
• Find hierarchical clustering tree in a greedy bottom-up fashion
• Results in different trees for each
• Each tree is a range of clustering solutions at different resolutions
Same
Different Resolutions
Slonim & Tishby 1999
Agglomerative Algorithm (aIB)Fix
Start with
Agglomerative Algorithm (aIB)For each pair
Compute new
Merge and that produce the smallest
Agglomerative Algorithm (aIB)For each pair
Compute new
Merge and that produce the smallest
Agglomerative Algorithm (aIB)For each pair
Continue merging until single cluster is left
Agglomerative Algorithm (aIB)
Agglomerative Algorithm (aIB)
Advantages:
• Non-parametric
• Full Hierarchy of clusters for each
• Simple
Agglomerative Algorithm (aIB)
Drawbacks:
• Greedy – is not guaranteed to extract even locally minimal solutions along the tree
• Large data sample is required
Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory
– Blahut-Arimoto algorithm
• Information Bottleneck Principle• IB algorithms
– iIB– dIB– aIB
• Application
Unsupervised Clustering of Images
Modeling assumption:
For a fixed colors and their spatial distribution are generated by a mixture of Gaussians in 5-dim
Applications…
Shiri Gordon et. al., 2003
Unsupervised Clustering of Images
Apply EM procedure to estimate the mixture parameters
Applications…
Shiri Gordon et. al., 2003
Mixture of Gaussians model:
Unsupervised Clustering of Images
Applications…
Shiri Gordon et. al., 2003
• Assume uniform prior
• Calculate conditional
• Apply aIB algorithm
Unsupervised Clustering of ImagesApplications…
Shiri Gordon et. al., 2003
Unsupervised Clustering of ImagesApplications…
Shiri Gordon et. al., 2003
Summary
• Rate Distortion Theory– Blahut-Arimoto algorithm
• Information Bottleneck Principle
• IB algorithms– iIB– dIB– aIB
• Application
Thank you
Blahut-Arimoto algorithmA B
When does it converge to the global minimum?
- A and B are convex + some requirements on distance measure
Convex set of distributions
Convex set of distributions
Minimum Distance
?
Csiszar & Tusnady, 1984
Blahut-Arimoto algorithmA B
Reformulate using distance
Blahut-Arimoto algorithmA B
Rate Distortion Theory - Intuition
• – zero distortion – not compact–
• – high distortion – very compact–
• Assume Markov relations:
– T is a compressed representation of X, thus independent of Y if X is given
– Information processing inequality:
Information Bottleneck - cont’d