Post on 24-Feb-2021
transcript
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Metric Distances Between ProbabilityDistributions of Different Sizes
M. Vidyasagar
Cecil & Ida Green ChairThe University of Texas at Dallas
M.Vidyasagar@utdallas.eduwww.utdallas.edu/∼m.vidyasagar
Johns Hopkins University, 20 October 2011
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem FormulationSource of Difficulty
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem FormulationSource of Difficulty
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem FormulationSource of Difficulty
Motivation
Originally: Given a hidden Markov process with a very large statespace, how can one approximate it ‘optimally’ with another HMPwith a much smaller state space?
Analog control ← digital control.
Signals in Rd ← signals over finite sets.
Applications: Control over networks, data compression, reducedsize noise modeling etc.
If u, y are input and output, view {ut, yt} as a stochastic processover some finite set U × Y , then ‘reduced order modeling’ isapproximating {ut, yt} by another stochastic process over a‘smaller cardinality’ set U ′ × Y ′.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem FormulationSource of Difficulty
General Problem: Simplified Modeling of StochasticProcesses
Suppose {Xt} is a stochastic process assuming values in a ‘large’but finite set A with n elements; we wish to approximate it byanother process {Yt} assuming values in a ‘small’ finite set B withm < n elements.
Questions:
How do we define the ‘distance’ (think ‘modeling error’)between the two processes {Xt} and {Yt}?Given {Xt}, how do we find the ‘best possible’ reduced orderapproximation to {Xt} in the chosen distance?
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem FormulationSource of Difficulty
Scope of Today’s Talk
Scope of this talk: i.i.d. (independent, identically distributed)processes.
An i.i.d. process {Xt} is completely described by its‘one-dimensional marginal,’ i.e., the distribution of X1 (or any Xt).
Questions:
Given two probability distributions φ,ψ, on finite sets A,B,how can we define a ‘distance’ between them?
Given distribution φ with n components, and an integerm < n, how can we find the ‘best possible’ m-dimensionalapproximation to φ?
Full paper at: http://arxiv.org/pdf/1104.4521v2
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem FormulationSource of Difficulty
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem FormulationSource of Difficulty
Total Variation Metric
Suppose A = {a1, . . . , an} is a finite set. and φ,ψ are probabilitydistributions on A. Then the total variation metric is
ρ(φ,ψ) =1
2
n∑i=1
|φi − ψi| =n∑
i=1
{φi − ψi}+ = −n∑
i=1
{φi − ψi}−,
where {x}+ = max{x, 0}, {x}− = min{x, 0}.
ρ is permutation-invariant if the same permutation is applied tothe components of φ,ψ (rearranging the elements of A). So if π isa permutation of the elements of A, then
ρ(π(φ), π(ψ)) = ρ(φ,ψ).
But what if φ,ψ are probability measures on different sets?
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem FormulationSource of Difficulty
Permutation Invariance Example
Example:
A = {H,T},B = {M,W},φ = [0.55 0.45],ψ = [0.48 0.52].
What is the distance between φ,ψ?
If we identify H ↔M,T ↔W , then ρ(φ,ψ) = 0.07.
But if we identify H ↔W,T ↔M , then ρ(φ,ψ) = 0.03.
Which one is more ‘natural’? Answer: There is no naturalassociation!
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem FormulationSource of Difficulty
Permutation Invariance: An Inherent Feature
Suppose φ,ψ are probability distributions on distinct sets A,B,and we wish to ‘compare’ them. Suppose d(φ,ψ) is the distance(yet to be defined).
Claim: Suppose π is a permutation on A, ξ is a permutation on B.Then we must have
d(φ,ψ) = d(π(φ), ξ(ψ)).
Ergo: Any definition of the distance must be invariant underpossibly different permutations of A,B.
In particular, if A,B are distinct sets, even if |A| = |B|, ourdistance cannot reduce to ρ.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Notation
A = {1, . . . , n}1, B = {1, . . . ,m}, φ is a probability distributionon A, ψ is a probability distribution on B, X is a random variableon A with distribution φ, Y is a r.v. on B with distribution ψ.
Sn := {v ∈ Rn+ :
n∑i=1
vi = 1}.
So φ ∈ Sn,ψ ∈ Sm. Sn×m denotes the set of n×m ‘stochasticmatrices’, i.e., matrices whose columns add up to one:
Sn×m = {P ∈ [0, 1]n×m : Pem = en},
where en is the column vector with n ones.
1We don’t write A = {a1, . . . , an} but that is what we mean.M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Entropy
Suppose φ ∈ Sn. Then the (Shannon) entropy of φ is
H(φ) = −n∑
i=1
φi log φi.
We can also call this H(X) (i.e., associate entropy with a r.v. orits distribution).
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Mutual Information
Suppose X,Y are r.v.s on A,B and let θ denote their jointdistribution. Then θ ∈ Snm, and θA = φ,θB = ψ (marginaldistributions).
I(X,Y ) := H(X) +H(Y )−H(X,Y )
is called the mutual information between X and Y .
Alternate formula:
I(X,Y ) =∑i∈A
∑j∈B
θij logθijφiψi
.
Note that I(X,Y ) = I(Y,X).
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Conditional Entropy
The quantityH(X|Y ) = H(X,Y )−H(Y )
is called the conditional entropy of X given Y . Note thatH(X|Y ) 6= H(Y |X) in general. In fact
H(Y |X) = H(X|Y ) +H(Y )−H(X).
If θ is the joint distribution of X,Y , then
H(X|Y ) = H(θ)−H(ψ), H(Y |X) = H(θ)−H(φ).
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Conditional Entropy: Alternate Formulation
Define
Θ = [θij ] = [Pr{X = i&Y = j}] ∈ [0, 1]n×m,
and define P = [Diag(φ)]−1Θ. Clearly
pij =θijφi
= Pr{Y = j|X = i}.
So P ∈ Sn×m and
H(Y |X) =∑i∈A
φiH(pi) =: Jφ(P ),
where pi is the i-th row of P .
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
Maximizing Mutual Information (MMI)
Given φ ∈ Sn,ψ ∈ Sm, we look for a ‘joint’ distribution θ ∈ Snmsuch that θA = φ,θB = ψ, and in addition, the entropy H(θ) isminimum, or equivalently, the conditional entropy of X given Y isminimum, or again equivalently, the mutual information betweenX and Y is maximum. In other words, define
W (φ,ψ) := minH(θ) s.t. θA = φ,θB = ψ
= minH(X,Y ) s.t. θA = φ,θB = ψ,
V (φ,ψ) := W (φ,ψ)−H(φ)
= minH(Y |X) s.t. θA = φ,θB = ψ.
We try to make Y ‘as deterministic as possible’ given X.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
The Variation of Information Metric
The quantityd(φ,ψ) := V (φ,ψ) + V (ψ,φ)
is called the variation of information metric. Since
V (ψ,φ) = V (φ,ψ) +H(φ)−H(ψ),
we need to compute only one of V (ψ,φ), V (φ,ψ).
The quantity d satisfies
d(φ,φ) = 0 and d(φ,ψ) ≥ 0 ∀φ,ψ.
d(φ,ψ) = d(ψ,φ).
The triangle inequality holds:
d(φ, ξ) ≤ d(φ,ψ) + d(ψ, ξ).
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
Proof of Triangle Inequality
Suppose X,Y, Z are r.v.s on finite sets A,B,C. Then theconditional entropy satisfies a ‘one-sided’ triangle inequality:
H(X|Y ) ≤ H(X|Z) +H(Z|Y ).
Proof:
H(X|Y ) ≤ H(X,Z|Y ) = H(Z|Y ) +H(X|Z, Y )
≤ H(Z|Y ) +H(X|Z).
So if we define
v(X,Y ) = H(X|Y ) +H(Y |X),
then v satisfies the triangle inequality. Note that
V (φ,ψ) = minφ,ψ
v(X,Y )
subject to X,Y having distributions φ,ψ respectively.M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
A Key Property
Actually d is a pseudometric, not a metric. In other words, d(φ,ψ)can be zero even if φ 6= ψ.
Theorem: d(φ,ψ) = 0 if and only if n = m and φ,ψ arepermuations of each other.
Consequence: The metric d is not convex!
Example: Let n = m = 2,
φ = [0.75 0.25],ψ = [0.25 0.75], ξ = 0.5φ+ 0.5ψ = [0.5 0.5].
Thend(φ,φ) = d(φ,ψ) = 0 but d(φ, ξ) > 0.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
Computing the Metric Distance
Change variable of optimization from θ, the joint distribution, toP , the matrix of conditional probabilities.
V (φ,ψ) = minθ∈Snm
H(θ)−H(φ) s.t. θA = φ,θB = ψ
= minP∈Sn×m
Jφ(P ) :=∑i∈A
φiH(pi) s.t. φP = ψ.
Since Jφ(·) is a strictly concave function, and the feasible region ispolyhedral (convex hull of a finite number of extreme points),solution occurs at one of these extreme points.
Also, a ‘principle of optimality’ allows us to break down largeproblems into smaller problems.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
The m = 2 Case
Partition the set {1, . . . , n} into two sets I1, I2 such that
|ψ1 −∑i∈I1
φi| = |ψ2 −∑i∈I2
φi|
is minimized.
Interpretation: Given two bins with capacities ψ1, ψ2, assignφ1, . . . , φn to the two bins so that overflow in one bin (andunder-utilization in other bin) is minimized.
Theorem: If both bins are filled exactly, then V (φ,ψ) = 0.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
The m = 2 Case (Continued)
Theorem: If both bins cannot be filled exactly, and if the smallestelement of φ (call it φ0) belongs to the overstuffed bin, then
V (φ,ψ) = φ0H([u/φ0 (φ0 − u)/φ0]),
where u is the unutilized capacity (or overflow).
If φ0 belongs to the underutilized bin, the above is an upper boundfor V (φ,ψ) but may not equal V (φ,ψ).
Bad news: Computing the optimal partitioning is NP-hard!
So we need an approximation procedure to upper bound d(φ,ψ).No special reason to restrict to m = 2.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
The m = 2 Case (Continued)
Theorem: If both bins cannot be filled exactly, and if the smallestelement of φ (call it φ0) belongs to the overstuffed bin, then
V (φ,ψ) = φ0H([u/φ0 (φ0 − u)/φ0]),
where u is the unutilized capacity (or overflow).
If φ0 belongs to the underutilized bin, the above is an upper boundfor V (φ,ψ) but may not equal V (φ,ψ).
Bad news: Computing the optimal partitioning is NP-hard!
So we need an approximation procedure to upper bound d(φ,ψ).No special reason to restrict to m = 2.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
Best Fit Algorithm for Bin-Packing
Think of ψ1, . . . , ψm as capacities of m bins.
Arrange ψj in descending order.
For i = 1, . . . , n, place each φi into the bin with maximumunutilized capacity.
If a bin overflows, don’t fill it anymore.
Complexity is O(n logm).
Provably suboptimal: Total bin size is ≤ 1.25 times optimal binsize; best-known bound.
Corresponding bound on overflow is not so good: It is≤ 0.25 + 1.25× optimal value.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
Illustration of Best Fit Algorithm
Suppose ψ = [0.45 0.30 0.25], and φ ∈ S14 is given by
φ = 10−2 · [ 14 13 12 9 8 7 6 6 5 5 4 4 4 3 ]
φ1
φ2
φ6
φ8
φ11
φ3
φ5
φ9
φ12φ14
φ4
φ7
φ10
φ13
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Definition of the Metric DistanceComputing the Distance
A Greedy Algorithm for Bounding d(φ,ψ)
Think of ψ1, . . . , ψm as m bins to be packed. Modify ‘best fit’algorithm: Sort ψ in decreasing order, put each φi into bin withlargest unutilized capacity. If φi does not fit into any bin, put itaside (departure from best fit algorithm).
When all of φ is processed, say k entries are put aside. Thenk < m. Now solve a k by m bin-packing problem; repeat. See fullpaper for details.
This results in a sequence of bin-packing problems of decreasingsize. Outcome is an upper bound for d(φ,ψ). Complexity isO((n+m2) logm).
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Problem Formulation
Problem: Given φ ∈ Sn, and m < n, find ψ ∈ Sm that minimizesd(φ,ψ).
Definition: ψ ∈ Sm is said to be an aggregation of φ ∈ Sn ifthere exists a partition I1, . . . , Im of {1, . . . , n} such that
ψj =∑i∈Ij
φi.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Results
Theorem: Any optimal approximation of φ in the variation ofinformation metric d must be an aggregation of φ.
Theorem: An aggregation φ(a) of φ is an optimal approximationof φ in the variation of information metric d if and only if it hasmaximum entropy (amongst all aggregations).
Note: um, the uniform distribution with m elements, hasmaximum entropy in Sm. So should we try to minimizeρ(φ(a),um)? This is yet another bin-packing problem with all bincapacities equal to 1/m. But how valid is this approach?
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
NP-Hardness of Optimal Order Reduction
Suppose m = 2 and let φ(a) be an aggregation of φ ∈ Sn. Thenφ(a) has maximum entropy if and only if the total variationdistance ρ(φ(a),u2) is minimized.
Therefore:
When m = 2, minimizing ρ(φ(a),u2) gives an optimalreduced order approximation to φ.
For m ≥ 3, minimizing ρ(φ(a),um) gives a suboptimalreduced order approximation to φ.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Reformulation of Problem
Problem: Given φ ∈ Sn and m < n, find an aggregation ofφ(a) ∈ Sm with maximum entropy.
Reformulation: Since the uniform distribution um has maximumentropy in Sm, So find an aggregation φ(a) such that the totalvariation distance ρ(φ(a),um) is minimized.
More general problem: (Not any harder) Given φ ∈ Sn, m < nand ξ ∈ Sm, find an aggregation φ(a) to minimize the totalvariation distance ρ(φ(a), ξ).
Use best-fit algorithm to find an aggregation of φ that is close tothe uniform distribution. Complexity is O(n logm).
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Bound on Performance of Best Fit Algorithm
Theorem: Let φ(a) denote the aggregation of φ using the best fitalgorithm. Then
ρ(φ(a), ξ) ≤ 0.25mφmax,
where φmax is the largest component of φ and ρ is the totalvariation metric.
This can be turned into a (messy) bound on the entropy of φ(a).
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Outline
1 Motivation
Problem Formulation
Source of Difficulty
2 Information Theory Background
3 A Metric Distance Between Distributions
Definition of the Metric Distance
Computing the Distance
4 Optimal Order Reduction
5 Concluding Remarks
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Achievements
Definition of a proper metric between probability distributionsdefined on sets of different cardinalities, by maximizingmutual information between the two distributions.
Study of properties of the distance, and the problem ofcomputing the distance; showing the close relationship to thebin-packing problem with overstuffing.
Since bin-packing is NP-hard, adapting best-fit algorithm togenerate upper bounds for distance in polynomial time.
Characterization of solution of optimal order reductionproblem in terms of aggregating to maximize entropy.
Formulation as a bin-packing problem with overstuffing.
Upper bound on performance of algorithm.
M. Vidyasagar Distances Between Probability Distributions
MotivationInformation Theory Background
A Metric Distance Between DistributionsOptimal Order Reduction
Concluding Remarks
Next Steps
Extension of the metric to Markov processes, hidden Markovprocesses, arbitrary (but stationary ergodic) stochasticprocesses.
Alternatives to best-fit heuristic algorithm.
Thank You!
M. Vidyasagar Distances Between Probability Distributions