Quantization, Compression, and Classiﬁcation: Extracting ...gray/ucsd06.pdfA/D conversion, data...

transcript

27 January 2006

Quantization,Compression, and Classification:Extracting discrete information

from a continuous world

Robert M. GrayInformation Systems Laboratory, Dept of Electrical Engineering

Stanford, CA 94305rmgray@stanford.edu

Partially supported by the National Science Foundation, Norsk Electro-Optikk,and Hewlett Packard Laboratories.

A pdf of these slides may be found at http://ee.stanford.edu/~gray/ucsd06.pdf

Quantization 1

Introduction

continuous world

discrete representations��

��

encoder α

011100100 · · ·

decoder β

“Picture of a privateer”

��*decoder β

How well can you do it?

What do you mean by “well”?

How do you do it?

Quantization 2

The dictionary (Random House) definition of quantization:

the division of a quantity into a discrete number of small

parts, often assumed to be integral multiples of a common

quantity.

In general, The mapping of a continuous quantity into a

discrete quantity.

Converting a continuous quantity into a discrete quantity

causes loss of accuracy and information. Usually need to

minimize the loss.

Quantization 3

Quantization

Old example: Round off real numbers to nearest integer:

Estimate densities by histograms [Sheppard (1898)]

More generally, quantizer of a space A (e.g., <k, L2([0, 1]2)consists of

• An encoder α : A → I, I an index set, e.g., nonnegative

integers. ⇔ partition S = {Si; i ∈ I}: Si = {x : α(x) = i}.

• A decoder β : I → C ⇔ codebook C = β(α(A))).

Scalar quantizer:

- x︸︷︷︸S0

︸︷︷︸S1

︸︷︷︸S2

︸︷︷︸S3

︸︷︷︸S4

β(0)∗

β(1)∗

β(2)∗

β(3)∗

β(4)∗

Quantization 4

Two dimensional example:

Centroidal Voronoi diagram:

Nearest neighbor partition

Euclidean centroids

E.g., complex numbers, nearest

mail boxes, sensors, repeaters, fish

fortresses . . .

territories of the male Tilapia

mossambica [G. W. Barlow,

Hexagonal Territories, Animal

Behavior, Volume 22, 1974]

Quantization 5

Three dimensions: Voronoi

partition around spheres

http://www-

math.mit.edu/dryfluids/gallery/

Quantization 6

Quality/Distortion/Cost

Quality of a quantizer measured by the goodness of the

resulting reproduction in comparison to the original.

Assume a distortion measure d(x, y) ≥ 0 which measures

penalty if an input x results in an output y = β(α(x)).

Small (large) distortion ⇔ good (bad) quality

System quality measured by average distortion

Assume X a random object (variable, vector, process, field) with

probability distribution PX: pixel intensities, samples, features,

fields, DCT or wavelet transform coefficients, moments, etc.

If A = <k, PX ⇔ pdf f , pmf p, or empirical distribution PLfrom training/learning set L = {xl; l = 1, 2, . . . , |L|}

Quantization 7

Average Distortion wrt PX: D(α, β) = E[d(X,β(α(X))]

Examples:

• Mean squared error (MSE) d(x, y) = ||x−y||2 =∑k

i=1 |xi−yi|2Useful if y is intended to reproduce or approximate x.

More generally: input or output weighted quadratic, Bx positive

definite matrix: (x− y)tBx(x− y), (x− y)tBy(x− y)

• X is observation, but unseen Y is desired information

“Sensor” described by PX|Y , e.g., X = Y +W .

Quantize x, reconstruct y = κ(i). Bayes risk C(y, y).

d(x, i) = E[C(Y, κ(i))|X = x]

E[d(X,κ(α(X))] = E[C(Y, κ(α(X)))]

Quantization 8

• Y discrete ⇒ classification, detection

• Y continuous ⇒ estimation, regression

SENSOR

PX|Y -X -

ENCODER

α - α(X) = i

DECODER

Y = κ(i)X = β(i)

Or weighted combination of MSE for X, Bayes risk for Y

Problem: Need to know or estimate PY |X, harder to estimate

than PX.

Quantization 9

Want average distortion small, which can do with big

codebooks and small cells, but . . .

. . . there is usually a cost r(i) of choosing index i in terms

of storage or transmission capacity or computational capacity —

more cells imply more memory, more time, more bits.

For example,

Quantization 10

• r(i) = ln |C|, All codewords have equal cost. Fixed-rate, classic

quantization.

• r(i) = `(i) — a length function satisfying∑

i e−`(i) ≤ 1

Kraft inequality ⇔ uniquely decodable lossless code exists with

lengths `(i). # of bits or nats required to communicate/store

index.

• r(i) = `(i) = − ln Pr(α(X) = i), Shannon codelengths,

E(`(α(X)) = H(α(X)) = −∑

i pi ln pi, Shannon entropy.

Optimal choice of ` satisfying Kraft. Optimal lossless

compression of `(α(X)). Minimizes required channel capacity

for communication.

• r(i) = (1 − η)`(i) + η ln |C|, η ∈ [0, 1] Combined transmission

length and memory constraints. (current research)

Quantization 11

Issues

If know distributions,

What is optimal tradeoff of distortion vs. rate?

(Want both to be small!)

How design good codes?

If do not know distributions, how use training/learning data

estimate tradeoffs and design good codes? (statistical learning,

machine learning)

Quantization 12

Applications

Communications & signal processing:

A/D conversion, data compression.

Compression required for efficient

transmission– send more data in available

bandwidth

– send the same data in less bandwidth

– more users on same bandwidth

and storage– can store more data– can compress for local storage, put

details on cheaper media

Graphic courtesy of Jim Storer.

Quantization 13

• Statistical clustering: grouping bird songs, designing Mao

suites, grouping gene features, taxonomy.

• Placing points in space: mailboxes, wireless repeaters, wireless

sensors

• How best approximate continuous probability distributions by

discrete ones? Quantize space of distributions.

• Numerical integration and optimal quadrature rules. Want to

estimate integral∫g(x) dx by sum

∑i∈I V (Si)g(yi), where

V (Si) is the volume of cell Si in the partition S. How best

choose the points yi and cells Si?

• Pick best Gauss mixture model from a finite set based on

training data. Plug into Bayes estimator.

Quantization 14

Distortion-rate optimization: q = (α, β, d, r)

D(q) = E[d(X,β(α(X)))] vs. R(q) = E[r(α(X))]

Minimize D(q) for R(q) ≤ R δ(R) = infq:R(q)≤R

Minimize R(q) for D(q) ≤ D r(D) = infq:D(q)≤D

Lagrangian approach: ρλ(x, i) = d(x, β(i)) + λr(i), λ ≥ 0

Minimize ρ(f, λ, η, q) ∆= Eρλ(X,α(X))

= D(q) + λR(q)

= Ed(X,β(α(X))) + λ [(1− η)E`(α(X)) + η ln |C|]

ρ(f, λ, η) = infqρ(f, λ, η, q)

Traditional fixed-rate η = 1, variable-rate η = 0

Quantization 15

Three Theories of Quantization

Rate-Distortion Theory Shannon distortion-rate function

D(R)≤ δ(R), achievable in asymptopia of large dimension k

and fixed rate R. [Shannon (1949, 1959), Gallager (1978)]

? Nonasymptotic (Exact) results Necessary conditions for

optimal codes ⇒ iterative design algorithms

⇔ statistical clustering [Steinhaus (1956), Lloyd (1957)]

? High rate theory Optimal performance in asymptopia of fixed

dimension k and large rate R . [Bennett (1948), Lloyd

(1957), Zador (1963), Gersho (1979), Bucklew, Wise (1982)]

Quantization 16

Lloyd Optimality Properties: q = (α, β, d, r)

Any component can be optimized for the others.

Encoder α(x) = argmini

(d(x, β(i)) + λr(i)) Minimum distortion

Decoder β(i) = argminy

E(d(X, y)|α(X) = i) Lloyd centroid

Length Function `(i) = − lnPf(α(X) = i) Shannon codelength

Pruning remove any index i if the pruned code α′, β′, r′ satisfies

D(α′, β′) + λR(α′, r′) ≤ D(α, β) + λR(α, `). Pruning

⇒ Lloyd clustering algorithm, (rediscovered as k-means,

grouped coordinate descent, alternating optimization, principal

points)

For estimation application, often nearly optimal to quantize

optimal nonlinear estimate (Ephraim)

Quantization 17

Classic case of fixed length, squared error =⇒Voronoi partition (Dirichlet tesselation, Theissen polygons,

Wigner-Seitz zones, medial axis transforms), centroidal

codebook. Centroidal Voronoi partition

Voronoi diagrams common in many fields including biology,

cartography, crystallography, chemistry, physics, astronomy,

meteorology, geography, anthropology, computational geometry

Quantization 18

Scott Snibbe’s

Boundary Functions

http://www.snibbe.com/scott/bf/

The Voronoi game: Competitive facility

location

http://www.voronoigame.com

Quantization 19

High Rate (Resolution) Theory

Traditional form (Zador, Gersho, Bucklew, Wise): pdf f ,

limR→∞

e2kRδ(R) =

ak||f || kk+2

fixed-rate (R = lnN)

bke2kh(f) variable-rate (R = H), where

h(f) = −∫

dxf(x) ln f(x), ||f ||k/(k+2) =(∫

k+2 dx

ak ≥ bk are Zador constants (depend on k and d, not f !)

a1 = b1 =112, a2 =

518√

3, b2 = ?, ak, bk = ? for k ≥ 3

limk→∞

ak = limk→∞

bk = 1/2πe

Quantization 20

Lagrangian form: (Gray, Linder, Li, Gill) variable-rate

limλ→0

(ρ(f, λ, 0, q)

]= θk + h(f)

θk∆= inf

(ρ(u, λ, 0, q)

limλ→0

(ρ(f, λ, 1, q)

]= ψk + ln ||f ||k/2

k/(k+2)

ψk∆= inf

(ρ(u, λ, 1, q)

fixed-rate both have form optimum for u + function of f

Quantization 21

Ongoing: Conjecture similar results for combined constraints

limλ→0

(Ef [d(X,β(α(X)))]

λ+ (1− η)Hf(q) + η lnNq

)︸︷︷︸

ρ(f,λ,η)

2lnλ)

= θk(η) + h(f, η), θk(η)∆= inf

(ρ(u, λ, η)

)where h(f, η) given by a convex minimization

h(f, η) = h(f) + infν

∫dxf(x)

[eν(x) − ν(x)− 1

]+ ηH(f ||Λ)

]where

Λ(x) =e−kν(x)/2∫e−kν(y)/2 dy

, H(f ||λ) =∫

f(x) lnf(x)Λ(x)

dx (1)

Quantization 22

Proofs

Rigorous proofs of traditional cases painfully tedious, but

follow Zador’s original approach:

Step 1 Prove the result for a uniform density on a cube.

Step 2 Prove the result for a pdf that is piecewise constant on

disjoint cubes of equal volume.

Step 3 Prove the result for a general pdf on a cube by

approximating it by a piecewise constant pdf on small cubes.

Step 4 Extend the result for a general pdf on the cube to general

pdfs on <k by limiting arguments.

Quantization 23

Heuristic proofs developed by Gersho: Assume existence of

quantizer point density function Λ(x) for sequence of codebooks

of size N

limN→∞

1N× ( # reproduction vectors in a set S)

Λ(x) dx; all S,

where∫<k Λ(x) dx = 1.

Assume fX(x) smooth, rate large, and minimum distortion

quantizer has cells Si ≈ scaled, rotated, and translated copies of

S∗ achieving

ck = mintesselating convex polytopes S

Quantization 24

M(S) =1

kV (S)2/k

||x− centroid(S)||2

V (S)dx

Then hand-waving approximations of integrals relate

N(q), Df(q),Hf(q) ⇒

Df(q) ≈ ckEf

1N(q)Λ(X)

)Hf(q(X)) ≈ h(X)− E

1N(q)Λ(X)

= lnN(q)−H(f ||Λ),

Optimizing using Holder’s inequality or Jensen’s inequality

yields classic fixed-rate and variable-rate results.

Quantization 25

Can also apply to combined-constraint case to get

θ(f, λ, η,Λ) =k

)+ (1− η)h(f)+

((Λ(X))−2/k

))+ (1− η)Ef (lnΛ(X)) , (2)

Goal is to minimize over Λ.

If substitute θk(η) for ck term, equivalent to conjecture

proposed here with Λ(x) defined by (1)!

Suggests connections with rigorous and heuristic proofs, may

help insight and simplify proofs.

Current state: Rigorous proof for u and upper bound for other

steps. Lacking converses.

Quantization 26

Variable-rate: Asymptotically OptimalQuantizers

Theorem ⇒ for λn ↓ 0 there is a λn-asymptotically optimalsequence of quantizers qn = (αn, βn, `n):

limn→∞

(Ef [d(X,βn(αn(X)))]

λn+ Ef`n(αn(X)) +

2lnλn

)= θk +h(f)

⇒ λ controls distortion and entropy separately [DCC 05]:

limn→∞

2Df(qn)kλn

= 1 , limn→∞

(H(qn) +

2lnλn

)= h(f) + θk .

(Similar results for fixed-rate and combined constraints)

Quantization 27

Variable-rate codes: Mismatch

What if we design a.o. sequence of quantizers qn for pdf g,

but apply it to f? ⇒ mismatch

Answer in terms of relative entropy:

Theorem. [Gray, Linder (2003)] If qn is λn-a.o. for g, then

limn→∞

Df(qn)λn

+Ef`n(αn(X)) +k

2lnλn = h(f) + θk +H(f ||g).

Quantization 28

Worst case: (variable-rate) Asymptotic performance depends on

f only through h(f) ⇒ worst case source given constraints is

max h source, e.g., given m and K, worst case is Gaussian

If use worst case Gaussian for g to design codes and apply

to f , also robust in the sense of Sakrison: Gaussian codes yield

same performance on other sources with same covariance. Loss

from optimum is H(f‖g).

=⇒ suggests using relative entropy (mismatch) as a

“distance” or “distortion measure” on pdfs in order to “quantize”

the space of pdf’s in order to fit models to observed data.

Quantization 29

Classified Codes and Mixture Models

Problem with single worst case: too conservative.

Alternative idea, fit a different Gauss source to distinct groups

of inputs instead of the entire collection of inputs. Robust codes

for local behavior.

Suppose have a partition S = {Sm; m ∈ I} of <k.

If input falls in Sm, use codes designed for a Gaussian model

gm = N (µm,Km) ∈ M, the space of all nonsingular Gaussian

pdfs on <k.

Design asymptotically optimum codes qm,n; n = 1, 2, . . . for

each m and common λn → 0

Quantization 30

Construct overall composite quantizer sequence qn by using

qm,n if in Sm . Defining

fm(x) =

{f(x)/pm x ∈ Sm

0 otherwise; pm =

dxf(x) ⇒

limn→∞

Df(qn)λn

+Ef`n(αn(X))+k

2lnλn = θk+h(f)+

pmH(fm||gm)︸︷︷︸mismatch distortion

What is best (smallest) can make this over all partitions

S = {Sm; m ∈ I} and collections G = {gm, pm; m ∈ I}?

⇔ Quantize space of Gauss models using mismatch distortion

Outcome of optimization is a Gauss mixture.

Alternative to Baum-Welch/EM algorithm.

Quantization 31

Can use Lloyd algorithm to optimize.

Lloyd conditions for mismatch distortion imply that

Decoder (centroid) gm = argming∈M

H(fm||g) = N (µm,Km).

Encoder (partition) α(x) = argminm d(x,m) ∆=

argminm

(− ln pm +

ln (|Km|) +12(x− µm)tK−1

m (x− µm))

Length Function L(m) = − ln pm.

Can adjust Lagrange weighting of ln pm

Gauss mixture vector quantization (GMVQ)

Quantization 32

Toy example: GMVQ design of GM

Source: 2 component GM:

m1 = (0, 0)t, K1 =[

1 00 1

], p1 = 0.8

m2 = (−2, 2)t, K2 =[

1 11 1.5

], p2 = 0.2

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

Quantization 33

Data and Initialization:

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

Quantization 34

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

1Quantization 35

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

2Quantization 36

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

3Quantization 37

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

4Quantization 38

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

5Quantization 39

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

6Quantization 40

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

7Quantization 41

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

8Quantization 42

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

9Quantization 43

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

10Quantization 44

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

11Quantization 45

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

12Quantization 46

−6 −5 −4 −3 −2 −1 0 1 2 3 4−4

Converged!Quantization 47

Algorithm for fitting a Gauss mixture to a dataset provides

a supervised learning method for classifier design: Given a

collection of classes of interest, design for each a GMVQ

Given a new vector (e.g., image):

Traditional approach: plug in density estimate to optimal Bayes

Best codebook approach: Code the image using each of the

Gauss mixture VQs and select the class corresponding to the

quantizer with the smallest average mismatch distortion.

Codebook-matching, classification by compression.

“nearest-neighbor,” need not (explicitly) estimate PY |X

Quantization 48

Examples

Aerial images: Manmade vs. natural White: man-made,

gray:natural

original 8 bpp, gray-scale images, hand-labeled classified

images. GMVQ with probability of error 12.23%

Quantization 49

Content-Adresssible Databases/Image Retrieval

8× 8 blocks, 9 images train for each of 50 classes, 500 image

database (cross-validated)

Precision (fraction of retrieved images that are relevant) ≈recall (fraction of relevant images that are retrieved) ≈ .94 .

Query ExamplesQuery

Quantization 50

North Sea Gas Pipeline Data

normal pipeline image field joint longitudinal weld

which is which???

Quantization 51

• GMVQ Best codebook wins

• MAP-GMVQ Plug into ECVQ produced GM for MAP

• 1-NN

• MART (boosted classification tree)

• Regularized QDA (Gaussian class models)

• MAP-ECVQ (GM class models)

• MAP-EM: Plug into EM produced GM for MAP

Quantization 52

Method Recall Precision AccuracyS V W S V W

MART 0.9608 0.9000 0.8718 0.9545 0.9000 0.8947 0.9387Reg. QDA 0.9869 1.0000 0.9487 0.9869 0.9091 1.0000 0.98111-NN 0.9281 0.7000 0.8462 0.9221 0.8750 1.0000 0.8915MAP-ECVQ 0.9737 0.9000 0.9437 0.9739 0.9000 0.9487 0.9623MAP-EM 0.9739 0.9000 0.9487 0.9739 0.9000 0.9487 0.9623MAP-GMVQ 0.9935 0.8500 0.9487 0.9682 1.0000 0.9737 0.9717GMVQ 0.9673 0.8000 0.9487 0.9737 0.7619 0.9487 0.9481

Class Subclasses

S Normal, Osmosis Blisters, Black Lines, Small Black Corrosion Dots, Grinder Marks, MFL Marks,Corrosion Blisters, Single Dots

V Longitudinal Welds

W Weld Cavity, Field Joint

Recall = Pr (declare as class i | class i )Precision =Pr ( class i | declare as class i )

Implementation is simple, convergence is fast,

classification performance is good.

Quantization 53

Parting Thoughts

• Quantization ideas useful for A/D, compression, classification,

detection, and modeling. Alternative to EM for Gauss mixture

design.

• Connections with Gersho’s conjucture on asymptotic optimal

cell shapes (and its implications)

Gersho’s approach provides intuitive, but nonrigorous, proofs

of high rate results based on conjectured geometry of optimal

cell shapes.

Has many interesting implications that have never been proved:

– Existence of quantizer point density functions at high rate

(known for fixed-rate [Bucklew])

Quantization 54

– Optimal high-rate variable-rate quantizer point density is

uniform.

– At high rate optimal fixed-rate quantizers on u are maximum

entropy.

– Equality of fixed-rate and entropy constrained optimal

performance (ak = bk)

• Combined constraint approach provides

– unified development of traditional cases and provides a

formulation consistent with Gersho’s approach, allows

rigorous proofs of several of Gersho’s steps and parts of

his conjecture.

– describes separate behavior of distortion, entropy, and

codebook size of asymptotically optimum quantizers

Quantization 55

– provides equivalent conditions for implications of Gersho’s

conjectures and a possible means of eventually

demonstrating or refuting some of these implications

E.g., if Gersho true, then θk(η) is constant with zero

derivative, ak = bk, optimal fixed rate codes have maximum

entropy.

– provides a new Lloyd optimality condition: pruning

– high rate results complement Shannon results

mixed constraint theory does not have a Shannon

counterpart — in high dimensions D(R) describes both

asymptotic codebook size and quantizer entropy (asymptotic

equipartition property)

– adding components to an optimization problem may help

find better local optima.

Quantization 56

Quantization, Compression, and Classiﬁcation: Extracting ...gray/ucsd06.pdfA/D conversion, data...

Documents