Learning Near-Isometric Linear Embeddings Richard Baraniuk Rice University Chinmay Hegde MIT Aswin...

Post on 18-Dec-2015

218 views 0 download

Tags:

transcript

Learning Near-Isometric Linear Embeddings

Richard Baraniuk

Rice University

Chinmay HegdeMIT

Aswin Sankaranarayanan

CMU

Wotao YinUCLA

challenge 1too much data

Large Scale Datasets

Case in Point: DARPA ARGUS-IS

• 1.8 Gigapixel image sensor

Case in Point: DARPA ARGUS-IS

• 1.8 Gpixel image sensor– video rate output:

444 Gbits/s– comm data rate:

274 Mbits/s

factor of 1600xway out of reach ofexisting compressiontechnology

• Reconnaissancewithout conscience– too much data to transmit to a ground station– too much data to make effective real-time decisions

challenge 2data too expensive

Case in Point: MR Imaging

• Measurements very expensive

• $1-3 million per machine

• 30 minutes per scan

Case in Point: IR Imaging

DIMENSIO

NALITY

REDUCTION

Intrinsic Dimensionality

Intrinsic dimension << Extrinsic dimension!

• Why? Geometry, that’s why• Exploit to perform more efficient analysis and

processing of large-scale data

Linear Dimensionality Reduction

measurements

signal

Linear Dimensionality Reduction

Goal: Create a (linear) mapping from RN to RM with M < N that preserves the key geometric properties of the data

ex: configuration of the data points

Dimensionality Reduction

• Given a training set of signals, find “best” that preserves its geometry

Dimensionality Reduction

• Given a training set of signals, find “best” that preserves its geometry

• Approach 1: Principal Component Analysis (PCA) via SVD of training signals

– find “average” best fitting subspace in least-squares sense– average error metric can distort point cloud geometry

Embedding

• Given a training set of signals, find “best” that preserves its geometry

• Approach 2: Inspired by

Restricted Isometry Property (RIP)

Whitney Embedding Theorem

Isometric Embedding

• Given a training set of signals, find “best” that preserves its geometry

• Approach 2: Inspired by RIP and Whitney– design to preserve inter-point distances (secants)– more faithful to training data

Near-Isometric Embedding

• Given a training set of signals, find “best” that preserves its geometry

• Approach 2: Inspired by RIP and Whitney– design to preserve inter-point distances (secants)– more faithful to training data– but exact isometry can be too much to ask

Near-Isometric Embedding

• Given a training set of signals, find “best” that preserves its geometry

• Approach 2: Inspired by RIP and Whitney– design to preserve inter-point distances (secants)– more faithful to training data– but exact isometry can be too much to ask

Why Near-Isometry?• Sensing

– guarantees existence of a recovery algorithm• Machine learning applications

– kernel matrix depends only on pairwise distances• Approximate nearest neighbors for classification

– efficient dimensionality reduction

Existence of Near Isometries

• Johnson-Lindenstrauss Lemma

• Given a set of Q points, there exists a Lipchitz map that achieves near-isometry (with constant ) provided

• Random matrices with iid subGaussian entries work– compressive sensing, locality sensitive hashing,

database monitoring, cryptography

• Existence of solution!– but constants are poor– oblivious to data structure [J-L, 84]

[Frankl and Meahara, 88][Indyk and Motwani, 99][Achlioptas, 01][Dasgupta and Gupta, 02]

Designed Embeddings

• Unfortunately, random projections are data-oblivious (by definition)

• Q: Can we beat random projections?

• Our quest: A new approach for designing linear embeddings for specific datasets

[math alert]

Designing Embeddings

• Normalized secants [Whitney; Kirby; Wakin, B ’09]

• Goal: approximately preserve the length of

• Obviously, projecting in direction of is a bad idea

Designing Embeddings

• Normalized secants

• Goal: approximately preserve the length of

• Note: total number of secants is large:

“Good” Linear Embedding Design

• Given: normalized secants

• Seek: the “shortest” matrix such that

• Think of as the knob that controls the “maximum distortion” that you are willing to tolerate

“Good” Linear Embedding Design

• Given: (normalized) secants

• Seek: the “shortest” matrix such that

Lifting Trick

• Convert quadratic constraints in into linear constraints in

• Given , obtain via matrix square root

Relaxation

• Relax rank minimization problem to nuclear norm minimization problem

NuMax

• Nuclear norm minimization with Max-norm constraints (NuMax)

• Semi-Definite Program (SDP)– solvable by standard interior

point methods

• Rank of solution is determined by

Accelerating NuMax

• Poor scaling with N and S– least squares involves matrices

with S rows– SVD of an NxN matrix

• Several avenues to accelerate:– Alternating Direction Method of Multipliers (ADMM) – exploit fact that intermediate estimates of P are low-rank– exploit fact that only a few secants define the optimal

embedding (“column generation”)

Accelerated NuMax

Can solve for datasetswith Q=100k points in N=1000 dimensions

in a few hours

[/math alert]

App: Linear Compression

• Images of translating blurred squares live on a

K=2 dimensional smooth “surface” (manifold) in N=256 dimensional space

• Project a collection of 1000 such images into M-dimensional space while preserving structure(as measured by distortion constant )

N=16x16=256

Rows of “Optimal”

measurements

signal

N=16x16=256

Rows of “Optimal”

Rows of “Optimal”

Rows of “Optimal”

App: Linear Compression

• M=40 linear measurements enough to ensure isometry constant of = 0.01

Secant Distortion

• Distribution of secant distortions for the translating squares dataset• Embedding dimension M=30• Input distortion to NuMax is \delta=0.03

• As opposed to PCA and random, NuMax provides distortions sharply concentrated at \delta.

Secant Distortion

• Translating squares dataset– N = 16x16 = 256– M = 30– = 0.03

• Histograms of normalized secant distortions

random PCA NuMax

0.060.060.06

MNIST (8) – Near Isometry

M = 14 basis functions achieve = 0.05

N=20x20=400

MNIST (8) – Near Isometry

N=20x20=400

Goal: Preserve neighborhood structure of a set of images

App: Image Retrieval

LabelMe Image Dataset

• N = 512, Q = 4000, M = 45 suffices to preserve 80% of neighborhoods

App: Classification

• MNIST digits dataset– N = 20x20 = 400-dim images– 10 classes: digits 0-9 – Q = 60000 training images

• Nearest neighbor (NN) classifier– Test on 10000 images

• Mis-classification rate of NN classifier using original dataset: 3.63%

App: Classification• MNIST dataset

– N = 20x20 = 400-dim images– 10 classes: digits 0-9 – Q = 60000 training images, so S = 1.8 billion secants! – NuMax-CG took 3 hours to process

• Mis-classification rate of NN classifier: 3.63%

• NuMax provides the best NN-classification rates!

δ 0.40 0.25 0.1

Rank of NuMax solution 72 97 167

Mis-classification

rates in %

NuMax 2.99 3.11 3.31

Gaussian 5.79 4.51 3.88

PCA 4.40 4.38 4.41

NuMax and Task Adaptivity

• Prune the secants according to the task at hand

– If goal is reconstruction / retrieval, then preserve all secants

– If goal is signal classification, then preserve inter-class secants differently from intra-class secants

– This preferential weighting approach is akin to “boosting”

Optimized Classification

Intra-class secants are not expanded

Inter-class secants are not shrunk

This simple modification improves NN classification rates while using even fewer measurements

Optimized Classification• MNIST dataset

– N = 20x20 = 400-dim images– 10 classes: digits 0-9 – Q = 60000 training images, so >1.8 billion secants! – NuMax-CG took 3-4 hours to process

1. Significant reduction in number of measurements (M)

2. Significant improvement in classification rate

δ 0.40 0.25 0.1

Algorithm NuMax NuMax Class NuMax NuMax

Class NuMax NuMaxClass

Rank 72 52 97 69 167 116

Miss-classification rate in % 2.99 2.68 3.11 2.72 3.31 3.09

Conclusions

• NuMax – new adaptive data representation that is linear, near-isometric– minimize distortion to preserve geometrical info in a

set of training signals

• Posed as a rank-minimization problem– relaxed to a Semi-definite program (SDP) – NuMax solves very efficiently via ADMM and CG

• Applications: Classification, retrieval, compressive sensing, ++

• Nontrivial extension from signal recovery to signal inference

Open Problems

• Equivalence between the solutions of min-rank and min-trace problems ?

• Convergence rate of NuMax– Preliminary studies show o(1/k) rate of convergence

• Scaling of the algorithm– Given dataset of Q-points, #secants is O(Q2)– Are there alternate formulations that scale

linearly/sub-linearly in Q ?

• More applications

Software• GNuMax

Software package at dsp.rice.edu

• PneuMaxFrench-version software packagecoming soon

References• C. Hegde, A. C. Sankaranarayanan, W. Yin, and R. G. Baraniuk, “A Convex Approach for

Learning Near-Isometric Linear Embeddings,” Submitted to Journal of Machine Learning Research, 2012

• C. Hegde, A. C. Sankaranarayanan, and R. G. Baraniuk, “Near-Isometric Linear Embeddings of Manifolds,” IEEE Statistical Signal Processing Workshop (SSP), August 2012

• Y. Li, C. Hegde, A. Sankaranarayanan, R. Baraniuk, K. Kelly, “Compressive Classification via Secant Projections,” submitted to Optics Express, February 2014

BONUS SLIDES

Practical Considerations

• In practice N large, Q very large!

• Computational cost per iteration scales as

• Alternating Direction Method of Multipliers (ADMM)

- solve for P using spectral thresholding- solve for L using least-squares

- solve for q using “clipping”

• Computational/memory cost per iteration:

Solving NuMax

Accelerating NuMax

• Poor scaling with N and Q– least squares involves matrices with Q2 rows– SVD of an NxN matrix

• Observation 1 – intermediate estimates of P are low-rank– use low-rank representation to reduce memory

and accelerate computations– use incremental SVD for faster computations

Accelerating NuMax

• Observation 2 – by KKT conditions, by complementary slackness, only

constraints that are satisfied with equality determine solutions (“active constraints”)

Analogy: Recall support vector machines (SVMs)., where we solve

The solution is determined only by the support vectors – those for which

NuMax-CG

• Observation 2 – by KKT conditions, by complementary slackness, only

constraints that are satisfied with equality determine solutions (“active constraints”)

• Hence, given feasibility of a solution P*, only secants vk for which |vk

TP*vk – 1| = determine the value of P*

• Key: Number of “support secants” << total number of secants

– and so we only need to track the support secants– “column generation” approach to solving NuMax

• Example from our paper with Yun and Kevin.

• (a) & (b) : example target images (toy bus vs toy car; 1D manifold of rotations)

• (c): PCA basis functions learned from inter-class secants.

• (d): NuMax basis functions learned from inter-class secants.

(Optional) Real-World Expts

• Real-data experiments using the Rice Single-Pixel Camera

• Test scenes: toy bus/car at unknown orientations• NuMax results:

(Optional) Real-World Expts

• Experimental details:– N = 64x64 = 4096, 72 images for each class– Acquire M measurements using {PCA, Bernoulli-random,

NuMax}– Perform nearest-neighbor classification

NuMax: Analysis

• Performance of NuMax depends upon the tightness of the convex relaxation:

Q. When is this relaxation tight?

A. Open Problem, likely very hard

NuMax: Analysis

However, can rigorously analyze if is further constrained to be orthonormal

• Essentially enforces that the rows of are (i) unit norm and (ii) pairwise orthogonal

• Upshot: Models a per-sample energy constraint of a CS acquisition system

– Different measurements necessarily probe “new” portions of the signal space

– Measurements remain uncorrelated, so noise/perturbations in the input data are not amplified

Slight Refinement

1. Look at the converse problem fix the embedding dimension and solve for the linear embedding with minimum distortion, , as a function of M– Does not change the problem qualitatively

2. Restrict the problem to the space of orthonormal embeddings

orthonormality

Slight Refinement

• As in NuMax, lifting + trace-norm relaxation:

• Efficient solution algorithms (NuMax, NuMax-CG) remain essentially unchanged

• However, solutions come with guarantees …

Analytical Guarantee

• Theorem [Grant, Hegde, Indyk ‘13] Denote the optimal distortion obtained by a rank-M

orthonormal embedding as Then, by solving an SDP, we can efficiently construct

a rank-2M embedding with distortion at most

ie: One can get close to the optimal distortion by paying an additional price in the measurement budget (M)

CVDomes Radar Signals

• Training data: 2000 secants (inter-class, joint)• Test data: 100 signatures from each class