Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | elfreda-harrell |
View: | 218 times |
Download: | 0 times |
Learning Near-Isometric Linear Embeddings
Richard Baraniuk
Rice University
Chinmay HegdeMIT
Aswin Sankaranarayanan
CMU
Wotao YinUCLA
challenge 1too much data
Large Scale Datasets
Case in Point: DARPA ARGUS-IS
• 1.8 Gigapixel image sensor
Case in Point: DARPA ARGUS-IS
• 1.8 Gpixel image sensor– video rate output:
444 Gbits/s– comm data rate:
274 Mbits/s
factor of 1600xway out of reach ofexisting compressiontechnology
• Reconnaissancewithout conscience– too much data to transmit to a ground station– too much data to make effective real-time decisions
challenge 2data too expensive
Case in Point: MR Imaging
• Measurements very expensive
• $1-3 million per machine
• 30 minutes per scan
Case in Point: IR Imaging
DIMENSIO
NALITY
REDUCTION
Intrinsic Dimensionality
Intrinsic dimension << Extrinsic dimension!
• Why? Geometry, that’s why• Exploit to perform more efficient analysis and
processing of large-scale data
Linear Dimensionality Reduction
measurements
signal
Linear Dimensionality Reduction
Goal: Create a (linear) mapping from RN to RM with M < N that preserves the key geometric properties of the data
ex: configuration of the data points
Dimensionality Reduction
• Given a training set of signals, find “best” that preserves its geometry
Dimensionality Reduction
• Given a training set of signals, find “best” that preserves its geometry
• Approach 1: Principal Component Analysis (PCA) via SVD of training signals
– find “average” best fitting subspace in least-squares sense– average error metric can distort point cloud geometry
Embedding
• Given a training set of signals, find “best” that preserves its geometry
• Approach 2: Inspired by
Restricted Isometry Property (RIP)
Whitney Embedding Theorem
Isometric Embedding
• Given a training set of signals, find “best” that preserves its geometry
• Approach 2: Inspired by RIP and Whitney– design to preserve inter-point distances (secants)– more faithful to training data
Near-Isometric Embedding
• Given a training set of signals, find “best” that preserves its geometry
• Approach 2: Inspired by RIP and Whitney– design to preserve inter-point distances (secants)– more faithful to training data– but exact isometry can be too much to ask
Near-Isometric Embedding
• Given a training set of signals, find “best” that preserves its geometry
• Approach 2: Inspired by RIP and Whitney– design to preserve inter-point distances (secants)– more faithful to training data– but exact isometry can be too much to ask
Why Near-Isometry?• Sensing
– guarantees existence of a recovery algorithm• Machine learning applications
– kernel matrix depends only on pairwise distances• Approximate nearest neighbors for classification
– efficient dimensionality reduction
Existence of Near Isometries
• Johnson-Lindenstrauss Lemma
• Given a set of Q points, there exists a Lipchitz map that achieves near-isometry (with constant ) provided
• Random matrices with iid subGaussian entries work– compressive sensing, locality sensitive hashing,
database monitoring, cryptography
• Existence of solution!– but constants are poor– oblivious to data structure [J-L, 84]
[Frankl and Meahara, 88][Indyk and Motwani, 99][Achlioptas, 01][Dasgupta and Gupta, 02]
Designed Embeddings
• Unfortunately, random projections are data-oblivious (by definition)
• Q: Can we beat random projections?
• Our quest: A new approach for designing linear embeddings for specific datasets
[math alert]
Designing Embeddings
• Normalized secants [Whitney; Kirby; Wakin, B ’09]
• Goal: approximately preserve the length of
• Obviously, projecting in direction of is a bad idea
Designing Embeddings
• Normalized secants
• Goal: approximately preserve the length of
• Note: total number of secants is large:
“Good” Linear Embedding Design
• Given: normalized secants
• Seek: the “shortest” matrix such that
• Think of as the knob that controls the “maximum distortion” that you are willing to tolerate
“Good” Linear Embedding Design
• Given: (normalized) secants
• Seek: the “shortest” matrix such that
Lifting Trick
• Convert quadratic constraints in into linear constraints in
• Given , obtain via matrix square root
Relaxation
• Relax rank minimization problem to nuclear norm minimization problem
NuMax
• Nuclear norm minimization with Max-norm constraints (NuMax)
• Semi-Definite Program (SDP)– solvable by standard interior
point methods
• Rank of solution is determined by
Accelerating NuMax
• Poor scaling with N and S– least squares involves matrices
with S rows– SVD of an NxN matrix
• Several avenues to accelerate:– Alternating Direction Method of Multipliers (ADMM) – exploit fact that intermediate estimates of P are low-rank– exploit fact that only a few secants define the optimal
embedding (“column generation”)
Accelerated NuMax
Can solve for datasetswith Q=100k points in N=1000 dimensions
in a few hours
[/math alert]
App: Linear Compression
• Images of translating blurred squares live on a
K=2 dimensional smooth “surface” (manifold) in N=256 dimensional space
• Project a collection of 1000 such images into M-dimensional space while preserving structure(as measured by distortion constant )
N=16x16=256
Rows of “Optimal”
measurements
signal
N=16x16=256
Rows of “Optimal”
Rows of “Optimal”
Rows of “Optimal”
App: Linear Compression
• M=40 linear measurements enough to ensure isometry constant of = 0.01
Secant Distortion
• Distribution of secant distortions for the translating squares dataset• Embedding dimension M=30• Input distortion to NuMax is \delta=0.03
• As opposed to PCA and random, NuMax provides distortions sharply concentrated at \delta.
Secant Distortion
• Translating squares dataset– N = 16x16 = 256– M = 30– = 0.03
• Histograms of normalized secant distortions
random PCA NuMax
0.060.060.06
MNIST (8) – Near Isometry
M = 14 basis functions achieve = 0.05
N=20x20=400
MNIST (8) – Near Isometry
N=20x20=400
Goal: Preserve neighborhood structure of a set of images
App: Image Retrieval
LabelMe Image Dataset
• N = 512, Q = 4000, M = 45 suffices to preserve 80% of neighborhoods
App: Classification
• MNIST digits dataset– N = 20x20 = 400-dim images– 10 classes: digits 0-9 – Q = 60000 training images
• Nearest neighbor (NN) classifier– Test on 10000 images
• Mis-classification rate of NN classifier using original dataset: 3.63%
App: Classification• MNIST dataset
– N = 20x20 = 400-dim images– 10 classes: digits 0-9 – Q = 60000 training images, so S = 1.8 billion secants! – NuMax-CG took 3 hours to process
• Mis-classification rate of NN classifier: 3.63%
• NuMax provides the best NN-classification rates!
δ 0.40 0.25 0.1
Rank of NuMax solution 72 97 167
Mis-classification
rates in %
NuMax 2.99 3.11 3.31
Gaussian 5.79 4.51 3.88
PCA 4.40 4.38 4.41
NuMax and Task Adaptivity
• Prune the secants according to the task at hand
– If goal is reconstruction / retrieval, then preserve all secants
– If goal is signal classification, then preserve inter-class secants differently from intra-class secants
– This preferential weighting approach is akin to “boosting”
Optimized Classification
Intra-class secants are not expanded
Inter-class secants are not shrunk
This simple modification improves NN classification rates while using even fewer measurements
Optimized Classification• MNIST dataset
– N = 20x20 = 400-dim images– 10 classes: digits 0-9 – Q = 60000 training images, so >1.8 billion secants! – NuMax-CG took 3-4 hours to process
1. Significant reduction in number of measurements (M)
2. Significant improvement in classification rate
δ 0.40 0.25 0.1
Algorithm NuMax NuMax Class NuMax NuMax
Class NuMax NuMaxClass
Rank 72 52 97 69 167 116
Miss-classification rate in % 2.99 2.68 3.11 2.72 3.31 3.09
Conclusions
• NuMax – new adaptive data representation that is linear, near-isometric– minimize distortion to preserve geometrical info in a
set of training signals
• Posed as a rank-minimization problem– relaxed to a Semi-definite program (SDP) – NuMax solves very efficiently via ADMM and CG
• Applications: Classification, retrieval, compressive sensing, ++
• Nontrivial extension from signal recovery to signal inference
Open Problems
• Equivalence between the solutions of min-rank and min-trace problems ?
• Convergence rate of NuMax– Preliminary studies show o(1/k) rate of convergence
• Scaling of the algorithm– Given dataset of Q-points, #secants is O(Q2)– Are there alternate formulations that scale
linearly/sub-linearly in Q ?
• More applications
Software• GNuMax
Software package at dsp.rice.edu
• PneuMaxFrench-version software packagecoming soon
References• C. Hegde, A. C. Sankaranarayanan, W. Yin, and R. G. Baraniuk, “A Convex Approach for
Learning Near-Isometric Linear Embeddings,” Submitted to Journal of Machine Learning Research, 2012
• C. Hegde, A. C. Sankaranarayanan, and R. G. Baraniuk, “Near-Isometric Linear Embeddings of Manifolds,” IEEE Statistical Signal Processing Workshop (SSP), August 2012
• Y. Li, C. Hegde, A. Sankaranarayanan, R. Baraniuk, K. Kelly, “Compressive Classification via Secant Projections,” submitted to Optics Express, February 2014
BONUS SLIDES
Practical Considerations
• In practice N large, Q very large!
• Computational cost per iteration scales as
• Alternating Direction Method of Multipliers (ADMM)
- solve for P using spectral thresholding- solve for L using least-squares
- solve for q using “clipping”
• Computational/memory cost per iteration:
Solving NuMax
Accelerating NuMax
• Poor scaling with N and Q– least squares involves matrices with Q2 rows– SVD of an NxN matrix
• Observation 1 – intermediate estimates of P are low-rank– use low-rank representation to reduce memory
and accelerate computations– use incremental SVD for faster computations
Accelerating NuMax
• Observation 2 – by KKT conditions, by complementary slackness, only
constraints that are satisfied with equality determine solutions (“active constraints”)
Analogy: Recall support vector machines (SVMs)., where we solve
The solution is determined only by the support vectors – those for which
NuMax-CG
• Observation 2 – by KKT conditions, by complementary slackness, only
constraints that are satisfied with equality determine solutions (“active constraints”)
• Hence, given feasibility of a solution P*, only secants vk for which |vk
TP*vk – 1| = determine the value of P*
• Key: Number of “support secants” << total number of secants
– and so we only need to track the support secants– “column generation” approach to solving NuMax
• Example from our paper with Yun and Kevin.
• (a) & (b) : example target images (toy bus vs toy car; 1D manifold of rotations)
• (c): PCA basis functions learned from inter-class secants.
• (d): NuMax basis functions learned from inter-class secants.
(Optional) Real-World Expts
• Real-data experiments using the Rice Single-Pixel Camera
• Test scenes: toy bus/car at unknown orientations• NuMax results:
(Optional) Real-World Expts
• Experimental details:– N = 64x64 = 4096, 72 images for each class– Acquire M measurements using {PCA, Bernoulli-random,
NuMax}– Perform nearest-neighbor classification
NuMax: Analysis
• Performance of NuMax depends upon the tightness of the convex relaxation:
Q. When is this relaxation tight?
A. Open Problem, likely very hard
NuMax: Analysis
However, can rigorously analyze if is further constrained to be orthonormal
• Essentially enforces that the rows of are (i) unit norm and (ii) pairwise orthogonal
• Upshot: Models a per-sample energy constraint of a CS acquisition system
– Different measurements necessarily probe “new” portions of the signal space
– Measurements remain uncorrelated, so noise/perturbations in the input data are not amplified
Slight Refinement
1. Look at the converse problem fix the embedding dimension and solve for the linear embedding with minimum distortion, , as a function of M– Does not change the problem qualitatively
2. Restrict the problem to the space of orthonormal embeddings
orthonormality
Slight Refinement
• As in NuMax, lifting + trace-norm relaxation:
• Efficient solution algorithms (NuMax, NuMax-CG) remain essentially unchanged
• However, solutions come with guarantees …
Analytical Guarantee
• Theorem [Grant, Hegde, Indyk ‘13] Denote the optimal distortion obtained by a rank-M
orthonormal embedding as Then, by solving an SDP, we can efficiently construct
a rank-2M embedding with distortion at most
ie: One can get close to the optimal distortion by paying an additional price in the measurement budget (M)
CVDomes Radar Signals
• Training data: 2000 secants (inter-class, joint)• Test data: 100 signatures from each class