Evaluation of Distance Metrics for Evaluation of Distance Metrics for Recognition Based on Non-Recognition Based on Non-
Negative Matrix FactorizationNegative Matrix Factorization
David Guillamet, Jordi VitriàDavid Guillamet, Jordi VitriàPattern Recognition LettersPattern Recognition Letters24:1599-1605, June, 200324:1599-1605, June, 2003
John GaleottiJohn GaleottiAdvanced PerceptionAdvanced Perception
March 23, 2004March 23, 2004
Actually, Two ICPR’02 PapersActually, Two ICPR’02 Papers
Analyzing Non-Negative Matrix Analyzing Non-Negative Matrix Factorization for Image ClassificationFactorization for Image Classification
David Guillamet, Bernt Schiele, Jordi David Guillamet, Bernt Schiele, Jordi VitriàVitrià
Determining a Suitable Metric When using Determining a Suitable Metric When using Non-negative Matrix FactorizationNon-negative Matrix Factorization
David Guillamet, Jordi VitriàDavid Guillamet, Jordi Vitrià
Non-Negative Matrix FactorizationNon-Negative Matrix Factorization
TLA: NMFTLA: NMF Used for dimensionality reductionUsed for dimensionality reduction
VVnnxxmm ≈ ≈ WWnnxxrrHHrrxxmm, r < nm/(n+m), r < nm/(n+m) VV has non-negative training samples as its columns has non-negative training samples as its columns WW contains the non-negative basis vectors contains the non-negative basis vectors HH contains the non-negative coefficients to contains the non-negative coefficients to
approximate each column of approximate each column of VV using using WW Results similar in concept to PCA, but with Results similar in concept to PCA, but with
non-negative “basis vectors”non-negative “basis vectors”
NMF Distinguishing PropertiesNMF Distinguishing Properties
Requires positive dataRequires positive dataComputationally expensiveComputationally expensivePart-based decompositionPart-based decomposition
Because only additive combinations of Because only additive combinations of original data are allowedoriginal data are allowed
Not an orthonormal basisNot an orthonormal basis
Different Decomposition TypesDifferent Decomposition Types20 Dimensions of Numeric Digits20 Dimensions of Numeric Digits
PCA NMFPCA NMF
50 Dimensions of Numeric Digits50 Dimensions of Numeric Digits
PCA NMF PCA NMF
Why not just use PCA?Why not just use PCA?
PCA is optimal for reconstructionPCA is optimal for reconstructionPCA is not optimal for separation and PCA is not optimal for separation and
recognition of classesrecognition of classes
NMF Issues AddressedNMF Issues Addressed
If/when is NMF better at dimensionality If/when is NMF better at dimensionality reduction than PCA for classification?reduction than PCA for classification?
Can combining PCA and NMF lead to Can combining PCA and NMF lead to better performance?better performance?
What is the best distance metric to use What is the best distance metric to use with the nonorthonormal basis of NMF?with the nonorthonormal basis of NMF?
How NMF WorksHow NMF Works
VVnnxxmm ≈ ≈ WWnnxxrrHHrrxxmm, r < nm/(n+m), r < nm/(n+m)Begin with a nBegin with a nxxm matrix of training data m matrix of training data VV
Each column is a vectorized data pointEach column is a vectorized data pointRandomly initialize Randomly initialize WW and and HH with positive with positive
valuesvalues Iterate according to update rules:Iterate according to update rules:
How NMF WorksHow NMF Works
In general, NMF requires the non-linear In general, NMF requires the non-linear optimization of an objective functionoptimization of an objective function
The update rules just given correspond The update rules just given correspond to a popular objective function, and are to a popular objective function, and are guaranteed to converge.guaranteed to converge. That objective function relates to the That objective function relates to the
probability of generating the images in probability of generating the images in VV from the bases from the bases WW and encodings and encodings HH::
NMF vs. PCA ExperimentsNMF vs. PCA Experiments
Dataset: 10 classes of natural texturesDataset: 10 classes of natural textures Clouds, grass, ice, trees, sand, sky, etc.Clouds, grass, ice, trees, sand, sky, etc. 932 color images total932 color images total Each image tessellated into 10x10 patchesEach image tessellated into 10x10 patches 1000 patches for training, 1000 for testing 1000 patches for training, 1000 for testing Each patch classified as a single textureEach patch classified as a single texture
Raw feature vectors: Color histogramsRaw feature vectors: Color histograms Each region histogrammed into 8 bins per Each region histogrammed into 8 bins per
color, 16 colors color, 16 colors 512 dimensional vectors 512 dimensional vectors
NMF vs. PCA ExperimentsNMF vs. PCA Experiments
Learn both NMF and PCA subspaces Learn both NMF and PCA subspaces for each class of histogramfor each class of histogram
For both NMF and PCA:For both NMF and PCA: Project queries onto the learned Project queries onto the learned
subspaces of each classsubspaces of each class Label each query by the subspace that Label each query by the subspace that
best reconstructs the querybest reconstructs the query This seems like a poor scheme for NMFThis seems like a poor scheme for NMF
(Other experiments allow better schemes) (Other experiments allow better schemes)
NMF vs. PCA ResultsNMF vs. PCA Results
NMF works best for dispersed classesNMF works best for dispersed classesPCA works best for compact classesPCA works best for compact classesBoth seem useful…try combining themBoth seem useful…try combining themBut, But, why are less than half of the sky why are less than half of the sky
vectors best reconstructed by PCA vectors best reconstructed by PCA when for sky PCA has a mean when for sky PCA has a mean reconstruction error less than 1/4 that of reconstruction error less than 1/4 that of NMF? Mistakes? NMF? Mistakes?
NMF+PCA ExperimentsNMF+PCA Experiments
During training, we learned whether During training, we learned whether NMF or PCA worked best for each classNMF or PCA worked best for each class
Project a query to a class using only the Project a query to a class using only the method that works best for that classmethod that works best for that class
Result: 2.3% improvement in the Result: 2.3% improvement in the recognition rate over NMF alone (PCA: recognition rate over NMF alone (PCA: 5.8%), but is this significant at 60%?5.8%), but is this significant at 60%?
Hierarchy ExperimentsHierarchy Experiments
At level k of the hierarchy, project the query At level k of the hierarchy, project the query onto each original class’ NMF or PCA onto each original class’ NMF or PCA subspacesubspace
But, to choose the direction to descend the But, to choose the direction to descend the hierarchy, we only care about the level k hierarchy, we only care about the level k super-class containing the matching classsuper-class containing the matching class
Furthermore, for each class the choice of Furthermore, for each class the choice of PCA vs. NMF can be independently set at PCA vs. NMF can be independently set at each level of the hierarchyeach level of the hierarchy
Hierarchy ResultsHierarchy Results
2% improvement in recognition rate2% improvement in recognition rate I really suspect that this is insignificant, I really suspect that this is insignificant,
and resulting only from the additional and resulting only from the additional degrees of freedomdegrees of freedom
They employ various additional They employ various additional neighborhood-based hacks to increase neighborhood-based hacks to increase their accuracy further, but I don’t see their accuracy further, but I don’t see any relevance to NMF specificallyany relevance to NMF specifically
Need for a better metricNeed for a better metric
Want to classify based on nearest Want to classify based on nearest neighbor, rather than reprojection errorneighbor, rather than reprojection error
Unfortunately, NMF generates a Unfortunately, NMF generates a nonorthonormal basis, and so the nonorthonormal basis, and so the relative distance to a base depends on relative distance to a base depends on the uniqueness of that basethe uniqueness of that base Bases will share a lot of pixels in common Bases will share a lot of pixels in common
areasareas
Earth Movers Distance (EMD)Earth Movers Distance (EMD)
Defined as the minimal amount of Defined as the minimal amount of “work” that must be performed to “work” that must be performed to transform one feature distribution into transform one feature distribution into the otherthe other
A special case of the “transportation A special case of the “transportation problem” from linear optimizationproblem” from linear optimization Let I=set of suppliers, J=set of consumers, Let I=set of suppliers, J=set of consumers,
ccijij=cost to ship from I to J, f=cost to ship from I to J, f ijij=amount =amount shipped from I to Jshipped from I to J
Distance = cost to make datasets equalDistance = cost to make datasets equal
Earth Movers Distance (EMD)Earth Movers Distance (EMD)
Based on finding a measure of Based on finding a measure of correlation between bases to define its correlation between bases to define its cost matrixcost matrix
The cost matrix weights the transition of The cost matrix weights the transition of one basis (bone basis (bii) to another (b) to another (bjj))
ccijij = dist = distangleangle(b(bii,b,bjj) = -( x • y )/( ||x|| ||y|| )) = -( x • y )/( ||x|| ||y|| )
EMD: Transportation ProblemEMD: Transportation Problem
ffijij = quant. shipped from i = quant. shipped from ijj
Consumers don’t shipConsumers don’t ship
Don’t exceed demandDon’t exceed demand
Don’t exceed supplyDon’t exceed supply
Demand Demand mustmust equal supply for EMD to be a metric equal supply for EMD to be a metric
EMD vs. “Other” ExperimentsEMD vs. “Other” Experiments
Digit recognition from MNIST digit databaseDigit recognition from MNIST digit database 60,000 training images + 10,000 for test60,000 training images + 10,000 for test Classify by NN and 5NN in the subspaceClassify by NN and 5NN in the subspace Result: EMD works best in low-dimensional Result: EMD works best in low-dimensional
subspaces, but in high-dimensional subspaces subspaces, but in high-dimensional subspaces EMD does not work wellEMD does not work well
More specificly, EMD works well when the More specificly, EMD works well when the bases contain some intersecting pixelsbases contain some intersecting pixels
Occlusion ExperimentsOcclusion Experiments
Randomly occlude either 1 or 2 of the 4 Randomly occlude either 1 or 2 of the 4 quadrants of an image (25% and 50% quadrants of an image (25% and 50% occlusion)occlusion)
Why does distWhy does distangleangle do so well? do so well?
Best subspace & distance with occlusionsBest subspace & distance with occlusionsLow dim.Low dim. High dim.High dim.
25% Occlusion25% Occlusion NMF+distNMF+distangleangle PCA sometimes PCA sometimes betterbetter
50% Occlusion50% Occlusion NMF+distNMF+distangleangle OR EMD OR EMD NMF+distNMF+distangleangle
DemoDemo
NMF difficultiesNMF difficultiesEMD experiments insteadEMD experiments instead
Demonstrate using existing code within the Demonstrate using existing code within the desired framework of a cost matrixdesired framework of a cost matrix
Their code: Their code: http://robotics.stanford.edu/~rubner/emd/dehttp://robotics.stanford.edu/~rubner/emd/default.htmfault.htm
My code: My code: http://www.vialab.org/john/Pres9-code/http://www.vialab.org/john/Pres9-code/
ConclusionConclusion
NMF is a parts-based alternative to PCANMF is a parts-based alternative to PCANMF and PCA should be combined for NMF and PCA should be combined for
minimum-reprojection-error classificationminimum-reprojection-error classificationFor nearest-neighbor classification, NMF For nearest-neighbor classification, NMF
needs a better metricneeds a better metric When the subspace dimensionality is When the subspace dimensionality is
chosen appropriately for good bases, chosen appropriately for good bases, NMF+EMD or NMF+distNMF+EMD or NMF+distangleangle have the highest have the highest recognition ratesrecognition rates