+ All Categories
Home > Documents > Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification...

Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification...

Date post: 11-Jan-2016
Category:
Upload: elijah-todd
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
40
Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner, Martin Law, Erinija Pranckeviciene, Albert Orriols-Puig, Nuria Macia
Transcript
Page 1: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

Bell Laboratories

Data Complexity Analysis:Linkage between Context and Solution in

Classification

Tin Kam Ho

With contributions from

Mitra Basu, Ester Bernado-Mansilla, Richard Baumgartner, Martin Law,Erinija Pranckeviciene, Albert Orriols-Puig, Nuria Macia

Page 2: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

2 All Rights Reserved © Alcatel-Lucent 2008

Pattern Recognition: Research vs. Practice

Steps to solve a practical pattern recognition problem

Feature

Extraction

Classifier Training

Classification

Sensory Data

Decision

Feature Vectors

Classifier

Data Collection

Study of the Problem Context

Study of the Mathematical

Solution

Practical Focus

Research Focus

Danger of Disconnectio

n

Page 3: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

3 All Rights Reserved © Alcatel-Lucent 2008

Reconnecting Context and Solution

Feature Vectors

Study of the Problem Context

Study of the Mathematical

Solution

To understand how such properties may impact the classification solution

To understand how changes in the problem set-up and data collection procedures may affect such properties

Data Complexity Analysis: Analysis of

the properties of feature vectors

Improvements

Limitations

Expectations

Page 4: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

4 All Rights Reserved © Alcatel-Lucent 2008

•Kolmogorov complexity

•Boundary length can be exponential in dimensionality

•A trivial description is to list all points & class labels

•Is there a shorter description?

Focus is on Boundary Complexity

Page 5: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

5 All Rights Reserved © Alcatel-Lucent 2008

Early Discoveries

•Problems distribute in a continuum in complexity space

•Several key measures provide independent characterization

•There exist identifiable domains of classifier’s dominant competency

•Feature selection and transformation induce variability in complexity estimates

Page 6: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

6 All Rights Reserved © Alcatel-Lucent 2008

Parameterization of Data Complexity

Page 7: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

7 All Rights Reserved © Alcatel-Lucent 2008

Complexity Classes vs. Complexity Scales

•Study is driven by observed limits in classifier accuracy, even with new, sophisticated methods (e.g., ensembles, SVM, …)

•Analysis is needed for each instance of a classification problem, not just the worst case of a family of problems

•Linear separability: the earliest attempt to address classification complexity

•Observed in real-world problems: different degrees of linear non-separability

•Continuous scale is needed

Page 8: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

8 All Rights Reserved © Alcatel-Lucent 2008

Some Useful Measures of Geometric Complexity

22

21

221

σσ)μ(μ

f

Classical measure of class separability

Maximize over all features to find the most discriminating

Fisher’s Discriminant Ratio

Degree of Linear Separability

Find separating hyper-plane by linear programming

Error counts and distances to plane measure separability

Length of Class Boundary

Compute minimum spanning tree

Count class-crossing edges

Shapes of Class Manifolds

Cover same-class pts with maximal balls

Ball counts describe shape of class manifold

Page 9: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

9 All Rights Reserved © Alcatel-Lucent 2008

Real-World Data Sets:

Benchmarking data from UC-Irvine archive

844 two-class problems452 are linearly separable, 392 non-separable

Synthetic Data Sets:

Random labeling of

randomly located points100 problems in 1-100 dimensions

Continuous Distributions in Complexity Space

Random labeling

Linearly separable real-world data

Linearly non-separable real-world data

Complexity Metric 1

Metr

ic 2

Page 10: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

10 All Rights Reserved © Alcatel-Lucent 2008

Measures of Geometrical Complexity

Page 11: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

11 All Rights Reserved © Alcatel-Lucent 2008

The First 6 Principal Components

Page 12: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

12 All Rights Reserved © Alcatel-Lucent 2008

Interpretation of the First 4 PCs

PC 1: 50% of variance: Linearity of boundary and proximity of opposite class neighbor

PC 2: 12% of variance: Balance between within-class scatter and between-class distance

PC 3: 11% of variance: Concentration & orientation of intrusion into opposite class

PC 4: 9% of variance: Within-class scatter

Page 13: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

13 All Rights Reserved © Alcatel-Lucent 2008

• Continuous distribution

• Known easy & difficult problems occupy opposite ends

• Few outliers

• Empty regionsRandom labels

Linearly separable

Problem Distribution in 1st & 2nd Principal Components

Page 14: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

14 All Rights Reserved © Alcatel-Lucent 2008

Apparent vs. True Complexity: Uncertainty in Measures due to Sampling Density

2 points 10 points

100 points 500 points 1000 points

Problem may appear deceptively simple or complex with small samples

Page 15: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

15 All Rights Reserved © Alcatel-Lucent 2008

Observations

•Problems distribute in a continuum in complexity space

•Several key measures/dimensions provide independent characterization

•Need further analysis on uncertainty in complexity estimates due to small sample size effects

Page 16: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

16 All Rights Reserved © Alcatel-Lucent 2008

Relating Classifier Behavior to Data Complexity

Page 17: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

17 All Rights Reserved © Alcatel-Lucent 2008

Class Boundaries Inferred by Different Classifiers

XCS: a genetic algorithm

Nearest neighbor classifier

Linear classifier

Page 18: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

18 All Rights Reserved © Alcatel-Lucent 2008

Accuracy Depends on the Goodness of Match between Classifiers and Problems

NNXCSerror=0.06%

error=1.9%

Better!

Problem A Problem B

error=0.6%

error=0.7%

XCS NN

Better!

Page 19: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

19 All Rights Reserved © Alcatel-Lucent 2008

Domains of Competence of Classifiers

Given a classification problem,we want determine which classifier is the best for it.

Can data complexity give us a hint?

Complexity metric 1

Metr

ic 2

NN

LC

XCSDecisionForest

?

Here is my

problem !

Page 20: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

20 All Rights Reserved © Alcatel-Lucent 2008

Domain of Competence Experiment

Use a set of 9 complexity measuresBoundary, Pretop, IntraInter, NonLinNN, NonLinLP,Fisher, MaxEff, VolumeOverlap, Npts/Ndim

Characterize 392 two-class problems from UCI data,all shown to be linearly non-separable

Evaluate 6 classifiersNN (1-nearest neighbor)LP (linear classifier by linear programming)Odt (oblique decision tree)Pdfc (random subspace decision forest)Bdfc (bagging based decision forest)XCS (a genetic-algorithm based classifier)

ensemble methodsensemble methodsensemble methods

Page 21: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

21 All Rights Reserved © Alcatel-Lucent 2008

Identifiable Domains of Competence by NN and LP

Best Classifier for Benchmarking Data

Page 22: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

22 All Rights Reserved © Alcatel-Lucent 2008

Regions in complexity space where the best classifier is (nn,lp, or odt) vs. an ensemble technique

Boundary-NonLinNN

IntraInter-Pretop

MaxEff-VolumeOverlap

ensemble+ nn,lp,odt

Less Identifiable Domains of Competence

Page 23: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

23 All Rights Reserved © Alcatel-Lucent 2008

Uncertainty of Estimates at Two Levels

Sparse training data in each problem & complex geometry cause ill-posedness of class boundaries

(uncertainty in feature space)

Sparse sample of problems causes difficulty in identifying regions of dominant competence

(uncertainty in complexity space)

Page 24: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

24 All Rights Reserved © Alcatel-Lucent 2008

Complexity and Data Dimensionality:Class Separability after Dimensionality Reduction

Feature selection/transformation may change the difficulty of a classification problem:

• Widening the gap between classes• Compressing the discriminatory information• Removing irrelevant dimensions

It is often unclear to what extent these happen We seek quantitative description of such changes

Feature selection Discrimination

Page 25: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

25 All Rights Reserved © Alcatel-Lucent 2008

10 20 30 40 50 60 70 80 90

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Boundary

1N

N e

rro

r FFS subsets all datasets

boundary versus 1NN classification error spectra1

colon spectra2eogat ovarian spectra3

Spread of classification accuracy and geometrical complexity due to forward feature selection

Page 26: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

26 All Rights Reserved © Alcatel-Lucent 2008

Designing a Strategy for Classifier Evaluation

Page 27: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

27 All Rights Reserved © Alcatel-Lucent 2008

A Complete Platform for Evaluating Learning Algorithms

To facilitate progress on learning algorithms:

• Need a way to systematically create learning problems

• Provide a complete coverage of the complexity space

• Be representative of all the known problems

i.e., every classification problem arising

in the real-world should have a close neighbor

representing it in the complexity space.

Is this possible?

Page 28: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

28 All Rights Reserved © Alcatel-Lucent 2008

Ways to Synthesize Classification Problems

• Synthesizing data with targeted levels of complexity

• e.g. compute MST over a uniform point distribution, then assign class-crossing edges randomly [Macia et al. 2008]

• or, create partitions with increasing resolution

• can create continuous cover of complexity space

• but, are the data similar to those arising from reality?

Page 29: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

29 All Rights Reserved © Alcatel-Lucent 2008

Ways to Synthesize Classification Problems

• Synthesizing data to simulate natural processes• e.g. Neyman-Scott process

• how many such processes have explicit models?

• how many are needed to cover all real-world problems?

• Systematically degrade real-world datasets• increase noise, reduce image resolution, …

Page 30: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

30 All Rights Reserved © Alcatel-Lucent 2008

Simplification of Class Geometry

Page 31: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

31 All Rights Reserved © Alcatel-Lucent 2008

Manifold Learning and Dimensionality Reduction

• Manifold learning techniques that highlight intrinsic dimensions

• But the class boundary may not follow the intrinsic dimensions

Page 32: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

32 All Rights Reserved © Alcatel-Lucent 2008

Manifold Learning and Dimensionality Reduction

• Supervised manifold learning – seek mappings

that exaggerate class separation

[de Ridder et al., 2003]

• Best, the mapping should be sought to directly

minimize some measures of data complexity

Page 33: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

33 All Rights Reserved © Alcatel-Lucent 2008

Seeking Optimizations Upstream

Back to the application context:• Use data complexity measures for guidance

• Change the setup, definition of the classification problem

• Collect more samples, in finer resolution, extract more features …

• Alternative representations:

• dissimilarity-based? [Pekalska & Duin 2005]

Data complexity gives an operational definition of learnability

Optimization in the upstream: formalize the intuition of seeking invariance, systematically optimize the problem setup and data acquisition scenario to reduce data complexity

Page 34: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

34 All Rights Reserved © Alcatel-Lucent 2008

Recent Examples from the Internet

Page 35: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

35 All Rights Reserved © Alcatel-Lucent 2008

CAPTCHA: Completely Automated Public Turing test to tell Computers and Humans Apart

Also known as

• Reverse Turing Test

• Human Interactive Proofs

[von Ahn et al., CMU 2000]

Exploit limitations in accuracy of machine pattern recognition

Page 36: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

36 All Rights Reserved © Alcatel-Lucent 2008

The Netflix Challenge

• $1 Million Prize for the first team to improve 10% over the company’s own recommender system

• But, is the goal achievable? Do the training data support such possibility?

Page 37: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

37 All Rights Reserved © Alcatel-Lucent 2008

Amazon’s Mechanical Turk

• “Crowd-sourcing” tedious human intelligence (pattern recognition) tasks

• Which ones are doable by machines?

Page 38: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

38 All Rights Reserved © Alcatel-Lucent 2008

Conclusions

Page 39: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

39 All Rights Reserved © Alcatel-Lucent 2008

Summary

Automatic classification is useful, but can be very difficult. We know the key steps and many promising methods. But we have not fully understood how they work, what else is needed.

We found measures for geometric complexity that are useful to characterize difficulties of classification problems and classifier domains of competence.

Better understanding of how data and classifiers interact can guide practice, and re-establish the linkage between context and solution.

Page 40: Bell Laboratories Data Complexity Analysis: Linkage between Context and Solution in Classification Tin Kam Ho With contributions from Mitra Basu, Ester.

40 All Rights Reserved © Alcatel-Lucent 2008

For the Future

Further progress in statistical and machine learning will need systematic, scientific evaluation of the algorithms with problems that are difficult for different reasons.

A “problem synthesizer” will be useful to provide a complete evaluation platform, and reveal the “blind spots” of current learning algorithms.

Rigorous statistical characterization of complexity estimates from limited training data will help gauge the uncertainty, and determine applicability of data complexity methods.


Recommended