Outline
Statistical learning algorithms for one-class problems.
Machine Learning and Data Mining 2006
Challenging problems
Machine Learning and Data Mining 2006
Data Mining
Challenges in area of data storage, organization and searching have
led to the new field of data mining.
Vast amounts of data are being generated in many fields, and our
job is: to exact important patterns and trends, and understand what
the data says. This is called learning from data.
Machine Learning and Data Mining 2006
Machine Learning
The learning problems can be roughly categorized as supervised and
unsupervised.
Supervised: classification, regression and ranking;
Unsupervised: one-class, clustering and PCA.
Machine Learning and Data Mining 2006
Application in PR
Pattern recognition system:
Machine Learning and Data Mining 2006
Difference
Statistical machine learning: in terms of finite samples, inductive
inference.
Machine Learning and Data Mining 2006
Biometrics
Biometrics refers to the automatic identification of a person based
on his/her physiological or behavioral characteristics.
Machine Learning and Data Mining 2006
Bioinformatics
In the last few decades, advances in molecular biology and the
equipment available for research in this field have allowed the
increasingly rapid sequencing of large portions of the genomes.
Popular sequence databases have been growing at exponential
rates.
This deluge of information has necessitated the careful storage,
organization and indexing of sequence information. Information
science has been applied to biology to produce the field called
Bioinformatics.
Machine Learning and Data Mining 2006
ISI
Intelligence and Security Informatics is an merging field of study
aimed at developing advanced information technologies, systems,
algorithms, and databases for national- and
homeland-security-related applications.
Machine Learning and Data Mining 2006
Confusion
Many researchers claim that they are studying statistical machine
learning methods.
Maybe, their interest is only to apply the statistical machine
learning algorithms.
Machine Learning and Data Mining 2006
Machine learning community
To analyze different algorithms theoretically;
To develop new theory and learning algorithms for new
problems.
Machine Learning and Data Mining 2006
Performance
More
But also in terms of implementation, speed, understandability
etc.
Machine Learning and Data Mining 2006
Theoretical Analysis
Model Assessment: estimating the performance of different models in
order to choose the best one.
Model Selection: having chosen the model, estimating the prediction
error on new data.
Machine Learning and Data Mining 2006
Ian Hacking
The quiet statisticians have changed our world:
not by discovering new facts or technical developments, but by
changing the ways that we reason, experiments and form our
opinions……
Machine Learning and Data Mining 2006
Statistical learning
Not limited to statistical learning by Vapnik.
Machine Learning and Data Mining 2006
Andreas Buja
There is no true interpretation of anything:
Interpretation is vehicle in the service of human comprehension.
The value of interpretation is in enabling others to fruitfully
think about the idea.
Machine Learning and Data Mining 2006
Interpretation of Algorithms
Almost all the learning algorithms can be illustrated theoretically
and intuitively;
The probability and geometric explanations not only help us to
understand the algorithms theoretically and intuitively, but also
motivate us to develop elegant and practical new algorithms.
Machine Learning and Data Mining 2006
Main references
N. Cristianini and J. Schawe-Taylor. An Introduction to Support
Vector Machines. Cambridge: Cambridge Univ Press. 2000.
T. Hastie, R. Tibshirani and J. Friedman. The Elements of
Statistical Learning. Springer. 2001.
Machine Learning and Data Mining 2006
Main kinds of theory
Definition of Classifications
Definition of regression
Several well-known algorithms
Machine Learning and Data Mining 2006
Framework of algorithms
Linear to nonlinear (neural networks, kernel); single to ensemble
(boosting); pointwise to continuous (one-class problems); local to
global ( KNN and LMS).
Machine Learning and Data Mining 2006
Designation of algorithms
Usually, the algorithm under more complex hypothesis space should
be a specific one under simple hypothesis space.
The algorithm under simple hypothesis space serves as a start point
of the complete framework.
Machine Learning and Data Mining 2006
Bayesianclassification
Machine Learning and Data Mining 2006
Bayesian: regression
Machine Learning and Data Mining 2006
Estimating densities
The knowledge of density functions would allow us to solve whatever
problems that can be solved on the basis of available data;
Vapnik's principle: never to solve a problem that is more general
than you actually need to solve.
Machine Learning and Data Mining 2006
KNN
InterpretationKNN
KNN:
Assuming that the classifier is well approximated by a locally
constant function, and conditioning at a point is relaxed to
conditioning on some region to the target point.
Machine Learning and Data Mining 2006
LMS
LMS
Interpretation: LMS
LMS
Assuming that the classifier is well approximated by a globally
linear function, and the expectation is approximated by averages
over the training data.
Machine Learning and Data Mining 2006
Fisher Discriminant Analysis
To seek a direction for which the projected samples are well
separated.
w(y)
w
y1
y2
x2
ω1
ω2
x1
Interpretation: FDA
Generally speaking, it is not optimal.
FDA is the Bayes optimal solution if the two classes are
distributed according to a normal distribution with equal
covariance.
Machine Learning and Data Mining 2006
FDA and LMS
squares approaches for classification.
The solution to the least squares problem is in the same direction
as the solution of Fisher’s discriminant.
Machine Learning and Data Mining 2006
FDA: a novel interpretation
T. Centeno, N. Lawrence. Optimizing kernel parameters and
regularization coefficients for non-linear discriminant analysis.
JMLR, 7 (2006).
A novel Bayesian interpretation of FDA relating Rayleigh’s
coefficient to a noise model that minimizes a cost based on the
most probable class centers and that abandons the ‘regression to
the labels’ assumption used by other algorithms.
Machine Learning and Data Mining 2006
FDA: parameters
Going further, with the use of a Gaussian process prior, they show
the equivalence of their model to a regularized kernel FDA.
A key advantage of our approach is the facility to determine kernel
parameters and the regularization coefficient through the
optimization of the marginal log-likelihood of the data.
Machine Learning and Data Mining 2006
FDA: framework of algorithms
Qing Tao, et al. The Theoretical Analysis of FDA and Applications.
Pattern Recognition. 39(6):1199-1204.
Similar in spirit to maximal margin algorithm, FDA with zero
within-class variance is proved to serve as a start point of the
complete FDA framework.
Machine Learning and Data Mining 2006
Disadvantage
Motivation;
Bias and variance analysis
The bias-variance decomposition is a very powerful and widely-used
tool for understanding machine-learning algorithms;
It was originally developed for squared loss.
Machine Learning and Data Mining 2006
Bias-Variance Decomposition
XY,f-hat(x0)
Bias-Variance Tradeoff
Often, the variance can be significantly reduced by deliberately
introducing a small amount of bias.
blue area: error sigma (noise) , centered by the truth
large yellow circle: variance of least square fit, centered by the
closest fit in whole population.
smaller circle: shrunken fit, smaller variance, higher bias
model space: the set of all possible predictions from the
model
question: what is the definition of model?
could we say “linear model” but what is the role of least square,
or is it also a model itself?
Machine Learning and Data Mining 2006
Interpretation: KNN
Ridge regression
LMS: ill-posed problem;
Compared with LMS under certain assumptions, it introduces a small
amount of bias.
Machine Learning and Data Mining 2006
Interpretation: ridge regression
Analytic solution
The technique as a way to simultaneously reduce the risk and
increase the numerical stability of LMS.
Machine Learning and Data Mining 2006
Interpretation: parameter
Effective degrees of freedom: experiment analysis;
The key result is dramatic reduction of parameter variance.
Machine Learning and Data Mining 2006
A note
A new class of generalized Bayes minimax ridge regression
estimators. The Annals of Statistics. 2005, 33(4).
The risk reduction aspect of ridge regression was often observed in
simulations but was not theoretically justified. Almost all
theoretical results on ridge regression in the literature depend on
normality.
Machine Learning and Data Mining 2006
Other loss functions
P. Domingos. A Unified Bias-Variance Decomposition and its
Applications. ICML, 2000.
The resulting decomposition specializes to the standard one for the
squared-loss case, and to a close relative of Kong and Dietterich’s
1995 one for the zero-one case.
Machine Learning and Data Mining 2006
Interpretation: boosting
Both Bagging and Boosting reduce error by reducing the variance
term (Breiman1996);
Bauer and Kohavi 1999 demonstrated that Boosting does indeed seem
to reduce bias for certain real world problems.
Machine Learning and Data Mining 2006
Interpretation: margin
Domingos 2000:
Schapire’s (1997) notion of “margin” can be expressed as a function
of the zero-one bias and variance, making it possible to formally
relate a classifier ensemble’s generalization error to the base
learner’s bias and variance on training examples.
Machine Learning and Data Mining 2006
Interpretation: SVM
G. Valentini, T. Dietterich. Bias-variance analysis of SVM for the
development of SVM-Based ensemble methods. JMLR, 2004. 5
Presenting an extended experimental analysis of bias-variance
decomposition of the error in SVMs, considering Gaussian,
polynomial and dot product kernels.
Machine Learning and Data Mining 2006
SVM: experimental analysis
A characterization of the error decomposition is provided, by means
of the analysis of the relationships between bias, variance, kernel
type and its parameters.
The results show that the expected trade-off between bias and
variance is sometimes observed, but more complex relationship can
be detected, especially in Gaussian and polynomial kernels.
Machine Learning and Data Mining 2006
Interpretation: base learners
The effectiveness of ensemble methods depends on the specific
characteristics of the base learners;
The bias-variance decomposition offers a rationale to develop
ensemble methods using SVMs as base learners.
Machine Learning and Data Mining 2006
Disadvantage
To be able to estimate the bias, variance, we need to know the
actual function being learned. This is unavailable for real-world
problems;
Estimating bias and varianceusing bootstrap to replicate the
data.
Machine Learning and Data Mining 2006
Generalization bound
PAC Framework;
VC Theory;
PAC Frame
Probably Approximately CorrectFPACPfF0<11hP(h(x)f(x)) 1/1/
Machine Learning and Data Mining 2006
VC Theory and PAC Bounds
Landmark paper by Blumer1989;
Greatly influence the field of machine learning;
VC theory and PAC bounds have been used to analyze the performance
of learning systems as diverse as decision trees, neural networks,
and others.
Machine Learning and Data Mining 2006
PAC Bounds for Classification
errD(hS) (l, H, ) holds with probability 1- .
(l, H, ): a bound.
VC Dimension
The largest size of shattered set.
Machine Learning and Data Mining 2006
A consistency problems
In spite of
Machine Learning and Data Mining 2006
Remarks on PAC+VC Bounds
The size of training set required to ensure good generalization
scales linearly with this quantity.
Only be able to take advantage of benign distributions.
Machine Learning and Data Mining 2006
SVM: Linearly separable
Linearly separable SVM
Maximizing the margin
SVM: soft Margin
C-SVM
The impasse of the NP-hardness of minimizing the training error has
been avoided.
Machine Learning and Data Mining 2006
SVM: algorithms
Hypothesis: linearly separable, linearly inseparable; kernel.
The maximal margin algorithm serves as a start point of the
complete SVM framework Shawe-Taylor1998.
Machine Learning and Data Mining 2006
Bound: VC Dimension
provided
Bound: VC dimension+errors
Structural risk minimization principle;
Occam’s razor: a simple function that explains most of the data is
preferable to a complex one.
Machine Learning and Data Mining 2006
Disadvantages of SRM
According to the SRM principle, the structure has to be defined a
priori before the training data appear Shawe-Taylor1998.
The maximum margin algorithm violates this principle in that the
hierarchy defined depends on the data.
Machine Learning and Data Mining 2006
Disadvantage: PAC+VC bound
Not data-dependent: bounds should rely on an effective complexity
measure rather than the a priori VC dimension;
It can not be applied in high dimensional feature spaces.
Machine Learning and Data Mining 2006
Several concepts
fat-shattering dimension;
Covering number.
Generalization Bound: margin
The first paper about margin results: J.Shawe-Taylor and P. L.
Bartlett, 1998.
provided
Machine Learning and Data Mining 2006
Importance of Margin
Over the last decade, both theories and practices have pointed out
that the concept of margin is central to the success of SVM and
boosting.
Generally, a large margin implies good generalization
performance.
Machine Learning and Data Mining 2006
Vapnik’s three periods
1970 – 1990Development of Basics of Statistical Learning Theory
(the VC theory)
1992 – 2004Development of Large Margin Technology (SVMs).
2005 – .... Development of Non-Inductive Methods of
Inferences
Machine Learning and Data Mining 2006
Neural networks
MSE criterion and deepest decent method.
Machine Learning and Data Mining 2006
Interpretation: neural networks
An important feature of the fat-shattering dimension for these
classes is that it does not depend on the number of parameters
(e.g., weights in a neural network), but rather on their
sizes.
These measures, therefore, motivate a form of weight decay. Indeed,
one consequence of the above result is a justification of the
standard error function used in back-propagation optimization
incorporating weight decay.
Machine Learning and Data Mining 2006
BP Algorithms
In the case of neural networks, the question naturally arises as to
whether there might exist a polynomial-time algorithm for
optimizing the soft margin bound.
The analysis has also placed the optimization of the quadratic loss
used in the back-propagation algorithm on a firm footing, though,
in this case, no polynomial time algorithm is known.
Machine Learning and Data Mining 2006
Disadvantage
A nice and direct proof of this bound is systematically described
in N. Cristianini and J. Schawe-Taylor 2001.
So many bounds now.
One-class references
D. Tax and R. Duin. Support vector data description. Machine
Learning, 2004.
B. Schölkopf, et al.. New support vector algorithms. Neural
Computation, 2000.
I. Steinwart et al.. A Classification Framework for Anomaly
Detection. JMLR, 2005.
(
)
(
)
(
)
(
)