Classification for Lifelog Management

transcript

September 30, 2008

Sung-Bae Cho

• Introduction to Classification

– Discriminant Analysis

– Decision Tree

– Density Estimation

– Artificial Neural Networks

• Comparison of Classification Methods

• An Example of Classification for Life Log Analysis

Agenda

Classification

• Definition

– A procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items.

• Formally, the problem can be stated as follows:

– Given training data produce a classifier which maps an object to its classification label .

• For example, if the problem is filtering spam, then Xi is some representation of an email and y is either "Spam" or "Non-Spam".

• Classification algorithms are typically used in pattern recognition systems.

Other Types of Classification Problem

• Can consider classification as an estimation problem, where the goal is to estimate a function of the form

– where the feature vector input is , and the function f is typically parameterized by some parameters .

• In the Bayesian approach to this problem, instead of choosing a single parameter vector , the result is integrated over all possible thetas

– with the thetas weighted by how likely they are given the training data D:

Features of Classification

• Structure of a classification task

– Prior probability: relative frequency with which the classes occur in the population

– Criterion for separating the classes: underlying relation that uses observed attributes to distinguish an individual from each class

– Misclassification cost: cost associated with making a wrong classification

• Benefits

– Mechanical classification is faster

– Human operators may have biases

– Cheaper, diagnosis made on external symptoms, avoiding surgery

– Understanding structural relationships between the response and the measured variables

• Issues

– Accuracy, Speed, Comprehensibility

– Time to learn

Examples of Classification Algorithms

• Discriminant analysis

– Linear / Quadratic discriminant

– Logistic regression

– Support vector machine

• Decision tree

• Density estimation

– k-nearest neighbor

• Artificial neural networks

– Perceptron

– Multi layer perceptron

– Naive Bayes classifier

• Probabilistic classification

– Bayesian networks*

– Temporal classification

• Hidden Markov models*

• Dynamic Bayesian networks* * at the next lectures

– Decision Tree

Agenda

Discriminant Analysis

• Definition

– Divide sample space by a series of lines in two dimensions, planes in 3-D and, more generally hyper planes in many dimensions

– The line dividing two classes is drawn to bisect the line joining the centers of those classes

– The direction of the line is determined by the shape of the clusters of points

Discriminant Analysis: Example

Discriminant Analysis: Issues and Properties

• Requires numerical attributes, no missing values

• Dummy variables for categories

• Maximizes separation between the classes in a least square sense

• Attributes must be linearly independent

• No attribute can be constant within each class (add random noise)

• Canonical variates for multiclass

• Specially useful when the class means are ordered, or lie along a simple curve in attribute-space

• Logistic maximizes conditional P(class|x)

• Assumes that the form of the underlying density functions (or ratio) is known

Support Vector Machines

• Definition:

– A set of related supervised learning methods used for classification

– Viewing input data as two sets of vectors in an n-dimensional space

– Constructing a separating hyperplane in the space

• which maximizes the margin between the two data sets

Linear Support Vector Machines

SupportVectors

– Decision Tree

Agenda

Decision Tree

• Definition

– Space is divided into boxes, and at each stage in the procedure, each box is examined to see if it may be split into more boxes

– The split usually being parallel to the coordinate axes

• Structure

Decision Tree: Example

Classification Process of a Tree

• Generate_tests:

– Attr = val, Attr < val, Attr In Subset, …

• Test:

– Gini, Entropy, KS, Twoing, …

• Stop_criterion:

– Num cases, significance (e.g., c2-test), overgrow & prune

• Info:

– Most frequent class ("mode"), …

• Prune:

– Reduced-error, cost complexity, …

Training DataTraining Data

Classification RulesClassification Rules

Learning AlgorithmLearning Algorithm

Testing DataTesting Data

– Decision Tree

Agenda

Density Estimation

• Definition

– Classification rule that assigns x to class Ad if: p(Ad|x) = maxi p(Ai|x), maximizes the a posteriori probability

– For two classes Ai and Aj

– Kernel methods

– Naive Bayes assumes variable independency

– Projection methods

• Example: Nearest Neighbors

– 1-nearest neighbor rule, in which the training set is searched for the ‘nearest’ (in a defined sense) previous example, whose class is then assumed for the new case

– K-nearest neighbor, the natural extension taking the most frequent class in the neighbors

– Parameter free model (distance!)

)(),()(),( xfjicxfijc iijj

Nearest Neighbours: Example

Density Estimation: Issues

• Presence of irrelevant variables affect model

• Require numerical variables, or special distance function for categorical variables

• Variable’s scale affect the model

• All of the sample must be stored

– Decision Tree

Agenda

Artificial Neural Networks

• Black Box Modeling

• An artificial neural network (ANN) is a mathematical model or computational model based on biological neural networks

• It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation

• In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase

sum O f(sum)

Inputs Weights

Output

sum = wn Inf: activation function

The Neuron Model

Oi = fi(wn In)

Multi Layer Perceptron Model

feedforward

feedback

e = d1 - o1w31

e: error

d: desired output

o: network output

w: weights

PATTERN

• Universal approximator

• Weights are modified to minimize classification error

• The outputs represent the P ( Class | example )

Artificial Neural Networks: Principles

YyE 2)(2

)(wEww

Back-propagation Learning

• The error E is a function of the weights and the training data

• Minimize error E using the steepest descent

minimum

Wnew = Wold -E/ Wold

: learning rate

Downhill = -E / W

Artificial Neural Networks: Issues

• To make a generalizable network one has to determine a suitable number of hidden nodes

– Too few network does not learn enough…

– Too many over-learning

– Decision Tree

Agenda

Comparison of Classification Methods

• MV: Missing Values

• Cost matrix: learning(L),testing(T) or not(N)

• Interpretability: easily understood classifier

• Method: comprehensibility of the method

• Parameters: good guidelines or automatic selection of parameters

• User-fr: user friendliness

• Data: numerical, categorical

Algo. MV Cost Interpr. Method Param’s User-fr Data

Density N T 1 5 2 N N

Discriminant N T 3 4 4 Y N

Tree Y LT 5 4 5 Y NC

Neural

NetsN T 1 3 3 N N

– Decision Tree

Agenda

VTT Technical Research Center at Finland

• Representing complex entities: Any single methodology alone is often insufficient

– For example, knowledge representation needs a hierarchy utilizing heterogeneous methodologies (Minsky)

• Laerhoeven and Cakmakci (2000)

– 4 layers (statistical cues, Kohonen clustering, kNN classification, and a Markov chain supervisory layer)

• Advantage of using high-level features

– Control and ‘accurate touch’ to the data: Maintained up to the classification layer

– If the feature itself is discriminative enough: classification layer is not even necessary

Sensing

Service

Modeling & Classifying

Representation

Clustering

Fuzzy Logic

Naïve Bayes

Fuzzy Control

Overview

• Disadvantage of using high-level features

– The development of proper signal processing algorithms is laborous (compared to the user of very low-level features)

Context Representation

• A central problem in utilizing context information in application development

– Difficulty to add meaning of concepts from real world to extracted features

– “When environment loudness is SILENT, set ringing volume LOW” > “When environment loudness is -20dB, set ringing tone volume to 2.5”

• Context extraction: Abstract raw sensor signals and compress information

– How well features describe some parts of the real world context

– Extracted features: Context atoms (the smallest amount of context information)

Context Information Processing (1)

Context Information Processing (2)

• Context atom vector (k=maximum number of context atoms, the values of context atoms are between 0 and 1)

• Advantages

– Composing the ontology for sensor-based context information

• Direct control of various types of applications

– Multidimensional context atom vector C: Utilization of explorative data analysis methods

– Extendable (Straightforward to add new contexts atoms into a vector C)

Some of the Quantized Features

Naive Bayes Classifier

Structure

Joint Probability Distribution

CPT (Conditional probability table)learning

Conditional probability distributioncounting

Experiment

• Two kinds of data

– Measured in controlled conditions (1)

– Real world scenario (2)

• Experiment (1)

– Nine individual context class data sets were recorded (each five times separately in controlled conditions)

– Length of each nine-class data set: 30 seconds (30 sets of 1 seconds atoms)

– Nine sets, each to the sum of 150 vectors

– To create variability in the data

• Five samples of different songs from two albums

• Speech and acceleration-based contexts by four different persons

– Training data set = test data set

Experiment (2)

• Experiment (2)– Home environment– Multiple simultaneous contexts– Only those contexts that can be defined in each phase are counted– Five testees (performed the scenario twice) ten test data set with a

length 6-8 minutes– The last test data set ignored (box wires had been broken)– Final amount of scenario data sets 9 (60 minutes)– Different songs, three different cars– Manual on-the-fly segmentation is not accurate, a lot of disturbances– 1) training data set = test data set 2) cross validation

Recognition Results as an Intensity Graph

Resolution= 1 second (5 second moving average)

Classification Accuracy (9 contexts, 2 BN)

Recognition Results

• Recognition Results for 13 Individual Contexts (9 scenarios)

BNDirect

Context Data

• Data gathering Laptop

• 1 seconds resolution (4 seconds sliding window)

• 3 minutes

• Experiments Offline

Experiments with User Scenario

• Fluctuation of control signals

– Oscillation of various information representation levels

– Moving averaging methods or smoothing in changing font sizes

• Extension (adding new types of contexts)

– Complicate the development of a control system

User Reactions

• Ten cell phone users were interviewed– 5 of them normal users (use basic communication applications)– 5 of them active cell phone users (use actively most of the cell phone

applications and features)

• Interview process– Explaining scenario presented in the table to a user– Explaining to a user how adaptive applications operate in general and

particularly during the user scenario• Control signals and screenshots of adapted service contents

– Users were interviewed

• Application adaptation is an acceptable feature, but mistakes in adaptation are not accepted

• Users want to have control over the device’s operations

Summary

• Classification for context awareness

• Classification method application

– Applying naïve Bayesian classifier in learning and classifying the contexts of a mobile device user

– Mobile device-oriented sensors are embedded (instead of being distributed into the environment)

– To recognize static short-term events (descriptive and generic at the intermediate levels of the context representation layer structure), sequences of these events could be utilized at the upper level

– Naïve Bayesian learning and inference: Straightforward and computationally very efficient

Bibliography

• Book: Machine Learning, Neural and Statistical Classification. Editors: D. Michie, D.J. Spiegelhalter, C.C. Taylor, Statlog Project

• Book: Weiss, S. M. and Kulikowski, C. A. (1991). Computer systems that learn: classification and prediction methods from statistics, neural networks, machine learning and expert systems. Morgan Kaufmann, San Mateo, CA.

• Book: Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition. Academic Press, 2nd edition

• Paper: C.M. van der Walt and E. Barnard, “Data characteristics that determine classifier performance”, in Proceedings of the Sixteenth Annual Symposium of the Pattern Recognition Association of South Africa, pp.160-165, 2006

• Classifier showdown: A practical comparison of classification algorithms - Link

• Statistical Pattern Recognition Toolbox for Matlab - Link

Classification for Lifelog Management

Documents