Download - Spoken Dialog Systems and Voice XML Pattern Recognition Spoken Dialog Systems and Voice XML : Intro to Pattern Recognition Esther Levin Dept of Computer.

Spoken Dialog Systems and Voice Spoken Dialog Systems and Voice XMLXML :

Intro to Pattern RecognitionPattern Recognition

Esther Levin

Dept of Computer Science

CCNY

Some materials used in this course were taken from the textbook “Pattern Classification” by Duda et al., John Wiley & Sons, 2001 with the permission of the authors and the publisher

Credits and AcknowledgmentsMaterials used in this course were taken from the textbook “Pattern Classification” by Duda et al., John Wiley & Sons, 2001 with the permission of the authors and the publisher; and also fromOther material on the web:

Dr. A. Aydin Atalan, Middle East Technical University, Turkey Dr. Djamel Bouchaffra, Oakland University Dr. Adam Krzyzak, Concordia University Dr. Joseph Picone, Mississippi State University Dr. Robi Polikar, Rowan University Dr. Stefan A. Robila, University of New Orleans Dr. Sargur N. Srihari, State University of New York at Buffalo David G. Stork, Stanford University Dr. Godfried Toussaint, McGill University Dr. Chris Wyatt, Virginia Tech Dr. Alan L. Yuille, University of California, Los Angeles Dr. Song-Chun Zhu, University of California, Los Angeles

Outline

IntroductionWhat is this pattern recogntiion

Background MaterialProbability theory

PATTERN RECOGNITION AREASOptical Character Recognition ( OCR)

Sorting letters by postal code. Reconstructing text from printed materials (such as reading machines for blind

people).Analysis and identification of human patterns

Speech and voice recognition. Finger prints and DNA mapping.

Banking and insurance applications Credit cards applicants classified by income, credit worthiness, mortgage amount, # of

dependents, etc. Car insurance (pattern including make of car, #of accidents, age, sex, driving habits,

location, etc).Diagnosis systems

Medical diagnosis (disease vs. symptoms classification, X-Ray, EKG and tests analysis, etc).

Diagnosis of automotive malfunctioning Prediction systems

Weather forecasting (based on satellite data). Analysis of seismic patterns

Dating services (where pattern includes age, sex, race, hobbies, income, etc).

More Pattern Recognition Applications

SENSORYVision Face/Handwriting/Hand

Speech Speaker/Speech

Olfaction Apple Ripe?

DATA

Text Categorization

Information Retrieval

Data Mining

Genome Sequence Matching

What is a pattern?What is a pattern?“A pattern is the opposite of a chaos; it is an entity

vaguely defined, that could be given a name.”

PR Definitions

Theory, Algorithms, Systems to Put

Patterns into Categories

Classification of Noisy or Complex Data

Relate Perceived Pattern to Previously

Perceived Patterns

Characters

Ç ş ğ İ ü Ü Ö Ğچك٤٧ع

К Ц Д

ζ ω Ψ Ω ξ θ

א ם ש ת ד נ

A v t u I h D U w K

Handwriting

Terminology

Features, feature vector

Decision boundary

Error

Cost of error

Generalization

A Fishy Example I

“Sorting incoming Fish on a conveyor according to species using optical sensing”

Salmon or Sea Bass?

Problem Analysis

Set up a camera and take some sample images to extract features

Length Lightness Width Number and shape of fins Position of the mouth, etc…

This is the set of all suggested features to explore for use in our classifier!

Solution by Stages

Preprocess raw data from camera

Segment isolated fish

Extract features from each fish (length,width, brightness, etc.)

Classify each fish

PreprocessingUse a segmentation operation to isolate fishes

from one another and from the background

Information from a single fish is sent to a feature extractor whose purpose is to reduce the data by measuring certain features

The features are passed to a classifier

2

2

Classification

Select the length of the fish as a possible feature for discrimination

2

2

The length is a poor feature alone!

Select the lightness as a possible feature.

2

2

Threshold decision boundary and cost relationship Move our decision boundary toward smaller values

of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!)

Task of decision theory

2

“Customers do not want sea bass in their cans of salmon”

Adopt the lightness and add the width of the fish

Fish x = [x1, x2]

Lightness Width

2

2

We might add other features that are not correlated with the ones we already have. A precaution should be taken not to reduce the performance by adding such “noisy features”

Ideally, the best decision boundary should be the one which provides an optimal performance such as in the following figure:

2

2

However, our satisfaction is premature because the central aim of designing a classifier is to correctly classify novel input

Issue of generalization!

2

2

Decision BoundariesObserve: Can do much better with two features

Caveat: overfitting!

Occam’s Razor

Entities are not to be multiplied without necessity

William of Occam (1284-1347)

A Complete PR System

Problem Formulation

Measurements &

PreprocessingClassificationFeatures

Inputobject

ClassLabel

Basic ingredients:•Measurement space (e.g., image intensity, pressure)•Features (e.g., corners, spectral energy)•Classifier - soft and hard•Decision boundary•Training sample•Probability of error

Pattern Recognition Systems

SensingUse of a transducer (camera or microphone)PR system depends of the bandwidth, the

resolution, sensitivity, distortion of the transducer

Segmentation and groupingPatterns should be well separated and

should not overlap

3

3

Feature extraction Discriminative features Invariant features with respect to translation, rotation and

scale.

Classification Use a feature vector provided by a feature extractor to

assign the object to a category

Post Processing Exploit context dependent information other than from the

target pattern itself to improve performance

The Design Cycle

Data collection

Feature Choice

Model Choice

Training

Evaluation

Computational Complexity

4

4

Data Collection

How do we know when we have collected an adequately large and representative set of examples for training and testing the system?

4

Feature Choice

Depends on the characteristics of the problem domain. Simple to extract, invariant to irrelevant transformation insensitive to noise.

4

Model Choice

Unsatisfied with the performance of our linear fish classifier and want to jump to another class of model

4

Training

Use data to determine the classifier. Many different procedures for training classifiers and choosing models

4

Evaluation

Measure the error rate (or performance) and switch from one set of features & models to another one.

4

Computational Complexity

What is the trade off between computational ease and performance?

(How an algorithm scales as a function of the number of features, number or training examples, number patterns or categories?)

4

Learning and AdaptationLearning: Any method that combines empirical information from the environment with prior knowledge into the design of a classifier, attempting to improve performance with time.Empirical information: Usually in the form of training examples.Prior knowledge: Invariances, correlations

Supervised learning A teacher provides a category label or cost for each pattern in the

training set

Unsupervised learning The system forms clusters or “natural groupings” of the input patterns

5

Syntactic Versus Statistical PR

Basic assumption: There is an underlying regularity behind the observed phenomena.Question: Based on noisy observations, what is the underlying regularity?Syntactic: Structure through common generative mechanism. For example, all different manifestations of English, share a common underlying set of grammatical rules.Statistical: Objects characterized through statistical similarity. For example, all possible digits `2' share some common underlying statistical relationship.

Difficulties

Segmentation

Context

Temporal structure

Missing features

Aberrant data

Noise

Do all these images represent an `A'?

Design Cycle

How do we know what features to select, and how do we select them…?

What type of classifier shall we use. Is there best classifier…?

How do we train…?How do we combine prior knowledge withempirical data?

How do we evaluate our performanceValidate the results. Confidence in decision?

Conclusion

I expect you are overwhelmed by the number, complexity and magnitude of the sub-problems of Pattern Recognition

Many of these sub-problems can indeed be solved

Many fascinating unsolved problems still remain

6

Toolkit for PRStatisticsDecision TheoryOptimizationSignal ProcessingNeural NetworksFuzzy LogicDecision TreesClusteringGenetic AlgorithmsAI SearchFormal Grammars….

Linear algebra

Matrix A:

Matrix Transpose

Vector a

mnmm

n

n

nmij

aaa

aaa

aaa

aA

...

............

...

...

][

21

22221

11211

mjniabAbB jiijT

mnij 1,1;][

],...,[;... 1

1

nT

n

aaa

a

a

a

Matrix and vector multiplication

Matrix multiplication

Outer vector product

Vector-matrix product

)()(,][

;][;][

BcolArowcwherecCAB

bBaA

jiijnmij

npijpmij

matrixnmanABbac

bBbaAa nijT

mij

,

;][;][ 11

mlengthofvectormatrixmanAbC

bBbaA nijnmij

1

;][;][ 1

Inner ProductInner (dot) product:

Length (Eucledian norm) of a vectora is normalized iff ||a|| = 1

The angle between two n-dimesional vectorsAn inner product is a measure of collinearity: a and b are orthogonal iff

a and b are collinear iff

A set of vectors is linearly independent if no vector is a linear combination of other vectors.

n

iii

T baba1

n

ii

T aaaa1

2

||||||||cos

ba

baT

0baT

|||||||| babaT

Determinant and Trace

Determinant

det(AB)= det(A)det(B)

Trace

)det()1(

;,....1;)det(

;][

1

ijji

ij

n

jijij

nnij

MA

niAaA

aA

n

jjjnnij aAtraA

1

][;][

Matrix Inversion

A (n x n) is nonsingular if there exists B

A=[2 3; 2 2], B=[-1 3/2; 1 -1]

A is nonsingular iff

Pseudo-inverse for a non square matrix, provided

is not singular

1; ABIBAAB n

0|||| A

TT AAAA 1# ][ AAT

IAA #

Eigenvectors and Eigenvalues

1||||;,...,1, jjjj enjeAe

0]det[ nIA

n

jjAtr

1

][

Characteristic equation:n-th order polynomial, with n roots.

n

jjA

1

]det[

Probability Theory

Primary references: Any Probability and Statistics text book (Papoulis) Appendix A.4 in “Pattern Classification” by Duda

et al

The principles of probability theory, describing the behavior of systems with random characteristics, are of fundamental importance to pattern recognition.

Example 1 ( wikipedia)•two bowls full of cookies.

•Bowl #1 has 10 chocolate chip cookies and 30 plain cookies,•bowl #2 has 20 of each.

•Fred picks a bowl at random, and then picks a cookie at random.

•The cookie turns out to be a plain one. •How probable is it that Fred picked it out of bowl •what’s the probability that Fred picked bowl #1, given that he has a plain cookie?”

•event A is that Fred picked bowl #1, •event B is that Fred picked a plain cookie. •Pr(A|B) ?

Example1 - cpntinuedTables of occurrences and relative frequenciesIt is often helpful when calculating conditional probabilities to create a simple table containing the number of occurrences of each outcome, or the relative frequencies of each outcome, for each of the independent variables. The tables below illustrate the use of this method for the cookies.

Number of cookies in each bowlby type of cookie

Relative frequency of cookies in each bowl

by type of cookie

The table on the right is derived from the table on the left by dividing each entry by the total number of cookies under consideration, or 80 cookies.

Bowl 1 Bowl 2 Totals

Chocolate Chip 10 20 30

Plain 30 20 50

Total 40 40 80

Bowl #1 Bowl #2 Totals

Chocolate Chip 0.125 0.250 0.375

Plain 0.375 0.250 0.625

Total 0.500 0.500 1.000

Example 2

1. Power Plant Operation. The variables X, Y, Z describe

the state of 3 power plants (X=0 means plant X is idle).

Denote by A an event that a plant X is idle, and by B an event that 2 out of three plants are working.

What’s P(A) and P(A|B), the probability that X is idle given that at least 2 out of three are working?

X Y Z P(x,y,z)

0 0 0 0.07

0 0 1 0.04

0 1 0 0.03

0 1 1 0.18

1 0 0 0.16

1 0 1 0.18

1 1 0 0.21

1 1 1 0.13

P(A) = P(0,0,0) + P(0,0,1) + P(0,1,0) + P(0, 1, 1) = 0.07+0.04 +0.03 +0.18 =0.32

P(B) = P(0,1,1) +P(1,0,1) + P(1,1,0)+ P(1,1,1)= 0.18+ 0.18+0.21+0.13=0.7

P(A and B) = P(0,1,1) = 0.18

P(A|B) = P(A and B)/P(B) = 0.18/0.7 =0.257

2. Cars are assembled in four possible locations. Plant I supplies 20% of the cars; plant II, 24%; plant III, 25%; and plant IV, 31%. There is 1 year warrantee on every car.

The company collected data that shows

P(claim| plant I) = 0.05; P(claim|Plant II)=0.11;

P(claim|plant III) = 0.03; P(claim|Plant IV)=0.18;

Cars are sold at random.

An owned just submitted a claim for her car. What are the posterior probabilities that this car was made in plant I, II, III and IV?

P(claim) = P(claim|plant I)P(plant I) +

P(claim|plant II)P(plant II) +

P(claim|plant III)P(plant III) +

P(claim|plant IV)P(plant IV) =0.0687

P(plant1|claim) =

= P(claim|plant I) * P(plant I)/P(claim) = 0.146

P(plantII|claim) =

= P(claim|plant II) * P(plant II)/P(claim) = 0.384

P(plantIII|claim) =

= P(claim|plant III) * P(plant III)/P(claim) = 0.109

P(plantIV|claim) =

= P(claim|plant IV) * P(plant IV)/P(claim) = 0.361

Example 33. It is known that 1% of population suffers from a

particular disease. A blood test has a 97% chance to identify the disease for a diseased individual, by also has a 6% chance of falsely indicating that a healthy person has a disease.

a. What is the probability that a random person has a positive blood test.

b. If a blood test is positive, what’s the probability that the person has the disease?

c. If a blood test is negative, what’s the probability that the person does not have the disease?

A is the event that a person has a disease. P(A) = 0.01; P(A’) = 0.99.

B is the event that the test result is positive. P(B|A) = 0.97; P(B’|A) = 0.03; P(B|A’) = 0.06; P(B’|A’) = 0.94;

(a) P(B) = P(A) P(B|A) + P(A’)P(B|A’) = 0.01*0.97 +0.99 * 0.06 = 0.0691

(b) P(A|B)=P(B|A)*P(A)/P(B) = 0.97* 0.01/0.0691 = 0.1403

(c) P(A’|B’) = P(B’|A’)P(A’)/P(B’)= P(B’|A’)P(A’)/(1-P(B))= 0.94*0.99/(1-.0691)=0.9997

Sums of Random Variables

z = x + y

Var(z) = Var(x) + Var(y) + 2Cov(x,y)

If x,y independent: Var(z) = Var(x) + Var(y)

Distribution of z:

yxz

dxxzpxpypxpzp yxyx

)()()()()(

Examples:

x and y are uniform on [0,1]Find p(z=x+y), E(z), Var(z);

x is uniform on [-1,1], and P(y)= 0.5 for y =0, y=10; and 0 elsewhere.Find p(z=x+y), E(z), Var(z);

Normal Distributions

Gaussian distribution

Mean

Variance

Central Limit Theorem says sums of random variables tend toward a Normal distribution.

Mahalanobis Distance:

xxE )(

22/2)(

2

1),()( xxx

x

eNxp xx

22])[(xx

xE

x

xxr

Multivariate Normal Densityx is a vector of d Gaussian variables

Mahalanobis Distance

All conditionals and marginals are also Gaussian

dxxpxxxxE

dxxxpxE

xTxe

dNxp

TT )())((]))([(

)(][

)(1)(21

2/1||2/2

1),()(

)()( 12 xxr T

Bivariate Normal Densities

Level curves - elliplses.x and y width are determined by the

variances, and the eccentricity by correlation coefficient

Principal axes are the eigenvectors, and the width in these direction is the root of the corresponding eigenvalue.

Information theoryKey principles: What is the information contained in a

random event? Less probable event contains more information For two independent event, the information is a sum

What is the average information or entropy of a distribution?

)(log)( 2 xPxI

)(log)()( 2 xPxPxHx

Examples: uniform distribution, dirac distribution;

Mutual information: reduction in uncertainty about one variable due to knowledge of other variable.

yx

yxypxp

yxpyxpyxHxHI

,2,

)()(

),(log),()|()(