Boosting and Applications Yuan

8/8/2019 Boosting and Applications Yuan

1/41

Boosting Algorithm and Its

Application

Dan Yuan

Jan 2005


2/41

Gambling Strategies

Rules-of-thumb from gambling experts

Maximizing the advantages using these

rules-of-thumb. How to combine these rules-of-thumb into a

highly accurate prediction rule.


3/41

Boosting

Definition of Boosting:Boosting refers to the general problem of producing a veryaccurate prediction rule by combining rough andmoderately inaccurate rules-of-thumb.

Boosting procedures

Given a set of labeled training examples ,whereis the label associated with instance

On each round ,

The booster devises a distribution (importance) over theexample set

The booster requests a weak hypothesis (rule-of-thumb) withlow error

AfterTrounds, the booster combine the weak hypothesisinto a single prediction rule.

Niyx ii .1, ! iy

ix

Tt ,,1 .!

tD

thtI


4/41

Conventional Boosting Algorithm

The intuitive idea

Altering the distribution over the domain in a way that

increases the probability of the harder parts of the

space, thus forcing the weak learner to generate new

hypotheses that make less mistakes on these parts.

Disadvantages

Needs to know the prior knowledge of accuracies of

the weak hypotheses

The performance bounds depends only on the

accuracy of the least accurate weak hypothesis


5/41

Adaboost

The framework

The learner receives examples chosen randomly

according to some fixed but unknown distribution on

The learner finds a hypothesis which is consistent withmost of the samples

The algorithm

Input variables

P: The distribution where the training examples sampling from

D: The distribution over all the training samples

WeakLearn: A weak learning algorithm to be boosted

T: The specified number of iterations

Niyx ii .1, !

P YXv

Nimostforyxhiif

ee! 1

fh


6/41

Adaboost (contd)


7/41

Advantages of adaboost

Adaboost adjusts adaptively the errors of theweak hypotheses by WeakLearn.

Unlike the conventional boosting algorithm,

the prior error need not be known ahead oftime.

The update rule reduces the probabilityassigned to those examples on which thehypothesis makes a good predictions andincreases the probability of the examples onwhich the prediction is poor.


8/41

The error bound

Suppose the weak learning algorithm WeakLearn, when

called by Adaboost, generates hypotheses with errors .

Then the error of the final hypothesis output by

Adaboost is bounded above by

Note that the errors generated by WeakLearn are not uniform,

and the final error depends on the error of all of the weak

hypotheses. Recall that the errors of the previous boostingalgorithms depend only on the maximal error of the weakest

hypothesis and ignored the advantages that can be gained

from the hypotheses whose errors are smaller.

TII ,,1 .

? AiifDi yxh {! ~PrI fh

!

eT

t

tt

1

12 III


9/41

The error bound (contd)

Alternative Formulation

if

where is the Kullback-Leiblerdivergence

Also, we can assume that the errors of all the

hypotheses are , thenwhich means when the number of iterations goestowards infinity, the upper bound of the finalhypothesis error approaches zero.

ttKI !

2

1

e

!!e

!!!!

T

t

t

T

t

t

T

t

t

T

t

tt KL1

2

11

2

1

2exp2/1||2/1exp4112 KKKIII

!b

aa

baabaKL

11

ln)1(ln||

KI ! 2

1t

2

2exp KIT

e


10/41

The generalization error


Evaluation of the error of the final hypothesis outside

the training set.

The goal:

Making the generalization error close to the empirical error

on the training set.

one natural way of achieving this is to restrict the weaklearner to choose its hypotheses from some simple

functions and restrict T, the number of weak hypothesis.

? AyxhfPyxg ! ~,PrI


11/41

The generalization error (contd)

The choice of the class of weak hypothesis is

specific to the real learning problem and at

least should reflect the knowledge about the

properties of the unknown concept.

Using an upper bound on the VC-dimension

of the concept class for the choice ofT


12/41

The Vapniks Theorem

Stating how close the empirical error and

generalization error would be.


The empirical error from Nexamples

For any we have that

? AyxhfPyxg

!~,P

r

I

_ aN

yxhih

ii {!:

I

0"H

HII e

-

"

N

TdOhhHh g:Pr


13/41

Minimization of generalization error

Assume be the hypothesis generated by running

AdaBoost forTiterations, by combining the observed

empirical error of with the given bounds, we can

compute an upper bound on the generalization error of

for all T. Then, selecting the hypothesis that minimizes

the guaranteed upper bound.

Cross-validation for choosing T

T

fh

T

fhT

f

h


14/41

Multi-class Extensions

The previous discussion is restricted to binary

classification problems. The set Ycould have

any number of labels, which is a multi-class

problems.

The multi-class case (AdaBoost.M1) requires

the accuracy of the weak hypothesis greater

than . This condition in the multi-class isstronger than that in the binary classification

cases


15/41

AdaBoost.M1

The algorithm


16/41

Error Upper Bound of Adaboost.M1

Like the binary classification case, the error of the

final hypothesis is also bounded.

! eT

t

tt

1

12 III


17/41

Adaboost.M2

Introducing the degree of belief into all the labels

rather than a single label output.

For instance, measures the degree to which it is

believed that y is the correct label associated with x Replacing the original prediction error with the

pseudo-loss which can focus the learner on the

labels that are hardest to discriminate.

yxh ,


18/41

The pseudo-loss

For a fixed training example ,we use a given

hypothesis to keep asking k-1 questions for

Whichis the label of xi, y or yi?

The probability of choosing the incorrect answer y tothe question is

The weighted average probability (pseudo-loss) of

answering all the k-1 questions is

Where q is called label weighting function and summed to 1

yxhyxhiii,,1

2

1

ii yx ,

iyy {

!

{

yxhyiqyxhihploss iyy

iiq

i

,,,121,


19/41

The pseudo-loss (contd)

The weak learners goal is to minimize the expected

pseudo-loss for a given distribuation D and weighting

function q

As we can see , by manipulating both the distribution on

instances, and the label weighting function q, the boosting

algorithm forces the weak learner to focus not only on the

hard instances, but also on the incorrect class labels that arehardest to eliminate.

? AihplossEhploss qDiqD ,: ~, !


20/41

The algorithmAdaBoost.M2


21/41

Error Upper Bound of Adaboost.M2

Like the previous case, the error of the final

hypothesis is also bounded.

!e

T

tttk 1 12)1(

III


22/41

Detection Pedestrian Using Patterns

ofMotion and Appearance

Paul Viola, Michael J. Jones, Daniel Snow


23/41

The System

A pedestrian detection system using image

intensity information and motion information

with the detectors trained by AdaBoost.

The first approach combining both the

appearance and motion information in a

single detector.

Advantages: High efficiency

High detection rate & low false positive rate


24/41

Rectangle Filters

Measuring the difference between region averages

at various scales, orientations and aspect ratios.

However, this information is limited and needs to be

boosted to perform accurate classification


25/41

Motion information

Information about the direction of motion can be

extracted from the difference between shifted

versions of the second image in time with the first

image Motion filters (direction, shear, magnitude) operate

on 5 images: q!

p!

n!

o!

!(

1

1

1

1

1

tt

tt

tt

tt

tt

IIabsD

IIabsR

IIabsL

IIabsU

IIabs


26/41

An example

?


27/41

Motion Direction and Shear Filters

Motion Direction Filter

is single box rectangular sum

These filters extract information related to the

likelihood that a particular region is moving in a

given direction

Motion Shear Filter

Using the rectangle filters

_ a

SrrfDRLUS

iii (!

,,,

ir

jJ

Sf jj J!


28/41

Motion Magnitude Filter and Appearance Filter

Motion Magnitude Filter

is single box rectangular sum within the

detection window

Appearance Filter is rectangular filters that operate

on the first input image

ir

_ a SrfDRLUS

kk ! ,,,

tm If J!


29/41

Integral Image

The integral image at location x,y contains the sum

of the pixels above and to the left of x,y, inclusive:

where is the integral image and is the

original image

where s(x,y) is the cumulative row sum

eded

dd!yyxx

yxiyxii,

,,

yxii , yxi ,

yxsyxiiyxii

yxiyxsyxs

,,1,

,1,,

!

!


30/41

Scale-invariance

Scale-invariance is achieved during the training

process.

A pyramid of different scales is built with a base

resolution.

!

p!

n!

o!

!(

1

1

1

1

1

tl

tll

tl

tll

tl

tll

tl

tll

tl

tll

IIabs

IIabs

IIabs

IIabs

IIabs


31/41

Training Filters

The rectangle filters can have any size,

aspect ratio or position as long as they fit in

the detection window; therefore, there are

quite a number of possible motion andappearance filters, from which a learning

algorithm selects to build classifiers.


32/41

The Classifier (contd)

A classifier, C, is a thresholded sum of features:

A feature ,F, is simply a thresholded filter that outputs

one of the votes

where is a feature threshold and is one of

the motion or appearance filters. The real-valued

and are computed during AdaBoost learning.

"(

! !else

DRLUIFifIIC

N

i titt

0

,,,,,1, 11

U

"(

!else

tDRLUIfifIIF

iti

ttiF

E ,,,,,, 1

it if

E

?


33/41

Training Process

The training process uses AdaBoost to select a

subset of features (F) which minimize the weighted

error, to construct the classifier.

In each round, the learning algorithm chooses a setof filters from motion and appearance filters.

Also picks the optimal threshold (t) for each feature

as well as the votes and

The outputs of AdaBoost is a linear combination ofthe selected features.

E F


34/41

Training Process

A cascade architecture is used to raise the

efficiency of the system.

The true and false positives passed at the

current stage will be used in the next stage of

the cascade. The goal is to reduce the false

positive rate faster than the detection rate.


35/41

Experiments

Each classifier in the cascade is trained usingthe original positive examples and the samenumber of false positives from the previous

stage or negative examples at the first stage. The resulting classifier of previous stage is

used as the input of the current stage andbuild a new classifier with lower false positive

rate The detection threshold is set using a

validation set of image pairs.


36/41

Training samples

A small sample of

positive training

examples. A pair of

image patternscomprise a single

example for training


37/41

Training the cascade

A large number of motion and appearance

filters for training the dynamic pedestrians

Fewer number of appearance filters for

training the static pedestrians


38/41

Training results

The first five filters learned for

the dynamic pedestrian

detector. The six images used

in the motion and appearance

representation are shown foreach filter

The first five filters learned for

the static pedestrian detector


39/41

Testing

Detection for the dynamic detector

Detection for the dynamic detector


40/41

Testing

Detection for static detector


41/41

Thanks

Date post:	10-Apr-2018
Category:	Documents
Upload:	claudia-larray
View:	216 times
Download:	0 times

Boosting and Applications Yuan

Documents