+ All Categories
Home > Documents > Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of...

Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of...

Date post: 29-Mar-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
32
Learning and Design of Principal Curves Bal´ azs K´ egl Adam Krzy˙ zak Tam´ as Linder Kenneth Zeger IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 22, no. 3, pp. 281-297, 2000. Abstract Principal curves have been defined as “self consistent” smooth curves which pass through the “middle” of a d -dimensional probability distribution or data cloud. They give a summary of the data and also serve as an efficient feature extraction tool. We take a new approach by defining principal curves as continuous curves of a given length which minimize the expected squared distance between the curve and points of the space randomly chosen according to a given distribution. The new definition makes it possible to theoretically analyze principal curve learning from training data and it also leads to a new practical construction. Our theoretical learning scheme chooses a curve from a class of polygonal lines with k segments and with a given total length, to minimize the average squared distance over n training points drawn independently. Convergence properties of this learning scheme are analyzed and a practical version of this theoretical algorithm is implemented. In each iteration of the algorithm a new vertex is added to the polygonal line and the positions of the vertices are updated so that they minimize a penalized squared distance criterion. Simulation results demonstrate that the new algorithm compares favorably with previous methods both in terms of performance and computational complexity, and is more robust to varying data models. Key words: learning systems, unsupervised learning, feature extraction, vector quantization, curve fitting, piecewise linear approximation. B. K´ egl and T. Linder are with the Department of Mathematics and Statistics, Queen’s University, Kingston, Ontario, Canada K7L 3N6 (email: kegl,linder @mast.queensu.ca). A. Krzy˙ zak is with the Department of Computer Science, Concordia University, 1450 de Maisonneuve Blvd. West, Montreal PQ, Canada H3G 1M8 (email: [email protected]). K. Zeger is with the Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA 92093-0407 (email: [email protected]). This research was supported in part by the National Science Foundation and NSERC.
Transcript
Page 1: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

LearningandDesignof PrincipalCurves�

BalazsKegl AdamKrzyzak TamasLinder KennethZeger

IEEETransactionson PatternAnalysisandMachineIntelligencevol. 22,no. 3, pp. 281-297,2000.

Abstract

Principalcurveshave beendefinedas“self consistent”smoothcurveswhich passthrough

the“middle” of a d-dimensionalprobabilitydistribution or datacloud. They give a summary

of thedataandalsoserve asan efficient featureextractiontool. We take a new approachby

definingprincipalcurvesascontinuouscurvesof a givenlengthwhich minimizetheexpected

squareddistancebetweenthe curve andpointsof the spacerandomlychosenaccordingto a

givendistribution. Thenew definitionmakesit possibleto theoreticallyanalyzeprincipalcurve

learningfrom training dataandit alsoleadsto a new practicalconstruction.Our theoretical

learningschemechoosesa curve from a classof polygonallines with k segmentsandwith

a given total length, to minimize the averagesquareddistanceover n training pointsdrawn

independently. Convergencepropertiesof this learningschemeareanalyzedanda practical

versionof this theoreticalalgorithmis implemented.In eachiterationof thealgorithma new

vertex is addedto the polygonal line and the positionsof the verticesare updatedso that

they minimizea penalizedsquareddistancecriterion. Simulationresultsdemonstratethat the

new algorithmcomparesfavorably with previous methodsboth in termsof performanceand

computationalcomplexity, andis morerobustto varyingdatamodels.

Keywords: learningsystems,unsupervisedlearning,featureextraction,vectorquantization,curve

fitting, piecewiselinearapproximation.�B. Kegl and T. Linder are with the Departmentof MathematicsandStatistics,Queen’s University, Kingston,

Ontario,CanadaK7L 3N6 (email: � kegl,linder � @mast.queensu.ca). A. Krzyzak is with the Department

of ComputerScience,ConcordiaUniversity, 1450de Maisonneuve Blvd. West, Montreal PQ, CanadaH3G 1M8

(email:[email protected]). K. Zegeris with theDepartmentof ElectricalandComputerEngineering,

University of California, SanDiego, La Jolla, CA 92093-0407(email: [email protected]). This researchwas

supportedin partby theNationalScienceFoundationandNSERC.

Page 2: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

1 Intr oduction

Principalcomponentanalysisis perhapsthebestknown techniquein multivariateanalysisandis

usedin dimensionreduction,featureextraction,andin imagecodingandenhancement.Consider

a d-dimensionalrandomvectorX ��� X1 ������ Xd � with finite secondmoments.Thefirst principal

componentline for X is astraightline whichhasthepropertythattheexpectedvalueof thesquared

EuclideandistancefromX to theline is minimumamongall straightlines.Thispropertymakesthe

first principal componenta conciseone-dimensionalapproximationto the distribution of X, and

theprojectionof X to this line givesthebestlinearsummaryof thedata.For elliptical distributions

the first principal componentis alsoself consistent, i.e., any point of the line is the conditional

expectationof X over thosepointsof thespacewhichprojectto this point.

Hastie[1] andHastieandStuetzle[2] (hereafterHS) generalizedtheself consistency property

of principalcomponentsandintroducedthenotionof principal curves. Let f � t � � � f1 � t � ������ fd � t ��bea smooth(infinitely differentiable)curve in � d parametrizedby t ��� , andfor any x ��� d let

tf � x � denotethe largestparametervaluet for which thedistancebetweenx andf � t � is minimized

(seeFigure1). More formally, theprojectionindex tf � x � is definedby

tf � x � � sup� t : � x � f � t � ��� infτ� x � f � τ � ��� (1)

where ����� denotestheEuclideannormin � d .

����

� �! !"#

$%& &' '

()* * * ** * * ** * * *+ + + ++ + + ++ + + +, ,, ,, ,, ,, ,, ,------ . . . .. . . .. . . .. . . ./ / / // / / // / / // / / /

0 0 00 0 00 0 01 1 11 1 11 1 1 2 2 22 2 22 2 22 2 23 3 33 3 33 3 33 3 34 44 44 44 44 45 55 55 55 55 5

6 6 6 66 6 6 66 6 6 66 6 6 66 6 6 66 6 6 67 7 7 77 7 7 77 7 7 77 7 7 77 7 7 77 7 7 7

8�9:�; 9<>=?< 8A@>@: ; B<>=?< 8A@>@B :C; D 8�E: ; E<>=?< 8A@>@ 8�F: ; F<>=?< 8A@>@8 D 8�G: ; G<>=?< 8H@@ : ; I<=J< 8A@>@8 I<=J< 8A@>@8

Figure1: Projectingpointsto acurve.

By theHSdefinition,thesmoothcurve f � t � is aprincipalcurve if thefollowing hold:

(i) f doesnot intersectitself

(ii) f hasfinite lengthinsideany boundedsubsetof � d

(iii) f is self-consistent,i.e., f � t � � E � X K tf � X � � t � .1

Page 3: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

Intuitively, self-consistency meansthateachpoint of f is theaverage(underthedistribution of X)

of all pointsthatprojectthere.Thus,principalcurvesaresmoothself-consistentcurveswhichpass

throughthe“middle” of thedistribution andprovide a goodone-dimensionalnonlinearsummary

of thedata.

Basedon theself-consistency property, HS developedanalgorithmfor constructingprincipal

curves.Similar in spirit to theGeneralizedLloyd Algorithm (GLA) of vectorquantizerdesign[3],

theHSalgorithmiteratesbetweena projectionstepandanexpectationstep.Whentheprobability

densityof X is known, theHSprincipalalgorithmfor constructingprincipalcurvesis thefollowing.

Step0 Let f L 0M � t � bethefirst principalcomponentline for X. Set j � 1.

Step1 Definef L j M � t � � E N X K tf O j P 1Q � X � � t R .Step2 Settf O j Q � x � � max� t : � x � f L j M � t � ��� minτ � x � f L j M � τ � ��� for all x �A� d .

Step3 Compute∆ � f L j M � � E � X � f L j M � tf O j Q � X �� � 2. If K∆ � f L j M � � ∆ � f L j S 1M � KUT threshold, thenStop.

Otherwise,let j � j V 1 andgo to Step1.

In practice,the distribution of X is often unknown, but a datasetconsistingof n samplesof

the underlyingdistribution is known instead. In the HS algorithmfor datasets,the expectation

in Step1 is replacedby a “smoother” (locally weightedrunning lines [4]) or a nonparametric

regressionestimate(cubic smoothingsplines). HS provide simulationexamplesto illustratethe

behavior of thealgorithm,anddescribeanapplicationin theStanfordLinearCollider Project[2].

It shouldbenotedthat thereis no known proof of theconvergenceof thehypotheticalalgorithm

(Steps0-3; onemaindifficulty is thatStep1 canproducenondifferentiablecurveswhile principal

curvesaredifferentiableby definition).However, extensivetestingonsimulatedandrealexamples

havenot revealedany convergenceproblemsfor thepracticalimplementationof thealgorithm[2].

Alternativedefinitionsandmethodsfor estimatingprincipalcurveshavebeengivensubsequent

to HastieandStuetzle’s groundbreakingwork. BanfieldandRaftery[5] (hereafterBR) modeled

the outlinesof ice floes in satelliteimagesby closedprincipal curvesand they developeda ro-

bust methodwhich reducesthe biasin the estimationprocess.Their methodof clusteringabout

principalcurvesled to a fully automaticmethodfor identifying ice floesandtheir outlines.Singh

et al. [6] usedprincipal curvesto extract skeletalstructuresof hand-writtencharactersin faded

documents.ReinhardandNiranjan[7] appliedprincipalcurvesto modeltheshorttime spectrum

of speechsignals.They foundthatprincipalcurvescanbeusedefficiently to capturetransitional

informationbetweenphones.ChangandGhosh[8], [9] combinedtheHS andtheBR algorithms

to improve the performanceof the principal curve algorithm, and usedthe modified algorithm

for nonlinearfeatureextractionandpatternclassification.On thetheoreticalside,Tibshirani[10]

2

Page 4: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

introducedasemiparametricmodelfor principalcurvesandproposedamethodfor estimatingprin-

cipal curvesusingtheEM algorithm.CloseconnectionsbetweenprincipalcurvesandKohonen’s

self-organizingmapswerepointedout by Mulier andCherkassky [11]. Recently, Delicado[12]

proposedyet anotherdefinitionbasedon apropertyof thefirst principalcomponentsof multivari-

atenormaldistributions.

Thereremainsan unsatisfactory aspectof the definition of principal curves in the original

HS paperas well as in subsequentworks. Although principal curves have beendefinedto be

nonparametric, their existencehasbeenproven only for somespecialdensitiessuchasradially

symmetricdensitiesandfor theuniformdensityon a rectangleor anannulusin theplane[13]. At

present,it is anopenproblemwhetherHS principalcurvesexistsfor all “reasonable”densities.It

is alsounknown how thehypotheticalHSalgorithmbehavesfor aprobabilitydensityfor whichan

HS principal curve doesnot exist. At the sametime, the problemof existencemakesit difficult

to theoreticallyanalyze(in termsof consistency andconvergencerates)any estimationschemefor

HSprincipalcurves.

In this paperwe proposea new definition of principal curvesto resolve this problem. In the

new definition,a principal curve is a continuouscurve of a given lengthL which minimizesthe

expectedsquareddistancebetweenX andthe curve. In Section2 (Lemma1) we prove that for

any X with finite secondmomentstherealwaysexistsa principalcurve in thenew sense.We also

discussconnectionsbetweenthe newly definedprincipal curvesand optimal vector quantizers.

Thenwe proposea theoreticallearningschemein which the model classesarepolygonallines

with k segmentsandwith a givenlength,andthealgorithmchoosesa curve from this classwhich

minimizestheaveragesquareddistanceover n trainingpoints. In Theorem1 we prove thatwith

k suitablychosenasa functionof n, theexpectedsquareddistanceof thecurve trainedon n data

pointsconvergesto theexpectedsquareddistanceof theprincipalcurveatarateO � n1W 3 � asn X ∞.

Two main featuresdistinguishthis learningschemefrom the HS algorithm. First, thepolyg-

onal line estimateof theprincipal curve is determinedvia minimizing a datadependentcriterion

directly relatedto thedefinitionof principalcurves.This facilitatesthetheoreticalanalysisof the

performance.Second,thecomplexity of theresultingpolygonalline is determinedby thenumber

of segmentsk, which whenoptimally chosenis typically muchlessthann. This agreeswith our

mentalimagethatprincipalcurvesshouldprovideaconcisesummaryof thedata.In contrast,for n

datapointstheHSalgorithmwith scatterplotsmoothingproducespolygonallineswith n segments.

Thoughamenableto analysis,our theoreticalalgorithm is computationallyburdensomefor

implementation.In Section3 we develop a suboptimalalgorithmfor learningprincipal curves.

Thepracticalalgorithmproducespolygonalline approximationsto theprincipalcurve just asthe

theoreticalmethoddoes,but global optimizationis replacedby a lesscomplex iterative descent

method.In Section4 we give simulationresultsandcompareour algorithmwith previouswork.

3

Page 5: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

In general,on examplesconsideredby HS, theperformanceof thenew algorithmis comparable

with theHS algorithm,while it provesto bemorerobustto changesin thedatageneratingmodel.

In Section4 we alsoreportsomepreliminaryresultson applyingthepolygonalline algorithmto

find themedialaxes(“skeletons”)of pixel templatesof hand-writtencharacters.

2 Learning Principal Curveswith a Length Constraint

A curvein d-dimensionalEuclideanspaceis a continuousfunctionf : I XY� d , whereI is aclosed

interval of thereal line. Let theexpectedsquareddistancebetweenX andf bedenotedby

∆ � f � � E Z inft� X � f � t � � 2[ � E � X � f � tf � X �� � 2 (2)

wheretheprojectionindex tf � x � is givenin (1). Let f bea smooth(infinitely differentiable)curve

andfor λ �\� considertheperturbationf V λg of f by a smoothcurve g suchthatsupt � g � t � �^] 1

andsupt � g_`� t � � ] 1. HS provedthat f is a principalcurve if andonly if f is a critical point of the

distancefunctionin thesensethatfor all suchg,

∂∆ � f V λg�∂λ

aaaaλ b 0

� 0 �It is nothardto seethatananalogousresultholdsfor principalcomponentlinesif theperturbation

g is a straightline. In this sensethe HS principal curve definition is a naturalgeneralizationof

principal components.Also, it is easyto checkthat principal componentsare in fact principal

curvesif thedistributionof X is elliptical.

An unfortunatepropertyof theHSdefinitionis thatin generalit is notknown if principalcurves

exist for agivensourcedensity. To resolve thisproblemwegobackto thedefiningpropertyof the

first principalcomponent.A straightline s� t � is thefirst principalcomponentif andonly if

E Zmint� X � s� t � � 2[ ] E Zmin

t� X � s� t � � 2 [

for any otherstraightline s� t � . Wewishto generalizethispropertyof thefirst principalcomponent

anddefineprincipal curvesso that they minimize the expectedsquareddistanceover a classof

curvesratherthanonly beingcritical pointsof thedistancefunction. To do this it is necessaryto

constrainthe length1 of the curve, sinceotherwisefor any X with a densityandany ε c 0 there

exists a smoothcurve f suchthat ∆ � f � ] ε, andthusa minimizing f hasinfinite length. On the

otherhand,if the distribution of X is concentratedon a polygonalline andis uniform there,the1For thedefinitionof lengthfor nondifferentiablecurvesseeAppendixA wheresomebasicfactsconcerningcurves

in d d havebeencollectedfrom [14].

4

Page 6: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

infimum of thesquareddistances∆ � f � is 0 over theclassof smoothcurves,but no smoothcurve

canachieve this infimum. For this reason,we relax the requirementthat f be differentiableand

insteadweconstrainthelengthof f. Notethatby thedefinitionof curves,f is still continuous.We

givethefollowing new definitionof principalcurves.

Definition 1 A curvef e is calleda principalcurve of lengthL for X if f e minimizes∆ � f � over all

curvesof lengthlessthanor equalto L.

A usefuladvantageof thenew definition is thatprincipalcurvesof lengthL alwaysexist if X

hasfinite secondmoments,asthenext resultshows.

Lemma 1 AssumethatE � X � 2 T ∞. Thenfor anyL c 0 thereexistsa curvef e with l � f e � ] L such

that

∆ � f e � � inf f ∆ � f � : l � f � ] L g �Theproof of thelemmais givenin AppendixA.

Note thatwe have droppedtherequirementof theHS definitionthatprincipalcurvesbenon-

intersecting.In fact, Lemma1 doesnot hold in generalfor non-intersectingcurvesof lengthL

withoutfurtherrestrictingthedistributionof X, sincetherearedistributionsfor whichtheminimum

of ∆ � f � is achievedonly byanintersectingcurveeventhoughnon-intersectingcurvescanarbitrarily

approachthisminimum.NotealsothatneithertheHSnorourdefinitionguaranteestheuniqueness

of principal curves. In our case,theremight exist several principal curves for a given length

constraintL but eachof thesewill have thesame(minimal) squaredloss.

Remark: (Connectionwith vectorquantization)

Our new definitionof principalcurveshasbeeninspiredby thenotionof anoptimalvectorquan-

tizer. A vectorquantizermapsa point in � d to theclosestpoint in a fixedset(calleda codebook)f y1 ������ yk gihj� d . Thecodepointsy e1 ����>� y ek ��� d correspondto anoptimalk-point vectorquan-

tizer if

E Zmini� X � y ei � 2 [ ] E Zmin

i� X � yi � 2 [

for any other collection of k points y1 ������ yk �k� d . In other words, the points y e1 ����>� y ek give

the bestk-point representationof X in the meansquaredsense.Optimal vectorquantizersare

importantin lossydatacompression,speechandimagecoding[15], andclustering[16]. Thereis

a strongconnectionbetweenthedefinitionof anoptimal vectorquantizerandour definitionof a

principalcurve. Both minimizethesameexpectedsquareddistancecriterion. A vectorquantizer

is constrainedto have at mostk points,whereasa principal curve hasa constrainedlength. This

5

Page 7: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

connectionis furtherillustratedby a recentwork of Tarpey et al. [17] who definepointsy1 ����� yk

to beself-consistentif

yi � E lX KX � Si mn�whereS1 ����� Sk arethe“Voronoi regions” definedasSi �of x : � x � yi �p]q� x � y j � � j � 1 ����>� k g(ties arebroken arbitrarily). Thusour principal curvescorrespondto optimal vectorquantizers

(“principal points” by the terminologyof [17]) while the HS principal curvescorrespondto self

consistentpoints.

While principalcurvesof a given lengthalwaysexist, it appearsdifficult to demonstratecon-

creteexamples,unlessthedistributionof X is discreteor is concentratedonacurve. It is presently

unknown whatprincipalcurveslook likewith a lengthconstraintfor eventhesimplestcontinuous

multivariatedistributionssuchastheGaussian.However, this fact in itself doesnot limit theop-

erationalsignificanceof principal curves. Analogously, for k r 3 codepointsthereareno known

concreteexamplesof optimal vectorquantizersfor even the most commonmodeldistributions

suchasGaussian,Laplacian,or uniform (in a hypercube)in any dimensionsd r 2. Nevertheless,

algorithmsfor quantizerdesignattemptingto find nearoptimalvectorquantizersareof greatthe-

oreticalandpracticalinterest.In what follows we considertheproblemof principalcurve design

basedon trainingdata.

Supposethat n independentcopiesX1 ����>� Xn of X aregiven. Thesearecalled the training

dataandthey areassumedto beindependentof X. Thegoalis to usethetrainingdatato construct

acurveof lengthat mostL whoseexpectedsquaredlossis closeto thatof aprincipalcurve for X.

Our methodis basedon a commonmodel in statisticallearningtheory (e.g.,see[18]). We

considerclassess 1 � s 2 ����>� of curvesof increasingcomplexity. Givenn datapointsdrawn inde-

pendentlyfrom the distribution of X, we choosea curve asthe estimatorof the principal curve

from thekth modelclasss k by minimizing theempiricalerror. By choosingthecomplexity of the

modelclassappropriatelyasthe sizeof the training datagrows, the chosencurve representsthe

principalcurvewith increasingaccuracy.

We assumethat the distribution of X is concentratedon a closedand boundedconvex set

K hj� d . A basicpropertyof convex setsin � d shows thatthereexistsa principalcurve of length

L insideK (see[19, Lemma1]), andsowewill only considercurvesin K.

Let s denotethefamily of curvestakingvaluesin K andhaving lengthnot greaterthanL. For

k r 1 let s k be the setof polygonal(piecewise linear) curves in K which have k segmentsand

whoselengthsdo notexceedL. Notethat s k hts for all k. Let

∆ � x � f � � mint� x � f � t � � 2 (3)

denotethesquareddistancebetweena point x ��� d andthecurve f. For any f �As theempirical

6

Page 8: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

squarederrorof f on thetrainingdatais thesampleaverage

∆n � f � � 1n

n

∑i b 1

∆ � X i � f � (4)

wherewe have suppressedin the notationthe dependenceof ∆n � f � on the training data. Let our

theoreticalalgorithm2 chooseanfk u n �vs k which minimizestheempiricalerror, i.e,

fk u n � argminf wx k

∆n � f � � (5)

We measuretheefficiency of fk u n in estimatingf e by thedifferenceJ � fk u n � betweentheexpected

squaredlossof fk u n andtheoptimalexpectedsquaredlossachievedby f e , i.e.,we let

J � fk u n � � ∆ � fk u n � � ∆ � f e � � ∆ � fk u n � � minf wx ∆ � f � �

Sinces k hys , wehaveJ � fk u n � r 0. Ourmainresultin thissectionprovesthatif thenumberof data

pointsn tendsto infinity, andk is chosento beproportionalto n1W 3, thenJ � fk u n � tendsto zeroat a

rateJ � fk u n � � O � n S 1W 3 � .Theorem 1 AssumethatP f X � K gz� 1 for aboundedandclosedconvex setK, let n bethenumber

of trainingpoints,andlet k bechosento beproportionalto n1W 3. Thentheexpectedsquaredlossof

theempiricallyoptimalpolygonalline with k segmentsandlengthat mostL converges,asn X ∞,

to thesquaredlossof theprincipal curveof lengthL at a rate

J � fk u n � � O � n S 1W 3 � �The proof of the theoremis given in AppendixB. To establishthe resultwe usetechniques

from statisticallearningtheory(e.g.,see[20]). First, theapproximatingcapabilityof theclassof

curves s k is considered,andthentheestimation(generalization)erroris boundedvia coveringthe

classof curvess k with ε accuracy (in thesquareddistancesense)by adiscretesetof curves.When

thesetwo boundsarecombined,oneobtains

J � fk u n � ] {kC � L � D � d �

nV DL V 2

kV O � n S 1W 2 � (6)

wherethe termC � L � D � d � dependsonly on thedimensiond, the lengthL, andthediameterD of

thesupportof X, but is independentof k andn. Thetwo errortermsarebalancedby choosingk to

beproportionalto n1W 3 whichgivestheconvergencerateof Theorem1.2Theterm“hypotheticalalgorithm” might appearto bemoreaccuratesincewe have not shown thatanalgorithm

for finding fk | n exists. However, analgorithmclearlyexists which canapproximatefk | n with arbitraryaccuracy in a

finite numberof steps(considerpolygonallineswhoseverticesarerestrictedto a fine rectangulargrid). Theproof of

Theorem1 showsthatsuchapproximatingcurvescanreplacefk | n in theanalysis.

7

Page 9: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

Note that althoughthe constanthiddenin the O notationdependson the dimensiond, the

exponentof n is dimension-free.This is not surprisingin view of thefact that theclassof curvess is equivalentin acertainsenseto theclassof Lipschitzfunctionsf : l 0 � 1m X K suchthat � f � x� �f � y� �}] L K x � y K (seeAppendixA). It is known thattheε-entropy, definedby thelogarithmof the

ε coveringnumber, is roughlyproportionalto 1~ ε for suchfunctionclasses[21]. Usingthis result,

theconvergencerateO � n S 1W 3 � canbeobtainedby consideringε-coversof s directly(withoutusing

themodelclassess k) andpickingtheempiricallyoptimalcurvein thiscover. Theuseof theclassess k hastheadvantagethatthey aredirectly relatedto thepracticalimplementationof thealgorithm

givenin thenext section.

NotealsothateventhoughTheorem1 is valid for any givenlengthconstraintL, thetheoretical

algorithmitself giveslittle guidanceabouthow to chooseL. Thischoicedependson theparticular

applicationandheuristicconsiderationsarelikely to enterhere.Oneexampleis givenin Section3

wherea practicalimplementationof thepolygonalline algorithmis usedto recover a “generating

curve” from noisyobservations.

Finally, we notethat theproof of Theorem1 alsoprovidesinformationon thedistribution of

theexpectedsquarederrorof fk u n giventhetrainingdataX1 ����� Xn. In particular, it is shown at the

endof theproof that for all n andk, andδ suchthat0 T δ T 1, with probabilityat least1 � δ we

have

E Z ∆ � X � fk u n � KX1 ����>� Xn[ � ∆ � f e � ] {

kC � L � D � d � � D4 log � δ ~ 2�n

V DL V 2k

(7)

wherelog denotesnaturallogarithmandC � L � D � d � is thesameconstantasin (6).

3 The PolygonalLine Algorithm

Given a setof datapoints � n ��f x1 ������ xn g�h�� d , the taskof finding a polygonalcurve with k

segmentsandlengthL which minimizes1n ∑n

i b 1 ∆ � xi � f � is computationallydifficult. We proposea

suboptimalmethodwith reasonablecomplexity whichalsopicksthelengthL of theprincipalcurve

automatically. Thebasicideais to startwith astraightline segmentf0 u n, theshortestsegmentof the

first principalcomponentline which containsall of theprojecteddatapoints,andin eachiteration

of thealgorithmto increasethenumberof segmentsby onebyaddinganew vertex to thepolygonal

line fk u n producedin thepreviousiteration.After addinga new vertex, thepositionsof all vertices

areupdatedin aninnerloop.

Theinnerloopconsistsof aprojectionstepandanoptimizationstep.In theprojectionstepthe

datapointsarepartitionedinto “nearestneighborregions” accordingto which segmentor vertex

they project. In theoptimizationstepthenew positionof a vertex vi is determinedby minimizing

anaveragesquareddistancecriterionpenalizedby ameasureof thelocalcurvature,while all other

8

Page 10: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

(a) (b)

(c) (d)

Figure2: Thecurvesfk � n producedby thepolygonalline algorithmfor n � 100datapoints.Thedatawas

generatedby addingindependentGaussianerrorsto bothcoordinatesof a point chosenrandomlyon a half

circle. (a) f1 � n, (b) f2 � n, (c) f4 � n, (d) f15� n (theoutputof thealgorithm).

verticesarekeptfixed.Thesetwo stepsareiteratedsothattheoptimizationstepis appliedto each

vertex vi , i � 1 ������ k V 1, in acyclic fashion(sothataftervk � 1, theprocedurestartsagainwith v1),

until convergenceis achievedandfk u n is produced.Thenanew vertex is added.

Thealgorithmstopswhenk exceedsa thresholdc � n � ∆ � . This stoppingcriterionis basedon a

heuristiccomplexity measure,determinedby thenumberof segmentsk, thenumberof datapoints

n, andtheaveragesquareddistance∆n � fk u n � .Theflow-chartof thealgorithmis given in Figure3. Theevolution of thecurve producedby

thealgorithmis illustratedin Figure2. As with theHSalgorithm,wehavenoformalproof thatthe

practicalalgorithmwill converge,but in practice,afterextensive testing,it seemsto consistently

converge.

It shouldbenotedthatthetwo corecomponentsof thealgorithm,theprojectionandthevertex

optimizationsteps,arecombinedwith moreheuristicelementssuchasthestoppingconditionand

theform of thepenaltyterm(8) of theoptimizationstep.Theheuristicpartsof thealgorithmhave

beentailoredto thetaskof recoveringanunderlyinggeneratingcurve for adistributionbasedona

finite datasetof randomlydrawn points(seetheexperimentalresultsin Section4). Whenthealgo-

rithm is intendedfor anapplicationwith adifferentobjective (suchasrobustsignalcompression),

9

Page 11: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

Vertex optimization

Projection

Initialization

Convergence?

∆k > c(n, )?

Add new vertex

START

END

N

Y

Y

N

Figure3: Theflow chartof thepolygonalline algorithm.

thecorecomponentscanbekeptunchangedbut theheuristicelementsmaybereplacedaccording

to thenew objectives.

3.1 The Projection Step

Let f denotea polygonalline with verticesv1 ����� vk � 1 andline segmentss1 ����>� sk, suchthat si

connectsverticesvi andvi � 1. In thisstepthedataset� n is partitionedinto (atmost)2k V 1 disjoint

setsV1 ����� Vk � 1 and S1 ����>� Sk, the nearestneighborregions of the verticesand segmentsof f,

respectively, in thefollowing manner. For any x �H� d let ∆ � x � si � bethesquareddistancefrom x to

si (seedefinition(3)), let ∆ � x � vi � ��� x � vi � 2, andlet

Vi � � x ��� n : ∆ � x � vi � � ∆ � x � f � � ∆ � x � vi � T ∆ � x � vm� � m � 1 ����� i � 1 � �UponsettingV ��� k � 1

i b 1 Vi , theSi setsaredefinedby

Si � � x ��� n : x �� V � ∆ � x � si � � ∆ � x � f � � ∆ � x � si � T ∆ � x � sm� � m � 1 ����>� i � 1 � �Theresultingpartitionis illustratedin Figure4.

As a resultof introducingthenearestneighborregionsSi andVi , thepolygonalline algorithm

substantiallydiffers from methodsbasedon the self-organizingmap. Although we optimizethe

positionsof theverticesof thecurve, thedistancesof thedatapointsaremeasuredfrom theseg-

mentsandverticesof thecurve ontowhich they project,while theself-organizingmapmeasures

10

Page 12: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

iV

Si

Si+1is

i

vv

1

Vi+1

Si-2

s

i-

2s

1

1vi+1

i-S1i-

i-

V

i+

si-1

Figure4: A nearestneighborpartitionof � 2 inducedby theverticesandsegmentsof f. Thenearestpoint

of f to any point in thesetVi is thevertex vi . Thenearestpointof f to any point in thesetSi is apoint of the

line segmentsi .

distancesexclusively from the vertices.Our principle makesit possibleto usea relatively small

numberof verticesandstill obtaingoodapproximationto anunderlyinggeneratingcurve.

3.2 The Vertex Optimization Step

In this stepthenew positionof a vertex vi is determined.In thetheoreticalalgorithmtheaverage

squareddistance∆n � x � f � is minimized subjectto the constraintthat f is a polygonal line with

k segmentsand length not exceedingL. One could usea Lagrangianformulation and attempt

to find a new position for vi (while all otherverticesarefixed) suchthat the penalizedsquared

error ∆n � f � V λl � f � 2 is minimum. Although this direct length penaltycan work well in certain

applications,it yieldspoorresultsin termsof recoveringa smoothgeneratingcurve. In particular,

this approachis very sensitive to thechoiceof λ andtendsto producecurveswhich, similarly to

theHSalgorithm,exhibit a “flattening” estimationbiastowardsthecenterof thecurvature.

To reducetheestimationbias,we penalizesharpanglesbetweenline segments.At innerver-

tices vi , 3 ] i ] k � 1 we penalizethe sum of the cosinesof the threeanglesat verticesvi S 1,

vi , andvi � 1. The cosinefunction is convex in the interval l π ~ 2 � π m and its derivative is zeroat

π which makesit especiallysuitablefor the steepestdescentalgorithm. To make the algorithm

invariantunderscaling,we multiply thecosinesby thesquareof the“radius” of thedatadefined

by r � maxx w� n �� x � 1n ∑y w� n

y �� . Notethatthechosenpenaltyformulationis relatedto theoriginal

principleof penalizingthe lengthof thecurve. At innervertices,sinceonly onevertex is moved

ata time,penalizingsharpanglesindirectlypenalizeslongsegments.At theendpointsandat their

11

Page 13: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

immediateneighbors(vi , i � 1 � 2 � k � k V 1), wherepenalizingsharpanglesdoesnot translateto pe-

nalizing long line segments,thepenaltyon a nonexistentangleis replacedby a directpenaltyon

thesquaredlengthof thefirst (or last)segment.

Formally, let γi denotetheangleatvertex vi , let π � vi � � r2 � 1 V cosγi � , letµ� � vi � � � vi � vi � 1 � 2,

andlet µS � vi � ��� vi � vi S 1 � 2. ThenthepenaltyP � vi � at vertex vi is givenby

P � vi � ����������� ���������µ� � vi � V π � vi � 1 � if i � 1

µS � vi � V π � vi � V π � vi � 1 � if i � 2

π � vi S 1 � V π � vi � V π � vi � 1 � if 2 T i T k

π � vi S 1 � V π � vi � V µ� � vi � if i � k

π � vi S 1 � V µS � vi � if i � k V 1 �(8)

The local measureof the averagesquareddistanceis calculatedfrom the datapointswhich

projectto vi or to theline segment(s)startingat vi (seeProjectionStep).Accordingly, let

σ � � vi � � ∑x w Si

∆ � x � si �σ S � vi � � ∑

x w Si P 1

∆ � x � si S 1 �ν � vi � � ∑

x w Vi

∆ � x � vi � �Now definethelocal averagesquareddistanceasa functionof vi by

∆n � vi � ������ ���� ν � vi � V σ � � vi � if i � 1

σ S � vi � V ν � vi � V σ � � vi � if 1 T i T k V 1

σ S � vi � V ν � vi � if i � k V 1 � (9)

We useaniterativesteepestdescentmethodto minimize

G � vi � � 1n

∆n � vi � V λp1

k V 1P � vi �

whereλp c 0. Thelocalsquareddistanceterm∆n � vi � andthelocalpenaltytermP � vi � arenormal-

izedby thenumberof datapointsn andthenumberof vertices � k V 1� , respectively, to keepthe

globalobjective function∑k� 1i b 1 G � vi � approximatelyin thesamemagnitudeif thenumberof data

pointsdrawn from thesourcedistribution or thenumberof line segmentsof thepolygonalcurve

arechanged.

We searchfor a local minimum of G � vi � in the direction of the negative gradientof G � vi �by usinga proceduresimilar to Newton’s method.Thenthe gradientis recomputedandthe line

12

Page 14: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

searchis repeated.Theiterationstopswhentherelative improvementof G � vi � is lessthanapreset

threshold. It shouldbe notedthat ∆n � vi � is not differentiableat any point vi suchthat at least

onedatapoint falls on the boundaryof a nearestneighborregion Si S 1, Si , or Vi . ThusG � vi � is

only piecewise differentiableandthe variantof Newton’s methodwe usecannotguaranteethat

the global objective function ∑k� 1i b 1 G � vi � will alwaysdecreasein the optimizationstep. During

extensivetestruns,however, thealgorithmwasobservedto alwaysconverge.Furthermore,wenote

thatthispartof thealgorithmis modular, i.e., theprocedureweareusingcanbesubstitutedwith a

moresophisticatedoptimizationroutineat theexpenseof increasedcomputationalcomplexity.

Oneimportantissueis the amountof smoothingrequiredfor a givendataset. In the HS al-

gorithmoneneedsto determinethepenaltycoefficient of thesplinesmoother, or thespanof the

scatterplotsmoother. In ouralgorithm,thecorrespondingparameteris thecurvaturepenaltyfactor

λp. If somea priori knowledgeaboutthe distribution is available,onecan useit to determine

the smoothingparameter. However in the absenceof suchknowledge,the coefficient shouldbe

data-dependent.Basedon heuristicconsiderationsexplainedbelow, andaftercarryingout practi-

cal experiments,we setλp � λ _pkn S 1W 3∆n � fk u n � 1W 2r S 1, whereλ _p is anexperimentallydetermined

constant.

By settingthe penaltyto be proportionalto the averagedistanceof the datapointsfrom the

curve we avoid thezig-zaggingbehavior of thecurve resultingfrom overfitting whenthenoiseis

relatively large. At thesametime, this penaltyfactorallows theprincipalcurve to closelyfollow

thegeneratingcurvewhenthegeneratingcurve itself is apolygonalline with sharpanglesandthe

datais concentratedon this curve (thenoiseis very small). Thepenaltyis setto beproportional

to thenumberof segmentsk becausein our experimentswehave foundthatthealgorithmis more

likely to avoid local minima if a small penaltyis imposedinitially andthe penaltyis gradually

increasedasthenumberof segmentsgrows. Sincethestoppingcondition(Section3.4) indicates

thatthefinal numberof line segmentsis proportionalto thecuberootof thedatasize,wenormalize

k by n1W 3 in the penaltyterm. The penaltyfactor is alsonormalizedby the radiusof the datato

obtainscaleindependence.The valueof the parameterλ _p wasdeterminedby experiments,and

wassetto aconstant0 � 13.

3.3 Adding a New Vertex

We startwith theoptimizedfk u n andchoosethesegmentthathasthelargestnumberof datapoints

projectingto it. If morethanonesuchsegmentexist, we choosethe longestone. The midpoint

of this segmentis selectedasthenew vertex. Formally, let I � � i : KSi K�roKSj K � j � 1 ����� k � , and� � argmaxi w I � vi � vi � 1 � . Thenthenew vertex is vnew ��� v ��V v � � 1 � ~ 2.

13

Page 15: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

3.4 StoppingCondition

Accordingto the theoreticalresultsof Section2, the numberof segmentsk is an importantfac-

tor that controlsthe balancebetweenthe estimationandapproximationerrors,and it shouldbe

proportionalto n1W 3 to achieve the O � n1W 3 � convergencerate for the expectedsquareddistance.

Althoughthetheoreticalboundsarenot tight enoughto determinetheoptimalnumberof segments

for a givendatasize,we foundthatk � n1W 3 works in practice.We alsofoundthat,similar to the

penaltyfactorλp, thefinal valueof k shouldalsodependontheaveragesquareddistanceto achieve

robustness.If the varianceof the noiseis relatively small, we cankeepthe approximationerror

low by allowing a relatively largenumberof segments.On theotherhand,whenthevarianceof

thenoiseis large(implying a high estimationerror),a low approximationerrordoesnot improve

theoverall performancesignificantly, soin this casea smallernumberof segmentscanbechosen.

Thestoppingconditionblendsthesetwo considerations.Thealgorithmstopswhenk exceeds

c N n � ∆n � fk u n � R � βn1W 3∆n � fk u n � S 1W 2r (10)

whereβ is a parameterof thealgorithmwhich wasdeterminedby experimentsandwassetto the

constantvalue0 � 3.

Notethatin apracticalsense,thenumberof segmentsplaysamoreimportantrole in determin-

ing thecomputationalcomplexity of the algorithmthanin measuringthe quality of theapproxi-

mation. Experimentsshowedthat,dueto thedatadependentcurvaturepenaltyandtheconstraint

that only onevertex is moved at a time, the numberof segmentscan increaseeven beyond the

numberof datapointswithout any indicationof overfitting. While increasingthenumberof seg-

mentsbeyondacertainlimit offersonly marginal improvementin theapproximation,it causesthe

algorithmto slow down considerably. Therefore,in on-lineapplicationswherespeedhaspriority

over precision,it is reasonableto usea smallernumberof segmentsthanindicatedby (10),andif

”aesthetic”smoothnessis anissue,to fit asplinethroughtheverticesof thecurve.

3.5 Computational Complexity

The complexity of the inner loop is dominatedby the complexity of the projectionstep,which

is O � nk� . Increasingthe numberof segmentsone at a time (as describedin Section3.3), the

complexity of thealgorithmto obtainfk u n is O � nk2 � . Usingthestoppingconditionof Section3.4,

the computationalcomplexity of thealgorithmbecomesO � n5W 3 � . This is slightly betterthanthe

O � n2 � complexity of theHSalgorithm.

Thecomplexity canbedramaticallydecreasedin certainsituations.Onepossibility is to add

morethanonevertex at a time. For example,if insteadof addingonly onevertex, a new vertex

is placedat the midpoint of every segment,then we can reducethe computationalcomplexity

14

Page 16: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

for producingfk u n to O � nklogk � . Onecanalsosetk to bea constantif thedatasizeis large,since

increasingk beyondacertainthresholdbringsonlydiminishingreturns.Also,k canbenaturallyset

to aconstantin certainapplications,giving O � nk� computationalcomplexity. Thesesimplifications

work well in certainsituations,but theoriginal algorithmis morerobust.

4 Experimental Results

We have extensively testedtheproposedalgorithmon two-dimensionaldatasets.In mostexperi-

mentsthedatawasgeneratedby acommonlyused(see,e.g.,[2] [10] [11]) additivemodel

X � Y V e (11)

whereY is uniformly distributedon a smoothplanarcurve (hereaftercalledthegenerating curve)

ande is bivariateadditivenoisewhich is independentof Y.

In Section4.1wecomparethepolygonalline algorithm,theHSalgorithm,and,for closedgen-

eratingcurves,theBR algorithm[5]. Thevariousmethodsarecomparedsubjectivelybasedmainly

onhow closelytheresultingcurve follows theshapeof thegeneratingcurve. Weusevaryinggen-

eratingshapes,noiseparameters,anddatasizesto demonstratetherobustnessof thepolygonalline

algorithm. For the caseof a circular generatingcurve we alsoevaluatein a quantitative manner

how well the polygonalline algorithmapproximatesthe generatingcurve asthe datasizegrows

andasthenoisevariancedecreases.

In Section4.2 we show two scenariosin which the polygonalline algorithm(alongwith the

HS algorithm)fails to producemeaningfulresults.In thefirst, thehigh numberof abruptchanges

in the directionof the generatingcurve causesthe algorithmto oversmooththe principal curve,

evenwhenthedatais concentratedon the generatingcurve. This is a typical situationwhenthe

penaltyparameterλ _p shouldbe decreased.In the secondscenario,the generatingcurve is too

complex (e.g.,it containsloops,or it hastheshapeof a spiral), so thealgorithmfails to find the

globalstructureof thedataif theprocessis startedfrom thefirst principalcomponent.To recover

thegeneratingcurve, onemustreplacethe initialization stepby a moresophisticatedroutinethat

approximatelycapturestheglobalstructureof thedata.

In Section4.3 an applicationin featureextraction is briefly outlined. We departfrom the

syntheticdatageneratingmodelin (11)anduseanextendedversionof thepolygonalline algorithm

to find themedialaxes(“skeletons”)of pixel templatesof hand-writtencharacters.Suchskeletons

canbeusedin hand-writtencharacterrecognitionandcompressionof hand-writtendocuments.

15

Page 17: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

4.1 Experimentswith the generatingcurvemodel

In general,in simulationexamplesconsideredby HS the performanceof the new algorithm is

comparablewith the HS algorithm. Due to the datadependenceof the curvaturepenaltyfactor

and the stoppingcondition,our algorithmturnsout to be morerobust to alterationsin the data

generatingmodel,aswell asto changesin theparametersof theparticularmodel.

Weusemodel(11)with varyinggeneratingshapes,noiseparameters,anddatasizesto demon-

stratethe robustnessof the polygonalline algorithm. All plots show the generatingcurve, the

curve producedby our polygonalline algorithm(Polygonalprincipal curve), andthe curve pro-

ducedby theHS algorithmwith splinesmoothing(HS principalcurve), which we have foundto

performbetterthanthe HS algorithmusingscatterplotsmoothing.For closedgeneratingcurves

we alsoincludethecurve producedby theBR algorithm[5] (BR principalcurve),which extends

theHS algorithmto closedcurves.Thetwo coefficientsof thepolygonalline algorithmaresetin

all experimentsto theconstantvaluesβ � 0 � 3 andλ _p � 0 � 1.

In Figure5(a) thegeneratingcurve is a circle of radiusr � 1, ande ��� e1 � e2 � is a zeromean

bivariateuncorrelatedGaussianwith varianceE � e2i � � 0 � 04, for i � 1 � 2. The performanceof

the threealgorithms(HS, BR, andthepolygonalline algorithm)is comparable,althoughtheHS

algorithmexhibits morebiasthantheothertwo. NotethattheBR algorithm[5] hasbeentailored

to fit closedcurvesandto reducetheestimationbias.In Figure5(b),only half of thecircle is used

asa generatingcurve andtheotherparametersremainthesame.Here,too, both theHS andour

algorithmbehavesimilarly.

Whenwedepartfrom theseusualsettingsthepolygonalline algorithmexhibitsbetterbehavior

thantheHS algorithm.In Figure6(a)thedatawasgeneratedsimilarly to thedatasetof Figure5,

andthenit waslinearly transformedusingthe matrix N 0 � 7 0 � 4S 0 � 8 1 � 0 R . In Figure6(b) the transforma-

tion N S 1 � 0 S 1 � 21 � 0 S 0 � 2 R wasused.Theoriginal datasetwasgeneratedby anS-shapedgeneratingcurve,

consistingof two half circlesof unit radii, to which thesameGaussiannoisewasaddedasin Fig-

ure5. In bothcasesthepolygonalline algorithmproducescurvesthatfit thegeneratorcurvemore

closely. This is especiallynoticeablein Figure6(a)wherethe HS principal curve fails to follow

theshapeof thedistortedhalf circle.

Thereare two situationswhenwe expectour algorithmto performparticularlywell. If the

distribution is concentratedon a curve, then accordingto both the HS and our definitionsthe

principalcurve is thegeneratingcurve itself. Thus,if thenoisevarianceis small,we expectboth

algorithmsto verycloselyapproximatethegeneratingcurve. Thedatain Figure7(a)wasgenerated

using the sameadditive Gaussianmodel as in Figure 5, but the noisevariancewas reducedto

E � e2i � � 0 � 0001for i � 1 � 2. In this casewe foundthatthepolygonalline algorithmoutperformed

boththeHSandtheBR algorithms.

16

Page 18: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

(a) Circle,100datapoints����� �����¡  ¢�� £¤`¥ ¢ ¥§¦ ���   ¢�¨ª©¡« ¦ ¬­¥®¯�¡° ±­¨­�¡¢­�¡°�� ¦   ¢�©�  �­�¡° ©­« ¦ ¬­¥²´³µ� ¦   ¢�©­  ���¡° ©­« ¦ ¬­¥¶�·� ¦   ¢�©­  ���¡° ©­« ¦ ¬­¥(b) Half circle,100datapoints����� �����§  ¢�� £¤`¥ ¢ ¥¸¦ ���   ¢�¨ª©¡« ¦ ¬�¥®¯�¡° ±­¨­�§¢��¡°�� ¦   ¢�©­  �­�¡° ©­« ¦ ¬­¥¶�·� ¦   ¢�©­  ���¡° ©¡« ¦ ¬­¥

Figure5: (a) TheBR andthepolygonalline algorithmsshow lessbiasthantheHS algorithm.(b) TheHS

andthepolygonalline algorithmsproducesimilar curves.

(a)Distortedhalf circle,100datapoints�´��� ���­�¡  ¢¹� £¤`¥ ¢ ¥§¦ ���   ¢­¨ª©­« ¦ ¬­¥®¸�§° ±�¨¡�¡¢��§°¹� ¦   ¢�©­  ���§° ©¡« ¦ ¬�¥¶´·µ� ¦   ¢�©�  �­�¡° ©¡« ¦ ¬�¥ (b) DistortedS-shape,100datapoints����� �����§  ¢¹� £¤º¥ ¢ ¥¸¦ ���   ¢�¨ª©¡« ¦ ¬�¥®¯�§° ±­¨­�§¢��§°¹� ¦   ¢�©­  ���¡° ©¡« ¦ ¬­¥¶�·� ¦   ¢�©­  ���§° ©¡« ¦ ¬�¥

Figure6: TransformedDataSets.Thepolygonalline algorithmstill follows fairly closelythe“distorted”

shapes.

Thesecondcaseis whenthesamplesizeis large. Althoughthegeneratingcurve is not neces-

sarily theprincipalcurve of thedistribution, it is naturalto expectthealgorithmto well approxi-

17

Page 19: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

matethegeneratingcurve asthesamplesizegrows. Sucha caseis shown in Figure7(b), where

n � 10000datapointsweregenerated(but only 2000of thesewereactuallyplotted). Herethe

polygonalline algorithmapproximatesthe generatingcurve with muchbetteraccuracy thanthe

HSalgorithm.

(a) Circle,100datapoints����� �����¡  ¢�� £¤`¥ ¢ ¥§¦ ���   ¢�¨ª©¡« ¦ ¬­¥®¯�¡° ±­¨­�¡¢­�¡°�� ¦   ¢�©�  �­�¡° ©­« ¦ ¬­¥²´³µ� ¦   ¢�©­  ���¡° ©­« ¦ ¬­¥¶�·� ¦   ¢�©­  ���¡° ©­« ¦ ¬­¥(b) S-shape,10000datapoints����� �����§  ¢�� £¤`¥ ¢ ¥¸¦ ���   ¢�¨ª©¡« ¦ ¬�¥®¯�¡° ±­¨­�§¢��¡°�� ¦   ¢�©­  �­�¡° ©­« ¦ ¬­¥¶�·� ¦   ¢�©­  ���¡° ©¡« ¦ ¬­¥

Figure7: SmallNoiseVariance(a)andLargeSampleSize(b). Thecurvesproducedby thepolygonalline

algorithmarenearlyindistinguishablefrom thegeneratingcurves.

Although in the model (11) the generatingcurve is in generalnot the principal curve in our

definition(or in theHS definition) it is of interestto numericallyevaluatehow well thepolygonal

line algorithmapproximatesthegeneratingcurve. In theseexperimentsthegeneratingcurveg � t � is

acircleof unit radiuscenteredat theorigin andthenoiseis zeromeanbivariateuncorrelatedGaus-

sian. We chose21 differentdatasizesrangingfrom 10 to 10000,and7 differentnoisestandard

deviationsrangingfrom σ � 0 � 01to σ � 0 � 4. For themeasureof approximationwechosetheaver-

agedistancedefinedby δ � 1l L f M¼» mins � f � t � � g � s� � dt, wherethepolygonalline f is parametrized

by its arclength.To eliminatethedistortionoccurringat theendpoints,we initialized thepolygo-

nal line algorithmby anequilateraltriangleinscribedin thegeneratingcircle. For eachparticular

datasizeandnoisevariancevalue,100randomdatasetsweregeneratedandtheresultingδ values

wereaveragedover theseexperiments.Thedependenceof theaveragedistanceδ on thedatasize

andthenoisevarianceis plottedonalogarithmicscalein Figure8. Theresultingcurvesjustify our

informal observation madeearlierthat the approximationsubstantiallyimprovesasthe datasize

grows,andasthevarianceof thenoisedecreases.

18

Page 20: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

0.001

0.01

0.1

1

10 100 1000 10000

aver

age

dist

ance

n

sigma = 0.05sigma = 0.1

sigma = 0.15sigma = 0.2sigma = 0.3sigma = 0.4

Figure8: Theapproximationerrorδ decreasesasn grows or σ decreases.

4.2 Failur e modes

We describetwo specificsituationswhenthepolygonalline algorithmfails to recover thegener-

atingcurve. In thefirst scenario,we usezig-zagginggeneratingcurvesf i for i � 2 � 3 � 4 consisting

of 2i line segmentsof equallength,suchthattwo consecutivesegmentsjoin at a right angle(Fig-

ure9). In theseexperiments,thenumberof thedatapointsgeneratedona line segmentis constant

(it is setto 100),andthevarianceof thebivariateGaussiannoiseis l2 � 0 � 0005,wherel is thelength

of a line segment.Figure9 shows theprincipalcurvesproducedby theHS andthepolygonalline

algorithmsin thethreeexperiments.Althoughthepolygonalprincipalcurvefollowsthegenerating

curvemorecloselythantheHSprincipalcurve in thefirst two experiments(Figures9(a)and(b)),

thetwo algorithmsproduceequallypoor resultsif thenumberof line segmentsexceedsa certain

limit (Figure9(c)). Thedatadependentpenaltytermexplainsthis behavior of thepolygonalline

algorithm.Sincethepenaltyfactorλp is proportionalto thenumberof line segments,thepenalty

relatively increasesasthenumberof line segmentsof thegeneratingcurve grows. To achieve the

samelocalsmoothnessin thefour experiments,thepenaltyfactorshouldbegraduallydecreasedas

thenumberof line segmentsof thegeneratingcurve grows. Indeed,if theconstantof thepenalty

term is resetto λ _p � 0 � 02 in the fourth experiment,the polygonalprincipal curve recovers the

generatingcurvewith highaccuracy (Figure11(a)).

Thesecondscenariowhenthepolygonalline algorithmfails to producea meaningfulresultis

whenthegeneratingcurve is too complex, so the algorithmdoesnot find theglobal structureof

thedata.To testthegradualdegradationof thealgorithmweusedspiral-shapedgeneratingcurves

of increasinglength,i.e., we setgi � t � �½� t sin� iπt � � t cos� iπt �� for t �¾l 0 � 1m and i � 1 ����>� 6. The

varianceof the noisewassetto 0 � 0001,andwe generated1000datapoints in eachexperiment.

19

Page 21: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

(a)¿­À Á ÀÃÂ Ä¹Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë´Ì Í Ê Î ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÒ¡ÓºÂ Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É (b)¿¡À Á ÀàÄ�Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë�Ì Í Ê Î ÉÏ�Ä�Ð Ñ Ë Ä¹Æ À¹Ð Â Ê Å Æ Ì Å Â À¹Ð Ì Í Ê Î ÉÒ§ÓºÂ Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É (c)¿¡À Á ÀàÄ�Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë´Ì Í Ê Î ÉϹÄ�Ð Ñ Ë Ä Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÒ§Ó`Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É

Figure9: Abrupt changesin the directionof the generatingcurve. The polygonalline algorithmover-

smoothestheprincipalcurve asthenumberof directionchangesincreases.

Figure10 shows the principal curvesproducedby the HS andthe polygonalline algorithmsin

threeexperiments(i � 3 � 4 � 6). In thefirst two experiments(Figures10(a)and(b)), thepolygonal

principalcurveis almostindistinguishablefrom thegeneratingcurvewhile theHSalgorithmeither

oversmoothesthe principal curve (Figure10(a)),or fails to recover the shapeof the generating

curve(Figure10(b)). In thethird experimentbothalgorithmsfail to find theshapeof thegenerating

curve(Figure10(c)).Thefailurehereis dueto thefactthatthealgorithmis stuckin alocalminima

betweenthe initial curve (the first principal component)andthedesiredsolution(the generating

curve). If this is likely to occurin anapplication,theinitializationstepmustbereplacedby amore

sophisticatedroutine that approximatelycapturesthe global structureof the data. Figure11(b)

indicatesthatthis indeedworks. Herewe manuallyinitialize bothalgorithmsby a polygonalline

with eight vertices. Using this “hint”, the polygonalline algorithmproducesan almostperfect

solution,while theHSalgorithmstill cannotrecover theshapeof thegeneratingcurve.

4.3 Recovering smoothcharacter skeletons

In this sectionwe usethepolygonalline algorithmto find smoothskeletonsof hand-writtenchar-

actertemplates.Theresultsreportedherearepreliminary, andthefull treatmentof thisapplication

will be presentedin a future publication. Principalcurveshave beenappliedby Singhet al. [6]

for similar purposes.In [6] the initial treesareproducedusinga versionof the SOM algorithm

andthentheHS algorithmis appliedto extractskeletalstructuresof hand-writtencharacters.The

focusin [6] wason recoveringthe topologicalstructureof noisy lettersin fadeddocuments.Our

aim with thepolygonalline algorithmis to producesmoothcurveswhich canbeusedto recover

thetrajectoryof thepenstroke.

20

Page 22: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

(a)¿­À Á ÀÃÂ Ä¹Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë´Ì Í Ê Î ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÒ¡ÓºÂ Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É (b)¿¡À Á ÀàÄ�Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë�Ì Í Ê Î ÉÏ�Ä�Ð Ñ Ë Ä¹Æ À¹Ð Â Ê Å Æ Ì Å Â À¹Ð Ì Í Ê Î ÉÒ§ÓºÂ Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É (c)¿¡À Á ÀàÄ�Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë´Ì Í Ê Î ÉϹÄ�Ð Ñ Ë Ä Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÒ§Ó`Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É

Figure10: Spiral-shapedgeneratingcurves.Thepolygonalline algorithmfails to find thegeneratingcurve

asthelengthof thespiralis increased.

(a)����� �����¡  ¢�� £¤`¥ ¢ ¥§¦ ���   ¢�¨ª©¡« ¦ ¬­¥®¯�¡° ±­¨­�¡¢­�¡°�� ¦   ¢�©�  �­�¡° ©­« ¦ ¬­¥ (b)����� �����§  ¢�� £¤`¥ ¢ ¥¸¦ ���   ¢�¨ª©¡« ¦ ¬�¥®¯�¡° ±­¨­�§¢��¡°�� ¦   ¢�©­  �­�¡° ©­« ¦ ¬­¥Ô ¢­  �   �¡° ©­« ¦ ¬­¥¶�·� ¦   ¢�©­  ���¡° ©¡« ¦ ¬­¥

Figure11: Improvedperformanceof thepolygonalline algorithm.(a)Thepenaltyparameteris decreased.

(b) Thealgorithmsareinitialized manually.

To transformblack-and-whitecharactertemplatesinto two-dimensionaldatasets,weplacethe

midpointof thebottom-mostleft-mostpixel of the templateat thecenterof a coordinatesystem.

Theunit lengthof thecoordinatesystemis setto thewidth (andheight)of apixel, sothemidpoint

of eachpixel hasintegercoordinates.Thenweaddthemidpointof eachblackpixel to thedataset.

Thepolygonalline algorithmwastestedonimagesof isolatedhandwrittendigitsfrom theNIST

SpecialDatabase19 [22]. We found that thepolygonalline algorithmcanbeusedeffectively to

21

Page 23: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

find smoothmedialaxesof simpledigitswhichcontainno loopsor crossingsof strokes.Figure12

showssomeof theseresults.

To find smoothskeletonsof morecomplex characterswemodifiedandextendedthepolygonal

line algorithm. We introducednew typesof verticesincidentto morethantwo line segmentsto

handleloopsandcrossings,andmodified the vertex optimizationstepaccordingly. The initial-

izationstepwasalsoreplacedby a moresophisticatedroutinebasedon a thinningmethod[23] to

produceaninitial graphthatapproximatelycapturesthetopologyof thecharacter. Figure13shows

someof theresultsof theextendedpolygonalline algorithmon morecomplex characters.Details

of theextendedpolygonalline algorithmandmorecompletetestingresultswill bepresentedin the

future. Õ¸Ö À Ê À Ì Á É�Ê Á É¹× Â¹Ð À Á ÉÏ�Ä¹Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É Õ§Ö À Ê À Ì Á É�Ê Á É × Â Ð À Á ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÕ¸Ö À Ê À Ì Á É�Ê Á É¹× Â¹Ð À Á ÉÏ�Ä¹Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É Õ§Ö À Ê À Ì Á É�Ê Á É × Â Ð À Á ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É

Figure12: Resultsproducedby thepolygonalline algorithmon charactersnot containingloopsor cross-

ings.

5 Conclusion

A new definitionof principalcurveshasbeenoffered.Thenew definitionhassignificanttheoretical

appeal;the existenceof principal curveswith this definition can be proved undervery general

conditions,anda learningmethodfor constructingprincipal curvesfor finite datasetsyields to

theoreticallyanalysis.

Inspiredby thenew definitionandthetheoreticallearningscheme,we have introduceda new

practicalpolygonalline algorithmfor designingprincipalcurves.Lackingtheoreticalresultscon-

22

Page 24: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

Õ¸Ö À Ê À Ì Á É�Ê Á É¹× Â¹Ð À Á ÉÏ�Ä¹Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É Õ§Ö À Ê À Ì Á É�Ê Á É × Â Ð À Á ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÕ¸Ö À Ê À Ì Á É�Ê Á É¹× Â¹Ð À Á ÉÏ�Ä¹Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É Õ§Ö À Ê À Ì Á É�Ê Á É × Â Ð À Á ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É

Figure13: Resultsproducedby theextendedpolygonalline algorithmon characterscontainingloopsor

crossings.

cerningboth the HS andour polygonalline algorithm, we comparedthe two methodsthrough

simulations.We have foundthat in generalour algorithmhaseithercomparableor betterperfor-

mancethantheoriginal HS algorithmandit exhibits better, morerobustbehavior whenthedata

generatingmodelis varied. We have alsoreportedpreliminaryresultsin applyingthepolygonal

line algorithmto the problemof handwrittencharacterskeletonization.We believe that the new

principalcurvealgorithmmayalsoproveusefulin otherapplicationssuchasdatacompressionand

featureextractionwherea compactandaccuratedescriptionof a patternor an imageis required.

Theseareissuesfor futurework.

Appendix A

Curvesin � d

Let f : l a � bm XØ� d acontinuousmapping(curve). Thelengthof f overaninterval lα � β m h�l a � bm ,denotedby l � f � α � β � , is definedby

l � f � α � β � � supN

∑i b 1

� f � ti � � f � ti S 1 � � (A.1)

wherethe supremumis taken over all finite partitionsof l α � β m with arbitrarysubdivision points

α � t0 ] t1 TÙ���Ú] tN � β for N r 1. Thelengthof f over its domain l a � bm is denotedby l � f � . If

23

Page 25: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

l � f � T ∞, thenf is saidto be rectifiable. It is well known that f ��� f1 ����>� fd � is rectifiableif and

only if eachcoordinatefunction f j : l a � bm XY� is of boundedvariation.

Two curves f : l a � bm X � d and g : l a_ � b_ m X � d are said to be equivalentif thereexist two

nondecreasingcontinuousrealontofunctionsφ : l 0 � 1m X l a � bm andη : l 0 � 1m X l a_ � b_ m suchthat

f � φ � t �� � g � η � t �� � for t �kl 0 � 1mn�In this casewe write f � g, andit is easyto seethat � is an equivalencerelation. If f � g, then

l � f � � l � g� . A curve g over l a � bm is saidto beparametrizedby its arc lengthif l � g � a � t � � t � a for

any a ] t ] b. Let f be a curve over l a � bm with lengthL. It is not hardto seethat thereexists a

uniquearclengthparametrizedcurveg over l 0 � L m suchthatf � g.

Let f be any curve with lengthL _ ] L, andconsiderthe arc lengthparametrizedcurve g � f

with parameterinterval l 0 � L _ m . By definition(A.1), for all s1 � s2 �Ûl 0 � L _ m wehave � g � s1 � � g � s2 � �Ü]K s1 � s2 K . Defineg � t � � g � L _ t � for 0 ] t ] 1. Thenf � g, andg satisfiesthe following Lipschitz

condition:For all t1 � t2 �tl 0 � 1m ,� g � t1 � � g � t2 � �� � g � L _ t1 � � g � L _ t2 � �] L _ K t1 � t2 K] L K t1 � t2 K � (A.2)

On theotherhand,notethatif g is acurveover l 0 � 1m whichsatisfiestheLipschitzcondition(A.2),

thenits lengthis at mostL.

Let f beacurveover l a � bm anddenotethesquaredEuclideandistancefrom any x �Þ� d to f by

∆ � x � f � � infa ß t ß b

� x � f � t � � 2 �Notethatif l � f � T ∞, thenby thecontinuityof f, its graph

Gf � f f � t � : a ] t ] b gis a compactsubsetof � d , andthe infimum above is achievedfor somet. Also, sinceGf � Gg if

f � g, we alsohave that∆ � x � f � � ∆ � x � g� for all g � f.

Proof of Lemma 1 Define

∆ e � inf f ∆ � f � : l � f � ] L g �First we show thattheabove infimum doesnot changeif we addtherestrictionthatall f lie inside

a closedsphereS� r � �àf x : � x �}] r g of largeenoughradiusr andcenteredat theorigin. Indeed,

24

Page 26: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

without excludingnontrivial cases,wecanassumethat∆ e T E � X � 2. Denotethedistributionof X

by µ andchooser c 3L largeenoughsuchthatáSL r W 3M � x � 2µ � dx � c ∆ e V ε (A.3)

for someε c 0. If f is suchthatGf is notentirelycontainedin S� r � , thenfor all x � S� r ~ 3� wehave

∆ � x � f � co� x � 2 sincethediameterof Gf is at mostL. Then(A.3) impliesthat

∆ � f � r áSL r W 3M ∆ � x � f � µ � dx � c ∆ e V ε

andthus

∆ e � inf f ∆ � f � : l � f � ] L � Gf h S� r � g � (A.4)

In view of (A.4) thereexistsasequenceof curves f fn g suchthat l � fn � ] L, Gfn h S� r � for all n,

and∆ � fn � X ∆ e . By thediscussionpreceding(A.2), wecanassumewithout lossof generalitythat

all fn aredefinedover l 0 � 1m and � fn � t1 � � fn � t2 � �â] L K t1 � t2 K (A.5)

for all t1 � t2 �ãl 0 � 1m . Considerthe set of all curves ä over l 0 � 1m suchthat f �åä if and only if� f � t1 � � f � t2 � �J] L K t1 � t2 K for all t1 � t2 �jl 0 � 1m andGf h S� r � . It is easyto seethat ä is a closed

setundertheuniformmetricd � f � g� � sup0 ß t ß 1 � f � t � � g � t � � . Also, ä is anequicontinuousfamily

of functionsandsupt � f � t � � is uniformly boundedover ä . Thus ä is a compactmetric spaceby

the Arzela-Ascoli theorem(see,e.g., [24]). Sincefn �kä for all n, it follows that thereexists a

subsequencefnk converginguniformly to anf e �æä .

To simplify thenotationlet us renamef fnk g as f fn g . Fix x ��� d , assume∆ � x � fn � r ∆ � x � f e � ,andlet tx besuchthat∆ � x � f e � ��� x � f e�� tx � � 2. Thenby thetriangleinequality,K∆ � x � f e � � ∆ � x � fn � Kà� ∆ � x � fn � � ∆ � x � f e �] � x � fn � tx � � 2 �j� x � f e � tx � � 2] N � x � fn � tx � �çV�� x � f e � tx � � R � fn � tx � � f e � tx � � �By symmetry, asimilar inequalityholdsif ∆ � x � fn � T ∆ � x � f e � . SinceGf è � Gfn h S� r � , andE � X � 2 is

finite, thereexistsA c 0 suchthat

E K∆ � X � fn � � ∆ � X � f e � KU] A sup0 ß t ß 0

� fn � t � � f e � t � �andtherefore

∆ e � limné ∞

∆ � fn � � ∆ � f e � �SincetheLipschitzconditionon f e guaranteesthat l � f e � ] L, theproof is complete. ê

25

Page 27: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

Appendix B

Proof of Theorem 1. Let f ek denotethecurve in s k minimizing thesquaredloss,i.e.,

f ek � argminf wx k

∆ � f � �The existenceof a minimizing f ek can easily be shown using a simpler versionof the proof of

Lemma1. ThenJ � fk u n � canbedecomposedas

J � fk u n � � N ∆ � fk u n � � ∆ � f ek � R V N ∆ � f ek � � ∆ � f e � Rwhere,usingstandardterminology, ∆ � fk u n � � ∆ � f ek � is calledtheestimationerror and∆ � f ek � � ∆ � f e �is calledtheapproximationerror. We considerthesetermsseparatelyfirst, andthenchoosek as

a function of the training datasizen to balancethe obtainedupperboundsin an asymptotically

optimalway.

ApproximationError

For any two curvesf andg of finite lengthdefinetheir (nonsymmetric)distanceby

ρ � f � g� � maxt

mins� f � t � � g � s� � �

Notethatρ � f � g� � ρ � f � g� if f � f andg � g, i.e., ρ � f � g� is independentof theparticularchoiceof

theparametrizationwithin equivalenceclasses.Next weobservethatif thediameterof K is D, and

Gf � Gg � K, thenfor all x � K,

∆ � x � g� � ∆ � x � f � ] 2Dρ � f � g� (B.1)

andtherefore

∆ � g� � ∆ � f � ] 2Dρ � f � g� � (B.2)

To prove (B.1), let x � K andchooset _ ands_ suchthat ∆ � x � f � �ë� x � f � t _ � � 2 andmins � g � s� �f � t _ � �ì��� g � s_ � � f � t _ � � . Then

∆ � x � g� � ∆ � x � f � ] � x � g � s_ � � 2 �j� x � f � t _ � � 2� N � x � g � s_ � �çV�� x � f � t _ � � RíN � x � g � s_ � �î�j� x � f � t _ � � R] 2D � g � s_ � � f � t _ � �] 2Dρ � f � g� �Let f �Hs beanarbitraryarc lengthparametrizedcurve over l 0 � L _ m , whereL _ï] L. Defineg as

a polygonalcurve with verticesf � 0� � f � L _ ~ k � ������ f �� k � 1� L _ ~ k � � f � L _ � . For any t �ðl 0 � L _ m we have

26

Page 28: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

K t � iL _ ~ k KU] L ~ï� 2k � for somei �ñf 0 ������ k g . Sinceg � s� � f � iL _ ~ k � for somes, wehave

mins� f � t � � g � s� � ] � f � t � � f � iL _ ~ k � �] K t � iL _ ~ k K�] L

2k �Notethatl � g� ] L _ , by construction,andthusg �òs k. Thusfor every f �òs thereexistsag �òs k such

thatρ � f � g� ] L ~ï� 2k � . Now let g �As k besuchthatρ � f e � g� ] L ~ï� 2k � . Thenby (B.2) we conclude

thattheapproximationerroris upperboundedas

∆ � f ek � � ∆ � f e � ] ∆ � g� � ∆ � f e �] 2Dρ � f e � g�] DLk � (B.3)

EstimationError

For eachε c 0 andk r 1 let Sk u ε bea finite setof curvesin K which form anε-cover of s k in

thefollowing sense.For any f �ós k thereis anf _ �ós k u ε which satisfies

supx w K

K∆ � x � f � � ∆ � x � f _ � KU] ε � (B.4)

The explicit constructionof Sk u ε is given in [19]. Sincefk u n �\s k (see(5)), thereexists an f _k u n �s k u ε suchthat K∆ � x � fk u n � � ∆ � x � f _k u n � K�] ε for all x � K. We introducethe compactnotation � n �� X1 ����>� Xn � for thetrainingdata.Thuswecanwrite

E l ∆ � X � fk u n � K � n m � ∆ � f ek � � E l ∆ � X � fk u n � K � nm � ∆n � fk u n �V ∆n � fk u n � � ∆ � f ek �] 2ε V E l ∆ � X � f _k u n � K � n m � ∆n � f _k u n �V ∆n � fk u n � � ∆ � f ek � (B.5)] 2ε V E l ∆ � X � f _k u n � K � n m � ∆n � f _k u n �V ∆n � f ek � � ∆ � f ek � (B.6)] 2ε V 2 � maxf wx k ô ε õÚö f è�÷ K∆ � f � � ∆n � f � K � (B.7)

where(B.5) follows from theapproximatingpropertyof f _k u n andthefactthatthedistributionof X

is concentratedon K. (B.6) holdsbecausefk u n minimizes∆n � f � over all f �As k, and(B.7) follows

becausegiven � n �ø� X1 ����� Xn � , E l ∆ � X � f _k u n � K � n m is anordinaryexpectationof thetypeE l ∆ � X � f � m ,27

Page 29: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

f �ós k u ε. Thusfor any t c 2ε theunionboundimplies

P � E l ∆ � X � fk u n � K � n m � ∆ � f ek � c t �] P � maxf wx k ô ε õÚö f èn÷ K∆ � f � � ∆n � f � KUc t

2� ε �] N KSk u ε KnV 1R max

f wx k ô ε õÚö f è ÷ P ��K∆ � f � � ∆n � f � K�c t2� ε � (B.8)

where K s k u ε K denotesthecardinalityof s k u ε.Recallnow Hoeffding’s inequality[25] which statesthat if Y1 � Y2 ����� Yn areindependentand

identicallydistributedrealrandomvariablessuchthat0 ] Yi ] A with probabilityone,thenfor all

u c 0,

P ù aaaaa 1n n

∑i b 1

Yi � EY1

aaaaa c u úã] 2eS 2nu2 W A2 ûSincethediameterof K is D, wehave ü x ý f þ t ÿ�ü 2 ] D2 for all x � K andf suchthatGf � K. Thus

0 ] ∆ þ X � f ÿ ] D2 with probabilityoneandby Hoeffding’sinequality, for all f ��s k u ε � f f e g wehave

P� K∆ þ f ÿçý ∆n þ f ÿ�K c t

2ý ε � ] 2eS 2n LÃL t W 2M`S ε M 2 W D4

which impliesby (B.8) that

P�E � ∆ þ X � fk � n ÿ�� n ý ∆ þ f �k ÿ � t ��� 2 ���Sk � ε ��� 1� e� 2n ��� t � 2� � ε � 2 � D4

(B.9)

for any t � 2ε. UsingthefactthatE �Y ���� ∞0 P � Y � t � dt for any nonnegativerandomvariableY,

wecanwrite for any u � 0,

∆ þ fk � n ÿ ý ∆ þ f �k ÿ � ∞

0P�E � ∆ þ X � fk � n ÿ�� n ý ∆ þ f �k ÿ � t � dt� u � 2ε � 2 � �Sk � ε ��� 1�! ∞

u" 2εe� 2n ��� t � 2� � ε � 2 � D4

dt (B.10)

� u � 2ε � 2 �#�Sk � ε ��� 1� D4 $ e� nu2 �%� 2D4 �nu

(B.11)

� & 2D4 log þ'�Sk � ε ��� 1ÿn

� 2ε � O þ n � 1� 2 ÿ (B.12)

where(B.11)followsfrom theinequality � ∞x e� t2 � 2dt ( þ 1) xÿ e� x2 � 2, for x � 0, and(B.12)follows

by settingu �+* 2D4 log �-,Sk . ε , " 1�n , wherelog denotesnaturallogarithm.Thefollowing lemma,which

is provedin [19], demonstratestheexistenceof a suitablecoveringsetSk � ε.Lemma 2 For anyε � 0 thereexistsa finitecollectionof curvesSk � ε in K such that

supx / K

�∆ þ x � f ÿçý ∆ þ x � f 0¯ÿ1�2� ε

28

Page 30: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

and �Sk � ε ��� 2LDε " 3k" 1Vk" 1

d 3 D2 4 dε

� 4 d 5 d 3 LD 4 dkε

� 3 4 d 5 kd

whereVd is thevolumeof thed-dimensionalunit sphereandD is thediameterof K.

It is not hardto seethatsettingε � 1) k in Lemma2 givestheupperbound

2D4 log þ'�Sk � ε ��� 1ÿ6� kC þ L � D � d ÿ (B.13)

whereC þ L � D � d ÿ doesnotdependon k. Combiningthis with (B.12)andtheapproximationbound

givenby (B.3) resultsin

∆ þ fk � n ÿçý ∆ þ f � ÿ6� & kC þ L � D � d ÿn

� DL � 2k

� O þ n � 1� 2 ÿ ûTherateat which ∆ þ fk � n ÿ approaches∆ þ f � ÿ is optimizedby settingthenumberof segmentsk to be

proportionalto n1� 3. With thischoiceJ þ fk � n ÿ � ∆ þ fk � n ÿ ý ∆ þ f � ÿ hastheasymptoticconvergencerate

J þ fk � n ÿ � O þ n � 1� 3 ÿ'�andtheproof of Theorem1 is complete.

To show thebound(7), let δ �kþ 0 � 1ÿ andobserve thatby (B.9) we have

P�E � ∆ þ X � fk � n ÿ1� n ý ∆ þ f �k ÿ � t �7� 1 ý δ

whenever t � 2ε and

δ � 2 �#�Sk � ε ��� 1� e� 2n �8� t � 2� � ε � 2 � D4 ûSolvingthis equationfor t andletting ε � 1) k asbefore,weobtain

t � 2D4 log ���Sk � 1� k �9� 1�ìý 2D4 log þ δ ) 2ÿn

� 2k� & kC þ L � D � d ÿçý 2D4 log þ δ ) 2ÿ

n� 2

Therefore,with probabilityat least1 ý δ, wehave

E � ∆ þ X � fk � n ÿ�� n ý ∆ þ f �k ÿ6� & kC þ L � D � d ÿçý 2D4 log þ δ ) 2ÿn

� 2kû

Combiningthis boundwith theapproximationbound∆ þ f �k ÿ ý ∆ þ f � ÿ:�øþ DL ÿ;) k gives(7). <29

Page 31: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

References

[1] T. Hastie,Principal curvesandsurfaces. PhDthesis,StanfordUniversity, 1984.

[2] T. HastieandW. Stuetzle,“Principalcurves,” Journalof theAmericanStatisticalAssociation,

vol. 84,pp.502–516,1989.

[3] Y. Linde,A. Buzo,andR. M. Gray, “An algorithmfor vectorquantizerdesign,” IEEETrans-

actionson Communications, vol. COM-28,pp.84–95,1980.

[4] W. S. Cleveland,“Robust locally weightedregressionandsmoothingscatterplots,” Journal

of theAmericanStatisticalAssociation, vol. 74,pp.829–835,1979.

[5] J.D. BanfieldandA. E. Raftery, “Ice floe identificationin satelliteimagesusingmathemat-

ical morphologyandclusteringaboutprincipal curves,” Journal of theAmericanStatistical

Association, vol. 87,pp.7–16,1992.

[6] R. Singh,M. C. Wade,andN. P. Papanikolopoulos,“Letter-level shapedescriptionby skele-

tonizationin fadeddocuments,” in Proceedingsof theFourthIEEEWorkshoponApplications

of ComputerVision, pp.121–126,IEEEComput.Soc.Press,1998.

[7] K. Reinhardand M. Niranjan, “Subspacemodels for speechtransitionsusing principal

curves,” Proceedingsof Instituteof Acoustics, vol. 20(6),pp.53–60,1998.

[8] K. ChangandJ.Ghosh,“Principalcurvesfor nonlinearfeatureextractionandclassification,”

in Applicationsof Artificial Neural Networksin Image ProcessingIII , vol. 3307,(SanJose,

CA), pp.120–129,SPIEPhotonicsWest’98 ElectronicImageConference,Jan24–301998.

[9] K. ChangandJ. Ghosh,“Principal curve classifier– a nonlinearapproachto patternclas-

sification,” in IEEE InternationalJoint Conferenceon Neural Networks, (Anchorage,AL),

pp.695–670,May 5–91998.

[10] R. Tibshirani,“Principal curvesrevisited,” StatisticsandComputation, vol. 2, pp. 183–190,

1992.

[11] F. Mulier andV. Cherkassky, “Self-organizationasan iterative kernelsmoothingprocess,”

Neural Computation, vol. 7, pp.1165–1177,1995.

[12] P. Delicado,“Principal curvesandprincipal orientedpoints,” Tech.Rep.309, Department

d’Economiai Empresa,UniversitatPompeuFabra,1998.

30

Page 32: Learning and Design of Principal Curveskegl/research/PDFs/KeKrLiZe00.pdfLearning and Design of Principal Curves Balazs´ Kegl´ Adam Krzyzak˙ Tamas´ Linder Kenneth Zeger IEEE Transactions

[13] T. Duchampand W. Stuetzle,“Geometricpropertiesof principal curves in the plane,” in

Robuststatistics,dataanalysis,andcomputerintensivemethods:in honorof PeterHuber’s

60thbirthday(H. Rieder, ed.),vol. 109of Lecture notesin statistics, pp.135–152,Springer-

Verlag,1996.

[14] A. N. Kolmogorov andS.V. Fomin, IntroductoryRealAnalysis. New York: Dover, 1975.

[15] A. GershoandR. M. Gray, VectorQuantizationandSignalCompression. Boston:Kluwer,

1992.

[16] J.A. Hartigan,ClusteringAlgorithms. New York: Wiley, 1975.

[17] T. Tarpey, L. Li, andB. D. Flury, “Principal points andself-consistentpoints of elliptical

distributions,” Annalsof Statistics, vol. 23,no.1, pp.103–112,1995.

[18] V. N. Vapnik,TheNatureof StatisticalLearningTheory. New York: Springer-Verlag,1995.

[19] B. Kegl, Principal Curves: Learning, Design,and Applications. PhD thesis,Concorida

University, Montreal,Canada,1999.

[20] L. Devroye, L. Gyorfi, andG. Lugosi,A ProbabilisticTheoryof PatternRecognition. New

York: Springer, 1996.

[21] A. N. Kolmogorov and V. M. Tikhomirov, “ε-entropy and ε-capacityof setsin function

spaces,” Translationsof theAmericanMathematicalSociety, vol. 17,pp.277–364,1961.

[22] P. Grother, NISTSpecialDatabase19. NationalInstituteof StandardsandTechnology, Ad-

vancedSystemsDivision,1995.

[23] S.SuzukiandK. Abe,“Sequentialthinningof binarypicturesusingdistancetransformation,”

in Proceedingsof the 8th International Conferenceon Pattern Recognition, pp. 289–292,

1986.

[24] R. B. Ash,RealAnalysisandProbability. New York: AcademicPress,1972.

[25] W. Hoeffding, “Probability inequalitiesfor sumsof boundedrandomvariables,” Journal of

theAmericanStatisticalAssociation, vol. 58,pp.13–30,1963.

31


Recommended