LearningandDesignof PrincipalCurves�
BalazsKegl AdamKrzyzak TamasLinder KennethZeger
IEEETransactionson PatternAnalysisandMachineIntelligencevol. 22,no. 3, pp. 281-297,2000.
Abstract
Principalcurveshave beendefinedas“self consistent”smoothcurveswhich passthrough
the“middle” of a d-dimensionalprobabilitydistribution or datacloud. They give a summary
of thedataandalsoserve asan efficient featureextractiontool. We take a new approachby
definingprincipalcurvesascontinuouscurvesof a givenlengthwhich minimizetheexpected
squareddistancebetweenthe curve andpointsof the spacerandomlychosenaccordingto a
givendistribution. Thenew definitionmakesit possibleto theoreticallyanalyzeprincipalcurve
learningfrom training dataandit alsoleadsto a new practicalconstruction.Our theoretical
learningschemechoosesa curve from a classof polygonallines with k segmentsandwith
a given total length, to minimize the averagesquareddistanceover n training pointsdrawn
independently. Convergencepropertiesof this learningschemeareanalyzedanda practical
versionof this theoreticalalgorithmis implemented.In eachiterationof thealgorithma new
vertex is addedto the polygonal line and the positionsof the verticesare updatedso that
they minimizea penalizedsquareddistancecriterion. Simulationresultsdemonstratethat the
new algorithmcomparesfavorably with previous methodsboth in termsof performanceand
computationalcomplexity, andis morerobustto varyingdatamodels.
Keywords: learningsystems,unsupervisedlearning,featureextraction,vectorquantization,curve
fitting, piecewiselinearapproximation.�B. Kegl and T. Linder are with the Departmentof MathematicsandStatistics,Queen’s University, Kingston,
Ontario,CanadaK7L 3N6 (email: � kegl,linder � @mast.queensu.ca). A. Krzyzak is with the Department
of ComputerScience,ConcordiaUniversity, 1450de Maisonneuve Blvd. West, Montreal PQ, CanadaH3G 1M8
(email:[email protected]). K. Zegeris with theDepartmentof ElectricalandComputerEngineering,
University of California, SanDiego, La Jolla, CA 92093-0407(email: [email protected]). This researchwas
supportedin partby theNationalScienceFoundationandNSERC.
1 Intr oduction
Principalcomponentanalysisis perhapsthebestknown techniquein multivariateanalysisandis
usedin dimensionreduction,featureextraction,andin imagecodingandenhancement.Consider
a d-dimensionalrandomvectorX ��� X1 ������ Xd � with finite secondmoments.Thefirst principal
componentline for X is astraightline whichhasthepropertythattheexpectedvalueof thesquared
EuclideandistancefromX to theline is minimumamongall straightlines.Thispropertymakesthe
first principal componenta conciseone-dimensionalapproximationto the distribution of X, and
theprojectionof X to this line givesthebestlinearsummaryof thedata.For elliptical distributions
the first principal componentis alsoself consistent, i.e., any point of the line is the conditional
expectationof X over thosepointsof thespacewhichprojectto this point.
Hastie[1] andHastieandStuetzle[2] (hereafterHS) generalizedtheself consistency property
of principalcomponentsandintroducedthenotionof principal curves. Let f � t � � � f1 � t � ������ fd � t ��bea smooth(infinitely differentiable)curve in � d parametrizedby t ��� , andfor any x ��� d let
tf � x � denotethe largestparametervaluet for which thedistancebetweenx andf � t � is minimized
(seeFigure1). More formally, theprojectionindex tf � x � is definedby
tf � x � � sup� t : � x � f � t � ��� infτ� x � f � τ � ��� (1)
where ����� denotestheEuclideannormin � d .
����
� �! !"#
$%& &' '
()* * * ** * * ** * * *+ + + ++ + + ++ + + +, ,, ,, ,, ,, ,, ,------ . . . .. . . .. . . .. . . ./ / / // / / // / / // / / /
0 0 00 0 00 0 01 1 11 1 11 1 1 2 2 22 2 22 2 22 2 23 3 33 3 33 3 33 3 34 44 44 44 44 45 55 55 55 55 5
6 6 6 66 6 6 66 6 6 66 6 6 66 6 6 66 6 6 67 7 7 77 7 7 77 7 7 77 7 7 77 7 7 77 7 7 7
8�9:�; 9<>=?< 8A@>@: ; B<>=?< 8A@>@B :C; D 8�E: ; E<>=?< 8A@>@ 8�F: ; F<>=?< 8A@>@8 D 8�G: ; G<>=?< 8H@@ : ; I<=J< 8A@>@8 I<=J< 8A@>@8
Figure1: Projectingpointsto acurve.
By theHSdefinition,thesmoothcurve f � t � is aprincipalcurve if thefollowing hold:
(i) f doesnot intersectitself
(ii) f hasfinite lengthinsideany boundedsubsetof � d
(iii) f is self-consistent,i.e., f � t � � E � X K tf � X � � t � .1
Intuitively, self-consistency meansthateachpoint of f is theaverage(underthedistribution of X)
of all pointsthatprojectthere.Thus,principalcurvesaresmoothself-consistentcurveswhichpass
throughthe“middle” of thedistribution andprovide a goodone-dimensionalnonlinearsummary
of thedata.
Basedon theself-consistency property, HS developedanalgorithmfor constructingprincipal
curves.Similar in spirit to theGeneralizedLloyd Algorithm (GLA) of vectorquantizerdesign[3],
theHSalgorithmiteratesbetweena projectionstepandanexpectationstep.Whentheprobability
densityof X is known, theHSprincipalalgorithmfor constructingprincipalcurvesis thefollowing.
Step0 Let f L 0M � t � bethefirst principalcomponentline for X. Set j � 1.
Step1 Definef L j M � t � � E N X K tf O j P 1Q � X � � t R .Step2 Settf O j Q � x � � max� t : � x � f L j M � t � ��� minτ � x � f L j M � τ � ��� for all x �A� d .
Step3 Compute∆ � f L j M � � E � X � f L j M � tf O j Q � X �� � 2. If K∆ � f L j M � � ∆ � f L j S 1M � KUT threshold, thenStop.
Otherwise,let j � j V 1 andgo to Step1.
In practice,the distribution of X is often unknown, but a datasetconsistingof n samplesof
the underlyingdistribution is known instead. In the HS algorithmfor datasets,the expectation
in Step1 is replacedby a “smoother” (locally weightedrunning lines [4]) or a nonparametric
regressionestimate(cubic smoothingsplines). HS provide simulationexamplesto illustratethe
behavior of thealgorithm,anddescribeanapplicationin theStanfordLinearCollider Project[2].
It shouldbenotedthat thereis no known proof of theconvergenceof thehypotheticalalgorithm
(Steps0-3; onemaindifficulty is thatStep1 canproducenondifferentiablecurveswhile principal
curvesaredifferentiableby definition).However, extensivetestingonsimulatedandrealexamples
havenot revealedany convergenceproblemsfor thepracticalimplementationof thealgorithm[2].
Alternativedefinitionsandmethodsfor estimatingprincipalcurveshavebeengivensubsequent
to HastieandStuetzle’s groundbreakingwork. BanfieldandRaftery[5] (hereafterBR) modeled
the outlinesof ice floes in satelliteimagesby closedprincipal curvesand they developeda ro-
bust methodwhich reducesthe biasin the estimationprocess.Their methodof clusteringabout
principalcurvesled to a fully automaticmethodfor identifying ice floesandtheir outlines.Singh
et al. [6] usedprincipal curvesto extract skeletalstructuresof hand-writtencharactersin faded
documents.ReinhardandNiranjan[7] appliedprincipalcurvesto modeltheshorttime spectrum
of speechsignals.They foundthatprincipalcurvescanbeusedefficiently to capturetransitional
informationbetweenphones.ChangandGhosh[8], [9] combinedtheHS andtheBR algorithms
to improve the performanceof the principal curve algorithm, and usedthe modified algorithm
for nonlinearfeatureextractionandpatternclassification.On thetheoreticalside,Tibshirani[10]
2
introducedasemiparametricmodelfor principalcurvesandproposedamethodfor estimatingprin-
cipal curvesusingtheEM algorithm.CloseconnectionsbetweenprincipalcurvesandKohonen’s
self-organizingmapswerepointedout by Mulier andCherkassky [11]. Recently, Delicado[12]
proposedyet anotherdefinitionbasedon apropertyof thefirst principalcomponentsof multivari-
atenormaldistributions.
Thereremainsan unsatisfactory aspectof the definition of principal curves in the original
HS paperas well as in subsequentworks. Although principal curves have beendefinedto be
nonparametric, their existencehasbeenproven only for somespecialdensitiessuchasradially
symmetricdensitiesandfor theuniformdensityon a rectangleor anannulusin theplane[13]. At
present,it is anopenproblemwhetherHS principalcurvesexistsfor all “reasonable”densities.It
is alsounknown how thehypotheticalHSalgorithmbehavesfor aprobabilitydensityfor whichan
HS principal curve doesnot exist. At the sametime, the problemof existencemakesit difficult
to theoreticallyanalyze(in termsof consistency andconvergencerates)any estimationschemefor
HSprincipalcurves.
In this paperwe proposea new definition of principal curvesto resolve this problem. In the
new definition,a principal curve is a continuouscurve of a given lengthL which minimizesthe
expectedsquareddistancebetweenX andthe curve. In Section2 (Lemma1) we prove that for
any X with finite secondmomentstherealwaysexistsa principalcurve in thenew sense.We also
discussconnectionsbetweenthe newly definedprincipal curvesand optimal vector quantizers.
Thenwe proposea theoreticallearningschemein which the model classesarepolygonallines
with k segmentsandwith a givenlength,andthealgorithmchoosesa curve from this classwhich
minimizestheaveragesquareddistanceover n trainingpoints. In Theorem1 we prove thatwith
k suitablychosenasa functionof n, theexpectedsquareddistanceof thecurve trainedon n data
pointsconvergesto theexpectedsquareddistanceof theprincipalcurveatarateO � n1W 3 � asn X ∞.
Two main featuresdistinguishthis learningschemefrom the HS algorithm. First, thepolyg-
onal line estimateof theprincipal curve is determinedvia minimizing a datadependentcriterion
directly relatedto thedefinitionof principalcurves.This facilitatesthetheoreticalanalysisof the
performance.Second,thecomplexity of theresultingpolygonalline is determinedby thenumber
of segmentsk, which whenoptimally chosenis typically muchlessthann. This agreeswith our
mentalimagethatprincipalcurvesshouldprovideaconcisesummaryof thedata.In contrast,for n
datapointstheHSalgorithmwith scatterplotsmoothingproducespolygonallineswith n segments.
Thoughamenableto analysis,our theoreticalalgorithm is computationallyburdensomefor
implementation.In Section3 we develop a suboptimalalgorithmfor learningprincipal curves.
Thepracticalalgorithmproducespolygonalline approximationsto theprincipalcurve just asthe
theoreticalmethoddoes,but global optimizationis replacedby a lesscomplex iterative descent
method.In Section4 we give simulationresultsandcompareour algorithmwith previouswork.
3
In general,on examplesconsideredby HS, theperformanceof thenew algorithmis comparable
with theHS algorithm,while it provesto bemorerobustto changesin thedatageneratingmodel.
In Section4 we alsoreportsomepreliminaryresultson applyingthepolygonalline algorithmto
find themedialaxes(“skeletons”)of pixel templatesof hand-writtencharacters.
2 Learning Principal Curveswith a Length Constraint
A curvein d-dimensionalEuclideanspaceis a continuousfunctionf : I XY� d , whereI is aclosed
interval of thereal line. Let theexpectedsquareddistancebetweenX andf bedenotedby
∆ � f � � E Z inft� X � f � t � � 2[ � E � X � f � tf � X �� � 2 (2)
wheretheprojectionindex tf � x � is givenin (1). Let f bea smooth(infinitely differentiable)curve
andfor λ �\� considertheperturbationf V λg of f by a smoothcurve g suchthatsupt � g � t � �^] 1
andsupt � g_`� t � � ] 1. HS provedthat f is a principalcurve if andonly if f is a critical point of the
distancefunctionin thesensethatfor all suchg,
∂∆ � f V λg�∂λ
aaaaλ b 0
� 0 �It is nothardto seethatananalogousresultholdsfor principalcomponentlinesif theperturbation
g is a straightline. In this sensethe HS principal curve definition is a naturalgeneralizationof
principal components.Also, it is easyto checkthat principal componentsare in fact principal
curvesif thedistributionof X is elliptical.
An unfortunatepropertyof theHSdefinitionis thatin generalit is notknown if principalcurves
exist for agivensourcedensity. To resolve thisproblemwegobackto thedefiningpropertyof the
first principalcomponent.A straightline s� t � is thefirst principalcomponentif andonly if
E Zmint� X � s� t � � 2[ ] E Zmin
t� X � s� t � � 2 [
for any otherstraightline s� t � . Wewishto generalizethispropertyof thefirst principalcomponent
anddefineprincipal curvesso that they minimize the expectedsquareddistanceover a classof
curvesratherthanonly beingcritical pointsof thedistancefunction. To do this it is necessaryto
constrainthe length1 of the curve, sinceotherwisefor any X with a densityandany ε c 0 there
exists a smoothcurve f suchthat ∆ � f � ] ε, andthusa minimizing f hasinfinite length. On the
otherhand,if the distribution of X is concentratedon a polygonalline andis uniform there,the1For thedefinitionof lengthfor nondifferentiablecurvesseeAppendixA wheresomebasicfactsconcerningcurves
in d d havebeencollectedfrom [14].
4
infimum of thesquareddistances∆ � f � is 0 over theclassof smoothcurves,but no smoothcurve
canachieve this infimum. For this reason,we relax the requirementthat f be differentiableand
insteadweconstrainthelengthof f. Notethatby thedefinitionof curves,f is still continuous.We
givethefollowing new definitionof principalcurves.
Definition 1 A curvef e is calleda principalcurve of lengthL for X if f e minimizes∆ � f � over all
curvesof lengthlessthanor equalto L.
A usefuladvantageof thenew definition is thatprincipalcurvesof lengthL alwaysexist if X
hasfinite secondmoments,asthenext resultshows.
Lemma 1 AssumethatE � X � 2 T ∞. Thenfor anyL c 0 thereexistsa curvef e with l � f e � ] L such
that
∆ � f e � � inf f ∆ � f � : l � f � ] L g �Theproof of thelemmais givenin AppendixA.
Note thatwe have droppedtherequirementof theHS definitionthatprincipalcurvesbenon-
intersecting.In fact, Lemma1 doesnot hold in generalfor non-intersectingcurvesof lengthL
withoutfurtherrestrictingthedistributionof X, sincetherearedistributionsfor whichtheminimum
of ∆ � f � is achievedonly byanintersectingcurveeventhoughnon-intersectingcurvescanarbitrarily
approachthisminimum.NotealsothatneithertheHSnorourdefinitionguaranteestheuniqueness
of principal curves. In our case,theremight exist several principal curves for a given length
constraintL but eachof thesewill have thesame(minimal) squaredloss.
Remark: (Connectionwith vectorquantization)
Our new definitionof principalcurveshasbeeninspiredby thenotionof anoptimalvectorquan-
tizer. A vectorquantizermapsa point in � d to theclosestpoint in a fixedset(calleda codebook)f y1 ������ yk gihj� d . Thecodepointsy e1 ����>� y ek ��� d correspondto anoptimalk-point vectorquan-
tizer if
E Zmini� X � y ei � 2 [ ] E Zmin
i� X � yi � 2 [
for any other collection of k points y1 ������ yk �k� d . In other words, the points y e1 ����>� y ek give
the bestk-point representationof X in the meansquaredsense.Optimal vectorquantizersare
importantin lossydatacompression,speechandimagecoding[15], andclustering[16]. Thereis
a strongconnectionbetweenthedefinitionof anoptimal vectorquantizerandour definitionof a
principalcurve. Both minimizethesameexpectedsquareddistancecriterion. A vectorquantizer
is constrainedto have at mostk points,whereasa principal curve hasa constrainedlength. This
5
connectionis furtherillustratedby a recentwork of Tarpey et al. [17] who definepointsy1 ����� yk
to beself-consistentif
yi � E lX KX � Si mn�whereS1 ����� Sk arethe“Voronoi regions” definedasSi �of x : � x � yi �p]q� x � y j � � j � 1 ����>� k g(ties arebroken arbitrarily). Thusour principal curvescorrespondto optimal vectorquantizers
(“principal points” by the terminologyof [17]) while the HS principal curvescorrespondto self
consistentpoints.
While principalcurvesof a given lengthalwaysexist, it appearsdifficult to demonstratecon-
creteexamples,unlessthedistributionof X is discreteor is concentratedonacurve. It is presently
unknown whatprincipalcurveslook likewith a lengthconstraintfor eventhesimplestcontinuous
multivariatedistributionssuchastheGaussian.However, this fact in itself doesnot limit theop-
erationalsignificanceof principal curves. Analogously, for k r 3 codepointsthereareno known
concreteexamplesof optimal vectorquantizersfor even the most commonmodeldistributions
suchasGaussian,Laplacian,or uniform (in a hypercube)in any dimensionsd r 2. Nevertheless,
algorithmsfor quantizerdesignattemptingto find nearoptimalvectorquantizersareof greatthe-
oreticalandpracticalinterest.In what follows we considertheproblemof principalcurve design
basedon trainingdata.
Supposethat n independentcopiesX1 ����>� Xn of X aregiven. Thesearecalled the training
dataandthey areassumedto beindependentof X. Thegoalis to usethetrainingdatato construct
acurveof lengthat mostL whoseexpectedsquaredlossis closeto thatof aprincipalcurve for X.
Our methodis basedon a commonmodel in statisticallearningtheory (e.g.,see[18]). We
considerclassess 1 � s 2 ����>� of curvesof increasingcomplexity. Givenn datapointsdrawn inde-
pendentlyfrom the distribution of X, we choosea curve asthe estimatorof the principal curve
from thekth modelclasss k by minimizing theempiricalerror. By choosingthecomplexity of the
modelclassappropriatelyasthe sizeof the training datagrows, the chosencurve representsthe
principalcurvewith increasingaccuracy.
We assumethat the distribution of X is concentratedon a closedand boundedconvex set
K hj� d . A basicpropertyof convex setsin � d shows thatthereexistsa principalcurve of length
L insideK (see[19, Lemma1]), andsowewill only considercurvesin K.
Let s denotethefamily of curvestakingvaluesin K andhaving lengthnot greaterthanL. For
k r 1 let s k be the setof polygonal(piecewise linear) curves in K which have k segmentsand
whoselengthsdo notexceedL. Notethat s k hts for all k. Let
∆ � x � f � � mint� x � f � t � � 2 (3)
denotethesquareddistancebetweena point x ��� d andthecurve f. For any f �As theempirical
6
squarederrorof f on thetrainingdatais thesampleaverage
∆n � f � � 1n
n
∑i b 1
∆ � X i � f � (4)
wherewe have suppressedin the notationthe dependenceof ∆n � f � on the training data. Let our
theoreticalalgorithm2 chooseanfk u n �vs k which minimizestheempiricalerror, i.e,
fk u n � argminf wx k
∆n � f � � (5)
We measuretheefficiency of fk u n in estimatingf e by thedifferenceJ � fk u n � betweentheexpected
squaredlossof fk u n andtheoptimalexpectedsquaredlossachievedby f e , i.e.,we let
J � fk u n � � ∆ � fk u n � � ∆ � f e � � ∆ � fk u n � � minf wx ∆ � f � �
Sinces k hys , wehaveJ � fk u n � r 0. Ourmainresultin thissectionprovesthatif thenumberof data
pointsn tendsto infinity, andk is chosento beproportionalto n1W 3, thenJ � fk u n � tendsto zeroat a
rateJ � fk u n � � O � n S 1W 3 � .Theorem 1 AssumethatP f X � K gz� 1 for aboundedandclosedconvex setK, let n bethenumber
of trainingpoints,andlet k bechosento beproportionalto n1W 3. Thentheexpectedsquaredlossof
theempiricallyoptimalpolygonalline with k segmentsandlengthat mostL converges,asn X ∞,
to thesquaredlossof theprincipal curveof lengthL at a rate
J � fk u n � � O � n S 1W 3 � �The proof of the theoremis given in AppendixB. To establishthe resultwe usetechniques
from statisticallearningtheory(e.g.,see[20]). First, theapproximatingcapabilityof theclassof
curves s k is considered,andthentheestimation(generalization)erroris boundedvia coveringthe
classof curvess k with ε accuracy (in thesquareddistancesense)by adiscretesetof curves.When
thesetwo boundsarecombined,oneobtains
J � fk u n � ] {kC � L � D � d �
nV DL V 2
kV O � n S 1W 2 � (6)
wherethe termC � L � D � d � dependsonly on thedimensiond, the lengthL, andthediameterD of
thesupportof X, but is independentof k andn. Thetwo errortermsarebalancedby choosingk to
beproportionalto n1W 3 whichgivestheconvergencerateof Theorem1.2Theterm“hypotheticalalgorithm” might appearto bemoreaccuratesincewe have not shown thatanalgorithm
for finding fk | n exists. However, analgorithmclearlyexists which canapproximatefk | n with arbitraryaccuracy in a
finite numberof steps(considerpolygonallineswhoseverticesarerestrictedto a fine rectangulargrid). Theproof of
Theorem1 showsthatsuchapproximatingcurvescanreplacefk | n in theanalysis.
7
Note that althoughthe constanthiddenin the O notationdependson the dimensiond, the
exponentof n is dimension-free.This is not surprisingin view of thefact that theclassof curvess is equivalentin acertainsenseto theclassof Lipschitzfunctionsf : l 0 � 1m X K suchthat � f � x� �f � y� �}] L K x � y K (seeAppendixA). It is known thattheε-entropy, definedby thelogarithmof the
ε coveringnumber, is roughlyproportionalto 1~ ε for suchfunctionclasses[21]. Usingthis result,
theconvergencerateO � n S 1W 3 � canbeobtainedby consideringε-coversof s directly(withoutusing
themodelclassess k) andpickingtheempiricallyoptimalcurvein thiscover. Theuseof theclassess k hastheadvantagethatthey aredirectly relatedto thepracticalimplementationof thealgorithm
givenin thenext section.
NotealsothateventhoughTheorem1 is valid for any givenlengthconstraintL, thetheoretical
algorithmitself giveslittle guidanceabouthow to chooseL. Thischoicedependson theparticular
applicationandheuristicconsiderationsarelikely to enterhere.Oneexampleis givenin Section3
wherea practicalimplementationof thepolygonalline algorithmis usedto recover a “generating
curve” from noisyobservations.
Finally, we notethat theproof of Theorem1 alsoprovidesinformationon thedistribution of
theexpectedsquarederrorof fk u n giventhetrainingdataX1 ����� Xn. In particular, it is shown at the
endof theproof that for all n andk, andδ suchthat0 T δ T 1, with probabilityat least1 � δ we
have
E Z ∆ � X � fk u n � KX1 ����>� Xn[ � ∆ � f e � ] {
kC � L � D � d � � D4 log � δ ~ 2�n
V DL V 2k
(7)
wherelog denotesnaturallogarithmandC � L � D � d � is thesameconstantasin (6).
3 The PolygonalLine Algorithm
Given a setof datapoints � n ��f x1 ������ xn g�h�� d , the taskof finding a polygonalcurve with k
segmentsandlengthL which minimizes1n ∑n
i b 1 ∆ � xi � f � is computationallydifficult. We proposea
suboptimalmethodwith reasonablecomplexity whichalsopicksthelengthL of theprincipalcurve
automatically. Thebasicideais to startwith astraightline segmentf0 u n, theshortestsegmentof the
first principalcomponentline which containsall of theprojecteddatapoints,andin eachiteration
of thealgorithmto increasethenumberof segmentsby onebyaddinganew vertex to thepolygonal
line fk u n producedin thepreviousiteration.After addinga new vertex, thepositionsof all vertices
areupdatedin aninnerloop.
Theinnerloopconsistsof aprojectionstepandanoptimizationstep.In theprojectionstepthe
datapointsarepartitionedinto “nearestneighborregions” accordingto which segmentor vertex
they project. In theoptimizationstepthenew positionof a vertex vi is determinedby minimizing
anaveragesquareddistancecriterionpenalizedby ameasureof thelocalcurvature,while all other
8
(a) (b)
(c) (d)
Figure2: Thecurvesfk � n producedby thepolygonalline algorithmfor n � 100datapoints.Thedatawas
generatedby addingindependentGaussianerrorsto bothcoordinatesof a point chosenrandomlyon a half
circle. (a) f1 � n, (b) f2 � n, (c) f4 � n, (d) f15� n (theoutputof thealgorithm).
verticesarekeptfixed.Thesetwo stepsareiteratedsothattheoptimizationstepis appliedto each
vertex vi , i � 1 ������ k V 1, in acyclic fashion(sothataftervk � 1, theprocedurestartsagainwith v1),
until convergenceis achievedandfk u n is produced.Thenanew vertex is added.
Thealgorithmstopswhenk exceedsa thresholdc � n � ∆ � . This stoppingcriterionis basedon a
heuristiccomplexity measure,determinedby thenumberof segmentsk, thenumberof datapoints
n, andtheaveragesquareddistance∆n � fk u n � .Theflow-chartof thealgorithmis given in Figure3. Theevolution of thecurve producedby
thealgorithmis illustratedin Figure2. As with theHSalgorithm,wehavenoformalproof thatthe
practicalalgorithmwill converge,but in practice,afterextensive testing,it seemsto consistently
converge.
It shouldbenotedthatthetwo corecomponentsof thealgorithm,theprojectionandthevertex
optimizationsteps,arecombinedwith moreheuristicelementssuchasthestoppingconditionand
theform of thepenaltyterm(8) of theoptimizationstep.Theheuristicpartsof thealgorithmhave
beentailoredto thetaskof recoveringanunderlyinggeneratingcurve for adistributionbasedona
finite datasetof randomlydrawn points(seetheexperimentalresultsin Section4). Whenthealgo-
rithm is intendedfor anapplicationwith adifferentobjective (suchasrobustsignalcompression),
9
Vertex optimization
Projection
Initialization
Convergence?
∆k > c(n, )?
Add new vertex
START
END
N
Y
Y
N
Figure3: Theflow chartof thepolygonalline algorithm.
thecorecomponentscanbekeptunchangedbut theheuristicelementsmaybereplacedaccording
to thenew objectives.
3.1 The Projection Step
Let f denotea polygonalline with verticesv1 ����� vk � 1 andline segmentss1 ����>� sk, suchthat si
connectsverticesvi andvi � 1. In thisstepthedataset� n is partitionedinto (atmost)2k V 1 disjoint
setsV1 ����� Vk � 1 and S1 ����>� Sk, the nearestneighborregions of the verticesand segmentsof f,
respectively, in thefollowing manner. For any x �H� d let ∆ � x � si � bethesquareddistancefrom x to
si (seedefinition(3)), let ∆ � x � vi � ��� x � vi � 2, andlet
Vi � � x ��� n : ∆ � x � vi � � ∆ � x � f � � ∆ � x � vi � T ∆ � x � vm� � m � 1 ����� i � 1 � �UponsettingV ��� k � 1
i b 1 Vi , theSi setsaredefinedby
Si � � x ��� n : x �� V � ∆ � x � si � � ∆ � x � f � � ∆ � x � si � T ∆ � x � sm� � m � 1 ����>� i � 1 � �Theresultingpartitionis illustratedin Figure4.
As a resultof introducingthenearestneighborregionsSi andVi , thepolygonalline algorithm
substantiallydiffers from methodsbasedon the self-organizingmap. Although we optimizethe
positionsof theverticesof thecurve, thedistancesof thedatapointsaremeasuredfrom theseg-
mentsandverticesof thecurve ontowhich they project,while theself-organizingmapmeasures
10
iV
Si
Si+1is
i
vv
1
Vi+1
Si-2
s
i-
2s
1
1vi+1
i-S1i-
i-
V
i+
si-1
Figure4: A nearestneighborpartitionof � 2 inducedby theverticesandsegmentsof f. Thenearestpoint
of f to any point in thesetVi is thevertex vi . Thenearestpointof f to any point in thesetSi is apoint of the
line segmentsi .
distancesexclusively from the vertices.Our principle makesit possibleto usea relatively small
numberof verticesandstill obtaingoodapproximationto anunderlyinggeneratingcurve.
3.2 The Vertex Optimization Step
In this stepthenew positionof a vertex vi is determined.In thetheoreticalalgorithmtheaverage
squareddistance∆n � x � f � is minimized subjectto the constraintthat f is a polygonal line with
k segmentsand length not exceedingL. One could usea Lagrangianformulation and attempt
to find a new position for vi (while all otherverticesarefixed) suchthat the penalizedsquared
error ∆n � f � V λl � f � 2 is minimum. Although this direct length penaltycan work well in certain
applications,it yieldspoorresultsin termsof recoveringa smoothgeneratingcurve. In particular,
this approachis very sensitive to thechoiceof λ andtendsto producecurveswhich, similarly to
theHSalgorithm,exhibit a “flattening” estimationbiastowardsthecenterof thecurvature.
To reducetheestimationbias,we penalizesharpanglesbetweenline segments.At innerver-
tices vi , 3 ] i ] k � 1 we penalizethe sum of the cosinesof the threeanglesat verticesvi S 1,
vi , andvi � 1. The cosinefunction is convex in the interval l π ~ 2 � π m and its derivative is zeroat
π which makesit especiallysuitablefor the steepestdescentalgorithm. To make the algorithm
invariantunderscaling,we multiply thecosinesby thesquareof the“radius” of thedatadefined
by r � maxx w� n �� x � 1n ∑y w� n
y �� . Notethatthechosenpenaltyformulationis relatedto theoriginal
principleof penalizingthe lengthof thecurve. At innervertices,sinceonly onevertex is moved
ata time,penalizingsharpanglesindirectlypenalizeslongsegments.At theendpointsandat their
11
immediateneighbors(vi , i � 1 � 2 � k � k V 1), wherepenalizingsharpanglesdoesnot translateto pe-
nalizing long line segments,thepenaltyon a nonexistentangleis replacedby a directpenaltyon
thesquaredlengthof thefirst (or last)segment.
Formally, let γi denotetheangleatvertex vi , let π � vi � � r2 � 1 V cosγi � , letµ� � vi � � � vi � vi � 1 � 2,
andlet µS � vi � ��� vi � vi S 1 � 2. ThenthepenaltyP � vi � at vertex vi is givenby
P � vi � ����������� ���������µ� � vi � V π � vi � 1 � if i � 1
µS � vi � V π � vi � V π � vi � 1 � if i � 2
π � vi S 1 � V π � vi � V π � vi � 1 � if 2 T i T k
π � vi S 1 � V π � vi � V µ� � vi � if i � k
π � vi S 1 � V µS � vi � if i � k V 1 �(8)
The local measureof the averagesquareddistanceis calculatedfrom the datapointswhich
projectto vi or to theline segment(s)startingat vi (seeProjectionStep).Accordingly, let
σ � � vi � � ∑x w Si
∆ � x � si �σ S � vi � � ∑
x w Si P 1
∆ � x � si S 1 �ν � vi � � ∑
x w Vi
∆ � x � vi � �Now definethelocal averagesquareddistanceasa functionof vi by
∆n � vi � ������ ���� ν � vi � V σ � � vi � if i � 1
σ S � vi � V ν � vi � V σ � � vi � if 1 T i T k V 1
σ S � vi � V ν � vi � if i � k V 1 � (9)
We useaniterativesteepestdescentmethodto minimize
G � vi � � 1n
∆n � vi � V λp1
k V 1P � vi �
whereλp c 0. Thelocalsquareddistanceterm∆n � vi � andthelocalpenaltytermP � vi � arenormal-
izedby thenumberof datapointsn andthenumberof vertices � k V 1� , respectively, to keepthe
globalobjective function∑k� 1i b 1 G � vi � approximatelyin thesamemagnitudeif thenumberof data
pointsdrawn from thesourcedistribution or thenumberof line segmentsof thepolygonalcurve
arechanged.
We searchfor a local minimum of G � vi � in the direction of the negative gradientof G � vi �by usinga proceduresimilar to Newton’s method.Thenthe gradientis recomputedandthe line
12
searchis repeated.Theiterationstopswhentherelative improvementof G � vi � is lessthanapreset
threshold. It shouldbe notedthat ∆n � vi � is not differentiableat any point vi suchthat at least
onedatapoint falls on the boundaryof a nearestneighborregion Si S 1, Si , or Vi . ThusG � vi � is
only piecewise differentiableandthe variantof Newton’s methodwe usecannotguaranteethat
the global objective function ∑k� 1i b 1 G � vi � will alwaysdecreasein the optimizationstep. During
extensivetestruns,however, thealgorithmwasobservedto alwaysconverge.Furthermore,wenote
thatthispartof thealgorithmis modular, i.e., theprocedureweareusingcanbesubstitutedwith a
moresophisticatedoptimizationroutineat theexpenseof increasedcomputationalcomplexity.
Oneimportantissueis the amountof smoothingrequiredfor a givendataset. In the HS al-
gorithmoneneedsto determinethepenaltycoefficient of thesplinesmoother, or thespanof the
scatterplotsmoother. In ouralgorithm,thecorrespondingparameteris thecurvaturepenaltyfactor
λp. If somea priori knowledgeaboutthe distribution is available,onecan useit to determine
the smoothingparameter. However in the absenceof suchknowledge,the coefficient shouldbe
data-dependent.Basedon heuristicconsiderationsexplainedbelow, andaftercarryingout practi-
cal experiments,we setλp � λ _pkn S 1W 3∆n � fk u n � 1W 2r S 1, whereλ _p is anexperimentallydetermined
constant.
By settingthe penaltyto be proportionalto the averagedistanceof the datapointsfrom the
curve we avoid thezig-zaggingbehavior of thecurve resultingfrom overfitting whenthenoiseis
relatively large. At thesametime, this penaltyfactorallows theprincipalcurve to closelyfollow
thegeneratingcurvewhenthegeneratingcurve itself is apolygonalline with sharpanglesandthe
datais concentratedon this curve (thenoiseis very small). Thepenaltyis setto beproportional
to thenumberof segmentsk becausein our experimentswehave foundthatthealgorithmis more
likely to avoid local minima if a small penaltyis imposedinitially andthe penaltyis gradually
increasedasthenumberof segmentsgrows. Sincethestoppingcondition(Section3.4) indicates
thatthefinal numberof line segmentsis proportionalto thecuberootof thedatasize,wenormalize
k by n1W 3 in the penaltyterm. The penaltyfactor is alsonormalizedby the radiusof the datato
obtainscaleindependence.The valueof the parameterλ _p wasdeterminedby experiments,and
wassetto aconstant0 � 13.
3.3 Adding a New Vertex
We startwith theoptimizedfk u n andchoosethesegmentthathasthelargestnumberof datapoints
projectingto it. If morethanonesuchsegmentexist, we choosethe longestone. The midpoint
of this segmentis selectedasthenew vertex. Formally, let I � � i : KSi K�roKSj K � j � 1 ����� k � , and� � argmaxi w I � vi � vi � 1 � . Thenthenew vertex is vnew ��� v ��V v � � 1 � ~ 2.
13
3.4 StoppingCondition
Accordingto the theoreticalresultsof Section2, the numberof segmentsk is an importantfac-
tor that controlsthe balancebetweenthe estimationandapproximationerrors,and it shouldbe
proportionalto n1W 3 to achieve the O � n1W 3 � convergencerate for the expectedsquareddistance.
Althoughthetheoreticalboundsarenot tight enoughto determinetheoptimalnumberof segments
for a givendatasize,we foundthatk � n1W 3 works in practice.We alsofoundthat,similar to the
penaltyfactorλp, thefinal valueof k shouldalsodependontheaveragesquareddistanceto achieve
robustness.If the varianceof the noiseis relatively small, we cankeepthe approximationerror
low by allowing a relatively largenumberof segments.On theotherhand,whenthevarianceof
thenoiseis large(implying a high estimationerror),a low approximationerrordoesnot improve
theoverall performancesignificantly, soin this casea smallernumberof segmentscanbechosen.
Thestoppingconditionblendsthesetwo considerations.Thealgorithmstopswhenk exceeds
c N n � ∆n � fk u n � R � βn1W 3∆n � fk u n � S 1W 2r (10)
whereβ is a parameterof thealgorithmwhich wasdeterminedby experimentsandwassetto the
constantvalue0 � 3.
Notethatin apracticalsense,thenumberof segmentsplaysamoreimportantrole in determin-
ing thecomputationalcomplexity of the algorithmthanin measuringthe quality of theapproxi-
mation. Experimentsshowedthat,dueto thedatadependentcurvaturepenaltyandtheconstraint
that only onevertex is moved at a time, the numberof segmentscan increaseeven beyond the
numberof datapointswithout any indicationof overfitting. While increasingthenumberof seg-
mentsbeyondacertainlimit offersonly marginal improvementin theapproximation,it causesthe
algorithmto slow down considerably. Therefore,in on-lineapplicationswherespeedhaspriority
over precision,it is reasonableto usea smallernumberof segmentsthanindicatedby (10),andif
”aesthetic”smoothnessis anissue,to fit asplinethroughtheverticesof thecurve.
3.5 Computational Complexity
The complexity of the inner loop is dominatedby the complexity of the projectionstep,which
is O � nk� . Increasingthe numberof segmentsone at a time (as describedin Section3.3), the
complexity of thealgorithmto obtainfk u n is O � nk2 � . Usingthestoppingconditionof Section3.4,
the computationalcomplexity of thealgorithmbecomesO � n5W 3 � . This is slightly betterthanthe
O � n2 � complexity of theHSalgorithm.
Thecomplexity canbedramaticallydecreasedin certainsituations.Onepossibility is to add
morethanonevertex at a time. For example,if insteadof addingonly onevertex, a new vertex
is placedat the midpoint of every segment,then we can reducethe computationalcomplexity
14
for producingfk u n to O � nklogk � . Onecanalsosetk to bea constantif thedatasizeis large,since
increasingk beyondacertainthresholdbringsonlydiminishingreturns.Also,k canbenaturallyset
to aconstantin certainapplications,giving O � nk� computationalcomplexity. Thesesimplifications
work well in certainsituations,but theoriginal algorithmis morerobust.
4 Experimental Results
We have extensively testedtheproposedalgorithmon two-dimensionaldatasets.In mostexperi-
mentsthedatawasgeneratedby acommonlyused(see,e.g.,[2] [10] [11]) additivemodel
X � Y V e (11)
whereY is uniformly distributedon a smoothplanarcurve (hereaftercalledthegenerating curve)
ande is bivariateadditivenoisewhich is independentof Y.
In Section4.1wecomparethepolygonalline algorithm,theHSalgorithm,and,for closedgen-
eratingcurves,theBR algorithm[5]. Thevariousmethodsarecomparedsubjectivelybasedmainly
onhow closelytheresultingcurve follows theshapeof thegeneratingcurve. Weusevaryinggen-
eratingshapes,noiseparameters,anddatasizesto demonstratetherobustnessof thepolygonalline
algorithm. For the caseof a circular generatingcurve we alsoevaluatein a quantitative manner
how well the polygonalline algorithmapproximatesthe generatingcurve asthe datasizegrows
andasthenoisevariancedecreases.
In Section4.2 we show two scenariosin which the polygonalline algorithm(alongwith the
HS algorithm)fails to producemeaningfulresults.In thefirst, thehigh numberof abruptchanges
in the directionof the generatingcurve causesthe algorithmto oversmooththe principal curve,
evenwhenthedatais concentratedon the generatingcurve. This is a typical situationwhenthe
penaltyparameterλ _p shouldbe decreased.In the secondscenario,the generatingcurve is too
complex (e.g.,it containsloops,or it hastheshapeof a spiral), so thealgorithmfails to find the
globalstructureof thedataif theprocessis startedfrom thefirst principalcomponent.To recover
thegeneratingcurve, onemustreplacethe initialization stepby a moresophisticatedroutinethat
approximatelycapturestheglobalstructureof thedata.
In Section4.3 an applicationin featureextraction is briefly outlined. We departfrom the
syntheticdatageneratingmodelin (11)anduseanextendedversionof thepolygonalline algorithm
to find themedialaxes(“skeletons”)of pixel templatesof hand-writtencharacters.Suchskeletons
canbeusedin hand-writtencharacterrecognitionandcompressionof hand-writtendocuments.
15
4.1 Experimentswith the generatingcurvemodel
In general,in simulationexamplesconsideredby HS the performanceof the new algorithm is
comparablewith the HS algorithm. Due to the datadependenceof the curvaturepenaltyfactor
and the stoppingcondition,our algorithmturnsout to be morerobust to alterationsin the data
generatingmodel,aswell asto changesin theparametersof theparticularmodel.
Weusemodel(11)with varyinggeneratingshapes,noiseparameters,anddatasizesto demon-
stratethe robustnessof the polygonalline algorithm. All plots show the generatingcurve, the
curve producedby our polygonalline algorithm(Polygonalprincipal curve), andthe curve pro-
ducedby theHS algorithmwith splinesmoothing(HS principalcurve), which we have foundto
performbetterthanthe HS algorithmusingscatterplotsmoothing.For closedgeneratingcurves
we alsoincludethecurve producedby theBR algorithm[5] (BR principalcurve),which extends
theHS algorithmto closedcurves.Thetwo coefficientsof thepolygonalline algorithmaresetin
all experimentsto theconstantvaluesβ � 0 � 3 andλ _p � 0 � 1.
In Figure5(a) thegeneratingcurve is a circle of radiusr � 1, ande ��� e1 � e2 � is a zeromean
bivariateuncorrelatedGaussianwith varianceE � e2i � � 0 � 04, for i � 1 � 2. The performanceof
the threealgorithms(HS, BR, andthepolygonalline algorithm)is comparable,althoughtheHS
algorithmexhibits morebiasthantheothertwo. NotethattheBR algorithm[5] hasbeentailored
to fit closedcurvesandto reducetheestimationbias.In Figure5(b),only half of thecircle is used
asa generatingcurve andtheotherparametersremainthesame.Here,too, both theHS andour
algorithmbehavesimilarly.
Whenwedepartfrom theseusualsettingsthepolygonalline algorithmexhibitsbetterbehavior
thantheHS algorithm.In Figure6(a)thedatawasgeneratedsimilarly to thedatasetof Figure5,
andthenit waslinearly transformedusingthe matrix N 0 � 7 0 � 4S 0 � 8 1 � 0 R . In Figure6(b) the transforma-
tion N S 1 � 0 S 1 � 21 � 0 S 0 � 2 R wasused.Theoriginal datasetwasgeneratedby anS-shapedgeneratingcurve,
consistingof two half circlesof unit radii, to which thesameGaussiannoisewasaddedasin Fig-
ure5. In bothcasesthepolygonalline algorithmproducescurvesthatfit thegeneratorcurvemore
closely. This is especiallynoticeablein Figure6(a)wherethe HS principal curve fails to follow
theshapeof thedistortedhalf circle.
Thereare two situationswhenwe expectour algorithmto performparticularlywell. If the
distribution is concentratedon a curve, then accordingto both the HS and our definitionsthe
principalcurve is thegeneratingcurve itself. Thus,if thenoisevarianceis small,we expectboth
algorithmsto verycloselyapproximatethegeneratingcurve. Thedatain Figure7(a)wasgenerated
using the sameadditive Gaussianmodel as in Figure 5, but the noisevariancewas reducedto
E � e2i � � 0 � 0001for i � 1 � 2. In this casewe foundthatthepolygonalline algorithmoutperformed
boththeHSandtheBR algorithms.
16
(a) Circle,100datapoints����� �����¡ ¢�� £¤`¥ ¢ ¥§¦ ��� ¢�¨ª©¡« ¦ ¬¥®¯�¡° ±¨�¡¢�¡°�� ¦ ¢�©� ��¡° ©« ¦ ¬¥²´³µ� ¦ ¢�© ���¡° ©« ¦ ¬¥¶�·� ¦ ¢�© ���¡° ©« ¦ ¬¥(b) Half circle,100datapoints����� �����§ ¢�� £¤`¥ ¢ ¥¸¦ ��� ¢�¨ª©¡« ¦ ¬�¥®¯�¡° ±¨�§¢��¡°�� ¦ ¢�© ��¡° ©« ¦ ¬¥¶�·� ¦ ¢�© ���¡° ©¡« ¦ ¬¥
Figure5: (a) TheBR andthepolygonalline algorithmsshow lessbiasthantheHS algorithm.(b) TheHS
andthepolygonalline algorithmsproducesimilar curves.
(a)Distortedhalf circle,100datapoints�´��� ����¡ ¢¹� £¤`¥ ¢ ¥§¦ ��� ¢¨ª©« ¦ ¬¥®¸�§° ±�¨¡�¡¢��§°¹� ¦ ¢�© ���§° ©¡« ¦ ¬�¥¶´·µ� ¦ ¢�©� ��¡° ©¡« ¦ ¬�¥ (b) DistortedS-shape,100datapoints����� �����§ ¢¹� £¤º¥ ¢ ¥¸¦ ��� ¢�¨ª©¡« ¦ ¬�¥®¯�§° ±¨�§¢��§°¹� ¦ ¢�© ���¡° ©¡« ¦ ¬¥¶�·� ¦ ¢�© ���§° ©¡« ¦ ¬�¥
Figure6: TransformedDataSets.Thepolygonalline algorithmstill follows fairly closelythe“distorted”
shapes.
Thesecondcaseis whenthesamplesizeis large. Althoughthegeneratingcurve is not neces-
sarily theprincipalcurve of thedistribution, it is naturalto expectthealgorithmto well approxi-
17
matethegeneratingcurve asthesamplesizegrows. Sucha caseis shown in Figure7(b), where
n � 10000datapointsweregenerated(but only 2000of thesewereactuallyplotted). Herethe
polygonalline algorithmapproximatesthe generatingcurve with muchbetteraccuracy thanthe
HSalgorithm.
(a) Circle,100datapoints����� �����¡ ¢�� £¤`¥ ¢ ¥§¦ ��� ¢�¨ª©¡« ¦ ¬¥®¯�¡° ±¨�¡¢�¡°�� ¦ ¢�©� ��¡° ©« ¦ ¬¥²´³µ� ¦ ¢�© ���¡° ©« ¦ ¬¥¶�·� ¦ ¢�© ���¡° ©« ¦ ¬¥(b) S-shape,10000datapoints����� �����§ ¢�� £¤`¥ ¢ ¥¸¦ ��� ¢�¨ª©¡« ¦ ¬�¥®¯�¡° ±¨�§¢��¡°�� ¦ ¢�© ��¡° ©« ¦ ¬¥¶�·� ¦ ¢�© ���¡° ©¡« ¦ ¬¥
Figure7: SmallNoiseVariance(a)andLargeSampleSize(b). Thecurvesproducedby thepolygonalline
algorithmarenearlyindistinguishablefrom thegeneratingcurves.
Although in the model (11) the generatingcurve is in generalnot the principal curve in our
definition(or in theHS definition) it is of interestto numericallyevaluatehow well thepolygonal
line algorithmapproximatesthegeneratingcurve. In theseexperimentsthegeneratingcurveg � t � is
acircleof unit radiuscenteredat theorigin andthenoiseis zeromeanbivariateuncorrelatedGaus-
sian. We chose21 differentdatasizesrangingfrom 10 to 10000,and7 differentnoisestandard
deviationsrangingfrom σ � 0 � 01to σ � 0 � 4. For themeasureof approximationwechosetheaver-
agedistancedefinedby δ � 1l L f M¼» mins � f � t � � g � s� � dt, wherethepolygonalline f is parametrized
by its arclength.To eliminatethedistortionoccurringat theendpoints,we initialized thepolygo-
nal line algorithmby anequilateraltriangleinscribedin thegeneratingcircle. For eachparticular
datasizeandnoisevariancevalue,100randomdatasetsweregeneratedandtheresultingδ values
wereaveragedover theseexperiments.Thedependenceof theaveragedistanceδ on thedatasize
andthenoisevarianceis plottedonalogarithmicscalein Figure8. Theresultingcurvesjustify our
informal observation madeearlierthat the approximationsubstantiallyimprovesasthe datasize
grows,andasthevarianceof thenoisedecreases.
18
0.001
0.01
0.1
1
10 100 1000 10000
aver
age
dist
ance
n
sigma = 0.05sigma = 0.1
sigma = 0.15sigma = 0.2sigma = 0.3sigma = 0.4
Figure8: Theapproximationerrorδ decreasesasn grows or σ decreases.
4.2 Failur e modes
We describetwo specificsituationswhenthepolygonalline algorithmfails to recover thegener-
atingcurve. In thefirst scenario,we usezig-zagginggeneratingcurvesf i for i � 2 � 3 � 4 consisting
of 2i line segmentsof equallength,suchthattwo consecutivesegmentsjoin at a right angle(Fig-
ure9). In theseexperiments,thenumberof thedatapointsgeneratedona line segmentis constant
(it is setto 100),andthevarianceof thebivariateGaussiannoiseis l2 � 0 � 0005,wherel is thelength
of a line segment.Figure9 shows theprincipalcurvesproducedby theHS andthepolygonalline
algorithmsin thethreeexperiments.Althoughthepolygonalprincipalcurvefollowsthegenerating
curvemorecloselythantheHSprincipalcurve in thefirst two experiments(Figures9(a)and(b)),
thetwo algorithmsproduceequallypoor resultsif thenumberof line segmentsexceedsa certain
limit (Figure9(c)). Thedatadependentpenaltytermexplainsthis behavior of thepolygonalline
algorithm.Sincethepenaltyfactorλp is proportionalto thenumberof line segments,thepenalty
relatively increasesasthenumberof line segmentsof thegeneratingcurve grows. To achieve the
samelocalsmoothnessin thefour experiments,thepenaltyfactorshouldbegraduallydecreasedas
thenumberof line segmentsof thegeneratingcurve grows. Indeed,if theconstantof thepenalty
term is resetto λ _p � 0 � 02 in the fourth experiment,the polygonalprincipal curve recovers the
generatingcurvewith highaccuracy (Figure11(a)).
Thesecondscenariowhenthepolygonalline algorithmfails to producea meaningfulresultis
whenthegeneratingcurve is too complex, so the algorithmdoesnot find theglobal structureof
thedata.To testthegradualdegradationof thealgorithmweusedspiral-shapedgeneratingcurves
of increasinglength,i.e., we setgi � t � �½� t sin� iπt � � t cos� iπt �� for t �¾l 0 � 1m and i � 1 ����>� 6. The
varianceof the noisewassetto 0 � 0001,andwe generated1000datapoints in eachexperiment.
19
(a)¿À Á ÀÃÂ Ä¹Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë´Ì Í Ê Î ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÒ¡ÓºÂ Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É (b)¿¡À Á ÀàÄ�Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë�Ì Í Ê Î ÉÏ�Ä�Ð Ñ Ë Ä¹Æ À¹Ð Â Ê Å Æ Ì Å Â À¹Ð Ì Í Ê Î ÉÒ§ÓºÂ Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É (c)¿¡À Á ÀàÄ�Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë´Ì Í Ê Î ÉϹÄ�Ð Ñ Ë Ä Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÒ§Ó`Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É
Figure9: Abrupt changesin the directionof the generatingcurve. The polygonalline algorithmover-
smoothestheprincipalcurve asthenumberof directionchangesincreases.
Figure10 shows the principal curvesproducedby the HS andthe polygonalline algorithmsin
threeexperiments(i � 3 � 4 � 6). In thefirst two experiments(Figures10(a)and(b)), thepolygonal
principalcurveis almostindistinguishablefrom thegeneratingcurvewhile theHSalgorithmeither
oversmoothesthe principal curve (Figure10(a)),or fails to recover the shapeof the generating
curve(Figure10(b)). In thethird experimentbothalgorithmsfail to find theshapeof thegenerating
curve(Figure10(c)).Thefailurehereis dueto thefactthatthealgorithmis stuckin alocalminima
betweenthe initial curve (the first principal component)andthedesiredsolution(the generating
curve). If this is likely to occurin anapplication,theinitializationstepmustbereplacedby amore
sophisticatedroutine that approximatelycapturesthe global structureof the data. Figure11(b)
indicatesthatthis indeedworks. Herewe manuallyinitialize bothalgorithmsby a polygonalline
with eight vertices. Using this “hint”, the polygonalline algorithmproducesan almostperfect
solution,while theHSalgorithmstill cannotrecover theshapeof thegeneratingcurve.
4.3 Recovering smoothcharacter skeletons
In this sectionwe usethepolygonalline algorithmto find smoothskeletonsof hand-writtenchar-
actertemplates.Theresultsreportedherearepreliminary, andthefull treatmentof thisapplication
will be presentedin a future publication. Principalcurveshave beenappliedby Singhet al. [6]
for similar purposes.In [6] the initial treesareproducedusinga versionof the SOM algorithm
andthentheHS algorithmis appliedto extractskeletalstructuresof hand-writtencharacters.The
focusin [6] wason recoveringthe topologicalstructureof noisy lettersin fadeddocuments.Our
aim with thepolygonalline algorithmis to producesmoothcurveswhich canbeusedto recover
thetrajectoryof thepenstroke.
20
(a)¿À Á ÀÃÂ Ä¹Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë´Ì Í Ê Î ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÒ¡ÓºÂ Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É (b)¿¡À Á ÀàÄ�Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë�Ì Í Ê Î ÉÏ�Ä�Ð Ñ Ë Ä¹Æ À¹Ð Â Ê Å Æ Ì Å Â À¹Ð Ì Í Ê Î ÉÒ§ÓºÂ Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É (c)¿¡À Á ÀàÄ�Å Æ Á ÇÈ¯É Æ É�Ê À Á Å Æ Ë´Ì Í Ê Î ÉϹÄ�Ð Ñ Ë Ä Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÒ§Ó`Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É
Figure10: Spiral-shapedgeneratingcurves.Thepolygonalline algorithmfails to find thegeneratingcurve
asthelengthof thespiralis increased.
(a)����� �����¡ ¢�� £¤`¥ ¢ ¥§¦ ��� ¢�¨ª©¡« ¦ ¬¥®¯�¡° ±¨�¡¢�¡°�� ¦ ¢�©� ��¡° ©« ¦ ¬¥ (b)����� �����§ ¢�� £¤`¥ ¢ ¥¸¦ ��� ¢�¨ª©¡« ¦ ¬�¥®¯�¡° ±¨�§¢��¡°�� ¦ ¢�© ��¡° ©« ¦ ¬¥Ô ¢ � �¡° ©« ¦ ¬¥¶�·� ¦ ¢�© ���¡° ©¡« ¦ ¬¥
Figure11: Improvedperformanceof thepolygonalline algorithm.(a)Thepenaltyparameteris decreased.
(b) Thealgorithmsareinitialized manually.
To transformblack-and-whitecharactertemplatesinto two-dimensionaldatasets,weplacethe
midpointof thebottom-mostleft-mostpixel of the templateat thecenterof a coordinatesystem.
Theunit lengthof thecoordinatesystemis setto thewidth (andheight)of apixel, sothemidpoint
of eachpixel hasintegercoordinates.Thenweaddthemidpointof eachblackpixel to thedataset.
Thepolygonalline algorithmwastestedonimagesof isolatedhandwrittendigitsfrom theNIST
SpecialDatabase19 [22]. We found that thepolygonalline algorithmcanbeusedeffectively to
21
find smoothmedialaxesof simpledigitswhichcontainno loopsor crossingsof strokes.Figure12
showssomeof theseresults.
To find smoothskeletonsof morecomplex characterswemodifiedandextendedthepolygonal
line algorithm. We introducednew typesof verticesincidentto morethantwo line segmentsto
handleloopsandcrossings,andmodified the vertex optimizationstepaccordingly. The initial-
izationstepwasalsoreplacedby a moresophisticatedroutinebasedon a thinningmethod[23] to
produceaninitial graphthatapproximatelycapturesthetopologyof thecharacter. Figure13shows
someof theresultsof theextendedpolygonalline algorithmon morecomplex characters.Details
of theextendedpolygonalline algorithmandmorecompletetestingresultswill bepresentedin the
future. Õ¸Ö À Ê À Ì Á É�Ê Á É¹× Â¹Ð À Á ÉÏ�Ä¹Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É Õ§Ö À Ê À Ì Á É�Ê Á É × Â Ð À Á ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÕ¸Ö À Ê À Ì Á É�Ê Á É¹× Â¹Ð À Á ÉÏ�Ä¹Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É Õ§Ö À Ê À Ì Á É�Ê Á É × Â Ð À Á ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É
Figure12: Resultsproducedby thepolygonalline algorithmon charactersnot containingloopsor cross-
ings.
5 Conclusion
A new definitionof principalcurveshasbeenoffered.Thenew definitionhassignificanttheoretical
appeal;the existenceof principal curveswith this definition can be proved undervery general
conditions,anda learningmethodfor constructingprincipal curvesfor finite datasetsyields to
theoreticallyanalysis.
Inspiredby thenew definitionandthetheoreticallearningscheme,we have introduceda new
practicalpolygonalline algorithmfor designingprincipalcurves.Lackingtheoreticalresultscon-
22
Õ¸Ö À Ê À Ì Á É�Ê Á É¹× Â¹Ð À Á ÉÏ�Ä¹Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É Õ§Ö À Ê À Ì Á É�Ê Á É × Â Ð À Á ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î ÉÕ¸Ö À Ê À Ì Á É�Ê Á É¹× Â¹Ð À Á ÉÏ�Ä¹Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É Õ§Ö À Ê À Ì Á É�Ê Á É × Â Ð À Á ÉϹÄ�Ð Ñ Ë Ä¹Æ À�Ð Â Ê Å Æ Ì Å Â À�Ð Ì Í Ê Î É
Figure13: Resultsproducedby theextendedpolygonalline algorithmon characterscontainingloopsor
crossings.
cerningboth the HS andour polygonalline algorithm, we comparedthe two methodsthrough
simulations.We have foundthat in generalour algorithmhaseithercomparableor betterperfor-
mancethantheoriginal HS algorithmandit exhibits better, morerobustbehavior whenthedata
generatingmodelis varied. We have alsoreportedpreliminaryresultsin applyingthepolygonal
line algorithmto the problemof handwrittencharacterskeletonization.We believe that the new
principalcurvealgorithmmayalsoproveusefulin otherapplicationssuchasdatacompressionand
featureextractionwherea compactandaccuratedescriptionof a patternor an imageis required.
Theseareissuesfor futurework.
Appendix A
Curvesin � d
Let f : l a � bm XØ� d acontinuousmapping(curve). Thelengthof f overaninterval lα � β m h�l a � bm ,denotedby l � f � α � β � , is definedby
l � f � α � β � � supN
∑i b 1
� f � ti � � f � ti S 1 � � (A.1)
wherethe supremumis taken over all finite partitionsof l α � β m with arbitrarysubdivision points
α � t0 ] t1 TÙ���Ú] tN � β for N r 1. Thelengthof f over its domain l a � bm is denotedby l � f � . If
23
l � f � T ∞, thenf is saidto be rectifiable. It is well known that f ��� f1 ����>� fd � is rectifiableif and
only if eachcoordinatefunction f j : l a � bm XY� is of boundedvariation.
Two curves f : l a � bm X � d and g : l a_ � b_ m X � d are said to be equivalentif thereexist two
nondecreasingcontinuousrealontofunctionsφ : l 0 � 1m X l a � bm andη : l 0 � 1m X l a_ � b_ m suchthat
f � φ � t �� � g � η � t �� � for t �kl 0 � 1mn�In this casewe write f � g, andit is easyto seethat � is an equivalencerelation. If f � g, then
l � f � � l � g� . A curve g over l a � bm is saidto beparametrizedby its arc lengthif l � g � a � t � � t � a for
any a ] t ] b. Let f be a curve over l a � bm with lengthL. It is not hardto seethat thereexists a
uniquearclengthparametrizedcurveg over l 0 � L m suchthatf � g.
Let f be any curve with lengthL _ ] L, andconsiderthe arc lengthparametrizedcurve g � f
with parameterinterval l 0 � L _ m . By definition(A.1), for all s1 � s2 �Ûl 0 � L _ m wehave � g � s1 � � g � s2 � �Ü]K s1 � s2 K . Defineg � t � � g � L _ t � for 0 ] t ] 1. Thenf � g, andg satisfiesthe following Lipschitz
condition:For all t1 � t2 �tl 0 � 1m ,� g � t1 � � g � t2 � �� � g � L _ t1 � � g � L _ t2 � �] L _ K t1 � t2 K] L K t1 � t2 K � (A.2)
On theotherhand,notethatif g is acurveover l 0 � 1m whichsatisfiestheLipschitzcondition(A.2),
thenits lengthis at mostL.
Let f beacurveover l a � bm anddenotethesquaredEuclideandistancefrom any x �Þ� d to f by
∆ � x � f � � infa ß t ß b
� x � f � t � � 2 �Notethatif l � f � T ∞, thenby thecontinuityof f, its graph
Gf � f f � t � : a ] t ] b gis a compactsubsetof � d , andthe infimum above is achievedfor somet. Also, sinceGf � Gg if
f � g, we alsohave that∆ � x � f � � ∆ � x � g� for all g � f.
Proof of Lemma 1 Define
∆ e � inf f ∆ � f � : l � f � ] L g �First we show thattheabove infimum doesnot changeif we addtherestrictionthatall f lie inside
a closedsphereS� r � �àf x : � x �}] r g of largeenoughradiusr andcenteredat theorigin. Indeed,
24
without excludingnontrivial cases,wecanassumethat∆ e T E � X � 2. Denotethedistributionof X
by µ andchooser c 3L largeenoughsuchthatáSL r W 3M � x � 2µ � dx � c ∆ e V ε (A.3)
for someε c 0. If f is suchthatGf is notentirelycontainedin S� r � , thenfor all x � S� r ~ 3� wehave
∆ � x � f � co� x � 2 sincethediameterof Gf is at mostL. Then(A.3) impliesthat
∆ � f � r áSL r W 3M ∆ � x � f � µ � dx � c ∆ e V ε
andthus
∆ e � inf f ∆ � f � : l � f � ] L � Gf h S� r � g � (A.4)
In view of (A.4) thereexistsasequenceof curves f fn g suchthat l � fn � ] L, Gfn h S� r � for all n,
and∆ � fn � X ∆ e . By thediscussionpreceding(A.2), wecanassumewithout lossof generalitythat
all fn aredefinedover l 0 � 1m and � fn � t1 � � fn � t2 � �â] L K t1 � t2 K (A.5)
for all t1 � t2 �ãl 0 � 1m . Considerthe set of all curves ä over l 0 � 1m suchthat f �åä if and only if� f � t1 � � f � t2 � �J] L K t1 � t2 K for all t1 � t2 �jl 0 � 1m andGf h S� r � . It is easyto seethat ä is a closed
setundertheuniformmetricd � f � g� � sup0 ß t ß 1 � f � t � � g � t � � . Also, ä is anequicontinuousfamily
of functionsandsupt � f � t � � is uniformly boundedover ä . Thus ä is a compactmetric spaceby
the Arzela-Ascoli theorem(see,e.g., [24]). Sincefn �kä for all n, it follows that thereexists a
subsequencefnk converginguniformly to anf e �æä .
To simplify thenotationlet us renamef fnk g as f fn g . Fix x ��� d , assume∆ � x � fn � r ∆ � x � f e � ,andlet tx besuchthat∆ � x � f e � ��� x � f e�� tx � � 2. Thenby thetriangleinequality,K∆ � x � f e � � ∆ � x � fn � Kà� ∆ � x � fn � � ∆ � x � f e �] � x � fn � tx � � 2 �j� x � f e � tx � � 2] N � x � fn � tx � �çV�� x � f e � tx � � R � fn � tx � � f e � tx � � �By symmetry, asimilar inequalityholdsif ∆ � x � fn � T ∆ � x � f e � . SinceGf è � Gfn h S� r � , andE � X � 2 is
finite, thereexistsA c 0 suchthat
E K∆ � X � fn � � ∆ � X � f e � KU] A sup0 ß t ß 0
� fn � t � � f e � t � �andtherefore
∆ e � limné ∞
∆ � fn � � ∆ � f e � �SincetheLipschitzconditionon f e guaranteesthat l � f e � ] L, theproof is complete. ê
25
Appendix B
Proof of Theorem 1. Let f ek denotethecurve in s k minimizing thesquaredloss,i.e.,
f ek � argminf wx k
∆ � f � �The existenceof a minimizing f ek can easily be shown using a simpler versionof the proof of
Lemma1. ThenJ � fk u n � canbedecomposedas
J � fk u n � � N ∆ � fk u n � � ∆ � f ek � R V N ∆ � f ek � � ∆ � f e � Rwhere,usingstandardterminology, ∆ � fk u n � � ∆ � f ek � is calledtheestimationerror and∆ � f ek � � ∆ � f e �is calledtheapproximationerror. We considerthesetermsseparatelyfirst, andthenchoosek as
a function of the training datasizen to balancethe obtainedupperboundsin an asymptotically
optimalway.
ApproximationError
For any two curvesf andg of finite lengthdefinetheir (nonsymmetric)distanceby
ρ � f � g� � maxt
mins� f � t � � g � s� � �
Notethatρ � f � g� � ρ � f � g� if f � f andg � g, i.e., ρ � f � g� is independentof theparticularchoiceof
theparametrizationwithin equivalenceclasses.Next weobservethatif thediameterof K is D, and
Gf � Gg � K, thenfor all x � K,
∆ � x � g� � ∆ � x � f � ] 2Dρ � f � g� (B.1)
andtherefore
∆ � g� � ∆ � f � ] 2Dρ � f � g� � (B.2)
To prove (B.1), let x � K andchooset _ ands_ suchthat ∆ � x � f � �ë� x � f � t _ � � 2 andmins � g � s� �f � t _ � �ì��� g � s_ � � f � t _ � � . Then
∆ � x � g� � ∆ � x � f � ] � x � g � s_ � � 2 �j� x � f � t _ � � 2� N � x � g � s_ � �çV�� x � f � t _ � � RíN � x � g � s_ � �î�j� x � f � t _ � � R] 2D � g � s_ � � f � t _ � �] 2Dρ � f � g� �Let f �Hs beanarbitraryarc lengthparametrizedcurve over l 0 � L _ m , whereL _ï] L. Defineg as
a polygonalcurve with verticesf � 0� � f � L _ ~ k � ������ f �� k � 1� L _ ~ k � � f � L _ � . For any t �ðl 0 � L _ m we have
26
K t � iL _ ~ k KU] L ~ï� 2k � for somei �ñf 0 ������ k g . Sinceg � s� � f � iL _ ~ k � for somes, wehave
mins� f � t � � g � s� � ] � f � t � � f � iL _ ~ k � �] K t � iL _ ~ k K�] L
2k �Notethatl � g� ] L _ , by construction,andthusg �òs k. Thusfor every f �òs thereexistsag �òs k such
thatρ � f � g� ] L ~ï� 2k � . Now let g �As k besuchthatρ � f e � g� ] L ~ï� 2k � . Thenby (B.2) we conclude
thattheapproximationerroris upperboundedas
∆ � f ek � � ∆ � f e � ] ∆ � g� � ∆ � f e �] 2Dρ � f e � g�] DLk � (B.3)
EstimationError
For eachε c 0 andk r 1 let Sk u ε bea finite setof curvesin K which form anε-cover of s k in
thefollowing sense.For any f �ós k thereis anf _ �ós k u ε which satisfies
supx w K
K∆ � x � f � � ∆ � x � f _ � KU] ε � (B.4)
The explicit constructionof Sk u ε is given in [19]. Sincefk u n �\s k (see(5)), thereexists an f _k u n �s k u ε suchthat K∆ � x � fk u n � � ∆ � x � f _k u n � K�] ε for all x � K. We introducethe compactnotation � n �� X1 ����>� Xn � for thetrainingdata.Thuswecanwrite
E l ∆ � X � fk u n � K � n m � ∆ � f ek � � E l ∆ � X � fk u n � K � nm � ∆n � fk u n �V ∆n � fk u n � � ∆ � f ek �] 2ε V E l ∆ � X � f _k u n � K � n m � ∆n � f _k u n �V ∆n � fk u n � � ∆ � f ek � (B.5)] 2ε V E l ∆ � X � f _k u n � K � n m � ∆n � f _k u n �V ∆n � f ek � � ∆ � f ek � (B.6)] 2ε V 2 � maxf wx k ô ε õÚö f è�÷ K∆ � f � � ∆n � f � K � (B.7)
where(B.5) follows from theapproximatingpropertyof f _k u n andthefactthatthedistributionof X
is concentratedon K. (B.6) holdsbecausefk u n minimizes∆n � f � over all f �As k, and(B.7) follows
becausegiven � n �ø� X1 ����� Xn � , E l ∆ � X � f _k u n � K � n m is anordinaryexpectationof thetypeE l ∆ � X � f � m ,27
f �ós k u ε. Thusfor any t c 2ε theunionboundimplies
P � E l ∆ � X � fk u n � K � n m � ∆ � f ek � c t �] P � maxf wx k ô ε õÚö f èn÷ K∆ � f � � ∆n � f � KUc t
2� ε �] N KSk u ε KnV 1R max
f wx k ô ε õÚö f è ÷ P ��K∆ � f � � ∆n � f � K�c t2� ε � (B.8)
where K s k u ε K denotesthecardinalityof s k u ε.Recallnow Hoeffding’s inequality[25] which statesthat if Y1 � Y2 ����� Yn areindependentand
identicallydistributedrealrandomvariablessuchthat0 ] Yi ] A with probabilityone,thenfor all
u c 0,
P ù aaaaa 1n n
∑i b 1
Yi � EY1
aaaaa c u úã] 2eS 2nu2 W A2 ûSincethediameterof K is D, wehave ü x ý f þ t ÿ�ü 2 ] D2 for all x � K andf suchthatGf � K. Thus
0 ] ∆ þ X � f ÿ ] D2 with probabilityoneandby Hoeffding’sinequality, for all f ��s k u ε � f f e g wehave
P� K∆ þ f ÿçý ∆n þ f ÿ�K c t
2ý ε � ] 2eS 2n LÃL t W 2M`S ε M 2 W D4
which impliesby (B.8) that
P�E � ∆ þ X � fk � n ÿ�� n ý ∆ þ f �k ÿ � t ��� 2 ���Sk � ε ��� 1� e� 2n ��� t � 2� � ε � 2 � D4
(B.9)
for any t � 2ε. UsingthefactthatE �Y ���� ∞0 P � Y � t � dt for any nonnegativerandomvariableY,
wecanwrite for any u � 0,
∆ þ fk � n ÿ ý ∆ þ f �k ÿ � ∞
0P�E � ∆ þ X � fk � n ÿ�� n ý ∆ þ f �k ÿ � t � dt� u � 2ε � 2 � �Sk � ε ��� 1�! ∞
u" 2εe� 2n ��� t � 2� � ε � 2 � D4
dt (B.10)
� u � 2ε � 2 �#�Sk � ε ��� 1� D4 $ e� nu2 �%� 2D4 �nu
(B.11)
� & 2D4 log þ'�Sk � ε ��� 1ÿn
� 2ε � O þ n � 1� 2 ÿ (B.12)
where(B.11)followsfrom theinequality � ∞x e� t2 � 2dt ( þ 1) xÿ e� x2 � 2, for x � 0, and(B.12)follows
by settingu �+* 2D4 log �-,Sk . ε , " 1�n , wherelog denotesnaturallogarithm.Thefollowing lemma,which
is provedin [19], demonstratestheexistenceof a suitablecoveringsetSk � ε.Lemma 2 For anyε � 0 thereexistsa finitecollectionof curvesSk � ε in K such that
supx / K
�∆ þ x � f ÿçý ∆ þ x � f 0¯ÿ1�2� ε
28
and �Sk � ε ��� 2LDε " 3k" 1Vk" 1
d 3 D2 4 dε
� 4 d 5 d 3 LD 4 dkε
� 3 4 d 5 kd
whereVd is thevolumeof thed-dimensionalunit sphereandD is thediameterof K.
It is not hardto seethatsettingε � 1) k in Lemma2 givestheupperbound
2D4 log þ'�Sk � ε ��� 1ÿ6� kC þ L � D � d ÿ (B.13)
whereC þ L � D � d ÿ doesnotdependon k. Combiningthis with (B.12)andtheapproximationbound
givenby (B.3) resultsin
∆ þ fk � n ÿçý ∆ þ f � ÿ6� & kC þ L � D � d ÿn
� DL � 2k
� O þ n � 1� 2 ÿ ûTherateat which ∆ þ fk � n ÿ approaches∆ þ f � ÿ is optimizedby settingthenumberof segmentsk to be
proportionalto n1� 3. With thischoiceJ þ fk � n ÿ � ∆ þ fk � n ÿ ý ∆ þ f � ÿ hastheasymptoticconvergencerate
J þ fk � n ÿ � O þ n � 1� 3 ÿ'�andtheproof of Theorem1 is complete.
To show thebound(7), let δ �kþ 0 � 1ÿ andobserve thatby (B.9) we have
P�E � ∆ þ X � fk � n ÿ1� n ý ∆ þ f �k ÿ � t �7� 1 ý δ
whenever t � 2ε and
δ � 2 �#�Sk � ε ��� 1� e� 2n �8� t � 2� � ε � 2 � D4 ûSolvingthis equationfor t andletting ε � 1) k asbefore,weobtain
t � 2D4 log ���Sk � 1� k �9� 1�ìý 2D4 log þ δ ) 2ÿn
� 2k� & kC þ L � D � d ÿçý 2D4 log þ δ ) 2ÿ
n� 2
kû
Therefore,with probabilityat least1 ý δ, wehave
E � ∆ þ X � fk � n ÿ�� n ý ∆ þ f �k ÿ6� & kC þ L � D � d ÿçý 2D4 log þ δ ) 2ÿn
� 2kû
Combiningthis boundwith theapproximationbound∆ þ f �k ÿ ý ∆ þ f � ÿ:�øþ DL ÿ;) k gives(7). <29
References
[1] T. Hastie,Principal curvesandsurfaces. PhDthesis,StanfordUniversity, 1984.
[2] T. HastieandW. Stuetzle,“Principalcurves,” Journalof theAmericanStatisticalAssociation,
vol. 84,pp.502–516,1989.
[3] Y. Linde,A. Buzo,andR. M. Gray, “An algorithmfor vectorquantizerdesign,” IEEETrans-
actionson Communications, vol. COM-28,pp.84–95,1980.
[4] W. S. Cleveland,“Robust locally weightedregressionandsmoothingscatterplots,” Journal
of theAmericanStatisticalAssociation, vol. 74,pp.829–835,1979.
[5] J.D. BanfieldandA. E. Raftery, “Ice floe identificationin satelliteimagesusingmathemat-
ical morphologyandclusteringaboutprincipal curves,” Journal of theAmericanStatistical
Association, vol. 87,pp.7–16,1992.
[6] R. Singh,M. C. Wade,andN. P. Papanikolopoulos,“Letter-level shapedescriptionby skele-
tonizationin fadeddocuments,” in Proceedingsof theFourthIEEEWorkshoponApplications
of ComputerVision, pp.121–126,IEEEComput.Soc.Press,1998.
[7] K. Reinhardand M. Niranjan, “Subspacemodels for speechtransitionsusing principal
curves,” Proceedingsof Instituteof Acoustics, vol. 20(6),pp.53–60,1998.
[8] K. ChangandJ.Ghosh,“Principalcurvesfor nonlinearfeatureextractionandclassification,”
in Applicationsof Artificial Neural Networksin Image ProcessingIII , vol. 3307,(SanJose,
CA), pp.120–129,SPIEPhotonicsWest’98 ElectronicImageConference,Jan24–301998.
[9] K. ChangandJ. Ghosh,“Principal curve classifier– a nonlinearapproachto patternclas-
sification,” in IEEE InternationalJoint Conferenceon Neural Networks, (Anchorage,AL),
pp.695–670,May 5–91998.
[10] R. Tibshirani,“Principal curvesrevisited,” StatisticsandComputation, vol. 2, pp. 183–190,
1992.
[11] F. Mulier andV. Cherkassky, “Self-organizationasan iterative kernelsmoothingprocess,”
Neural Computation, vol. 7, pp.1165–1177,1995.
[12] P. Delicado,“Principal curvesandprincipal orientedpoints,” Tech.Rep.309, Department
d’Economiai Empresa,UniversitatPompeuFabra,1998.
30
[13] T. Duchampand W. Stuetzle,“Geometricpropertiesof principal curves in the plane,” in
Robuststatistics,dataanalysis,andcomputerintensivemethods:in honorof PeterHuber’s
60thbirthday(H. Rieder, ed.),vol. 109of Lecture notesin statistics, pp.135–152,Springer-
Verlag,1996.
[14] A. N. Kolmogorov andS.V. Fomin, IntroductoryRealAnalysis. New York: Dover, 1975.
[15] A. GershoandR. M. Gray, VectorQuantizationandSignalCompression. Boston:Kluwer,
1992.
[16] J.A. Hartigan,ClusteringAlgorithms. New York: Wiley, 1975.
[17] T. Tarpey, L. Li, andB. D. Flury, “Principal points andself-consistentpoints of elliptical
distributions,” Annalsof Statistics, vol. 23,no.1, pp.103–112,1995.
[18] V. N. Vapnik,TheNatureof StatisticalLearningTheory. New York: Springer-Verlag,1995.
[19] B. Kegl, Principal Curves: Learning, Design,and Applications. PhD thesis,Concorida
University, Montreal,Canada,1999.
[20] L. Devroye, L. Gyorfi, andG. Lugosi,A ProbabilisticTheoryof PatternRecognition. New
York: Springer, 1996.
[21] A. N. Kolmogorov and V. M. Tikhomirov, “ε-entropy and ε-capacityof setsin function
spaces,” Translationsof theAmericanMathematicalSociety, vol. 17,pp.277–364,1961.
[22] P. Grother, NISTSpecialDatabase19. NationalInstituteof StandardsandTechnology, Ad-
vancedSystemsDivision,1995.
[23] S.SuzukiandK. Abe,“Sequentialthinningof binarypicturesusingdistancetransformation,”
in Proceedingsof the 8th International Conferenceon Pattern Recognition, pp. 289–292,
1986.
[24] R. B. Ash,RealAnalysisandProbability. New York: AcademicPress,1972.
[25] W. Hoeffding, “Probability inequalitiesfor sumsof boundedrandomvariables,” Journal of
theAmericanStatisticalAssociation, vol. 58,pp.13–30,1963.
31