Statistical Learning for Humanoid Robots · 2011-05-24 · Autonomous Robots 12, 55–69, 2002 c...

Autonomous Robots 12, 55–69, 2002c© 2002 Kluwer Academic Publishers. Manufactured in The Netherlands.

Statistical Learning for Humanoid Robots

SETHU VIJAYAKUMAR AND AARON D’SOUZAComputer Science & Neuroscience and Kawato Dynamic Brain Project, University of Southern California,

Los Angeles, CA 90089-2520, [email protected]

[email protected]

TOMOHIRO SHIBATAKawato Dynamic Brain Project, ERATO, Japan Science & Technology Corp., Kyoto 619-0288, Japan

[email protected]

JORG CONRADTUniversity/ETH Zurich, Winterthurerstr. 190, CH-8057 Zurich, Switzerland

[email protected]

STEFAN SCHAALComputer Science & Neuroscience and Kawato Dynamic Brain Project, University of Southern California,

Los Angeles, CA 90089-2520, [email protected]

Abstract. The complexity of the kinematic and dynamic structure of humanoid robots make conventional ana-lytical approaches to control increasingly unsuitable for such systems. Learning techniques offer a possible wayto aid controller design if insufficient analytical knowledge is available, and learning approaches seem mandatorywhen humanoid systems are supposed to become completely autonomous. While recent research in neural net-works and statistical learning has focused mostly on learning from finite data sets without stringent constraintson computational efficiency, learning for humanoid robots requires a different setting, characterized by the needfor real-time learning performance from an essentially infinite stream of incrementally arriving data. This paperdemonstrates how even high-dimensional learning problems of this kind can successfully be dealt with by tech-niques from nonparametric regression and locally weighted learning. As an example, we describe the applicationof one of the most advanced of such algorithms, Locally Weighted Projection Regression (LWPR), to the on-linelearning of three problems in humanoid motor control: the learning of inverse dynamics models for model-basedcontrol, the learning of inverse kinematics of redundant manipulators, and the learning of oculomotor reflexes. Allthese examples demonstrate fast, i.e., within seconds or minutes, learning convergence with highly accurate finalpeformance. We conclude that real-time learning for complex motor system like humanoid robots is possible withappropriately tailored algorithms, such that increasingly autonomous robots with massive learning abilities shouldbe achievable in the near future.

Keywords: motor control, statistical learning, dimensionality reduction, inverse dynamics, inverse kinematics,oculomotor learning, nonparametric regression

56 Vijayakumar et al.

1. Introduction

The necessity for adaptive control is becoming moreapparent as the scale of control systems gets increas-ingly more complex, for instance, as experienced in thefields of advanced robotics, factory automation, andautonomous vehicle control. Humanoid robots, the fo-cus of this paper, are a typical example. Humanoidrobots are high dimensional movement systems forwhich classical system identification and control tech-niques are often insufficient due to unknown sourcesof non-linearities inherent in these systems. Learningtechniques are a possible way to overcome such limi-tations by aiding the design of appropriate control laws(Slotine, 1991), which often involve decisions basedon a multitude of sensors and actuators. Learning alsoseems to be the only viable research approach towardsthe generation of flexible autonomous robots that canperform multiple tasks (Schaal, 1999), with the hopeof creating a completely autonomous humanoid robotat some point.

Among the characteristics of the motor learningproblems in humanoid robots are high dimensional in-put spaces with potentially redundant and irrelevantdimensions, nonstationary input and output distribu-tions, essentially infinite training data sets with no rep-resentative validation sets, and the need for continuallearning. Most learning tasks fall into the domain of re-gression problems, e.g., as in learning dynamics mod-els, or they at least involve regression problems, e.g.,as in learning a policy with reinforcement learning.Interestingly, the class of on-line learning of regres-sion problems with the characteristics above has so farnot been conquered by the new developments in sta-tistical learning. Bayesian inference (Bishop, 1995) isusually computationally too expensive for real-time ap-plication as it requires representation of the completejoint probablity densities of the data. The framework ofstructural risk minimization (Vapnik, 1995), the mostadvanced in form of Support Vector Machines, excelsin classification and finite batch learning problems, buthas yet to show compelling performance in regressionand incremental learning. However techniques fromnonparametric regression, in particular the methods oflocally weighted learning (Atkeson et al., 1997), haverecently advanced to meet all the requirements of real-time learning in high-dimensional spaces (Schaal et al.,2000).

In this paper, we will present one of the most ad-vanced locally weighted learning algorithms, Locally

Weighted Projection Regression (LWPR), and its ap-plication to three challenging problems of learning inhumanoid robotics, i.e., (i) an inverse dynamics modelof a 7 DOF anthropomorphic robot, (ii) an inverse kine-matics map of a redundant dextrous arm, and (iii) thebio-mimetic gaze stabilization of a humanoid oculo-motor system. In the following sections, we will firstexplain the LWPR algorithm and then introduce thevarious learning tasks and illustrate learning resultsfrom real-time learning on the actual robots for each ofthe tasks. To the best of our knowledge, this is the firsttime that real-time learning of such complex modelshas been accomplished in robot control.

2. Locally Weighted Projection Regression

The core concept of our learning approach is to ap-proximate nonlinear functions by means of piecewiselinear models (Atkeson et al., 1997). The learning sys-tem automatically determines the appropriate numberof local models, the parameters of the hyperplane ineach model, and also the region of validity, called re-ceptive field (RF), of each of the model, formalized asa Gaussian kernel:

wk = exp

(−1

2(x − ck)

T Dk(x − ck)

), (1)

Given a query point x, each linear model calculates aprediction yk = βkx. The total output of the learningsystem is the weighted mean of all K linear models:

y =K∑

k=1

wk yk

/K∑

k=1

wk,

also illustrated in Fig. 1. Learning in the system in-volves determining the linear regression parameter βk

and the distance metric Dk . The center ck of the RFremains fixed. Local models are created as and whenneeded as described in Section 2.3.

2.1. Local Dimensionality Reduction

Despite its appealing simplicity, the “piecewise lin-ear modeling” approach becomes numerically brittleand computationally too expensive in high dimen-sional input spaces. Given the empirical observationthat high dimensional data lie often on locally low di-mensional distributions, it is possible to develop anefficient approach to exploit this property. Instead of

Statistical Learning for Humanoid Robots 57

Figure 1. Information processing unit of LWPR.

using ordinary linear regression to fit the local hyper-planes, we suggest to employ Partial Least Squares(PLS) (Wold, 1975; Frank and Friedman, 1993). PLSrecursively computes orthogonal projections of the in-put data and performs single variable regressions alongthese projections on the residuals of the previous iter-ation step. Table 1 illustrates PLS in pseudocode for aglobal linear model where the input data is in the rowsof the matrix X, and the corresponding one dimensionaloutput data is in the vector y. The key ingredient in PLSis to use the direction of maximal correlation betweenthe residual error and the input data as the projectiondirection at every regression step. Additionally, PLSregresses the inputs of the previous step against theprojected inputs sr in order to ensure the orthogonalityof all the projections ur (Step 2b). Actually, this addi-tional regression could be avoided by replacing pr withur , similar to techniques used in principal componentanalysis (Sanger, 1989). However, using this regres-sion step leads to better performance of the algorithm.This effect is due to the fact that PLS chooses the mosteffective projections if the input data has a sphericaldistribution: with only one projection, PLS will findthe direction of the gradient and achieve optimal re-gression results. The regression step in 2b modifies the

Table 1. Pseudocode of PLS projection regression.

1. Initialize: Xres = X, yres = y

2. For r = 1 to R (# projections)

(a) ur = XTresyres; βr = sT

r yres/(sTr sr ) where sr = Xresur .

(b) yres = yres − sr βr ; Xres = Xres − sr prT where pr =

XTressr /(sT

r sr ).

input data Xres such that each resulting data vectorshave coefficients of minimal magnitude and, hence,push the distribution of Xres to become more spherical.

An incremental locally weighted version of the PLSalgorithm is derived in Table 2 (Vijayakumar andSchaal, 2000). Here, λ ∈ [0, 1] denotes a forgetting fac-tor that determines how quickly older data will be for-gotten in the various PLS parameters, similar to therecursive system identification techniques (Ljung andSoderstrom, 1986). The variables SSr , S Rr and SZr

are memory terms that enable us to do the univariateregression in Step 5 in a recursive least squares fashion,i.e., a fast Newton-like method.

Since PLS selects the univariate projections very ef-ficiently, it is even possible to run locally weightedPLS with only one projection direction (denoted asLWPR-1). The optimal projection is in the direction ofthe local gradient of the function to be approximated. Ifthe locally weighted input data forms a spherical dis-tribution in a local model, the single PLS projectionwill suffice to find the optimal direction. Otherwise,the distance metric (and hence, weighting of the data)will need to be adjusted to make the local distributionmore spherical. The learning rule of the distance met-ric can accomplish this adjustment, as explained below.It should be noted that Steps 8–10 in Table 2 becomeunnecessary for the uni-projection case.

2.2. Learning the Distance Metric

The distance metric Dk and hence, the locality of thereceptive fields, can be learned for each local modelindividually by stochastic gradient descent in a leave-one-out cross validation cost function. Note that thisupdate does not require competitive learning—only acompletely local learning rule is needed, and leave-one-out cross validation can be performed without keepingdata in memory (Schaal and Atkeson, 1998). The up-date rule can be written as:

Mn+1 = Mn − α∂J

∂Mwhere D = MT M

(for positive definiteness) (2)

and the cost function to be minimized is:

J = 1∑Mi=1 wi

M∑i=1

R∑r=1

wi res2r+1,i(

1 − wis2

r,i

sTr Wsr

)2 + γ

N

N∑i, j=1

D2ij

=R∑

r=1

(M∑

i=1

J1,r

)+ J2. (3)


Table 2. Incremental locally weighted PLS for one RF centered at c.

Initialization:x0

0 = 0, u0 = 0, β00 = 0,

W 0 = 0

Given: Training point (x, y)

w = exp(− 1

2 (x − c)T D(x − c))

Update the means:W n+1 = λW n + w

xn+10 = λW nxn

0 + wxW n+1

βn+10 = λW nβn

0 + wy

W n+1

Update the local model

Initialize:z = x − xn+1

0 , res1 = y − βn+10

For r = 1 : R (# projections)

1. un+1r = λun

r + wz resr

2. sr = zT un+1r /(un+1

rT

un+1r )

3. SSn+1r = λSSn

r + w s2r

4. SRn+1r = λSRn

r + w sr resr

5. βn+1r = SRn+1

r /SSn+1r

6. resr+1 = resr − sr βn+1r

7. MSEn+1r = λMSEn

r + w res2r+1

8. SZn+1r = λSZn

r + wzsr

9. pn+1r = SZn+1

r /SSn+1r

10. z = z − sr pn+1r

Predicting with novel data (xq ): Initialize: y = β0, z = xq − x0

Repeat for r = 1:R

− y = y + βr sr where sr = uTr z

− z = z − sr pnr

where M denotes the number of training data, and Nthe number of input dimensions. A stochastic versionof the gradient ∂ J

∂M can be derived from the cost functionby keeping track of several “memory terms” as shownin Table 3.

Table 3. Derivatives for distance metric update.

∂J

∂M≈

R∑r=1

(M∑

i=1

∂J1,r

∂w

)∂w

∂M+ w

W n+1

∂J2

∂M(stochastic update)

∂w

∂Mkl= − 1

2w(x − c)T ∂D

∂Mkl(x − c); ∂J2

∂Mkl= 2

γ

N

N∑i=1, j=1

Dij∂Dij

∂Mkl

∂Dij

∂ Mkl= Mkj δil + Mki δjl; where δij = 1 if i = j else δi j = 0.

Compute the following for each projection direction r :

M∑i=1

∂J1,r

∂w= e2

cv,r

W n+1− 2

(Pn+1

r sr er)

W n+1Hn

r − 2

(Pn+1

r sr)2

W n+1Rn

r − En+1r

(W n+1)2

+[Tn+1

r − 2Rn+1r Pn+1

r Cn+1r

] (I − ur uT

r /(uT

r ur))

z resr

W n+1√

uTr ur

Cn+1r = λCn

r + wsr zT , er = resr+1, ecv,r = er

1 − wPn+1r s2

r

Pn+1r = 1

SSn+1r

, Hn+1r = λHn

r + w ecv,r sr

(1 − w hr )

Rn+1r = λRn

r + w2s2r e2

cv,r

(1 − w hr )where hr = Pn+1

r s2r

En+1r = λEn

r + we2cv,r ; Tn+1

r = λTnr + w

(2we2

cv,r sr Pn+1r − ecv,r β

n+1r

)(1 − w hr )

zT

2.3. The Complete LWPR Algorithm

All the ingredients above can be combined in an in-cremental learning scheme that automatically allocatesnew locally linear models as needed. The final learning


Table 4. Psuedocode of the complete LWPR algorithm.

– Initialize the LWPR with no receptive field (RF);

– For every new training sample (x, y):

• For k = 1 to K (# of receptive fields):

∗ calculate the activation from Eq. (1)∗ update projections & regression (Table 2) and Distance

Metric (Table 2.2)∗ check if no. of projections needs to be increased

(cf. Section 2.3)

• If no RF was activated by more than wgen;

∗ create a new RF with r = 2, c = x, D = Ddef

network is illustrated in Fig. 1 and an outline of the al-gorithm is shown in Table 4.

In this pseudo-code, wgen is a threshold that de-termines when to create a new receptive field, andDdef is the initial (usually diagonal) distance metric inEq. (1). The initial number of projections is set to r = 2.The algorithm has a simple mechanism of determiningwhether r should be increased by recursively keepingtrack of the mean-squared error (MSE) as a functionof the number of projections included in a local model,i.e., Step 7 in the incremental PLS pseudocode. If theMSE at the next projection does not decrease morethan a certain percentage of the previous MSE, i.e.,MSEi+1

MSEi> φ, where φ ∈ [0, 1], the algorithm will stop

adding new projections locally. For a diagonal distancemetric D and under the assumption that the number ofprojections R remains small, the computational com-plexity of the update of all parameters of LWPR is linearin the number of input dimensions n. For the LWPR-1variant, this O(n) computational complexity is alwaysguaranteed.

3. Real-Time Learning for Humanoid Robots

One of the main motivations of the development ofLWPR was that model-based control of our humanoidrobots with analytical models did not result in suffi-cient accuracy. The following sections describe howLWPR has allowed us to improve model-based con-trol with models that were learned, i.e., they wereacquired very rapidly in real-time while the sys-tem was trying to accomplish a task. Our resultsare one of the first in the learning literature thatdemonstrate the feasibility of real-time statistical learn-ing in high-dimensional systems such as humanoidrobots.

3.1. Real-Time Learning of Inverse Dynamics

A common strategy in robotic and biological motorcontrol is to convert kinematic trajectory plans into mo-tor commands by means of an inverse dynamics model.The inverse dynamics takes the desired positions, ve-locities, and accelerations of all DOFs of the robot andoutputs the appropriate motor commands. In the case oflearning with our seven DOF anthropomorphic robotarm (see Fig. 2(a)), the inverse dynamics model re-ceives 21 inputs and outputs 7 torque commands. Ifderived analytically under a rigid body dynamics as-sumption (An et al., 1988), the most compact recursiveformulation of the inverse dynamics of our robot re-sults in about 20 pages of compact C-code, filled withnested sine and cosine terms. For on-line learning, mo-tor commands need to be generated from the modelat 480 Hz in our implementation. Updating the learn-ing system can take place at a lower rate but shouldremain as high as possible to capture suffcient datain fast movements—we usually achieve about 70 Hzupdating rate.

Learning regression problems in such high dimen-sional input space is a daunting problem from the viewof the bias-variance trade-off. In learning control, train-ing data is generated by the learning system itself, andit is impossible to assess a priori what structural com-plexity that data is going to have. Fortunately, actualmovement systems do not fill the data space in a com-pletely random way. Instead, when viewed locally, datadistributions tend to be low dimensional, e.g., about4–6 dimensional for the inverse dynamics (Schaal et al.,1998) of our robot instead of the global 21 input dimen-sions. This property, which is exploited by the LWPRalgorithm, is a key element in the excellent real-timeperformance of our learning scheme.

3.1.1. Performance Comparison on a Static Data Set.Before demonstrating the applicability of LWPR inreal-time, a comparison with alternative learning meth-ods will serve to demonstrate the complexity of thelearning task. We collected 50,000 data points fromvarious movement patterns from our 7 DOF anthro-pomorphic robot (Fig. 2(a)) at 50 Hz sampling fre-quency. 10 percent of this data was excluded as a testset. The training data was approximated by 4 differ-ent methods: i) parameter estimation based on an an-alytical rigid body dynamics model (An et al., 1988),ii) Support Vector Regression (Saunders et al., 1998),


Figure 2. (a) 7-DOF SARCOS dexterous arm. (b) 30-DOF humanoid robot.

iii) LWPR-1, and iv) full LWPR. It should be noted thatneither i) nor ii) are incremental learning methods, i.e.,they require batch learning. Using a parametric modelas suggested in i) and just approximating its open pa-rameters from data results in a global model of theinverse dynamics and is theroretically the most power-ful method. However, given that our robot is actuatedhydraulically and rather lightweight and compliant, weknow that the rigid body dynamics assumption is notfully justified. Method ii), Support Vector Regression,is a relatively new statistical learning approach that wasderived from the theory of structural risk minimization.In many recent publications, support vector machineshave demonstrated superior learning performance overprevious algorithms, such that a comparison of thismethod with LWPR seemed to be an interesting bench-mark. LWPR as used in iii) and iv) was exactly thealgorithm described in the previous section. Methodsii)–iv) learned a separate model for each output of theinverse dynamics, i.e., all models had a univariate out-put and 21 inputs. LWPR employed a diagonal distancemetric.

Figure 3 illustrates the function approximation re-sults for the shoulder motor command graphed overthe number of training iterations (one iteration corre-sponds to the update from one data point). Surprisingly,

Figure 3. Comparison of generalization error (nMSE) traces fordifferent learning schemes.

rigid body parameter estimation achieved the worst re-sults. LWPR-1 outperformed parameter estimation, butfell behind SVM regression. Full LWPR performed thebest. The results for all other DOFs were analogous.For the final result, LWPR employed 260 local mod-els, using an average of 3.2 local projections. LWPR-1did not perform better because we used a diagonal dis-tance metric. The abilities of a diagonal distance metricto “carve out” a locally spherical distribution are toolimited to accomplish better results—a full distance


metric can remedy this problem, but would make thelearning updates quadratic in the number of inputs.These results demonstrate that LWPR is a competitivefunction approximation technique.

3.1.2. On-Line Learning. We implemented fullLWPR on our robotic setup. Out of the four parallelprocessors of the system, one 366 MHZ PowerPC pro-cessor was completely devoted to lookup and learningwith LWPR. Each DOF had its own LWPR learningsystem, resulting in 7 parallel learning modules. In or-der to accelerate lookup and training times, we added aspecial data structure to LWPR. Each local model main-tained a list of all other local models that overlappedsufficiently with it. Sufficient overlap between two lo-cal model i and j can be determined from the centersand distance metrics. The point x in input space that isthe closest to both centers in the sense of a Mahalanobisdistance is x = (Di + D j )

−1(Di ci + D j c j ). Insertingthis point into Eq. (1) of one of the local models givesthe activation w due to this point. Two local mod-els are listed as sufficiently overlapping if w ≥ wgen

(cf. LWPR outline). For diagonal distance metrics, theoverlap computation is linear in the number of inputs.Whenever a new data point is added to LWPR, oneneighborhood relation is checked for the maximallyactivated RF. An appropriate counter for each local

Figure 4. (a) Robot end effector motion traces under different control schemes. (b) Progress of online learning with LWPR control.

model ensures that overlap with all other local modelsis checked exhaustively. Given this “nearest neighbor”data structure and the fact that a movement system gen-erates temporally highly correlated data, lookup andlearning can be confined to only few RFs. For everylookup (update), the identification number of the max-imally activated RF is returned. The next lookup (up-date) will only consider the neighbors of this RF. Itcan be proved that this method performs as good as anexhaustive lookup (update) strategy that excludes RFsthat are activated below a certain threshold wcutoff .

The LWPR models were trained on-line while therobot performed a pseudo randomly drifting figure-8pattern in front of its body. Lookup proceeded at480 Hz, while updating the learning model wasachieved at about 70 Hz. At certain intervals, learn-ing was stopped and the robot attempted to draw aplanar figure-8 at 2 Hz frequency for the entire pattern.The quality of these drawing patterns is illustrated inFig. 4(a) and (b). In Fig. 4(a), Xdes denotes the de-sired figure-8 pattern, Xsim illustrates the figure-8 per-formed by our robot simulator that uses a perfect in-verse dynamics model (but not necessarily a perfecttracking and numerical integration algorithm), Xparam

is the performance of the estimated rigid body dynam-ics model, and Xlwpr shows the results of LWPR. Whilethe rigid body model has the worst performance, LWPRobtained the results comparable to the simulator.


Figure 4(b) illustrates the speed of LWPR learning.The Xnouff trace demonstrates the figure-8 patterns per-formed without any inverse dynamics model, just usinga low gain PD controller. The other traces show howrapidly LWPR learned the figure-8 pattern during train-ing: they denote performance after 10, 20, 30, and 60seconds of training. After 60 seconds, the figure-8 ishardly distinguishable from the desired trace.

3.2. Inverse Kinematics Learning

Since most movement tasks are defined in coordinatesystems that are different from the actuator space of therobot, coordinate transformation from task to actuatorspace must be performed before motor commands canbe computed. On a system with redundant degrees-of-freedom (DOFs), this inverse kinematics transfor-mation from external plans to internal coordinates isoften ill-posed as it is underconstrained. If we definethe intrinsic coordinates of a manipulator as the n-dimensional vector of joint angles θθ ∈Rn , and the po-sition and orientation of the manipulator’s end effectoras the m-dimensional vector x ∈Rm , the forward kine-matic function can generally be written as:

x = f (θθ) (4)

while what we need is the inverse relationship:

θθ = f −1(x) (5)

For redundant systems, like our Sarcos robots (seeFig. 2), solutions to the above equation are non-unique.Traditional inverse kinematics algorithms address howto determine a particular solution in face of multi-ple solutions by optimizing an additional cost crite-rion g = g(θθ). Most approaches favor local optimiza-tions that compute an optimal change in θθ , �θθ , for asmall change in x, �x and then integrate �θθ to gener-ate the entire joint space path. Resolved Motion RateControl (RMRC) is one such local method which usesthe Jacobian J of the forward kinematics to describe achange of the endeffector’s position as:

x = J(θθ)θθ (6)

This equation can be solved for θθ by taking the inverseof J if it is square i.e. m = n, and non-singular, or byusing pseudo-inverse computations that minimize g in

the null space of J (Liegeois, 1977):

θθ = J#x − α(I − J#J)∂g

∂θθ(7)

3.2.1. Motivation for Learning. Learning of inversekinematics is useful when the kinematic model of therobot is not known accurately, when Cartesian informa-tion is provided in uncalibrated camera coordinates, orwhen the computational complexity of analytical solu-tions becomes too high. For instance, in our humanoidrobot we observed that offsets in sensor readings andinaccurate knowledge of the exact kinematics of therobot can lead to significant error accumulations foranalytical inverse kinematics computations, and that itis hard to maintain an accurate calibration of the activevision system of the robot. Instead of re-calibratingthe entire robot frequently, we would rather employa self-calibrating, i.e., learning approach. An addi-tional appealing feature of learning inverse kinematicsis that it avoids problems due to kinematic singularities-learning works out of experienced data, and such datais always physically correct and does not demand im-possible postures as can result from an ill-conditionedmatrix inversion.

A major obstacle in learning inverse kinematics, isthat the inverse kinematics of a redundant kinematicchain has infinitely many solutions. In the context ofEq. (6), this means that multiple θθ i , are mapped tothe same x. Algorithms that learn the mapping θθ ←f −1(x) average over all the solutions θθ i , assuming thatdifferent θθ i for the same x are due to noise. This mayresult in an invalid solution if the multiple θθ i lie ina non-convex set, as is frequently the case in robotkinematics (Jordan and Rumelhart, 1992).

This problem can be avoided by a specific inputrepresentation to the learning network (Bullock et al.,1993) which allows local averaging over θθ i . This canbe shown by averaging Eq. (6) over multiple θθ i thatmap to the same x, for a fixed θθ .

〈x〉 = ⟨J (θθ) θθ i

⟩i ⇒ x = J (θθ)

⟨θθ i

⟩ = J (θθ) ¯θθ i (8)

Since the Jacobian relates the x and θθ i in linear form,even for redundant systems the average of the solutionswill result in the desired x as long as the averaging iscarried out in the vicinity of a particular θθ .

Thus we propose to learn the inverse mapping func-tion with our spatially localized LWPR learning sys-tem based on the input/output representation (x, θθ) →(θθ). This approach will automatically resolve the


redundancy problem without resorting to any other op-timization approach: the local average solution pickedis simply the local average over the solutions that wereexperienced. The algorithm will also perform well nearsingular posture since, as mentioned before, it cannotgenerate joint movements that it has never experienced.

3.2.2. Applying LWPR to Inverse Kinematics Learn-ing. In order to apply LWPR to inverse kinematicslearning for our humanoid robot, we learn a separatemodel to generate each of the joint angles such thateach of the models performs a 29 (26 degrees of free-dom neglecting the 4 degrees of freedom for the eyes,plus 3 Cartesian inputs) to 1 mapping (x, θθ) → (θl),and we have 26 such models (l = 1, . . . , 26).

The resolution of redundancy requires creating anoptimization criterion that allows the system to choosea particular solution to the inverse kinematics problem.Given that our robot is a humanoid robot, we would likethe system to assume a posture that is as “natural” aspossible. Our definition of “natural” corresponds to theposture being as close as possible to some default pos-ture θθopt, as advocated by behavioral studies (Cruse andBruer, 1987). Hence the total cost function for trainingLWPR can be written as follows:

Q = 1

2(θθ − ˆθθ)T (θθ − ˆθθ)

+ 1

2α

(ˆθθ − �θθ

�t

)T

W(

ˆθθ − �θθ

�t

)(9)

where �θθ = θθopt − θθ represents the distance of thecurrent posture from the optimal posture θθopt, W is adiagonal weight matrix, and ˆθθ is the current predictionof LWPR for z = (x, θθ). Minimizing Q can be achievedby presenting LWPR with the target values:

θθ target = θθ − αW( ˆθθ − �θθ) (10)

These targets are commposed of the self-supervisedtarget θθ , slightly modified by a component to enforcethe optimization of the cost function within the nullspace of the Jacobian (cf. Eq. (7)).

As an exploration strategy, we initially bias the out-put of LWPR with a term that creates a motion towardsθθopt:

˜θθ = ˆθθ + 1

nr�θθ (11)

The strength of the bias decays with the number of datapoints nr seen by the largest contributing local model

of LWPR. This additional term allows creating mean-ingful (and importantly, data-generating) motion evenin regions of the joint space that have not yet been ex-plored. This enables us to learn inverse kinematics “onthe fly”, i.e., while attempting to perform the requiredtask itself.

An important aspect of our formulation of the in-verse kinematics problem is that although the inputsto the learning problem comprise x and θθ , the local-ity of the local model is a function of only θθ , whilethe linear projection directions (given this locality inθθ ) are solely dependent on x (cf. Eq. (8)). We encodethis prior knowledge into LWPR’s learning process bysetting the initial values of the diagonal terms of thedistance metric D in Eq. (1) that correspond to the xvariables to zero. This bias ensures that the locality ofthe receptive fields in the model is solely based on θθ .

LWPR has the ability to determine and ignore in-puts that are locally irrelevant to the regression, butwe also provide this information by normalizing theinput dimensions such that the variance in the relevantdimensions is large. This scaling results in larger corre-lations of the relevant inputs with the output variablesand hence biases the projection directions towards therelevant subspace. We use this feature to scale the di-mensions corresponding to the x variables so that theregression within a local model is based primarily onthis subspace.

3.2.3. Experimental Evaluations. The goal task ineach of the experiments was to track a figure-eight tra-jectory in Cartesian space created by simulated visualinput to the robot. In each of the figures in this sec-tion, the performance of the system is plotted alongwith that of an analytical pseudo-inverse solution (cf.Eq. (7)) that was available for our robot from previouswork (Tevatia and Schaal, 2000).

The system was first trained on data generated fromsmall sinusoidal motions of each degree of freedomabout a randomly chosen mean in θθ space. Every fewseconds this mean is repositioned. The performanceof the system after training the system on this “motorbabbling” for 10 minutes is shown in Fig. 5(a).

In the second experiment, the robot executed thefigure-eight again, using the trained LWPR from thefirst experiment. In this case however, the system wasallowed to improve itself with the data collected whileperforming the task. As shown in Fig. 5(b), after merely1 minute of additional learning, the system performs aswell as the analytical pseudo-inverse solution.


Figure 5. Tracking a figure eight with learned inverse kinematics. (a) Performance after training with motor babbling. (b) Results after improvingperformance using the data seen on the task. (c) Performance during the first 3 minutes of learning from scratch on the task. (d) Phase plot ofjoint position and joint velocity.

The final experiment started with an untrained sys-tem, and endeavored to learn the inverse kinematicsfrom scratch, while performing the figure-eight task it-self. Figure 5(c) shows the progression of the system’sperformance from the beginning of the task to about3 minutes into the learning. The system initially startsout making slow inaccurate movements. As it collectsdata however, it rapidly converges towards the desiredtrajectory. Within a few more minutes of training, theperformance approached that seen in Fig. 5(b).

It is important to note that for redundant manipu-lators, following a periodic trajectory in operationalspace does not imply consistency in joint space, i.e.,the trajectory followed in joint space may not be cyclicsince there could be aperiodic null space motion that

does not affect tracking accuracy. Figure 5(d) shows aphase plot of one of the joints (elbow flexion and exten-sion), over about 30 cycles of the figure-eight trajectoryafter learning had converged. The presence of a singleloop over all cycles shows that the inverse kinematicssolution found by our algorithm is indeed consistent.

3.3. Learning for Biomimetic Gaze Stabilization

Oculomotor control in a humanoid robot faces similarproblems as biological oculomotor systems, i.e., thestabilization of gaze in face of unknown perturbationsof the body, selective attention, stereo vision, and deal-ing with large information processing delays. Giventhe nonlinearities of the geometry of binocular vision


Figure 6. A control diagram of the VOR-OKR learning system.The lowest box corresponds to the OKR-like negative feedback cir-cuit, the middle box corresponds to the linear feedforward modeland the top box corresponds to the continuously learned non-linearfeedforward circuitry.

as well as the possible nonlinearities of the oculomo-tor plant, it is desirable to accomplish accurate controlof these behaviors through learning approaches. Here,we describe the application of LWPR to a learning con-trol system for the phylogenetically oldest behaviors ofoculomotor control, the stabilization reflexes of gaze.

In our recent work (Shibata and Schaal, 2001), wedescribed how control theoretically reasonable choicesof control components result in an oculomotor controlsystem that resembles the known functional anatomyof the primate oculomotor system. The resulting con-trol circuitry for such a system is shown in Fig. 6. Thecore of the learning system is derived from the bio-logically inspired principle of feedback-error learningcombined with the LWPR algorithm. There are essen-tially three blocks in the system (cf. Fig. 6): (1) themiddle block which is the vestibular (head velocity)input based linear feedforward controller with con-servatively low gains (2) the top block that makesup the non-linear feedforward controller (continuouslyadapted using LWPR) with vestibular inputs and (3)a lower block which is the retinal slip based nega-tive feedback controller that generates a delayed errorsignal to both the linear (fixed) feedforward controland the non-linear (continuously learned) feedforwardcircuit.

Feedback Error Learning (FEL) is a principle oflearning motor control. It employs an approximate wayof mapping sensory errors into motor errors that, sub-sequently, can be used to train a neural network bysupervised learning. From the viewpoint of adaptivecontrol, FEL is a model-reference adaptive controller.

The controller is assumed to be equipped a priori witha stabilizing linear feedback controller whose perfor-mance, however, is not satisfactory due to nonlinear-ities in the plant and delays in the feedback signals.Therefore, the feedback motor command of this con-troller is employed as an error signal to train a neuralnetwork controller. Given that the neural network re-ceives the correct inputs, i.e., usually current and de-sired state of the plant, it can acquire a nonlinear controlpolicy that includes both an inverse dynamics model ofthe plant and a nonlinear feedback controller. Kawato(1990) proved the convergence of this adaptive controlscheme and advocated its architecture as an abstractmodel of learning in the cerebellum.

In order to employ LWPR for learning under theFEL scheme, we require the presence of a target out-put y (See Table 2). In motor learning, target valuesfor motor commands rarely exist since errors are usu-ally generated in sensory space, not in motor com-mand space. The FEL strategy can be interpreted asgenerating a pseudo target for the motor commandy(t − 1) = y(t − 1) + τfb(t), where τfb denotes thefeedback error signal and y is the predicted output. Us-ing these principles and by employing the LWPR al-gorithm for on-line learning, we demonstrate that ourhumanoid robot is able to acquire high performancevisual stabilization reflexes after about 40 seconds oflearning despite significant nonlinearities and process-ing delays in the system.

3.3.1. Experimental Setup. Figure 7 depicts our ex-perimental system. Each DOF of the robot is actuated

Figure 7. The vision head subsystem of our humanoid experimentalsetup.


hydraulically out of a torque control loop. Each eye ofthe robot’s oculomotor system consists of two cameras,a wide angle (100 degrees view-angle horizontally)color camera for peripheral vision, and second camerafor foveal vision, providing a narrow-view (24 degreesview-angle horizontally) color image. This setup mim-ics the foveated retinal structure of primates, and it isalso essential for an artificial vision system in order toobtain high resolution vision of objects of interest whilestill being able to perceive events in the peripheral en-vironment. Each eye has two independent degrees offreedom, a pan and a tilt motion.

The controllers are implemented in two subsystems,a learning control subsystem and a vision subsystem,each operated out of a VME rack using the real-time op-erating system VxWorks. Three CPU boards (MotorolaMVME2700) are used for the learning control subsys-tem, and two CPU boards (Motorola MVME2604) areprovided for the vision subsystem. In the learning con-trol subsystem, CPU boards are used, respectively, for:i) low level motor control of the eyes and other joints ofour robot (compute torque mode), ii) visuomotor learn-ing, and iii) receiving data from the vision subsystem.All communication between the CPU boards is carriedout through the VME shared memory communicationwhich, since it is implemented in hardware, is very fast.In the vision subsystem, each CPU board controls oneFujitsu tracking vision board in order to calculate reti-nal slip and retinal slip velocity information of eacheye. NTSC video signals from the binocular camerasare synchronized to ensure simultenous processing ofboth eyes’ vision data. Vision data are sent via a se-rial port (115200 bps) to the learning control subsys-tem. For the experimental demonstrations of this paper,only one peripheral camera is used for VOR-OKR in itshorizontal (pan) degree-of-freedom. Multiple degreesof freedom per camera, and multiple eyes just require aduplication of our control/learning circuits. If the imageon a peripheral camera is stabilized, the image on themechanically-coupled foveal vision is also stabilized.In order to mimic the semicircular canal of biologicalsystems, we attached a three-axis gyro-sensor circuitto the head. From the sensors of this circuit, the headangular velocity signal is acquired through a 12 bit A/Dboard. The oculomotor and head control loop runs at480 Hz, while the vision control loop runs at 30 Hz.

We use both visual-tracking and optical flow calcu-lation in order to acquire the retinal slip and the retinalslip velocity, respectively. Spatial averaging of multi-ple optical flow detectors were used to reduce the noise.

To maintain a 30 Hz vision processing loop rate, pixelswere sampled only every three dots. Due to this sam-pling, the effective angular resolution around the centerof the image was about 0.03 rad.

3.3.2. Experimental Results of Online Gaze Stabilza-tion. There are three sources of nonlinearities both inbiological and artificial oculomotor systems: i) musclenonlinearities or nonlinearities added by the actuatorsand the usually heavy cable attached to the cameras,ii) perceptual distortion due to foveal vision, and iii)off-axis effects. Off-axis effects result from the non-coinciding axes of rotation of eye-balls and the headand require a nonlinear adjustment of the feedforwardcontroller as a function of focal length, eye, and headposition. Note that this off-axis effect is the most sig-nificant nonlinearity in our oculomotor system.

In the learning experiment, we will compare thelearning performance of our LWPR non-linear onlinelearning algorithm against Recursive Least Squares(RLS) regression, a linear learning system (Ljung andSoderstrom, 1986). For this purpose, a large board withtexture appropriate for vision processing was placed infront of the robot. The distance between a camera andthe board was around 50 cm, i.e., a distance that empha-sized the off-axis nonlinearities. In this experiment, thehead was moved horizontally according to a sinusoidalsignal with frequency 0.8 Hz and amplitude 0.25 rad.

Figure 8 shows the time course of the rectified retinalslip, smoothed with a moving average over a one sec-ond time window. The dashed line corresponds to RLSlearning, while the solid line presents the learning per-formance of LWPR. The benefits for a nonlinear learn-ing system are clearly demonstrated in this plot: bothlearning curves show rapid improvement over time,but the final retinal slip out of LWPR is almost halfof the remaining slip from linear learning. Figure 8(inset) shows the time course of the raw retinal slipsignals at the end of learning. Since, as mentioned inSection 3.3.1, the effective angular resolution aroundthe center of the image was 0.03 rad, the learning re-sults shown in Fig. 8 are satisfactory as their amplitudeis also about 0.03 rad, i.e., the best result achievablewith this visual sensing resolution.

The nonlinear component generated by the off-axiseffect is around 0.05 rad when the head is rotated0.25 rad and the visual stimulus is at 0.5 m distance(based on analytical computations from the geometryof off-axis vision head system). This difference is con-sistent with the average difference between the results


Figure 8. Time course of the mean retinal slip: The dashed line corresponds to linear learning result and the solid line corresponds to non-linearlearning with LWPR; (inset) retinal slip during the last part of learning.

obtained by RLS and LWPR, suggesting that LWPRwas able to learn the nonlinear component generatedby the off-axis effect.

4. Conclusions

This paper introduced locally weighted projection re-gression (LWPR), a statistical learning algorithm, forapplications of real-time learning in highly complexhumanoid robots. The O(n) update complexity ofLWPR in the number of inputs n, together with its sta-tistically sound dimensionality reduction and learningrules allowed a reliable and successful real-time imple-mentation of various learning problems in humanoidrobotics, including inverse dynamics learning, inversekinematics learning, and oculomotor learning. Theseresults demark one of the first times that complex in-ternal models for model-based control could be learnedautonomously in real-time on sophisticated robotic de-vices. We hope that algorithms like LWPR will allows

us in the near future to equip robots with massive on-line learning abilities such that we come one step closerto realizing the dream of completely autonomous hu-manoid robots.

References

An, C.H., Atkeson, C., and Hollerbach, J. 1988. Model BasedControl of a Robot Manipulator, MIT Press: Cambridge, MA.

Atkeson, C., Moore, A., and Schaal, S. 1997. Locally weighted learn-ing. Artificial Intelligence Review, 11:76–113.

Bishop, C. 1995. Neural Networks for Pattern Recognition, OxfordUniversity Press: London.

Bullock, D., Grossberg, S., and Guenther, F.H. 1993. A self-organizing neural model of motor equivalent reaching and tooluse by a multijoint arm. Journal of Cognitive Neuroscience, 5(4):408–435.

Cruse, H. and Bruwer, M. 1987. The human arm as a redundantmanipulator: The control of path and joint angles. BiologicalCybernetics, 57:137–144.

Frank, I.E. and Friedman, J.H. 1993. A statistical view of somechemometric regression tools. Technometrics, 35:109–135.

Jordan, M.I. and Rumelhart, D.E. 1992. Supervised learning with adistal teacher. Cognitive Science, 16(3):307–354.


Kawato, M. 1990. Feedback-error-learning neural network forsupervised motor learning. In Advanced Neural Computers, R.Eckmiller (Ed.), North-Holland/Elsevier: Amsterdam, pp. 365–372.

Liegeois, A. 1977. Automatic supervisory control of the configura-tion and behavior of multibody mechnisms. IEEE Transactions onSystems, Man, and Cybernetics, 7(12):868–871.

Ljung, L. and Soderstrom, T. 1986. Theory and Practice of RecursiveIdentification, MIT Press: Cambridge, MA.

Sanger, T.D. 1989. Optimal unsupervised learning in a single layerliner feedforward neural network. Neural Networks, 2:459–473.

Saunders, C., Stitson, M.O., Weston, J., Bottou, L., Schoelkopf, B.,and Smola, A. 1998. Support vector machine—Reference man-ual. TR CSD-TR-98-03. Department of Computer Science, RoyalHolloway, University of London.

Schaal, S. 1999. Is imitation learning the route to humanoid robots?Trends in Cognitive Sciences, 3:233–242.

Schaal, S. and Atkeson, C.G. 1998. Constructive incremental learn-ing from only local information. Neural Comp. 10:2047–2084.

Schaal, S., Atkeson, C.G., and Vijayakumar, S. 2000. Real-time robotlearning with locally weighted statistical learning. In Proc. In-ternational Conference on Robotics and Automation ICRA2000,pp. 288–293.

Schaal, S., Vijayakumar, S., and Atkeson, C.G. 1998. Local dimen-sionality reduction. Proc. Neural Information Processing Systems,10:633–639.

Shibata, T. and Schaal, S. 2001. Biomimetic gaze stabilization basedon feedback-error-learning with nonparametric regression net-works. Neural Networks, 14(2):201–216.

Slotine, J.E. and Li, W. 1991. Applied Nonlinear Control, PrenticeHall: Englewood cliffs, NJ.

Tevatia, G. and Schaal, S. 2000. Inverse kinematics for humanoidrobots. In Proceedings of the International Conference on Roboticsand Automation (ICRA2000), San Francisco, CA.

Vapnik, V. 1995. The Nature of Statistical Learning Theory, Springer:New York.

Vijayakumar, S. and Schaal, S. 2000. Locally weighted projectionregression: An O(n) algorithm for incremental real time learningin high dimensional space. In Proc. International Conference onMachine Learning ICML2000, pp. 1079–1086.

Wold, H. 1975. Soft modeling with latent variables: The nonlineariterative partial least squares approach. Perspectives in Probabilityand Statistics: Papers in Honor of M.S. Bartlett, pp. 114–142.Academic Press: London.

Sethu Vijayakumar is a Research Assistant Professor in theDepartment of Computer Science and Neuroscience at the Univer-sity of Southern California and holds a part time affiliation with theRIKEN Brain Science Institute in Japan. His research interests in-

cludes statistical machine learning, neural networks, motor controland computational neuroscience. He received the ICNN‘95 Best Stu-dent Paper Award in 1995, the IEEE Vincent Bendix Award in 1991and the IEEE R.K.Wilson RAB Award in 1996. Dr. Vijayakumar isalso a member of the International Neural Network Society, and anassociate of the IEEE.

Aaron D’Souza received his B.E. degree in Computer Engineeringfrom Mumbai University, India, in 1998. He is presently a Ph.D. stu-dent at the USC Computational Learning & Motor Control Lab. Hisresearch interests include statistical machine learning, neural net-works, and Bayesian learning, with applications in humanoid robotcontrol.

Tomohiro Shibata received his Ph.D. in information engineeringfrom the University of Tokyo in 1996. From 1996 to 1997, he wasa postdoctoral fellow at the University of Tokyo. Currently, he isa researcher with the Japan Science and Technology Corporation’sERATO project since April 1997. Dr. Shibata works on modelingand learning of biologically plausible oculomotor controllers anduses robots as testbeds for contributing to neuroscience research.

Jorg Conradt is a Ph.D. student at the Institute of Neuroinformaticsin Zurich, Switzerland working on spatial representations in the hip-pocampus place fields. He holds a Masters in Computer Engineeringfrom the Technische Universitat Berlin and a M.S. in Computer Sci-ence from the University of Southern California, where he was a


Fulbright Scholar. Jorg research interests include statistical learning,robotics and motor control.

Stefan Schaal is an Assistant Professor at the Department of Com-puter Science and the Neuroscience Program at the University of

Southern California. He also holds additional appointments as Headof the Computational Learning Group of the Kawato Dynamic BrainProject (ERATO/JST) and as an Adjunct Assistant Professor at theDepartment of Kinesiology of the Pennsylvania State University.Dr. Schaal’s research interests include topics of statistical and ma-chine learning, neural networks, computational neuroscience, non-linear dynamics, nonlinear control theory, and biomimetic robotics.

Date post:	03-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Statistical Learning for Humanoid Robots · 2011-05-24 · Autonomous Robots 12, 55–69, 2002 c...

Documents