Slow Feature Analysis - Applications · Slow feature analysis is an algorithm that has been...

Slow Feature Analysis

- Applications -

— Lecture Notes —

Laurenz WiskottInstitut fur Neuroinformatik

Ruhr-Universitat Bochum, Germany, EU

14 December 2016

Contents

1 Slow feature analysis (SFA) (→ slides) 1

2 Applications of SFA in machine learning 10

2.1 Extracting driving forces (→ slides) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Nonlinear blind source separation (→ slides) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1 Slow feature analysis (SFA) (→ slides)

Slow feature analysis was first published in 1998 (Wiskott, 1998). This section is based on (Wiskott andSejnowski, 2002).

© 2008, 2009, 2016 Laurenz Wiskott (ORCID http://orcid.org/0000-0001-6237-740X, homepage https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all figures from other sources, if present) is licensed under the Creative CommonsAttribution-ShareAlike 4.0 International License, see http://creativecommons.org/licenses/by-sa/4.0/. If figures are notincluded for copyright reasons, they are uni colored, but the word ’Figure’, ’Image’, or the like in the reference is often linkedto a freely available copy.Core text and formulas are set in dark red, one can repeat the lecture notes quickly by just reading these; � marks importantformulas or items worth remembering and learning for an exam; ♦ marks less important formulas or items that I would usuallyalso present in a lecture; + marks sections that I would usually skip in a lecture.More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.

1

http://orcid.org/0000-0001-6237-740X

https://www.ini.rub.de/PEOPLE/wiskott/

https://www.ini.rub.de/PEOPLE/wiskott/

http://creativecommons.org/licenses/by-sa/4.0/

https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/

Slow feature analysis is an algorithm that has been developed in context of modeling the primate visualsystem, but it has also been applied successfully in technical contexts. It is based on the slowness principle,which is introduced here from the view of learning invariances in visual perception.

Slow Feature Analysis

(maxmann, 2016, pixabay,© CC0, URL)

0 min

Section title: Slow FeatureAnalysisImage: (maxmann, 2016, pix-abay,© CC0, URL)1

2

https://creativecommons.org/share-your-work/public-domain/cc0/

https://pixabay.com/en/snail-shell-crawl-mollusk-1330766/

Slowness as a Learning Principle

21

3x

(t)

x

(t)

x

(t)

object identity

mo

nke

y

object location

left

time ttime t

high level representationprimary sensory signal

Foldiak (1991), Mitchison (1991), Becker & Hinton (1992), O’Reilly & Johnson (1994), Stone &

Bray (1995), Wallis & Rolls (1997), Peng et al. (1998), Kording & Konig (2001), Wiskott &

Sejnowski (2002)2 min (Wiskott & Sejnowski, 2002, Neural Comp. 14(4):715-770)

Slowness as a learning prin-ciple is based on the obser-vation that different represen-tations of the visual sensoryinput vary on different timescales. Our visual environ-ment itself is rather stable. Itvaries on a time scale of sec-onds.The primary sensory signal onthe hand, e.g. responses ofsingle receptors in our retinaor the gray value of a singlepixel of a CCD camera, varyon a faster time scale of mil-liseconds, simply as a conse-quence of the very small recep-tive field sizes combined withgaze changes or moving ob-jects. As an example imagineyou are looking at a quietlygrazing zebra. As your eyesscan the zebra, single recep-tors rapidly change from black

to white and back again because of the stripes of the zebra. But the scenery itself does not change much.Finally, your internal high-level representation of the environment changes on a similar time scale as theenvironment itself, namely on a slow time scale. The brain is somehow able to extract the slowly varyinghigh-level representation from the quickly varying primary sensory input. The hypothesis of the slownesslearning principle is that the time scale itself provides the cue for this extraction. The idea is that if thesystem manages to extract slowly varying features from the quickly varying sensory input, then there is agood chance that the features are a good representation of the visual environment.A number of people have worked along these lines. Slow feature analysis is within this tradition but differsin some significant technical aspects from all previous approaches.Figure: (Wiskott et al., 2011, Fig. 2, © CC BY 4.0, URL)2

3

https://creativecommons.org/licenses/by/4.0/

http://scholarpedia.org/article/Slow_feature_analysis

Optimization Problem

g x)(

x

xN

x2

x1

(t)

t

y g x(t) = (t))(

y1

y2

y3

t

Given an input signal x(t).

Find an input-output function g(x) (e.g. polynomial of degree 2).

The function generates the output signal y(t) = g(x(t)).

This is done instantaneously.

The output signal should vary slowly, i.e. minimize 〈y2i 〉.

The output signal should carry much information, i.e. 〈yi 〉 = 0, 〈y2i 〉 = 1,

and 〈yj yi 〉 = 0 ∀j < i .2 min (Wiskott & Sejnowski, 2002, Neural Comp. 14(4):715-770)

Slow feature analysis is basedon a clearcut optimizationproblem. The goal is to findinput-output functions thatextract most slowly varyingfeatures from a quickly vary-ing input signal.It is important that the func-tions are instantaneous, i.e.one time slice of the outputsignal is based on just onetime slice of the input signal(marked in yellow). Otherwiselow-pass filtering would be avalid but not particularly use-ful method of extracting slowoutput signals. Instantaneousfunctions also make the sys-tem fast after training, as isimportant in visual process-ing, for instance. It is alsopossible to take a few inputtime slices into account, e.g.to make the system sensitive

to motion or to process scalar input signals with a fast dynamics on a short time scale. However, low-passfiltering should never be the main method by which slowness is achieved.Without any constraints, the optimal but not very useful output signal would be constant. We thus imposethe constraints of unit variance 〈y2i 〉 = 1 and, for mathematical convenience, zero mean 〈yi〉 = 0. To makedifferent output signal components represent different information, we impose the decorrelation constraint〈yjyi〉 = 0. Without this constraint, all output components would typically be the same. Notice that theconstraint is asymmetric, later components have to be uncorrelated to earlier ones but not the other wayaround. This induces an order. The first component is the slowest possible one, the second component isthe next slowest one under the constraint of being uncorrelated to the first, the third component is the nextslowest one under the constraint of being uncorrelated to the first two, etc.Figure: (Wiskott et al., 2011, Fig. 1, © CC BY 4.0, URL)3

4



Simple Example

x1(t) = sin(t) + cos(11t)2

x2(t) = cos(11t)

component x1(t) component x2(t) Input signal x(t)

A slow feature of this signal is obviously

y(t) = x1(t)− x2(t)2 = sin(t) .

2 min (Wiskott & Sejnowski, 2002, Neural Comp. 14(4):715-770)

Consider a simple two-dimensional trajectory as anexample. x1(t) and x2(t) areboth quickly varying. Non-linearly hidden in this signalis sin(t), which is relativelyslow. It can be extracted witha polynomial of degree two,since x1(t) − x2(t)2 = sin(t).In the trajectory plot (right)one can see the fast back andforth rocking on a parabula,which itself is slowly movingup and down.Figure: (Wiskott and Se-jnowski, 2002, Fig. 2, URL)4

Simple Example

Minimize 〈y2〉with y(t) = g(x(t))

under the constraints 〈y〉 = 0

and 〈y2〉 = 1 .

input signal x(t) input-output function g(x) output signal y(t)


From this input signal wewant to extract only a singleoutput component, so that theoptimization problem simpli-fies somewhat. As seen above,the slowest output signal isy(t) = sin(t) and the func-tion required to extract it isg(x) := x1 − x22. Its levellines are parabulas and followthe principal curve of the in-put trajectory.Figure: (Wiskott and Se-jnowski, 2002, Fig. 2, URL)5

5

http://www.ini.rub.de/PEOPLE/wiskott/Reprints/WiskottSejnowski-2002-NeurComp-LearningInvariances.pdf


Slow Feature Analysis (SFA)

input signal x(t) expanded signal z(t) sphered signal z(t)

time derivative signal z(t) output signal y(t) input-output function g(x)


The SFA-algorithm isrelatively straight for-ward. The upper leftpanel shows the trajec-tory x1(t) = sin(t) + cos(11t)2

and x2(t) = cos(11t). It isquickly rocking back andforth on a parabola, whichitself is slowly moving up anddown. SFA finds this slowfeature as follows:Step 1: The input sig-nal is nonlinearly expandedinto a high-dimensional fea-ture space z (upper middlepanel). We often use poly-nomials of degree two, whichwould result in z1 := x1, z2 :=x2, z3 := x21, z4 := x1x2, andz5 := x22. Within this spacethe problem can be solved lin-early, because any polynomialof degree two can be writtenas a linear combination of the

zi.Step 2: The expanded signal is whitened or sphered (upper right panel). This operation first removes themean and then stretches the signal along its principal axes such that it has unit variance in all directions.This has the great advantage that the constraints are easy to fulfill. If we project the whitened signal ontoany unit vector, the projected signal has zero mean and unit variance; if we project the whitened signal ontoany set of orthogonal unit vectors, the projected signal components are uncorrelated. Thus, we only haveto find the orthogonal unit vectors that produce the slowest signal components. The constraints are thentaken care of automatically. In our simple example, we only need one unit vector to project to.Step 3: To find the direction in which the signal varies most slowly we calculate the derivative of the whitenedsignal (lower left panel). The variance of the derivative is small in directions of slow variation and large indirections of fast variation. The principal component with smallest eigenvalue therefore gives us the unitvector that yields the slowest possible output signal component. If we want to extract more slow features,we simply take the principal components with next larger eigenvalues. Thus, once we are here, it is easy toextract many slow output signal components.If one concatenates the nonlinear expansion, the whitening, and the projection onto the unit vectors, one getsthe nonlinear functions gi(x). The lower right panel shows the function found here for the simple example.Evaluating this function along the input trajectory yields the output signal y(t) (lower middle panel).Figure: (Wiskott and Sejnowski, 2002, Fig. 2, URL)6

6


Nonlinear Expansion

Assume we have an algorithm that solves a certain optimization problemfor the class of linear functions gl(x), i.e. it finds the optimal coefficientsa0 to a2 or b0 to b5 for the linear functions

gl2(x) := a0 + a1x1 + a2x2 or

gl5(z) := b0 + b1z1 + b2z2 + b3z3 + b4z4 + b5z5 ,

respectively, according to some optimization criterion.

Example: Linear regression finds the coefficients that minimize the meansquared distance between the output values yµ := gl(xµ) and sometarget values sµ.

7 min

Nonlinear expansion is a sim-ple way of extending a lin-ear method to nonlinear prob-lems. It can generally be ap-plied if the linear method isinsensitive to any linear trans-formation of the input data.

Nonlinear Expansion

If we set

z1 := x1 , z2 := x2 , z3 := x21 , z4 := x1x2 , z5 := x2

2 ,

and solve the linear optimization problem for

gl5(z) := b0 + b1z1 + b2z2 + b3z3 + b4z4 + b5z5

then we get the optimal quadratic function for the input data xµ:

gq2(x) := b0 + b1x1 + b2x2 + b3x21 + b4x1x2 + b5x

22 .

This is a simple and general technique to generalize linear algorithms tononlinear functions.

Problem 1: The dimensionality of the expanded space can get very large.

Problem 2: Parameterized nonlinearities, such as g(x) = a sin(kx + φ),cannot be optimized this way.

7 min

The idea of nonlinear expan-sion is to apply a large numberof fixed nonlinear functions tothe original data. This pro-duces new input data of higherdimensionality to which thelinear method can be applied.The linear combination of theexpanded signal that solvesthe problem induces a nonlin-ear function that solves theproblem on the original in-put data. One simply takesthe same linear combination ofthe fixed nonlinear functions.Taking monomials like in theexample shown here is just onepossibility. Any set of nonlin-ear functions can be used.A problem of this technique isthat the dimensionality of theexpanded space can quicklygrow beyond managable size.If that happens, kernel meth-

ods might be a way to go.Another limitation is that parameterized nonlinearities can not be optimized that way.

7

Nonlinear Expansion

x

g(x)

0 1g(x) = a + a x

+ -

x2

-+

0 212g(x) = b + b x + b x

g(x)

+ -- +

7 min

This example illustrates theeffect of nonlinear expansion.The blue and green dots onthe x-axis are not linearlyseparable. However, if oneexpands the one-dimensionalinput space x to the two-dimension space of x and x2,the problem can be solvedwith a linear classifier. Thislinear function mapped backto the original input spacebecomes a quadratic functionsolving the problem.© CC BY-SA 4.0

Whitening or Sphering

whitening

7 min

Whitening data means to firstremove the mean and thenstretch the data along theprincipal axes such that ithas unit variance in all direc-tions. If the original data isan unisotropic Gaussian to be-gin with, e.g. having the shapeof a flying sausage or a cigar,it has a spherical shape afterwhitening, which is the reasonto call it also sphering.Whitened data has the advan-tage that you can project itonto any unit vector and it hasunit variance. Projected ontotwo orthogonal unit vectors,the two projected data sets areuncorrelated.© CC BY-SA 4.0

8

https://creativecommons.org/licenses/by-sa/4.0/

https://creativecommons.org/licenses/by-sa/4.0/

Derivative of a Trajectory

Trajectory: x(t) = (x1(t), x2(t), ..., xN(t))T ,

Derivative: x(t) = (x1(t), x2(t), ..., xN(t))T .

7 min

Formally the derivative of atrajectory is simply the vec-tor of the derivatives of thecomponents of the trajectory.If the trajectory is discretizedin time with step size 1, itsderivative is simply the se-quence of diffence vectors oftwo successive time points.© CC BY 4.0

Successive Application of SFA


An interesting property ofSFA is that it can be ap-plied in a cascade. Theinput signal of the exampleshown here has a slow fea-ture in it that cannot be ex-tracted with a polynomial ofdegree two. Thus SFA2, i.e.quadratic SFA, alone cannotsolve the problem. But if oneapplies SFA2 again to the firstthree output components ofthe first SFA2 and then a thirdtime, the slow feature gets ex-tracted, compare the contourplot of g3,1(x1, x2) with thetrajectory plot of x2(t) versusx1(t).Figure: (Wiskott and Se-jnowski, 2002, Fig. 7, URL)7

9



2 Applications of SFA in machine learning

2.1 Extracting driving forces (→ slides)

This section is based on (Wiskott, 2003).

Analyzing Non-stationary Time Series

(Peggy Marco, 2012, pixabay,© CC0, URL)7 min

Section title: AnalyzingNon-stationary Time Se-riesImage: (Peggy Marco, 2012,pixabay,© CC0, URL)8

10


https://pixabay.com/en/mikado-domino-stones-pay-steinchen-1013877/

Non-Stationary Tent-Map Time Series

wi+1

γ ’

wi

wi+1

1

1

7 min

An iterative map is a discretedynamical system. For thetent map f (shown in black)one starts with an arbitraryvalue w0 between 0 and 1 andsimply applies f to get thenext value w1 = f(w0). Re-peating this process leads toa time series wt like the oneshown below. The tent mapis peculiar in that the result-ing time series has no obviousstructure and looks like whitenoise. A similar time series re-sults if one shifts the mappingfunction f cyclicly within theinterval [0, 1] by some value γ′

(shown in red). If γ′ itself de-pends on time, it is called adriving force of the system andit changes the system dynam-ics over time. If γ′(t) changeson a slower time scale than thetime series itself, SFA should

be able to extract it.Figure (top): (Wiskott et al., 2011, Fig. 11, © CC BY 4.0, URL)9

Figure (bottom): (Wiskott, 2003, Fig. 1,© CC BY 4.0, URL)10

11


http://www.scholarpedia.org/article/Slow_feature_analysis


http://arxiv.org/abs/cond-mat/0312317/

Analysis with SFA

7 min

The top graph shows a tent-map time series with a drivingforce γ(t) as shown in the bot-tom graph (solid line). In or-der to apply SFA, one has toconsider several (in this case10) successive values togetheras the input vector for SFA,a method called time embed-ding. In other words SFA seesa sliding window of ten pointsof the time series and tries toextract some slow feature fromit. Applying SFA with poly-nomials of degree 3 results inthe slow feature shown in thebottom graph as points over-layed over the solid curve ofthe true driving force. Meanand variance, which can prin-cipally not be extracted, arenormalized to make the curvescomparable. The correlationcoefficient is r = 0.96.

Figure: (Wiskott, 2003, Fig. 1, © CC BY 4.0, URL)11

Step-like Driving Force

7 min

SFA cannot only extract con-tinuous features but also fea-tures that switch discretelybetween different values. Thisgraph is the same as aboveexcept that the driving forcechanges in a step-like fashion.Figure: (Wiskott, 2003,Fig. 3, © CC BY 4.0,URL)12

12





2.2 Nonlinear blind source separation (→ slides)

This section is based on (Sprekeler et al., 2014).

Nonlinear Blind Source Separation

(“Blind mans bluff”, 1803, Wikimedia,© CC0, URL)

Henning Sprekeler Tiziano Zito7 min

Image: (“Blind mans bluff”,1803, Wikimedia, © CC0,URL)13

13


https://commons.wikimedia.org/wiki/File:Blind_mans_bluff_1803.PNG

Nonlinear Mixing

s1

s2

x1

x2

sources

mixtures

mix

ing

time

9 min

We consider here the problemof nonlinear blind source sep-aration for two sources. Thesources s1 and s2 are shownat the top left and are plot-ted together in the 2D scatterplot on the right. The mixingcan be viewed as stretchingthe 2D space and winding itup in a spiral, see scatter plotbelow. The first source runsalong the arms of the spiral;the second source runs per-pendicular to it and has a verysmall amplitude. The result-ing mixed signals are shown asx1 and x2 on the left. Thetask of nonlinear blind sourceseparation is to extract thetwo sources without knowinganything about the mixture orthe sources, except that theyare statistically independentand smoothly varying, i.e. not

white noise.The theory of SFA formalizes two properties that can explain why SFA might be suitable to perform nonlinearblind source separation. Roughly speaking: (i) Any nonlinearly transformed version of a signal varies fasterthan the signal itself. (ii) Any mixture of two signals varies faster than the slower of the two signals. (Thesestatements are true only modulo an invertible transformation of the single sources. But that is all you canhope for in any case, because the problem of nonlinear blind source separation is defined only up to aninvertible transformation of the single sources.)Figure: (Wiskott et al., 2011, Fig. 13, © CC BY 4.0, URL)14

14



Harmonics of a Signal

s1 .063

g11 h11 .060

g12 h12 .138

g13 h13 .264

g14 h14 .376

10 min

This graph illustrates theprinciple that a signal typi-cally gets faster if you trans-form it. s1 is the orig-inal signal with a ∆-valueof 0.063 (the ∆-value mea-sure the ’fastness’ of a sig-nal). g11(s) is an opti-mal invertible transformationthat makes the signal slightlyslower, see h11 = g11(s1)with a ∆-value of 0.060. Theother noninvertible transfor-mations shown all make thesignal faster.The functions g1m have beenfound by applying SFA withhigh polynomials to s1 di-rectly without time embed-ding. g11 is therefore optimalin yielding a slower version ofs1. The resulting signals h1mare called harmonics and playan important role in the the-

ory of SFA.Figure: (Wiskott & Escalante, 2010, unpubl.,© CC BY 4.0)

Products of Harmonics of Two Signals

h11 .060

h21 .140

h11h21 .200

h13h23 .794

h12h24 .867

11 min

This graph illustrates theprinciple that a mixture of twosignals is typically faster thanthe slower of the two signals.Since any mixture can be ex-pressed as a linear combina-tion of products of harmon-ics of the two sources (simi-lar to a 2D Taylor expansion),we only consider the first har-monics h11 and h21 of the twosources and various productsof first and higher harmonics.All products are faster, i.e.have a higher ∆-value, thanh11.Figure: (Wiskott & Escalante,2010, unpubl.,© CC BY 4.0)

15



xSFA - Unmixing

s1

s2

x1

x2

y1

y2

sources

mixtures

estimated sources

time

mix

ing

xS

FA

12 min

In xSFA (extended SFA) themixture is first expanded intoa very high-dimensional space.The slowest signal within thatspace is then extracted withSFA and declared the firstsource. Next, all harmonics ofthat first source are projectedout of the expanded signal.Then, SFA is applied againand the slowest signal declaredthe second source. Next, allharmonics and all products ofharmonics of first and secondsource are projected out ofthe expanded signal. Iteratingthis scheme can in principleextract an arbitrary number ofsources (if they are in the mix-ture), but noise reduces per-formance for later sources.The graphs at the bottomshow the extracted sources y1and y2. As the scatter plots on

the left between the extracted and the true sources show, the first source has a wrong sign (it can principallynot be recovered), but otherwise the extraction is very good.Figure: (Wiskott et al., 2011, Fig. 13, © CC BY 4.0, URL)15

16



References

Sprekeler, H., Zito, T., and Wiskott, L. (2014). An extension of slow feature analysis for nonlinear blindsource separation. Journal of Machine Learning Research, 15:921–947. 13

Wiskott, L. (1998). Learning invariance manifolds. In Niklasson, L., Boden, M., and Ziemke, T., editors,Proc. 8th Intl. Conf. on Artificial Neural Networks (ICANN’98), Skovde, Sweden, Perspectives in NeuralComputing, pages 555–560, London. Springer. 1

Wiskott, L. (2003). Estimating driving forces of nonstationary time series with slow feature analysis. e-printarXiv:cond-mat/0312317. 10, 11, 12

Wiskott, L., Berkes, P., Franzius, M., Sprekeler, H., and Wilbert, N. (2011). Slow feature analysis. Scholar-pedia, 6(4):5282. 3, 4, 11, 14, 16

Wiskott, L. and Sejnowski, T. (2002). Slow feature analysis: unsupervised learning of invariances. NeuralComputation, 14(4):715–770. 1, 5, 6, 9

Notes

1maxmann, 2016, pixabay,© CC0, https://pixabay.com/en/snail-shell-crawl-mollusk-1330766/

2Wiskott et al., 2011, Scholarpedia 5(2):1362, Fig. 2, © CC BY 4.0, http://scholarpedia.org/article/Slow_feature_

analysis

3Wiskott et al., 2011, Scholarpedia 5(2):1362, Fig. 1, © CC BY 4.0, http://scholarpedia.org/article/Slow_feature_

analysis

4Wiskott & Sejnowski, 2002, Neur. Comp. 14:715–770, Fig. 2, http://www.ini.rub.de/PEOPLE/wiskott/Reprints/

WiskottSejnowski-2002-NeurComp-LearningInvariances.pdf







8Peggy Marco, 2012, pixabay,© CC0, https://pixabay.com/en/mikado-domino-stones-pay-steinchen-1013877/

9Wiskott et al., 2011, Scholarpedia 5(2):1362, Fig. 11, © CC BY 4.0, http://www.scholarpedia.org/article/Slow_

feature_analysis

10Wiskott, 2003, arXiv.org 0312317, Fig. 1, © CC BY 4.0, http://arxiv.org/abs/cond-mat/0312317/



13“Blind mans bluff”, 1803, Wikimedia,© CC0, https://commons.wikimedia.org/wiki/File:Blind_mans_bluff_1803.PNG


feature_analysis


feature_analysis

Copyrightprotectionlevel: 2/ 2

17


https://pixabay.com/en/snail-shell-crawl-mollusk-1330766/
















https://pixabay.com/en/mikado-domino-stones-pay-steinchen-1013877/











https://commons.wikimedia.org/wiki/File:Blind_mans_bluff_1803.PNG







Date post:	12-Jan-2020
Category:	Documents
Upload:	others
View:	19 times
Download:	0 times

Slow Feature Analysis - Applications · Slow feature analysis is an algorithm that has been...

Documents