Download - Linear Prediction Filters and Neural Networks Paul O’Brien ESS 265, UCLA February 8, 1999.

Linear Prediction Filters and

Neural Networks

Paul O’Brien

ESS 265, UCLA

February 8, 1999

Outline

• Linear Prediction Filters– Uses

• System Description

• Time Evolution

– Green’s Functions– Construction– Filter Analysis– Examples

• Dst-H

• Dst-VBs

• Neural Networks– Uses


• Time Evolution

– Theory– Training– Network Analysis– Examples

• Dst-H

• Dst-VBs

Linear Prediction Filter Uses

• Map one Input (or many) to one Output– Convert Dst to Single Ground-station H

• Make a Forecast– Convert Solar Wind Measurements To

Geomagnetic Indices

• Determine System Dynamics– Use Impulse Response To Determine

Underlying Ordinary Differential Equation

What an LPF Looks Like

• An LPF can have an autoregressive (AR) part and a moving average (MA) part

• AR part describes internal dynamics

• MA part describes external dynamics

• Ambiguity occurs when used separately

O t a O t i t b I t j tii

N

jj

M

( ) ( ) ( )

1 0

AR MA

Other Names for LPFs

• Convolution Filter

• Moving Average

• Infinite Impulse Response Filter (IIR)

• Finite Impulse Response Filter (FIR)

• Recursive Filter

• ARMA Filter

• LPFs are a subset of Linear Filters which relate the Fourier spectra of two signals

MA Filters Are Green’s Functions

• Ordinary Differential Equations Can Be Solved with Greens Functions:

dOdt

O dGdt

G

t

jj

M

I t

G t e O t I t t G t dt

O t b I t j t

( )

( ) , ( ) ( ' ) ( ' ) '

( ) ( )

/

0

0

AR Filters Are Differential Equations

• Ordinary Differential Equations Can Be Rewritten As AR Filters:

dOdt

O

O t O tt

O t

t

I

I t

O t O t I t t

( ) ( ) ( ) ( )

( ) ( )[ ] ( )

1 1

1 1

Determining LPF Coefficients

The a’s and b’s are found by solving an overdetermined matrix equation:

O t a O t i t b I t j tii

N

jj

M

( ) ( ) ( )

1 0

O t t O t t O t N t I t I t M t

O t t O t t O t N t I t I t M t

a

a

a

b

b

O t

N

M

( ) ( ) ... ( ) ( ) ... ( )

( ) ( ) ... ( ) ( ) ... ( )

... ...

( )0 0 0 0 0

1 1 1 1 1

1

2

0

02

2

O t( )1

Solved using Least Squared Error Optimization or Singular Value Decomposition

Often tk = t0+kt

More on Linear Filters

• The Linear Filter formalism can be extended to multiple inputs

• There is an ambiguity between the MA and AR parts for certain kinds of differential equations. An ARMA filter can greatly simplify some MA filters

O t b I t j t b e

a O t i t I t a e

jj

M

jj t

t

( ) ( ),

( ) ( ),

/

/

0

1 1

Handling Data Gaps• Missing Inputs

– Omit intervals with data gaps, if possible– Interpolation over data gaps will smear out MA

coefficients

• Missing Outputs– Omit intervals with data gaps– Interpolation over data gaps can ruin the AR

coefficients– Less sensitive to interpolation in lag outputs

Linear Filter Analysis

• Once the filter coefficients are determined, we relate them to ODEs

bj

lag

Exponential Decay

dOdt

O I

bj

lagRecurrence

dOdt

O I t cI t T ( ) ( )

bj

lag

Derivative OperatordOdt

O dIdt

LPF Localization

• Localization adds Nonlinearity– Nonlocal LPFs cannot handle nonlinear dynamics

• Localize in Time– Continuously reconstruct LPF based only on most

recent data

– Can be very accurate but hard to interpret

• Localize in State Space– Construct different LPF for each region of state space

– Can provide multivariate nonlinearity

LPF Example: Dst-H

• Dst is a weighted average of H measured at 4 stations around the globe.

• Depending on activity and the local time of a station, there is a different relationship between Dst and H

• We will localize this filter in 24 1-hour bins of local time

• The filter will be very simple: H(t) = b0(lt) Dst(t)– lt is the local time of the station at time t

– We solve this equation in each 1-hour bin of local time

• By plotting b0 vs lt, we can infer the local current systems– This local current system is believed to be the partial ring current

Partial Ring Current

• Localized Filter: H = b0(lt)Dst

• Dst is less intense than H near Dusk due to an enhanced Ring Current in the Dusk sector

0.6

0.8

1

1.2

1/b 0

HSJG-Dst 1-Point Filter by Local Time

Local Dusk

0 4 8 12 16 20 24

Local Time

LPF Example: Dst-VBs

• The Ring Current (Dst) is largely driven by the solar wind

• One good “coupling function” for the solar wind driver is the interplanetary electric field (VBs)

• We construct a long MA filter:

Dst t b VBs t j tjj

M

( ) ( )

0

Dst-VBs Filter Coefficients

• Note the roughly exponential decay

• The differential equation could be:

• We can, therefore, build a trivial ARMA filter to do the same job:

dDstdt

Dst bVBs

Dst t bVBs t t a Dst t t( ) ( ) ( ) 1 1

-5 0 5 10 15 20 25 30 35 40 45-101234

Lag (hours)

b jDst t b VBs t j tj

j

M

( ) ( )

0

Neural Network Uses

• Map multiple Inputs to one Output (or many)– Excellent nonlinear interpolation but not reliable for

extrapolation

• Make a forecast– Excellent for forecasting complex phenomena

• Determine System Dynamics– NN is a black box model of the system

– Run NN on simulated data to isolate system response to individual inputs

• Many exotic NNs exist to perform other tasks

NN Theory

• Based on biological neural systems– Biological NN: composed of connections between

individual neurons

– Artificial NN: composed of weights between perceptrons

• We don’t know exactly how biological neural systems learn, so we have made some approximate training schemes

• Artificial NNs excel at quantifying• Biological NNs excel at qualifying

NN Topology• A standard Feed Forward

NN has no recursive connections

• Arrows represent weights (w,v) and biases (b,c)

• hi and Oi are perceptrons

• Typically only one hidden layer is necessary

• More hidden units allow for more complexity in fitting (not always good)

• Nonlinearity is achieved through an activation function – tanh(x) or [1+e-x]-1

I1

I2

I3

h1

h2

h3

h4

O1

O t b w c v I tii

i ij jj

( ) tanh( ( ))

wi

ij

• This is equivalent to an MA Filter

• AR behavior can be achieved through recurrence

NN Recurrence

• An Output Recurrent network is useful when O(t) depends on O(t-t)

• The recurrence is usually only implicit during training: I*

3 is taken from actual data rather than previous O1

I1

I2

I*3

h1

h2

h3

h4

O1wi

ij

I1

I2

I*3

h1

h2

h3

h4

O1wi

ij

• An Elman Network is useful when O(t) depends on the time history of the Inputs

• This makes training rather difficult– Continuous time series are needed

– Batch optimization is impossible

Pseudo-Input

Pseudo-Input

NN Training Theory

• A NN is initialized with a random set of weights• It is trained to find the optimal weights

– Iteratively adjust the weights

– Target least squared error for most data

– Target relative error for data with rare large events

• There must be far more training samples than there are weights– At least a factor of 10

• The goal is to achieve a fit which will work well out of sample as well as in sample– Poor out of sample performance is a result of overfitting

Gradient Descent• Simplest non-linear optimization• Weights are corrected in steps down the

error gradient• A learning rate is used to ensure

smooth convergence to error minimum• Descent can be stabilized by adding

momentum m, which recalls W(s-1) from last step

• m and should be between 0 and 1, sometimes functions of s

• For recurrent networks, gradient descent should be done serially for each tk

• This type of Gradient Descent replaces Backpropagation

E e e

e O t O t

JO

W

WE

WJ e

W s J e m W s

k k k

kll t

T

T

k

12

1

( )~

( )~

( ) ( )

W is a vector holding all the NN weights and biases

NN Output

Levenberg-Marquardt Training

• LM training is much faster than gradient descent

• LM training is not very appropriate for recurrent networks

• Algorithm:1. Increase until a step can be

taken without increasing the error

2. Decrease while error decreases with each step

3. Repeat 1 & 2 until exceeds threshold or other training limit is met

e O t O t

JO

W

e W e J W

W J J J e

W J J I J e

k k k

kll t

eW

T T

T T

k

( )~

( )~

0

1

1

NN Output

LM is based on Newton’s Method, but, when is large, LM training becomes gradient descent

NN Generalization• Overfitting can render NNs useless• Always reserve some of your training data (20%-50%)

for out-of-sample testing• Identical network topologies can perform differently,

depending on their initial weights (assigned randomly)– Train several networks (5) and keep the one with the best

out-of-sample performance

• Starve the NN by reducing the number of hidden units until fit quality begins to plummet

NN Analysis• The quality of NN an output is dependent on the training set density of

points near the associated inputs– Always plot histograms of the input & output parameters in the training

set to determine the high training density region

– Regions of input space which are sparsely populated are not well determined in the NN and may exhibit artificial behavior

• Try several different input combinations– Indicates what influences the system

• Analyzing weights directly is nearly impossible

• Instead, we run the trained NN on artificial inputs so that we can isolate the effects of a single variable on the multivariate system– Vary one input (or two) while holding all other inputs constant

– Simulate a square-wave input for a time series (pseudo-impulse)

– To identify real and artificial behavior, plot training points in the neighborhood of the simulated data to see what the NN is fitting

NN Example: Dst-H• We will repeat the Dst- H analysis nonlinearly• Inputs: Dst, VBs, sin(lt), cos(lt), Output: H

– Train with Levenberg-Marquardt

• The VBs input allows us to specify the intensity of (partial) ring current injection

• The lt inputs allow us to implicitly localize the NN• By plotting H vs lt for fixed values of Dst and VBs, we can

infer the local current systems and determine what role the VBs electric field plays

• By using an NN instead of an LPF, we add non-linear interpolation at the expense of linear extrapolation

Partial Ring Current (2)• We ran the NN on simulated data

– (Dst,VBs,lt) = ({-100:10},{0,5},{0:18})

– Psw was constant at 3 nPa

• We plot the H-Dst relationship at constant lt ,VBs, and Psw to see its characteristics

– A localized current system is creating an asymmetry (Dawn-Dusk)

– Otherwise, the relationship is linear

• Comparing the recovery phase (VBs = 0) to the main phase (VBs = 5 mV/m), we see that at larger Dst, the local-time asymmetry is weaker for the recovery phase than for the main phase

! It is generally impossible to make direct measurements of the local time H-Dst relationship at fixed VBs

-100 -80 -60 -40 -20 0 20-140

-120

-100

-80

-60

-40

-20

0

20

Dst

H

lt = 0

lt = 6

lt = 12

lt = 18

Main Phase

HSJG vs Dst (Psw=3) (VBs =5)

-100 -80 -60 -40 -20 0 20-120

-100

-80

-60

-40

-20

0

20HSJG vs Dst (Psw=3) (VBs =0)

Dst

H

lt = 0

lt = 6

lt = 12

lt = 18

Recovery Phase

NN Example: Dst-VBs

• There is some speculation that the dynamics of the ring current are nonlinear: try an NN!

• Inputs: Dst(t-1),VBs(t),Psw(t-1),Psw(t)

• Output: Dst(t)• Dst(t-1) provides implicit output recurrence

– We can still train with Levenberg-Marquardt!

• VBs(t) provides the new injection

• Psw allows the NN to remove the magnetopause contamination

Dst-VBs NN Analysis

• We ran the network on simulated data– (Dst,VBs) = ({-200:0},{0:7})

– Psw was constant at 3 nPa

• Dst = Dst(t+1)-Dst(t)

• The phase-space trajectories are very linear in the HTD area

• Curvature outside of the HTD area may be artificial

• Note how VBs affects the trajectories

• The dynamic equation is:dDst

dt

Dst

VBsQ VBs

( )( )

-40 -30 -20 -10 0 10 20-200

-180

-160

-140

-120

-100

-80

-60

-40

-20

0

Neural Network Phase Space

Dst

Dst

VBs = 0 mV/m

VBs = 1 mV/m

VBs = 2 mV/m

VBs = 3 mV/m

VBs = 4 mV/m

VBs = 5 mV/m

VBs = 6 mV/m

VBs = 7 mV/m

VBs Increases

High T

raining Density

Summary

• Linear Prediction Filters– Uses


• Time Evolution

– Construction • Choose AR, MA or ARMA

• Least Squares Solution of Overdetermined Matrix

– Localize for Nonlinearity

– Filter Analysis • Green’s Function Analogue

• Local Filters Reveal Local Processes

• Neural Networks– Uses


• Time Evolution

– Theory• Based on Biology

– Training• Iterative Adjustment of

Weights

– Network Analysis• Consider Training Density

• Run on Simulated Data

• Examine Phase Space

Further Reading

All you ever wanted to know about LPFs in Space Physics: Solar Wind-Magnetosphere Coupling Proceedings, (Kamide and Slavin eds) Terra Scientific Publishing, 1986. Articles by Clauer (p. 39), McPherron et al. (p. 93), and Fay et al (p. 111)

An Excellent Source for all kinds of NN stuff is the Matlab Neural Network Toolbox User’s Guide or Neural Network Design by Beale, Hagan, & Demuth. PWS Publishers 1995

“Learning internal representations by error propagation” by Rumelhart, Hinton, Williams in Parallel Data Processing, p. 318, Vol. 1, Ch 8. MIT Press (Rumelhart & McClelland eds)

Proceedings of the International Workshop on Artificial Intelligence Applications in Solar-Terrestrial Physics: 1993 and 1997 meetings