Advanced information retreival Chapter 02: Modeling - Neural Network Model Neural Network Model.

Advanced information Advanced information retreivalretreival

Chapter 02: Modeling - Chapter 02: Modeling -

Neural Network Neural Network ModelModel

Neural Network Model A neural network is an oversimplified representation

of the neuron interconnections in the human brain: nodes are processing units edges are synaptic connections the strength of a propagating signal is modelled by a

weight assigned to each edge the state of a node is defined by its activation level depending on its activation level, a node might issue

an output signal

Neural NetworksNeural Networks

• Neural NetworksNeural Networks– Complex learning systems recognized in animal brainsComplex learning systems recognized in animal brains– Single neuron has simple structureSingle neuron has simple structure– Interconnected sets of neurons perform complex Interconnected sets of neurons perform complex

learning taskslearning tasks– Human brain has 10Human brain has 101515 synaptic connections synaptic connections– Artificial Neural NetworksArtificial Neural Networks attempt to replicate non-linear attempt to replicate non-linear

learning found in naturelearning found in natureDendrites

Cell Body

Axon

Neural Networks Neural Networks ((cont’dcont’d))

– Dendrites gather inputs from other neurons and combine Dendrites gather inputs from other neurons and combine informationinformation

– Then generate non-linear response when threshold reachedThen generate non-linear response when threshold reached– Signal sent to other neurons via axonSignal sent to other neurons via axon

– Artificial neuron model is similarArtificial neuron model is similar

– Data inputs (xData inputs (xii) are collected from upstream neurons input to ) are collected from upstream neurons input to

combination function (sigma)combination function (sigma)

nx

x

x

2

1

y

Neural NetworksNeural Networks ((cont’dcont’d))

– Activation function reads combined input and produces Activation function reads combined input and produces non-linear response (y)non-linear response (y)

– Response channeled downstream to other neuronsResponse channeled downstream to other neurons

• What problems applicable to Neural Networks?What problems applicable to Neural Networks?– Quite robust with respect to noisy dataQuite robust with respect to noisy data– Can learn and work around erroneous dataCan learn and work around erroneous data– Results opaque to human interpretationResults opaque to human interpretation– Often require long training timesOften require long training times

Input and Output EncodingInput and Output Encoding

– Neural Networks require attribute values encoded to [0, 1]Neural Networks require attribute values encoded to [0, 1]

• NumericNumeric– Apply Min-max Normalization to continuous variablesApply Min-max Normalization to continuous variables

– Works well when Min and Max knownWorks well when Min and Max known– Also assumes new data values occur within Min-Max Also assumes new data values occur within Min-Max

rangerange– Values outside range may be rejected or mapped to Min Values outside range may be rejected or mapped to Min

or Maxor Max

)min()max(

)min(

)range(

)min(*

XX

XX

X

XXX

Input and Output Encoding Input and Output Encoding ((cont’dcont’d))

• OutputOutput – Neural Networks always return continuous values [0, 1]Neural Networks always return continuous values [0, 1]– Many classification problems have two outcomesMany classification problems have two outcomes– Solution uses threshold established Solution uses threshold established a prioria priori in single in single

output node to separate classesoutput node to separate classes– For example, target variable is “leave” or “stay”For example, target variable is “leave” or “stay”– Threshold value is “leave if output >= 0.67”Threshold value is “leave if output >= 0.67”– Single output node value = 0.72 classifies record as Single output node value = 0.72 classifies record as

“leave”“leave”

Simple Example of a Neural Simple Example of a Neural NetworkNetwork

– Neural Network consists of Neural Network consists of layeredlayered, , feedforwardfeedforward, , completely connectedcompletely connected network of nodes network of nodes

– Feedforward restricts network flow to single directionFeedforward restricts network flow to single direction– Flow does not loop or cycleFlow does not loop or cycle– Network composed of two or more layersNetwork composed of two or more layers

Node 1

Node 2

Node 3

Node B

Node A

Node Z

W1A

W1B

W2A

W2B

WAZ

W3A

W3B

W0A

WBZ

W0Z

W0B

Input LayerInput Layer Hidden LayerHidden Layer Output LayerOutput Layer

Simple Example of a Neural Simple Example of a Neural Network Network ((cont’dcont’d))

– Most networks have Most networks have InputInput, , HiddenHidden, , OutputOutput layers layers– Network may contain more than one hidden layerNetwork may contain more than one hidden layer– Network is completely connectedNetwork is completely connected– Each node in given layer, connected to every node in Each node in given layer, connected to every node in

next layernext layer

– Every connection has weight (WEvery connection has weight (Wijij) associated with it) associated with it

– Weight values randomly assigned 0 to 1 by algorithmWeight values randomly assigned 0 to 1 by algorithm– Number of input nodes dependent on number of Number of input nodes dependent on number of

predictorspredictors– Number of hidden and output nodes configurableNumber of hidden and output nodes configurable

Simple Example of a Neural Network Simple Example of a Neural Network ((contcont))

– Combination function produces linear combination of Combination function produces linear combination of node inputs and connection weights to single scalar valuenode inputs and connection weights to single scalar value

– For node j, xFor node j, xijij is i is ithth input input– WWijij is weight associated with i is weight associated with ithth input node input node– I+ 1 inputs to node jI+ 1 inputs to node j– xx11, x, x22, ..., x, ..., xII are inputs from upstream nodes are inputs from upstream nodes– xx00 is is constant inputconstant input value = 1.0 value = 1.0– Each input node has extra input WEach input node has extra input W0j0jxx0j0j = W = W0j0j

jIjIjjjjiji

ijj xWxWxWxW ...net 1100

Node 1

Node 2

Node 3

Node B

Node A Node

Z

W1AW1B

W2A

W2B

WAZ

W3AW3B

W0A

WBZW0Z

W0B



– The scalar value computed for hidden layer Node A The scalar value computed for hidden layer Node A equals equals

– For Node A, netFor Node A, netAA = 1.32 is input to activation function = 1.32 is input to activation function– Neurons “fire” in biological organisms Neurons “fire” in biological organisms – Signals sent between neurons when combination of Signals sent between neurons when combination of

inputs cross thresholdinputs cross threshold

x0 = 1.0 W0A = 0.5 W0B = 0.7 W0Z = 0.5

x1 = 0.4 W1A = 0.6 W1B = 0.9 WAZ = 0.9

x2 = 0.2 W2A = 0.8 W2B = 0.8 WBZ = 0.9

x3 = 0.7 W3A = 0.6 W3B = 0.4

32.1)7.0(6.0)2.0(8.0)4.0(6.05.0

)0.1(net 3322110

AAAAAAAiAi

iAA xWxWxWWxW


– Firing response Firing response not necessarily linearly relatednot necessarily linearly related to to increase in input stimulationincrease in input stimulation

– Neural Networks model behavior using non-linear Neural Networks model behavior using non-linear activation functionactivation function

– Sigmoid functionSigmoid function most commonly used most commonly used

– In Node A, sigmoid function takes netIn Node A, sigmoid function takes netAA = 1.32 as input = 1.32 as input and produces outputand produces output

xey

1

1

7892.01

132.1

ey


– Node A outputs 0.7892 along connection to Node Z, and Node A outputs 0.7892 along connection to Node Z, and becomes component of netbecomes component of netZZ

– Before netBefore netZZ is computed, contribution from Node B is computed, contribution from Node B requiredrequired

– Node Z combines outputs from Node A and Node B, Node Z combines outputs from Node A and Node B, through netthrough netZZ

8176.01

1)net(

and,

5.1)7.0(4.0)2.0(8.0)4.0(9.07.0

)0.1(net

5.1B

3322110

ef

xWxWxWWxW BBBBBBBiBi

iBB


– Inputs to Node Z not data attribute valuesInputs to Node Z not data attribute values– Rather, outputs are from sigmoid function in upstream Rather, outputs are from sigmoid function in upstream

nodesnodes

– Value 0.8750 output from Neural Network on first passValue 0.8750 output from Neural Network on first pass– Represents predicted value for target variable, given first Represents predicted value for target variable, given first

observationobservation

8750.01

1)net(

finally,

9461.1)8176.0(9.0)7892.0(9.05.0

)0.1(net

9461.1z

0

ef

xWxWWxW BZBZAZAZZiZi

iZZ

Sigmoid Activation FunctionSigmoid Activation Function

– Sigmoid function combines Sigmoid function combines nearly linearnearly linear, , curvilinearcurvilinear, and , and nearly constant behaviornearly constant behavior depending on input value depending on input value

– Function nearly linear for domain values -1 < x < 1Function nearly linear for domain values -1 < x < 1– Becomes curvilinear as values move away from centerBecomes curvilinear as values move away from center– At extreme values, f(At extreme values, f(xx) is nearly constant ) is nearly constant – Moderate increments in Moderate increments in xx produce variable increase in produce variable increase in

f(f(xx), depending on location of ), depending on location of xx– Sometimes called “Squashing Function”Sometimes called “Squashing Function”– Takes real-valued input and returns values [0, 1]Takes real-valued input and returns values [0, 1]

Back-PropagationBack-Propagation

– Neural Networks are supervised learning methodNeural Networks are supervised learning method– Require target variableRequire target variable– Each observation passed through network results in Each observation passed through network results in

output valueoutput value– Output value compared to actual value of target Output value compared to actual value of target

variablevariable– (Actual – Output) = Error(Actual – Output) = Error– Prediction error analogous to residuals in regression Prediction error analogous to residuals in regression

modelsmodels– Most networks use Sum of Squares (SSE) to measure Most networks use Sum of Squares (SSE) to measure

how well predictions fit target valueshow well predictions fit target values sOutputNodecords

outputactualSSE 2

Re

)(

Back-Propagation Back-Propagation ((cont’dcont’d))

– Squared prediction errors summed over all output Squared prediction errors summed over all output nodes, and all records in data setnodes, and all records in data set

– Model weights constructed that minimize SSEModel weights constructed that minimize SSE– Actual values that minimize SSE are unknownActual values that minimize SSE are unknown– Weights estimated, given the data setWeights estimated, given the data set

Back-Propagation RulesBack-Propagation Rules

– Back-propagation percolates prediction error for record Back-propagation percolates prediction error for record back through networkback through network

– Partitioned responsibility for prediction error assigned Partitioned responsibility for prediction error assigned to various connectionsto various connections

– Back-propagation rules defined Back-propagation rules defined (Mitchell)(Mitchell)

j

ji

xw

www

j

ijjij

ijCURRENTijNEWij

node tobelongingerror particular afor lity responsibi represents

node input toth signifies x

rate learning

where,

ij

,,

Back-Propagation Rules Back-Propagation Rules ((cont’dcont’d))

– Error responsibility computed using partial derivative of Error responsibility computed using partial derivative of the sigmoid function with respect to netthe sigmoid function with respect to net jj

– Values take one of two formsValues take one of two forms

– Rules show why input values require normalizationRules show why input values require normalization– Large input values xLarge input values xijij would dominate weight adjustment would dominate weight adjustment– Error propagation would be overwhelmed, and learning Error propagation would be overwhelmed, and learning

stifledstifled

downstream nodesfor litiesresponsibierror of sum weighted torefers where,

nodeslayer hidden for

nodeslayer output for )output1(output

)outputactual)(output1(output

jj

jjjj

DOWNSTREAMjjk

DOWNSTREAMjjk

j

W

W

Example of Back-PropagationExample of Back-Propagation

– Recall that first pass through network yielded Recall that first pass through network yielded outputoutput = = 0.87500.8750

– Assume actual target value = 0.8, and learning rate = 0.01Assume actual target value = 0.8, and learning rate = 0.01– Prediction error = 0.8 - 0.8750 = -0.075Prediction error = 0.8 - 0.8750 = -0.075– Neural Networks use Neural Networks use stochasticstochastic back-propagation back-propagation– Weights updated after each record processed by networkWeights updated after each record processed by network– Adjusting the weights using back-propagation shown nextAdjusting the weights using back-propagation shown next

– Error responsibility for Node Z, an output node, found firstError responsibility for Node Z, an output node, found first

0082.0)875.08.0)(875.01(875.0

)outputactual)(output1(output ZZZZ

Z

Node 1

Node 2

Node 3

Node B

Node A Node

Z

W1AW1B

W2A

W2B

WAZ

W3AW3B

W0A

WBZW0Z

W0B


Example of Back-Propagation Example of Back-Propagation ((cont’dcont’d))

– Now adjust “constant” weight wNow adjust “constant” weight w0Z0Z using rules using rules

– Move upstream to Node A, a hidden layer nodeMove upstream to Node A, a hidden layer node– Only node downstream from Node A is Node ZOnly node downstream from Node A is Node Z

49918.000082.05.0

00082.)1)(0082.0(1.0)1(

0,0,0

0

ZCURRENTZNEWZ

ZZ

www

W

00123.0)0082.0)(9.0)(7892.01(7892.0

)output1(output AA

DOWNSTREAM

jjkA W


– Adjust weight wAdjust weight wAZAZ using back-propagation rules using back-propagation rules

– Connection weight between Node A and Node Z Connection weight between Node A and Node Z adjusted from 0.9 to 0.899353adjusted from 0.9 to 0.899353

– Next, Node B is hidden layer nodeNext, Node B is hidden layer node– Only node downstream from Node B is Node ZOnly node downstream from Node B is Node Z

899353.0000647.09.0

000647.0)7892.0)(0082.0(1.0)(

,,

AZCURRENTAZNEWAZ

AZAZ

www

OUTPUTW

0011.0)0082.0)(9.0)(8176.01(8176.0

)output1(output BB

DOWNSTREAM

jjkB W


– Adjust weight wAdjust weight wBZBZ using back-propagation rules using back-propagation rules

– Connection weight between Node B and Node Z Connection weight between Node B and Node Z adjusted from 0.9 to 0.89933adjusted from 0.9 to 0.89933

– Similarly, application of back-propagation rules Similarly, application of back-propagation rules continues to input layer nodescontinues to input layer nodes

– Weights {wWeights {w1A1A, w, w2A2A, w, w3A 3A , w, w0A0A} and {w} and {w1B1B, w, w2B2B, w, w3B 3B , w, w0B0B} } updated by processupdated by process

89933.000067.0.09.0

00067.0)8176.0)(0082.0(1.0)(

,,

BZCURRENTBZNEWBZ

BZBZ

www

OUTPUTW


– Now, all network weights in model are updatedNow, all network weights in model are updated– Each iteration based on single record from data setEach iteration based on single record from data set

• SummarySummary– Network calculated predicted value for target variableNetwork calculated predicted value for target variable– Prediction error derivedPrediction error derived– Prediction error percolated back through networkPrediction error percolated back through network– Weights adjusted to generate smaller prediction errorWeights adjusted to generate smaller prediction error– Process repeats record by recordProcess repeats record by record

Termination CriteriaTermination Criteria

– Many passes through data set performedMany passes through data set performed– Constantly adjusting weights to reduce prediction errorConstantly adjusting weights to reduce prediction error– When to terminate?When to terminate?

– Stopping criterion may be computational “clock” time?Stopping criterion may be computational “clock” time?– Short training times likely result in poor modelShort training times likely result in poor model

– Terminate when SSE reaches threshold level?Terminate when SSE reaches threshold level?– Neural Networks are prone to overfittingNeural Networks are prone to overfitting– Memorizing patterns rather than generalizingMemorizing patterns rather than generalizing

– And …And …

Learning RateLearning Rate

– Recall Recall Learning RateLearning Rate (Greek “eta”) is a constant (Greek “eta”) is a constant

– Helps adjust weights toward global minimum for SSEHelps adjust weights toward global minimum for SSE

• Small Learning RateSmall Learning Rate– With small learning rate, weight adjustments smallWith small learning rate, weight adjustments small– Network takes unacceptable time converging to solutionNetwork takes unacceptable time converging to solution

• Large Learning RateLarge Learning Rate– Suppose algorithm close to optimal solutionSuppose algorithm close to optimal solution– With large learning rate, network likely to “overshoot” With large learning rate, network likely to “overshoot”

optimal solutionoptimal solution

rate learning

where,10

Neural Network for IR: From the work by Wilkinson & Hingston, SIGIR’91

Document Terms

Query Terms Documents

ka

kb

kc

ka

kb

kc

k1

kt

d1

dj

dj+1

dN

Neural Network for IR Three layers network Signals propagate across the network First level of propagation:

Query terms issue the first signals These signals propagate accross the network to

reach the document nodes Second level of propagation:

Document nodes might themselves generate new signals which affect the document term nodes

Document term nodes might respond with new signals of their own

Quantifying Signal Propagation

Normalize signal strength (MAX = 1) Query terms emit initial signal equal to 1 Weight associated with an edge from a query term

node ki to a document term node ki: WiqWiq = wiq

sqrt ( i wiq ) Weight associated with an edge from a document

term node ki to a document node dj: WijWij = wij

sqrt ( i wij )

2

2

Quantifying Signal Propagation After the first level of signal propagation, the

activation level of a document node dj is given by:

i WiqWiq WijWij = i wiq wij sqrt ( i wiq ) * sqrt ( i wij )

which is exactly the ranking of the Vector model New signals might be exchanged among document

term nodes and document nodes in a process analogous to a feedback cycle

A minimum threshold should be enforced to avoid spurious signal generation

222

Conclusions

Model provides an interesting formulation of the IR problem

Model has not been tested extensively It is not clear the improvements that the model might

provide

Date post:	30-Dec-2015
Category:	Documents
Upload:	jean-burns
View:	220 times
Download:	2 times