soir¶ee cocktail en vue d’applications µa la s¶eparation ...poser une architecture apte a faire...

$Page 1: soir¶ee cocktail en vue d’applications µa la s¶eparation ...poser une architecture apte a faire de la reconnaissance de formes. Notre architecture intitul¶ee \Oscillatory Dynamic$
Faculte de genieGenie electrique et genie informatique

Traitement neuronal et anthropomorphiquede l’effet “soiree cocktail” en vue d’applications

a la separation de sources sonores eta la reconnaissance de formes

These de doctoratSpecialite : genie electrique

Ramin PICHEVAR

Sherbrooke (Quebec) Canada Novembre 2004


i

RESUME

Cette these se compose de deux parties. La premiere partie porte sur la separation de

sources sonores. En nous basant sur les trouvailles de la premiere partie, nous proposons

une architecture neuronale pour la reconnaissance de formes dans la deuxieme partie.

Le systeme de separateur de sons proposes est base sur une architecture neuronale bio-

inspiree de reseaux a decharges (reseaux a spikes). Deux representations differentes

(Cochleotopic / AMtopic ou Cochleotopic / Spectrotopic) sont utilisees comme pre-

traitement. Ces images auditives bi-dimensionnelles essayent de mimer partiellement le

comportement du chemin auditif. Les elements de base du reseau de neurones propose

sont les neurones oscillatoires a relaxation.

Nous demontrons que le comportement du neurone plus populaire “integrate-and-fire” est

une approximation du neurone a relaxation. La separation est basee sur la synchronisation

de la deuxieme couche de neurones. Chaque neurone de la deuxieme couche est associe

a un canal cochleaire (un total de 256 canaux). Une version amelioree du banc de filtre

synthese/analyse gammatone est utilisee pour generer les canaux cochleaires. Le cirtere

de distorsion spectrale (LSD) est utilise pour comparer les performances. Nous utilisons

aussi d’autres criteres de performance comme le PEL (Pourcentage d’energie perdue),

PNR (Pourcentage du bruit residuel), SNR (rapport signal-bruit) et PESQ (evaluation

perceptive de la qualite du son).

Le systeme de reconnaissance de formes est inspire de la premiere partie de la these.

L’objectif de cette partie est de faire une analogie entre la vision et l’audition pour pro-

poser une architecture apte a faire de la reconnaissance de formes. Notre architecture

intitulee “Oscillatory Dynamic Link Matching” est une extension de l’architecture “Dy-

namic Link Marching” proposee anterieurement par d’autres chercheurs. L’architecture

proposee comprend deux couches. Si la synchronisation est atteinte entre les couches,

cela signifie que le patron existe dans la scene. Le comportement du reseau est analyse

mathematiquement dans la these.

ii

ABSTRACT

This thesis consists of two parts. The first part deals with the sound source separation

problem. Based on the findings of the first part, a neural architecture for visual pattern

recognition is proposed in the second part.

The proposed sound source separation technique is based on a two-layered bio-inspired

spiking neural network . Depending on the characteristics of the intruding sound, one of

the two bio-inspired proposed spectral maps (Cochleotopic / AMtopic or Cochleotopic /

Spectrotopic) is used as front-end. These two-dimensional maps try to mimic partially

the auditory pathway. The building blocks of the neural network are oscillatory relaxation

neurons. We show that the behavior of the more popular integrate-and-fire neurons are an

approximation of the latter-mentioned neurons. The separation of different sound sources

is based on the synchronization of neurons in the second layer. Each neuron in the second

layer is associated to a cochlear channel (a total of 256 channels in our experiments).

An enhanced version of the gammatone analysis/synthesis filterbank is used to generate

the cochlear channels. The Log-Spectral Distortion (LSD) criterion is used to compare

performance. We also compare different performance criteria like LSD (Log-Spectral

Distortion), PEL (Percentage of Energy Loss), PNR (Percentage of Noise Residue), SNR

(Signal-to-Noise Ratio), and PESQ (Perceptual Evaluation of Speech Quality).

The proposed visual pattern recognition is inspired from the work we did in the first

part of the thesis. The goal in this part is to make the analogy between audition and

vision and propose an architecture able of doing visual pattern recognition. Our proposed

’Oscillatory Dynamic Link Matcher’ is an extension of the already known ’Dynamic Link

Matcher’. The network consists of two layers. The pattern is applied to the first layer and

the scene to the second layer. If synchronization is achieved between layers, we conclude

that the pattern exists in the scene. These facts are proven mathematically (along other

properties) in this thesis.

iii

REMERCIEMENTS 1

Je remercie Jean Rouat, mon directeur de these, qui a su eveiller ma curiosite avec son

approche pluridisciplinaire . Grace a sa tenacite, nous avons acheve un travail de recherche

innovant. Je le remercie aussi pour son soutien financier.

Je remercie Juan-Manuel Torres, Francois Michaud et Roch Lefebvre pour avoir accepte

de faire partie du jury de ma these.

Je remercie mes collegues de travail et amis, Stephane Loiselle, Rachid Moussaoui, Gregoire

Mouly-aigrot, Gregory Farage, Stephane Ragot, Mohammed Bahoura, Hassan Ezzaidi,

Steeve Larouche, Romain Balleraud, Mario Petitclerc, Le Tan Thanh Tai et Guillaume

Fuchs.

Je remercie DeLiang Wang et Alessandro Villa pour m’avoir accepte en stage aux Etats-

Unis et en France respectivement.

Je remercie Guy Benoıt et Pierre-Yves Fortin du Bureau de Liaison Entreprises Universite

(BLEU) de l’Universite de Sherbrooke pour leur aide dans le cadre du depot de brevet.

Je remercie Christian Feldbauer et Gernot Kubin pour leur collaboration technique dans

le cadre du projet europeen COST 277.

Un grand merci a tous mes amis et collegues qui ont bien voulu assister a ma soutenance

de these.

Merci enfin a mes parents et a ma famille pour leur soutien continuel. Je leur dedie ce

travail.

1Acknowledgments are in French

iv

TABLE OF CONTENTS

1 INTRODUCTION 1

1.1 Auditory scene analysis for real scenes . . . . . . . . . . . . . . . . . . . . 1

1.2 Approaches for auditory scene analysis . . . . . . . . . . . . . . . . . . . . 2

1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Ideas to be investigated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Specific goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.6 Outline of this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA) 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Cocktail party effect and human audition . . . . . . . . . . . . . . . . . . . 9

2.3 History of CASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Applications of CASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Bases of auditory scene analysis . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Data-driven CASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.7 Top-down or schema-driven CASA . . . . . . . . . . . . . . . . . . . . . . 24

2.8 Different implementations of CASA . . . . . . . . . . . . . . . . . . . . . . 25

2.9 ASA limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.9.1 Sinusoidal speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.9.2 Limitations of pitch-based grouping . . . . . . . . . . . . . . . . . . 27

2.10 Comparison of CASA with other source separation techniques . . . . . . . 28

2.10.1 Blind Sound Source Separation . . . . . . . . . . . . . . . . . . . . 28

2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

v

vi TABLE OF CONTENTS

3 NEUROCOGNITION 35

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Geststalt Psychology and Neurophysiology . . . . . . . . . . . . . . . . . . 35

3.3 Conventional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 The binding problem . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.2 Are classical neural networks universal? . . . . . . . . . . . . . . . . 39

3.4 Solutions to the binding problem . . . . . . . . . . . . . . . . . . . . . . . 41

3.4.1 Hierarchical coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.2 Attentional models . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.3 Assembly Coding and Temporal correlation . . . . . . . . . . . . . 45

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 DYNAMICS OF BIO-INSPIRED NEURONS 53

4.0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Different types of neuronal models . . . . . . . . . . . . . . . . . . . . . . . 54

4.1.1 Class I and Class II neural excitatability . . . . . . . . . . . . . . . 54

4.2 Mathematical description of neurons . . . . . . . . . . . . . . . . . . . . . 55

4.2.1 Four-dimensional neuronal models . . . . . . . . . . . . . . . . . . . 55

4.2.2 Two-dimensional neural models . . . . . . . . . . . . . . . . . . . . 56

4.2.3 One-dimensional neural models . . . . . . . . . . . . . . . . . . . . 59

4.2.4 Fractal dimension neural models . . . . . . . . . . . . . . . . . . . . 60

4.3 Canonical Neuronal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4 Different modes of synchronization . . . . . . . . . . . . . . . . . . . . . . 65

4.5 Selection of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5.1 Pros and cons of relaxation oscillators . . . . . . . . . . . . . . . . . 66

4.5.2 Pros and cons of ’integrate-and-fire’ neurons . . . . . . . . . . . . . 67

TABLE OF CONTENTS vii

4.5.3 Pros and cons of chaotic neurons . . . . . . . . . . . . . . . . . . . 68

4.6 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6.1 Memoryless learning . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6.2 Hebbian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.7 Implementational aspects of ’Temporal Correlation’ . . . . . . . . . . . . . 72

4.8 Architectures for ’temporal correlation’ . . . . . . . . . . . . . . . . . . . . 73

4.8.1 LEGION: Locally Excitatory Globally Inhibitory Oscillatory Network 74

4.8.2 Attentional Oscillatory Neural Network (AONN) The schematic ofthis architecture is shown in Figure 4.8.2 [1] . . . . . . . . . . . . . 75

4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5 SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS 85

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 Source separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3 Proposed system strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.4 Description of the source separation system . . . . . . . . . . . . . . . . . 88

5.4.1 The choice of the cochlear filterbank . . . . . . . . . . . . . . . . . 88

5.4.2 Signal analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.3 Theoretical motivation behind the CAM/CSM generation . . . . . . 93

5.4.4 The Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.4.5 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.5.1 Database and comparison . . . . . . . . . . . . . . . . . . . . . . . 102

5.5.2 Separation performance . . . . . . . . . . . . . . . . . . . . . . . . 103

5.6 Separation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.6.1 Separation of speech from telephone trill . . . . . . . . . . . . . . . 103

5.6.2 Separation of speech from 1 kHz tone . . . . . . . . . . . . . . . . . 105

viii TABLE OF CONTENTS

5.6.3 Double-vowel segregation case . . . . . . . . . . . . . . . . . . . . . 105

5.6.4 Sentence plus siren . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.6.5 PESQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.6.6 Three-source case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.7 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . 117

6 ODLM FOR PATTERN RECOGNITION 121

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.2 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.3 The Dynamic Link Matcher . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6.4 The oscillatory dynamic link matcher . . . . . . . . . . . . . . . . . . . . 127

6.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.4.2 Mathematical Description of the Network . . . . . . . . . . . . . . . 127

6.5 Behavioral description of the network . . . . . . . . . . . . . . . . . . . . . 129

6.6 Geometrical Interpretation of the ODLM . . . . . . . . . . . . . . . . . . . 131

6.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.8 Rate Coding vs. Phase coding . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.8.1 Rate Coding (Average over Time) . . . . . . . . . . . . . . . . . . . 134

6.8.2 Phase coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.8.3 Dynamics of the Rate-coding DLM . . . . . . . . . . . . . . . . . . 135

6.8.4 Segmentation and Matching for Invariant Pattern Recognition . . . 136

6.8.5 One-object scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.9 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . 137

7 CONCLUSION 151

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.2 What has been presented . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

TABLE OF CONTENTS ix

7.3 Future developments of the model . . . . . . . . . . . . . . . . . . . . . . . 152

7.4 The future of Computational Auditory Scene Analysis . . . . . . . . . . . . 154

BIBLIOGRAPHY 200

x TABLE OF CONTENTS

LIST OF FIGURES

1.1 Bregman’s metaphoric description of audition . . . . . . . . . . . . . . . . 1

2.1 Human performance in the presence of multiple voices and mask . . . . . . 13

2.2 Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Old plus new . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Good continuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.7 Mutual Exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.8 Data-driven CASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.9 Description of the main ideas behind different CASA approaches . . . . . . 31

2.10 A top-down blackboard system . . . . . . . . . . . . . . . . . . . . . . . . 32

2.11 Wang and Brown’s oscillatory CASA system . . . . . . . . . . . . . . . . . 33

2.12 The cochleogram for simple tones and street noise . . . . . . . . . . . . . . 33

2.13 Spectrogram for natural speech . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Rosenblatt Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Catastrophe scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 The Illusory Conjunction experiment as described by Anna Treisman . . . 40

3.4 Binding example no. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41




3.8 The hierarchical scene analyzer of Riesenhuber and Poggio . . . . . . . . . 49

xi

xii LIST OF FIGURES

3.9 Hierarchical network for feature extraction with two types of attentionalcontrol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.10 Schematic diagram of the SCAN (Signal Channelling Attentional Network) 51

3.11 The hierarchical approach (along with attention) used by the neocognitronto recognize ‘0’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.12 Solution to the binding problem using the temporal correlation technique . 52

4.1 The spike rate dependency to the applied input current in the Wilson-Cowan neural model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 Schematic diagram for the Hodgkin-Huxley model . . . . . . . . . . . . . . 56

4.3 Different excitation modes seen in real biological neurons . . . . . . . . . . 78

4.4 Comparison of different neural models . . . . . . . . . . . . . . . . . . . . 79

4.5 A nullcline of the Wang-Terman equation . . . . . . . . . . . . . . . . . . . 80

4.6 SIMULINK model of the “integrate-and-fire” neuron. . . . . . . . . . . . 80

4.7 Temporal correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.8 The architecture of the LEGION . . . . . . . . . . . . . . . . . . . . . . . 82

4.9 The architecture of the AONN network . . . . . . . . . . . . . . . . . . . . 83

5.1 The proposed source separation system . . . . . . . . . . . . . . . . . . . . 89

5.2 3-D plot of the output of the proposed neural network . . . . . . . . . . . . 90

5.3 CAM for the female /di/ and male /da/ mixture at SNR = 0 dB andt = 166 ms when the channel number is equal to 24. The separation of thetwo sources can be done based on ray distances. . . . . . . . . . . . . . . . 92

5.4 Schematic representation of the signal processing steps required to computethe reassigned spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.5 CSM (24-channel) of the mixture of /di/ and the siren in Equation 5.23 att=50 ms. Segregation is based on the selection of energy bursts. . . . . . 94

5.6 CAM (24-channel) for the /di/ /da/ mixture. Segregation is based onharmonic selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.7 CSM (24-channel) for the speech plus tone mixture. Segregation is basedon energy bursts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

LIST OF FIGURES xiii

5.8 The change in the stiffness of the hair cells due to a change of the stimulus 107

5.9 Idealized schematic of a 2-D spectral map (Cochleotopic/AMtopic) for atwo-speaker signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.10 Architecture of the Two-Layer Bio-inspired Neural Network . . . . . . . . 109

5.11 Mixture of the utterance “Why were you all weary?” with a trill telephonenoise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.12 Separation results for the trill telephone noise . . . . . . . . . . . . . . . . 110

5.13 The synthesized “Why were you all weary?” by the approach proposed byWang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.14 Mixture of the utterance “I willingly marry Marilyn” with 1 kHz tone. . . 111

5.15 Comparison between our approach and Wang’s approach for the ’1 kHz’tone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.16 The spectrogram of the /di/ /da/ mixture. . . . . . . . . . . . . . . . . . . 112

5.17 The spectrogram of the extracted /di/. . . . . . . . . . . . . . . . . . . . . 113

5.18 The spectrogram of the extracted /da/. . . . . . . . . . . . . . . . . . . . . 113

5.19 Mixture of a siren and the sentence “I willingly marry Marilyn”. . . . . . . 114

5.20 Synthesis by an FIR implementation . . . . . . . . . . . . . . . . . . . . . 115

5.21 Synthesis by an IIR implementation . . . . . . . . . . . . . . . . . . . . . 116

5.22 Synthesis result for the siren plus sentence case, when the masking is ap-plied before the masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.23 The synthesized “Why were you all weary?” proposed by Wang . . . . . . 117

6.1 Some examples of affine transforms . . . . . . . . . . . . . . . . . . . . . . 124

6.2 An industrial application of the ODLM . . . . . . . . . . . . . . . . . . . . 139

6.3 The architecture of the oscillatory dynamic link matcher . . . . . . . . . . 140

6.4 An affine transform T for a four-corner object. . . . . . . . . . . . . . . . 141

6.5 A snapshot of the activity the first and second layers of the neural map.Colors represent relative phase of oscillations. . . . . . . . . . . . . . . . . 142

6.6 Neural activity pattern after segmentation . . . . . . . . . . . . . . . . . . 143

xiv LIST OF FIGURES

6.7 Neural activity pattern after matching . . . . . . . . . . . . . . . . . . . . 144

6.8 The evolution of the thresholded activity of network through time in thesegmentation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

6.9 The evolution of the thresholded activity of the network through time inthe dynamic matching phase. . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.10 The Synchronization index of a one-object scene when the segmentationstep is bypassed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.11 The synchronization pattern of a one-object scene when the segmentationphase precedes the matching phase . . . . . . . . . . . . . . . . . . . . . . 148

6.12 A scene segmentation done during the segmentation phase of the algorithm 149

6.13 Architecture of an integrated top-down and bottom-up processor (underinvestigation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

A-1 Bifurcation in a dynamical system . . . . . . . . . . . . . . . . . . . . . . . 158

A-2 Saddle-node bifurcation in Wilson-Cowan oscillators . . . . . . . . . . . . . 159

A-3 The transofrmation h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

B-4 Architecture of the simplified chaotic neural network based sound sourceseparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

B-5 Oscillatory behavior of the chaotic network for the two speaker segregationproblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

C-6 Comparison of multiplicative and additive synapses . . . . . . . . . . . . . 172

E-7 Piecewise linear model of the state space of the Wang-Terman oscillator . . 176

LIST OF TABLES

2.1 Analogies between Vision and Audition . . . . . . . . . . . . . . . . . . . . 9

2.2 Gestalt principles and their applications in Auditory Scene Analysis . . . . 22

2.3 Grouping cues for ASA (adapted from [2]) . . . . . . . . . . . . . . . . . . 23

5.1 The numerical values of the different parameters used in the first layer ofthe network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2 The numerical values of the different parameters used in the second layerof the network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.3 The log spectral distortion (LSD) for three different methods . . . . . . . . 104

5.4 The PESQ of three different methods: P-R (our proposed approach), W-B([3]), and H-W ([4]) ( see caption of Table 5.3) . Higher values mean betterperformance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.5 PESQ for two different methods: P-R (our proposed approach) and J-L([5]). The mixture comprises a female voice with musical rock background. 118

D-1 Parameters for the Hodgkin-Huxley Equations. . . . . . . . . . . . . . . . . 173

D-2 Parameters used in Equation 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . 173

xv

xvi LIST OF TABLES

CHAPTER 1

INTRODUCTION

1.1 Auditory scene analysis for real scenes

We have all already been confronted to situations in which there are many sound sources

in the environment we live in. This can happen when we are at a “cocktail party” or when

we are walking in a street. Amazingly, we as humans are able to separate different sound

sources and decode the underlying message no matter what the type and the structure of

the original sources are. This phenomenon has been described very elegantly by Bregman’s

following metaphoric picture: “Imagine two narrow channels dug up from the edge of a

lake, with handkerchiefs stretched across each one. Looking only at the motion of the

handkerchiefs, you are to answer questions such as: How many boats are there on the

lake and where are they?” [6].

Figure 1.1: Bregman’s metaphoric description of audition: based on the movements in

the two narrow channels, the person should say how many boats are in the lake (adapted

from Ellis’s presentation at NSF Speech Separation Workshop, Montreal, 2003).

This ability in humans has been considered in psychoacoustics under the titles of au-

ditory perceptual organization or auditory scene analysis [6]. These studies construct

experimental stimuli consisting of a few simple sounds such as sine tones or noise bursts,

1

2 CHAPTER 1. INTRODUCTION

and then record subjects’ interpretation of the combination. The work by Bregman has

been very revealing of the mechanisms by which structure is derived from sound, but

typically it fails to address the question of scaling these results to more complex sounds:

sounds incoming from a real environment.

As pointed out thoroughly in Chapter 2, in order to design a viable sound source separator

capable of functioning for any mixture and any type of sound, one should adapt or modify

Bregman’s simple rules to real-world scenarios. This is what we will try to do in this thesis.

It should be reminded that the state of the art is far from handling this problem in the

general case. Experts’ prediction is that it will take many years before a system will be

capable of outperforming humans.

1.2 Approaches for auditory scene analysis

A detailed description of the state of the art in Auditory Scene Analysis is given in Chapter

2, but for the time being let us say that there are three main approaches to solve this

problem. The first approach relies on extracting the statistics of the underlying sounds

in the mixture and on using statistical concepts. The second approach is to use expert

systems that are based on heuristics. The third approach is based on neural networks.

Personally, I think that since humans are good at doing auditory scene analysis it is a

good idea to mimic them. One can argue that this is not always the best way to solve

engineering problems. First of all, not all human-made systems are inspired from their

natural counterparts. For instance, an airplane does not fly as a bird does. Furthermore,

not all the dynamics of the nervous system is known to us. So how can we reproduce

something we do not know much about? I totally agree with these arguments, but as a

counter-argument I would say that auditory scene analysis is a very new scientific field.

If we do not try to mimic the human behavior, what else can we do? Remember that

the first attempts to build flying objects were very similar to birds’ physiognomy (like the

prototype made in 1870 by the French engineer Alphonse Penaud, among others). Hence,

let us start with mimicking the nervous system and then try to adapt it to our technology

1.3. APPLICATIONS 3

and computers. If there are missing parts in our understanding of the nervous systems,

let us replace them with more “engineering-inspired” models.

As presented in Chapters 5 and 6, I have adopted the approach that tries to mimic (at

least partially) the nervous system by using bio-inspired neural networks and auditory

representations, which I think can approximate some parts of the dynamics of the brain.

Once again, the reader should note that our understanding of the human brain is very

basic and what will be described in this thesis (or in any other similar work) is only a

“toy model” of what really happens in the brain.

1.3 Applications

One question the reader may ask is ‘What is the point of conducting research into this

problem?’. The broadest motivation is intellectual curiosity, born of an increasing sense

of awe as the full subtlety and sophistication of the auditory system is revealed. This

answer may be convincing from the point of view of pure scientists but surely not con-

vincing enough for engineers. An engineering project is viable if there are industrial

applications for it. As a matter of fact, a good sound separator can open the door to so

many interesting applications and tasks that are impossible to accomplish now. These

application are detailed more thoroughly in Chapter 2. One of the most interesting appli-

cations of sound source separation is in the hearing aids industry. There are 500 million

hearing-impaired persons over the world and 70 million North-Americans with hearing 1

disabilities. Actual hearing aids amplify all incoming sounds, rendering them useless in

crowded places. A good sound separator with low computational complexity will surely

help hearing-impaired people have a better life. Other applications of this technology, as

detailed in Chapter 2, are multimedia sound file indexation, robot navigation, speech and

audio enhancement, etc.

1www.hear-it.org


1.4 Ideas to be investigated

Beyond the general idea that this thesis is a useful collection of techniques for building

auditory models, there are in fact a couple of fairly strong and perhaps slightly unusual

positions behind this work.

The first idea is that some simple auditory representations we called Cochleotopic/AMtopic

and Cochleotopic/Spectrotopic Maps (see Chapter 5), which are based on very simple sig-

nal processing techniques, can tell us a lot about the structure and the organization of

sound in mixtures.

The second main idea is that based on temporal correlation (see Chapters 3, 5, and 6)

we can group regions of sounds on the frequency-time maps (the ones I have proposed).

The grouping is done when the regions belong to the same source. This is done by using

the bio-inspired neural networks proposed in this thesis.

The third contention is the analogy I have tried to make between auditory and visual

scenes. As pointed out in Chapter 2, Bregman’s pioneering work began with the adapta-

tion of Gestalt principles of visual scene analysis to auditory scene analysis. In this work,

I did somehow the opposite. It starts by designing a system capable of doing sound source

separation and then tries to ‘adapt’ the system to vision. These ideas are explained in

Chapter 6.

1.5 Specific goals

A project in computational auditory scene analysis can go in many different directions.

In this work, the particular goals that were pursued, and to a greater or lesser extent

achieved are as follows:

• Computational auditory scene analysis. The broadest goal is to produce a

computer system capable of processing real-world sound scenes of moderate com-

1.6. OUTLINE OF THIS DOCUMENT 5

plexity independent of the structure of the sound sources or the way they have been

mixed.

• Adequate sound representation and reconstruction. Adequate synthesis and

resynthesis tools have been proposed to generate perceptually acceptable reproduc-

tions of the represented sounds.

• Assessment of scene-analysis systems. Adequate assessment metrics have been

proposed and used to compare this work to other works.

• Computational visual scene analysis. Based on findings in audition, the archi-

tecture has been adapted to perform visual scene analysis on ‘toy objects’.

1.6 Outline of this document

This dissertation has seven chapters. After this introduction, chapter 2 presents an

overview of the field of computational auditory scene analysis. Chapter 3 details the

bases of neurocognition and more specifically temporal correlation. Chapter 4 deals with

the mathematical modelling of ‘bio-inspired’ neurons. Chapter 5 explains the architecture

of the system used to perform sound source separation. Chapter 6 describes the ‘Oscilla-

tory Dynamic Link Matching’ proposed in this thesis to perform visual pattern matching.

Finally, the conclusion in Chapter 6 summarizes the project and considers how well it has

achieved its goals.


CHAPTER 2

COMPUTATIONAL AUDITORY SCENE

ANALYSIS (CASA)

2.1 Introduction

In our life we are confronted to situations in which a mixture of sound sources are present

in the environment and we are able of extracting one or more source among others.

The acoustic mixture reaching the ears is processed to enable constituent sounds to be

heard and recognized as distinct entities. While the auditory system may not always

succeed in this goal, the range of situations in which recognition is possible in the presence

of competing (Figure 2.1, Page 13) sources highlights the flexibility and robustness of

human speech perception. The background against which a conversation is carried out is

made up of acoustic intrusions which overlap in both frequency and time with the target

speech. Target and background may contain similar kinds of envelope modulations, and

can arrive from similar locations in space. The background may consist of utterances

whose fundamental frequency and formant contours occupy similar regions to those of the

target. Sometimes, the background will be characterized by high-intensity onsets which

completely mask the target conversation. There are strong evidences that even animals

are capable of doing sound separation. For instance, penguins use signal emissions to find

their mates and offsprings amid the crowds of penguins huddled together for warmth in

the dark Antarctic winter [7] (see also [8] for auditory scene analysis in mustached bats).

On the other hand, computer systems are unable to be robust in the presence of the

“cocktail party” [9] effect (when a mixture of sounds is present in the environment, see

section 2.2) especially when the computer system in hand has only one microphone (one

7

8 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)

sensor). Note that, as a human you don’t always need your two ears to do the sound

separation (although results from two-ear separation may be slightly better). For instance,

you can separate music from speech when you listen to a radio broadcast (there is no

spatial cue in this case) or when you try to obstruct one of your ears. We, humans, can

use different cues to do sound separation. We can use the pitch (or the harmonic structure

of the sound), the onset-offset times (the time a sound begins or ends), the spatial location

of the sound, to segregate sources. In addition we can predict what is next in the sentence

and based on this knowledge we can enhance our recognition performance. For example,

we need few cues to recognize our name in a very crowded environment but it will be

impossible for us to recognize words that are uttered randomly (when we cannot a priori

predict them). We can also use visual cues to do segregation (like lip-reading [10]).

In order to make computers as robust as humans in presence of background noise two

different approaches can be used:

• Mathematical and Statistical Approaches: This approach tries to find a so-

lution to the “cocktail party” effect in the framework of standard signal processing

and information theory techniques.

• Computational Auditory Scene Analysis (CASA): In this framework, sound

is processed by doing an analogy between vision and audition (Table 2.1) using

the Gestalt principles of common fate, similarity, continuity, etc. [11] (see section

2.5). The separation cues are based on psychoacoustical and physiological evidences.

Although there should be somehow an equivalence between the mathematical ap-

proach and the CASA techniques, the lack of information about the surrounding

environment and the statistical behavior of the sound sources makes that in so

many cases the use of rule-based approach (CASA) is much more powerful than

mathematical-based approaches [12].

In the remaining parts of this chapter, first a detailed explanation of the cocktail party

effect and the experiments undertaken by Cherry along with some ASA history is given.

2.2. COCKTAIL PARTY EFFECT AND HUMAN AUDITION 9

TABLE 2.1: Analogies between Vision and Audition

Vision: Marr [13] Audition: Bregman [6]

Explicit naming: Compute properties of Primitive vs. schema-driven grouping

entities rather than parts

Least commitment: Never do anything that may Fusion as the default state

later have to be undone of perceptual organization

Graceful degradation: System should not Exclusive allocation of

be very sensitive to poor input quality parts to entities

The basics and psychoacoustical principles of ASA is briefly discussed. The key concepts

of implementing ASA in computer systems (CASA) is discussed and the limitations of

top-down and bottom-up CASA is analyzed. In Chapter 2, the neurophysiological and

cognitive aspects of those kinds of neural networks (spiking neural networks) that can be

used to solve CASA will be laid down. Chapter 3 deals with the mathematical formulation

of spiking neural networks. Chapter 5 discusses our results and findings about sound

source separation (CASA) with our proposed neural network. Chapter 6 proposes another

neural architecture suitable for visual and auditive pattern matching and recognition.

2.2 Cocktail party effect and human audition

In 1953, Cherry [9], then an engineer at MIT, used for the first time ever the term “cocktail

party”. The name comes from the fact that humans are able to separate sound sources

in a “cocktail party” when other people speak simultaneously and there is music in the

background, etc. He conducted six different experiments as follows 1:

• The Basic “Mixed Message” Paradigm: In the first two series of experiments,

Cherry investigated how we recognize what one person is saying when others are

speaking simultaneously. Cherry described this situation as the ‘cocktail party prob-

1Taken from http://www.smithsrisca.demon.co.uk/PSYcherry1953.html


lem’. Subjects were presented with two different spoken messages, recorded onto

a single audiotape (i.e. ‘mixed’, in a tape editing sense) by the same speaker, and

played back via headphones. Both messages were thus simultaneously and equally

available to both ears, thus approximating to real life competitive conversation.

Subjects were then instructed to repeat one of the messages word by word or phrase

by phrase. Cherry’s observations were

(a) Subjects reproduced at phrase level, rather than word level.

(b) There were extremely few transpositions of material from the to-be-rejected

message 2 . Subjects generally reported great difficulty with the task, but the task

would have been eased considerably if they were allowed to make written notes.

• Predictability: In this series of experiments, Cherry arranged for the mixed ma-

terial to be full of cliches, that is to say, “highly probable phrases” such as “the

time has come to stop beating around the bush”. His observation was that output

tended to consist of whole cliches, and that recognition of just the first one or two

words of a stock phrase would typically prompt the entire phrase.

• The Basic “Unmixed Message” Paradigm: In the remaining sets of experi-

ments, subjects were presented with two different spoken messages, recorded onto

separate audiotapes (i.e., “unmixed” in a tape editing sense) by the same speaker,

and played back by headphones, one message to each earpiece. Unlike the mixed

message paradigm, each ear now only heard one message. Again, subjects were in-

structed to repeat one of the messages (always the right ear message) as accurately

as possible. Cherry’s general observations were:

(a) Subjects could switch between messages at will.

(b) They could repeat the selected message easily and accurately, but slightly

delayed.

2It means that few people put words from the competing sentence into the target utterance.

2.2. COCKTAIL PARTY EFFECT AND HUMAN AUDITION 11

(c) Their speaking voice became monotonous with ‘little emotional content or

stressing of the words’.

(d) They remained unaware of this.

(e) They ‘may have very little idea’ what the message was all about.

(f) They took in very little about the content of the rejected message.

Indeed, if the language of the unattended message was changed from English to

German a few seconds into the trial, once shadowing of the target message had been

successfully established, that change was not usually detected. This observation

prompted further investigation of what sort of information, if any, was available

from the rejected message.

• Penetration of the rejected message. In third series of experiments, Cherry

looked at what information, if any, remained available to the listener from an oth-

erwise unattended message. Cherry arranged for the unattended left ear message

to change from its normal (male spoken English) once the trial was under way. His

observations were

(a) A change from forward speech to backward speech (same sound profile, but

zero lexical or semantic content) was noticed as ‘something queer about it’ by some

subjects but not noticed at all by others.

(b) A change from male to female voice was ‘nearly always’ identified.

(c) A change to a 400 Hz tone was always noticed.

(d) Subjects could not say with certainty what language was being used.

• Same message, time delayed. In this series of experiments, Cherry wished

to investigate the mechanisms by which the brain decides whether the messages

arriving at the ears is from a single source. The point is that when two inputs are

correlated, they need to be merged internally, despite naturally occurring ear-to-ear

differences in intensity and arrival time, whilst when they are from different sources


one of them needs to be rejected internally. He therefore presented an identical

message to each ear, but with the left (to be rejected) delayed relative to the right

(to be shadowed). This was achieved by running a single length of pre-recorded

audiotape through two physically separated tape players. The second tape player

was then gradually moved closer to the first, thus reducing the playback delay.

Cherry’s observations were that ‘nearly all’ subjects eventually recognized’ words or

phrases from the rejected message as matching those in the attended ear. Cherry

remarks that this is actually quite surprising, given that when different messages are

used nothing is perceived from the rejected ear. The delay at which such recognition

took place varied considerably between subjects, but was typically 2-6 seconds.

• Same message, alternating ear. This series of experiments was prompted by

the observation that it took a finite amount of time to switch attention from one

ear to the other. Cherry recorded long samples of speech and switched it between

his subjects’ ears either

(a) randomly

(b) periodically. When this switching was slow (say once a second), subjects

could shadow with 100% accuracy. When it was fast (say 20-50 times a second),

most subjects could shadow 3 ‘the majority’ of the words, reporting that ‘they

listened as though to both ears simultaneously’ 4. However, as the switching period

decreased to around six or seven times a second, so too did accuracy. To investigate

this critical speed in more detail, Cherry introduced short periods of silence into

the message. When played to one ear this would mean hearing about 150 msecs.

of message, followed by 10 msec. of silence, followed by the next message block,

followed by the next silence, and so on (equivalent to six or seven cycles per second).

Accuracy in this condition was 95-100%. When each message block was switched to

alternate ears, however, accuracy reduced to less than 20%. Cherry concluded that

3By shadowing Cherry meant that subjects were able to fuse the messages coming from different ears.4In other words, subjects had the perception that the message had been applied simultaneously to

both ears.

2.3. HISTORY OF CASA 13

this particular switching rate coincided with the very short time interval required

to transfer attention from one ear to the other, and that by the time attention had

been switched it needed to be switched back again.

Figure 2.1 shows that the human auditory pathway can handle as much as eight simulta-

neous sound sources. The figure shows the identification accuracy vs. the masker intensity

in dB SPL (Sound Pressure Level dB).

Figure 2.1: Human performance in the presence of multiple voices and mask. Up to eight

simultaneous voices can be distinguished by a human listener with high accuracy at low

masker intensities [14].

2.3 History of CASA

Computational Auditory Scene Analysis (CASA) is the name for a field of research that

seeks to build computer models of the process of auditory organization, by which biological

listeners are able to understand dense sound mixtures as the superimposed result of many

independent sound-producing entities in the environment. CASA is in its early days, with

quite a number of different efforts, but no obvious winning strategies, and a large range


of perspectives on the problem. In what follows, the reader could find some of the most

important contributions in the field.

• 1948: Jeffress model of interaural correlation for sound localization [15].

• 1951: Place mechanisms of auditory frequency analysis by Licklider [16].

• 1953: First usage of the term ‘Cocktail party’ by Cherry [9].

• 1976: Sound source separation using classical signal processing techniques by Par-

sons [17].

• 1982-83: Lyon’s auditory and binaural model [18].

• 1983: Scheffers’ harmonic-based double vowel separation [19].

• 1985: Voiced Speech separation by Weintraub [20].

• 1986: Temporal correlation based solution to the ’cocktail party’ problem by Von

der Malsburg and Schneider [21].

• 1987: Based on Amplitude Modulation, Berthommier proposed an F0-dependent

method of sound separation [22].

• 1988: Voice separation algorithms by Stubbs and Summerfield [23].

• 1990: Publication of the book: “Auditory Scene Analysis: The Perceptual Organi-

zation of Sound” by Bregman [6].

• 1991: Ph.D.s based on Bregman’s findings: Speech (Cooke [24]), Music (Mellinger

[25]).

• 1992: Evidence integration by Kashino and Tanaka [26]

• 1992: Auditory image for ASA: first usage of the term “CASA” (Brown’s Ph.D.

[27])

2.4. APPLICATIONS OF CASA 15

• 1992: Time-domain cancellation of harmonics proposed by de Cheveigne [28].

• 1994: First database for CASA: ShATR (University of Sheffield).

• 1995: Patterson’s auditory model [29].

• 1995: First CASA workshop (Montreal).

• 1996: Prediction-driven CASA (Ellis’s Ph.D. [30]).

• 1997: Second CASA Workshop (Nagoya).

• 1998: Publication of the book: “Computational Auditory Scene Analysis” by Okuno

and Rosenthal [31].

• 1999: Speech Communication’s special issue on Auditory Scene Analysis.

• 1999: Source separation by temporal correlation (Wang and Brown) [3].

• 2001: Probabilistic CASA of speech with missing and unreliable acoustic data by

Cooke et al. [32].

• 2003: Factorial HMM sound source separation by Roweis and Gomez et al. [33, 34].

• 2003: CASA based on pitch tracking by Hu et al. and Wu et al. [35, 36].

2.4 Applications of CASA

In what follows I enumerate some of the major applications of CASA in real-life problems.

• Speech processing. The statistical-based approaches to speech processing like the

HMM (Hidden Markov Model) works only in quiet environments. If there is “cock-

tail party” background noise or many speakers then the aforementioned methods

cannot be applied with success. Hence, a preprocessing technique like CASA should

be used before the recognition phase.


• Hearing aids. The actual hearing aids normally amplify all the sounds without

filtering them. Therefore, the hearing impaired persons are not capable to under-

stand a conversation in presence of the ‘cocktail party’ effect. An intelligent filter

based on CASA can help to prevent this problem.

• Sound file indexing. One of the most challenging tasks is the indexing of sound

files on the Internet. A sound file indexing system should be able to separate different

sound sources and label each source with the adequate tag (i.e, music, speech, the

speaker ID, etc.). For more details see the MPEG-7 standard.

• Music industry. The recording of songs is a very expensive process. Suppose that

an unwanted door shutting noise corrupts the whole recording. Now imagine that

instead of going through the whole process of recording another time, you can use

a CASA technique to delete the unwanted noise. This can be a very interesting

application of CASA.

• Robot navigation. CASA can be used by a robot to find its way through a

crowded environment based on audio cues (in addition to visual anchors).

• Audio compression: In a futuristic view, one can imagine an audio codec that sep-

arates sound sources in a given file, extract features of each source and sends a text

file to the receiver that contains all the mandatory information, so that the receiver

can synthesize the sound file (pitch, timbre, duration, onset/offset times, etc.). In

image processing terminology, this is known as “Semantic Image Compression” (or

object-based compression) [37].

2.5 Bases of auditory scene analysis

Preliminary experiments led by Bregman [6] have shown a great degree of organization in

the audition. Bregman draws a distinction between an acoustic source – a single physical

system giving rise to a particular pattern of sound waves – and an auditory stream which

2.5. BASES OF AUDITORY SCENE ANALYSIS 17

denotes the abstract, or the conceptual effect it has in the mind of the listener. Listeners

have to solve an auditory scene analysis (ASA) problem in order to extract one or more

relevant auditory streams from the mixture of sources which contribute to their acoustic

environment.

Sound sources may differ in all kinds of properties such as location, instantaneous fun-

damental frequency, or the patterns of energy envelope modulation in different frequency

bands. If it is possible to extract these potential cues with sufficient reliability, the au-

ditory system can group those parts of the mixture that have similar properties. This

affords listeners a basis for organizing into a coherent whole the sound fragments which

have common origin. This type of processing is often described as bottom-up or primitive.

In addition to primitive grouping processes, listeners can exploit prior familiarity with

the patterns of spoken language or other sources. For speech, these regularities manifest

themselves at a number of levels, form the sub-syllabic to the sentential. Such top-down

processes have been termed schema-driven mechanism by Bregman [6].

Early auditory signal processing involves at least two forms of decomposition. First, the

signal is subject to spectral decomposition into separate frequency bands by the cochlea.

Second it appears that different properties are extracted in distinct auditory maps [38, 39],

or distributions of specific signal features over an array of neural elements.

Bregman defined the processes of “auditory stream segregation” and “auditory stream

integration”. The process whereby sound elements are separated into different auditory

objects is known as “auditory stream segregation”, and, conversely, the process whereby

different sound elements are assigned to a single object is known as “auditory stream

integration”. Auditory streaming is important in, for instance, assigning consecutive

speech elements to the same speaker, or following a melodic line in a background of

other musical sounds. In baroque music, stream segregation is often used to make one

instrument play two melodical lines. If an instrument plays a rapid sequence of alternating

low and high tones, the sequence will break into two melodic lines - one consisting of the


low tones and the other consisting of the high tones - if the pitch difference between the

low and high tones is large enough. In the aforementioned example, the speed of the

sequence and the frequencies are features that the audition uses to group sounds. These

features are called cues by Bregman. Some of the most important cues used in ASA are

shown in Table 2.3.

As mentioned earlier, Bregman’s theory is based on Gestalt psychology. Some of the basic

rules based on the Gestalt theory for segregation and integration are (among others)stated

below.

• Simplicity Items will be organized into simple figures according to symmetry, reg-

ularity, and smoothness.

• Similarity. Objects that are more similar to one another tend to be grouped

together. The similarity can be in terms of any psychological dimension: shape,

size, color, or luminance (or motion) for the visual scene analysis (Figure 2.2). In

Auditory scene analysis the psychological dimensions (cues) can be any of the cues

defined in Table 2.3.

Figure 2.2: Similarity: objects that are similar tend to be grouped together. The similarity

criterion is color in this figure.

• Proximity. This rule states roughly that the closer the visual elements in a set are

to one another, the stronger we tend to group them perceptually. The closeness can

be defined in terms of space or in terms of time (Figure 2.3).

2.5. BASES OF AUDITORY SCENE ANALYSIS 19

Figure 2.3: Proximity: objects that are closer to one another tend to be grouped together.

You see three different objects in this figure.

• Old-plus-new. This heuristic was not in the initial list of Gestalt principles but has

been added by Bregman. It states that a “new” organization appears in the residual

left after subtraction of “old” components, based on the assumption of continuity

(Figure 2.4).

+

time/s

freq/kHz

0.0 0.4 0.8

1

2

1.2

0

Figure 2.4: Old plus new: a sequence of wide-band and narrow-band signals and its

perception in the human auditory pathway according to the old-plus-new heuristic: the

sound is perceived as the old part (0-1 kHz) plus the new part (1-2 kHz).

• Good continuation. Good continuation says that elements forming continuous

lines or curves are grouped (Figure 2.5).

• Closure. Objects that form closed units tend to be group together (Figure 2.6).


Figure 2.5: Good continuation: both figures contain one “T” oriented differently. Notice

how much easier it is to find the T when it is not contained in the same “line” as all other

elements. When the T is embedded in a line of elements, all of the elements in that line

are grouped together – forming a large unit. In the latter-mentioned case it is harder to

see the “misoriented” T. In the other case, the T stands out on its own, which makes it

more easier to see.

• Common fate. Common fate states that those attributes (aspects) of perceptual

field that move or function in a similar manner will be perceived as a unit.

• Mutual exclusivity. The affirmative and the counter plan cannot be associated

to the same group at the same time (Figure 2.7). For an example in Speech see [40].

The different cues and heuristics on which a computer algorithm can rely to segregate

sound sources have been enumerated above. In the next three sections we will describe

how we can integrate these techniques in a computer algorithm by introducing data-driven

and schema-based CASA. Table 2.2 shows how Gestalt psychology is used in audition for

streaming and segregation.

2.6 Data-driven CASA

Figure 2.8 shows a unidirectional system, in which the information propagates only from

inputs to outputs, with no feedback. Data-driven CASA includes techniques that use only

2.6. DATA-DRIVEN CASA 21

Figure 2.6: Closure: here you see two diamonds (each a closed unit), although when the

figure has been drawn, an “M” on top and a “W” on the bottom have been drawn.

Figure 2.7: Mutual Exclusivity: a) The contour belongs to the object F but not to the

ellipsoidal form. b) The contour belongs to the ellipsoidal form but not to the object G.

c) We can see either the face or the vase but not both of them at the same time because

of the mutual exclusivity rule.

bottom-up processing (there is no feedback from higher levels to lower levels) in contrast

with prediction-driven CASA that uses top-down processing. In Bregman’s terminology,

bottom-up processing corresponds to primitive processing, and top-down means schema-

based processing.

The auditory cues proposed by Bregman for simple tones are not applicable directly to

complex sounds. Therefore, one should develop more sophisticated cues based on different

auditory maps. For example, Ellis [30] uses sinusoidal tracks created by the interpolation

of the spectral picks of the output of a cochlear filterbank. Mellinger’s model [25] uses


TABLE 2.2: Gestalt principles and their applications in Auditory Scene Analysis

Gestalt Principle Stream Effect or Example

Proximity Frequency, time or space proximity

Similarity Harmonic relatedness (contiguity)

Connectedness Pitch glides pass through noise

Good continuation Gradual increase in loudness of approaching train

Common fate Musical counterpoint, onset/offset-based grouping

Symmetry Rising pitches tend to fall again

Closure Masking, Co-modulation Masking Release (CMR)

Cue detectors Representation Algorithm Ou tput

Objectformation

ResynthGroupingalgorithm

onset/offset

frequencytransition

sound common-periodobjects

maskCochleamodel

periodicmodulation

peripheral channels

Figure 2.8: Data-driven CASA (adapted from [30]). Note that there is no feedback in

the system. See also chapter 5 and [41] [42] [42] [43] for an implementation of this block

diagram. In chapter 5, the cue detectors are replaced by the cochlear maps, the object

formation and the grouping is done via our proposed neural network.

partials (see Figure 2.9 for details on different approaches). A partial is formed if an

activity on the onset maps (the beginning of an energy burst) coincides with an energy

local minimum of the spectral maps. Using these assumptions Mellinger proposed a

CASA system in order to separate musical instruments. Cooke [24] has introduced the

harmony strands, which is the counterpart of Mellinger’s cues in speech. The integration

and segregation of streams is done using Gestalt and Bregman’s heuristics. Berthommier

uses AM maps [22] (see also [38, 42]). Gaillard [44] uses a more conventional approach

by using the first zero crossing for the detection of pitch and harmonic structures in the

frequency-time map. Brown’s algorithm [27] is based on the mutual exclusivity Gestalt

2.6. DATA-DRIVEN CASA 23So

urce

Pro

perty

Pot

entia

l gro

uping

cue

Illustr

ation

s

No

tes

Star

ts an

d en

ds o

f eve

nts

Sy

nchr

ony o

f tra

nsien

ts

Effe

ct of

ons

et a

sync

hron

y on

O

ffset

gen

erall

y co

mm

on o

nset

/offs

et

a

cros

s fre

quen

cy re

gions

sylla

ble id

entifi

catio

n an

d pit

ch p

erce

ption

w

eake

r tha

n on

set

Tem

pora

l mod

ulatio

ns

s

low

C

orre

lation

am

ong

enve

lopes

Com

odula

tion

mas

king

relea

se (C

MR)

C

omm

on fr

eque

ncy m

odula

tion

in d

iffere

nt fr

eque

ncy c

hann

els

may

lead

to co

mm

on a

mpli

tude

as e

nerg

y shif

t cha

nnels

fa

st, p

eriod

ic

Cha

nnel

enve

lopes

with

per

iodici

ty at

Seg

rega

tion

of tw

o-to

ne co

mple

x

unre

solve

d ha

rmon

ics

b

y AM

pha

se d

iffere

nce

fast,

per

iodic

H

arm

onica

lly-re

lated

pea

ks in

the

M

istun

ing o

f res

olved

har

mon

ics

re

solve

d ha

rmon

ics

ef

fect o

n ph

onet

ic ca

tego

ry

fast,

per

iodic

P

eriod

icity

in fin

e str

uctu

re

P

erce

ption

of

Basis

for a

utoc

orre

lation

reso

lved

and

unre

solve

d ha

rmon

ics

doub

le vo

wels

mod

els

Spat

ial lo

catio

n

I

nter

aura

l tim

e dif

feren

ce d

ue to

V

owel

ident

ificat

ion. S

trong

est e

ffect

Evide

nce

that

sugg

ests

role

of

dif

fering

sour

ce-to

-pinn

a pa

th le

ngth

s

if d

irecti

on is

pre

vious

ly cu

ed

IT

D is

limite

d or

abs

ent

In

tera

ural

level

differ

ence

due

Nois

e-ba

nd vo

wel id

entifi

catio

n

to

hea

d sh

adow

ing

Mon

aura

l spe

ctral

cues

due

L

ocali

zatio

n in

the

H

as n

ot b

een

inves

tigat

ed fo

r com

plex,

to

pinn

a int

erac

tion

s

agitta

l plan

e

d

ynam

ic sig

nals

such

as s

peec

h E

vent

sequ

ence

s

A

cros

s-tim

e sim

ilarit

y of w

hole-

even

t

Se

quen

tial g

roup

ing o

f ton

es;

attr

ibute

s suc

h as

pitc

h, tim

bre,

etc.

se

quen

tial c

ueing

L

ong-

inter

val p

eriod

icity

Pe

rcep

tion

of ry

thm

B

y-pr

oduc

t of v

ery-

low-

fre

quen

cy 's

pectr

al' a

nalys

is So

urce

spec

ific

Confo

rman

ce to

lear

ned

patte

rns

S

ine-w

ave

spee

ch

TABLE 2.3: Grouping cues for ASA (adapted from [2])


principle (Figure 2.7).

In the next section we will see how adding feedbacks from higher processing levels to lower

processing levels can boost segregation quality.

2.7 Top-down or schema-driven CASA

Each of the authors stated in Section 2.6 acknowledges that their system functions less

well (performance is worse than in human listeners) than might be hoped. The authors

argue that their proposed approach is based on one of the multiple cues necessary to do a

correct sound segregation and pretend that the integration of such new cues to their sys-

tem is rather easy. On the other hand, even if the cues are well defined in psychoacoustics

(common onset, common location, harmonicity, etc.), their signal processing counterparts

are not precisely defined. For instance, both Mellinger and Brown implement onset de-

tector maps as rectified differentiators within each frequency channel, and both recognize

the importance of having a family of maps based on different time-constants to be able to

detect onsets at a variety of timescales. But there is no suggestion on how these different

maps must be merged to generate the ‘true’ onset cue. Mellinger uses information from

any map that indicates an onset, whereas Brown found that using only the very fastest

map was adequate. In [42], we use two different maps (CAM and CSM) depending on

the nature of the signal.

In all the cases stated above, a feedback from higher levels to lower levels should select the

adequate representation based on the actual performance of the system and the nature of

the sound. This form of ‘top-down’ CASA is called schema-driven.

Now that we know the general frameworks (data-driven or schema-based) of CASA, we

will focus on the way these general frameworks can be implemented by using different

approaches borrowed from Artificial Intelligence. This will be done in the next section.

2.8. DIFFERENT IMPLEMENTATIONS OF CASA 25

2.8 Different implementations of CASA

CASA can either be implemented based on expert systems, or it can be based on bio-

inspired neural networks or statistical approaches.

• Expert systems. In this approach, one tries to understand and extract all the

heuristic rules proposed by Gestalt scientists and Bregman and implements them

by defining rules (if-then cases). This approach has been used by Ellis [30], Brown

and Cooke [27], and others.

• Neural networks. This method consists of modelling the auditory pathway of

humans and animals. Since Gestalt heuristics have been observed in humans, the

organization of neurons that mimic the auditory neurons and cortex should by

themselves follow the Gestalt psychology and no explicit expert rule (if then case)

should be implemented in the system [46, 3, 41] (Figure 2.11). Unfortunately, the

structure of the digital computer and its common programming languages are very

far removed from the brain’s architecture; this gap (and its impact on models) might

be reduced with a more brain-like (parallel, distributed) computational paradigm.

• Statistical learning. This paradigm is based on the fact that Gestalt heuristics

can be learned through statistical approaches. Therefore, someone neither needs to

study the Gestalt theory nor implements it ‘biologically’ in his/her system. The rules

are implicitly implemented during the learning phase. For instance in [47, 33, 34, 48],

Hidden Markov Models (HMM) are used to do schema-driven source separation. The

disadvantages of HMM-based source separation is its very high learning time and

the constraint that the number of sources in the mixture should be known a priori.

In the previous sections we tried to describe ASA and to introduce some of the most

important computer implementations of ASA. In the next section, we will try to explain

the limits of the Auditory Scene Analysis and will point out that the ASA as proposed

by Bregman is only a part of the whole auditory processing undertaken in the brain. To


do so, we will present two aspects of the auditory processing that may or may not be

explained by Bregman’s theory (depending on what theory or school of thought you want

to support): pitch grouping and sine waves.

2.9 ASA limitations

As said before, Gestalt and Bregman’s rules, in their actual form, are very simplistic

and can only be applied to simple sounds. Complex sounds have very complex behaviors

(Figure 2.12, Page 33). Therefore some signal processing front-end should map the com-

plex sound into its constituent simple objects, which can then be analyzed by Bregman’s

rules. This mapping has not been completely derived so far and all the efforts in CASA

is directed in this direction. The next two subsections describe some of the controversies

about ASA and Bregman’s rules: pitch grouping and sine waves.

2.9.1 Sinusoidal speech

Some scientists based on psychological observations pretend that Bregman’s grouping

rules are wrong. For instance, Remez et al. in their famous work on sinusoidal-wave

speech [49] propose to represent a speech signal only by sinusoidal trajectories that track

the first three formants. In 1994, the group conducted other experiments based on their

sinusoidal speech representation to find a counter-example to Bregman’s findings [50]

(Figure 2.13). In fact, if speech is perceived by using Gestalt heuristics, then the grouping

and segregation should not only be done for sound mixtures but also for the different

entities present in a single given source (this behavior is observed many times in Bregman’s

experiments). For instance, the different formants of a speech signal should be grouped

based on some similarity criteria to give birth to a whole, which is speech. The only

similar thing in formants is the comodulation frequency. Therefore, one should argue

that if this comodulation of formants is suppressed then the audition will not perceive

formants as a whole. That is exactly what it is done in Remez et al. experiments. In fact,

2.9. ASA LIMITATIONS 27

in the sinusoidal speech case, there is no modulation, therefore no grouping should have

been observed based on Bregman’s findings. But all subjects reported the three sinus

sound as a whole and unique entity. Based on these observations, Remez et al. concluded

that the organization of sound in the brain is not governed by Gestalt psychology. In

1999 Cooke and Barker [51] performed other experiments to support Bregman’s theory.

They took the same sinus speech used by Remez et al. and modulated them with a

sawtooth signal with a frequency equal to the pitch of the speech. They reported that

this modulation improved the recognition score of subjects. They finally concluded that

since the modulation cue helped the audition improve the grouping process, Bregman’s

theory holds.

2.9.2 Limitations of pitch-based grouping

Examples described below show that some well-known ASA-based techniques like the

pitch-based grouping is incomplete in some special cases [52].

• Example 1: Overtone singing. Overtone singing is a vocal technique found in Cen-

tral Asian cultures, by which one singer produces two pitches simultaneously. When

listening to the performance, a high pitch of nF0 can be perceived along with a low

drone pitch of F0, because the formant centered at nF0 has an extraordinary small

bandwidth. Using a pitch model based on autocorrelation analysis to determine the

pitch strength of nF0, one can find that the peak height increases as the formant

bandwidth decreases. Autocorrelation functions of normal voices show peaks cor-

responding to formants, but their heights are not comparable to the peak at 1F0

.

When listening to overtone singing, the auditory system extracts ‘too many’ pitches

for grouping.

• Example 2: Natural periodic sounds with the predominance of upper odd har-

monics. A complex tone composed of three harmonics at 7F0, 9F0, and 11F0 could

elicit three pitches: a prominent pitch of F0, two weak pitches of 9F0/4 and 9F0/5.


Natural periodic sounds with the predominance of upper odd harmonics can be pro-

duced by a quasi-sinusoidally driven Duffing oscillator [53]. When listening to such

sounds, the auditory system extracts ’too many’ pitches for grouping.

• Example 3: Natural periodic sounds with the predominance of lower even-numbered

components. The sound of the oscillator that has undergone a period-doubling can

have weak odd-numbered components at lower frequencies. The pitch f0, which

is extracted on the basis of the lower even-numbered components-the harmonics-is

too high for grouping all components. The pitch sensation of f0/2 can accom-

plish this task, but the auditory system fails to perceive this pitch when the lower

odd-numbered components-the subharmonics -are weak and masked by adjacent

harmonics.

The above-mentioned findings and counter-examples along with other experiments show

that the cognition and organization of sound in the brain is an open issue and not under-

stood completely.

The next section deals with the comparison of psychological-based approaches (like CASA)

to more mathematical and statistical approaches like Blind Sound Source Separation.

2.10 Comparison of CASA with other source separa-

tion techniques

CASA is not the only technique that can be used to separate sound sources in a mixture.

Other non bio-inspired techniques like Blind Source Separation (BSS) can also be used

among others. In the next section, these techniques will be compared to CASA.

2.10.1 Blind Sound Source Separation

BSS techniques uses the statistical properties of signals to segregate sound sources without

taking into account any biological or psychological aspects. In fact these techniques can

2.10. COMPARISON OF CASA WITH OTHER SOURCE SEPARATION TECHNIQUES29

be used for any other type of signals (EEG [54], ECG, etc.). BSS is subject to some

constraints on the statistical behavior of the signals [55]. It is based on finding the

inverse of the mixing matrix based on the statistical independence of underlying signals.

Statistical independence means that all the mutual moments of the signals must be zero.

One of the methods that minimizes the second mutual moment of signals is named PCA

(Principal Component Analysis). Another technique based on second order statistics is

the SOBI (Second Order Blind Identification). Techniques that use higher order statistics

are called ICA (Independent Component Analysis) [56]. For instance, Comon’s ICA

technique minimizes the cumulant (4th-order statistics) [57] after signal whitening 5 as

stated in Equation 2.1.

cICA[y] =∑

i,j,k,l 6=iiii

|cum(yi, yj, yk, yl)|2 (2.1)

On the other hand, the JADE (Joint Approximate Diagonalization of Eigen-matrices)

algorithm minimizes the cost function in Equation 2.2:

cJADE[y] =∑

i,j,k,l 6=ijkl

|cum(yi, yj, yk, yl)|2 (2.2)

Where yi is a signal sample at time i, and cum is the cumulant. The difference between

the procedure proposed by Comon and JADE is that for Comon, the summation is done

over all indices for which i, j, k, l are not all equal, while for JADE it is done for all

indices for which i, j, k, l are not all four different. The Comon and JADE methods have

similar performances but JADE is much faster (note that for Comon, the summation

is done over N4 − N terms, N being the signal length, while for JADE it is done over

N4 −N(N − 1)(N − 2)(N − 3) terms).

There are three important factors in the BSS: The moment order used (covariance, cumu-

lants, etc.), the mixture (linear mixture, convolutive, etc.) and the optimization method

(batch or iterative).

5“Whitening” refers to the process which transforms a signal vector so that the covariance matrix is

unity


The BSS can be under-determined (less microphones than sources) [58, 59, 60, 61] or

over-determined (more or equal number of microphones) [55]. The over-determined case

works much better than the under-determined case.

The disadvantage of the BSS is that one should know a priori how many sound sources

are present in the mixture [12].

The BSS is a very powerful technique if all the required conditions (i.e., statistical inde-

pendence, etc.) are met. Unfortunately, not all the sound mixtures have all the required

conditions imposed by the BSS.

Van der Kouwe et al. [12] has compared CASA-based techniques to BSS techniques using

the Cooke database [62]. They have found that the improvement is greater when BSS

is used compared to the CASA-based oscillatory network [3] for wideband signals. On

the other hand, Wang’s network [3] performs better when the noise is narrowband. They

reported a greater robustness in the case of Wang networks. They have also pointed out

that the SNR is not a criterion of intelligibility and more work should be done to define

evaluation criteria for sound separation techniques (see also chapter 5 of this thesis).

2.11 Conclusion

In this chapter, we have introduced some fundamental concepts of the Computational

Auditory Scene Analysis (CASA). It has been shown based on bibliographical data that

although segregation rules for simple sounds seem to be known (at least partially), there

is no consensus or general framework for complex sounds. It has also been pointed out

that other separation techniques like the Blind Source Separation (BSS) do not perform

well either. Therefore, the sound source separation problem, particularly in the one

microphone case, is an open issue and is not solved at all in the general case. In the

next chapter, the parallel between Auditory Scene Analysis (ASA) and neurophysiology

is done. We demonstrate how cognitive observations let us define a unified framework

between scene analysis (either visual or auditive) and the neural pathway.

2.11. CONCLUSION 31

Hz

100

200

400

1k

2k

4k

10k

20k

0 ms 200 400

Hz

100

200

400

1k

2k

4k

10k

20k

0 ms 200 400

0.2 0.4 0.6 0.8 1.0 time/s

100

150 200

300

400

600

1000

1500

2000

3000

frq/Hz brn1h.aif

0.2 0.4 0.6 0.8 1.0 time/s

100

150 200

300

400

600

1000

1500

2000

3000

frq/Hz brn1h.fi.aif

0.84 0.92 1.00 1.08 1.16 1.24 1.32 1.40 1.48 1.56 1.64 1.72 time/s

200

300

400

600

800

1000 1200

1500

2000

3000

4000

6000

8000

frq/Hz

clacan.rs.aif

1.72 1.74 1.76 1.78 1.80 1.82 1.84 1.86 1.88 1.90 1.92 1.94 time/s

200

300

400

600

800

1000 1200

1500

2000

3000

4000

6000

8000

frq/Hz clacan.g1.ps-mini

(a) (b)

(c) (d)

Figure 2.9: Description of the main ideas behind different CASA approaches: a)

Cooke’s synchrony strands extracted for a voiced-speech utterance with the corresponding har-

monic sieves used for auditory streaming. b) The spectrogram of the McAdams-Reynolds oboe-

soprano sound, along with one of the sources extracted by Mellinger’s system [25]. Note that

up until 100 ms the system fuses all the harmonics, but then it segregates the even harmonics

on the basis of their common modulation. This is in contradiction with the least commitment

heuristic (see Table 2.1). c) The spectrogram of voice mixtures used in [27] before and after

processing to extract one voice; the effect of the time-frequency masking is clearly visible as

the extensive ‘white’ regions where interference has been removed. d) Sinusoidal tracks used to

model a mixture of a harmonic sound (a clarinet) and a transient (a dropped tin can) in [30].

The lower panel highlights the tracks corresponding to clarinet phrase, grouped on the basis of

harmonicity (adapted from [30]).


Front-endanalysis

Core soundelements

Predict &combine

Reconciliation

predicted

features

Actions

observed

features

predictionerror

Higher-levelabstractions

Resynthesis

World modelBlackboard

sound

separated

source

pda-b

d 3

dpw

e 1

996apr

sound

Engine

Cuedetectors

Figure 2.10: A top-down blackboard system. Blackboard systems address the issue of

developing and choosing hypotheses (“interpretations”) at different levels of abstraction

for a signal, using some rules or models [45].

2.11. CONCLUSION 33

Figure 2.11: Wang and Brown’s oscillatory CASA system [3]. The input sound is pro-

cessed by a cochlear filterbank followed by hair cells model. The correlogram is then

computed and the pitch is extracted. Sound frames are applied to the first layer. Lateral

connections on the second layer are established according to the pitch calculated from the

correlograms. Sound is separated by using a mask and is resynthesized.

200

400

1000

2000

4000

f/Hz Bregman alternating tone signal

0.0 0.2 0.4 0.6 0.8 1.0 1.2

200

400

1000

2000

4000

f/Hz City street ambience

1.6 1.8 2.0 2.2 2.4 2.6 2.8

60

50

40

30

dB time/s time/s

Figure 2.12: The cochleogram for simple tones and street noise (adapted from [30])


Natural Speech Sine-wave speech

Modulated SWS (100 Hz) Modulated SWS (200 Hz)

1.75s

75

75

5000

5000

Fre

quency(H

z)

Fre

quency(H

z)

Figure 2.13: Spectrogram for natural speech, synthesized sine-wave speech and modulated

sine-wave speech (from [51]). The utterance is still audible when speech is replaced by

sinusoidal trajectories that track the first three formants.

CHAPTER 3

NEUROCOGNITION

3.1 Introduction

This chapter deals with the cognitive aspects of neural networks. We first see how the

Gestalt Psychology described in chapter 2 can have its roots in the auditory and visual

neural pathway. We see whether the Gestalt rules introduced in Chapter 2, or more

generally classification problems encountered in real-life problems can be implemented by

using conventional neural networks or not. We finally explain how newer techniques and

approaches like the temporal correlation, the attentional models, etc. can help us to find

a more optimal solution to the aforementioned problems.

3.2 Geststalt Psychology and Neurophysiology

As stated earlier in Chapter 2, the Gestalt corollary states that the visual (auditive) per-

ception is not the linear sum of its constituent parts. In fact, the parts of an object are

integrated via the Gestalt principles to create a homogenous entity. The principle heuris-

tics used by the Gestalt psychology are (see chapter 2 for details): proximity, similarity,

closure, good continuation, common fate, etc. [63, 6]. Visual and auditive cortexes have

long range synaptic connection at birth (homogeneity is a global aspect of an image) but

the learning that enables the brain to apply Gestalt heuristics is done after birth [63].

Observations reveal that most of the Gestalt principles are implemented by horizontal

synaptic connections in the V1 (the primary visual cortex) [64]. Even if the ‘good contin-

uation’ and ‘good form’ heuristics used to perceive an object as a whole is easily done by

adults, a new-born child (less than one year old) cannot apply these principles to perceive

an object as a homogenous entity [63]. An important question to answer is whether it is

35

36 CHAPTER 3. NEUROCOGNITION

possible or not to use conventional neural networks as the ones we can find in engineering

textbooks (perceptron, etc.) for implementing the psychological rules we went through

in the first chapter. In the next section some of the disadvantages of using conventional

neural networks are presented.

3.3 Conventional Neural Networks

Conventional (classical) neural networks were developed as models of brain function. In

developing these models, several questions needed to be addressed:

1. How are brain states to be interpreted as representations of actual situations? In

other words, how is neural activity interpreted as a neural code, or, in computer

parlance, as a data structure?

2. What is the nature of the mechanisms by which brain states are organized?

3. In what format is information laid down permanently in the brain?

4. How is memory laid down? In other words, what are the mechanisms of learning?

Answers to the following questions are given by conventional neural networks paradigm

attributed to Hebb (see [65] for a more detailed discussion about the following answers).

1. The neural code: neurons are taken as concrete symbols, as semantic atoms. They

can be interpreted in relation to patterns and events external to the organism. For

instance, the symbolic meaning of a neuron can be ‘up/down’ or ‘black/white’ etc.

Neurophysiology has provided solid experimental basis for this statement, although

some extrapolation is needed to extend it to all neurons in the brain. A neuron has

only one degree of freedom at a given interval of time [t, t + ∆t]: it is either on or

off. Thus the brain state is described by a vector of on/off states. In order to know

what the brain is about at any instant of time, it is only necessary to know this

3.3. CONVENTIONAL NEURAL NETWORKS 37

vector, along with a description of the symbolic meaning of all neurons. It must be

stated that the state vector is not constant for the interval of time [t, t + ∆t], no

matter how much small ∆t is chosen.

2. The mechanism of organization of brain states is based on the fluxes of excitation

and inhibition, a neuron collecting incoming signals and firing when a threshold is

crossed. The dynamics of the system is regularized so that the activity is stable

within an interval of time [t, t + ∆t]. In associative memory models, this is, for

instance, achieved by requiring connections between any pair of neurons to be sym-

metric, with the consequence that the system displays attractor dynamics. Without

this restriction, a McCulloh and Pitts system would be a general digital machine

(Turing Machine) without any inherent tendency to organize.

3. Long-term memory is stored in terms of synaptic weights.

4. Long-term memory is laid down by mechanisms of synaptic plasticity, based on the

statistics of neural signals, especially their temporal correlations.

As argued in the next subsection, this classical point of view about neurons and their

organization in brain cannot answer to all our questions about the functioning of our

brain.

3.3.1 The binding problem

The binding problem was first addressed by [66] and [21]. Milner and Malsburg argued

that the classical code of neural networks is very poor, too narrow in its possibilities to

serve as a basis for an expansion of the functional range of current brain models. The

underlying weakness is best illustrated by a classical example due to Frank Rosenblatt.

Rosenblatt proposed the following experiment to show the weakness of ‘conventional neu-

ral networks’. Suppose that you have designed a neural classifier that classifies objects in

a visual scene (Figure 3.1). The network is capable of telling us which geometrical form


(circle, triangle, etc.) is applied to the network and what is the location of the object

(up, down, right, left). In the pattern recognition parlance these features are respectively

called the ‘what’ and ‘where’ information. The neural network has been designed in a

very simple way. In the output layer one neuron is associated with one of the forms (for

instance, the circle), another to the other geometrical form (triangle), etc. Another set of

neurons encode the location: a neuron is associated to the ‘down’ position, another one

to the ‘up’ position and so on. Now suppose that a triangle is present in the upper part of

the image applied to the network. The result of this experiment will be an activation of

the neuron associated to the triangle and of the neuron belonging to the ‘up’ geometrical

position. There is no ambiguity in the result of this experiment and everything seems

correct. Now suppose that a triangle is applied at the top and a rectangle at the bottom

at the same time. In this case four neurons will be activated at the same time: up, down,

triangle, and rectangle. There is now an ambiguity (see also Figure 3.2). How can we

bind the information we have got? What is the correct combination : [(down,triangle),

(up, rectangle] or [(down, rectangle), (up, triangle)]. This problem is referred to as the

binding problem. This is a fundamental problem with the classical neural network code:

it has no flexible means of constructing higher-level symbols by combining more elemen-

tary symbols. The difficulty is that, as seen in Rosenblatt’s experiment, coactivating the

elementary symbols leads to binding ambiguity when more than one composite symbol

is to be expressed. Figures 3.3, 3.4, 3.5, 3.6, and 3.7 show more practical situations of

the binding problem for vision. In each figure, the aim by the experimenters has been to

prove that human visual pathway uses some kind of binding to solve some difficult visual

problems. Experiments for the auditory pathway can also be found in the literature [6].

As we will see in chapter 5, this binding may also happen in the auditory scene analysis

problem, in which geometrical forms are replaced by speech-relevant features.

In the following subsection we will see why conventional neural networks can’t be used to

solve binding problems like the one stated above.

3.3. CONVENTIONAL NEURAL NETWORKS 39

Figure 3.1: Rosenblatt Experiment: the static network identify objects correctly when

they are applied separately. The triangle and rectangle and their respective position are

recognized correctly when applied separately. The ambiguity in recognition arises when

the two objects are applied simultaneously (adapted from [65]).

3.3.2 Are classical neural networks universal?

There is a widespread opinion that classical neural networks are a universal medium

with no limits to their abilities and that consequently they are not subject to the binding

problem [69]. This claim can be discussed from two different points of view. The questions

to be answered to are : Does universality suffices as a solution to the brain’s problems? Are

classical neural networks universal media? The idea behind the universality of classical

neural networks is the Turing machine and the fact that there is no effective procedure that

cannot be realized as the program, algorithm given enough storage space and time. From

this, it was extrapolated that mental processes, if only made concrete in terms of rules,

could be realizable in machines. Under this view, the brain is a digital Turing machine


Figure 3.2: Catastrophe scenario: if two sets of active neurons (left and middle panel) are

simultaneously activated (right panel), information on their membership in the original

set is automatically lost (adapted from [65]).

Red

Green

Figure 3.3: The Illusory Conjunction experiment as described by Anna Treisman: what is

the color of vertical bars ? Subjects bind the color information to the direction (vertical

or horizontal) information, so that they are unable to detect the only vertical red bar in

the scene as a first thought [67].

and can perform any given task if adequate number of neurons is available. McCulloh and

Pitts applied this idea to the modelling of the nervous system, proving that any logical

function can be implemented by perceptrons. But universality does not mean plausibility.

Is it realistic to say that the universality of classical neural networks can solve any problem

in hand even if it takes billions of neurons and centuries? Should not we look for a simpler

solution that can solve the problem with a couple of neurons and in laps of a couple of

seconds or even milliseconds? But what does universality buy? How can you extract the

rules from which you will design your Turing-machine-like neural network? Over time,

the field of artificial intelligence discovered that it is not an easy task at all to write a

3.4. SOLUTIONS TO THE BINDING PROBLEM 41

x y

z

Figure 3.4: Binding example no. 1 (adapted from [68]), visual experiment: Different

objects (three arrows) are presented to a human. Some objects mask others. Contours of

visual receptive fields x and y belong to the same object but receptive fields z and y do

not belong to the same object even if they are collinear and have the same color (texture).

program that emulates the capabilities of the brain. It is becoming clear that the only

goal we can hope for, is to establish a system that constitutes a basis for self-organizing

and learning, as the equivalent of a newborn who learns from the environment. Brain

theorists realized this fact in the late ’50s and modified the McCulloh and Pitts’ network

to accommodate self-organization and learning. However, these changes may have come

at a price: it is not clear whether neural networks are universal in any sense, although the

scientific community seems to have inherited the implicit belief that they are and that

any brain function can be modelled on the basis of those few abstractions from the real

nervous system that went into the formulation of neural networks.

3.4 Solutions to the binding problem

In this section some of the solutions to the ‘binding problem’ described in the previous

section are enumerated and explained.


x

y

Figure 3.5: Binding example no. 2 (adapted from [68]), visual experiment: two moving

bars are analyzed by a human subject. The cross (the combination of the two bars)

moves in the direction of the black arrow, but the visual receptive fields x and y cannot

detect the displacement in the direction of the black arrow. For the receptive field x, the

horizontal bar moves in the vertical direction, while for the receptive field y, the vertical

bar moves in the horizontal direction. The exact direction of motion cannot be detected

without binding.

3.4.1 Hierarchical coding

This approach is not really a ‘solution’ to the binding problem, but a mean to circumvent

it. This technique is based on the belief that classical neural networks are universal and

that any brain problem in hand can be solved by Turing-machine-like neural networks.

For example, in Rosenblatt’s experiment, one simple and trivial solution is to put four

neurons for all the possible combination in the output of the network: (up, rectangle),

(down, rectangle), (up, triangle), (down, triangle). This type of coding is the hierarchical

coding of two classes up/down and triangle/rectangle. But what happens if instead of

two classes, we want to classify 10,000 classes. In this case, 100,000,000 neurons should

be put in the output layer, which seems to be a very unrealistic solution to our problem.

Riesenhuber and Poggio [69] have used the hierarchical coding scheme to perform visual

scene analysis (Figure 3.8). In their model the two types of operations, selection and

template matching, are combined in a hierarchical fashion to build up a complex, invariant


x

y

z

Figure 3.6: Binding example no. 3 (adapted from [68]), visual experiment: even if x and

y have the same intensity, they belong to the same object. Without binding, this fact

wouldn’t have been trivial.

feature detectors from small, localized, simple cell-like receptive fields in the bottom layer.

In particular, patterns on the model “retina” are first filtered through a layer (S1) of simple

cell-like receptive fields (first derivative of gaussians, zero-sum, square-normalized to 1,

oriented at 0o, 45o, 90o, 135o). Cells in the next layer (C1) each pool S1 cells of the same

orientation over a range of scales and positions. Filters were grouped in four bands each

spanning roughly. Different C1 cells are then combined in higher layers. Each S2 cell

receives input from 4 neighboring C1 units of arbitrary orientation, yielding a total of

44 = 256 different S2 cell types. S2 transfer functions are gaussian. C2 cells then pooled

inputs from all S2 cells of the same type, producing invariant feature detectors tuned to

complex shapes.

Another problem with the hierarchical approach is the lack of autonomy or self-organization

in the network as in all other classical neural networks explained sooner. It means that for

a hierarchical network the design process is as follows: “Give me the brain problem you

want to solve, I will give you the adequate architecture”. This is in contradiction with the

self-organization paradigm which states: “Give your problem so that the network adapts

itself to the problem it has to solve”.


xy

z

Figure 3.7: Binding example no. 4 (adapted from [68]), visual experiment: the receptive

field z is associated to the gray object even if the orientation of this field is similar to the

orientation of the black object x. Binding lets us explain this phenomenon.

3.4.2 Attentional models

Another solution to Rosenblatt’s experiment is the attentional model paradigm [70]. If

somehow, we can eliminate the second object in the scene to focus on the first object, and

in a second phase eliminate the first object and keep the second, we will solve the binding

problem. The illusory conjunction is a psychological proof to the existence of attention in

the human cognition (Figure 3.3) [71, 67]. In this paradigm, efferent receptive fields are

‘tuned’ according to the focus of attention. One of the first attentional models proposed

is Fukushima’s Neocognitron [72, 73]. In this network, a ‘winner-takes-all’ competition

between the objects in the output layer triggers the masking of objects in the input layer

through efferent (feedback) synapses in the gating layer.

Another attentional models uses in the literature is dynamic routing [74] (see Figure

3.10, Page 51). The connectivity between two successive layers is controlled by routing

control units, which can turn on or off certain subsets of connections. If the appropri-

ate connections are activated, a region in the input layer, referred to as the window of

attention, is projected to the output in a standardized size. This provides a normalized

representation of the attended region., based on which recognition can be performed. The

latter-mentioned architecture is closely related to SCAN (Signal Channelling Attentional


Network) network by Postma et al. [75] (Figure 3.10, Page 51). The SCAN is a network

based on ‘dynamic routing’. The building block of SCAN is a gating lattice, a sparsely-

connected neural network defined as a special case of the Ising lattice from statistical

mechanics 1. The process of spatial selection through covert attention is interpreted as a

biological solution to the problem of translation-invariant pattern processing.

Salinas and Abott [76] have added the ‘gain field’ to the neocognitron to allow selecting a

local region and enable feature extracting units only there. One can also imagine top-down

attention to objects or features if the facilitation acts on different sets of units sensitive

to a common feature rather than location as illustrated in (Figure 3.9, Page 50). These

attentional control mechanisms are similar to those in the routing circuit model (Figure

3.10, Page 51) in that they work top-down and require indirect feedback.

3.4.3 Assembly Coding and Temporal correlation

The temporal correlation is a special case of the more general assembly coding approach.

In the assembly coding paradigm a particular constellation of features is represented by

the joint and coordinated activity of a dynamically associated ensemble of cells, each of

which represents explicitly only one of the more elementary features that characterize a

particular perceptual object. Different objects can then be represented by recombining

neurons tuned to more elementary features in various constellations (assemblies) [78]. For

assembly coding, two constraints need to be met. First, a selection mechanism is required

that permits dynamic, context dependent association of neurons into distinct, function-

ally coherent assemblies. Second, grouped responses must get labelled so that they can

be distinguished by subsequent processing stages as components of one coherent represen-

tation and do not get confounded with other unrelated responses. Tagging responses as

1Some neural architectures are inspired from statistical and quantum mechanics. For example, Boltz-

man machines, Mean-Field machines, and Ising laticces are physical concepts. An Ising lattice, is a square

connected lattice. Each lattice site (element) has a single spin variable s = ±1. Minimizing the energy

of such a lattice can solve optimization problems in artificial intelligence.


related is equivalent with raising their salience jointly and selectively, because this assures

that they are processed and evaluated together at the subsequent processing stage. This

can be achieved in three ways. First, nongrouped responses can be inhibited; second, the

amplitude of the selected responses can be enhanced; and third, the selected cells can

be made to discharge in precise temporal synchrony. All three mechanisms enhance the

relative impact of the grouped responses at the next higher processing level.

Based on the motivations and observations stated above Von der Malsburg has proposed a

phase coding (in contrast with rate coding) assembly coding paradigm he called ‘Temporal

Correlation’.

If synchronization serves as a selection and binding mechanism, neurons must be sensitive

to coincident input. Moreover, synchronization must occur rapidly and show a relation to

perceptual phenomena. Although the issue of coincidence detection is still controversial

[79, 68, 80], evidence is increasing that neurons can evaluate temporal relations with

precision among incoming activity.

As an example, reconsider Rosenblatt’s experiment. If the ‘up’ neuron is activated at the

same time as the ‘triangle’ output and the ‘down’ output at the same time as the ‘rectan-

gle’ output, so that the first event is dissociated from the second event, no ambiguity will

happen (Figure 3.12). In the telecommunication systems terminology this is equivalent

to a time-domain multiplexing (TDM).

The great advantage of the temporal correlation approach is its autonomy and self-

organization capability. This is so far, the simplest and more plausible solution to Rosen-

blatt’s experiments when the number of combinations is big.

The disadvantage of temporal correlation is its slowness compared with other rival ap-

proaches (especially the hierarchical coding). As stated earlier, the phase synchrony

detection by coincidence detector neurons is another physiological and practical problem

to be solved or studied more.

3.5. CONCLUSION 47

Another quibble to the temporal correlation approach (which is also true to some extent

for other approaches) stems in the fact that not all recognition tasks are ‘stimulus-driven’

(based on the properties of the stimulus alone). They are for most of the cases ‘task-

driven’ [81]. The hypothesis of stimulus-driven binding does not explain how neurons

know what they should bind. For instance, when you are observing someone’s face, you

may wish to identify his/her identity no matter his/her eyes are open or closed or if her/his

hair are short or long. In all the aforementioned situations, only some parts of the visual

input should be bound and other parts discarded. Thus, if binding by synchronization

takes place, it cannot be stimulus-driven. External inputs are needed to control binding

in a task-dependent way. For example, if the task is to extract the person’s feelings and

moral situation, then the closeness or openness of eyes can become an important issue (or

whether he/she smiles or not) but these feature are irrelevant for a person identification

task. This task-driven binding approach raises another new and important question: how

can a high-level process know which parts of the input image to group before it knows

what is in the image itself? Some top-down processes must be involved in the ‘task-driven’

binding. Some solutions to this problem has been proposed in the literature, but this is

still an unsolved issue [81].

In order to implement the temporal correlation adequate dynamics should be used for

the neurons. The neurons used for implementing this approach are not classical neurons

but bio-inspired neurons (neurons that behave like the cells in our nervous system). As

it is shown and discussed in Chapter 4 different dynamics can be chosen for this task:

relaxation oscillatory neurons, integrate-and-fire neurons, chaotic neurons, and Izhike-

vich’s model. In the case of the chaotic neurons phase synchrony is replaced by similarity

measures [82].

3.5 Conclusion

The cognitive aspects of neural networks have been mentioned in this chapter. We ob-

served that conventional neural networks do not cope with the situations encountered in


real life classification. New concepts have been introduced that would let us solve more

general problems. These concepts include but are not limited to: hierarchical coding,

attentional models, and ‘temporal correlation’. Hierarchical coding is the fastest but the

least flexible while ‘temporal correlation’ is a very flexible and autonomous approach. In

the next chapter we will see how such a ‘temporal correlation’ network can be constructed

using the available mathematical models of bio-inspired neurons.

3.5. CONCLUSION 49

view-tuned cells

MAX

weighted sum

simple cells (S1)

complex cells (C1)

"complex composite" cells (C2)

"composite feature" cells (S2)

Figure 3.8: The hierarchical scene analyzer of Riesenhuber and Poggio. Each pixel of

the image is connected to four different cells in the S1 layer that are sensitive to one of

the four directions: horizontal, vertical, right, and left. The hierarchical organization is

such that features in the first layer are merged in the second layer to give the best match

for two adjacent pixels and so on. This is done hierarchically until the best match for

the whole image is found. The network consists of layers of linear units that perform a

template match over their afferents (dashed arrows), and non-linear units that perform

a ‘MAX’ operation over their inputs, where the output is determined by the strongest

afferent (solid arrows). While the former operation serves to increase feature complexity,

the latter increases invariance by effectively scanning over afferents tuned to the same

feature but at different positions (to increase translation invariance) or scale (to increase

scale invariance, not shown). An afferent nerve carries impulses toward the central nervous

system. The opposite of afferent is efferent. For more detail see section 3.4.1.


Figure 3.9: Hierarchical network for feature extraction with two types of attentional

control. First, the control units located on the right can facilitate, connect (black lines)

or discard some units so that the network only processes information coming from a

single object. This is an attentional ‘top-down’ (task-driven) control influenced by the

task (when for example we know from higher levels that we are looking for something

specific, i.e. a triangle in Rosenblatt’s experiment). Second, a ‘winner-take-all’ mechanism

can select one object and discard the others. This is a saliency-based control (when for

example the triangle is greater or darker than the rectangle, the triangle will win the

competition over the rectangle) [77].

3.5. CONCLUSION 51

Figure 3.10: Schematic diagram of the SCAN (Signal Channelling Attentional Network).

The activity of units represents a feature value, such as local light intensity, and is in-

dicated by different gray values. The same type of feature is used in the whole network

(no feature hierarchy). Most of the existing connections between two successive layers are

disabled (gray lines) through inhibitory mechanisms by the routing control units. The

remaining active connections (black lines) establish a mapping between a region in the

input layer (bottom layer), referred to as window of attention, and the output layer (top

layer). This provides a normalized view of the attended object (adapted from [77]) [75].


Recognized

pattern

Stage1 Stage2

Stage3

Stage4

features

Figure 3.11: The hierarchical approach (along with attention) used by the neocognitron

to recognize ‘0’. In stage one, the existence of simple lines is detected by the network.

Stage two detects the combination of lines from stage one. Stage three analyzes the

combinations of features detected in stage two. In stage four, the whole number ‘0’ is

recognized.

TR

Neuron

RC

Neuron

TOP

Neuron

DOWN

Neuron

Output

Output

Output

Binding

timeOutput

Figure 3.12: Solution to the binding problem using the temporal correlation technique.

The neurons RC (rectangle) and DOWN are bound together in time as it is the case for

neurons TR (triangle) and TOP.

CHAPTER 4

DYNAMICS OF BIO-INSPIRED NEURONS

4.0.1 Introduction

Bio-inspired neural networks try to mimic the behavior of real neurons in animals and

humans. They let us process temporal sequences, in contrast with most of the classical

neural networks that are suitable for static data. One of the solutions used in classical

neural networks to process temporal data is to represent it spatially (like in Time-Delayed

Neural Networks). There are a number of solutions based on this approach. First, an

interface must buffer the input to the neural network so that the network has all inputs

available for processing. Also, some external agent should tell when the buffer is full

and the processing can begin. Further, this input buffer approach assumes that all input

patterns must be of the same length, which is not realistic in most applications. Thus,

the buffer must be made large enough to accommodate the longest possible sequence.

This results in unused buffers when shorter sequences are processed. Another problem is

that input vectors which are similar but displaced temporally will appear different when

represented spatially [83] (the network is unable to detect time delays in the input signal).

In the case of bio-inspired neural networks, temporal sequence processing is done naturally

because of the intrinsic dynamic behavior of the neurons. The pioneering work in the field

of bio-inspired neural networks has been done by Hodgkin and Huxley at the University

of Plymouth. They discovered in the ’50s, a mathematical description of the behavior of

a squid axon. Although this model is the most complete so far (it can predict most of the

behaviors seen in simple biological neurons), it is very complex and difficult to simulate

in an artificial neural network paradigm. In what follows, we first try to describe the

most important mathematical models used to modelize bio-inspired neurons beginning

53

54 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS

with the Hodgkin-Huxley model. We then show how well -known more simple models

can be derived using some approximations from the Hodgkin-Huxley equations. We then

introduce the canonical model that is a unified framework, in which a major part of

bio-inspired neurons can be expressed. Based on the literature, we then show how one

can derive synchronization criteria using this canonical model. Some aspects of learning

in neural networks are discussed. Finally, two different architectures that enable us to

implement ’temporal correlation’ are described.

4.1 Different types of neuronal models

4.1.1 Class I and Class II neural excitatability

Neurons can behave in two different modes. A neuron is called class I excitable if the

spiking rate of the neuron is a quasi-linear function of the current applied to the input of

the neuron. A class I neuron becomes active via a ’saddle-node’ bifurcation [84] (Section

4.3). A neuron belongs to class II if the discharge (spiking) rate varies very little with

the increase or the decrease of the applied current. Class II neurons are activated via a

’Andronov-Hopf’ bifurcation [84].

Figure 4.1: The spike rate dependency to the applied input current in the Wilson-Cowan

neural model [85]

4.2. MATHEMATICAL DESCRIPTION OF NEURONS 55

4.2 Mathematical description of neurons

Many different dynamics are used to mimic class I and class II excitatability (see section

4.1.1) . In the remaining of this chapter, the ’dimension of a model’ is the dimension of

the state space (phase space) describing the model. The dimension of the state space is

the number of independent variables that must be used to describe a dynamical system

using first-order differential equations. The number of the aforementioned independent

variables is equal to the number of first-order equations, which is equal to the dimension

of the state space.

4.2.1 Four-dimensional neuronal models

• Hodgkin-Huxley neuronal model The more general and complete model so far,

of a real neuron is the Hodgkin-H uxley model. The Hodgkin-Huxley model can be

understood with the help of Figure 4.2. The semipermeable cell membrane separates

the interior of the cell from the extracellular liquid and acts as a capacitor. If an

input current I(t) is injected into the cell, it may add further charge on the capacitor,

or leak through the channels in the cell membrane. Because of active ion transport

through the cell membrane, the ion concentration inside the cell is different from

that in the extracellular liquid. The potential generated by the difference in ion

concentration is represented by a battery.

The Hodgkin-Huxley model is defined by the following Equations (see [86]):

Cdu

dt= −∑

k

Ik + I(t) (4.1)

Ik(t) is the sum of the ionic currents which pass through the cell membrane (defined

in Equation 4.2), u(t) is the membrane potential, and I(t) is the external current

applied to the neuron.

∑Ik = gNam

3h(u− ENa) + gKn4(u− EK) + gL(u− EL) (4.2)


Inside

Outside

+ +

- -

+ +

- -

K

Na

+

+

R K NaC

I

Figure 4.2: Schematic diagram for the Hodgkin-Huxley model (adapted from [86].

The parameters ENa, EK , and EL are the reversal potentials. Reversal and conduc-

tances are empirical parameters (for details see D-2).

The three variables m,n, and h are called gating variables. They evolve according

to the following differential equations:

m = αm(u)(1−m)− βm(u)m

n = αn(u)(1− n)− βn(u)n (4.3)

h = αh(u)(1− h)− βh(u)h (4.4)

Eqs. 4.2.1, 4.2, 4.1 along with tables 7.3 and Equation D-2 (appendix D) define the

dynamics of the Hodgkin-Huxley equations. The problem with the Hodgkin-Huxley

model is its computational complexity. It is a nonlinear fourth order differential

equation with variable parameters. Therefore the simplified two-dimensional models

described in the next subsection is proposed.

4.2.2 Two-dimensional neural models

Two-dimensional models aim to simplify the dynamics of the Hodgkin-Huxley model.

They stem from the fact that the time scale of the dynamics of the gating variable

m is much faster than that of the variables n, h, and u. This suggests that we may


treat m as an instantaneous variable. The variable m can be replaced by its steady-

state value m(t) → m0[ut(t)]. This approximation is called the quasi-steady-state

approximation. Another approximation consists of replacing the two variables n

and (1 − h) by a single effective variable w, because the two variable have rather

similar graphs.

In what follows, Moris-Lecar, FitzHugh-Nagumo, Wang-Terman, and Izhikevich

models will be detailed, among which Wang-Terman oscillators are of great interest

for this work.

• Moris-Lecar Model Moris and Lecar proposed a two-dimensional description of

neuronal spike dynamics. A first equation describes the evolution of the membrane

potential u, the second equation the evolution of a ’slow recovery’ variable w. In

dimensionless variables, the Morris-Lecar equations read:

du

dt= −g1m0(u)(u− 1)− g2w(u− V2)− gL(u− VL) + I

dw

dt= − 1

τ(u)[w − w0(u)] (4.5)

Where τ(u) is a polynomial function of the α, and β variables of Hodgkin-Huxley

equations.

• FitzHugh-Nagumo Model FitzHugh and Nagumo were probably the first who

proposed a two-dimensional approximation of Hodgkin-Huxley equations. They

obtained sharp pulses by defining the following space-state euqations:

εdv

dt= F (v)− w + I

dw

dt= v − γw (4.6)

where F (v) = v(1− v)(v + a), ε ¿ 1, and α, I, and γ are constants.

• Wang-terman Model. Wang-Terman model is based on the Van der Pol equations

1. The state-space equations for this dynamics are as follows:

dx

dt= 3x− x3 + 2− y + ρ + p + S

1The van der Pol equation is a model of an electronic circuit that appeared in very early radios. This


dy

dt= ε[γ(1 + tanh(x/β))− y] (4.7)

Where x is the membrane potential (output) of the neuron and y is the state for

channel activation or inactivation. ρ denotes the amplitude of a Gaussian noise,

p is the external input to the neuron, and S is the coupling from other neurons

(connections through synaptic weights). ε, γ, and β are constants. This model will

be used in chapters 5 and 6. The dynamical behavior of a single Wang-Terman

neuron along with the behavior of an assembly of neurons of this type will be

analyzed further in chapters 5 and 6.

• Izhikevich Model. The Izhikevich model is a manifolds reduction of the canonical

model of Ertmentrout/Izhikevich. Izhikevich [87] has shown that this model can

reproduce all the modes shown in Figure 4.3. He has also shown in [88] that this

model is computationally interesting.

The original model of Izhikevich neuron follows the dynamics:

dv

dt= 0.04v2 + 5v + 140− u + I

du

dt= a(bv − u) (4.8)

with the additional condition:

If v = +30 mv then v ← c and u ← u + d

u and v are variables and a,b,c, and d are parameters. v corresponds to the internal

potential and u represents the ionic currents K+ and Na+. When v crosses some

predefined threshold (let say 30), u and v are reset to zero. Izhikevich used his

proposed equations with step size equal to 1 ms. My simulations have shown that

a step size (integration step) equal to 1ms gives unequal spikes at the output of

the neuron (different amplitudes). This is due to the fact that the dynamics of the

circuit arose back in the days of vacuum tubes. The tube acts like a normal resistor when current is

high, but acts like a negative resistor if the current is low. So this circuit pumps up small oscillations,

but drags down large oscillations. This behavior is known as relaxation oscillation.


system is stiff and at spiking instant the variable v varies very rapidly. Therefore

the variable v can have a value equal to v = 30 − δ(δ ¿ 1) at t and a value equal

to v = 30 + θ(θ > 100) at t+1ms. One trivial way to circumvent this problem

is to decrease the step size. Another solution we have proposed to Izhikevich was

to modify slightly the model above2. Hence in the final version of his paper, the

resetting condition has been changed as follows:

If v(t) > +30 mv then v(t) = 30 and v(t + 1) ← c and u ← u + d.

If the order 1 Euler integration method is used to solve Izhikevich’s state-space

equations, we obtain (step size equal to 1ms):

v(t + 1) = (0.04vt + 6)v(t) + 140− u(t) + I(t) (4.9)

u(t) = abv(t) + (1− a)u(t) (4.10)

The above equations are simply obtained by replacing dvdt

and dudt

in equation 4.8 by

v(t + 1)− v(t) and u(t + 1)− u(t) respectively.

Even more simplified models can be derived from two-dimensional neurons by further

approximating the equations. In Appendix E (Figure E-7) we show how a Wang-Terman

oscillator can be reduced to a one dimensional model. This approximation, will be a

very important issue in the analysis of the ODLM (Oscillatory Dynamic Link Matcher)

proposed in Chapter 6.

4.2.3 One-dimensional neural models

One of the most widely used models in computational neuroscience is the leaky integrate-

and-fire (I&F ) [89] neuron described as follows:

dv

dt= I + a− bv

if v ≥ vthreshold , then v ← c (4.11)

2Personal correspondence with Eugene Izhikevich.


where v is the membrane potential, I is the input current, and a, b, c, and vthreshold are

parameters. When the potential v reaches the threshold vthreshold the neuron is said to

fire a spike, and v is reset to c.

Still another type of model that should be discussed is the chaotic neural model. Chaotic

neurons have fractal dimensions (since they are chaotic). It means that the dynamics of

such a system is governed by strange attractors (fractals) and its fractal dimension is less

than the dimensions of its phase space.

4.2.4 Fractal dimension neural models

Chaotic Neural Model. The introduction of methods originating from nonlinear dy-

namics in the analysis of brain waves (electroencephalograms, EEG) goes back to the

pioneering work of Walter Freeman. Nonlinear time series analysis of the EEG of the ol-

factory system revealed that the dynamics of neuronal activity is low dimensional, though

unpredictable. This is a characteristic property of a deterministic chaotic system in con-

trast with the dynamics of a high-dimensional stochastic process. The former system

typically collapses after a transient to a low dimensional attractor whereas the dynam-

ics of the latter remains high dimensional (for further details on the difference between

a chaotic deterministic time series and a stochastic time series see [90]). Deterministic

chaos in neural networks has not only been observed at the network level but also at the

level of a single neuron. Already the Hodgkin-Huxley model, showed a parameter range

where chaotic dynamics appears (some other neural models also exhibit chaos (Figure

4.4)).

Since these early discoveries much effort has been devoted to devise sophisticated meth-

ods to establish the idea of chaos in the brain. However, the determination of chaotic

dynamics from time series analysis is a subtle task, mainly due to the presence of noise in

experimental systems. Thus, whether chaos is indeed present in the brain or if its detec-

tion is just an artifact, due to the applied methods, is still an open question. Moreover,

the significance of chaotic dynamics in neural systems has not yet been elucidated.


The chaotic map model used in [91] [82] is as follows:

xi(t + 1) = xi(t) +ε

N

j=N∑

j=1

f(xj(t)) (4.12)

Where f(x) is the logistic map defined as follows:

f(x) = ax(1− x) (4.13)

a is a constant. The logistic map can be replaced by other maps like the Heron map, etc.

( [91] 3). The chaotic map model defined above is called the Locally to Globally Coupled

chaotic Map (LtGCM) [92, 93, 94].

It must be pointed out that the dynamics explained above does not always exhibit chaotic

behavior. Roughly speaking, knowing that xj(0) are random when N is large, the sum

follows the large number theorem (but not the central limit theorem [92]). When the

variance of the variables xj(0) is small, the distribution∑j=N

j=1 f(xj(t)) becomes close to

a Delta function and the behavior of the system is not chaotic anymore. For a detailed

derivation of the criteria for chaotic behavior of the system see [95] 4.

Different excitation modes are observed in real biological neurons . In what follows we

will describe each mode briefly:

• Phasic Spiking. A neuron may fire only a single spike at the onset of the input, as

(Figure 4.3, B), and remain quiescent afterwards. Such a response is called phasic

spiking, and it is useful for detection of the beginning of stimulation.

• Tonic Bursting. Some neurons, such as the chattering neurons in cat cortex [87],

fire periodic bursts of spikes when stimulated, as in (Figure 4.3, C). The inter-

burst(i.e., between bursts) frequency may be as high as 50Hz, and it is believed

that such neurons contribute to the gamma-frequency oscillations in the brain.

3Personal correspondence with Zhao.4personal correspondence with Pasemann.


• Phasic Bursting. Similarly to the phasic spikes, some neurons are phasic, as in

(Figure 4.3, D). Such neurons report the beginning of the stimulation by transmit-

ting a burst.

• Mixed Model (Bursting Then Spiking). Intrinsically bursting (IB) excitatory

neurons in mammalian neocortex [87] can exhibit a mixed type of spiking activity

depicted in (Figure 4.3, E).

• Spike Frequency Adaptation. The most common type of excitatory neuron in

mammalian neocortex [87] , namely the regular spiking (RS) cell, fires tonic spikes

with decreasing frequency, as in (Figure 4.3, F).

• Spike Latency. Most cortical neurons fire spikes with a delay that depends on

the strength of the input signal. For a relatively weak but superthreshold input the

delay, also called spike latency, can be quite large, as in (Figure 4.3, I).

• Subthreshold Oscillations. Practically every brain structure has neurons capable

of exhibiting oscillatory potentials [87], as in (Figure 4.3, J). The frequency of such

oscillations play an important role and such neurons act as band-pass filters.

• Frequency Preference and Resonance. Due to resonance phenomenon, neurons

having oscillatory potentials can respond selectively to the inputs having frequency

content similar to frequency of subthreshold oscillations. Such neurons can imple-

ment frequency-modulated (FM) interactions and multiplexing of signals [87].

• Integration and Coincidence Detection. Neurons without oscillatory potentials

act as integrators: they prefer high-frequency input; the higher the frequency the

more likely they fire, as in (Figure 4.3, L). This can be useful for detecting coincident

or nearly coincident spikes.

• Rebound Spike. When a neuron receives and then is released from an inhibitory

input, it may fire a post-inhibitory (rebound) spike, as in (Figure 4.3, M). This

phenomenon is related to the anodal break excitation membranes.


• Rebound Burst. Some neurons, including the thalamo-cortical cell, may fire post

inhibitory bursts [87], as in (Figure 4.3, N). It is believed that such bursts contribute

to the sleep oscillations in the thalamo-cortical system.

• Threshold Variability. A common misconception in the artificial neural network

community is the belief that spiking neurons have a fixed voltage threshold. It is

well-known that biological neurons have a variable threshold that depends on the

prior activity of the neurons (in (Figure 4.3, O).

• Bistability of Resting and Spiking States. Some neurons can exhibit two stable

modes of operation: resting and tonic spiking (or even bursting). An excitatory or

inhibitory pulse can switch between the modes as in (Figure 4.3, P).

• Depolarization After-Potentials. After firing a spike, the membrane potential

of a neuron may exhibit a prolonged after-hyperpolarization (called AHP) as e.g. in

(Figure 4.3, B, I, or M) or a prolonged depolarized after-potential (called DAP) as

in (Figure 4.3, Q).

• Accomodation. Neurons are extremely sensitive to brief coincidence inputs but

may not fire in response to a strong but slowly increasing input as illustrated in

(Figure 4.3, R).

• Inhibition-Induced Spiking. A bizarre feature of many thalamo-cortical neurons

is that they are quiescent when there is no input, but fire when hyperpolarized by

an inhibitory input or an injected current (Figure 4.3, S).

• Inhibition-Induced Bursting. Instead of spiking, a thalamo-cortical neuron can

fire tonic bursts when an inhibitory input is applied to it ((Figure 4.3, T)

Not all the models (integrate-and-fire, relaxation, etc.) described in Section 4.2 can

reproduce all the modes. In our work only the RS (Regular Spiking) is used.

In order to analyze synchronization among neurons, which will be used further in chapters

5 and 6 in Section 4.3, we will present a general framework that will allow us to do so.


4.3 Canonical Neuronal Model

It is shown (see Appendix A) that it is possible to find a global mathematical framework

in which all class I neuronal models (see Section 4.1.1) can be represented by a single

state variable φ. The synchronization conditions can be derived by using an invariant

manifold5 reduction (chapter 5, [97], and Appendix A). The advantage of the canonical

model for neuroscience applications is that it can model all types of neurons, even those

that have not yet been invented.

Many scientist believe that all pulse-coupled neural networks are toy models that are far

away from the biological reality. In what follows, we will show that a huge class of biophys-

ically detailed and biologically plausible neural network models can be transformed into

a canonical pulse-coupled form by a piece-wise continuous, possibly noninvertible, change

of variable. Such transformations exist when a network satisfies a number of conditions,

e.g., it is weakly connected; the neurons are Class 1 excitable; the synapses between neu-

rons are conventional(i.e., axo-dendritic and axo-somatic). This generalization will let us

analyze network properties (such as synchronization, etc.) independently of the model

used (because all models will reduce to the canonical model). Using this approach, we

will find some general conditions that can be applied to all models seen before.

As shown in 7.3, the in-phase synchronized solution of two identical Class 1 neurons exists,

but it is not exponentially stable 6. Small perturbations can make it disappear or stabilize

5A manifold is a topological space which is locally Euclidean (i.e., around every point, there is a

neighborhood which is topologically the same as the open unit ball in ). To illustrate this idea, consider

the ancient belief that the Earth was flat as contrasted with the modern evidence that it is round. This

discrepancy arises essentially from the fact that on the small scales that we see, the Earth does indeed

look flat (although the Greeks did notice that the last part of a ship to disappear over the horizon was

the mast). In general, any object which is nearly “flat” on small scales is a manifold, and so manifolds

constitute a generalization of objects we could live on in which we would encounter the round/flat Earth

problem, as first codified by Poincare. More formally, any object that can be “charted” is a manifold.

For a detailed discussion see [96].6Exponential stability of a system is defined in terms of Lyapunov coefficients [98]. Exponentially

4.4. DIFFERENT MODES OF SYNCHRONIZATION 65

with a small phase shift. The result is valid for any arbitrary synaptic organization [84].

We have so far observed that synchronization can be achieved between two neurons under

certain conditions. In the next section we will focus on the behavior of many neurons

(assembly of neurons) and will classify different modes.

4.4 Different modes of synchronization

As stated in Chapter 3, synchronization is a key element to the ’temporal correlation’

theory. In what follows, I enumerate different mode of synchronization, in which a cell

assembly can operate [99]:

• In-phase synchronization. All neurons in the assembly have the same frequency

and phase.

• Antiphase synchronization. Neurons oscillate at the same frequency but have a

phase difference of π.

• Out-of-phase synchronization. Neurons spike at the same frequency but their

phase difference may range somewhere between 0 et π.

• Frequency synchronization. Neurons may have similar spiking frequency but

variable phase difference.

• Frequency-ratio synchronization. One neuron has an oscillation angular fre-

quency that is equal to ω1 and the other one has an angular frequency equal to ω2

so that ω1

ω2= n and n ∈ N (N is an integer).

• Low-frequency-modulated synchronization. Neurons’ spikes are quasi-periodic

(sum of a high-frequency oscillation pattern and a low-frequency oscillation pattern).

Neurons are synchronized in respect with the low frequency oscillation pattern.

stable systems tolerate modest implementational inaccuracies; mere stable systems, in general, do not.


• Partial synchronization. A subpopulation of neurons in an assembly has syn-

chronized behavior while the remaining neurons are not synchronized, disturbing

the synchronized region. But the perturbation is not strong enough to destroy the

partial synchronization.

Since we will focus on the In-Phase synchronization in the remaining of this work, we will

present in the next section different models that can mimic this kind of dynamics.

4.5 Selection of the model

As stated earlier, not all the proposed model in the literature can mimic all the modes

seen in neurophysiology. Most of the models stated earlier has been implemented in our

SIMULINK library. SIMULINK is a graphical-interface extension to MATLAB. Therefore

the choice of the model for our architecture should be based on our needs, that is synchro-

nization with the same frequency and different phase. We have also seen in the previous

section, that neurons with complex dynamics synchronize with a phase lag or with a

varying frequency, which is not the aim for our simple model. Even some simple models,

like the Izhikevich neuron cannot insure an in-phase synchronization. We are looking also

for a model that is not computationally very expensive, because complex models cannot

be implemented with our limited computational resources. Therefore, a model like the

Hodgkin-Huxley equations, which should be solved with finite element analysis techniques

cannot be considered in this work. Taking into account these two criteria, three models

can be selected in a first stage: the relaxation oscillator, the integrate-and-fire neuron,

and the chaotic neuron.

4.5.1 Pros and cons of relaxation oscillators

Advantages:

• The spiking frequency of the neuron is independent of the input in a range of

4.5. SELECTION OF THE MODEL 67

value. This makes the synchronization simpler, even though it is not biologically

very motivated (actually this behavior is somewhere between of class 1 and class 2

neural excitability).

• The behavioral dynamics of the system is mathematically tractable and on the other

side the equations are not as complicated as the Hodgkin-Huxley model.

• The model can produce some of the spiking modes. For some suitable parameters it

spikes with a very low duty-cycle (single spikes), while with another set of parameters

it oscillates with high duty-cyles that can be seen as the envelope of bursts.

Disadvantages:

• The Van der Pol equation used in the Wang-Terman relaxation neuron is a ’stiff’

equation. Hence, a small step size should be used. Seen as a single neuron, this can

be a disadvantage for the Wang-Terman model compared with the integrate-and-fire

model. But if many neurons are used, a small step size should also be used in the

integrate-and-fire network to break the symmetry of initial values 7.

• The numbers of synchronized region is limited to 4-5 in the original version of the

Wang-Terman model. The algorithmic version of the model does not have this

disadvantage but cannot be implemented in parallel [100].

4.5.2 Pros and cons of ’integrate-and-fire’ neurons

Advantages:

• In the case of the Van der Pol equation, the trajectory of the neuron in the state

space is one-way. Therefore an external input or a synapse cannot advance or lag

7It means that in each cluster of neurons, there must be at least one neuron that has different initial

values from at least one neuron from other clusters.


the spiking time of a neuron. This is in contrast with the integrate-and-fire neuron

in which the internal potential can increase or decrease depending on the external

influence. Hence, the frequency of oscillations can decrease so that more regions

with different phases can be created.

• For small network, the required integration step size is much bigger in the case of

the integrate-and-fire network. This is because we require a great precision in order

to break the initial symmetry in each region.

Disadvantages:

• The spiking frequency is a function of the applied input. Therefore, to have in-phase

synchronization, binary images should be used. Techniques have been proposed in

the literature to convert gray-level images to binary images for the segmentation

application [102].

• The integrate-and-fire network is very sensitive to weight normalization. For each

neuron, the number of active neighboring neurons should be found and normaliza-

tion should be applied. Although, weight normalization increases the synchroniza-

tion speed for Wang-Terman oscillatory neurons, it is not mandatory.

• Breaking of the initial symmetry in the network (regions that have at least one neu-

ron with different initial value) requires a high-precision random number generator.

4.5.3 Pros and cons of chaotic neurons

Advantages:

• The model is very advantageous in terms of computational complexity: there is no

differential equation to be solved and no threshold crossing detection is necessary.

• The model uses only multiplications and additions, therefore it is very suitable for

FPGA (Field Programmable Gate Array) implementation.

4.6. LEARNING 69

Disadvantages:

• The chaotic behavior of this model corresponds to a special parameter tuning of the

Hodgkin-Huxley equations, for which the dynamics becomes chaotic.

• The in-phase synchronization in this model is replaced with trajectory similarity.

The algorithms that detect trajectory similarities are much more complicated than

in-phase synchronization detection.

The overall behavior of a neural network does not only depend on the dynamics of the neu-

rons themselves, but also on the behavior of synapses (i.e., the way neurons are connected

to each other). In the next section, learning of neural networks will be discussed.

4.6 Learning

The learning method in a neural network defines the way in which the synaptic weights

change during the functioning of the network depending on the nature of the input signals.

We focus here only on unsupervised learning, which is a way of learning in which there

is no “tutor” to teach the network. This is in contrast with the supervised learning in

which the network has the correct classification result for a set of test data in advance.

In what follows, two general unspervised learning frameworks will be shortly described:

memoryless learning and Hebbian learning.

4.6.1 Memoryless learning

In memoryless learning, the synaptic weights are adjusted according to the actual value

of the input signals. The weights neither depend on the past values of inputs nor on the

past state of neurons [103, 82, 104, 105, 106]. Suppose that neurons i and j are connected

by synaptic weights w(i, j; t) = w(j, i; t) (symmetric). Let us suppose that Ii(t) and Ij(t)

are external inputs to neurons i and j respectively. The synaptic weights are defined as


follows:

w(i, j; t) = f(|Ii(t)− Ij(t)|) (4.14)

f can be an exponential or fractional (bilinear) function. As stated earlier the function

doesn’t remember the past history of the network. The advantage of this type of learning

is its simplicity: synchronization is achieved more easily. The disadvantage as stated

above is the lack of memory. It is widely believed that the dynamics of synaptic weights

are as important as the dynamics of neurons themselves and that weights with memory

can convey huge amounts of information.

4.6.2 Hebbian Learning

In 1949, Donald Hebb predicted a form of synaptic plasticity driven by temporal contiguity

of pre- and postsynaptic activity. This prediction was verified decades later with the

discovery of long-term potentiation, securing Hebb’s place in the scientific pantheon. The

Hebb postulate is as follows: When an axon of cell A is near enough to excite a cell B and

repeatedly or persistently takes part in firing it, some growth process or metabolic change

takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is

increased.” In mathematical formalism the postulate stated above can be formulated as:

∆w(i, j; t) = αxi(t)xj(t) (4.15)

α is the learning factor, xi and xj are outputs of neurons i and j respectively. More

recent research has shown that the order in which the pre- and post-synaptic spikes are

generated affects the evolution of the synaptic weights. This can be seen as an enhanced

Hebbian rule. This more precise learning rule is called STDP (Spike Timing-Dependent

Synaptic Plasticity). Depending on the relative time of arrival of spikes, a neuron emits

LTD (Long-Time Depression) or LTP (Long-Time Potentiation). In a more precise way,

if the postsynaptic action potential is produced in a 10-ms interval after the pre-synaptic

spike an LTP is generated, while an LTD will be produced, if the order of arrival is

reversed [107].

4.6. LEARNING 71

In general, a ’local’ learning rule governing the modification of the synapses has to be

evaluated according to several global measures [108]:

• All possible stimuli should specifically activate some neurons in the network, i.e.

the union of all receptive fields should cover the stimulus space.

• Rules of synaptic plasticity should allow quick learning. Performance of biological

systems indicates extremely fast performance, reaching one-shot learning in extreme

cases [109].

• The system should allow ongoing learning and be stable simultaneously.

• A learning rule should be compatible with known physiological properties of cortical

neurons [108].

Based on the facts stated above and the neurophysiological observations on LTP and LTD,

Kording and Konig proposed an enhanced Hebbian rule [108]. In fact, they suppose that

the post-synaptic action potential propagates from the output to the input, giving birth

to the backpropagation action potential [110].

• If the output potential coincides with the input potential and there is no inhibitory

input in a 3ms interval following the action potential, the synaptic weight is increased

using the following formula:

∆ωLTP = αLTPτ

|τ + ∆t| (4.16)

where ∆t is the time difference between the input spike and the output spike, and

τ = 5ms.

• If the output action potential coincides with the input action potential and there

is an inhibitory input in a 3ms interval following the spike, the synaptic weight is

decreased using the following formula: :

∆ωLTD = αLTDτ

|τ + ∆t| (4.17)


• For stability purposes a damping term is added to the learning rule:

∆ωNorm = −αNormω − αDecay (4.18)

where αNorm et αDecay are constant.

The synaptic weight ω is given by:

ω(t + ∆t) = ω(t) + ∆ωLTD/LTP + ∆ωNorm (4.19)

ωLTD/LTP means either ωLTD or ωLTP depending on the situation.

• The synaptic weights remain positive:

ω = max(ω + ∆ωNorm + ∆ωLTP/LTD, 0) (4.20)

Although we described above some learning rules with memory, for the time being it is

difficult to implement such algorithms in our simple architecture of ’temporal correlation’.

The above-mentioned Hebbian-like algorithms have been implemented in our library but

have not been used because of computational complexity and synchronization problems.

It is much easier to attain synchronization and to normalize weights in the memoryless

learning paradigm.

So far, the dynamics of single neurons and cell assemblies along with learning algorithms

for our specific task of ’Temporal Correlation’ have been studied. In the next section,

we will compare two architectures that let us achieve our goal ’Temporal Correlation’

through synchronization. The selected framework will be adapted to our auditory and

visual problems of chapter 5 and 6.

4.7 Implementational aspects of ’Temporal Correla-

tion’

In this section, some implementational aspects of architectures that use ’temporal corre-

lation’ will be enumerated.

4.8. ARCHITECTURES FOR ’TEMPORAL CORRELATION’ 73

• As will be seen in Section 4.8, most proposed architecture used for ’temporal corre-

lation’ use a local connection strategy, in which each neuron is connected only to a

couple of neighboring neurons. This means that a neuron doesn’t need the value of

all outputs at instant t to update its state and output at instant t + 1. Hence, the

network can be implemented on a parallel architecture.

• As stated in chapter 3, the ’temporal correlation’ enables us to represent different

objects simultaneously, using the synchronization and desynchronization between

regions.

• Each single neuron in the network corresponds to a pixel of the image (either visual

scene or auditory scene). Therefore any change in the input image will have an

online and automatic impact in the behavior of the network. This is in contrast

with some classical neural networks (i.e., the Hopfield Network8 [111]), in which a

change in the input does not mean an instantaneous and automatic change in the

dynamics of the network.

In order to design an architecture that enables us to implement ’temporal correlation’,

we must implement neural synchronization and neural desynchronization.

4.8 Architectures for ’temporal correlation’

Neural synchrony is a local aspect, while neural desynchrony is a global aspect of a net-

work. Any proposed architecture should be able to handle both of this modes at the same

time. Some networks use long-range fully connected synapses. In this approach each neu-

ron is connected to all other neurons [21, 86]. A fully connected network cannot extract

8A kind of neural network investigated by John Hopfield in the early 1980s. The Hopfield network

has no special input or output neurons, but all are both input and output, and all are connected to all

others in both directions (with equal weights in the two directions). Input is applied simultaneously to

all neurons which then output to each other and the process continues until a stable state is reached,

which represents the network output.


local behaviors, because of long-range connections. In this case, it is difficult to achieve

desynchrony with only excitatory neurons (for details on obtaining desynchrony using an

architecture of mixed inhibitory and excitatory neurons see [86, 112]). Some more com-

plicated architectures use modulated synapses and two types of connections: long-range

and short-range [113]. Although the latter-mentioned approach is biologically-motivated

it is computationally very expensive. A third approach consists of using only short-range

synapses and let a global controller perform the desynchronization. In this paradigm,

each neuron on the map is only connected to a couple of neighboring neurons. The LE-

GION (Locally Excitatory Globally Inhibitory Oscillatory Network) and the Attentional

Oscillatory Neural Network (AONN) explained below follow this last approach.

4.8.1 LEGION: Locally Excitatory Globally Inhibitory Oscilla-

tory Network

The underlying dynamics of a LEGION can be integrate-and-fire neurons [102], Van

der Pol relaxation oscillators [3, 103, 114, 115, 116, 117, 118], or chaotic oscillators [91]

[82, 119] but the general framework remains the same in all cases: neuronal elements

are connected in a neighborhood of 4 or 8. Desynchrony is guaranteed via a global

inhibitor neuron that is connected to all other excitatory neurons. Depending on the

dynamics used (integrate-and-fire, van der Pol, etc.) a mapping between the real image

and the inputs to the neuronal map adjusts the dynamic range, maximum/minimum input

values, etc. For integrate-and-fire neurons, synchronization means ’same-time impulses’

and desynchronization means ’different-in-time impulses’. In the van der Pol oscillator

case, the outputs are analog (in contrast with the integrate-and-fire dynamics in which

the output of a neuron is discrete impulse train). Therefore, synchronization means a

phase difference equal to zero and a desynchronization means a phase difference different

from zero. For the chaotic neuron, the output is non-stationary and non-ergodic, therefore

mathematical criteria of synchrony should be defined [91] [82]. These criteria are based

on the trajectory similarity between two neurons. Figure 4.8 shows the architecture of a

4.8. ARCHITECTURES FOR ’TEMPORAL CORRELATION’ 75

LEGION network.

• integrate-and-fire LEGION. The building blocks of the integrate-and-fire LE-

GION is the I&F neuron described in Equation 4.11. The global inhibitor, G(t),

sends an instantaneous inhibitory pulse to the entire network when any oscillator

in the network fires. It is defined as:

G(t) = Γδ(t− tmj ) ∀j,m (4.21)

where tmj represents the m firing time of the jth neuron. The constant Γ is less than

the smallest coupling strength between neighboring oscillators.

• Van der Pol LEGION. The building blocks of the Van der Pol LEGION are the

Wang-Terman oscillators defined in equations 4.7. The global controller is defined

as:

G(t) = αH(z − θ) (4.22)

dz

dt= σ − ξz (4.23)

σ is equal to 1 if the global activity of the network is greater than a predefined ζ

and is zero otherwise.

• Chaotic LEGION. The building blocks of the chaotic LEGION are the chaotic

map defined in Equations 4.12 and 4.13. No global controller is used in this case,

since the chaotic behavior means that a very little difference in the initial values of

neurons creates big differences at infinity. So the desynchrony is implicitly done in

the network without any global controller (or let say an implicit global controller,

in order to be conform with other types of LEGION).

4.8.2 Attentional Oscillatory Neural Network (AONN) The

schematic of this architecture is shown in Figure 4.8.2 [1]

. The Primary Layer (PL) receives the information about the input image and performs

an early stage of information processing. At this stage, the primary features of the objects


composing the image such as color, brightness, contrast, orientation, local shape, etc. are

extracted. Also the attention focus is formed in the PL which means that for some time one

object is selected from the image to be transmitted to the higher layers of processing for

further analysis (recognition, memorization, novelty detection, etc.). Object selection is

made on the basis of the features and the additional information about the image context.

The context is important because it allows the system to determine which features and

which object are more salient at the current moment. For example. a controlling context

signal such as “black vertical bar” biases the equilibrium of the attention system in such

a way that a black vertical bar has the highest priority to be selected in the attention

focus. In this approach, the attentional system is represented by an ONN with a central

oscillator (CO). The CO plays the role of the central executive of the attention system as

it is suggested in [120]. The ONN with a CO has a star-like architecture of connections

where global interactions between the so-called peripheral oscillators (POs) (representing

the elements of the ONN different from the CO) is implemented through forward and

backward connections with the CO. The model of attention focus formation and control

consists of a CO and many POs. These are forward and backward connections between

the CO and POs which are characterized by both the connection strength and phase shift.

The dynamics of the system is described by the equations:

dθ0

dt= ω0 +

A

n

n∑

i=1

sin(θi − θ0 + γ) (4.24)

dθi

dt= ωi + Bsin(θ0 − θi) i = 1, 2, ..., n (4.25)

where θi are oscillator phases, ωi are the natural frequencies of the oscillators, A and

B are coupling strengths, γ is a phase shift, and dθi

dtdescribes the current frequencies

of oscillators. Equation 4.24 describes the dynamics of the CO and Equation 4.25 de-

scribes the dynamics of the POs. Focus of attention is formed by those POs which work

synchronously with the CO. Three types of synchronization can appear:

• Full synchronization: all POs work synchronously with the CO (that is with the

same current frequency for all oscillators).

4.9. CONCLUSION 77

• Partial synchronization: there are some POs which work nearly synchronously with

the CO but other are out of synchronization.

• No synchronization: all oscillators have different current frequencies.

In the remaining of this thesis, only the LEGION network is considered leaving the use

of the AONN network for further works.

4.9 Conclusion

The mathematical fundamentals of bio-inspired neurons have been laid down in this chap-

ter. We pointed out that some more precise models like the Hodgkin-Huxley model are

inadequate for fast implementation. Therefore, more simplified models have been intro-

duced, among which the Wang-Terman oscillator is further used and explained in chapters

5 and 6. In addition, a general framework of synchronization has been described, which

can be used to do ’temporal correlation’. In the next two chapters, we see how all the

theoretical concepts introduced in the first three chapters can be used to solve real-life

problems like sound source separation and object recognition.


(A) (B) (C) (D)

(E) (F) (G) (H)

(I) (J) (K) (L)

(M) (N) (O) (P)

(Q) (R) accomodation (S) (T)

DAP

20 m s

tonic spiking phasic spiking tonic bursting phasic bursting

mixed modespike frequency

adaptation

input dc-current

class 1 excitable class 2 excitable

spike latency

subthreshold

oscillations resonator integrator

rebound spike rebound burstthreshold

variabilitybistability

depolarization after

potential (DAP)inhibition-induced

spiking

inhibition-induced

bursting

Figure 4.3: Different excitation modes seen in real biological neurons (adapted from [88])

.

4.9. CONCLUSION 79

+ -+

-

+ + +

+

- - - - - - - - -- - - - 5

+ +

+

+

++

+

+ +

+

+

+

+

+++

++

+

+

+

+ +

+

+ +

++

+ ++

+ +

+

+ +

- --

--

--

--

----

--

---

-

--

--

-

- 13

10

7+

+ ++ + ++ ++ ++ 13

++ + 600

+ + ++ ++ ++ ++ + ++ 180

+++ + ++ +- --- - + 72

+ + ++ ++ ++ ++ ++ ++ 120

+ + + ++ ++ ++ + ++ ++ ++ 1200

22

3

5 13 72

-

-

-

-

-

-

-

-

+

+

+ + +- - - - - - - - -- - - 10- +

- - --

- - -+

- - -+

- -+ +

- - --

+ ++ +

++ --

++ +

++- -

+ +

+ ++

-

-

+

+

+

+

-

-

-

integrate-and-fire

integrate-and-fire with adaptation

quadratic integrate-and-fire

integrate-and-fire-or-burst

resonate-and-fire

FitzHugh-Nagumo

Morris-Lecar

Izhikevich (2003)Hindmarsh-Rose Wilson

Hodgkin-Huxley

(efficient)implementation cost ((# of steps)

(prohibitive)

biophysically meaningful

tonic spiking

phasic spiking

tonic bursting

phasic bursting

mixed mode

spike frequency adaptation

class 1 excitable

class 2 excitable

spike latency

subthreshold oscillations

resonator

integrator

rebound spike

bistability

DAP

accomodation

inhibition-induced spiking

inhibition-induced bursting

chaos

threshold variability

rebound burst

integrate-and-fire

integrate-and-fire

with adaptation

integrate-and-fire-or-burst

resonate-and-fire

quadratic integrate-

and-fire

Izhikevich (2003)

steps

FitzHugh-Nagumo

Hindmarsh-Rose

Morris-Lecar

Wilson

Hodgkin-Huxley

Models

bio

logic

al pla

usib

ility

(# o

f fe

atu

res)

(good)

(poor)

Figure 4.4: Comparison of different neural models (adapted from [87]). “Biological plau-

sibility” is the number of characteristics (i.e., tonic bursting, phasic bursting, etc.) that

can be implemented by the model. The number of flops is an approximate number of

floating point operations (addition, multiplication, etc.) needed to simulate the model

during a 1 ms time span. The author of [87] left the field blank when the verification of a

characteristics had been impossible. Some of the models not described here can be found

in [87].


A

x

y

B

x

y

q

r

s p

C

Figure 4.5: A) A nullcline of the Wang-Terman equation. A neuron with initial values

outside the nullcline tends to converge to the nullcline and continue on that curve. B) The

trajectory of a spiking relaxation oscillatory neuron in the state-space. C) The output of

a relaxation oscillator (spikes) (adapted fron [101]).

Figure 4.6: SIMULINK model of the “integrate-and-fire” neuron.

4.9. CONCLUSION 81

Figure 4.7: Temporal correlation. The initial binary image is applied to a LEGION

network. After some iterations, each letter pops-up with a different synchronization phase.

a) the initial image; b) the initial states of neurons; c-f) synchronized regions. Each of

the disconnected regions (letters) synchronize with a different synchronization phase. The

activity of the inhibitor is shown at the bottom [103].


Global Controller

Figure 4.8: The architecture of the LEGION. Each of the circles represents a neuron. The

dynamics of such neurons can be either integrate-and-fire, Wang-Terman, or chaotic. The

global inhibitor is indicated by the black circle. The global controller(inhibitor) is used

in the integrate-and-fire and Wang-Terman cases but not in the chaotic case.

4.9. CONCLUSION 83

Focus of

attention

Central

Oscillator

Input image

Primary Layer

Higher layers

processing

CONTEXT

object

Figure 4.9: The architecture of the AONN network. The input image contains three

objects. In the Primary Layer an object in the focus of attention is painted in black,

other activated regions are painted in gray (adapted from [99]). The dynamics of the

Central Oscillator (CO) and the Peripheral Oscillators (PO) are given in Equations 4.24

and 4.25 respectively.


CHAPTER 5

SOURCE SEPARATION BY BIO-INSPIRED

NEURAL NETWORKS

5.1 Introduction

In this chapter we propose a spiking-neural-network approach to monaural sound source

separation. We also compare our approach to other approaches found in the literature

and discuss the pros and cons of each of them.

5.2 Source separation

Source separation of mixed signals is an important problem with many applications in

the context of audio processing. It can be used to assist a robot in segregating multiple

speakers, to ease the automatic transcription of video via the audio tracks, to separate

musical instruments before automatic transcription, to clean-up the signals before per-

forming speech recognition, etc (see chapter 2). In fact, in that situation, very good

separation can be obtained [121] [122] [123] [124] [125]. But, very often, only one

channel is available to the audio engineer that still has to solve the separation problem.

The problem of monaural (one-microphone) sound source separation is nowadays a very

challenging problem in the speech processing field.

Most monophonic source separation systems are based on either expert systems [2] (ex-

plicit knowledge), or they are based on statistical approaches [5] [33] (implicit knowledge)

or on bio-inspired approaches [32] [3]. Jang and Lee [5], and Roweis [33] have proposed ex-

tensions of data-driven methods to the problem of monophonic source separation. Wang

85

86CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS

and Brown [3] have proposed an original approach that uses features obtained from cor-

relograms and F0 (pitch frequency) in combination with an oscillatory neural network (

chapter 4) (for more details on these different approaches see chapter 2).

In the sound separation technique proposed in this thesis, we integrate physiology, psy-

choacoustic and signal processing to design an intelligent system in order to perform the

separation of multiple sources when only one recorded channel is available. The presented

approach is a first step towards the realization of a robust speech recognizer.

Compared with conventional approaches, our system does not require any knowledge of

the underlying signal. It neither needs a priori knowledge of the underlying sources, nor

does it estimates F0 or compute the computationally expensive correlograms. Computing

F0 is not always as simple as it can appear especially in noisy environment and it limits the

system to speech sound separation, ignoring the general case of audio sound separation

(note that pitch only exists in parts of speech and not in other types of speech and

sounds). Correlograms are three-dimensional plots as shown in Figure (2.8, page 33) that

are correlations computed at all time delays. Note that this is a very computing task.

We compare the performance of our system with that of the systems from [3] and [35]. In

the latter work, Hu and Wang have shown strong improvements in comparison to [3]when

the separation uses more conventional cues such as pitch. We believe that the integration

of conventional cues should in fact improve performance, but for this thesis, our goal is

to push the neural solution to its limits.

Our proposed architecture does not perform any segmentation of the sound file into frames.

This is in contrast with other approaches like the one proposed in [3]. It is based on the

availability of simultaneous auditory representations of signals. It is fully autonomous

and does not require any training (in contrast with other statistical approaches in [34, 33,

47, 126]). There is no training or recognition phase in the proposed neural network. To

our knowledge, it is one of the first architecture that makes use of fully dynamic synapses.

The approach used in this thesis uses many of the psychological (Gestalt) rules introduced

5.2. SOURCE SEPARATION 87

in chapter 2. For example:

• The mutual exclusivity is used to assign time-frequency bins to sources. In fact, each

time-frequency bin is assigned to one of the sources and as soon as it is assigned to

a source it cannot belong to any other one. This way of thinking gives birth to the

generation of binary masks as we will see later.

• Proximity is guaranteed through the connectedness of the neural architecture. Since

our network, has local connections between neighboring neurons, it implicitly inte-

grates the Gestalt proximity rule.

• The good continuation is somehow guaranteed by the dynamics of the neurons.

Neurons and synapses have memory in our architecture. This means that they will

not let any abrupt changes of the sound source separation algorithm through time.

• Closure is implemented through local connectivity. This phenomenon can be seen

very easily in oscillatory neural networks used for image segmentation [103].

• Common fate in oscillatory neural networks can be seen in motion detection tasks

like the one proposed in [115].

The sound source separation technique proposed in this chapter is based on temporal

binding (Chapter 3). In the present work, we implement the temporal correlation as

introduced by Malsburg [21] and Milner [66] to bind auditory image objects (see section

5.3).

We are aware that association between patterns in the auditory image could be based on

direct computation of cross-correlations. This solution would lead us to include delay lines

in our network, which we are not interested in for now. The advantage of the proposed

implementation of temporal binding and temporal correlation resides in its autonomy and

no delay lines have to be created into the network of neurons [127]. In the next section

some of the auditory-based feature recognizers found in the literature.


5.3 Proposed system strategy

Figure 5.1 shows the block diagram of the proposed sound separation algorithm. The

sound that may contain many sources goes through the analysis filterbank and many

outputs (channels) are generated. The CAM/CSM representation is generated. Our

proposed neural network generates a binary mask that is applied to the channels of the

filterbank. Our proposed neural network performs ‘temporal correlation’ (see chapter 3

and 4). In other words, neurons associated to filterbank channels that belong to the

same sound source synchronize, while the other neurons synchronize at a different phase.

The schematic behavior of such a network is shown in Figure 5.2. The figure is a 3-D

plot, in which time, channels, and spike height have been put on the three axes. As it

can be seen, neurons belonging to different sound sources synchronize at different phases.

The synthesis filterbank synthesizes masked sound, which is an approximation of one of

the original sound sources. In the next section each of the building blocks is studied

separately.

5.4 Description of the source separation system

5.4.1 The choice of the cochlear filterbank

The proposed method that will be discussed further in the following sections allows to

resynthesize the audio signal of a single sound source from a mixture of sources. Generally

speaking, this is achieved using a time-varying filter. The pathway of the audio signal

consists of a non-decimated, static analysis filterbank, the time-varying mask, and a static

synthesis filterbank.

We use an FIR implementation of the well-known gammatone filterbank1 by Patterson et

al. [128] (see Appendix F) as the analysis filterbank [129]. The number of channels is 256

1The adaptation of the analysis/synthesis filterbank has been done jointly with C. Feldbauer and G.

Kubin from Graz University of Technology.

5.4. DESCRIPTION OF THE SOURCE SEPARATION SYSTEM 89

FIR Gammatone

Analysis Filterbank

256

256

Output

Sound Mixture

Neural Synchrony

CSM

Generation

Envelope

Detection

CAM

Generation

Spiking Neural

Network

Mask

Generation

FIR Gammatone

Synthesis Filterbank

Figure 5.1: The proposed source separation system

with center frequencies from 100 Hz to 3600 Hz uniformly spaced on an ERB rate scale.

The sampling rate is 8 kHz.

The actual time-varying filtering is done by the mask. Once this mask is obtained by

grouping synchronous oscillators of the neural net (see section 5.4.4), it is multiplied with

the output of the analysis filterbank. Thus, auditory channels belonging to interfering

sound sources are muted and channels belonging to the sound source of interest remain

unaffected.

Before the signals of the masked auditory channels are added to form the synthesized

signal, they are passed through the synthesis filters, whose impulse responses are time-

reversed versions of the impulse responses of the corresponding analysis filters. That

means that the magnitude of the frequency response of a synthesis filter is the same as of

the analysis filter in the same channel. The convolution with the time-reversed impulse


Figure 5.2: 3-D plot of the output of the proposed neural network. The evolution of the

output of our proposed neural network is shown through time. Neurons associated to

channels belonging to the same source synchronize.

responses linearizes the phase responses and, if the impulse responses of all filters have

same lengths2 and, therefore, same total group delay in all channels, summation yields a

phase-distortion-free result. For a low number of channels, the only distortion of the pair

of analysis and synthesis filterbanks would be a minor magnitude ripple in the overall

frequency response. But for the high number of channels used in our system, this is

absolutely negligible.

This non-decimated FIR analysis/synthesis filterbank was proposed by Irino and Unoki

[130] and also used in the perceptual speech coder in [131] (in the latter with 20 channels

only).

We had also used the IIR gammatone filterbank proposed in [132] for signal analysis and,

for synthesis, we had simply summed up all modified channel signals after applying the

mask. This use of IIR filters had resulted in phase distortions and an overall reduced sig-

nal reconstruction quality. In addition, as stated earlier, since the CAM/CSM takes into

account only magnitude information, it cannot guarantee a good separation when nonlin-

2Shorter gammatones of higher-frequency channels need zero padding.


ear phase IIR filterbanks are used. The use of the FIR implementation of the gammatone

cochlear filterbank in the present work has allowed us to overcome this problem with a

significant increase of reconstruction quality. More quantitative and qualitative compar-

isons between results obtained by the IIR implementation and the FIR implementation

will be given in the following sections.

In the next section, the signal analysis part of the algorithm (from the raw data from the

sound mixture to the input of the neural network) is detailed.

5.4.2 Signal analysis

Our CAM/CSM generation algorithm is as follows.

1. Down–sampling to 8000 samples/s.

2. Filter the sound source using a 256-filter bark-scaled cochlear filterbank ranging

from 100 Hz to 3.6 kHz .

3. • For CAM: Extract the envelope (AM demodulation) for channels 30-256; for

other low frequency channels (1–29) use raw outputs [133].

• For CSM: Nothing is done in this step.

4. Compute the STFT using a Hamming window (4ms to 32ms depending on the

nature of the sound).

5. In order to increase the spectro-temporal resolution of the STFT, find the reassigned

spectrum of the STFT [134] (this consists of applying an affine transform to the

points in order to relocate the spectrum, see Figure 5.4). The reassigned spectrum

as proposed in [134] is for continuous-time signals. For our purpose the values of

reassigned ω and t are rounded to the nearest values.


6. Compute the logarithm of the magnitude of the STFT3. The logarithm enhances

the presence of the stronger source in a given 2D frequency bin of the CAM/CSM 4.

Channel Number

700 Hz

0 5 10 15 20 Channel Number

S2 Source

S1 Source

Fre

quency

S2 Source

Figure 5.3: CAM for the female /di/ and male /da/ mixture at SNR = 0 dB and t = 166

ms when the channel number is equal to 24. The separation of the two sources can be

done based on ray distances.

We suppose here that envelope detection and selection between the CAM and the CSM, in

the auditory pathway, could be associated to the change of stiffness of hair cells combined

with cochlear nucleus processing [135] [136] (see Figure 5.8) . For now, in the present

experimental setup, selection between the two auditory images is done manually.

The question the reader may ask is what is the theoretical proof that such representations

may work in sound source separation. The answer to this question is given in the next

subsection.

3A moving window is applied to the signal and the fourier transform is applied to the signal within

the window as the window is moved.4log(e1 + e2) ' max(log e1, log e2) (unless e1 and e2 are both large and almost equal) [33]


Figure 5.4: Schematic representation of the signal processing steps required to compute

the reassigned spectrum. Each analysis frame is windowed and Fourier transformed three

times: using the original window (h), the derivative of the window (dh), and the product

of the window and time (th). “mult” means multiplication. “cplx mutl” is the complex

multiplication. “div” corresponds to division and FT computes the Fourier Transform.

(adapted from [134]).

5.4.3 Theoretical motivation behind the CAM/CSM generation

Let us suppose that two persons are uttering sentences at different pitches (Figure 5.9,

Page 108). In a simplified scheme, we can assume that the harmonics of the fundamental

frequency (pitch) is convolved with the impulse response of the vocal tract (with resonance

frequencies we call formants). The effect of these formants is to amplify some of the

harmonics and to attenuate some others.

We suppose that each cochlear channel is dominated by one of the sources. Therefore

let say that in channel n there are a couple of harmonic components at multiples of

F01 and that in channel m there are harmonic components centered around multiples of

F02. There are also resonance frequencies (formants) noted Fri,j. Each Fri,j is the jth

resonance frequency (formant) for source i.


5 10 15 20

Siren (Noise)

Figure 5.5: CSM (24-channel) of the mixture of /di/ and the siren in Equation 5.23 at

t=50 ms. Segregation is based on the selection of energy bursts.

After AM demodulation, frequencies F01, F02, and Fri,j are translated to new frequencies

according to the following formulae:

fri,j = Fri,j − fc(n) (5.1)

f01 = F01 − fc(n) (5.2)

f02 = F02 − fc(m) (5.3)

Where fc(n) is the central frequency of the cochlear channel n. For each sound source (in

their respective channel of dominance m and n), the simplified utterance signal represen-

tation (after AM-demodulation) is given by:

S1(f) = W (fr1,1, fr1,2, fr1,3, .....)∑n

δ(nf01) (5.4)

S2(f) = W (fr2,1, fr2,2, fr2,3, .....)∑n

δ(nf02) (5.5)


Channel Number

700 Hz

0 5 10 15 20 Channel Number

S2 Source

S1 Source

Fre

quency

S2 Source

Figure 5.6: CAM (24-channel) for the /di/ /da/ mixture. Segregation is based on har-

monic selection.

W(.) is the windowing function caused by the resonances fri,j defined as follows:

W (fri,1, fri,2, fri,3, .....) = A∑

j

Π(fri,j, ∆) (5.6)

Where Π(fri,j, ∆) is a rectangular window with width ∆ centered on frequency fri,j. A

is a constant, with A À 1.

The effect of the windowing function W (.) is that some of the harmonic multiples nf01

or nf02 will have greater amplitudes and some others will have smaller amplitudes. In

order to minimize the effect of the resonance frequencies the logarithms of S1 and S2 are

computed.

Now suppose that our 2-D map is generated for our simplified two-source speech. Spectral

rays appear on the map for that channel and the distance between rays is equal to f0i

(see Figure 5.9). What if one of the channels is not dominated by only one source as


64 128 192 25600

3600Hz

Fre

quency

Channel Number

Figure 5.7: CSM (24-channel) for the speech plus tone mixture. Segregation is based on

energy bursts.

supposed in the beginning of this section? The answer is that although this can be a

practical issue, it is not a theoretical one, since the channel width for the filterbank can

be chosen as small as needed (by increasing the number of channels). (Figure 5.9, Page

108) shows an idealized case for the two-speaker case.

Now the front-end processing is completed, we should apply the result of this processing

to our proposed neural network. In the next subsection, the architecture of the neural

network is discussed.

5.4.4 The Neural Network

The neural network proposed in this work is based on relaxation oscillatory neurons. In

another work we have used chaotic neurons (see appendix B). Chaotic networks have a


simpler dynamics but are more complex to analyze and to detect synchronization (for

details see appendix B).

• First layer: Auditory image segmentation The dynamics of the neurons we use

is governed by a modified version of the Van der Pol relaxation oscillator5 (Wang-

Terman oscillators [3]). The state-space equations for this dynamics are as follows:

dx

dt= 3x− x3 + 2− y + ρ + p + S (5.7)

dy

dt= ε[γ(1 + tanh(x/β))− y] (5.8)

Where x is the membrane potential (output) of the neuron and y is the state for

channel activation or inactivation. ρ denotes the amplitude of a Gaussian noise, p is

the external input to the neuron, and S is the coupling from other neurons (connec-

tions through synaptic weights). ε, γ, and β are constants. The Euler integration

method is used to solve the equations. The first layer is a partially connected net-

work of relaxation oscillators [3]. Each neuron is connected to its four neighbors.

The CAM (or the CSM) is applied to the input of the neurons. Our observations

have shown that the geometric interpretation of pitch (ray distance criterion) is less

clear for the first 29 channels. For this reason, we have also established long-range

connections from clear (high frequency) zones to confusion (low frequency) zones.

These connections exist only across the cochlear channel number axis of the CAM.

This architecture can help the network to better extract harmonic patterns.

The weight between neuron(i, j) and neuron(k, m) of the first layer is computed

via the following formula:

wi,j,k,m(t) =1

Card{N(i, j)}0.25

eλ|p(i,j;t)−p(k,m;t)| (5.9)

5Relaxation oscillators comprise a large class of nonlinear dynamical systems, and arise naturally from

many physical systems such as mechanics, biology, chemistry, and engineering. Such periodic phenomena

are characterized by intervals of time during which little happens, interleaved with intervals of time during

which considerable changes take place. In other words, relaxation oscillators exhibit more than one time

scale [137].


here p(i, j) and p(k, m) are respectively external inputs to neuron(i, j) and

neuron(k, m) ∈ N(i, j). Card{N(i, j)} is a normalization factor and is equal to

the cardinal number (number of elements) of the set N(i, j) containing neighbors

connected to the neuron(i, j) (can be equal to 4, 3 or 2 depending on the location

of the neuron on the map, i.e. center, corner, etc.). The external input values are

normalized. The value of λ depends on the dynamic range of the inputs and is set

to λ = 1 in our case. This same weight adaptation is used for long range clear

to confusion zone connections (Equation 5.13) in the CAM processing case. The

coupling Si,j defined in Equation 5.7 is :

Si,j(t) =∑

k,m∈N(i,j)

wi,j,k,m(t)H(x(k, m; t))− ηG(t) + κLi,j(t) (5.10)

H(.) is the Heaviside function. The dynamics of G(t) (the global controller) is as

follows:

G(t) = αH(z − θ) (5.11)

dz

dt= σ − ξz (5.12)


and is zero otherwise. α and ξ are constants.

Li,j(t) is the long range coupling as follows:

Li,j(t) =

0 j ≥ 30∑

k=225...256 wi,j,i,k(t)H(x(i, k; t)) j < 30(5.13)

κ is a binary variable defined as follows:

κ =

1 for CAM

0 for CSM(5.14)

• Second layer: temporal correlation and multiplicative synapses [138]. The

second layer is an array of 256 neurons (one for each channel). Each neuron receives

the weighted product of the outputs of the first layer neurons along the frequency

axis of the CAM/CSM. The weights between layer one and layer two are defined


TABLE 5.1: The numerical values of the different parameters used in the first layer of

the network.

Constant’s name Value

λ 1

θ 0.9

α -0.1

ξ 0.4

ζ 0.2

η 0.05

γ 4.0

ε 0.02

ρ 0.02

β 0.1

κ 0.2

as wll(i) = αi, where i can be related to the frequency bins of the STFT and α is

a constant for the CAM case, since we are looking for structured patterns. For the

CSM, wll(i) = α is constant along the frequency bins as we are looking for energy

bursts. Therefore, the input stimulus to neuron(j) in the second layer is defined as

follows [138]:

θ(j; t) =∏

i

wll(i)Ξ{x(i, j; t)} (5.15)

The operator Ξ is defined as:

Ξ{x(i, j; t)} =

1 for x(i, j; t) = 0

x(i, j; t) elsewhere(5.16)

where () is the averaging over a time window operator (the duration of the win-

dow is in the order of the discharge period). The multiplication is done only for

non-zero outputs (in which spike is present) [139, 140]. This behavior has been ob-

served in the integration of ITD (Interaural Time Difference) and ILD (Inter Level


TABLE 5.2: The numerical values of the different parameters used in the second layer of

the network.

Constant’s name Value

α 1

µ 2

Difference) information in the barn owl’s auditory system [139] or in the monkey’s

posterior parietal lobe neurons that show receptive fields that can be explained by

a multiplication of retinal and eye or head position signals [141]. The theoretical

motivations behind using a multiplicative synapse instead of using an additive one

is explained in 7.3.

The synaptic weights inside the second layer are adjusted through the following rule:

w′ij(t) =

0.2

eµ|p(j;t)−p(k;t)| (5.17)

µ is chosen to be equal to 2. The “binding” of these features is done via this second

layer. In fact, the second layer is an array of fully connected neurons along with

a global controller. The global controller desynchronizes the synchronized neurons

for the first and second sources by emitting inhibitory activities whenever there

is an activity (spikings) in the network [3]. Note also, that the H(.) (Heaviside

function) of the input values are applied to the neurons because of synchronization

considerations. Regions with different first layer activity will dissociate through

very weak synaptic connections, producing desynchronization (similar frequencies

but different phase) and similar region will synchronize (similar frequency and phase)

through strong synaptic connections.

Now that the neural networks successfully separated different sound sources based on the

neural synchrony of the outputs, the extracted information should be used to generate

a binary mask. This mask will be used to synthesize the sources. This aspect will be

explained in the next subsection.


5.4.5 Synthesis

Our system assumes that different sources segregate in the auditory image representation

space and that masking of the undesired sources is feasible. In fact, speech has a specific

(characteristic) structure that is different from that of most noises and perturbations [142].

Also, when dealing with simultaneous speakers, separation is possible when preserving the

time structure (the probability at a given instant t to observe overlap in pitch and timbre

is relatively low), therefore, a mask can be used to suppress the interference (or separate

all sources with adaptive masks). Here is how the synthesis is done in our system:

• The time-reversed signal is passed through the synthesis filterbank giving birth to

zi(t).

• The mask is applied to the channels and the extracted signal is computed. The

energy of each frame of the signal is normalized before synthesis.

s(t) =256∑

i=1

mi(t)znormi (t) (5.18)

where s(N − t) is the recovered signal (N is the length of the signal in discrete

mode), znormi (t) is the normalized filtered output of the original corrupted signal for

channel i, and mi(t) is the mask value. The mask has equal values for all channels

whose associated neurons are synchronized, e.g. mi(t) = 0 or 1, depending on the

source to be enhanced. Another approach is to apply the mask before the synthesis

filterbank, instead of applying it after the mask. This approach has been tested.

Figure 5.22 shows the result when the mask is applied before the synthesis. There

are pros and cons for each approach (masking before or after synthesis). In the case

of masking after the synthesis filterbank there is a musical noise and some rays on

the spectrum. In the case of masking before synthesis, there is a pink noise present

in the result.

In the next section results obtained from different experimental setups will be given.

These results will be compared to results obtained by other techniques proposed in the


literature.

5.5 Experiments

5.5.1 Database and comparison

Martin Cooke’s database [62] is used for evaluation purposes. The following noises have

been tested: 1 kHz tone, FM siren, white noise, trill telephone noise, and speech. The

aforementioned noises have been added to the target utterance. The audio results (files)

can be found at [143]. Each mixture is applied to the neural system and the mixed sound

sources are separated. The LSD (Log Spectral Distortion) and the PEL (Percentage of

Energy Loss) are used as performance criterion [144, 145, 43, 41]. The LSD is defined

below:

LSD =1

L

L−1∑

l=0

√√√√ 1

K

K−1∑

k=0

(20 log10|I(k, l)|+ ε

|O(k, l)|+ ε)2 (5.19)

Where I(k, l) and O(k, l) are the FFT of I(t) (ideal source signal) and O(t) (separated

source) respectively. L is the number of frames, K is the number of frequency bins and ε

is meant to prevent extreme values (equal to .001 in our case).

The PEL is defined as follows:

PEL =Σte2(t)

ΣtI2(t)(5.20)

The PNR (Percentage of Noise Residual) is defined as follows:

PNR =Σte1(t)

ΣtO2(t)(5.21)

PEL indicates the percentage of target speech excluded from segregated speech, and PNR

the percentage of intrusion included in the synthesized speech. O(t) gives the resulting

speech from our system. The speech waveform resynthesized from the ideal binary mask

is denoted by I(n). To obtain e2(t), a mask is constructed as follows. A T-F unit is

assigned 1 if and only if it is 1 in the ideal binary mask but 0 in the segregated target

5.6. SEPARATION EXAMPLES 103

stream. e2(t) is then obtained by resynthesizing the input mixture from the obtained

mask. e1(t) is obtained in a similar way6.

Although this criterion is used in [35, 146, 4, 42, 12], it is difficult to determine the ideal

mask, therefore this criterion is not used for all experiments.

5.5.2 Separation performance

Table 5.3 gives the LSD. The SNR of the initial signal is calculated by

SNR = 10log∑

s2(t)/∑

n2(t) (5.22)

Where s(t) represents the original target signal n(t) the noise. In all cases, the system

performs better than [3]. It is the best when the interference is a tone. For the siren,

it is comparable to [146]. For telephone and white noise, [146] is the best. For the

double-vowel, the LSD is the highest – showing that separation is more difficult when the

interference is speech. In what follows, spectrograms for different sounds and different

approaches are given for visual comparison purposes.

We have so far compared our proposed technique to other approaches quantitatively by

using LSD, PEL, and PNR criteria. In the next section qualitative comparison will be

made available to the reader.

5.6 Separation examples

5.6.1 Separation of speech from telephone trill

Figure 5.11 shows the mixture of the utterance “Why were you all weary?” with the

telephone trill noise (from Martin Cooke’s database). The trill telephone noise (ring) is

wideband, interrupted, and structured. Figure 5.12 shows separated utterance, and trill

6Personal communication with Goning Hu.


TABLE 5.3: The log spectral distortion (LSD) for three different methods: P-R (our

proposed approach), W-B (the method proposed by [3]), and H-W (the method proposed

by [146]). The intrusion noises are as follows a) 1 kHz pure tone, b) FM siren, c) telephone

ring, d) white noise, e) male-speaker intrusion (/di/) for the French /di//da/ mixture,

f) female-speaker intrusion (/da/) for the French /di//da/ mixture. Except for the last

two tests, the intrusions are mixed with a sentence taken from Martin Cooke’s database.

SNR of the initial P-R W-B H-W

Intrusion mixture(dB)

LSD LSD LSD

Tone -2 dB 7.07 23.15 16.45

Siren -5 dB 8.68 17.26 8.52

Tel. ring 3 dB 15.43 16.56 10.11

White noise -5 dB 15.29 18.41 12.77

Male (da) 0 dB 23.70 N/A N/A

Female (di) 0 dB 17.95 N/A N/A


telephone, spectrograms obtained by using our approach. It is interesting to note that

the low-frequency range of the telephone trill has been preserved. Figure 5.13 shows the

extracted utterance by using [3]. As can be seen, our approach performs better in higher

frequencies.

5.6.2 Separation of speech from 1 kHz tone

In this experiment the utterance “I willingly marry Marilyn” with a 1 kHz pure tone is

used. The tone is narrowband, continuous, and structured. Figure 5.14 shows the original

utterance plus 1 kHz tone. Figure 5.15 shows the separation results for our approach and

the approach proposed by [3]. The method proposed in [3] removes speech in middle

and high frequencies, while these frequencies remain unaffected by our approach. When

listening to the signal and according to the LSD (equal to 7.07), the tone has been removed

(even if a gray bar is shown in figure (5.15, left)).

5.6.3 Double-vowel segregation case

Two speakers have simultaneously and respectively pronounced a /di/ and a /da/ (spec-

trogram Figure 5.16). We observe that the CSM representation does not generate very

discriminative representation while, from the CAM, the two speakers are well separa-

ble. After binding, two sets of synchronized neurons are obtained: one for each speaker.

Separation is performed by using Equation 5.18, where mi(t) = 0 for one speaker and

mi(t) = 1 for the other speaker (target speaker). For the /di/+/da/ mixture, we used

the PEL (Percentage of Energy Loss) as an evaluation criterion.

The PEL for the synthesized /da/ is 15.01% at SNR = 0dB and is equal to 16.67% for

the /di/. Perceptual tests have shown that although we lose some sound quality after the

process, the vowels are separated and are clearly recognizable.


5.6.4 Sentence plus siren

The siren used in Cooke’s database [62] [3] (Equation 5.23) is mixed with the sentence “I

willingly marry Marilyn”.

The spectrogram of the mixed sound is shown in Figure 5.19). The noise is represented

by the following equation and can be generated by a VCO (Voltage controlled oscillator):

n(t) =∑

i

cos[(ωit +∆ω

ωm

cos(ωmt + ϕi)] (5.23)

Where ωi is the central angular frequency, ωm is the angular frequency of the modulating

signal, ∆ω is the angular frequency deviation, and ϕi is the phase of the modulating

signal (equal to 0 in Figure 5.16). We are looking for short but high energy bursts. We

observe that the CSM representation generates a very discriminative representation of the

speech and siren signals, while, on the other hand, the CAM fades the image because of

the envelopes. After binding, two sets of synchronized neurons are obtained: one for each

source. Separation is performed by using Equation 5.18, where mi(t) = 0 for the siren and

mi(t) = 1 for the speech sentence and vice-versa. The CSM is presented to the spiking

neural network. The weighted product of the outputs of first layer along the frequency

axis is different when the siren is present. The binding of channels on the two sides of

the “noise intruding zone” is done via the long-range synaptic connections of the second

layer. A CSM is extracted at each 10 ms and the selection is made by 10 ms intervals. In

a future work, we plan to use much smaller selection intervals and shorter STFT windows

to prevent discontinuities, as observed in Figures 5.20 and 5.21. Furthermore, overlapping

cochlear filters are not suitable for the synthesis of the processed speech.


Figure 5.8: The change in the stiffness of the hair cells due to a change of the stimulus.

Πl(τ) represents the terminal contribution of the Outer Hair Cells (OHC) efferent system.

The efferent gain Π1(τ) in the upper frequency band of the cochleogram partly compen-

sates for the loss. The low-frequency efferent gain Π4(τ) is primarily sensitive to voicing

and shows large temporal fluctuations. Kst is the stiffness of the terminal contribution of

the acoustic reflex (AR) on the middle ear. Maximal stapedial muscle contraction Kmax

incurs a loss in middle ear transmission below 1000 Hz of up to 15 dB (scanned from [31],

chapter 25). See section 5.4.2 for details.


Source 1

Source 2

Fre

quencie

s

Cochlear Channels

f01

fr1,1

fr1,2

fr2,1

fr2,2

fr2,3

f0

Figure 5.9: Idealized schematic of a 2-D spectral map (Cochleotopic/AMtopic) for a two-

speaker signal. The distance between rays corresponds to the pitch of the source. Note

that the amplitude of the rays are not equal because of the effect of the formants. Some

resonance frequency fri,j are shown by dotted boxes. Note that the resonance frequencies

do not always match the harmonic frequencies nf0.


Neuroni,j

Neuronk,m

H(.)x(k,m;t)

wi,j,k,m

sum > ζ sum < ζ

σ=1 σ=0

G

dz/dt= σ − ξ z

−η

L i,j

Glo

ba

l C

on

trolle

r

Synchronization

CAM/CSM

Figure 5.10: Architecture of the Two-Layer Bio-inspired Neural Network. G Stands for

global controller (the global controller for the first layer is not shown on the figure). One

long range connection is shown in the figure. The CAM/CSM is applied to the first layer.

The synchronization on the second layer is based on the similarity of cochlear channels.

Neurons associated to channels belonging to the same source synchronize.


Time (s)0 1.51

0

4000

Fre

qu

en

cy (

Hz)

Time (s)0 1.51

0

4000

Fre

qu

en

cy (

Hz)

Figure 5.11: Mixture of the utterance “Why were you all weary?” with a trill telephone

noise.

Time (s)0 1.51

0

4000

Fre

qu

en

cy (

Hz)

Time (s)0 1.51

0

4000

Fre

qu

en

cy (

Hz)

Figure 5.12: Separation results for the trill telephone noise. Left: The synthesised “Why

were you all weary?” after the separation by the approach proposed in this article. Right:

The synthesised trill phone after the separation by the approach proposed in this article.


Time(s)

Fre

quency (

Hz)

4000

00 1.75

Figure 5.13: The synthesized “Why were you all weary?” by the approach proposed by

[3]. The high-frequency information is missing.

4000

Freq

uenc

y (H

z)

00 Time (s) 1.75

Figure 5.14: Mixture of the utterance “I willingly marry Marilyn” with 1 kHz tone.


Time (s)0 1.75

0

4000

Fre

qu

en

cy (

Hz)

Time (s)0 1.46

0

4000

Fre

qu

en

cy (

Hz)

Figure 5.15: Comparison between our approach and Wang’s approach for the ’1 kHz’ tone.

Left: The separation result for the 1 kHz plus utterance mixture using the approach

described in this thesis. The dynamic range between the darkest gray level and the

brightest level is 50 dB. Right: The synthesised “Why were you all weary?” by the

approach proposed by [3]. The high-frequency information is missing.

4000

3000

2000

1000

Frequency (Hz)

Figure 5.16: The spectrogram of the /di/ /da/ mixture.


4000

3000

2000

1000

Frequency (Hz)

Figure 5.17: The spectrogram of the extracted /di/.

4000

3000

2000

1000

Frequency (Hz)

Figure 5.18: The spectrogram of the extracted /da/.


0 1.75

Fre

qu

en

cy(H

z)

4000

Time(s)0

Figure 5.19: Mixture of a siren and the sentence “I willingly marry Marilyn”.


Time (s) 0 1.75 0

4000

Freq

uenc

y (H

z)

Time (s)0 1.750

4000

Freq

uenc

y (H

z)

Figure 5.20: Synthesis by an FIR implementation Left: Results with the proposed 256-

channel FIR gammatone filterbank: the spectrogram of the extracted siren. Right:

Results with the proposed 256-channel FIR gammatone filterbank: the spectrogram of

the utterance (the siren is removed).

5.6.5 PESQ

Another quantitative performance criterion used in speech coding is the PESQ (Perceptual

Evaluation of Speech Quality). We propose here to use this criterion for sound source

separation 7.

The PESQ is an objective method for end-to-end speech quality assessment of narrow-

band telephone networks and speech codecs, which is applicable to any end-to-end mea-

surement. This evaluation method has been proposed by the ITU (International Telecom-

munication Union) under the recommendation P.862. The code and documentation for

PESQ can be downloaded at [147] (see also [148].

Our technique gives better result compared with [4] [3] based on the PESQ for all mixtures

tested (except for telephone ring, in which case performance is comparable). However,

according to LSD, our technique performs better than Wang & Brown [3] and performs

either better or worse (depending on the mixture used) compared with Hu & Wang [4].

Compared with a statistical approach proposed in [5] our approach performs better for the

7Many thanks to Vijay Parsa from the University of Western Ontario for fruitful discussions on PESQ.


Time (s)0 1.750

4000

Freq

uenc

y (H

z)

Time (s) 0 1.750

4000

Freq

uenc

y (H

z)

Figure 5.21: Synthesis by an IIR implementation Left: Results with a 256-channel IIR

implementation of the gammatone filter: the spectrogram of the extracted siren. Right:

Results with a 256-channel IIR implementation of the gammatone filter: the spectrogram

of the utterance (the siren is removed).

extraction of music and performs slightly worse for the extraction of speech. However, the

technique proposed in [5] had been statistically trained with speech before the separation

phase.

5.6.6 Three-source case

Results on three-source cases (utterance plus siren plus telephone trill) have shown that

the technique proposed in this thesis can be easily generalized to multiple-source separa-

tion. In addition, music plus utterance sound source separation has been tested with the

proposed technique. The sound files for these results can be found at [143]. Quantitative

comparison is not done, since data for other approaches is not available [148].

In the next section, some preliminary finding on source separation based on chaotic oscil-

lators will be given. The advantage of the technique described below is its computational

simplicity compared with the technique proposed above.

5.7. CONCLUSION AND FURTHER WORK 117

4000

Freq

uenc

y (H

z)

0 0 Time (s) 1.75

Figure 5.22: Synthesis result for the siren plus sentence case, when the masking is applied

before the masking. Musical noise is decreased but pink noise is increased.

Time (s)0 1.46

0

4000

Fre

qu

en

cy (

Hz)

Figure 5.23: The synthesized “Why were you all weary?” by the approach proposed by

[3] in the siren plus utterance mixture case. The high-frequency information is missing.

5.7 Conclusion and Further Work

Based on evidences regarding the dynamics of the efferent loop [135] and on the richness of

the representations observed in the Cochlear Nucleus, we proposed a technique to explore

the monophonic source separation problem using a multirepresentation (CAM/CSM) bio-

inspired pre-processing stage and a bio-inspired neural network that does not require any

a priori knowledge of the signal. We saw how this technique helped separate target sound

sources from interfering noises like: siren, trill telephone, music, tone, talkers, etc. We

also compared our technique to other techniques proposed in the literature and saw that


TABLE 5.4: The PESQ of three different methods: P-R (our proposed approach), W-B

([3]), and H-W ([4]) ( see caption of Table 5.3) . Higher values mean better performance.

Intrusion ini. SNR P-R W-B H-W

(noise) mixture (PESQ) (PESQ) (PESQ)

Tone -2 dB 0.4 0.2 0.4

Siren -5 dB 2.1 1.6 1.2

Tel. ring 3 dB 0.9 0.7 0.9

White -5 dB 0.9 0.2 0.3

Male (da) 0 dB 2.1 N/A N/A

Female (di) 0 dB 0.7 N/A N/A

TABLE 5.5: PESQ for two different methods: P-R (our proposed approach) and J-L ([5]).

The mixture comprises a female voice with musical rock background.

Mixture Separated P-R J-L

sources (PESQ) (PESQ)

Music & female music 1.70 0.35

(AF) voice 0.55 0.63

ours is doing better in almost all cases from a well-known bio-inspired sound separation

technique [3]. We also compared our technique to a pitch-based technique [4]. We saw

that our technique is sometimes doing better and sometimes worse, but the results remain

comparable. On the other hand, the aim of this thesis was to design a bio-inspired sound

separator and the pitch-based technique is more expert system oriented. We believe

that our system is more flexible than some other techniques found in the literature. For

example a technique that uses pitch like the one used in [4] cannot be used for musical

sound source separation, since the harmonic structure is different in music. We believe

that our approach can be applied to musical instruments separation since for example

we have not done any assumption on the formantic structure of speech to develop our

algorithm. In addition, preliminary experiments show that our technique work for three-


source sound separation. Authors of [3] and [4] have not reported more than two-source

sound separation.

For the time being, the CSM/CAM selection is done manually. In further work, one can

include a top–down module based on the SNR gain between inputs and the extracted

signals to selectively find the suitable auditory image representation, depending on the

neural network synchronization. Other maps like ones that are based on instantaneous

frequencies (FM) can be added to the multi-representation [149].

As stated earlier, the CSM is computed at 10 ms intervals with a 64 ms STFT window 8. In

a future work, smaller intervals and shorter STFT windows should be chosen to diminish

the spectral discontinuities. In addition, other speech analysis techniques that do not

require stationary signals (such as wavelets) could be used. More thorough comparison

with other techniques like the one proposed in [47] [5] (among others) are also planned.

Musical noise is inherent to any technique based on binary masks. In order to fix this

problem, non-binary masks with smooth transitions must be used to reduce different

types of noise. Non-binary masks are in contradiction with the Gestalt rule of mutual

exclusivity (remember that in this rule an object cannot belong to two different entities at

the same time). Therefore, a psychological interpretation to this non-binary mask should

be found.

Qualitative results obtained from signal synthesis are encouraging and we believe that

spiking neural networks in combination with suitable signal representations have a strong

potential in speech and audio processing.

The segregation results are not very good for a sound file mixed with white noise. Other

types of auditory maps should be developed for the white noise intrusions. In fact, we

observed that the leakage between different filters of the filterbank somehow amplifies

the noise. Therefore, more work should be done on the design of more suitable analy-

sis/synthesis filterbanks.

8The window length is equal to 4 ms for the telephone trill.


In [150, 151, 152], it has been shown that a circular chain of spiking neural networks has

a faster synchronization time than a linear chain of neurons. Based on this fact, one can

imagine of modifying the linear chain of the second layer of our proposed network to a

circular chain in order to increase synchronization speed.

The parameters in tables 5.1 and 5.2 are chosen empirically in this work. A mathemat-

ical analysis or a statistical study should be done to find the optimal values for these

parameters.

In this chapter, a method has been proposed to do “bottom-up” sound source separa-

tion. A “top-down” processor should be integrated to this technique to further enhance

performance. Top-down processing uses higher-level information at word levels, etc. to

match the obtained pattern by “bottom-up” processor to a priori known pattern. In the

next chapter of this thesis, we propose an architecture that can potentially do this kind

of “top-down” processing, although for the time being it has been only tested on visual

“toy objects” (see [153]).

CHAPTER 6

ODLM FOR PATTERN RECOGNITION

6.1 Introduction

In this chapter we propose the Oscillatory Dynamic Link Matching (ODLM), which is

an extension to the Dynamic Link Matching (DLM). We present how this technique can

help match objects to predefined patterns. This technique can be used in future works to

match auditory patterns in a “top-down” processor that can be coupled to the approach

proposed in Chapter 5.

6.2 Pattern Recognition

Pattern recognition is a branch of artificial intelligence concerned with the classification

or description of observations.

Pattern recognition aims to classify data (patterns) based on either a priori knowledge

or on statistical information extracted from the patterns. The patterns to be classified

are usually groups of measurements or observations, defining points in an appropriate

multidimensional space.

A complete pattern recognition system consists of a sensor that gathers the observations

to be classified or described; a feature extraction mechanism that computes numeric or

symbolic information from the observations; and a classification or description scheme

that does the actual job of classifying or describing observations, relying on the extracted

features.

The classification or description scheme is usually based on the availability of a set of

121

122 CHAPTER 6. ODLM FOR PATTERN RECOGNITION

patterns that have already been classified or described. This set of patterns is termed the

training set and the resulting learning strategy is characterized as supervised. Learning

can also be unsupervised, in the sense that the system is not given an a priori labelling

of patterns, instead it establishes the classes itself based on the statistical regularities of

the patterns.

The classification or description scheme usually uses one of the following approaches:

statistical (or decision theoretic), syntactic (or structural), or neural. Statistical pattern

recognition is based on statistical characterisations of patterns, assuming that the patterns

are generated by a probabilistic system. Structural pattern recognition is based on the

structural interrelationships of features. Neural pattern recognition employs the neural

computing paradigm that has emerged with neural networks.

Pattern recognition (or more specifically template matching) robust to noise, symme-

try, homothety (size change with angle preservation), etc. (Figure 6.1) has long been a

challenging problem in artificial intelligence. This task can be seen as a complementary

task to the source separation described in chapter 5, in which recognition is done using

the CAM/ CSM described in chapter 5, in addition to its direct application in artificial

intelligence and image processing. Many solutions or partial solutions to this problem

have been proposed using expert systems or neural networks. In general three different

approaches are used to perform invariant pattern recognition:

• Normalization. In this approach the analyzed object is normalized to a standard

position and size by an internal transformation. One advantage of this approach is

that: The coordinate information (the “where” information) is retrievable at any

stage of the processing and there is a minimum loss of information. The disadvantage

of this approach is that the network should find the object in the scene and then

normalize it. This task is not as obvious as it can appear [75] [154].

• Invariant Features. In this approach some features that are invariant to the lo-

cation and the size of an object are extracted. The disadvantages of this approach

6.2. PATTERN RECOGNITION 123

is that the position of the object may be difficult to extract after recognition and

information is lost during the process. The advantage is that the technique doesn’t

require to know where the object is and unlike normalization in which other tech-

niques should be used after this stage to recognize patterns, the invariant features

approach already does some pattern recognition by finding important features [73].

• Invariance Learning from temporal input sequences. The assumption is that

primary sensory signals, which in general code for local properties, vary quickly

while the perceived environment changes slowly. If it is possible to extract slow

features from the quickly varying sensory signal, it is likely to obtain an invariant

representation of the environment [155] [156].

Based on the Normalization approach, the “dynamic link matching” (DLM) has been

first proposed by Konen et al. [154, 157]. This approach consists of two layers of neurons

connected to each other through synaptic connections constrained to some normalization.

The saved pattern is applied to one of the layers and the pattern to be recognized to the

other. The dynamics of the neurons are chosen in such a way that “blobs” are formed

randomly in the layers. If the features in these two blobs are similar enough, some weight

strengthening and activity similarity will be observed between the two layers, which can

be detected by correlation computation [154, 158]. These blobs can or cannot correspond

to a segmented region of the visual scene, since their size is fixed in the whole simulation

period and is chosen by some parameters in the dynamics of the network [154]. The

apparition of blobs in the network has been linked to the attention process present in

the brain by the developers of the architecture. The dynamics of the neurons used in

the original DLM network is not the well-known spiking neuron dynamics. In fact, its

behavior is based on rate coding (average neuron activity over time, for details see section

6.8) and can be shown to be equivalent to an enhanced dynamic Kohonen Map in its

Fast Dynamic Link Matching (FDLM) form [154]. The DLM technique developed by

Von der Malsburg’s group has been applied with simplifications to real images. In fact,

the first layer of their network contains 10x10 neurons while the second layer contains


Rotation

Scaling (Homothety)

Shearing

Reflection

a)

b)

c)

d)

Reflection + Shearing

e)

origin

Translation

origin

f)

Figure 6.1: Some examples of affine transforms. Some transforms are simple like in

a,b,c,d,e and some are combinations of simple transforms like the one presented in f.

6.3. THE DYNAMIC LINK MATCHER 125

16x7 neurons (most probably because the computational complexity grows exponentially

with the number of neurons). They have used jets (grey-value distributions based on the

Gabor transform) to simplify the image and extract some features (or simple objects).

Here, we propose the Oscillatory Dynamic Link Matching algorithm (ODLM) [106] [159]

[160], which uses third generation spiking neurons and is based on phase (place) coding.

The network is capable of doing motion analysis, but neither it computes optical flow

nor it performs additional signal processing between the layers, unlike in [161]. In a more

general way, our proposed network can solve the correspondence problem, and at the same

time, perform the segmentation of the scene, which is in accordance with the Gestalt

theory of perception [162] and it is very useful when pattern recognition should be done

in multiple-object scenes. In other words the network does normalization, segmentation,

and pattern recognition at the same time. It is also self-organized. In addition, if only

one object is present in the scene the segmentation phase can be bypassed, if the speed

of convergence is the only concern (section 6.7). The application of this network is not

limited to visual scene analysis, it can be used in sound source segregation problem and

may act as a top-down (schema-driven) processor in the Computational Auditory Scene

Analysis (CASA) [42]. In the following two sections, we describe first the Dynamic Link

Matcher as proposed by Konen et al. We will then propose our improved Oscillatory

Dynamic Link Matcher. We then prove why our proposed technique works and what are

its advantages.

6.3 The Dynamic Link Matcher

The Dynamic Link Matcher was first proposed by Konen et al. [154]. The architecture

of the DLM consists of two layers of neurons. The dynamics of the neurons will be

detailed later below. There are synaptic couplings between neurons in different layers

and neurons in the same layer. In DLM, finding a match between patterns means finding

a set of mutually corresponding cells a ∈ X (X being the neurons of the first layer)

and b ∈ Y (Y being the neurons of the second layer). A cell a may be considered as a


neuron or neuronal group capable of two functions: 1) coding a local feature imposed by

the actual pattern; 2) representing an activity state which can be transmitted to other

cells. Correspondence in the DLM is expressed by the binding of cells a and b through a

dynamic link wba ≥ 0. The links converging on a given cell b are subject to a normalization

condition∑

a wextba = 1. Thus they can be interpreted as the probability that a cell a is

the correct correspondence for b. The link matching is done in a self-organized manner.

Inter-layer links wba that are static and homogeneous intra-layer connections in both X

and Y (wintba ) which are described through an interaction kernel k(d) = γexp(−d2/2s2)−β

which consists of short-range excitatory connections with range s and global inhibitory

connections of relative strength β. It can be shown that this type of interacting kernel

has a connected active region or “blob” as equilibrium solution [154].

Each iteration step consists of simultaneous blob formation process in both layers X and

Y . This is achieved through a set of coupled differential equations starting from initial

conditions x(0) = y(0) = 0:

dxa

dt= −αxa + (k ∗ σ(xa)) + Ixa (6.1)

dxb

dt= −αxb + (k ∗ σ(xb)) + Ixb (6.2)

σ(.) is the Gaussian Mexican Hat. The above equations differ only in their input terms:

The layer X receives its input Ixa which is slowly varying compared to the dynamics of

X and Y . On the other hand, the activity of layer Y is coupled to X through the input

term Ixb = ε∑

a wbaTbaσ(xa) with coupling strength ε. Tba is the similarity matrix. It has

high entries for all candidate matches with similar features.

When the activity in both layers X and Y has converged to its equilibrium blob solution,

the dynamic links between active cells are strengthened according to:

4wba = εwbaσ(xa)σ(xb) (6.3)

Based on these assumptions, Konen, Malsburg and others proposed a dynamic link

matcher [163, 154]. Although this approach is partially bio-inspired it is not entirely

6.4. THE OSCILLATORY DYNAMIC LINK MATCHER 127

biologically plausible. It does not use the neural building blocks used in bio-inspired neu-

rons like the integrate-and-fire or relaxation oscillators. In the next section, we propose

a technique based on DLMs but by using relaxation oscillators. We will further show,

why our proposed network can do pattern matching and why the original DLM is an

approximation of our ODLM.

6.4 The oscillatory dynamic link matcher

6.4.1 Introduction

In this section we propose our ODLM (Oscillatory Dynamic Link Matcher). Like the

DLM, the ODLM’s aim is to match patterns that have been applied to the two layers of

the network. In other terms, a visual scene is applied to the first layer of the network and

an object to the other layer. If the object exists in the scene, synchronization is achieved

between the two layers. If it does not exist, no synchronization is achieved between the two

layers. This behavior has been schematized in (Figure 6.2, Page 139). The mathematical

description of the network is given in the following subsection.

6.4.2 Mathematical Description of the Network

The building blocks of this network are oscillatory neurons [100] (see Chapters 4 and 5 for

further detail). The dynamics of this kind of neurons is governed by a modified version

of the Van der Pol relaxation oscillator (called the Wang-Terman oscillator) (for a similar

approach with different dynamics see [1]). There is an active phase when the neuron

spikes and a relaxation phase when the neuron is silent. The dynamics of the neurons

follows the following state-space equations, where xi is the membrane potential (output)

of the neuron and yi is the state for channel activation or inactivation.

dxi,j

dt= 3xi,j − x3

i,j + 2− yi,j + ρ + H(pinputi,j ) + Si,j (6.4)


dyi,j

dt= ε[γ(1 + tanh(xi,j/β))− yi,j] (6.5)

ρ denotes the amplitude of a Gaussian noise, pinputi,j the external input to the neuron (its

value is equal to the gray-level value of the corresponding pixel in the picture), and Si,j the

coupling from other neurons (connections through synaptic weights). ε, γ, β are constants

(defined at Table 5.1, page 99), and H(.) is the Heaviside function defined below:

H(x) =

1 if x > 0

0 otherwise(6.6)

Initial values are generated by a uniform distribution between the interval [-2; 2] for xi,j

and between [0; 8] for yi,j (these values correspond to the whole dynamic range of the

equations) (for more details on the dynamics of oscillatory neurons see chapters 4 and 5).

A neighborhood of four is chosen in each layer for the connections. Each neuron in the first

layer is connected to all neurons in the second layer and vice-versa. A global controller

is connected to all neurons in the first and second layers as in [115]. In a first stage,

segmentation is done in the two layers independently (with no extra-layer connections)

as explained in Section 6.5, while dynamic matching is done with both intra-layer and

extra-layer couplings. The intra-layer and extra-layer connections are defined as follows:

winti,j,k,m(t) =

wintmax

Card{N int(i, j) ∪N ext(i, j)} ·1

eλ|p(i,j;t)−p(k,m;t)| (6.7)

wexti,j,k,m(t) =

wextmax

Card{N ext(i, j) ∪N int(i, j)} ·1

eλ|p(i,j;t)−p(k,m;t)| (6.8)

where winti,j,k,m(t) are intra-layer connections and wext

i,j,k,m(t) are extra-layer connections (be-

tween the two layers) and wintmax = 0.2 and wext

max = 0.2 are constants equal to the maximum

value of the synaptic weights. Card{N int(i, j)} is a normalization factor and is equal to

the cardinal number (number of elements) of the set N int(i, j) containing neighbors con-

nected to the neuron(i, j) and can be equal to 4, 3 or 2 depending on the location of the

6.5. BEHAVIORAL DESCRIPTION OF THE NETWORK 129

neuron on the map, i.e., center, corner, etc., and the number of active connections. A

connection is active when H(p(i, j) − p(k,m) − 0.01) = 1. p(i, j) and p(k, m) are input

values and H(.) is the Heaviside function described in Equation 6.6. This condition is

tested both for intra-layer and extra-layer connections. Card{N ext(i, j)} is the cardinal

number for extra-layer connections and is equal to the number of neurons in the second

layer with active connection to neuroni,j in the first layer. Note that normalization in

Equation 6.8 is mandatory if someone wants to correspond similar pictures with different

sizes. If the aim is to match objects with exactly the same size the normalization factor

should be set to a constant for all neurons. The reason for this is that with normalization

even if the size of the picture in the second layer was the double of the same object in the

first layer the total influence to the neuroni,j would be the same as if the pattern was of

the same size.

The schematic of the network is shown in (Figure 6.3, 140).

6.5 Behavioral description of the network

The network has two different behavioral modes: segmentation and matching.

• Segmentation: In the segmentation stage, there is no connection between the two

layers. The two layers act independently (unless for the influence of the global

controller) and segment the two images applied to the two layers respectively. The

global controller forces the segments on the two layers to have different phases. At

the end of this stage, the two images are segmented but no two objects have the

same synchronization phase (Figure 6.6, 143). The results from segmentation are

used to create binary masks that select one object in each layer in multi-object

scenes. In fact, a snapshot like the one shown in Figure 6.12 is used to create the

binary mask m(i, j) for one of the objects as follows:

m(i, j) =

1 for xi,j(tsync) = xsync

0 otherwise(6.9)


tsync is a given instant of time after synchronization is reached. Since the neurons

are noiseless, all the neurons synchronized with each other at a given tsync will have

exactly the same output xsynch.

xsync can be the synchronized value that corresponds to either the cross or the

rectangle in Figure 6.12 at time tsync.

The coupling strength Si,j for each layer as defined in Equation 6.4 is computed by

:

Si,j(t) =∑

k,m∈N int(i,j)

winti,j,k,m(t)H(xint(k, m; t))− ηG(t) (6.10)

H(.) is the Heaviside function, G(t) is the influence of the global controller defined

by the following equation. η should be set to a value smaller than the maximum

value of synaptic weights, i.e., 0.25 in our case.

G(t) = αH(z − θ) (6.11)

dz

dt= σ − ξz (6.12)


and is zero otherwise.

The reason why we use the global controller is that it may happen that the initial

values xi,j(0) of the leading neuron (the neuron to which all other neurons in the

segment will synchronize) in two different segments are similar, which means that

without a global controller these two segments would have similar phases. Note that

in contrast with the integrate-and-fire neurons, the phase trajectory of relaxation

neurons is progressing in one direction and cannot jump back. Thus, it is impossible

for other non-leader neurons to delay the spiking of the leading neuron. On the other

side, the probability that two leaders have the same initial value is equal to p(x)∆(x),

where p(x) is the probability distribution that is used to pick up initial values in

Equation 6.4. Since p(x) is bound to 1, the aforementioned probability is upper-

bounded to ∆(x), which is related to the numerical resolution of the integration

and to the accuracy of the random number generator. If we assume that small

6.6. GEOMETRICAL INTERPRETATION OF THE ODLM 131

phase discrepancies between regions are acceptable, it will be very unlikely that two

different segments synchronize for small networks (i.e., N ∼ 100). Hence, the global

controller becomes really mandatory for only bigger networks.

• Dynamic Matching: In the matching phase, the external input to the layers are

defined by the binary masks generated in the segmentation phase. The input to the

layers are defined by:

pmatchingi,j = m(i, j)pinput

i,j (6.13)

Extra-layer connections (Equation 6.8) are established. If there are similar objects

in the two layers, these extra-layer connections will help them synchronize. In other

words, these two segments are bound together through these extra-layer connections

[65]. In order to detect synchronization, double-thresholding can be used [164]. This

stage may be seen as a folded oscillatory texture segmentation device as the one

proposed in [100]. The coupling strength Si,j for each layer in the matching phase

is defined as follows :

Si,j(t) =∑

k,m∈Next(i,j)

{wexti,j,k,m(t)H(xext(k, m; t)) + wint

i,j,k,m(t)H(xint(k, m; t))}− ηG(t)

(6.14)

xext is the output of extra-layer neurons (neurons belonging to the other layer as

neuroni,j) and xint is the output of intra-layer neurons (neurons belonging to the

same layer as neuroni,j)

6.6 Geometrical Interpretation of the ODLM

We know that an object can be represented by a set of points corresponding to its corners,

and any affine transform is a map T : R2 → R2 of these points defined by the following

matrix operation

p’ = A ∗ p + t (6.15)


Where A is a 2x2 non-singular matrix, p ∈ R2 is a point in the plane, and p’ is its affine

transform. t is the translation vector. The transform is linear if t = 0. For example, for

a rotation with angle θ, the matrix A is :

cos θ − sin θ

sin θ cos θ

Affine transformation is a combination of several simple mappings such as rotation, scal-

ing, translation, and shearing. The similarity transformation is a special case of affine

transformation. It preserves length ratios and angles while the affine transformation, in

general does not. In this paragraph we show that the coupling Si,j is independent of the

affine transform used. We know that any object can be shattered into its constituent tri-

angles (three corners per triangle). Now suppose that the set {a, b, c, d} is mapped to the

set {T (a), T (b), T (c), T (d)}, and that the objects formed by these two sets of points are ap-

plied to the two layers of our neural network. Suppose also that points inside the triangle

{a, b, c} (resp. {T (a), T (b), T (c)}) have values equal to A (corresponding to the gray-level

value of the image at that points) and points inside {a, b, d} (resp. {T (a), T (b), T (d)})have values equal to B.

There are ∆T (abc) connections from the region with gray-level value A (triangle {T (a), T (b)

, T (c)}) and ∆T (abd) connections from the region with gray-level value B (triangle {T (a), T (b),

, T (d)}) to the neuroni,j belonging to the triangle {a, b, c} with gray-level value A.

We know that for an affine transform (Figure 6.4) (the affine transform conserves surface

ratio):∆abc

∆abd

=∆T (abc)

∆T (abd)

(6.16)

Where ∆abc is the area of the triangle {a, b, c} (expressed in number of neurons). For

neuroni,j belonging to {a, b, c} and neuronk,m belonging to {T (a), T (b), T (c)}, Equation

6.8 is equivalent to (neglecting the effect of intra-layer connections, since N ext À N int):

N ext = ∆T (abc) + ∆T (abd) (6.17)

6.7. RESULTS 133

Hence,

wexti,j,k,m(t) =

f(p(i, j; t)− p(k, m; t))

∆T (abc) + ∆T (abd)

, with f(x− y) =wext

max

eλ|x−y| ∀x, y (6.18)

Therefore, the external coupling for neuroni,j from all neuronk,m becomes :

Si,j(t) =∆T (abc)f(A− A)ψ(t, φ1)


+∆T (abd)f(A−B)ψ(t, φ2)


,

with ψ(t, φ) = H(xextk,m(t)) (6.19)

Where ψ(t, φ2) and ψ(t, φ1) (as seen in Figure 6.6, Page 143) are respectively associated

to spikes with phases φ2 and φ1 that appear after segmentation. After factorization and

using Equation 6.16 we obtain:

Si,j(t) =f(0)ψ(t, φ1)

1 + ∆abd

∆abc

+f(A−B)ψ(t, φ2)

1 + ∆abc

∆abd

(6.20)

The geometrical interpretation outlined here can be extended to more than four points and

can be applied to any complex object. This means that the extra-layer connections are

independent of the affine transform that maps the model to the scene (first and second

layer objects), therefore our template matching technique is independent of the affine

transform that has reshaped the image in comparison with the template and proves that

the technique theoretically works with any affine reshaping of the objects.

Note that if there are several objects in the scene and we want to match patterns, we can

use the results from the segmentation phase to break the scene into its constituent parts

(each synchronized region corresponds to one of the objects in the scene) and apply the

objects one by one to the network, until all combinations are tested. This is not possible

in the averaged Dynamic Link Matching case of Konen et al. where no segmentation

occurs.

6.7 Results

As stated earlier, this network can be used to solve the correspondence problem. For

example, suppose that in a factory chain, someone wants to check the existence of a


component on an electronic circuit board. All he/she has to do is to put an image of the

component on the first layer and check for synchronization between the layers. Ideally,

any change in the angle or the location of the camera or even the zoom factor should not

influence the result. One of the signal processing counterparts of our proposed technique

is the morphological processing [165]. Other partial solutions such as the Fourier (resp.

Mellin) transform could be used to perform matching robust to translation (resp. scaling)

[165].

There is no need to train or configure our architecture to the stimulus we want to apply.

The network is autonomous and flexible to not previously seen stimuli. This is in con-

trast with associative memory based architectures in which a stimulus must be applied

and saved into memory before retrieval [156]. It does not require any pre-configured archi-

tecture adapted to the stimulus, like in the hierarchical coding paradigm [69]. DLM can

play an important role in structuring memory, e.g. finding structural similarities between

stored information during sleep [163].

In this manuscript, we show the aforementioned capacities of the network using a proto-

type that will help us study the dynamics of the network.

6.8 Rate Coding vs. Phase coding

The aim in this paragraph is to show that the original DLM is a rate coding approximation

of the ODLM. First of all, we define what “rate coding” and “phase coding” mean. We

will then show how the DLM and the ODLM are related to each others.

6.8.1 Rate Coding (Average over Time)

The first and most commonly used definition of firing rate refers to temporal average.

This is essentially the spike count in an interval T divided by T . The length of the time

window is set by the experimenter and depends on the type of neuron recorded from and

6.8. RATE CODING VS. PHASE CODING 135

the stimulus. In practice, to get sensible averages, several spikes could occur within the

time window. Values of T = 100ms or T = 50ms are typical, but the duration may also

be longer or shorter.

6.8.2 Phase coding

We can also use spikes from other neurons as the reference signal for a pulse code. In

this scheme, times at which neurons spike convey the information (and not the averaged

rate). For example, synchrony between a pair or a group of neurons could signify special

events and convey information which is not contained in the firing rate of neurons (for

more details on synchrony and temporal correlation see chapters 3,4, and 5).

More generally, not only synchrony but any precise spatio-temporal pulse pattern could

be a meaningful event. For example, a spike pattern of three neurons, where neuron 1 fire

at some arbitrary time t1 followed by neuron 2 at time t1 + δ1 and by neuron 3 at t1 + δ2

might represent a certain stimulus condition (Rank Order Coding [109, 166]).

6.8.3 Dynamics of the Rate-coding DLM

Aoinishi et al. [158] have shown that a canonical form of rate coding dynamic equations

solve the matching problem in the mathematical sense. The dynamics of a neuron in one

of the layers of the original Dynamic Link Matcher proposed in [154] is as follows (see

section 6.3 for more details):

dxr

dt= −αxr + (k ∗ σ(xr)) + Ixr (6.21)

Where k(.) is a neighborhood function, Ixr is the summed value of extra-layer couplings

σ(.) is the sigmoidal function, x is the output of the rate coded neuron, and ∗ is the

convolution.

In order to prove that our system is a generalization of previous works by [154], we need to

perform a fixed-point approximation of the Van der Pol equation (the way we have derived


this fixed-point approximation is described in Appendix E). This approximation proves

that we can replace the Wang-Terman (Van der Pol) oscillators by the simpler “integrate-

and-fire” neurons. We will use this property in appendix G to prove that the original

DLM can be derived from our ODLM by a time average approximation.

6.8.4 Segmentation and Matching for Invariant Pattern Recog-

nition

A rectangular neuron map is chosen. There are 5x5 neurons in each layer. A vertical

bar in a background is presented in the first layer. The second layer receives the same

object transformed by an affine transformation (rotation, translation, etc.). Here are

some examples: Figure 6.5 shows an activity snapshots (instantaneous values of x(i, j))

in the two layers after segmentation (first phase). Note that same- colored neurons have

similar phases in the figure. On the other hand, different segments on different layers are

desynchronized (see Figures 6.6 and 6.7). In the dynamic matching stage, similar objects

among different layers are synchronized (Figure 6.9). The thresholded sum (synchroniza-

tion index) of the activity of all neurons (∑

i,j H(x(i, j) − 0.5))) is shown in Figure 6.8

for the segmentation phase and in Figure 6.9 for the dynamic matching phase. Since

there are four different regions in the two layers with different phases at the end of the

segmentation phase, four different synchronization regions can be seen in Figure 6.8. In

the dynamic matching phase, the similar objects (and the backgrounds) merge with each

other producing only two distinct regions. In addition, when a zero-mean Gaussian noise

with variance σ2 = 0.1 is added to both stimuli (SNR = 10dB) the matching results

remain unchanged.

6.8.5 One-object scenes

Note that if only one object is present in each layer of the scene, then the segmentation

phase can be bypassed and the network could function directly in the matching mode.

This strategy will help us speed up the pattern recognition process. Figure 6.10 and


Figure 6.11 show the behavior of a 13x5 network when only one object is present in each

layer. The synchronization time for the matching-only network is shorter. Note that the

matching-only approach cannot be used, if there are multiple objects in the scene. In the

latter-mentioned case the segmentation plus matching approach should be used.

6.9 Conclusion and Further Work

We proposed the oscillatory dynamic link matching as a mean to segment images and

solve the correspondence problem, as a whole system, using a two-layered oscillatory

neural network. Our work is an extension to the dynamic link matcher proposed by

Konen et al. [154]. In fact, we showed that Konen’s model is a time-averaged version of

our proposed technique. We showed theoretically and with “toy-object” experiments that

our network is capable of establishing correspondence between images and is robust to

translation, rotation, noise and homothetical transforms. More experiments with complex

objects and more general transforms like shearing, etc. are under investigation. Pattern

recognition of occluded objects is another challenge for this proposed architecture and

will be presented in further works. A more detailed study of robustness to noise should

be done for our proposed architecture.

Van Hemmen has shown that the maximum number of segmented object in a network

that uses temporal correlation is 6-7 objects [112]. Wang and Terman [100] has proposed

an algorithmic version of the Wang-Terman oscillator that can circumvent this limitation.

In a further work, the integration of this algorithmic version into the approach should be

considered. The problem with the algorithmic version is that you need global information

from all the neurons (or at least neurons from a synchronized region) to implement it.

This is in contradiction with the “modular” property of neural networks.

The possibility of the insertion of this architecture in our bottom-up sound segregator [42]

[104] (and chapter 5) as a top-down processor can be investigated in a further work. In

fact, in this application, visual images will be replaced by CAM (Cochleotopic/AMtopic)


and CSM (Cochleotopic/Spectrotopic) Maps proposed in [138]. The approach could also

be used as a separate discrete-word recognizer (see Figure 6.13, Page 150).


ODLM

ODLM

Figure 6.2: An industrial application of the odlm. Top: A resized version of an object

is applied to the matching layer. Synchronization is achieved, hence the object exists in

the visual scene. Bottom: A totally different object (which is not part of the scene)

is applied to the matching layer. Synchronization is not achieved. The object is not

matched.


G

Neuron i,j

Neuronk,m

xext

H(.)

Wext

Wint

H(.) x intNeuron

k',m'

Figure 6.3: The architecture of the oscillatory dynamic link matcher. The number of

neurons in the figure does not correspond to the real number of neurons. The global

controller has bidirectional connections to all neurons in the two layers. Synchronization

between neurons of the two layers is achieved when there is an affine similarity between

the pattern and the scene.


a d

bc

T(a) T(d)

T(c) T(b)

Figure 6.4: An affine transform T for a four-corner object.


Figure 6.5: A snapshot of the activity the first and second layers of the neural map.

Colors represent relative phase of oscillations.


0 2000 4000 6000 8000 10000 120002.5

2

1.5

1

0.5

0

0.5

1

1.5

2

Simulation time

0 2000 4000 6000 8000 10000 120002.5

2

1.5

1

0.5

0

0.5

1

1.5

2

Simulation time

t sync

xsync

Figure 6.6: Neural activity pattern after segmentation Left: Activity of one of the neurons

associated with the vertical bar in the first layer after segmentation. Right: Activity of

one of the neurons associated with the background in the same layer.


0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100002.5

2

1.5

1

0.5

0

0.5

1

1.5

2

Simulation time

t sync

xsync

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100002.5

2

1.5

1

0.5

0

0.5

1

1.5

2

Simulation time

t sync

xsync

Figure 6.7: Neural activity pattern after matching. Left: Activity of one of the neurons

associated with the horizontal bar in the first layer after dynamic matching. Right:

Activity of one of the neurons associated with the vertical bar after dynamic matching in

the second layer. The two neurons are in full synchronization.


Figure 6.8: The evolution of the thresholded activity of network through time in the

segmentation phase. Each vertical rod represents a synchronized ensemble of neurons

and the number of neurons in that synchronized region is represented on the vertical axis.


Figure 6.9: The evolution of the thresholded activity of the network through time in the

dynamic matching phase.


0 2000 4000 6000 8000 10000 12000 14000 16000 180000

20

40

60

80

100

120

Synchronization

Synchro

niz

ation index

Simulation time

Figure 6.10: The Synchronization index of a one-object scene when the segmentation step

is bypassed. The synchronization takes 85 oscillations (spikes).


0 0.5 1 1.5 2 2.5 3

x 10 4

0

40

60

80

100

120

140

synchro

niz

ation index

Simulation time

Synchronization

20

Figure 6.11: The synchronization pattern of a one-object scene when the segmentation

phase precedes the matching phase. The synchronization takes 155 oscillations (spikes).


0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

2

4

6

8

10

12

Figure 6.12: A scene segmentation done during the segmentation phase of the algorithm.

Colors represent synchronization phase. Binary masks are generated by assigning binary

values to different oscillation phases (Equation 6.9).


Figure 6.13: Architecture of an integrated top-down and bottom-up processor (under

investigation): In the segregation level the desired sound source is segregated using har-

monicity and localized energy cues by the CAM/CSM maps (Bottom-Up segregation).

The bottom-up segregation generates a mask from which cochlear channels are selected.

The result of this stage is compared with pre-stored patterns via the Dynamic Link

Matcher and the best match (e.g., vowel) is found.

CHAPTER 7

CONCLUSION

7.1 Summary

The previous six chapters formed the complete presentation of the architectures I have

proposed for auditory and visual scene analysis. In this final chapter I will take a slightly

broader perspective to look at the things I might have added to the system (in an ideal

world without time constraints), and some of the aspects of audition I still do not know

how to incorporate into this approach.

7.2 What has been presented

Before drawing conclusions, let us briefly review what this thesis has contained. I started

by presenting the state of the art in Computational Auditory Scene Analysis (CASA) and

by pointing out the lack of a general system that can handle all types of mixtures and

sounds. I then proposed to partially mimic the behavior of the nervous system to come

up with a system that can separate different sound sources. For doing so, I laid down in

Chapter 3 the neurocognitional aspects I had used later and tried to find the ‘best’ math-

ematical model of bio-inspired neurons that let me obtain roughly the same behavior as

real neurons with a relatively low computational complexity in Chapter 4. I proposed two

different representations (the Cochleotopic/AMtopic and the Cochleotopic/Spectrotopic

Maps) that I used as front end to my neural architecture. I then proposed an architecture

for the sound source separation problem based on temporal correlation. Then came the

turn to the synthesis of the sound. The synthesis quality of the conventional synthesis fil-

terbank used in other works was not satisfactory. Therefore, jointly with other colleagues

the FIR gammatone filters had been adapted to this work.

151

152 CHAPTER 7. CONCLUSION

In the visual scene analysis domain, the ‘Dynamic Link Matching’ has been extended

to bio-inspired neurons. I called this extension ‘Oscillatory Dynamic Link Matching’

(ODLM). I studied some theoretical properties of this new architecture and proved some

fundamental concepts about this technique. I then applied ‘toy-objects’ to the system

and proved that what I have derived mathematically hold for these simple objects.

7.3 Future developments of the model

As stated earlier, none of the proposed systems in the literature is a complete system that

can function in any condition. The system proposed in this thesis is not an exception to

this general rule. So many things are missing in this work. This incompleteness is either

due to the general lack of the know-how in the scientific community for the emerging

field of Computational Auditory Scene Analysis or the time constraint for this work. The

following list consists of the system features that I would most like to add in the future:

• Automatic CAM/CSM selection. In this work the selection between the two

representations (maps) is done manually. An algorithm should be conceived, which

based on the features of the mixture will decide which representation is adequate

for the task.

• ‘Top-down’ processing of information. The processing of information in this

work is done in a bottom-up manner. It means that no lexical or any higher level

information has been used for separation. We know that this kind of information

is very important for better and more robust source separation. In my opinion, the

integration of such a ‘top-down’ processor would be an asset to the system.

• Auditory maps issues. I have limited myself to two different representations

(the CAM and the CSM). From neurophysiological observations, it is well known

that more than two maps are generated in the brain. Therefore, a system can be

designed in the future that extends the two-map strategy to multi-map strategy.

7.3. FUTURE DEVELOPMENTS OF THE MODEL 153

Other candidate maps could be those detecting onsets, offsets, etc. In addition, the

maps I have proposed are only approximations to real maps used in the brain. No

one can guaranty the optimality of them. All I have proved in this work is that these

two maps can solve certain auditory scene analysis problems. I have also speculated

that there might be some correlations between these maps and the ‘real-world’ maps

of the nervous system. A further investigation should prove whether it is possible

to use other signal processing techniques (such as wavelets) to further enhance the

representations.

• Neural architecture. Based on our observations and findings, I tried to find a

trade-off between performance and computational complexity for the neural net-

works I have proposed. There are so many empirical and ad-hoc parameters in the

network (maybe like any other neural network). Future work should further enhance

the performance of my proposed network.

• Visual pattern recognition. Since this part was an extension of my initial goals

for this thesis, I haven’t applied real-world images to this network. The following

questions should be answered by a future work: What would have happened if I

had applied visual objects with occlusions to this network? What are the ultimate

performance of the technique?

• Implementational issues. Throughout this work I did not focus on the opti-

mization of the computer code, leaving this for a further work. An ‘event-driven’1

simulator can help us to decrease the simulation time and to attain real-time.

1An event-driven simulator updates the state of a simulation block only if there is a change in the

external inputs to that block.

154 CHAPTER 7. CONCLUSION

7.4 The future of Computational Auditory Scene

Analysis

The initial goals of this project were too ambitious, and while it is fun to think in

grandiose terms about the ‘whole’ of audition, it is not always so obvious to find a

general solution to the auditory scene analysis problem.

In my opinion, today’s most sophisticated theories will appear naive and almost

willfully simplistic in the quite near future. This is inevitable: the challenge is to

make the discoveries that will permit a more realistically complex model of audi-

tory perception. The technical gap between the first tentative to do sound source

separation in the ’70s (i.e., by Parson [17]) and the actual sound separators is great.

Very few work had been done in the ’70s and ’80s and the real explosion had begun

in the late ’90s.

According to Ellis [30], “the comparison of machine vision and machine listening is

sobering. There are similarities between our field and the computer models vision of

fifteen or twenty years ago, and vision is still very far from being a ‘solved problem’”.

However there are reasons to be optimistic: Firstly, many of the lessons gained rather

painfully in vision research have been incorporated directly into theories of hearing

rather than being rediscovered. Secondly, hearing is simpler than vision, in terms of

sensory bandwidth, and at the same time the inexorable advance of computational

power makes possible models that would previously have been unthinkable. There

are also other reasons to think that machine listening will be tougher to achieve than

machine vision. Our physiological knowledge about the auditory nervous pathway

is much less than what we know about vision. Furthermore, there is a lag between

our fundamental understanding of audition compared to vision. The first book that

somehow outlined the bases of Gestalt principles for vision was published in 1936 by

Koffka [11] (that principles were elaborated later by Marr in 1982 [13] for computer

vision purposes). The first book ever published about auditory scene analysis was in

7.4. THE FUTURE OF COMPUTATIONAL AUDITORY SCENE ANALYSIS 155

1990 by Bregman [6]. These historical facts show that there is lot to do at the pure-

science level (psychology, neurophysiology, etc.) in parallel with technical research

in engineering.

Perception is the right biological mystery to be studying at the moment, given

our experimental and computational tools. Its solution will lead naturally into the

deeper cognitive secrets of the brain and will let us design more robust engineering

devices.


APPENDIX A: CANONICAL NEURONAL

MODEL

The purpose of this appendix is to show that all neural models described in this thesis

(i.e., integrate-and-fire, relaxation oscillators, etc.) can be reduced to the canonical model

described here. The canonical model is a unified framework for the analysis of bio-inspired

neural networks.

As stated before, all class I neurons become active by means of a saddle-node bifurcation.

In nonlinear dynamical systems, a bifurcation is a period doubling, quadrupling. etc.

(Figure A-1, 158). that accompanies the onset of chaos. It represents the sudden appear-

ance of a qualitatively different solution for a nonlinear system as some of the parameters

of the system’s differential equations are varied. Saddle-node bifurcations on limit cycles

are ubiquitous in two-dimensional systems:

x = f(x, y)

y = g(x, y) (A-1)

Where f and g are continuous functions. Let us plot the nullclines x = 0 and y = 0 in

the xy-plane 2. Each intersection of the nullclines corresponds to an equilibrium of the

model. When the nullclines intersect as in Figure A-2, the bifurcation occurs. Note that

the neurons introduced in Section 4.2.2 can be put in this canonical format. Roughly

speaking, a saddle-node bifurcation occurs when there are two intersections of nullclines

one stable and the other unstable. Saddle-node bifurcation on a limit cycle leading to

Class 1 neural excitability can be observed in many multidimensional neural models such

2In a two-dimensional system of differential equations the nullclines are the curves where the vector

field is either horizontal or vertical. The horizontal nullcline is found by setting y = 0 since this says that

there is no vertical component of the vector field along this curve. Similarly, to find the vertical nullcline

we set x = 0.

157

Figure A-1: Bifurcation in a two-dimensional space. In a dynamical system, a bifurcation

is a period doubling, quadrupling, etc., that accompanies the onset of chaos. It represents

the sudden appearance of a qualitatively different solution for a nonlinear system as some

parameter is varied. The illustration above shows bifurcations (occurring at the location

of vertical lines) of the logistic map as the parameter r is varied. Bifurcations come in four

basic varieties: flip bifurcation, fold bifurcation, pitchfork bifurcation, and transcritical

bifurcation (adapted from http://www.mathworld.com)

as the Hodgkin-Huxley, Morris-Lecar, etc. Although the Hodgkin-Huxley model exhibits

class 2 excitability for the original values of parameters, it exhibits class 1 excitability

when a transient potassium A-current is taken into account.

The characteristic feature of an Andronov-Hopf bifurcation is that the equilibrium point

loses its stability and a limit cycle appears. If the initial value is on the limit cycle, then

the point moves along the curve, periodically returning to the initial point (oscillatory

activity).

Figure A-2: Saddle-node bifurcation in Wilson-Cowan oscillators

Canonical model for saddle-node bifurcation

The state-space of a neuron follows the dynamics [167]:

X = F (X,λ) (A-2)

λ is a vector of parameters and X is the state-space of the system containing the membrane

potential, the ions, the channels, etc. Now suppose that λ0 is the vector value for which

there is a ’saddle-node’ bifurcation. For all λ close to λ0 we can find a map h(X, λ) that

transforms every system of the form of Equation 7.3 to the Ermentrout canonical model:

ϕ′ = (1− cosϕ) + (1 + cosϕ)r (A-3)

where ϕ ∈ S1 is a phase variable (state variable) that describes activity of the neuron

along the limit cycle, S1 = {ejφ ∈ C} is the unit circle in the complex plane, and r ∈ R

is a new bifurcation parameter. The transformation h that maps solutions of 7.3 to those

of A-3 blows up a small neighborhood of the saddle-node bifurcation point and compress

the entire limit cycle to an open set around π ∈ S1 (Figure A-3).

The canonical model has the following interesting behavior: if r > 0 the neuron spikes at

a frequency equal to π√r, if r < 0 the spiking threshold ϕ+ and the equilibrium point ϕ−

are given by:

ϕ± = ±cos−1 1 + r

1− r(A-4)

Weakly Connected Neural Networks in the Canonical Form

Figure A-3: The transformation h maps solutions of Equation 7.3 to those of Equation

A-3

The dynamics of a weakly coupled neural network can be written in the following canonical

form:

X = Fi(Xi, λ) + εGi(X1, X2, ..., XN , λ, ε) (A-5)

G(.) describes how the ith neuron is affected by the other neurons, Xi describes the

activity of the ith neuron. For weakly coupled networks ε ¿ 1. After linearization and

some approximations [84] the weakly-connected neural network becomes:

ϕ′i = (1− cosϕi) + (1 + cosϕi)ri +

n∑

j=1

wij(ϕi)δ(ϕj − π) + O(√

εlnε) (A-6)

w(ϕi) = 2atan(tanϕi

2+ sij) (A-7)

Coupling of two neurons

• Unidirectional coupling Here we consider the situation in which a neuron

is connected to another neuron in one direction only (neuron 2 receives input from

neuron 1, but neuron 1 does not receive input from neuron 2). The dynamics of this

system is:

ϕ′1 = ((1− cosϕ1) + (1 + cosϕ1)r (A-8)

ϕ′2 = ((1− cosϕ2) + (1 + cosϕ2)r + w(ϕ2)δ(ϕ1 − π) (A-9)

Let us perturbate the stable solution ϕ2(t) = ϕ1(t) supposing that ϕ2 > ϕ1. Since

w ≥ 0, the spikes ϕ1 advance ϕ2 even further. This is due to the fact that

w(ϕ2)δ(ϕ1 − π) is always positive. Therefore, the in-phase solution is unstable.

After a while, ϕ2(t) → ϕ1(t) + 2π and each firing of ϕ1 advances ϕ2 even closer to

ϕ1(t) + 2π ≡ ϕ1(t) (note that ϕ1(t) is periodic with period 2π. We see that the in-

phase synchronized solution for the synaptic organization is stable in one direction

and unstable in the other [97]. This is because we have supposed that the two neu-

rons are identical. If the neurons, are different, let say r1 < r2 then no synchronized

solution exists and there is an in-phase synchronized solution when r1 > r2, and the

shift increases when r1 − r2 increases.

• Bidirectional coupling Consider now the case of bidirectional coupling:

ϕ′1 = ((1− cosϕ1) + (1 + cosϕ1)r + w12(ϕ1)δ(ϕ2 − π) (A-10)

ϕ′2 = ((1− cosϕ2) + (1 + cosϕ2)r + w21(ϕ2)δ(ϕ1 − π) (A-11)

We can show that for this constellation, the difference ϕ2−ϕ1 may have a different

value during an oscillation, but return to the initial value at the end of the oscillation,

i.e. (ϕ2(0)− ϕ1(0) = ϕ2(T )− ϕ1(T ), T is the oscillation period).

This framework enables us to analyze the coupling of bio-inspired neural neurons in a

standard manner.


APPENDIX B: CHAOTIC-BASED SOUND

SEPARATIONS

The computational burden of the the technique proposed in Chapter 5 is high, therefore

some simplifications/optimizations should be done so that the technique can be imple-

mented in real-time. For instance, the use of chaotic neural networks (see Chapter 4)

instead of Wang-Terman oscillators can help us speed up the separation process. Wang-

Terman oscillators are stiff equation that must be solved by numerical integration tech-

niques with a very small step, but the chaotic neurons use only additions and multiplica-

tions (for details see chapter 4 and [82]). The disadvantage of chaotic neurons is that while

their dynamics is simple, the analysis of the synchronization is complex. Since the output

of chaotic neurons are not ergodic3, two outputs may be synchronized for a time interval

but not synchronized for the entire process. In addition in this preliminary work correl-

ograms have been used: they are computationally expensive and should be replaced by

CAM/CSM. Although as stated in Chapter 4, there are some reports on chaotic behavior

of neurons, the chaotic model is less biologically plausible than spiking neural networks.

A very simplified one-dimensional version of the neural separator proposed in chapter 5 has

been tested and a very preliminary version has been designed. We applied correlograms

of AM envelopes of cochlear filterbank outputs to a network of oscillatory neurons, in

order to separate two speakers (or a speaker from a tone). In this approach synchronised

regions belong to the same speaker while desynchronized regions with respect to the

first speaker’s clusters correspond to other speakers (or noise). Our proposed network is

composed of chaotic neuronal elements like in [91] but is one dimensional. Our learning

algorithm is a modified version of the rules proposed in the work by Zhao et al. We

3ergodicity: an attribute of stochastic systems; generally, a system that tends in probability to a

limiting form that is independent of the initial conditions. This is due to the fact that the statistical

average of the stochastic system is equal to the time average of the variable.

163

achieved synchronization patterns that are different from those in Zhao et al. [91]. In fact

we think that periodic and quasi-periodic patterns we obtained in our work is biologically

more plausible. In contrast with other works we didn’t use any global controller. Our

tests on pilot and real data showed that the symmetry breaking is done automatically in

this network. Although, more detailed analysis should be done to prove this statement,

but we think that this behavior stems in the fact that the behavior of the network is

chaotic at the beginning (before synchronization), which lets enough time to the network

to desynchronize. To our knowledge, this is the first time that real speech data has been

applied to a one dimensional chaotical neural network. In addition, our network and its

associated learning algorithm is well suited to multilevel inputs and not just to binary

ones.

Our preprocessing stage consists of a 24 channel cochlear filterbank that mimics in part

the behavior of the human cochlea. The feature extraction algorithm described in [133]

has been used and the normalized correlogram is computed for the delays corresponding

to the pitch of one of the speakers. In order to find the pitch of the signal we used

the pooled correlogram technique [3]. Then the correlograms are quantized to a limited

number of levels (4 levels) and is applied to our network of chaotic neurons.

An array of chaotic neurons is used to segregate speech. The dynamic of each neuron i is

governed by a Chaotic Map (Zhao et al. [91]) :

xi(t + 1) = xi(t) +ε

NΣN

j=1f(xj(t)) (B-1)

f(x) = ax(1 − x) is the logistic map, N the number of neurons. We used a modi-

fied version of the dynamic neighborhood algorithm described in [91] since we are using

a one-dimensional network in contrast to the two dimensional network used in Zhao

et al. for image segmentation purposes. In addition, our proposed modified weight

adaptation rule is able to process non-binary data. The aforementioned proposed al-

gorithm is implemented as follows: each neuron in the network is connected to other

neurons of the network through discrete-time delays (the maximum neighboring dis-

tance of connections is set to 10 neurons). In the beginning, each neuron runs freely,

that is no synaptic connection is established between neurons. Later, connections are

established according to an exponential rule e−(xi−xi−1) where xi and xi−1 are the in-

puts applied to neurons i and i − 1 respectively. The farther a neuron is from an-

other one, the longer the update delay time is. For instance, for neuron i, updating

delays are defined as di−1, di+1, di−2, di+2, ..., di−10, di+10 (minus and plus signs correspond

to bottom and up neurons respectively) with di−1 < di−2 < di−3 < ... < di−10 and

di−1 = di+1, . . . , di−10 = di+10. The update equations are as follows:

wij(t) =

e−5.5∗|(xi(t−di−j)−xi−1(t−di−j)| for t− di−j > 0

0 otherwise,(B-2)

At time instant t = di−1, the network computes the difference between the inputs to

neurons i− 1 and i + 1 , the closest neurons (the 1-neighbors) to the neuron i using the


DMM

DMM

z -N

z-N

Weight

Adaptation

Weight

Adaptation

Weight

Adaptation

Weight

Adaptation

Cochlear Output1

Cochlear Output 24

Network Output 1

DMM: Decision Making Module

Chaotic Neuron

Figure B-4: Architecture of the simplified chaotic neural network based sound source

separator. The Decision Making Module (DMM) defines the neighborhood for which

connections are established for each neuron in the network. The neighborhood grows

with time (as described in Equation B-2).

Figure B-5: Oscillatory behavior of the chaotic network for the two speaker segregation

problem: X-axis represents discrete time while Y-axis represents channels. Synchroniza-

tion can be roughly associated to similar changing gray levels in the figure. Gray levels

show the level of activity: dark regions are zones of synchronized neurons and bright

regions are zones of synchronized neurons among themselves and desynchronized with

neurons of dark regions.

exponential rule defined earlier, at t = di−2 it updates the connections to neurons i + 2

and i − 2. Since in our case delays are all exponents of 2, at the same time it updates

the weight connections between the 1-neighbors and neuron i. In this way, the region of

synchrony around a neuron shrinks or grows at fixed time delays according to the defined

learning rule.

The mask is generated by using the output of the network. Then, speech is synthesized by

weighting the filterbank outputs with that mask. The oscillatory neural network that we

use has the advantage of creating a mask that takes into account the mutual information

from the cochlear channels and that does not require any training.

The mask is generated using the output of the network and the synthesis is similar to

what is described in Section 5.4.5.

The reason why this technique is not used further in this thesis and chaotic oscillators are

replaced by relaxation oscillators is the fact that it is more difficult to detect synchronicity

in chaotic networks. Furthermore, the biological plausibility of chaotic networks is not

totally justified.

APPENDIX C: MULTIPLICATIVE SYNAPSES

In this appendix, we demonstrate mathematically why additive synapses may fail to

separate sound sources. We base our derivation on what has been shown in Figure C-6.

It must be pointed out that this appendix does not prove that multiplicative synapses are

optimal in all senses. It simply shows that the second-layer integration can be done more

powerfully by multiplicative synapses.

In Figure C-6 (at top) second-Layer integration with additive synapses is analyzed. A

snapshot of the first layer’s activity is shown in the rectangle (Figure C-6, (A)). The first

layer’s activity emphasizes the underlying CSM/CAM. In this specific example, the first

layer activity for the CAM of a single speaker is depicted. According to the activity shown

in that rectangle two different regions are shown (circled green and red). The distance

between red dots corresponds to the pitch of the signal. The region circled in green

corresponds to channels where no neural activity has been detected (the background).

The background (in white) and the red dot have different spiking phases as shown by the

red arrows. That means that all white neurons have the same phase while red pixels have

a different phase. In the following the phase of neurons are described by their associated

color. Since the synapses are additive all the activity along a channel is added. The sum

of all activities along a channel in the red region is given by (Figure C-6, (B)) (note that

all weights connecting the two layers are set to unity for the sake of simplicity):

Φ1 =h1∑

n=1

δ(n− T1) +h2∑

n=1

δ(n− T2) (C-1)

The sum of all activities along a channel in the green region is given by (Figure C-6, (C))

Φ2 =h1+h2∑

n=1

δ(n− T2) (C-2)

By comparing (Figure C-6, (C)) and (Figure C-6, (B)), the averaging result over the

169

chosen window is the same for the two green and red regions and is equal to h1 + h2. A

good separator should have separated these two different regions as two different sources.

Hence, the additive synapses (Equation 5.15) as described here does not separate correctly

the regions. Note that even if we had chosen weights different from unity, nothing would

have changed for the additive case. In fact, in this case we would have had:

< Φ1 >=h1∑

i=1

wi +h2∑

i=1

wi =h1+h2∑

i=1

wi (C-3)

< Φ2 >=h1+h2∑

i=1

wi (C-4)

Therefore even if the weights are different, we still have < Φ1 >=< Φ2 >.

In the Figure at the bottom even a more complicated situation is shown. We will show

that although the additive synapses were unable to separate the figure/backgroung, mul-

tiplicative synapses can do lot more by separating the two-speaker plus background case.

The CAM for a two-speaker case is considered in the rectangle associated to the first

layer’s activity showing the underlying behavior of the CAM/CSM ((Figure C-6, (G)).

The region circled red corresponds to the channels belonging to the first speaker, the

purple region to the second speaker, and the green region to the background. The spike

activity is shown in the three averaging windows. For the red-circled region by applying

the operator described Ξ in (Equation 5.16, chapter 4) the overall multiplicative is given

by (Figure C-6, (D)) :

θ(⋃

i

xi) =∏

i

wll(i)Ξ{xi} (C-5)

Note that all we have done so far is introducing a new notation in the equation we had

already defined in chapter 5 (Equation 5.15). Note also that δ(n − T ) is either 1 or 0,

therefore the multiplication as defined by θ() is either 0 or 1.

θ1 = θ(h1⋃

n=1

δ(n− T1)h2⋃

n=1

δ(n− T2)) = h3δ(n− T1) + h3δ(n− T2) (C-6)

Where h3 is a scaling factor. For the purple-circled region by applying the same multi-

plicative operator we will obtain (Figure C-6, (E))

θ2 = θ(h4⋃

n=1

δ(n− T1)h5⋃

n=1

δ(n− T2)) = h3δ(n− T1) + h3δ(n− T3) (C-7)

In which h5 is the number of neurons in purple and h4 is the number of neurons in white.

For the green-circled region we have:

θ3 = θ(h6⋃

n=1

δ(n− T1)) = h3δ(n− T1) (C-8)

The averaging results in the three cases with weights different from unity gives:

< θ1 >=h1∏

i=1

wi +h2∏

i=h1

wi (C-9)

< θ2 >=h4∏

i=1

wi +h5∏

i=h4

wi (C-10)

< θ3 >=h6∏

i=1

wi (C-11)

The above set of equations prove that the three results are different for multiplicative

synapses. The goal is achieved by proving that multiplicative synapses can separate

sources while additive synapses cannot.

First Layer's Activity

Sum

of

Neura

l A

cti

vit

y

Averaging Window

Averaging Window

time (t)

time (t)

Sum of all unfilled regionsSum of all unfilled regions

h e ig h t = h 1 + h 2

h e ig h t= h 1 h e ig h t= h 2

Averaging with additive synapses

Same Averaging Results

time (t)

time (t)

time (t)

weight=w1

weight=w2

First Layer's Activity

Same Result

if w1=w2

Different Result

if w1<> w2

DifferentAveraging Results

Pro

duct

of

Neura

l A

cti

vit

y

Averaging with multiplicative synapses

height=h3 height=h3

height=h3height=h3

height=h3

(A )

(B )

(C )

(D )

(E )

(F )

(G )

t= T2

t= T1

t= T3

ChannelsF

req

ue

ncy .

. .

. . .

. . .

. . .

+

+

temporal pattern

of spiking neurons

. . .

. . .

. . .

Synaptic input to the second layer neuron associated with the red (right) region

Synaptic input to the second layer neuron associated with the green (left) region

Sum of unfilled regions Sum of filled regions

Synaptic input to the second layer associated with the red (right) region

Synaptic input to the second layer associated with the purple (middle) region

Synaptic input to the second layer associated with the green () region

Figure C-6: Comparison of multiplicative and additive synapses. Top: additive synapses

are unable to separate the ground from the source. Bottom: multiplicative synapses are

able to separate two speakers and the background (refer to the text for more details).

APPENDIX D: PARAMETERS OF THE

HODGKIN-HUXLEY NEURAL MODEL

Here are the numerical values used for parameters in equations defined in Section (4.2.1,

Chapter 4).

x Ex gx

Na 115mV 120mS/cm2

K −12mV 36mS/cm2

L 10.6mV 0.3mS/cm2

TABLE D-1: Parameters for the Hodgkin-Huxley Equations.

TABLE D-2: Parameters used in Equation 4.2.1, page 56

173


APPENDIX E: FIXED-POINT

APPROXIMATION

The Van der Pol oscillator used in this appendix is a two-dimensional approximation of

the Hodgkin-Huxley equations (as seen in chapter 4)4. We will show here that a further

approximation to one-dimensional state-space will reduce the oscillators to “integrate-and-

fire” neurons. The derivation presented here is similar to the one described in [86] but it

has been adapted to relaxation oscillators. In fact, a pseudo-linear approximation of the

Wang-Terman state-space equations gives the following two-variable “Integrate-and-fire”

model. The model is obtained by linearization (Figure E-7) of each branch of the state

space trajectory. This means that the nullclines of the Wang-Terman oscillators (Figure

4.5, chapter 4) is linearized. The coupling strength Si,j is not considered below, since it

does not intervene in the analysis (i, j subscripts are omitted for the sake of simplicity).

The linearization in Figure E-7 gives the following equations:

dx

dt= f(x)− y + I (E-1)

dy

dt= ε[bx− d(H(x)− 0.5)] (E-2)

Typical values for this piecewise linearization are: f(x) = ax for x < 0.5, f(x) = a(1−x)

for 0.5 < x < 1.5 and f(x) = c0 + c1x for x > 1.5 where a, c1 are parameters and c0 =

−0.5− 1.5c1. Furthermore, b > 0, d > 0 and 0 < ε ¿ 1. H(.) is the Heaviside function as

usual. I is the input current. Note that these are typical values and the following reasoning

remains the same with different values and even different approximating functions.

The rest state is x = y = 0. Suppose that the system is stimulated by a short current

pulse that shifts the state of the system horizontally. As long as x < 1, we have f(x) < 0.

According to Equation E-1, dxdt

< 0 and x returns to the rest state. For x < 0.5 the

4A review of Chapter 3 (Section 3.1) of [86] is strongly advised before reading this appendix

175

relaxation to the rest is exponential with x(t) = exp(at) in the limit of ε → 0. Thus, the

return to rest after a small perturbation is governed by the fast time scale. If the current

x

dy/dt=bx-d (H(x)-0.5)=0

dx/dt=f(x)-y=0

y

Figure E-7: Piecewise linear model of the state space of the Wang-Terman oscillator

presented in Chapter 4. The open curve of (Figure 4.5, page 80) has been approximated

by lines and the closed curve has been approximated by a rectangle. The inset shows the

trajectory (arrows) which follows the x nullcline at a distance of order ε.

pulse moves x to a value larger to a predefined threshold, which is equal to one for this

choice of parameters, we have dudt

= f(u) > 0. Hence the voltage x increases and a pulse

is emitted.

Using the above reasoning, we have simplified the two-dimensional Wang-Terman oscil-

lator to a one dimensional “integrate-and-fire” neuron with threshold, as written below.

The coupling strength Si,j is re-added to the equations.

dxi,j

dt= −xi,j + Si,j + H(pinput

i,j )

xi,j = 0 xi,j > threshold (E-3)

H(.) is again the Heaviside function. In addition, Campbell and Wang have shown that

the behavior of Van der Pol oscillators and Integrate-and-Fire neurons are equivalent for

temporal correlation purposes by simulation (but not theoretically) [102].


APPENDIX F: GAMMACHIRP/GAMMATONE

FILTERBANKS

In what follows in this appendix, we will detail some of the most important properties of

the Gammachirp/Gammatone filterbanks [128]. Although the gammatone filter is used

in this work, the gammachirp filter which is a more generalized form will be explained.

We will show that the gammatone filter is a simplification of the gammachirp filter.

The gammachirp filters are designed in such a way that the time-frequency uncertainty

is minimized [130]. The impulse response of the gammachirp filter is similar to the Gam-

matone filter except for a “chirp factor” c which is used as a modulation carrier.

gc(t) = atn−1e−2πB(fc)tej(2πfct+c log t) (F-1)

B(f) = 0.1039f + 24.7 (F-2)

This filterbank has asymmetrical frequency response.

The spectrum of the Gammachirp filterbank can be factorized in the following way:

|Gc(f)| = aΓ(c)|GT (f)|ecθ (F-3)

Gc(f) is the spectrum of the Gammachirp filterbank, GT (f) is the spectrum of the Gam-

matone filterbank, c is the modulation parameter, aΓ(c) is a gain which depends on c,

and θ is given by:

θ = tan−1(f − fc

B(fc)) (F-4)

This decomposition proposed by Irino is interesting, because it represents the Gammachirp

filterbank as the cascade of a Gammatone filterbank and a compensation filter ecθ. In this

work only the gammatone filterbank is used, which is a special gammachirp filter with

the parameter c = 0 in Equation F-1.

179

There are other types of more computational-effective approaches and filterbanks, for

details see [168].

APPENDIX G: RATE-CODING EQUIVALENCE

BETWEEN THE DLM AND THE ODLM

In this appendix we will prove that the network we proposed in chapter 5 (ODLM) is

rate-code equivalent to the original Dynamic Link Matcher (DLM). In order to do so, we

use the fixed-point approximation of the relaxation oscillator derived in appendix E.

If we rewrite the dynamics in the dynamic link matching phase (remember from chapter

5 that there are the segmentation and matching phase in our network) of the neuron in

the simplified “integrate-and-fire with threshold” (see appendix E for details) form for our

ODLM network (using the coupling strength in Equation 6.5 without the global controller

influence):

dxtwo

dt= −xtwo + Σk,m 6=i,jw

inti,j,k,mH(xtwo

k,m) + Σk,mwexti,j,k,mH(xone

k,m) + H(pinput)

x = 0 x > threshold (G-1)

Where xtwo stands for neurons in layer two and xone stands for neurons in layer one. Note

that as explained in chapter 5 there are synaptic connections (wint) in layer 2 (the 4

neighbors in our proposed architecture in chapter 5) and synaptic connections from layer

1 to layer 2 (wext). The neighborhood N(i, j) has been replaced by (k, m) 6= (i, j). We

use exactly the same approximation as in chapter 5 (see section 6.6), that is we neglect

the influence of intra-layer connections, therefore Equation G-1 becomes:

dxtwo

dt= −xtwo + Σk,mwext

i,j,k,mH(xonek,m) + H(pinput)


Note that for an integrate-and-fire neuron the approximation H(x) = x holds, since the

output of an integrate-and-fire neuron is either 0 or 1 (it emits spikes or delta functions),

therefore Equation G-2 can be further simplified to :

dxtwo

dt= −xtwo + Σk,mwext

i,j,k,mxonek,m + H(pinput)

181


By averaging the two sides of Equation G-3 we get: (H(pinput) is considered constant over

T ) :

dxtwoa

dt= −xtwo

a + Σwextxonea + H(pinput) (G-4)

xa = < x >T =1

T

∫ T

0x(t)dt (G-5)

< x >T , the averaged version of x over a time window of length T . For the sake of

simplicity, the indices are omitted in Equation G-4. Note that H(pinput is constant over

time, therefore its time average is equal to H(pinput.

From Maass (chapter 2) [112], we know that the averaged output xtwoa of an integrate-

and-fire neuron is related to the averaged-over-time inputs of a neuron (Σwextxonea ) by

a continuous function (sigmoidal, etc.). Let name this function ϕ (note that β is a

proportionality constant):

< xtwoi,j >= βϕ(Σwext < xone

k,m >) (G-6)

Note that in Equation G-4 we need < xonei,j > in function of < xtwo

k,m >. Note further that

Equation G-6 is a set of linear equations in wext and we can deduce xonei,j from that sets

of equations:

xonei,j = Σk,mσ(xtwo

k,m) (G-7)

Where σ(x) = ϕ−1(x). Replacing the above result in Equation G-4 gives (note that for

the sake of simplicity we omitted again the indices):

dxtwoa

dt= −xtwo

a + ΣΣwextσ(xtwoa ) + H(pinput) (G-8)

On the other hand:

ΣΣwintσ(xtwoa ) = k(xtwo

a ) ∗ σ(xtwoa ) (G-9)

Where * is a 2-D convolution. In our case k(.) is a 2-D rectangular window (in the original

DLM k(.) was chosen to be a Mexican hat).

Ixr in Equation 6.21 is the input signal that can be replaced by H(pinput) in the nota-

tions of chapter 5. Therefore, we have proved that the DLM is an averaged-over-time

approximation of the ODLM.

As stated above, the influence of the global controller has been ignored in the derivation

of these results. The question the reader may ask is “What would happen, if we had the

global controller in the equations?”. The answer to this question is that in steady-state

the average influence of the global controller does not change in time ( see the activity of

the “Inhibitor” in Figure 4.7, chapter 3). Therefore the above-derived equations hold in

steady-state, up to a constant. The transient-state analysis seems much more complicated

and has not been included in this appendix. It has been left for future work.

From the above discussion and mathematical derivation, we conclude that the dynam-

ics of the original Dynamic Link Matcher proposed by Konen et al. is the rate-coding

approximation of our place-coding network.


BIBLIOGRAPHY

[1] R. M. Borisyuk and Y. Kazanovich. Oscillatory neural network model of attention

focus formation and control. Biosystems, 71:29–36, 2003.

[2] M. Cooke and D. Ellis. The auditory organization of speech and other sources in

listeners and computational models. Speech Comm., pages 141–177, 2001.

[3] D. Wang and G. J. Brown. Separation of speech from interfering sounds based on

oscillatory correlation. IEEE Transactions on Neural Networks, 10(3):684–697, May

1999.

[4] G. Hu and D.L. Wang. Monaural speech segregation based on pitch tracking and

amplitude modulation. IEEE Trans. On Neural Networks, pages 1135– 1150, Sept.

2004.

[5] G. Jang and T. Lee. Single-channel signal separation using time-domain basis func-

tions. Signal Processing Letters, pages 168–171, June 2003.

[6] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990.

[7] T. Lengagne, T. Aubin, and J. Lauga. How do king penguins (aptenodytes patag-

onicus) apply the mathematical theory of information to communicate in windy

conditions? Proc. R. Soc. (London) B Biology, 266:1623–1628, 1999.

[8] J. Kanwal, A. Medvev, and C. Micheyl. Neurodynamics for auditory stream seg-

regation: tracking sounds in the mustached bat’s natural environment. Network:

Computation in Neural Systems, 14(13), 2003.

[9] R. L. Cherry. Some experiments on the recognition of speech, with one and with

two ears. Journal of Acousticial Society of America, 25:975–979, 1953.

[10] J. Driver. Enhancement of selective listening by illusory mislocation of speech sounds

due to lip-reading. Nature, 381:66–68, 1996.

185

[11] K. Koffka. Principles of Gestalt Psychology. Lund Humphries (London), 1935.

[12] A.J.W. Van der Kouwe, D.L. Wang, and G. J. Brown. A comparison of auditory

and blind separation techniques for speech segregation. IEEE Trans. on Speech and

Audio Processing, 9:189–195, 2001.

[13] D. Marr. Vision. Freeman Publishers, 1982.

[14] W. Ainsworth and S. Greenberg. Springer Handbook of Auditory Research. Springer,

2003.

[15] L. A. Jeffress. A place theory of sound localization. J. Comp. Physiol. Psychol.,

41:35–39, 1948.

[16] J.C.R. Licklider and W.H. Huggins. Place mechanisms of auditory frequency anal-

ysis. JASA, 23:290–299, 1951.

[17] T. W. Parsons. Separation of speech from interfering speech by means of harmonic

selection. JASA, 60:911–918, 1976.

[18] R.F. Lyon. A computational model of filtering, detection and compression in the

cochlea. In ICASSP, 1982.

[19] M.T. Scheffers. Sifting Vowels: Auditory Pitch Analysis and Sound Segregation.

PhD thesis, Groningen University, The Netherlands, 1983.

[20] M. Weintraub. A computational model for separating two simultaneous talkers. In

ICASSP, 1986.

[21] C. Von der Marlsburg and W. Schneider. A neural cocktail-party processor. Biol.

Cybernetics, pages 29–40, 1986.

[22] F. Berthommier and G. Meyer. Improving of amplitude modulation maps for f0-

dependent segregation of harmonic sounds. In Eurospeech’97, 1997.

[23] R.J. Stubbs and A.Q. Summerfield. Evaluation of 2 voice-separation algorithms

using normal-hearing and hearing-impaired listeners. JASA, 84:1236–1249, 1988.

[24] M. Cooke. Modelling Auditory Processing and Organisation. PhD thesis, University

of Sheffield, 1991.

[25] K. Mellinger. Event Formation and Separation in Musical Sound. PhD thesis,

Stanford University, 1991.

[26] K. Kashino and H. Tanaka. A sound source separation system using spectral features

integrated by the Dempster’s law of combination. Annual Report of the Engineering

Research Institute, University of Tokyo, 51:67–72, 1992.

[27] G. Brown and M. Cooke. Computational auditory scene analaysis. Computer Speech

and Language, pages 297–336, 1994.

[28] A. de Cheveigne. Separation of concurrent harmonic sounds: Fundamental fre-

quency estimation and a time-domain cancellation model of auditory processing.

Journal of Acoustical Society of America, pages 3271–3290, 1993.

[29] R.D. Patterson, M. H. Allerhand, and C. Giguere. Time-domain modelling of

peripheral auditory processing: A modular architecture and a software platform.

JASA, 98:1890–1894, 1995.

[30] D. Ellis. Prediction-Driven Computational Auditory Scene Analysis. PhD thesis,

MIT, 1996.

[31] D.F. Rosenthal and H. G. Okuno. Computational Auditory Scene Analysis.

Lawrence Erlbaum Assoc, 1998.

[32] M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recogni-

tion with missing and unreliable acoustic data. Speech Communication, 34:267–285,

2001.

[33] S. T. Roweis. One microphone source seperation. In NIPS, Denver, USA, 2000.

[34] M. J. Reyes-Gomez, B. Raj, and D. Ellis. Multi-channel source separation by

factorial HMMs. In ICASSP 2003, 2003.

[35] G. Hu and D. L. Wang. Monaural speech segregation based on pitch tracking and

amplitude modulation. Technical report, Ohio State University, 2002.

[36] M. Wu, D.L. Wang, and G.J. Brown. A multipitch tracking algorithm for noisy

speech. IEEE Trans. on Speech and Audio Processing, 2003.

[37] D.P. Gibson, N. W. Campbell, and B.T. Thomas. Very low bit rate semantic

compression of natural outdoor images. In Picture Coding Symposium, Oregon,

USA, 1999.

[38] N. Todd. An auditory cortical theory of auditory stream segregation. Network :

Computation in Neural Systems, 7:349–356, 1996.

[39] G. Langner. Temporal processing of pitch in the auditory system. J. New Music

Res, pages 116–132, 1997.

[40] S. Cunningham and M. Cooke. The role of evidence and counter-evidence in speech

perception. In International Congress of Phonetic Sciences 1999, 1999.

[41] J. Rouat and R. Pichevar. Source separation with one ear: Proposition for an

anthropomorphic approach. EURASIP Journal on Applied Signal Processing (sub-

mitted, invited paper), 2004.

[42] R. Pichevar and J. Rouat. Cochleotopic/AMtopic (CAM) and

Cochleotopic/Spectrotopic (CSM) map based sound source separation using

relaxation oscillatory neurons. In IEEE Neural Networks for Signal Processing

Workshop, Toulouse, France, 2003.

[43] R. Pichevar and J. Rouat. Monophonic source separation with an unsupervised

network of spikings neurons. Speech Communication (Elsevier), submitted, 2004.

[44] F. Gaillard. Analyse de Scenes Auditives Computationnelle (CASA): Un Nouvel

Outil de Marquage Du Plan Temps-Frequence Par Detection D’harmonicite Ex-

ploitant Une Statistique de Passage Par Zero. PhD thesis, INPG, 1999.

[45] F. Klessner, V. Lesser, and S.H. Nawab. The IPUS Blackboard Architecture as a

Framework for Computational Auditory Scene Analysis. In Computational Auditory

Scene Analysis, D.F. Rosenthal and H.G, Okuno, 1998.

[46] S. Grossberg, K. K. Govindarajan, L.L. Wyse, and M.A. Cohen. ARTSTREAM:

A neural network model of auditory scene analysis and source segregation. Neural

Networks, 2003.

[47] S. T. Roweis. Factorial models and refiltering for speech separation and denoising.

In Eurospeech 2003, 2003.

[48] H. Sameti, H. Sheikhzadeh, L. Deng, and R.L. Brennan. HMM-based strategies for

enhancement of speech signals embedded in nonstationary noise. IEEE Trans. on

Speech and Audio Processing, pages 445–455, 1998.

[49] R. Remez and P. E. Rubin. Speech perception without traditional speech cues.

Science, pages 947–949, May 1981.

[50] R. E. Remez and P.E. Rubin. On the perceptual organization of speech. Psycho-

logical Review, pages 129–148, 1994.

[51] J. Barker and M. Cooke. Is the sine-wave speech cocktail party worth attending?

Speech communication, 27:159–174, 1999.

[52] C.G. Tsai. Auditory grouping in the perception of roughness induced by subhar-

monics: Empirical findings and a qualitative model. In International Symposium

on Musical Acoustics, Japan, 2004.

[53] T.S. Parker and L.O. Chua. Practical Numerical Algorithms for Chaotic Systems.

Springer-Verlag, 1989.

[54] F. Vrins, Lee J. A, M. Verleysen, V. Vigneron, and C. Jutten. Improving inde-

pendent component analysis performances by variable selection. In IEEE NNSP,

2003.

[55] J-F. Cardoso. Blind signal separation: Statistical principles. Proc. IEEE, 86:2009–

2025, 1998.

[56] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John

Wiley and Sons, 2001.

[57] P.Comon. Independent component analysis: A new concept? Signal Processing,

36:287–314, 1994.

[58] M. Casey. Separation of mixed audio sources by independent subspace analysis. In

Int’l Computer Music Conference, Berlin, Germany, 2000.

[59] L.Q. Zhang, C. Amari, and C. Cichoki. Natural gradient approach to blind separa-

tion of over- and under-complete mixtures. In In Proc. Int. Workshop on Indepen-

dent Component Analysis and Blind Source Separation, pages 455–460, 1999.

[60] P. Comon. Blind identification and source separation in 2x3 under-determined

mixtures. IEEE Trans. on signal processing, pages 11–22, 2004.

[61] L. Albera, P. Comon, P. Chevalier, and A. Ferreol. Blind identification of underde-

terminded mixtures based on the hexacovariance. In International Conference on

Audio Speech and Signal Processing, 2004.

[62] M. Cooke. http://www.dcs.shef.ac.uk/˜martin/.

[63] C. Prodohl, R. Wurtz, and C. Von der Malsburg. Learning the gestalt rule of

collinearity from object motion. Neural Computation, pages 1865–1896, 2003.

[64] W. Ross, S. Grossberg, and E. Mingolla. Visual cortical mechanisms of perceptual

grouping: Interacting layers, networks, columns, and maps. Neural Networks, pages

571–588, 2000.

[65] C. Von der Malsburg. The what and why of binding: The modeler’s perspective.

Neuron, pages 95–104, 1999.

[66] P. Milner. A model for visual shape recognition. Psychological Review, pages 521–

535, 1974.

[67] A. Kristjansson, D.L. Wang, and K. Nakayama. The role of priming in conjunctive

visual search. Cognition, 85:37–52, 2002.

[68] M. Shadlen and A. Movshon. Synchrony unbound: A critical evaluation of the

temporal binding hypothesis. Neuron, 24:67–77, 1999.

[69] M. Riesenhuber and T. Poggio. Are cortical models really bound by the binding

problem? Neuron, 24:87–93, 1999.

[70] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid

scene analyis. IEEE Trans. on Pattern Analysis and Machine Intelligence, pages

1254–1259, 1998.

[71] J. Reynolds and R. Desimone. The role of neural mechanisms of attention in solving

the binding problem. Neuron, 24:19–29, 99.

[72] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, 1990.

[73] K. Fukushima. A neural network model for selective attention in visual pattern

recognition. Biol. Cybernetics, pages 5–15, 1986.

[74] B.A. Olshausen, C.H. Anderson, and D.C. Van Essen. A neurobiological model

of visual attention and invariant pattern recognition based on dynamic routing of

information. J. Neuroscience, pages 4700–4719, 1993.

[75] E.O. Postma, H.J. Van der Herik, and P.T. W. Hudson. SCAN: A scalable neural

model of covert attention. Neural Networks, 10:993–1015, 1997.

[76] E. Salinas and L.F. Abott. Invariant visual responses from attentional gain fields.

Journal of Neurophysiology, pages 3267–3272, 1997.

[77] L. Wiskott. How Does our Visual System Achieve Shift and Size Invariance. In J.L.

Van Hemmen and T.J. Sejnowski (Eds.), Oxford University Press, 2003.

[78] MIT Encyclopedia of Cognitive Sciences. MIT press, online.

[79] W. Singer. Neuronal synchrony: A versatile code for the definition of relations?

Neuron, 24:49–65, 99.

[80] J. Wolfe and K. Cave. The psychological evidence for a binding problem. Neuron,

24:11–17, 1999.

[81] G. Bugmman. Binding by synchronisation: A task dependence hypothesis. Brain

and Behaviour Sciences, pages 685–688, 1997.

[82] J. Rouat and R. Pichevar. Nonlinear speech processing techniques for source segre-

gation. In EUSIPCO, Toulouse, France, 2002.

[83] V.I. Nenov. Neural network for learning, recognition, and recall of pattern sequences.

US Patent, No. 5,222,348, 1993.

[84] E. M. Izhikevich. Class 1 neural excitability, conventional synapses, weakly con-

nected networks, and mathematical foundations of pulse-coupled models. IEEE

Trans. on Neural Networks, 10(3):499–507, 1999.

[85] H.R. Wilson and J.D. Cowan. Excitatory and inhibitory interactions in localized

populations of model neurons. Biophysics Journal, pages 12:1–24, 1972.

[86] W. Gerstner. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cam-

bridge University Press, 2002.

[87] E. Izhikevich. Which model to use for cortical spiking neurons? IEEE Trans. on

Neural Networks, 2004.

[88] E. Izhikevich. Simple model of spiking neurons. IEEE Trans. on Neural Networks,

2003.

[89] L. Lapique. Recherches quantitatives sur l’excitation electrique des nerfs traitee

comme une polarisation. J. Physiol. Patho., pages 620–635, 1907.

[90] H. Kantz and T. Schreiber. Nonlinear time series. Cambridge University Press,

1997.

[91] L. Zhao and E. Macau. A network of dynamically coupled chaotic map for scene

segmentation. IEEE Trans. on Neural Networks, pages 1375–1385, 2001.

[92] K. Kaneko. Globally coupled chaos violates the law of large numbers but not the

central-limit thorem. Physical Review Letters, pages 1391–1394, 1990.

[93] K. Kaneko. Chaotic but regular posi-nega switch among coded attractors by cluster-

size variation. Physical Review Letters, pages 219–223, 1989.

[94] J. Ito and K. Kaneko. Self-organized hierarchical structure in a plastic network of

chaotic units.

[95] F. Pasemann. Complex dyanmics and the structure of small neural networks. Net-

work: Computation in Neural Systems, pages 195–216, 2002.

[96] E. Izhikevich. Dynamical Systems in Neuroscience: The geometry of excitability

and bursting. Springer-Verlag (to appear), 2005.

[97] F.C. Hoppensteadt and E. Izhikevich. Weakly Connected Neural Networks. Springer-

Verlag, New York, 1997.

[98] R. Hilborn. Chaos and Nonlinear Dynamics: An Introduction for Scientists and

Engineers. Oxford University Press, 2000.

[99] R. Borisyuk. Synchronization of neural activity and information coding. In NCWS

2003, 2003.

[100] D.L. Wang and D. Terman. Image segmentation based on oscillatory correlation.

Neural Computation, pages 805–836, 1997.

[101] D. Wang. Relaxation oscillators and networks. In Wiley Encyclopedia of Electrical

and Electronics Engineering, pages 396–405. Wiley & Sons, 1999.

[102] S. R. Campbell, D. L. Wang, and C. Jayaprakash. Synchrony and desynchrony in

integrate-and-fire oscillators. Neural Computation, pages 1595–1619, 1999.

[103] D. L. Wang and D. Terman. Image segmentation based on oscillatory correlataion.

Neural Computation, pages II 521– II 525, 1995.

[104] R. Pichevar and J. Rouat. Binding of audio elements in the sound source segregation

problem via a two-layered bio-inspired neural network. In IEEE CCECE’2003.

[105] R. Pichevar and J. Rouat. Double-vowel segregation through temporal correlation:

A bio-inspired neural network paradigm. In NOLISP’2003, 2003.

[106] R. Pichevar and J. Rouat. Oscillatory dynamic link matching for pattern recogni-

tion. In International Workshop on Neural Coding (NCWS), Aulla, Italy, 2003.

[107] H.X. Wang G.Q. Bi. Temporal asymmetry in spike timing-dependent synaptic

plasticity. Psychology and Behavior, pages 551–555, 2002.

[108] K.P. Kording and P. Konig. Neurons with two sites of synaptic integration learn

invariant representations. Neural Computation, pages 2823–2849, 2001.

[109] R. Van Rullen and S. J. Thorpe. Rate coding versus temporal order coding: What

the retinal ganglion cells tell the visual cortex. Neural Computation, 13:1255–1283,

2001.

[110] C. Panchev, S. Wermter, and H. Chen. Spike-timing dependent competitive learning

of integrate-and-fire neurons with active dendrites. In ICANN, Spain, 2002.

[111] Simon Haykin. Neural Networks : A Comprehensive Foundation. Macmillan, 1994.

[112] W. Maass and C. M. Bishop. Pulsed Neural Networks. MIT Press, 1998.

[113] R. Eckhorn. Neural mechanisms of scene segmentation: Recordings from the vi-

sual cortex suggest basic circuits for linking field models. IEEE Trans. on Neural

Networks, 10(3):464–479, 1999.

[114] X. Liu and D.L. Wang. Range image segmentation using a relaxation oscillator

network. IEEE Trans. On Neural Networks, pages 564–574, May 99.

[115] E. Cesmeli and D. Wang. Motion segmentation based on motion/brightness integra-

tion and oscillatory correlation. IEEE Trans. on Neural Networks, 11(4):935–947,

2000.

[116] D. Wang and D. Terman. Locally excitatory globally inhibitory oscillator networks.

IEEE Trans. on Neural Networks, pages 283–286, 1995.

[117] D. L. Wang. On connectedness: A solution based on oscillatory correlation. Neural

Computation, pages 131–139, 2000.

[118] S. N. Wrigley and G. J. Brown. A neural oscillator model of auditory attention.

Lecture Notes in Computer Science, pages 1163–1170, 2001.

[119] H. Nakano and T. Saito. Synchronization in a pulse-coupled network of chaotic

spiking oscillators. In 45th Midwest Symposium on Circuits and Systems, 2002.

[120] N. Cowan. Evolving conceptions of memory storage, selective attention and their

mutual constraints within the human information processing system. Psychol. Bull.,

104:163–191, 1988.

[121] B. Widrow. Adaptive noise cancelling: Principles and applications. Proceedings of

the IEEE, 63(12), 1975.

[122] Y. Kaneda and J. Ohga. Adaptive microphone-array system for noise reduction.

TrASSP, pages 1391–1400, 1986.

[123] J.-M. Valin, J. Rouat, and F. Michaud. Microphone array post-filter for separation

of simultaneous non-stationary sources. In ICASSP, Montreal, Canada, 2004.

[124] M.S. Brandstein and D.B. (Eds.). Microphoe Arrays: Signal Processing Techniques

and Applications. Springer Verlag, 2001.

[125] J. Sanchez-Bote, J. Gonzales-Rodriguez, and J. Ortega-Garcian. A real-time

auditory-based microphone array assessedwith e-rasti evaluation proposal. In

ICASSP, Hong-Kong, 2003.

[126] M.R. Gomez, D. Ellis, and N. Jojic. Multiband audio modeling for single-channel

acoustic source separation. In IC ASSP 2004, 2004.

[127] P.A. Cariani and B. Delgutte. Neural correlates of the pitch complex tones. i. pitch

and pitch salience. ii. pitch shift, pitch ambiguity, phase invariance, pitch circularity,

and the dominance region for pitch. J. Neurophysiology, 1996.

[128] R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Aller-

hand. Complex sounds and auditory images. In Y. Cazals, L. Demany, and

K. Horner, editors, Auditory Physiology and Perception, pages 429–446. Pergamon

Press, Oxford, 1992.

[129] R. Pichevar, J. Rouat, C. Feldbauer, and G. Kubin. A bio-inspired sound source

separation technique in combination with an enhanced FIR gammatone Analy-

sis/Synthesis filterbank. In EUSIPCO Vienna, 2004.

[130] T. Irino and M. Unoki. A time-varying, analysis/synthesis auditory filterbank using

the gammachirp. In 98, volume 6, pages 3653–3656, Seattle, Washington, May 1998.

[131] Gernot Kubin and W. Bastiaan Kleijn. On speech coding in a perceptual domain.

In 99, volume 1, pages 205–208, Phoenix, Arizona, March 1999.

[132] Malcolm Slaney. An efficient implementation of the Patterson-Holdsworth auditory

filter bank. Technical Report 35, Apple Computer, Inc, 1993.

[133] J. Rouat, Y. C. Liu, and D. Morissette. A pitch determination and voiced/unvoiced

decision algorithm for noisy speech. Speech Comm., 21:191–207, 1997.

[134] F. Plante, G. Meyer, and W. Ainsworth. Improvement of speech spectrogram accu-

racy by the method of reassignment. IEEE Trans. on Speech and Audio Processing,

pages 282–287, 1998.

[135] C. Giguere and Philip C. Woodland. A computational model of the auditory pe-

riphery for speech and hearing research. JASA, pages 331–349, 1994.

[136] M.C. Liberman, S. Puria, and J.J. Jr. Guinan. The ipsilaterally evoked olivo-

cochlearreflex causes rapid adaptation of the 2f1-f2 distortion product otoacoustic

emission. JASA, 99:2572–3584, 1996.

[137] D. L. Wang. Relaxation Oscillators and Networks, pages 396–405. John Wiley Sons,

1999.

[138] R. Pichevar and J. Rouat. Streaming of audio objects on 2D spectral maps through

multiplicative synaptic connection neurons. In Auditory Perception, Cognition, and

Action Meeting , Vancouver, Canada, 2003.

[139] F. Gabbiani, H. Krapp, C. Koch, and G. Laurent. Multiplicative computation in a

visual neuron sensitive to looming. Nature, 420:320–324, 2002.

[140] JL. Pena and M. Konishi. Auditory spatial receptive fields created by multiplication.

Science, 292:294–252, 2001.

[141] R.A. Andersen, L.H. Snyder, D.C. Bradley, and J. Xing. Multimodal representation

of space in the posterior parietal cortex and its use in planning movements. Ann.

Rev. Neurosci., page 20:303, 1997.

[142] J. Rouat. Spatio-temporal pattern recognition with neural networks: Application

to speech. In Artificial Neural Networks-ICANN’97, Lect. Notes in Comp. Sc. 1327,

pages 43–48. Springer, 10 1997.

[143] http://www-edu.gel.usherbrooke.ca/pichevar/.

[144] J.-M. Valin, F. Michaud, J. Rouat, and D. LUtourneau. Robust sound source local-

ization using a microphone array on a mobile robot. In IEEE/RSJ-Int. Conference

on Intelligent Robots and Systems., 2003.

[145] J.-M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on micro-

phone array source separation with post-filter. In IROS, 2004.

[146] G. Hu and D.L. Wang. Separation of stop consonants. In ICASSP 2003, 2003.

[147] http://www.itu.int/home/.

[148] R. Pichevar and J. Rouat. Bio-inspired sound source separation technique based

on a spiking neural network: Application to three-source sounds. Lecture Notes in

Computer Science (Springer-Verlag), to appear, 2004.

[149] B. Boashash and M. Mesbah. Signal enhancement by time-frequency peak filtering.

IEEE Trans. On Signal Processing, pages 929–938, 2004.

[150] S.C. Yen, E. D. Meschik, and L.H. Finkel. Cortical synchronization and perceptual

salience. Computational Neuroscience: Trends in Research, pages 125–130, 1993.

[151] D. Somers and N. Kopell. Rapid synchronization through fast threshold modulation.

Biological cybernetics, pages 393–407, 1993.

[152] N. Koppel and G.B. Ermentrout. Symmetry and phaselocking in chains of weakly

coupled oscillators. Communications on Pure and Applied Mathematics, pages 623–

660, 1986.

[153] R. Pichevar and J. Rouat. RN-spike process for spatio-temporal pattern recognition.

Canadian Provisional Patent, 2004.

[154] W. Konen, T. Maurer, and C. Von der Malsburg. A fast dynamic link matching

algorithm for invariant pattern recognition. Neural Networks, pages 1019–1030,

1994.

[155] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of in-

variances. Neural Computation, pages 715–770, 2002.

[156] T. Vinh Ho and J. Rouat. Novelty detection based on relaxation time of a net-

work of integrate-and-fire neurons. In Proc. IEEE Int’l Joint Conference on Neural

Networks, Alaska, USA, 1998.

[157] R. P. Wurtz. Multilayer Dynamic Link Networks for Establishing Image Point Cor-

respondences and Visual Object Recognition. PhD thesis, Ruhr-Universitat Bochum,

Germany, 1994.

[158] T. Aoinishi, K. Kurata, and T. Mito. A phase locking theory for matching common

parts of two images by dynamic link matching. Biological Cybernetics, 78(4):253–

264, 1998.

[159] R. Pichevar and J. Rouat. Oscillatory dynamic link matching for invariant pattern

recognition. Biosystems Journal (submitted), 2004.

[160] R. Pichevar and J. Rouat. Oscillatory dynamic link matcher: A bio-inspired neural

network for pattern recognition. In Brain Inspired Cognitive Systems 2004, Stirling,

Scotland (Invited Paper), 2004.

[161] X. Zhang and A. Minai. Detecting corresponding segments across images using

synchronizable pulse-coupled nerual networks. In IJCNN2001, 2001.

[162] L.E. Gordon. Theories of Visual Perception. John Wiley Sons, 1997.

[163] L. Wiskott, C. Von der Malsburg, and A. Weitzenfeld. The Neural Simulation

Language: A System for Brain Modeling, chapter 18, pages 343–372. MIT Press,

2002.

[164] H. Ando, N. Takashi Morie, M. Nagata, and A. Iwata. A nonlinear oscillator network

circuit for image segmentation with double-threshold phase detection. In ICANN

99, 1999.

[165] A.K. Jain. Fundamentals of Digital Image Processing. Prentice-Hall, 1989.

[166] R. VanRullen and S. J. Thorpe. Surfing a spike wave down the ventral stream.

Vision Research, pages 2593–2615, 2002.

[167] G.B. Ermentrout and N. Kopell. Parabolic bursting in an excitable system coupled

with a slow oscillation. SIAM J. Appl. Math., pages 233–253, 1986.

[168] C. Feldbauer and G. Kubin. Critically sampled frequency-warped perfect recon-

struction filterbank. In ECCTD‘03, 2003.

BIBLIOGRAPHY

[1] R. M. Borisyuk and Y. Kazanovich. Oscillatory neural network model of attentionfocus formation and control. Biosystems, 71:29–36, 2003.

[2] M. Cooke and D. Ellis. The auditory organization of speech and other sources inlisteners and computational models. Speech Comm., pages 141–177, 2001.

[3] D. Wang and G. J. Brown. Separation of speech from interfering sounds based onoscillatory correlation. IEEE Transactions on Neural Networks, 10(3):684–697, May1999.

[4] G. Hu and D.L. Wang. Monaural speech segregation based on pitch tracking andamplitude modulation. IEEE Trans. On Neural Networks, pages 1135– 1150, Sept.2004.

[5] G. Jang and T. Lee. Single-channel signal separation using time-domain basis func-tions. Signal Processing Letters, pages 168–171, June 2003.

[6] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990.

[7] T. Lengagne, T. Aubin, and J. Lauga. How do king penguins (aptenodytes patag-onicus) apply the mathematical theory of information to communicate in windyconditions? Proc. R. Soc. (London) B Biology, 266:1623–1628, 1999.

[8] J. Kanwal, A. Medvev, and C. Micheyl. Neurodynamics for auditory stream seg-regation: tracking sounds in the mustached bat’s natural environment. Network:Computation in Neural Systems, 14(13), 2003.

[9] R. L. Cherry. Some experiments on the recognition of speech, with one and withtwo ears. Journal of Acousticial Society of America, 25:975–979, 1953.

[10] J. Driver. Enhancement of selective listening by illusory mislocation of speech soundsdue to lip-reading. Nature, 381:66–68, 1996.

[11] K. Koffka. Principles of Gestalt Psychology. Lund Humphries (London), 1935.

[12] A.J.W. Van der Kouwe, D.L. Wang, and G. J. Brown. A comparison of auditoryand blind separation techniques for speech segregation. IEEE Trans. on Speech andAudio Processing, 9:189–195, 2001.

[13] D. Marr. Vision. Freeman Publishers, 1982.

[14] W. Ainsworth and S. Greenberg. Springer Handbook of Auditory Research. Springer,2003.

201

[15] L. A. Jeffress. A place theory of sound localization. J. Comp. Physiol. Psychol.,41:35–39, 1948.

[16] J.C.R. Licklider and W.H. Huggins. Place mechanisms of auditory frequency anal-ysis. JASA, 23:290–299, 1951.

[17] T. W. Parsons. Separation of speech from interfering speech by means of harmonicselection. JASA, 60:911–918, 1976.

[18] R.F. Lyon. A computational model of filtering, detection and compression in thecochlea. In ICASSP, 1982.

[19] M.T. Scheffers. Sifting Vowels: Auditory Pitch Analysis and Sound Segregation.PhD thesis, Groningen University, The Netherlands, 1983.

[20] M. Weintraub. A computational model for separating two simultaneous talkers. InICASSP, 1986.

[21] C. Von der Marlsburg and W. Schneider. A neural cocktail-party processor. Biol.Cybernetics, pages 29–40, 1986.

[22] F. Berthommier and G. Meyer. Improving of amplitude modulation maps for f0-dependent segregation of harmonic sounds. In Eurospeech’97, 1997.

[23] R.J. Stubbs and A.Q. Summerfield. Evaluation of 2 voice-separation algorithmsusing normal-hearing and hearing-impaired listeners. JASA, 84:1236–1249, 1988.

[24] M. Cooke. Modelling Auditory Processing and Organisation. PhD thesis, Universityof Sheffield, 1991.

[25] K. Mellinger. Event Formation and Separation in Musical Sound. PhD thesis,Stanford University, 1991.

[26] K. Kashino and H. Tanaka. A sound source separation system using spectral featuresintegrated by the Dempster’s law of combination. Annual Report of the EngineeringResearch Institute, University of Tokyo, 51:67–72, 1992.

[27] G. Brown and M. Cooke. Computational auditory scene analaysis. Computer Speechand Language, pages 297–336, 1994.

[28] A. de Cheveigne. Separation of concurrent harmonic sounds: Fundamental fre-quency estimation and a time-domain cancellation model of auditory processing.Journal of Acoustical Society of America, pages 3271–3290, 1993.

[29] R.D. Patterson, M. H. Allerhand, and C. Giguere. Time-domain modelling ofperipheral auditory processing: A modular architecture and a software platform.JASA, 98:1890–1894, 1995.

[30] D. Ellis. Prediction-Driven Computational Auditory Scene Analysis. PhD thesis,MIT, 1996.

[31] D.F. Rosenthal and H. G. Okuno. Computational Auditory Scene Analysis.Lawrence Erlbaum Assoc, 1998.

[32] M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recogni-tion with missing and unreliable acoustic data. Speech Communication, 34:267–285,2001.

[33] S. T. Roweis. One microphone source seperation. In NIPS, Denver, USA, 2000.

[34] M. J. Reyes-Gomez, B. Raj, and D. Ellis. Multi-channel source separation byfactorial HMMs. In ICASSP 2003, 2003.

[35] G. Hu and D. L. Wang. Monaural speech segregation based on pitch tracking andamplitude modulation. Technical report, Ohio State University, 2002.

[36] M. Wu, D.L. Wang, and G.J. Brown. A multipitch tracking algorithm for noisyspeech. IEEE Trans. on Speech and Audio Processing, 2003.

[37] D.P. Gibson, N. W. Campbell, and B.T. Thomas. Very low bit rate semanticcompression of natural outdoor images. In Picture Coding Symposium, Oregon,USA, 1999.

[38] N. Todd. An auditory cortical theory of auditory stream segregation. Network :Computation in Neural Systems, 7:349–356, 1996.

[39] G. Langner. Temporal processing of pitch in the auditory system. J. New MusicRes, pages 116–132, 1997.

[40] S. Cunningham and M. Cooke. The role of evidence and counter-evidence in speechperception. In International Congress of Phonetic Sciences 1999, 1999.

[41] J. Rouat and R. Pichevar. Source separation with one ear: Proposition for ananthropomorphic approach. EURASIP Journal on Applied Signal Processing (sub-mitted, invited paper), 2004.

[42] R. Pichevar and J. Rouat. Cochleotopic/AMtopic (CAM) andCochleotopic/Spectrotopic (CSM) map based sound source separation usingrelaxation oscillatory neurons. In IEEE Neural Networks for Signal ProcessingWorkshop, Toulouse, France, 2003.

[43] R. Pichevar and J. Rouat. Monophonic source separation with an unsupervisednetwork of spikings neurons. Speech Communication (Elsevier), submitted, 2004.

[44] F. Gaillard. Analyse de Scenes Auditives Computationnelle (CASA): Un NouvelOutil de Marquage Du Plan Temps-Frequence Par Detection D’harmonicite Ex-ploitant Une Statistique de Passage Par Zero. PhD thesis, INPG, 1999.

[45] F. Klessner, V. Lesser, and S.H. Nawab. The IPUS Blackboard Architecture as aFramework for Computational Auditory Scene Analysis. In Computational AuditoryScene Analysis, D.F. Rosenthal and H.G, Okuno, 1998.

[46] S. Grossberg, K. K. Govindarajan, L.L. Wyse, and M.A. Cohen. ARTSTREAM:A neural network model of auditory scene analysis and source segregation. NeuralNetworks, 2003.

[47] S. T. Roweis. Factorial models and refiltering for speech separation and denoising.In Eurospeech 2003, 2003.

[48] H. Sameti, H. Sheikhzadeh, L. Deng, and R.L. Brennan. HMM-based strategies forenhancement of speech signals embedded in nonstationary noise. IEEE Trans. onSpeech and Audio Processing, pages 445–455, 1998.

[49] R. Remez and P. E. Rubin. Speech perception without traditional speech cues.Science, pages 947–949, May 1981.

[50] R. E. Remez and P.E. Rubin. On the perceptual organization of speech. Psycho-logical Review, pages 129–148, 1994.

[51] J. Barker and M. Cooke. Is the sine-wave speech cocktail party worth attending?Speech communication, 27:159–174, 1999.

[52] C.G. Tsai. Auditory grouping in the perception of roughness induced by subhar-monics: Empirical findings and a qualitative model. In International Symposiumon Musical Acoustics, Japan, 2004.

[53] T.S. Parker and L.O. Chua. Practical Numerical Algorithms for Chaotic Systems.Springer-Verlag, 1989.

[54] F. Vrins, Lee J. A, M. Verleysen, V. Vigneron, and C. Jutten. Improving inde-pendent component analysis performances by variable selection. In IEEE NNSP,2003.

[55] J-F. Cardoso. Blind signal separation: Statistical principles. Proc. IEEE, 86:2009–2025, 1998.

[56] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. JohnWiley and Sons, 2001.

[57] P.Comon. Independent component analysis: A new concept? Signal Processing,36:287–314, 1994.

[58] M. Casey. Separation of mixed audio sources by independent subspace analysis. InInt’l Computer Music Conference, Berlin, Germany, 2000.

[59] L.Q. Zhang, C. Amari, and C. Cichoki. Natural gradient approach to blind separa-tion of over- and under-complete mixtures. In In Proc. Int. Workshop on Indepen-dent Component Analysis and Blind Source Separation, pages 455–460, 1999.

[60] P. Comon. Blind identification and source separation in 2x3 under-determinedmixtures. IEEE Trans. on signal processing, pages 11–22, 2004.

[61] L. Albera, P. Comon, P. Chevalier, and A. Ferreol. Blind identification of underde-terminded mixtures based on the hexacovariance. In International Conference onAudio Speech and Signal Processing, 2004.

[62] M. Cooke. http://www.dcs.shef.ac.uk/˜martin/.

[63] C. Prodohl, R. Wurtz, and C. Von der Malsburg. Learning the gestalt rule ofcollinearity from object motion. Neural Computation, pages 1865–1896, 2003.

[64] W. Ross, S. Grossberg, and E. Mingolla. Visual cortical mechanisms of perceptualgrouping: Interacting layers, networks, columns, and maps. Neural Networks, pages571–588, 2000.

[65] C. Von der Malsburg. The what and why of binding: The modeler’s perspective.Neuron, pages 95–104, 1999.

[66] P. Milner. A model for visual shape recognition. Psychological Review, pages 521–535, 1974.

[67] A. Kristjansson, D.L. Wang, and K. Nakayama. The role of priming in conjunctivevisual search. Cognition, 85:37–52, 2002.

[68] M. Shadlen and A. Movshon. Synchrony unbound: A critical evaluation of thetemporal binding hypothesis. Neuron, 24:67–77, 1999.

[69] M. Riesenhuber and T. Poggio. Are cortical models really bound by the bindingproblem? Neuron, 24:87–93, 1999.

[70] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapidscene analyis. IEEE Trans. on Pattern Analysis and Machine Intelligence, pages1254–1259, 1998.

[71] J. Reynolds and R. Desimone. The role of neural mechanisms of attention in solvingthe binding problem. Neuron, 24:19–29, 99.

[72] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, 1990.

[73] K. Fukushima. A neural network model for selective attention in visual patternrecognition. Biol. Cybernetics, pages 5–15, 1986.

[74] B.A. Olshausen, C.H. Anderson, and D.C. Van Essen. A neurobiological modelof visual attention and invariant pattern recognition based on dynamic routing ofinformation. J. Neuroscience, pages 4700–4719, 1993.

[75] E.O. Postma, H.J. Van der Herik, and P.T. W. Hudson. SCAN: A scalable neuralmodel of covert attention. Neural Networks, 10:993–1015, 1997.

[76] E. Salinas and L.F. Abott. Invariant visual responses from attentional gain fields.Journal of Neurophysiology, pages 3267–3272, 1997.

[77] L. Wiskott. How Does our Visual System Achieve Shift and Size Invariance. In J.L.Van Hemmen and T.J. Sejnowski (Eds.), Oxford University Press, 2003.

[78] MIT Encyclopedia of Cognitive Sciences. MIT press, online.

[79] W. Singer. Neuronal synchrony: A versatile code for the definition of relations?Neuron, 24:49–65, 99.

[80] J. Wolfe and K. Cave. The psychological evidence for a binding problem. Neuron,24:11–17, 1999.

[81] G. Bugmman. Binding by synchronisation: A task dependence hypothesis. Brainand Behaviour Sciences, pages 685–688, 1997.

[82] J. Rouat and R. Pichevar. Nonlinear speech processing techniques for source segre-gation. In EUSIPCO, Toulouse, France, 2002.

[83] V.I. Nenov. Neural network for learning, recognition, and recall of pattern sequences.US Patent, No. 5,222,348, 1993.

[84] E. M. Izhikevich. Class 1 neural excitability, conventional synapses, weakly con-nected networks, and mathematical foundations of pulse-coupled models. IEEETrans. on Neural Networks, 10(3):499–507, 1999.

[85] H.R. Wilson and J.D. Cowan. Excitatory and inhibitory interactions in localizedpopulations of model neurons. Biophysics Journal, pages 12:1–24, 1972.

[86] W. Gerstner. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cam-bridge University Press, 2002.

[87] E. Izhikevich. Which model to use for cortical spiking neurons? IEEE Trans. onNeural Networks, 2004.

[88] E. Izhikevich. Simple model of spiking neurons. IEEE Trans. on Neural Networks,2003.

[89] L. Lapique. Recherches quantitatives sur l’excitation electrique des nerfs traiteecomme une polarisation. J. Physiol. Patho., pages 620–635, 1907.

[90] H. Kantz and T. Schreiber. Nonlinear time series. Cambridge University Press,1997.

[91] L. Zhao and E. Macau. A network of dynamically coupled chaotic map for scenesegmentation. IEEE Trans. on Neural Networks, pages 1375–1385, 2001.

[92] K. Kaneko. Globally coupled chaos violates the law of large numbers but not thecentral-limit thorem. Physical Review Letters, pages 1391–1394, 1990.

[93] K. Kaneko. Chaotic but regular posi-nega switch among coded attractors by cluster-size variation. Physical Review Letters, pages 219–223, 1989.

[94] J. Ito and K. Kaneko. Self-organized hierarchical structure in a plastic network ofchaotic units.

[95] F. Pasemann. Complex dyanmics and the structure of small neural networks. Net-work: Computation in Neural Systems, pages 195–216, 2002.

[96] E. Izhikevich. Dynamical Systems in Neuroscience: The geometry of excitabilityand bursting. Springer-Verlag (to appear), 2005.

[97] F.C. Hoppensteadt and E. Izhikevich. Weakly Connected Neural Networks. Springer-Verlag, New York, 1997.

[98] R. Hilborn. Chaos and Nonlinear Dynamics: An Introduction for Scientists andEngineers. Oxford University Press, 2000.

[99] R. Borisyuk. Synchronization of neural activity and information coding. In NCWS2003, 2003.

[100] D.L. Wang and D. Terman. Image segmentation based on oscillatory correlation.Neural Computation, pages 805–836, 1997.

[101] D. Wang. Relaxation oscillators and networks. In Wiley Encyclopedia of Electricaland Electronics Engineering, pages 396–405. Wiley & Sons, 1999.

[102] S. R. Campbell, D. L. Wang, and C. Jayaprakash. Synchrony and desynchrony inintegrate-and-fire oscillators. Neural Computation, pages 1595–1619, 1999.

[103] D. L. Wang and D. Terman. Image segmentation based on oscillatory correlataion.Neural Computation, pages II 521– II 525, 1995.

[104] R. Pichevar and J. Rouat. Binding of audio elements in the sound source segregationproblem via a two-layered bio-inspired neural network. In IEEE CCECE’2003.

[105] R. Pichevar and J. Rouat. Double-vowel segregation through temporal correlation:A bio-inspired neural network paradigm. In NOLISP’2003, 2003.

[106] R. Pichevar and J. Rouat. Oscillatory dynamic link matching for pattern recogni-tion. In International Workshop on Neural Coding (NCWS), Aulla, Italy, 2003.

[107] H.X. Wang G.Q. Bi. Temporal asymmetry in spike timing-dependent synapticplasticity. Psychology and Behavior, pages 551–555, 2002.

[108] K.P. Kording and P. Konig. Neurons with two sites of synaptic integration learninvariant representations. Neural Computation, pages 2823–2849, 2001.

[109] R. Van Rullen and S. J. Thorpe. Rate coding versus temporal order coding: Whatthe retinal ganglion cells tell the visual cortex. Neural Computation, 13:1255–1283,2001.

[110] C. Panchev, S. Wermter, and H. Chen. Spike-timing dependent competitive learningof integrate-and-fire neurons with active dendrites. In ICANN, Spain, 2002.

[111] Simon Haykin. Neural Networks : A Comprehensive Foundation. Macmillan, 1994.

[112] W. Maass and C. M. Bishop. Pulsed Neural Networks. MIT Press, 1998.

[113] R. Eckhorn. Neural mechanisms of scene segmentation: Recordings from the vi-sual cortex suggest basic circuits for linking field models. IEEE Trans. on NeuralNetworks, 10(3):464–479, 1999.

[114] X. Liu and D.L. Wang. Range image segmentation using a relaxation oscillatornetwork. IEEE Trans. On Neural Networks, pages 564–574, May 99.

[115] E. Cesmeli and D. Wang. Motion segmentation based on motion/brightness integra-tion and oscillatory correlation. IEEE Trans. on Neural Networks, 11(4):935–947,2000.

[116] D. Wang and D. Terman. Locally excitatory globally inhibitory oscillator networks.IEEE Trans. on Neural Networks, pages 283–286, 1995.

[117] D. L. Wang. On connectedness: A solution based on oscillatory correlation. NeuralComputation, pages 131–139, 2000.

[118] S. N. Wrigley and G. J. Brown. A neural oscillator model of auditory attention.Lecture Notes in Computer Science, pages 1163–1170, 2001.

[119] H. Nakano and T. Saito. Synchronization in a pulse-coupled network of chaoticspiking oscillators. In 45th Midwest Symposium on Circuits and Systems, 2002.

[120] N. Cowan. Evolving conceptions of memory storage, selective attention and theirmutual constraints within the human information processing system. Psychol. Bull.,104:163–191, 1988.

[121] B. Widrow. Adaptive noise cancelling: Principles and applications. Proceedings ofthe IEEE, 63(12), 1975.

[122] Y. Kaneda and J. Ohga. Adaptive microphone-array system for noise reduction.TrASSP, pages 1391–1400, 1986.

[123] J.-M. Valin, J. Rouat, and F. Michaud. Microphone array post-filter for separationof simultaneous non-stationary sources. In ICASSP, Montreal, Canada, 2004.

[124] M.S. Brandstein and D.B. (Eds.). Microphoe Arrays: Signal Processing Techniquesand Applications. Springer Verlag, 2001.

[125] J. Sanchez-Bote, J. Gonzales-Rodriguez, and J. Ortega-Garcian. A real-timeauditory-based microphone array assessedwith e-rasti evaluation proposal. InICASSP, Hong-Kong, 2003.

[126] M.R. Gomez, D. Ellis, and N. Jojic. Multiband audio modeling for single-channelacoustic source separation. In IC ASSP 2004, 2004.

[127] P.A. Cariani and B. Delgutte. Neural correlates of the pitch complex tones. i. pitchand pitch salience. ii. pitch shift, pitch ambiguity, phase invariance, pitch circularity,and the dominance region for pitch. J. Neurophysiology, 1996.

[128] R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Aller-hand. Complex sounds and auditory images. In Y. Cazals, L. Demany, andK. Horner, editors, Auditory Physiology and Perception, pages 429–446. PergamonPress, Oxford, 1992.

[129] R. Pichevar, J. Rouat, C. Feldbauer, and G. Kubin. A bio-inspired sound sourceseparation technique in combination with an enhanced FIR gammatone Analy-sis/Synthesis filterbank. In EUSIPCO Vienna, 2004.

[130] T. Irino and M. Unoki. A time-varying, analysis/synthesis auditory filterbank usingthe gammachirp. In 98, volume 6, pages 3653–3656, Seattle, Washington, May 1998.

[131] Gernot Kubin and W. Bastiaan Kleijn. On speech coding in a perceptual domain.In 99, volume 1, pages 205–208, Phoenix, Arizona, March 1999.

[132] Malcolm Slaney. An efficient implementation of the Patterson-Holdsworth auditoryfilter bank. Technical Report 35, Apple Computer, Inc, 1993.

[133] J. Rouat, Y. C. Liu, and D. Morissette. A pitch determination and voiced/unvoiceddecision algorithm for noisy speech. Speech Comm., 21:191–207, 1997.

[134] F. Plante, G. Meyer, and W. Ainsworth. Improvement of speech spectrogram accu-racy by the method of reassignment. IEEE Trans. on Speech and Audio Processing,pages 282–287, 1998.

[135] C. Giguere and Philip C. Woodland. A computational model of the auditory pe-riphery for speech and hearing research. JASA, pages 331–349, 1994.

[136] M.C. Liberman, S. Puria, and J.J. Jr. Guinan. The ipsilaterally evoked olivo-cochlearreflex causes rapid adaptation of the 2f1-f2 distortion product otoacousticemission. JASA, 99:2572–3584, 1996.

[137] D. L. Wang. Relaxation Oscillators and Networks, pages 396–405. John Wiley Sons,1999.

[138] R. Pichevar and J. Rouat. Streaming of audio objects on 2D spectral maps throughmultiplicative synaptic connection neurons. In Auditory Perception, Cognition, andAction Meeting , Vancouver, Canada, 2003.

[139] F. Gabbiani, H. Krapp, C. Koch, and G. Laurent. Multiplicative computation in avisual neuron sensitive to looming. Nature, 420:320–324, 2002.

[140] JL. Pena and M. Konishi. Auditory spatial receptive fields created by multiplication.Science, 292:294–252, 2001.

[141] R.A. Andersen, L.H. Snyder, D.C. Bradley, and J. Xing. Multimodal representationof space in the posterior parietal cortex and its use in planning movements. Ann.Rev. Neurosci., page 20:303, 1997.

[142] J. Rouat. Spatio-temporal pattern recognition with neural networks: Applicationto speech. In Artificial Neural Networks-ICANN’97, Lect. Notes in Comp. Sc. 1327,pages 43–48. Springer, 10 1997.

[143] http://www-edu.gel.usherbrooke.ca/pichevar/.

[144] J.-M. Valin, F. Michaud, J. Rouat, and D. LUtourneau. Robust sound source local-ization using a microphone array on a mobile robot. In IEEE/RSJ-Int. Conferenceon Intelligent Robots and Systems., 2003.

[145] J.-M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on micro-phone array source separation with post-filter. In IROS, 2004.

[146] G. Hu and D.L. Wang. Separation of stop consonants. In ICASSP 2003, 2003.

[147] http://www.itu.int/home/.

[148] R. Pichevar and J. Rouat. Bio-inspired sound source separation technique basedon a spiking neural network: Application to three-source sounds. Lecture Notes inComputer Science (Springer-Verlag), to appear, 2004.

[149] B. Boashash and M. Mesbah. Signal enhancement by time-frequency peak filtering.IEEE Trans. On Signal Processing, pages 929–938, 2004.

[150] S.C. Yen, E. D. Meschik, and L.H. Finkel. Cortical synchronization and perceptualsalience. Computational Neuroscience: Trends in Research, pages 125–130, 1993.

[151] D. Somers and N. Kopell. Rapid synchronization through fast threshold modulation.Biological cybernetics, pages 393–407, 1993.

[152] N. Koppel and G.B. Ermentrout. Symmetry and phaselocking in chains of weaklycoupled oscillators. Communications on Pure and Applied Mathematics, pages 623–660, 1986.

[153] R. Pichevar and J. Rouat. RN-spike process for spatio-temporal pattern recognition.Canadian Provisional Patent, 2004.

[154] W. Konen, T. Maurer, and C. Von der Malsburg. A fast dynamic link matchingalgorithm for invariant pattern recognition. Neural Networks, pages 1019–1030,1994.

[155] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of in-variances. Neural Computation, pages 715–770, 2002.

[156] T. Vinh Ho and J. Rouat. Novelty detection based on relaxation time of a net-work of integrate-and-fire neurons. In Proc. IEEE Int’l Joint Conference on NeuralNetworks, Alaska, USA, 1998.

[157] R. P. Wurtz. Multilayer Dynamic Link Networks for Establishing Image Point Cor-respondences and Visual Object Recognition. PhD thesis, Ruhr-Universitat Bochum,Germany, 1994.

[158] T. Aoinishi, K. Kurata, and T. Mito. A phase locking theory for matching commonparts of two images by dynamic link matching. Biological Cybernetics, 78(4):253–264, 1998.

[159] R. Pichevar and J. Rouat. Oscillatory dynamic link matching for invariant patternrecognition. Biosystems Journal (submitted), 2004.

[160] R. Pichevar and J. Rouat. Oscillatory dynamic link matcher: A bio-inspired neuralnetwork for pattern recognition. In Brain Inspired Cognitive Systems 2004, Stirling,Scotland (Invited Paper), 2004.

[161] X. Zhang and A. Minai. Detecting corresponding segments across images usingsynchronizable pulse-coupled nerual networks. In IJCNN2001, 2001.

[162] L.E. Gordon. Theories of Visual Perception. John Wiley Sons, 1997.

[163] L. Wiskott, C. Von der Malsburg, and A. Weitzenfeld. The Neural SimulationLanguage: A System for Brain Modeling, chapter 18, pages 343–372. MIT Press,2002.

[164] H. Ando, N. Takashi Morie, M. Nagata, and A. Iwata. A nonlinear oscillator networkcircuit for image segmentation with double-threshold phase detection. In ICANN99, 1999.

[165] A.K. Jain. Fundamentals of Digital Image Processing. Prentice-Hall, 1989.

[166] R. VanRullen and S. J. Thorpe. Surfing a spike wave down the ventral stream.Vision Research, pages 2593–2615, 2002.

[167] G.B. Ermentrout and N. Kopell. Parabolic bursting in an excitable system coupledwith a slow oscillation. SIAM J. Appl. Math., pages 233–253, 1986.

[168] C. Feldbauer and G. Kubin. Critically sampled frequency-warped perfect recon-struction filterbank. In ECCTD‘03, 2003.

Date post:	26-Apr-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

soir¶ee cocktail en vue d’applications µa la s¶eparation ...poser une architecture apte a faire...

Documents