Faculte de genieGenie electrique et genie informatique
Traitement neuronal et anthropomorphiquede l’effet “soiree cocktail” en vue d’applications
a la separation de sources sonores eta la reconnaissance de formes
These de doctoratSpecialite : genie electrique
Ramin PICHEVAR
Sherbrooke (Quebec) Canada Novembre 2004
i
RESUME
Cette these se compose de deux parties. La premiere partie porte sur la separation de
sources sonores. En nous basant sur les trouvailles de la premiere partie, nous proposons
une architecture neuronale pour la reconnaissance de formes dans la deuxieme partie.
Le systeme de separateur de sons proposes est base sur une architecture neuronale bio-
inspiree de reseaux a decharges (reseaux a spikes). Deux representations differentes
(Cochleotopic / AMtopic ou Cochleotopic / Spectrotopic) sont utilisees comme pre-
traitement. Ces images auditives bi-dimensionnelles essayent de mimer partiellement le
comportement du chemin auditif. Les elements de base du reseau de neurones propose
sont les neurones oscillatoires a relaxation.
Nous demontrons que le comportement du neurone plus populaire “integrate-and-fire” est
une approximation du neurone a relaxation. La separation est basee sur la synchronisation
de la deuxieme couche de neurones. Chaque neurone de la deuxieme couche est associe
a un canal cochleaire (un total de 256 canaux). Une version amelioree du banc de filtre
synthese/analyse gammatone est utilisee pour generer les canaux cochleaires. Le cirtere
de distorsion spectrale (LSD) est utilise pour comparer les performances. Nous utilisons
aussi d’autres criteres de performance comme le PEL (Pourcentage d’energie perdue),
PNR (Pourcentage du bruit residuel), SNR (rapport signal-bruit) et PESQ (evaluation
perceptive de la qualite du son).
Le systeme de reconnaissance de formes est inspire de la premiere partie de la these.
L’objectif de cette partie est de faire une analogie entre la vision et l’audition pour pro-
poser une architecture apte a faire de la reconnaissance de formes. Notre architecture
intitulee “Oscillatory Dynamic Link Matching” est une extension de l’architecture “Dy-
namic Link Marching” proposee anterieurement par d’autres chercheurs. L’architecture
proposee comprend deux couches. Si la synchronisation est atteinte entre les couches,
cela signifie que le patron existe dans la scene. Le comportement du reseau est analyse
mathematiquement dans la these.
ii
ABSTRACT
This thesis consists of two parts. The first part deals with the sound source separation
problem. Based on the findings of the first part, a neural architecture for visual pattern
recognition is proposed in the second part.
The proposed sound source separation technique is based on a two-layered bio-inspired
spiking neural network . Depending on the characteristics of the intruding sound, one of
the two bio-inspired proposed spectral maps (Cochleotopic / AMtopic or Cochleotopic /
Spectrotopic) is used as front-end. These two-dimensional maps try to mimic partially
the auditory pathway. The building blocks of the neural network are oscillatory relaxation
neurons. We show that the behavior of the more popular integrate-and-fire neurons are an
approximation of the latter-mentioned neurons. The separation of different sound sources
is based on the synchronization of neurons in the second layer. Each neuron in the second
layer is associated to a cochlear channel (a total of 256 channels in our experiments).
An enhanced version of the gammatone analysis/synthesis filterbank is used to generate
the cochlear channels. The Log-Spectral Distortion (LSD) criterion is used to compare
performance. We also compare different performance criteria like LSD (Log-Spectral
Distortion), PEL (Percentage of Energy Loss), PNR (Percentage of Noise Residue), SNR
(Signal-to-Noise Ratio), and PESQ (Perceptual Evaluation of Speech Quality).
The proposed visual pattern recognition is inspired from the work we did in the first
part of the thesis. The goal in this part is to make the analogy between audition and
vision and propose an architecture able of doing visual pattern recognition. Our proposed
’Oscillatory Dynamic Link Matcher’ is an extension of the already known ’Dynamic Link
Matcher’. The network consists of two layers. The pattern is applied to the first layer and
the scene to the second layer. If synchronization is achieved between layers, we conclude
that the pattern exists in the scene. These facts are proven mathematically (along other
properties) in this thesis.
iii
REMERCIEMENTS 1
Je remercie Jean Rouat, mon directeur de these, qui a su eveiller ma curiosite avec son
approche pluridisciplinaire . Grace a sa tenacite, nous avons acheve un travail de recherche
innovant. Je le remercie aussi pour son soutien financier.
Je remercie Juan-Manuel Torres, Francois Michaud et Roch Lefebvre pour avoir accepte
de faire partie du jury de ma these.
Je remercie mes collegues de travail et amis, Stephane Loiselle, Rachid Moussaoui, Gregoire
Mouly-aigrot, Gregory Farage, Stephane Ragot, Mohammed Bahoura, Hassan Ezzaidi,
Steeve Larouche, Romain Balleraud, Mario Petitclerc, Le Tan Thanh Tai et Guillaume
Fuchs.
Je remercie DeLiang Wang et Alessandro Villa pour m’avoir accepte en stage aux Etats-
Unis et en France respectivement.
Je remercie Guy Benoıt et Pierre-Yves Fortin du Bureau de Liaison Entreprises Universite
(BLEU) de l’Universite de Sherbrooke pour leur aide dans le cadre du depot de brevet.
Je remercie Christian Feldbauer et Gernot Kubin pour leur collaboration technique dans
le cadre du projet europeen COST 277.
Un grand merci a tous mes amis et collegues qui ont bien voulu assister a ma soutenance
de these.
Merci enfin a mes parents et a ma famille pour leur soutien continuel. Je leur dedie ce
travail.
1Acknowledgments are in French
iv
TABLE OF CONTENTS
1 INTRODUCTION 1
1.1 Auditory scene analysis for real scenes . . . . . . . . . . . . . . . . . . . . 1
1.2 Approaches for auditory scene analysis . . . . . . . . . . . . . . . . . . . . 2
1.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Ideas to be investigated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Specific goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Outline of this document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA) 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Cocktail party effect and human audition . . . . . . . . . . . . . . . . . . . 9
2.3 History of CASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Applications of CASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Bases of auditory scene analysis . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Data-driven CASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Top-down or schema-driven CASA . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Different implementations of CASA . . . . . . . . . . . . . . . . . . . . . . 25
2.9 ASA limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9.1 Sinusoidal speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.9.2 Limitations of pitch-based grouping . . . . . . . . . . . . . . . . . . 27
2.10 Comparison of CASA with other source separation techniques . . . . . . . 28
2.10.1 Blind Sound Source Separation . . . . . . . . . . . . . . . . . . . . 28
2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
v
vi TABLE OF CONTENTS
3 NEUROCOGNITION 35
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Geststalt Psychology and Neurophysiology . . . . . . . . . . . . . . . . . . 35
3.3 Conventional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 The binding problem . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.2 Are classical neural networks universal? . . . . . . . . . . . . . . . . 39
3.4 Solutions to the binding problem . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Hierarchical coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.4.2 Attentional models . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.3 Assembly Coding and Temporal correlation . . . . . . . . . . . . . 45
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4 DYNAMICS OF BIO-INSPIRED NEURONS 53
4.0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Different types of neuronal models . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 Class I and Class II neural excitatability . . . . . . . . . . . . . . . 54
4.2 Mathematical description of neurons . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Four-dimensional neuronal models . . . . . . . . . . . . . . . . . . . 55
4.2.2 Two-dimensional neural models . . . . . . . . . . . . . . . . . . . . 56
4.2.3 One-dimensional neural models . . . . . . . . . . . . . . . . . . . . 59
4.2.4 Fractal dimension neural models . . . . . . . . . . . . . . . . . . . . 60
4.3 Canonical Neuronal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Different modes of synchronization . . . . . . . . . . . . . . . . . . . . . . 65
4.5 Selection of the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.1 Pros and cons of relaxation oscillators . . . . . . . . . . . . . . . . . 66
4.5.2 Pros and cons of ’integrate-and-fire’ neurons . . . . . . . . . . . . . 67
TABLE OF CONTENTS vii
4.5.3 Pros and cons of chaotic neurons . . . . . . . . . . . . . . . . . . . 68
4.6 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6.1 Memoryless learning . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6.2 Hebbian Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7 Implementational aspects of ’Temporal Correlation’ . . . . . . . . . . . . . 72
4.8 Architectures for ’temporal correlation’ . . . . . . . . . . . . . . . . . . . . 73
4.8.1 LEGION: Locally Excitatory Globally Inhibitory Oscillatory Network 74
4.8.2 Attentional Oscillatory Neural Network (AONN) The schematic ofthis architecture is shown in Figure 4.8.2 [1] . . . . . . . . . . . . . 75
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5 SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS 85
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Source separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Proposed system strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 Description of the source separation system . . . . . . . . . . . . . . . . . 88
5.4.1 The choice of the cochlear filterbank . . . . . . . . . . . . . . . . . 88
5.4.2 Signal analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.3 Theoretical motivation behind the CAM/CSM generation . . . . . . 93
5.4.4 The Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.5 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.1 Database and comparison . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.2 Separation performance . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6 Separation examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6.1 Separation of speech from telephone trill . . . . . . . . . . . . . . . 103
5.6.2 Separation of speech from 1 kHz tone . . . . . . . . . . . . . . . . . 105
viii TABLE OF CONTENTS
5.6.3 Double-vowel segregation case . . . . . . . . . . . . . . . . . . . . . 105
5.6.4 Sentence plus siren . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6.5 PESQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.6.6 Three-source case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.7 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 ODLM FOR PATTERN RECOGNITION 121
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3 The Dynamic Link Matcher . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4 The oscillatory dynamic link matcher . . . . . . . . . . . . . . . . . . . . 127
6.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.4.2 Mathematical Description of the Network . . . . . . . . . . . . . . . 127
6.5 Behavioral description of the network . . . . . . . . . . . . . . . . . . . . . 129
6.6 Geometrical Interpretation of the ODLM . . . . . . . . . . . . . . . . . . . 131
6.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.8 Rate Coding vs. Phase coding . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.8.1 Rate Coding (Average over Time) . . . . . . . . . . . . . . . . . . . 134
6.8.2 Phase coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.8.3 Dynamics of the Rate-coding DLM . . . . . . . . . . . . . . . . . . 135
6.8.4 Segmentation and Matching for Invariant Pattern Recognition . . . 136
6.8.5 One-object scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.9 Conclusion and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . 137
7 CONCLUSION 151
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2 What has been presented . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
TABLE OF CONTENTS ix
7.3 Future developments of the model . . . . . . . . . . . . . . . . . . . . . . . 152
7.4 The future of Computational Auditory Scene Analysis . . . . . . . . . . . . 154
BIBLIOGRAPHY 200
x TABLE OF CONTENTS
LIST OF FIGURES
1.1 Bregman’s metaphoric description of audition . . . . . . . . . . . . . . . . 1
2.1 Human performance in the presence of multiple voices and mask . . . . . . 13
2.2 Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Proximity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Old plus new . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Good continuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Mutual Exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Data-driven CASA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Description of the main ideas behind different CASA approaches . . . . . . 31
2.10 A top-down blackboard system . . . . . . . . . . . . . . . . . . . . . . . . 32
2.11 Wang and Brown’s oscillatory CASA system . . . . . . . . . . . . . . . . . 33
2.12 The cochleogram for simple tones and street noise . . . . . . . . . . . . . . 33
2.13 Spectrogram for natural speech . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 Rosenblatt Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Catastrophe scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 The Illusory Conjunction experiment as described by Anna Treisman . . . 40
3.4 Binding example no. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5 Binding example no. 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Binding example no. 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Binding example no. 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 The hierarchical scene analyzer of Riesenhuber and Poggio . . . . . . . . . 49
xi
xii LIST OF FIGURES
3.9 Hierarchical network for feature extraction with two types of attentionalcontrol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.10 Schematic diagram of the SCAN (Signal Channelling Attentional Network) 51
3.11 The hierarchical approach (along with attention) used by the neocognitronto recognize ‘0’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.12 Solution to the binding problem using the temporal correlation technique . 52
4.1 The spike rate dependency to the applied input current in the Wilson-Cowan neural model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Schematic diagram for the Hodgkin-Huxley model . . . . . . . . . . . . . . 56
4.3 Different excitation modes seen in real biological neurons . . . . . . . . . . 78
4.4 Comparison of different neural models . . . . . . . . . . . . . . . . . . . . 79
4.5 A nullcline of the Wang-Terman equation . . . . . . . . . . . . . . . . . . . 80
4.6 SIMULINK model of the “integrate-and-fire” neuron. . . . . . . . . . . . 80
4.7 Temporal correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.8 The architecture of the LEGION . . . . . . . . . . . . . . . . . . . . . . . 82
4.9 The architecture of the AONN network . . . . . . . . . . . . . . . . . . . . 83
5.1 The proposed source separation system . . . . . . . . . . . . . . . . . . . . 89
5.2 3-D plot of the output of the proposed neural network . . . . . . . . . . . . 90
5.3 CAM for the female /di/ and male /da/ mixture at SNR = 0 dB andt = 166 ms when the channel number is equal to 24. The separation of thetwo sources can be done based on ray distances. . . . . . . . . . . . . . . . 92
5.4 Schematic representation of the signal processing steps required to computethe reassigned spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5 CSM (24-channel) of the mixture of /di/ and the siren in Equation 5.23 att=50 ms. Segregation is based on the selection of energy bursts. . . . . . 94
5.6 CAM (24-channel) for the /di/ /da/ mixture. Segregation is based onharmonic selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.7 CSM (24-channel) for the speech plus tone mixture. Segregation is basedon energy bursts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
LIST OF FIGURES xiii
5.8 The change in the stiffness of the hair cells due to a change of the stimulus 107
5.9 Idealized schematic of a 2-D spectral map (Cochleotopic/AMtopic) for atwo-speaker signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.10 Architecture of the Two-Layer Bio-inspired Neural Network . . . . . . . . 109
5.11 Mixture of the utterance “Why were you all weary?” with a trill telephonenoise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.12 Separation results for the trill telephone noise . . . . . . . . . . . . . . . . 110
5.13 The synthesized “Why were you all weary?” by the approach proposed byWang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.14 Mixture of the utterance “I willingly marry Marilyn” with 1 kHz tone. . . 111
5.15 Comparison between our approach and Wang’s approach for the ’1 kHz’tone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.16 The spectrogram of the /di/ /da/ mixture. . . . . . . . . . . . . . . . . . . 112
5.17 The spectrogram of the extracted /di/. . . . . . . . . . . . . . . . . . . . . 113
5.18 The spectrogram of the extracted /da/. . . . . . . . . . . . . . . . . . . . . 113
5.19 Mixture of a siren and the sentence “I willingly marry Marilyn”. . . . . . . 114
5.20 Synthesis by an FIR implementation . . . . . . . . . . . . . . . . . . . . . 115
5.21 Synthesis by an IIR implementation . . . . . . . . . . . . . . . . . . . . . 116
5.22 Synthesis result for the siren plus sentence case, when the masking is ap-plied before the masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.23 The synthesized “Why were you all weary?” proposed by Wang . . . . . . 117
6.1 Some examples of affine transforms . . . . . . . . . . . . . . . . . . . . . . 124
6.2 An industrial application of the ODLM . . . . . . . . . . . . . . . . . . . . 139
6.3 The architecture of the oscillatory dynamic link matcher . . . . . . . . . . 140
6.4 An affine transform T for a four-corner object. . . . . . . . . . . . . . . . 141
6.5 A snapshot of the activity the first and second layers of the neural map.Colors represent relative phase of oscillations. . . . . . . . . . . . . . . . . 142
6.6 Neural activity pattern after segmentation . . . . . . . . . . . . . . . . . . 143
xiv LIST OF FIGURES
6.7 Neural activity pattern after matching . . . . . . . . . . . . . . . . . . . . 144
6.8 The evolution of the thresholded activity of network through time in thesegmentation phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.9 The evolution of the thresholded activity of the network through time inthe dynamic matching phase. . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.10 The Synchronization index of a one-object scene when the segmentationstep is bypassed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.11 The synchronization pattern of a one-object scene when the segmentationphase precedes the matching phase . . . . . . . . . . . . . . . . . . . . . . 148
6.12 A scene segmentation done during the segmentation phase of the algorithm 149
6.13 Architecture of an integrated top-down and bottom-up processor (underinvestigation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
A-1 Bifurcation in a dynamical system . . . . . . . . . . . . . . . . . . . . . . . 158
A-2 Saddle-node bifurcation in Wilson-Cowan oscillators . . . . . . . . . . . . . 159
A-3 The transofrmation h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
B-4 Architecture of the simplified chaotic neural network based sound sourceseparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
B-5 Oscillatory behavior of the chaotic network for the two speaker segregationproblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
C-6 Comparison of multiplicative and additive synapses . . . . . . . . . . . . . 172
E-7 Piecewise linear model of the state space of the Wang-Terman oscillator . . 176
LIST OF TABLES
2.1 Analogies between Vision and Audition . . . . . . . . . . . . . . . . . . . . 9
2.2 Gestalt principles and their applications in Auditory Scene Analysis . . . . 22
2.3 Grouping cues for ASA (adapted from [2]) . . . . . . . . . . . . . . . . . . 23
5.1 The numerical values of the different parameters used in the first layer ofthe network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 The numerical values of the different parameters used in the second layerof the network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 The log spectral distortion (LSD) for three different methods . . . . . . . . 104
5.4 The PESQ of three different methods: P-R (our proposed approach), W-B([3]), and H-W ([4]) ( see caption of Table 5.3) . Higher values mean betterperformance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5 PESQ for two different methods: P-R (our proposed approach) and J-L([5]). The mixture comprises a female voice with musical rock background. 118
D-1 Parameters for the Hodgkin-Huxley Equations. . . . . . . . . . . . . . . . . 173
D-2 Parameters used in Equation 4.2.1 . . . . . . . . . . . . . . . . . . . . . . . 173
xv
xvi LIST OF TABLES
CHAPTER 1
INTRODUCTION
1.1 Auditory scene analysis for real scenes
We have all already been confronted to situations in which there are many sound sources
in the environment we live in. This can happen when we are at a “cocktail party” or when
we are walking in a street. Amazingly, we as humans are able to separate different sound
sources and decode the underlying message no matter what the type and the structure of
the original sources are. This phenomenon has been described very elegantly by Bregman’s
following metaphoric picture: “Imagine two narrow channels dug up from the edge of a
lake, with handkerchiefs stretched across each one. Looking only at the motion of the
handkerchiefs, you are to answer questions such as: How many boats are there on the
lake and where are they?” [6].
Figure 1.1: Bregman’s metaphoric description of audition: based on the movements in
the two narrow channels, the person should say how many boats are in the lake (adapted
from Ellis’s presentation at NSF Speech Separation Workshop, Montreal, 2003).
This ability in humans has been considered in psychoacoustics under the titles of au-
ditory perceptual organization or auditory scene analysis [6]. These studies construct
experimental stimuli consisting of a few simple sounds such as sine tones or noise bursts,
1
2 CHAPTER 1. INTRODUCTION
and then record subjects’ interpretation of the combination. The work by Bregman has
been very revealing of the mechanisms by which structure is derived from sound, but
typically it fails to address the question of scaling these results to more complex sounds:
sounds incoming from a real environment.
As pointed out thoroughly in Chapter 2, in order to design a viable sound source separator
capable of functioning for any mixture and any type of sound, one should adapt or modify
Bregman’s simple rules to real-world scenarios. This is what we will try to do in this thesis.
It should be reminded that the state of the art is far from handling this problem in the
general case. Experts’ prediction is that it will take many years before a system will be
capable of outperforming humans.
1.2 Approaches for auditory scene analysis
A detailed description of the state of the art in Auditory Scene Analysis is given in Chapter
2, but for the time being let us say that there are three main approaches to solve this
problem. The first approach relies on extracting the statistics of the underlying sounds
in the mixture and on using statistical concepts. The second approach is to use expert
systems that are based on heuristics. The third approach is based on neural networks.
Personally, I think that since humans are good at doing auditory scene analysis it is a
good idea to mimic them. One can argue that this is not always the best way to solve
engineering problems. First of all, not all human-made systems are inspired from their
natural counterparts. For instance, an airplane does not fly as a bird does. Furthermore,
not all the dynamics of the nervous system is known to us. So how can we reproduce
something we do not know much about? I totally agree with these arguments, but as a
counter-argument I would say that auditory scene analysis is a very new scientific field.
If we do not try to mimic the human behavior, what else can we do? Remember that
the first attempts to build flying objects were very similar to birds’ physiognomy (like the
prototype made in 1870 by the French engineer Alphonse Penaud, among others). Hence,
let us start with mimicking the nervous system and then try to adapt it to our technology
1.3. APPLICATIONS 3
and computers. If there are missing parts in our understanding of the nervous systems,
let us replace them with more “engineering-inspired” models.
As presented in Chapters 5 and 6, I have adopted the approach that tries to mimic (at
least partially) the nervous system by using bio-inspired neural networks and auditory
representations, which I think can approximate some parts of the dynamics of the brain.
Once again, the reader should note that our understanding of the human brain is very
basic and what will be described in this thesis (or in any other similar work) is only a
“toy model” of what really happens in the brain.
1.3 Applications
One question the reader may ask is ‘What is the point of conducting research into this
problem?’. The broadest motivation is intellectual curiosity, born of an increasing sense
of awe as the full subtlety and sophistication of the auditory system is revealed. This
answer may be convincing from the point of view of pure scientists but surely not con-
vincing enough for engineers. An engineering project is viable if there are industrial
applications for it. As a matter of fact, a good sound separator can open the door to so
many interesting applications and tasks that are impossible to accomplish now. These
application are detailed more thoroughly in Chapter 2. One of the most interesting appli-
cations of sound source separation is in the hearing aids industry. There are 500 million
hearing-impaired persons over the world and 70 million North-Americans with hearing 1
disabilities. Actual hearing aids amplify all incoming sounds, rendering them useless in
crowded places. A good sound separator with low computational complexity will surely
help hearing-impaired people have a better life. Other applications of this technology, as
detailed in Chapter 2, are multimedia sound file indexation, robot navigation, speech and
audio enhancement, etc.
1www.hear-it.org
4 CHAPTER 1. INTRODUCTION
1.4 Ideas to be investigated
Beyond the general idea that this thesis is a useful collection of techniques for building
auditory models, there are in fact a couple of fairly strong and perhaps slightly unusual
positions behind this work.
The first idea is that some simple auditory representations we called Cochleotopic/AMtopic
and Cochleotopic/Spectrotopic Maps (see Chapter 5), which are based on very simple sig-
nal processing techniques, can tell us a lot about the structure and the organization of
sound in mixtures.
The second main idea is that based on temporal correlation (see Chapters 3, 5, and 6)
we can group regions of sounds on the frequency-time maps (the ones I have proposed).
The grouping is done when the regions belong to the same source. This is done by using
the bio-inspired neural networks proposed in this thesis.
The third contention is the analogy I have tried to make between auditory and visual
scenes. As pointed out in Chapter 2, Bregman’s pioneering work began with the adapta-
tion of Gestalt principles of visual scene analysis to auditory scene analysis. In this work,
I did somehow the opposite. It starts by designing a system capable of doing sound source
separation and then tries to ‘adapt’ the system to vision. These ideas are explained in
Chapter 6.
1.5 Specific goals
A project in computational auditory scene analysis can go in many different directions.
In this work, the particular goals that were pursued, and to a greater or lesser extent
achieved are as follows:
• Computational auditory scene analysis. The broadest goal is to produce a
computer system capable of processing real-world sound scenes of moderate com-
1.6. OUTLINE OF THIS DOCUMENT 5
plexity independent of the structure of the sound sources or the way they have been
mixed.
• Adequate sound representation and reconstruction. Adequate synthesis and
resynthesis tools have been proposed to generate perceptually acceptable reproduc-
tions of the represented sounds.
• Assessment of scene-analysis systems. Adequate assessment metrics have been
proposed and used to compare this work to other works.
• Computational visual scene analysis. Based on findings in audition, the archi-
tecture has been adapted to perform visual scene analysis on ‘toy objects’.
1.6 Outline of this document
This dissertation has seven chapters. After this introduction, chapter 2 presents an
overview of the field of computational auditory scene analysis. Chapter 3 details the
bases of neurocognition and more specifically temporal correlation. Chapter 4 deals with
the mathematical modelling of ‘bio-inspired’ neurons. Chapter 5 explains the architecture
of the system used to perform sound source separation. Chapter 6 describes the ‘Oscilla-
tory Dynamic Link Matching’ proposed in this thesis to perform visual pattern matching.
Finally, the conclusion in Chapter 6 summarizes the project and considers how well it has
achieved its goals.
6 CHAPTER 1. INTRODUCTION
CHAPTER 2
COMPUTATIONAL AUDITORY SCENE
ANALYSIS (CASA)
2.1 Introduction
In our life we are confronted to situations in which a mixture of sound sources are present
in the environment and we are able of extracting one or more source among others.
The acoustic mixture reaching the ears is processed to enable constituent sounds to be
heard and recognized as distinct entities. While the auditory system may not always
succeed in this goal, the range of situations in which recognition is possible in the presence
of competing (Figure 2.1, Page 13) sources highlights the flexibility and robustness of
human speech perception. The background against which a conversation is carried out is
made up of acoustic intrusions which overlap in both frequency and time with the target
speech. Target and background may contain similar kinds of envelope modulations, and
can arrive from similar locations in space. The background may consist of utterances
whose fundamental frequency and formant contours occupy similar regions to those of the
target. Sometimes, the background will be characterized by high-intensity onsets which
completely mask the target conversation. There are strong evidences that even animals
are capable of doing sound separation. For instance, penguins use signal emissions to find
their mates and offsprings amid the crowds of penguins huddled together for warmth in
the dark Antarctic winter [7] (see also [8] for auditory scene analysis in mustached bats).
On the other hand, computer systems are unable to be robust in the presence of the
“cocktail party” [9] effect (when a mixture of sounds is present in the environment, see
section 2.2) especially when the computer system in hand has only one microphone (one
7
8 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
sensor). Note that, as a human you don’t always need your two ears to do the sound
separation (although results from two-ear separation may be slightly better). For instance,
you can separate music from speech when you listen to a radio broadcast (there is no
spatial cue in this case) or when you try to obstruct one of your ears. We, humans, can
use different cues to do sound separation. We can use the pitch (or the harmonic structure
of the sound), the onset-offset times (the time a sound begins or ends), the spatial location
of the sound, to segregate sources. In addition we can predict what is next in the sentence
and based on this knowledge we can enhance our recognition performance. For example,
we need few cues to recognize our name in a very crowded environment but it will be
impossible for us to recognize words that are uttered randomly (when we cannot a priori
predict them). We can also use visual cues to do segregation (like lip-reading [10]).
In order to make computers as robust as humans in presence of background noise two
different approaches can be used:
• Mathematical and Statistical Approaches: This approach tries to find a so-
lution to the “cocktail party” effect in the framework of standard signal processing
and information theory techniques.
• Computational Auditory Scene Analysis (CASA): In this framework, sound
is processed by doing an analogy between vision and audition (Table 2.1) using
the Gestalt principles of common fate, similarity, continuity, etc. [11] (see section
2.5). The separation cues are based on psychoacoustical and physiological evidences.
Although there should be somehow an equivalence between the mathematical ap-
proach and the CASA techniques, the lack of information about the surrounding
environment and the statistical behavior of the sound sources makes that in so
many cases the use of rule-based approach (CASA) is much more powerful than
mathematical-based approaches [12].
In the remaining parts of this chapter, first a detailed explanation of the cocktail party
effect and the experiments undertaken by Cherry along with some ASA history is given.
2.2. COCKTAIL PARTY EFFECT AND HUMAN AUDITION 9
TABLE 2.1: Analogies between Vision and Audition
Vision: Marr [13] Audition: Bregman [6]
Explicit naming: Compute properties of Primitive vs. schema-driven grouping
entities rather than parts
Least commitment: Never do anything that may Fusion as the default state
later have to be undone of perceptual organization
Graceful degradation: System should not Exclusive allocation of
be very sensitive to poor input quality parts to entities
The basics and psychoacoustical principles of ASA is briefly discussed. The key concepts
of implementing ASA in computer systems (CASA) is discussed and the limitations of
top-down and bottom-up CASA is analyzed. In Chapter 2, the neurophysiological and
cognitive aspects of those kinds of neural networks (spiking neural networks) that can be
used to solve CASA will be laid down. Chapter 3 deals with the mathematical formulation
of spiking neural networks. Chapter 5 discusses our results and findings about sound
source separation (CASA) with our proposed neural network. Chapter 6 proposes another
neural architecture suitable for visual and auditive pattern matching and recognition.
2.2 Cocktail party effect and human audition
In 1953, Cherry [9], then an engineer at MIT, used for the first time ever the term “cocktail
party”. The name comes from the fact that humans are able to separate sound sources
in a “cocktail party” when other people speak simultaneously and there is music in the
background, etc. He conducted six different experiments as follows 1:
• The Basic “Mixed Message” Paradigm: In the first two series of experiments,
Cherry investigated how we recognize what one person is saying when others are
speaking simultaneously. Cherry described this situation as the ‘cocktail party prob-
1Taken from http://www.smithsrisca.demon.co.uk/PSYcherry1953.html
10 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
lem’. Subjects were presented with two different spoken messages, recorded onto
a single audiotape (i.e. ‘mixed’, in a tape editing sense) by the same speaker, and
played back via headphones. Both messages were thus simultaneously and equally
available to both ears, thus approximating to real life competitive conversation.
Subjects were then instructed to repeat one of the messages word by word or phrase
by phrase. Cherry’s observations were
(a) Subjects reproduced at phrase level, rather than word level.
(b) There were extremely few transpositions of material from the to-be-rejected
message 2 . Subjects generally reported great difficulty with the task, but the task
would have been eased considerably if they were allowed to make written notes.
• Predictability: In this series of experiments, Cherry arranged for the mixed ma-
terial to be full of cliches, that is to say, “highly probable phrases” such as “the
time has come to stop beating around the bush”. His observation was that output
tended to consist of whole cliches, and that recognition of just the first one or two
words of a stock phrase would typically prompt the entire phrase.
• The Basic “Unmixed Message” Paradigm: In the remaining sets of experi-
ments, subjects were presented with two different spoken messages, recorded onto
separate audiotapes (i.e., “unmixed” in a tape editing sense) by the same speaker,
and played back by headphones, one message to each earpiece. Unlike the mixed
message paradigm, each ear now only heard one message. Again, subjects were in-
structed to repeat one of the messages (always the right ear message) as accurately
as possible. Cherry’s general observations were:
(a) Subjects could switch between messages at will.
(b) They could repeat the selected message easily and accurately, but slightly
delayed.
2It means that few people put words from the competing sentence into the target utterance.
2.2. COCKTAIL PARTY EFFECT AND HUMAN AUDITION 11
(c) Their speaking voice became monotonous with ‘little emotional content or
stressing of the words’.
(d) They remained unaware of this.
(e) They ‘may have very little idea’ what the message was all about.
(f) They took in very little about the content of the rejected message.
Indeed, if the language of the unattended message was changed from English to
German a few seconds into the trial, once shadowing of the target message had been
successfully established, that change was not usually detected. This observation
prompted further investigation of what sort of information, if any, was available
from the rejected message.
• Penetration of the rejected message. In third series of experiments, Cherry
looked at what information, if any, remained available to the listener from an oth-
erwise unattended message. Cherry arranged for the unattended left ear message
to change from its normal (male spoken English) once the trial was under way. His
observations were
(a) A change from forward speech to backward speech (same sound profile, but
zero lexical or semantic content) was noticed as ‘something queer about it’ by some
subjects but not noticed at all by others.
(b) A change from male to female voice was ‘nearly always’ identified.
(c) A change to a 400 Hz tone was always noticed.
(d) Subjects could not say with certainty what language was being used.
• Same message, time delayed. In this series of experiments, Cherry wished
to investigate the mechanisms by which the brain decides whether the messages
arriving at the ears is from a single source. The point is that when two inputs are
correlated, they need to be merged internally, despite naturally occurring ear-to-ear
differences in intensity and arrival time, whilst when they are from different sources
12 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
one of them needs to be rejected internally. He therefore presented an identical
message to each ear, but with the left (to be rejected) delayed relative to the right
(to be shadowed). This was achieved by running a single length of pre-recorded
audiotape through two physically separated tape players. The second tape player
was then gradually moved closer to the first, thus reducing the playback delay.
Cherry’s observations were that ‘nearly all’ subjects eventually recognized’ words or
phrases from the rejected message as matching those in the attended ear. Cherry
remarks that this is actually quite surprising, given that when different messages are
used nothing is perceived from the rejected ear. The delay at which such recognition
took place varied considerably between subjects, but was typically 2-6 seconds.
• Same message, alternating ear. This series of experiments was prompted by
the observation that it took a finite amount of time to switch attention from one
ear to the other. Cherry recorded long samples of speech and switched it between
his subjects’ ears either
(a) randomly
(b) periodically. When this switching was slow (say once a second), subjects
could shadow with 100% accuracy. When it was fast (say 20-50 times a second),
most subjects could shadow 3 ‘the majority’ of the words, reporting that ‘they
listened as though to both ears simultaneously’ 4. However, as the switching period
decreased to around six or seven times a second, so too did accuracy. To investigate
this critical speed in more detail, Cherry introduced short periods of silence into
the message. When played to one ear this would mean hearing about 150 msecs.
of message, followed by 10 msec. of silence, followed by the next message block,
followed by the next silence, and so on (equivalent to six or seven cycles per second).
Accuracy in this condition was 95-100%. When each message block was switched to
alternate ears, however, accuracy reduced to less than 20%. Cherry concluded that
3By shadowing Cherry meant that subjects were able to fuse the messages coming from different ears.4In other words, subjects had the perception that the message had been applied simultaneously to
both ears.
2.3. HISTORY OF CASA 13
this particular switching rate coincided with the very short time interval required
to transfer attention from one ear to the other, and that by the time attention had
been switched it needed to be switched back again.
Figure 2.1 shows that the human auditory pathway can handle as much as eight simulta-
neous sound sources. The figure shows the identification accuracy vs. the masker intensity
in dB SPL (Sound Pressure Level dB).
Figure 2.1: Human performance in the presence of multiple voices and mask. Up to eight
simultaneous voices can be distinguished by a human listener with high accuracy at low
masker intensities [14].
2.3 History of CASA
Computational Auditory Scene Analysis (CASA) is the name for a field of research that
seeks to build computer models of the process of auditory organization, by which biological
listeners are able to understand dense sound mixtures as the superimposed result of many
independent sound-producing entities in the environment. CASA is in its early days, with
quite a number of different efforts, but no obvious winning strategies, and a large range
14 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
of perspectives on the problem. In what follows, the reader could find some of the most
important contributions in the field.
• 1948: Jeffress model of interaural correlation for sound localization [15].
• 1951: Place mechanisms of auditory frequency analysis by Licklider [16].
• 1953: First usage of the term ‘Cocktail party’ by Cherry [9].
• 1976: Sound source separation using classical signal processing techniques by Par-
sons [17].
• 1982-83: Lyon’s auditory and binaural model [18].
• 1983: Scheffers’ harmonic-based double vowel separation [19].
• 1985: Voiced Speech separation by Weintraub [20].
• 1986: Temporal correlation based solution to the ’cocktail party’ problem by Von
der Malsburg and Schneider [21].
• 1987: Based on Amplitude Modulation, Berthommier proposed an F0-dependent
method of sound separation [22].
• 1988: Voice separation algorithms by Stubbs and Summerfield [23].
• 1990: Publication of the book: “Auditory Scene Analysis: The Perceptual Organi-
zation of Sound” by Bregman [6].
• 1991: Ph.D.s based on Bregman’s findings: Speech (Cooke [24]), Music (Mellinger
[25]).
• 1992: Evidence integration by Kashino and Tanaka [26]
• 1992: Auditory image for ASA: first usage of the term “CASA” (Brown’s Ph.D.
[27])
2.4. APPLICATIONS OF CASA 15
• 1992: Time-domain cancellation of harmonics proposed by de Cheveigne [28].
• 1994: First database for CASA: ShATR (University of Sheffield).
• 1995: Patterson’s auditory model [29].
• 1995: First CASA workshop (Montreal).
• 1996: Prediction-driven CASA (Ellis’s Ph.D. [30]).
• 1997: Second CASA Workshop (Nagoya).
• 1998: Publication of the book: “Computational Auditory Scene Analysis” by Okuno
and Rosenthal [31].
• 1999: Speech Communication’s special issue on Auditory Scene Analysis.
• 1999: Source separation by temporal correlation (Wang and Brown) [3].
• 2001: Probabilistic CASA of speech with missing and unreliable acoustic data by
Cooke et al. [32].
• 2003: Factorial HMM sound source separation by Roweis and Gomez et al. [33, 34].
• 2003: CASA based on pitch tracking by Hu et al. and Wu et al. [35, 36].
2.4 Applications of CASA
In what follows I enumerate some of the major applications of CASA in real-life problems.
• Speech processing. The statistical-based approaches to speech processing like the
HMM (Hidden Markov Model) works only in quiet environments. If there is “cock-
tail party” background noise or many speakers then the aforementioned methods
cannot be applied with success. Hence, a preprocessing technique like CASA should
be used before the recognition phase.
16 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
• Hearing aids. The actual hearing aids normally amplify all the sounds without
filtering them. Therefore, the hearing impaired persons are not capable to under-
stand a conversation in presence of the ‘cocktail party’ effect. An intelligent filter
based on CASA can help to prevent this problem.
• Sound file indexing. One of the most challenging tasks is the indexing of sound
files on the Internet. A sound file indexing system should be able to separate different
sound sources and label each source with the adequate tag (i.e, music, speech, the
speaker ID, etc.). For more details see the MPEG-7 standard.
• Music industry. The recording of songs is a very expensive process. Suppose that
an unwanted door shutting noise corrupts the whole recording. Now imagine that
instead of going through the whole process of recording another time, you can use
a CASA technique to delete the unwanted noise. This can be a very interesting
application of CASA.
• Robot navigation. CASA can be used by a robot to find its way through a
crowded environment based on audio cues (in addition to visual anchors).
• Audio compression: In a futuristic view, one can imagine an audio codec that sep-
arates sound sources in a given file, extract features of each source and sends a text
file to the receiver that contains all the mandatory information, so that the receiver
can synthesize the sound file (pitch, timbre, duration, onset/offset times, etc.). In
image processing terminology, this is known as “Semantic Image Compression” (or
object-based compression) [37].
2.5 Bases of auditory scene analysis
Preliminary experiments led by Bregman [6] have shown a great degree of organization in
the audition. Bregman draws a distinction between an acoustic source – a single physical
system giving rise to a particular pattern of sound waves – and an auditory stream which
2.5. BASES OF AUDITORY SCENE ANALYSIS 17
denotes the abstract, or the conceptual effect it has in the mind of the listener. Listeners
have to solve an auditory scene analysis (ASA) problem in order to extract one or more
relevant auditory streams from the mixture of sources which contribute to their acoustic
environment.
Sound sources may differ in all kinds of properties such as location, instantaneous fun-
damental frequency, or the patterns of energy envelope modulation in different frequency
bands. If it is possible to extract these potential cues with sufficient reliability, the au-
ditory system can group those parts of the mixture that have similar properties. This
affords listeners a basis for organizing into a coherent whole the sound fragments which
have common origin. This type of processing is often described as bottom-up or primitive.
In addition to primitive grouping processes, listeners can exploit prior familiarity with
the patterns of spoken language or other sources. For speech, these regularities manifest
themselves at a number of levels, form the sub-syllabic to the sentential. Such top-down
processes have been termed schema-driven mechanism by Bregman [6].
Early auditory signal processing involves at least two forms of decomposition. First, the
signal is subject to spectral decomposition into separate frequency bands by the cochlea.
Second it appears that different properties are extracted in distinct auditory maps [38, 39],
or distributions of specific signal features over an array of neural elements.
Bregman defined the processes of “auditory stream segregation” and “auditory stream
integration”. The process whereby sound elements are separated into different auditory
objects is known as “auditory stream segregation”, and, conversely, the process whereby
different sound elements are assigned to a single object is known as “auditory stream
integration”. Auditory streaming is important in, for instance, assigning consecutive
speech elements to the same speaker, or following a melodic line in a background of
other musical sounds. In baroque music, stream segregation is often used to make one
instrument play two melodical lines. If an instrument plays a rapid sequence of alternating
low and high tones, the sequence will break into two melodic lines - one consisting of the
18 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
low tones and the other consisting of the high tones - if the pitch difference between the
low and high tones is large enough. In the aforementioned example, the speed of the
sequence and the frequencies are features that the audition uses to group sounds. These
features are called cues by Bregman. Some of the most important cues used in ASA are
shown in Table 2.3.
As mentioned earlier, Bregman’s theory is based on Gestalt psychology. Some of the basic
rules based on the Gestalt theory for segregation and integration are (among others)stated
below.
• Simplicity Items will be organized into simple figures according to symmetry, reg-
ularity, and smoothness.
• Similarity. Objects that are more similar to one another tend to be grouped
together. The similarity can be in terms of any psychological dimension: shape,
size, color, or luminance (or motion) for the visual scene analysis (Figure 2.2). In
Auditory scene analysis the psychological dimensions (cues) can be any of the cues
defined in Table 2.3.
Figure 2.2: Similarity: objects that are similar tend to be grouped together. The similarity
criterion is color in this figure.
• Proximity. This rule states roughly that the closer the visual elements in a set are
to one another, the stronger we tend to group them perceptually. The closeness can
be defined in terms of space or in terms of time (Figure 2.3).
2.5. BASES OF AUDITORY SCENE ANALYSIS 19
Figure 2.3: Proximity: objects that are closer to one another tend to be grouped together.
You see three different objects in this figure.
• Old-plus-new. This heuristic was not in the initial list of Gestalt principles but has
been added by Bregman. It states that a “new” organization appears in the residual
left after subtraction of “old” components, based on the assumption of continuity
(Figure 2.4).
+
time/s
freq/kHz
0.0 0.4 0.8
1
2
1.2
0
Figure 2.4: Old plus new: a sequence of wide-band and narrow-band signals and its
perception in the human auditory pathway according to the old-plus-new heuristic: the
sound is perceived as the old part (0-1 kHz) plus the new part (1-2 kHz).
• Good continuation. Good continuation says that elements forming continuous
lines or curves are grouped (Figure 2.5).
• Closure. Objects that form closed units tend to be group together (Figure 2.6).
20 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
Figure 2.5: Good continuation: both figures contain one “T” oriented differently. Notice
how much easier it is to find the T when it is not contained in the same “line” as all other
elements. When the T is embedded in a line of elements, all of the elements in that line
are grouped together – forming a large unit. In the latter-mentioned case it is harder to
see the “misoriented” T. In the other case, the T stands out on its own, which makes it
more easier to see.
• Common fate. Common fate states that those attributes (aspects) of perceptual
field that move or function in a similar manner will be perceived as a unit.
• Mutual exclusivity. The affirmative and the counter plan cannot be associated
to the same group at the same time (Figure 2.7). For an example in Speech see [40].
The different cues and heuristics on which a computer algorithm can rely to segregate
sound sources have been enumerated above. In the next three sections we will describe
how we can integrate these techniques in a computer algorithm by introducing data-driven
and schema-based CASA. Table 2.2 shows how Gestalt psychology is used in audition for
streaming and segregation.
2.6 Data-driven CASA
Figure 2.8 shows a unidirectional system, in which the information propagates only from
inputs to outputs, with no feedback. Data-driven CASA includes techniques that use only
2.6. DATA-DRIVEN CASA 21
Figure 2.6: Closure: here you see two diamonds (each a closed unit), although when the
figure has been drawn, an “M” on top and a “W” on the bottom have been drawn.
Figure 2.7: Mutual Exclusivity: a) The contour belongs to the object F but not to the
ellipsoidal form. b) The contour belongs to the ellipsoidal form but not to the object G.
c) We can see either the face or the vase but not both of them at the same time because
of the mutual exclusivity rule.
bottom-up processing (there is no feedback from higher levels to lower levels) in contrast
with prediction-driven CASA that uses top-down processing. In Bregman’s terminology,
bottom-up processing corresponds to primitive processing, and top-down means schema-
based processing.
The auditory cues proposed by Bregman for simple tones are not applicable directly to
complex sounds. Therefore, one should develop more sophisticated cues based on different
auditory maps. For example, Ellis [30] uses sinusoidal tracks created by the interpolation
of the spectral picks of the output of a cochlear filterbank. Mellinger’s model [25] uses
22 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
TABLE 2.2: Gestalt principles and their applications in Auditory Scene Analysis
Gestalt Principle Stream Effect or Example
Proximity Frequency, time or space proximity
Similarity Harmonic relatedness (contiguity)
Connectedness Pitch glides pass through noise
Good continuation Gradual increase in loudness of approaching train
Common fate Musical counterpoint, onset/offset-based grouping
Symmetry Rising pitches tend to fall again
Closure Masking, Co-modulation Masking Release (CMR)
Cue detectors Representation Algorithm Ou tput
Objectformation
ResynthGroupingalgorithm
onset/offset
frequencytransition
sound common-periodobjects
maskCochleamodel
periodicmodulation
peripheral channels
Figure 2.8: Data-driven CASA (adapted from [30]). Note that there is no feedback in
the system. See also chapter 5 and [41] [42] [42] [43] for an implementation of this block
diagram. In chapter 5, the cue detectors are replaced by the cochlear maps, the object
formation and the grouping is done via our proposed neural network.
partials (see Figure 2.9 for details on different approaches). A partial is formed if an
activity on the onset maps (the beginning of an energy burst) coincides with an energy
local minimum of the spectral maps. Using these assumptions Mellinger proposed a
CASA system in order to separate musical instruments. Cooke [24] has introduced the
harmony strands, which is the counterpart of Mellinger’s cues in speech. The integration
and segregation of streams is done using Gestalt and Bregman’s heuristics. Berthommier
uses AM maps [22] (see also [38, 42]). Gaillard [44] uses a more conventional approach
by using the first zero crossing for the detection of pitch and harmonic structures in the
frequency-time map. Brown’s algorithm [27] is based on the mutual exclusivity Gestalt
2.6. DATA-DRIVEN CASA 23So
urce
Pro
perty
Pot
entia
l gro
uping
cue
Illustr
ation
s
No
tes
Star
ts an
d en
ds o
f eve
nts
Sy
nchr
ony o
f tra
nsien
ts
Effe
ct of
ons
et a
sync
hron
y on
O
ffset
gen
erall
y co
mm
on o
nset
/offs
et
a
cros
s fre
quen
cy re
gions
sylla
ble id
entifi
catio
n an
d pit
ch p
erce
ption
w
eake
r tha
n on
set
Tem
pora
l mod
ulatio
ns
s
low
C
orre
lation
am
ong
enve
lopes
Com
odula
tion
mas
king
relea
se (C
MR)
C
omm
on fr
eque
ncy m
odula
tion
in d
iffere
nt fr
eque
ncy c
hann
els
may
lead
to co
mm
on a
mpli
tude
as e
nerg
y shif
t cha
nnels
fa
st, p
eriod
ic
Cha
nnel
enve
lopes
with
per
iodici
ty at
Seg
rega
tion
of tw
o-to
ne co
mple
x
unre
solve
d ha
rmon
ics
b
y AM
pha
se d
iffere
nce
fast,
per
iodic
H
arm
onica
lly-re
lated
pea
ks in
the
M
istun
ing o
f res
olved
har
mon
ics
re
solve
d ha
rmon
ics
ef
fect o
n ph
onet
ic ca
tego
ry
fast,
per
iodic
P
eriod
icity
in fin
e str
uctu
re
P
erce
ption
of
Basis
for a
utoc
orre
lation
reso
lved
and
unre
solve
d ha
rmon
ics
doub
le vo
wels
mod
els
Spat
ial lo
catio
n
I
nter
aura
l tim
e dif
feren
ce d
ue to
V
owel
ident
ificat
ion. S
trong
est e
ffect
Evide
nce
that
sugg
ests
role
of
dif
fering
sour
ce-to
-pinn
a pa
th le
ngth
s
if d
irecti
on is
pre
vious
ly cu
ed
IT
D is
limite
d or
abs
ent
In
tera
ural
level
differ
ence
due
Nois
e-ba
nd vo
wel id
entifi
catio
n
to
hea
d sh
adow
ing
Mon
aura
l spe
ctral
cues
due
L
ocali
zatio
n in
the
H
as n
ot b
een
inves
tigat
ed fo
r com
plex,
to
pinn
a int
erac
tion
s
agitta
l plan
e
d
ynam
ic sig
nals
such
as s
peec
h E
vent
sequ
ence
s
A
cros
s-tim
e sim
ilarit
y of w
hole-
even
t
Se
quen
tial g
roup
ing o
f ton
es;
attr
ibute
s suc
h as
pitc
h, tim
bre,
etc.
se
quen
tial c
ueing
L
ong-
inter
val p
eriod
icity
Pe
rcep
tion
of ry
thm
B
y-pr
oduc
t of v
ery-
low-
fre
quen
cy 's
pectr
al' a
nalys
is So
urce
spec
ific
Confo
rman
ce to
lear
ned
patte
rns
S
ine-w
ave
spee
ch
TABLE 2.3: Grouping cues for ASA (adapted from [2])
24 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
principle (Figure 2.7).
In the next section we will see how adding feedbacks from higher processing levels to lower
processing levels can boost segregation quality.
2.7 Top-down or schema-driven CASA
Each of the authors stated in Section 2.6 acknowledges that their system functions less
well (performance is worse than in human listeners) than might be hoped. The authors
argue that their proposed approach is based on one of the multiple cues necessary to do a
correct sound segregation and pretend that the integration of such new cues to their sys-
tem is rather easy. On the other hand, even if the cues are well defined in psychoacoustics
(common onset, common location, harmonicity, etc.), their signal processing counterparts
are not precisely defined. For instance, both Mellinger and Brown implement onset de-
tector maps as rectified differentiators within each frequency channel, and both recognize
the importance of having a family of maps based on different time-constants to be able to
detect onsets at a variety of timescales. But there is no suggestion on how these different
maps must be merged to generate the ‘true’ onset cue. Mellinger uses information from
any map that indicates an onset, whereas Brown found that using only the very fastest
map was adequate. In [42], we use two different maps (CAM and CSM) depending on
the nature of the signal.
In all the cases stated above, a feedback from higher levels to lower levels should select the
adequate representation based on the actual performance of the system and the nature of
the sound. This form of ‘top-down’ CASA is called schema-driven.
Now that we know the general frameworks (data-driven or schema-based) of CASA, we
will focus on the way these general frameworks can be implemented by using different
approaches borrowed from Artificial Intelligence. This will be done in the next section.
2.8. DIFFERENT IMPLEMENTATIONS OF CASA 25
2.8 Different implementations of CASA
CASA can either be implemented based on expert systems, or it can be based on bio-
inspired neural networks or statistical approaches.
• Expert systems. In this approach, one tries to understand and extract all the
heuristic rules proposed by Gestalt scientists and Bregman and implements them
by defining rules (if-then cases). This approach has been used by Ellis [30], Brown
and Cooke [27], and others.
• Neural networks. This method consists of modelling the auditory pathway of
humans and animals. Since Gestalt heuristics have been observed in humans, the
organization of neurons that mimic the auditory neurons and cortex should by
themselves follow the Gestalt psychology and no explicit expert rule (if then case)
should be implemented in the system [46, 3, 41] (Figure 2.11). Unfortunately, the
structure of the digital computer and its common programming languages are very
far removed from the brain’s architecture; this gap (and its impact on models) might
be reduced with a more brain-like (parallel, distributed) computational paradigm.
• Statistical learning. This paradigm is based on the fact that Gestalt heuristics
can be learned through statistical approaches. Therefore, someone neither needs to
study the Gestalt theory nor implements it ‘biologically’ in his/her system. The rules
are implicitly implemented during the learning phase. For instance in [47, 33, 34, 48],
Hidden Markov Models (HMM) are used to do schema-driven source separation. The
disadvantages of HMM-based source separation is its very high learning time and
the constraint that the number of sources in the mixture should be known a priori.
In the previous sections we tried to describe ASA and to introduce some of the most
important computer implementations of ASA. In the next section, we will try to explain
the limits of the Auditory Scene Analysis and will point out that the ASA as proposed
by Bregman is only a part of the whole auditory processing undertaken in the brain. To
26 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
do so, we will present two aspects of the auditory processing that may or may not be
explained by Bregman’s theory (depending on what theory or school of thought you want
to support): pitch grouping and sine waves.
2.9 ASA limitations
As said before, Gestalt and Bregman’s rules, in their actual form, are very simplistic
and can only be applied to simple sounds. Complex sounds have very complex behaviors
(Figure 2.12, Page 33). Therefore some signal processing front-end should map the com-
plex sound into its constituent simple objects, which can then be analyzed by Bregman’s
rules. This mapping has not been completely derived so far and all the efforts in CASA
is directed in this direction. The next two subsections describe some of the controversies
about ASA and Bregman’s rules: pitch grouping and sine waves.
2.9.1 Sinusoidal speech
Some scientists based on psychological observations pretend that Bregman’s grouping
rules are wrong. For instance, Remez et al. in their famous work on sinusoidal-wave
speech [49] propose to represent a speech signal only by sinusoidal trajectories that track
the first three formants. In 1994, the group conducted other experiments based on their
sinusoidal speech representation to find a counter-example to Bregman’s findings [50]
(Figure 2.13). In fact, if speech is perceived by using Gestalt heuristics, then the grouping
and segregation should not only be done for sound mixtures but also for the different
entities present in a single given source (this behavior is observed many times in Bregman’s
experiments). For instance, the different formants of a speech signal should be grouped
based on some similarity criteria to give birth to a whole, which is speech. The only
similar thing in formants is the comodulation frequency. Therefore, one should argue
that if this comodulation of formants is suppressed then the audition will not perceive
formants as a whole. That is exactly what it is done in Remez et al. experiments. In fact,
2.9. ASA LIMITATIONS 27
in the sinusoidal speech case, there is no modulation, therefore no grouping should have
been observed based on Bregman’s findings. But all subjects reported the three sinus
sound as a whole and unique entity. Based on these observations, Remez et al. concluded
that the organization of sound in the brain is not governed by Gestalt psychology. In
1999 Cooke and Barker [51] performed other experiments to support Bregman’s theory.
They took the same sinus speech used by Remez et al. and modulated them with a
sawtooth signal with a frequency equal to the pitch of the speech. They reported that
this modulation improved the recognition score of subjects. They finally concluded that
since the modulation cue helped the audition improve the grouping process, Bregman’s
theory holds.
2.9.2 Limitations of pitch-based grouping
Examples described below show that some well-known ASA-based techniques like the
pitch-based grouping is incomplete in some special cases [52].
• Example 1: Overtone singing. Overtone singing is a vocal technique found in Cen-
tral Asian cultures, by which one singer produces two pitches simultaneously. When
listening to the performance, a high pitch of nF0 can be perceived along with a low
drone pitch of F0, because the formant centered at nF0 has an extraordinary small
bandwidth. Using a pitch model based on autocorrelation analysis to determine the
pitch strength of nF0, one can find that the peak height increases as the formant
bandwidth decreases. Autocorrelation functions of normal voices show peaks cor-
responding to formants, but their heights are not comparable to the peak at 1F0
.
When listening to overtone singing, the auditory system extracts ‘too many’ pitches
for grouping.
• Example 2: Natural periodic sounds with the predominance of upper odd har-
monics. A complex tone composed of three harmonics at 7F0, 9F0, and 11F0 could
elicit three pitches: a prominent pitch of F0, two weak pitches of 9F0/4 and 9F0/5.
28 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
Natural periodic sounds with the predominance of upper odd harmonics can be pro-
duced by a quasi-sinusoidally driven Duffing oscillator [53]. When listening to such
sounds, the auditory system extracts ’too many’ pitches for grouping.
• Example 3: Natural periodic sounds with the predominance of lower even-numbered
components. The sound of the oscillator that has undergone a period-doubling can
have weak odd-numbered components at lower frequencies. The pitch f0, which
is extracted on the basis of the lower even-numbered components-the harmonics-is
too high for grouping all components. The pitch sensation of f0/2 can accom-
plish this task, but the auditory system fails to perceive this pitch when the lower
odd-numbered components-the subharmonics -are weak and masked by adjacent
harmonics.
The above-mentioned findings and counter-examples along with other experiments show
that the cognition and organization of sound in the brain is an open issue and not under-
stood completely.
The next section deals with the comparison of psychological-based approaches (like CASA)
to more mathematical and statistical approaches like Blind Sound Source Separation.
2.10 Comparison of CASA with other source separa-
tion techniques
CASA is not the only technique that can be used to separate sound sources in a mixture.
Other non bio-inspired techniques like Blind Source Separation (BSS) can also be used
among others. In the next section, these techniques will be compared to CASA.
2.10.1 Blind Sound Source Separation
BSS techniques uses the statistical properties of signals to segregate sound sources without
taking into account any biological or psychological aspects. In fact these techniques can
2.10. COMPARISON OF CASA WITH OTHER SOURCE SEPARATION TECHNIQUES29
be used for any other type of signals (EEG [54], ECG, etc.). BSS is subject to some
constraints on the statistical behavior of the signals [55]. It is based on finding the
inverse of the mixing matrix based on the statistical independence of underlying signals.
Statistical independence means that all the mutual moments of the signals must be zero.
One of the methods that minimizes the second mutual moment of signals is named PCA
(Principal Component Analysis). Another technique based on second order statistics is
the SOBI (Second Order Blind Identification). Techniques that use higher order statistics
are called ICA (Independent Component Analysis) [56]. For instance, Comon’s ICA
technique minimizes the cumulant (4th-order statistics) [57] after signal whitening 5 as
stated in Equation 2.1.
cICA[y] =∑
i,j,k,l 6=iiii
|cum(yi, yj, yk, yl)|2 (2.1)
On the other hand, the JADE (Joint Approximate Diagonalization of Eigen-matrices)
algorithm minimizes the cost function in Equation 2.2:
cJADE[y] =∑
i,j,k,l 6=ijkl
|cum(yi, yj, yk, yl)|2 (2.2)
Where yi is a signal sample at time i, and cum is the cumulant. The difference between
the procedure proposed by Comon and JADE is that for Comon, the summation is done
over all indices for which i, j, k, l are not all equal, while for JADE it is done for all
indices for which i, j, k, l are not all four different. The Comon and JADE methods have
similar performances but JADE is much faster (note that for Comon, the summation
is done over N4 − N terms, N being the signal length, while for JADE it is done over
N4 −N(N − 1)(N − 2)(N − 3) terms).
There are three important factors in the BSS: The moment order used (covariance, cumu-
lants, etc.), the mixture (linear mixture, convolutive, etc.) and the optimization method
(batch or iterative).
5“Whitening” refers to the process which transforms a signal vector so that the covariance matrix is
unity
30 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
The BSS can be under-determined (less microphones than sources) [58, 59, 60, 61] or
over-determined (more or equal number of microphones) [55]. The over-determined case
works much better than the under-determined case.
The disadvantage of the BSS is that one should know a priori how many sound sources
are present in the mixture [12].
The BSS is a very powerful technique if all the required conditions (i.e., statistical inde-
pendence, etc.) are met. Unfortunately, not all the sound mixtures have all the required
conditions imposed by the BSS.
Van der Kouwe et al. [12] has compared CASA-based techniques to BSS techniques using
the Cooke database [62]. They have found that the improvement is greater when BSS
is used compared to the CASA-based oscillatory network [3] for wideband signals. On
the other hand, Wang’s network [3] performs better when the noise is narrowband. They
reported a greater robustness in the case of Wang networks. They have also pointed out
that the SNR is not a criterion of intelligibility and more work should be done to define
evaluation criteria for sound separation techniques (see also chapter 5 of this thesis).
2.11 Conclusion
In this chapter, we have introduced some fundamental concepts of the Computational
Auditory Scene Analysis (CASA). It has been shown based on bibliographical data that
although segregation rules for simple sounds seem to be known (at least partially), there
is no consensus or general framework for complex sounds. It has also been pointed out
that other separation techniques like the Blind Source Separation (BSS) do not perform
well either. Therefore, the sound source separation problem, particularly in the one
microphone case, is an open issue and is not solved at all in the general case. In the
next chapter, the parallel between Auditory Scene Analysis (ASA) and neurophysiology
is done. We demonstrate how cognitive observations let us define a unified framework
between scene analysis (either visual or auditive) and the neural pathway.
2.11. CONCLUSION 31
Hz
100
200
400
1k
2k
4k
10k
20k
0 ms 200 400
Hz
100
200
400
1k
2k
4k
10k
20k
0 ms 200 400
0.2 0.4 0.6 0.8 1.0 time/s
100
150 200
300
400
600
1000
1500
2000
3000
frq/Hz brn1h.aif
0.2 0.4 0.6 0.8 1.0 time/s
100
150 200
300
400
600
1000
1500
2000
3000
frq/Hz brn1h.fi.aif
0.84 0.92 1.00 1.08 1.16 1.24 1.32 1.40 1.48 1.56 1.64 1.72 time/s
200
300
400
600
800
1000 1200
1500
2000
3000
4000
6000
8000
frq/Hz
clacan.rs.aif
1.72 1.74 1.76 1.78 1.80 1.82 1.84 1.86 1.88 1.90 1.92 1.94 time/s
200
300
400
600
800
1000 1200
1500
2000
3000
4000
6000
8000
frq/Hz clacan.g1.ps-mini
(a) (b)
(c) (d)
Figure 2.9: Description of the main ideas behind different CASA approaches: a)
Cooke’s synchrony strands extracted for a voiced-speech utterance with the corresponding har-
monic sieves used for auditory streaming. b) The spectrogram of the McAdams-Reynolds oboe-
soprano sound, along with one of the sources extracted by Mellinger’s system [25]. Note that
up until 100 ms the system fuses all the harmonics, but then it segregates the even harmonics
on the basis of their common modulation. This is in contradiction with the least commitment
heuristic (see Table 2.1). c) The spectrogram of voice mixtures used in [27] before and after
processing to extract one voice; the effect of the time-frequency masking is clearly visible as
the extensive ‘white’ regions where interference has been removed. d) Sinusoidal tracks used to
model a mixture of a harmonic sound (a clarinet) and a transient (a dropped tin can) in [30].
The lower panel highlights the tracks corresponding to clarinet phrase, grouped on the basis of
harmonicity (adapted from [30]).
32 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
Front-endanalysis
Core soundelements
Predict &combine
Reconciliation
predicted
features
Actions
observed
features
predictionerror
Higher-levelabstractions
Resynthesis
World modelBlackboard
sound
separated
source
pda-b
d 3
dpw
e 1
996apr
sound
Engine
Cuedetectors
Figure 2.10: A top-down blackboard system. Blackboard systems address the issue of
developing and choosing hypotheses (“interpretations”) at different levels of abstraction
for a signal, using some rules or models [45].
2.11. CONCLUSION 33
Figure 2.11: Wang and Brown’s oscillatory CASA system [3]. The input sound is pro-
cessed by a cochlear filterbank followed by hair cells model. The correlogram is then
computed and the pitch is extracted. Sound frames are applied to the first layer. Lateral
connections on the second layer are established according to the pitch calculated from the
correlograms. Sound is separated by using a mask and is resynthesized.
200
400
1000
2000
4000
f/Hz Bregman alternating tone signal
0.0 0.2 0.4 0.6 0.8 1.0 1.2
200
400
1000
2000
4000
f/Hz City street ambience
1.6 1.8 2.0 2.2 2.4 2.6 2.8
60
50
40
30
dB time/s time/s
Figure 2.12: The cochleogram for simple tones and street noise (adapted from [30])
34 CHAPTER 2. COMPUTATIONAL AUDITORY SCENE ANALYSIS (CASA)
Natural Speech Sine-wave speech
Modulated SWS (100 Hz) Modulated SWS (200 Hz)
1.75s
75
75
5000
5000
Fre
quency(H
z)
Fre
quency(H
z)
Figure 2.13: Spectrogram for natural speech, synthesized sine-wave speech and modulated
sine-wave speech (from [51]). The utterance is still audible when speech is replaced by
sinusoidal trajectories that track the first three formants.
CHAPTER 3
NEUROCOGNITION
3.1 Introduction
This chapter deals with the cognitive aspects of neural networks. We first see how the
Gestalt Psychology described in chapter 2 can have its roots in the auditory and visual
neural pathway. We see whether the Gestalt rules introduced in Chapter 2, or more
generally classification problems encountered in real-life problems can be implemented by
using conventional neural networks or not. We finally explain how newer techniques and
approaches like the temporal correlation, the attentional models, etc. can help us to find
a more optimal solution to the aforementioned problems.
3.2 Geststalt Psychology and Neurophysiology
As stated earlier in Chapter 2, the Gestalt corollary states that the visual (auditive) per-
ception is not the linear sum of its constituent parts. In fact, the parts of an object are
integrated via the Gestalt principles to create a homogenous entity. The principle heuris-
tics used by the Gestalt psychology are (see chapter 2 for details): proximity, similarity,
closure, good continuation, common fate, etc. [63, 6]. Visual and auditive cortexes have
long range synaptic connection at birth (homogeneity is a global aspect of an image) but
the learning that enables the brain to apply Gestalt heuristics is done after birth [63].
Observations reveal that most of the Gestalt principles are implemented by horizontal
synaptic connections in the V1 (the primary visual cortex) [64]. Even if the ‘good contin-
uation’ and ‘good form’ heuristics used to perceive an object as a whole is easily done by
adults, a new-born child (less than one year old) cannot apply these principles to perceive
an object as a homogenous entity [63]. An important question to answer is whether it is
35
36 CHAPTER 3. NEUROCOGNITION
possible or not to use conventional neural networks as the ones we can find in engineering
textbooks (perceptron, etc.) for implementing the psychological rules we went through
in the first chapter. In the next section some of the disadvantages of using conventional
neural networks are presented.
3.3 Conventional Neural Networks
Conventional (classical) neural networks were developed as models of brain function. In
developing these models, several questions needed to be addressed:
1. How are brain states to be interpreted as representations of actual situations? In
other words, how is neural activity interpreted as a neural code, or, in computer
parlance, as a data structure?
2. What is the nature of the mechanisms by which brain states are organized?
3. In what format is information laid down permanently in the brain?
4. How is memory laid down? In other words, what are the mechanisms of learning?
Answers to the following questions are given by conventional neural networks paradigm
attributed to Hebb (see [65] for a more detailed discussion about the following answers).
1. The neural code: neurons are taken as concrete symbols, as semantic atoms. They
can be interpreted in relation to patterns and events external to the organism. For
instance, the symbolic meaning of a neuron can be ‘up/down’ or ‘black/white’ etc.
Neurophysiology has provided solid experimental basis for this statement, although
some extrapolation is needed to extend it to all neurons in the brain. A neuron has
only one degree of freedom at a given interval of time [t, t + ∆t]: it is either on or
off. Thus the brain state is described by a vector of on/off states. In order to know
what the brain is about at any instant of time, it is only necessary to know this
3.3. CONVENTIONAL NEURAL NETWORKS 37
vector, along with a description of the symbolic meaning of all neurons. It must be
stated that the state vector is not constant for the interval of time [t, t + ∆t], no
matter how much small ∆t is chosen.
2. The mechanism of organization of brain states is based on the fluxes of excitation
and inhibition, a neuron collecting incoming signals and firing when a threshold is
crossed. The dynamics of the system is regularized so that the activity is stable
within an interval of time [t, t + ∆t]. In associative memory models, this is, for
instance, achieved by requiring connections between any pair of neurons to be sym-
metric, with the consequence that the system displays attractor dynamics. Without
this restriction, a McCulloh and Pitts system would be a general digital machine
(Turing Machine) without any inherent tendency to organize.
3. Long-term memory is stored in terms of synaptic weights.
4. Long-term memory is laid down by mechanisms of synaptic plasticity, based on the
statistics of neural signals, especially their temporal correlations.
As argued in the next subsection, this classical point of view about neurons and their
organization in brain cannot answer to all our questions about the functioning of our
brain.
3.3.1 The binding problem
The binding problem was first addressed by [66] and [21]. Milner and Malsburg argued
that the classical code of neural networks is very poor, too narrow in its possibilities to
serve as a basis for an expansion of the functional range of current brain models. The
underlying weakness is best illustrated by a classical example due to Frank Rosenblatt.
Rosenblatt proposed the following experiment to show the weakness of ‘conventional neu-
ral networks’. Suppose that you have designed a neural classifier that classifies objects in
a visual scene (Figure 3.1). The network is capable of telling us which geometrical form
38 CHAPTER 3. NEUROCOGNITION
(circle, triangle, etc.) is applied to the network and what is the location of the object
(up, down, right, left). In the pattern recognition parlance these features are respectively
called the ‘what’ and ‘where’ information. The neural network has been designed in a
very simple way. In the output layer one neuron is associated with one of the forms (for
instance, the circle), another to the other geometrical form (triangle), etc. Another set of
neurons encode the location: a neuron is associated to the ‘down’ position, another one
to the ‘up’ position and so on. Now suppose that a triangle is present in the upper part of
the image applied to the network. The result of this experiment will be an activation of
the neuron associated to the triangle and of the neuron belonging to the ‘up’ geometrical
position. There is no ambiguity in the result of this experiment and everything seems
correct. Now suppose that a triangle is applied at the top and a rectangle at the bottom
at the same time. In this case four neurons will be activated at the same time: up, down,
triangle, and rectangle. There is now an ambiguity (see also Figure 3.2). How can we
bind the information we have got? What is the correct combination : [(down,triangle),
(up, rectangle] or [(down, rectangle), (up, triangle)]. This problem is referred to as the
binding problem. This is a fundamental problem with the classical neural network code:
it has no flexible means of constructing higher-level symbols by combining more elemen-
tary symbols. The difficulty is that, as seen in Rosenblatt’s experiment, coactivating the
elementary symbols leads to binding ambiguity when more than one composite symbol
is to be expressed. Figures 3.3, 3.4, 3.5, 3.6, and 3.7 show more practical situations of
the binding problem for vision. In each figure, the aim by the experimenters has been to
prove that human visual pathway uses some kind of binding to solve some difficult visual
problems. Experiments for the auditory pathway can also be found in the literature [6].
As we will see in chapter 5, this binding may also happen in the auditory scene analysis
problem, in which geometrical forms are replaced by speech-relevant features.
In the following subsection we will see why conventional neural networks can’t be used to
solve binding problems like the one stated above.
3.3. CONVENTIONAL NEURAL NETWORKS 39
Figure 3.1: Rosenblatt Experiment: the static network identify objects correctly when
they are applied separately. The triangle and rectangle and their respective position are
recognized correctly when applied separately. The ambiguity in recognition arises when
the two objects are applied simultaneously (adapted from [65]).
3.3.2 Are classical neural networks universal?
There is a widespread opinion that classical neural networks are a universal medium
with no limits to their abilities and that consequently they are not subject to the binding
problem [69]. This claim can be discussed from two different points of view. The questions
to be answered to are : Does universality suffices as a solution to the brain’s problems? Are
classical neural networks universal media? The idea behind the universality of classical
neural networks is the Turing machine and the fact that there is no effective procedure that
cannot be realized as the program, algorithm given enough storage space and time. From
this, it was extrapolated that mental processes, if only made concrete in terms of rules,
could be realizable in machines. Under this view, the brain is a digital Turing machine
40 CHAPTER 3. NEUROCOGNITION
Figure 3.2: Catastrophe scenario: if two sets of active neurons (left and middle panel) are
simultaneously activated (right panel), information on their membership in the original
set is automatically lost (adapted from [65]).
Red
Green
Figure 3.3: The Illusory Conjunction experiment as described by Anna Treisman: what is
the color of vertical bars ? Subjects bind the color information to the direction (vertical
or horizontal) information, so that they are unable to detect the only vertical red bar in
the scene as a first thought [67].
and can perform any given task if adequate number of neurons is available. McCulloh and
Pitts applied this idea to the modelling of the nervous system, proving that any logical
function can be implemented by perceptrons. But universality does not mean plausibility.
Is it realistic to say that the universality of classical neural networks can solve any problem
in hand even if it takes billions of neurons and centuries? Should not we look for a simpler
solution that can solve the problem with a couple of neurons and in laps of a couple of
seconds or even milliseconds? But what does universality buy? How can you extract the
rules from which you will design your Turing-machine-like neural network? Over time,
the field of artificial intelligence discovered that it is not an easy task at all to write a
3.4. SOLUTIONS TO THE BINDING PROBLEM 41
x y
z
Figure 3.4: Binding example no. 1 (adapted from [68]), visual experiment: Different
objects (three arrows) are presented to a human. Some objects mask others. Contours of
visual receptive fields x and y belong to the same object but receptive fields z and y do
not belong to the same object even if they are collinear and have the same color (texture).
program that emulates the capabilities of the brain. It is becoming clear that the only
goal we can hope for, is to establish a system that constitutes a basis for self-organizing
and learning, as the equivalent of a newborn who learns from the environment. Brain
theorists realized this fact in the late ’50s and modified the McCulloh and Pitts’ network
to accommodate self-organization and learning. However, these changes may have come
at a price: it is not clear whether neural networks are universal in any sense, although the
scientific community seems to have inherited the implicit belief that they are and that
any brain function can be modelled on the basis of those few abstractions from the real
nervous system that went into the formulation of neural networks.
3.4 Solutions to the binding problem
In this section some of the solutions to the ‘binding problem’ described in the previous
section are enumerated and explained.
42 CHAPTER 3. NEUROCOGNITION
x
y
Figure 3.5: Binding example no. 2 (adapted from [68]), visual experiment: two moving
bars are analyzed by a human subject. The cross (the combination of the two bars)
moves in the direction of the black arrow, but the visual receptive fields x and y cannot
detect the displacement in the direction of the black arrow. For the receptive field x, the
horizontal bar moves in the vertical direction, while for the receptive field y, the vertical
bar moves in the horizontal direction. The exact direction of motion cannot be detected
without binding.
3.4.1 Hierarchical coding
This approach is not really a ‘solution’ to the binding problem, but a mean to circumvent
it. This technique is based on the belief that classical neural networks are universal and
that any brain problem in hand can be solved by Turing-machine-like neural networks.
For example, in Rosenblatt’s experiment, one simple and trivial solution is to put four
neurons for all the possible combination in the output of the network: (up, rectangle),
(down, rectangle), (up, triangle), (down, triangle). This type of coding is the hierarchical
coding of two classes up/down and triangle/rectangle. But what happens if instead of
two classes, we want to classify 10,000 classes. In this case, 100,000,000 neurons should
be put in the output layer, which seems to be a very unrealistic solution to our problem.
Riesenhuber and Poggio [69] have used the hierarchical coding scheme to perform visual
scene analysis (Figure 3.8). In their model the two types of operations, selection and
template matching, are combined in a hierarchical fashion to build up a complex, invariant
3.4. SOLUTIONS TO THE BINDING PROBLEM 43
x
y
z
Figure 3.6: Binding example no. 3 (adapted from [68]), visual experiment: even if x and
y have the same intensity, they belong to the same object. Without binding, this fact
wouldn’t have been trivial.
feature detectors from small, localized, simple cell-like receptive fields in the bottom layer.
In particular, patterns on the model “retina” are first filtered through a layer (S1) of simple
cell-like receptive fields (first derivative of gaussians, zero-sum, square-normalized to 1,
oriented at 0o, 45o, 90o, 135o). Cells in the next layer (C1) each pool S1 cells of the same
orientation over a range of scales and positions. Filters were grouped in four bands each
spanning roughly. Different C1 cells are then combined in higher layers. Each S2 cell
receives input from 4 neighboring C1 units of arbitrary orientation, yielding a total of
44 = 256 different S2 cell types. S2 transfer functions are gaussian. C2 cells then pooled
inputs from all S2 cells of the same type, producing invariant feature detectors tuned to
complex shapes.
Another problem with the hierarchical approach is the lack of autonomy or self-organization
in the network as in all other classical neural networks explained sooner. It means that for
a hierarchical network the design process is as follows: “Give me the brain problem you
want to solve, I will give you the adequate architecture”. This is in contradiction with the
self-organization paradigm which states: “Give your problem so that the network adapts
itself to the problem it has to solve”.
44 CHAPTER 3. NEUROCOGNITION
xy
z
Figure 3.7: Binding example no. 4 (adapted from [68]), visual experiment: the receptive
field z is associated to the gray object even if the orientation of this field is similar to the
orientation of the black object x. Binding lets us explain this phenomenon.
3.4.2 Attentional models
Another solution to Rosenblatt’s experiment is the attentional model paradigm [70]. If
somehow, we can eliminate the second object in the scene to focus on the first object, and
in a second phase eliminate the first object and keep the second, we will solve the binding
problem. The illusory conjunction is a psychological proof to the existence of attention in
the human cognition (Figure 3.3) [71, 67]. In this paradigm, efferent receptive fields are
‘tuned’ according to the focus of attention. One of the first attentional models proposed
is Fukushima’s Neocognitron [72, 73]. In this network, a ‘winner-takes-all’ competition
between the objects in the output layer triggers the masking of objects in the input layer
through efferent (feedback) synapses in the gating layer.
Another attentional models uses in the literature is dynamic routing [74] (see Figure
3.10, Page 51). The connectivity between two successive layers is controlled by routing
control units, which can turn on or off certain subsets of connections. If the appropri-
ate connections are activated, a region in the input layer, referred to as the window of
attention, is projected to the output in a standardized size. This provides a normalized
representation of the attended region., based on which recognition can be performed. The
latter-mentioned architecture is closely related to SCAN (Signal Channelling Attentional
3.4. SOLUTIONS TO THE BINDING PROBLEM 45
Network) network by Postma et al. [75] (Figure 3.10, Page 51). The SCAN is a network
based on ‘dynamic routing’. The building block of SCAN is a gating lattice, a sparsely-
connected neural network defined as a special case of the Ising lattice from statistical
mechanics 1. The process of spatial selection through covert attention is interpreted as a
biological solution to the problem of translation-invariant pattern processing.
Salinas and Abott [76] have added the ‘gain field’ to the neocognitron to allow selecting a
local region and enable feature extracting units only there. One can also imagine top-down
attention to objects or features if the facilitation acts on different sets of units sensitive
to a common feature rather than location as illustrated in (Figure 3.9, Page 50). These
attentional control mechanisms are similar to those in the routing circuit model (Figure
3.10, Page 51) in that they work top-down and require indirect feedback.
3.4.3 Assembly Coding and Temporal correlation
The temporal correlation is a special case of the more general assembly coding approach.
In the assembly coding paradigm a particular constellation of features is represented by
the joint and coordinated activity of a dynamically associated ensemble of cells, each of
which represents explicitly only one of the more elementary features that characterize a
particular perceptual object. Different objects can then be represented by recombining
neurons tuned to more elementary features in various constellations (assemblies) [78]. For
assembly coding, two constraints need to be met. First, a selection mechanism is required
that permits dynamic, context dependent association of neurons into distinct, function-
ally coherent assemblies. Second, grouped responses must get labelled so that they can
be distinguished by subsequent processing stages as components of one coherent represen-
tation and do not get confounded with other unrelated responses. Tagging responses as
1Some neural architectures are inspired from statistical and quantum mechanics. For example, Boltz-
man machines, Mean-Field machines, and Ising laticces are physical concepts. An Ising lattice, is a square
connected lattice. Each lattice site (element) has a single spin variable s = ±1. Minimizing the energy
of such a lattice can solve optimization problems in artificial intelligence.
46 CHAPTER 3. NEUROCOGNITION
related is equivalent with raising their salience jointly and selectively, because this assures
that they are processed and evaluated together at the subsequent processing stage. This
can be achieved in three ways. First, nongrouped responses can be inhibited; second, the
amplitude of the selected responses can be enhanced; and third, the selected cells can
be made to discharge in precise temporal synchrony. All three mechanisms enhance the
relative impact of the grouped responses at the next higher processing level.
Based on the motivations and observations stated above Von der Malsburg has proposed a
phase coding (in contrast with rate coding) assembly coding paradigm he called ‘Temporal
Correlation’.
If synchronization serves as a selection and binding mechanism, neurons must be sensitive
to coincident input. Moreover, synchronization must occur rapidly and show a relation to
perceptual phenomena. Although the issue of coincidence detection is still controversial
[79, 68, 80], evidence is increasing that neurons can evaluate temporal relations with
precision among incoming activity.
As an example, reconsider Rosenblatt’s experiment. If the ‘up’ neuron is activated at the
same time as the ‘triangle’ output and the ‘down’ output at the same time as the ‘rectan-
gle’ output, so that the first event is dissociated from the second event, no ambiguity will
happen (Figure 3.12). In the telecommunication systems terminology this is equivalent
to a time-domain multiplexing (TDM).
The great advantage of the temporal correlation approach is its autonomy and self-
organization capability. This is so far, the simplest and more plausible solution to Rosen-
blatt’s experiments when the number of combinations is big.
The disadvantage of temporal correlation is its slowness compared with other rival ap-
proaches (especially the hierarchical coding). As stated earlier, the phase synchrony
detection by coincidence detector neurons is another physiological and practical problem
to be solved or studied more.
3.5. CONCLUSION 47
Another quibble to the temporal correlation approach (which is also true to some extent
for other approaches) stems in the fact that not all recognition tasks are ‘stimulus-driven’
(based on the properties of the stimulus alone). They are for most of the cases ‘task-
driven’ [81]. The hypothesis of stimulus-driven binding does not explain how neurons
know what they should bind. For instance, when you are observing someone’s face, you
may wish to identify his/her identity no matter his/her eyes are open or closed or if her/his
hair are short or long. In all the aforementioned situations, only some parts of the visual
input should be bound and other parts discarded. Thus, if binding by synchronization
takes place, it cannot be stimulus-driven. External inputs are needed to control binding
in a task-dependent way. For example, if the task is to extract the person’s feelings and
moral situation, then the closeness or openness of eyes can become an important issue (or
whether he/she smiles or not) but these feature are irrelevant for a person identification
task. This task-driven binding approach raises another new and important question: how
can a high-level process know which parts of the input image to group before it knows
what is in the image itself? Some top-down processes must be involved in the ‘task-driven’
binding. Some solutions to this problem has been proposed in the literature, but this is
still an unsolved issue [81].
In order to implement the temporal correlation adequate dynamics should be used for
the neurons. The neurons used for implementing this approach are not classical neurons
but bio-inspired neurons (neurons that behave like the cells in our nervous system). As
it is shown and discussed in Chapter 4 different dynamics can be chosen for this task:
relaxation oscillatory neurons, integrate-and-fire neurons, chaotic neurons, and Izhike-
vich’s model. In the case of the chaotic neurons phase synchrony is replaced by similarity
measures [82].
3.5 Conclusion
The cognitive aspects of neural networks have been mentioned in this chapter. We ob-
served that conventional neural networks do not cope with the situations encountered in
48 CHAPTER 3. NEUROCOGNITION
real life classification. New concepts have been introduced that would let us solve more
general problems. These concepts include but are not limited to: hierarchical coding,
attentional models, and ‘temporal correlation’. Hierarchical coding is the fastest but the
least flexible while ‘temporal correlation’ is a very flexible and autonomous approach. In
the next chapter we will see how such a ‘temporal correlation’ network can be constructed
using the available mathematical models of bio-inspired neurons.
3.5. CONCLUSION 49
view-tuned cells
MAX
weighted sum
simple cells (S1)
complex cells (C1)
"complex composite" cells (C2)
"composite feature" cells (S2)
Figure 3.8: The hierarchical scene analyzer of Riesenhuber and Poggio. Each pixel of
the image is connected to four different cells in the S1 layer that are sensitive to one of
the four directions: horizontal, vertical, right, and left. The hierarchical organization is
such that features in the first layer are merged in the second layer to give the best match
for two adjacent pixels and so on. This is done hierarchically until the best match for
the whole image is found. The network consists of layers of linear units that perform a
template match over their afferents (dashed arrows), and non-linear units that perform
a ‘MAX’ operation over their inputs, where the output is determined by the strongest
afferent (solid arrows). While the former operation serves to increase feature complexity,
the latter increases invariance by effectively scanning over afferents tuned to the same
feature but at different positions (to increase translation invariance) or scale (to increase
scale invariance, not shown). An afferent nerve carries impulses toward the central nervous
system. The opposite of afferent is efferent. For more detail see section 3.4.1.
50 CHAPTER 3. NEUROCOGNITION
Figure 3.9: Hierarchical network for feature extraction with two types of attentional
control. First, the control units located on the right can facilitate, connect (black lines)
or discard some units so that the network only processes information coming from a
single object. This is an attentional ‘top-down’ (task-driven) control influenced by the
task (when for example we know from higher levels that we are looking for something
specific, i.e. a triangle in Rosenblatt’s experiment). Second, a ‘winner-take-all’ mechanism
can select one object and discard the others. This is a saliency-based control (when for
example the triangle is greater or darker than the rectangle, the triangle will win the
competition over the rectangle) [77].
3.5. CONCLUSION 51
Figure 3.10: Schematic diagram of the SCAN (Signal Channelling Attentional Network).
The activity of units represents a feature value, such as local light intensity, and is in-
dicated by different gray values. The same type of feature is used in the whole network
(no feature hierarchy). Most of the existing connections between two successive layers are
disabled (gray lines) through inhibitory mechanisms by the routing control units. The
remaining active connections (black lines) establish a mapping between a region in the
input layer (bottom layer), referred to as window of attention, and the output layer (top
layer). This provides a normalized view of the attended object (adapted from [77]) [75].
52 CHAPTER 3. NEUROCOGNITION
Recognized
pattern
Stage1 Stage2
Stage3
Stage4
features
Figure 3.11: The hierarchical approach (along with attention) used by the neocognitron
to recognize ‘0’. In stage one, the existence of simple lines is detected by the network.
Stage two detects the combination of lines from stage one. Stage three analyzes the
combinations of features detected in stage two. In stage four, the whole number ‘0’ is
recognized.
TR
Neuron
RC
Neuron
TOP
Neuron
DOWN
Neuron
Output
Output
Output
Binding
timeOutput
Figure 3.12: Solution to the binding problem using the temporal correlation technique.
The neurons RC (rectangle) and DOWN are bound together in time as it is the case for
neurons TR (triangle) and TOP.
CHAPTER 4
DYNAMICS OF BIO-INSPIRED NEURONS
4.0.1 Introduction
Bio-inspired neural networks try to mimic the behavior of real neurons in animals and
humans. They let us process temporal sequences, in contrast with most of the classical
neural networks that are suitable for static data. One of the solutions used in classical
neural networks to process temporal data is to represent it spatially (like in Time-Delayed
Neural Networks). There are a number of solutions based on this approach. First, an
interface must buffer the input to the neural network so that the network has all inputs
available for processing. Also, some external agent should tell when the buffer is full
and the processing can begin. Further, this input buffer approach assumes that all input
patterns must be of the same length, which is not realistic in most applications. Thus,
the buffer must be made large enough to accommodate the longest possible sequence.
This results in unused buffers when shorter sequences are processed. Another problem is
that input vectors which are similar but displaced temporally will appear different when
represented spatially [83] (the network is unable to detect time delays in the input signal).
In the case of bio-inspired neural networks, temporal sequence processing is done naturally
because of the intrinsic dynamic behavior of the neurons. The pioneering work in the field
of bio-inspired neural networks has been done by Hodgkin and Huxley at the University
of Plymouth. They discovered in the ’50s, a mathematical description of the behavior of
a squid axon. Although this model is the most complete so far (it can predict most of the
behaviors seen in simple biological neurons), it is very complex and difficult to simulate
in an artificial neural network paradigm. In what follows, we first try to describe the
most important mathematical models used to modelize bio-inspired neurons beginning
53
54 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
with the Hodgkin-Huxley model. We then show how well -known more simple models
can be derived using some approximations from the Hodgkin-Huxley equations. We then
introduce the canonical model that is a unified framework, in which a major part of
bio-inspired neurons can be expressed. Based on the literature, we then show how one
can derive synchronization criteria using this canonical model. Some aspects of learning
in neural networks are discussed. Finally, two different architectures that enable us to
implement ’temporal correlation’ are described.
4.1 Different types of neuronal models
4.1.1 Class I and Class II neural excitatability
Neurons can behave in two different modes. A neuron is called class I excitable if the
spiking rate of the neuron is a quasi-linear function of the current applied to the input of
the neuron. A class I neuron becomes active via a ’saddle-node’ bifurcation [84] (Section
4.3). A neuron belongs to class II if the discharge (spiking) rate varies very little with
the increase or the decrease of the applied current. Class II neurons are activated via a
’Andronov-Hopf’ bifurcation [84].
Figure 4.1: The spike rate dependency to the applied input current in the Wilson-Cowan
neural model [85]
4.2. MATHEMATICAL DESCRIPTION OF NEURONS 55
4.2 Mathematical description of neurons
Many different dynamics are used to mimic class I and class II excitatability (see section
4.1.1) . In the remaining of this chapter, the ’dimension of a model’ is the dimension of
the state space (phase space) describing the model. The dimension of the state space is
the number of independent variables that must be used to describe a dynamical system
using first-order differential equations. The number of the aforementioned independent
variables is equal to the number of first-order equations, which is equal to the dimension
of the state space.
4.2.1 Four-dimensional neuronal models
• Hodgkin-Huxley neuronal model The more general and complete model so far,
of a real neuron is the Hodgkin-H uxley model. The Hodgkin-Huxley model can be
understood with the help of Figure 4.2. The semipermeable cell membrane separates
the interior of the cell from the extracellular liquid and acts as a capacitor. If an
input current I(t) is injected into the cell, it may add further charge on the capacitor,
or leak through the channels in the cell membrane. Because of active ion transport
through the cell membrane, the ion concentration inside the cell is different from
that in the extracellular liquid. The potential generated by the difference in ion
concentration is represented by a battery.
The Hodgkin-Huxley model is defined by the following Equations (see [86]):
Cdu
dt= −∑
k
Ik + I(t) (4.1)
Ik(t) is the sum of the ionic currents which pass through the cell membrane (defined
in Equation 4.2), u(t) is the membrane potential, and I(t) is the external current
applied to the neuron.
∑Ik = gNam
3h(u− ENa) + gKn4(u− EK) + gL(u− EL) (4.2)
56 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
Inside
Outside
+ +
- -
+ +
- -
K
Na
+
+
R K NaC
I
Figure 4.2: Schematic diagram for the Hodgkin-Huxley model (adapted from [86].
The parameters ENa, EK , and EL are the reversal potentials. Reversal and conduc-
tances are empirical parameters (for details see D-2).
The three variables m,n, and h are called gating variables. They evolve according
to the following differential equations:
m = αm(u)(1−m)− βm(u)m
n = αn(u)(1− n)− βn(u)n (4.3)
h = αh(u)(1− h)− βh(u)h (4.4)
Eqs. 4.2.1, 4.2, 4.1 along with tables 7.3 and Equation D-2 (appendix D) define the
dynamics of the Hodgkin-Huxley equations. The problem with the Hodgkin-Huxley
model is its computational complexity. It is a nonlinear fourth order differential
equation with variable parameters. Therefore the simplified two-dimensional models
described in the next subsection is proposed.
4.2.2 Two-dimensional neural models
Two-dimensional models aim to simplify the dynamics of the Hodgkin-Huxley model.
They stem from the fact that the time scale of the dynamics of the gating variable
m is much faster than that of the variables n, h, and u. This suggests that we may
4.2. MATHEMATICAL DESCRIPTION OF NEURONS 57
treat m as an instantaneous variable. The variable m can be replaced by its steady-
state value m(t) → m0[ut(t)]. This approximation is called the quasi-steady-state
approximation. Another approximation consists of replacing the two variables n
and (1 − h) by a single effective variable w, because the two variable have rather
similar graphs.
In what follows, Moris-Lecar, FitzHugh-Nagumo, Wang-Terman, and Izhikevich
models will be detailed, among which Wang-Terman oscillators are of great interest
for this work.
• Moris-Lecar Model Moris and Lecar proposed a two-dimensional description of
neuronal spike dynamics. A first equation describes the evolution of the membrane
potential u, the second equation the evolution of a ’slow recovery’ variable w. In
dimensionless variables, the Morris-Lecar equations read:
du
dt= −g1m0(u)(u− 1)− g2w(u− V2)− gL(u− VL) + I
dw
dt= − 1
τ(u)[w − w0(u)] (4.5)
Where τ(u) is a polynomial function of the α, and β variables of Hodgkin-Huxley
equations.
• FitzHugh-Nagumo Model FitzHugh and Nagumo were probably the first who
proposed a two-dimensional approximation of Hodgkin-Huxley equations. They
obtained sharp pulses by defining the following space-state euqations:
εdv
dt= F (v)− w + I
dw
dt= v − γw (4.6)
where F (v) = v(1− v)(v + a), ε ¿ 1, and α, I, and γ are constants.
• Wang-terman Model. Wang-Terman model is based on the Van der Pol equations
1. The state-space equations for this dynamics are as follows:
dx
dt= 3x− x3 + 2− y + ρ + p + S
1The van der Pol equation is a model of an electronic circuit that appeared in very early radios. This
58 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
dy
dt= ε[γ(1 + tanh(x/β))− y] (4.7)
Where x is the membrane potential (output) of the neuron and y is the state for
channel activation or inactivation. ρ denotes the amplitude of a Gaussian noise,
p is the external input to the neuron, and S is the coupling from other neurons
(connections through synaptic weights). ε, γ, and β are constants. This model will
be used in chapters 5 and 6. The dynamical behavior of a single Wang-Terman
neuron along with the behavior of an assembly of neurons of this type will be
analyzed further in chapters 5 and 6.
• Izhikevich Model. The Izhikevich model is a manifolds reduction of the canonical
model of Ertmentrout/Izhikevich. Izhikevich [87] has shown that this model can
reproduce all the modes shown in Figure 4.3. He has also shown in [88] that this
model is computationally interesting.
The original model of Izhikevich neuron follows the dynamics:
dv
dt= 0.04v2 + 5v + 140− u + I
du
dt= a(bv − u) (4.8)
with the additional condition:
If v = +30 mv then v ← c and u ← u + d
u and v are variables and a,b,c, and d are parameters. v corresponds to the internal
potential and u represents the ionic currents K+ and Na+. When v crosses some
predefined threshold (let say 30), u and v are reset to zero. Izhikevich used his
proposed equations with step size equal to 1 ms. My simulations have shown that
a step size (integration step) equal to 1ms gives unequal spikes at the output of
the neuron (different amplitudes). This is due to the fact that the dynamics of the
circuit arose back in the days of vacuum tubes. The tube acts like a normal resistor when current is
high, but acts like a negative resistor if the current is low. So this circuit pumps up small oscillations,
but drags down large oscillations. This behavior is known as relaxation oscillation.
4.2. MATHEMATICAL DESCRIPTION OF NEURONS 59
system is stiff and at spiking instant the variable v varies very rapidly. Therefore
the variable v can have a value equal to v = 30 − δ(δ ¿ 1) at t and a value equal
to v = 30 + θ(θ > 100) at t+1ms. One trivial way to circumvent this problem
is to decrease the step size. Another solution we have proposed to Izhikevich was
to modify slightly the model above2. Hence in the final version of his paper, the
resetting condition has been changed as follows:
If v(t) > +30 mv then v(t) = 30 and v(t + 1) ← c and u ← u + d.
If the order 1 Euler integration method is used to solve Izhikevich’s state-space
equations, we obtain (step size equal to 1ms):
v(t + 1) = (0.04vt + 6)v(t) + 140− u(t) + I(t) (4.9)
u(t) = abv(t) + (1− a)u(t) (4.10)
The above equations are simply obtained by replacing dvdt
and dudt
in equation 4.8 by
v(t + 1)− v(t) and u(t + 1)− u(t) respectively.
Even more simplified models can be derived from two-dimensional neurons by further
approximating the equations. In Appendix E (Figure E-7) we show how a Wang-Terman
oscillator can be reduced to a one dimensional model. This approximation, will be a
very important issue in the analysis of the ODLM (Oscillatory Dynamic Link Matcher)
proposed in Chapter 6.
4.2.3 One-dimensional neural models
One of the most widely used models in computational neuroscience is the leaky integrate-
and-fire (I&F ) [89] neuron described as follows:
dv
dt= I + a− bv
if v ≥ vthreshold , then v ← c (4.11)
2Personal correspondence with Eugene Izhikevich.
60 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
where v is the membrane potential, I is the input current, and a, b, c, and vthreshold are
parameters. When the potential v reaches the threshold vthreshold the neuron is said to
fire a spike, and v is reset to c.
Still another type of model that should be discussed is the chaotic neural model. Chaotic
neurons have fractal dimensions (since they are chaotic). It means that the dynamics of
such a system is governed by strange attractors (fractals) and its fractal dimension is less
than the dimensions of its phase space.
4.2.4 Fractal dimension neural models
Chaotic Neural Model. The introduction of methods originating from nonlinear dy-
namics in the analysis of brain waves (electroencephalograms, EEG) goes back to the
pioneering work of Walter Freeman. Nonlinear time series analysis of the EEG of the ol-
factory system revealed that the dynamics of neuronal activity is low dimensional, though
unpredictable. This is a characteristic property of a deterministic chaotic system in con-
trast with the dynamics of a high-dimensional stochastic process. The former system
typically collapses after a transient to a low dimensional attractor whereas the dynam-
ics of the latter remains high dimensional (for further details on the difference between
a chaotic deterministic time series and a stochastic time series see [90]). Deterministic
chaos in neural networks has not only been observed at the network level but also at the
level of a single neuron. Already the Hodgkin-Huxley model, showed a parameter range
where chaotic dynamics appears (some other neural models also exhibit chaos (Figure
4.4)).
Since these early discoveries much effort has been devoted to devise sophisticated meth-
ods to establish the idea of chaos in the brain. However, the determination of chaotic
dynamics from time series analysis is a subtle task, mainly due to the presence of noise in
experimental systems. Thus, whether chaos is indeed present in the brain or if its detec-
tion is just an artifact, due to the applied methods, is still an open question. Moreover,
the significance of chaotic dynamics in neural systems has not yet been elucidated.
4.2. MATHEMATICAL DESCRIPTION OF NEURONS 61
The chaotic map model used in [91] [82] is as follows:
xi(t + 1) = xi(t) +ε
N
j=N∑
j=1
f(xj(t)) (4.12)
Where f(x) is the logistic map defined as follows:
f(x) = ax(1− x) (4.13)
a is a constant. The logistic map can be replaced by other maps like the Heron map, etc.
( [91] 3). The chaotic map model defined above is called the Locally to Globally Coupled
chaotic Map (LtGCM) [92, 93, 94].
It must be pointed out that the dynamics explained above does not always exhibit chaotic
behavior. Roughly speaking, knowing that xj(0) are random when N is large, the sum
follows the large number theorem (but not the central limit theorem [92]). When the
variance of the variables xj(0) is small, the distribution∑j=N
j=1 f(xj(t)) becomes close to
a Delta function and the behavior of the system is not chaotic anymore. For a detailed
derivation of the criteria for chaotic behavior of the system see [95] 4.
Different excitation modes are observed in real biological neurons . In what follows we
will describe each mode briefly:
• Phasic Spiking. A neuron may fire only a single spike at the onset of the input, as
(Figure 4.3, B), and remain quiescent afterwards. Such a response is called phasic
spiking, and it is useful for detection of the beginning of stimulation.
• Tonic Bursting. Some neurons, such as the chattering neurons in cat cortex [87],
fire periodic bursts of spikes when stimulated, as in (Figure 4.3, C). The inter-
burst(i.e., between bursts) frequency may be as high as 50Hz, and it is believed
that such neurons contribute to the gamma-frequency oscillations in the brain.
3Personal correspondence with Zhao.4personal correspondence with Pasemann.
62 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
• Phasic Bursting. Similarly to the phasic spikes, some neurons are phasic, as in
(Figure 4.3, D). Such neurons report the beginning of the stimulation by transmit-
ting a burst.
• Mixed Model (Bursting Then Spiking). Intrinsically bursting (IB) excitatory
neurons in mammalian neocortex [87] can exhibit a mixed type of spiking activity
depicted in (Figure 4.3, E).
• Spike Frequency Adaptation. The most common type of excitatory neuron in
mammalian neocortex [87] , namely the regular spiking (RS) cell, fires tonic spikes
with decreasing frequency, as in (Figure 4.3, F).
• Spike Latency. Most cortical neurons fire spikes with a delay that depends on
the strength of the input signal. For a relatively weak but superthreshold input the
delay, also called spike latency, can be quite large, as in (Figure 4.3, I).
• Subthreshold Oscillations. Practically every brain structure has neurons capable
of exhibiting oscillatory potentials [87], as in (Figure 4.3, J). The frequency of such
oscillations play an important role and such neurons act as band-pass filters.
• Frequency Preference and Resonance. Due to resonance phenomenon, neurons
having oscillatory potentials can respond selectively to the inputs having frequency
content similar to frequency of subthreshold oscillations. Such neurons can imple-
ment frequency-modulated (FM) interactions and multiplexing of signals [87].
• Integration and Coincidence Detection. Neurons without oscillatory potentials
act as integrators: they prefer high-frequency input; the higher the frequency the
more likely they fire, as in (Figure 4.3, L). This can be useful for detecting coincident
or nearly coincident spikes.
• Rebound Spike. When a neuron receives and then is released from an inhibitory
input, it may fire a post-inhibitory (rebound) spike, as in (Figure 4.3, M). This
phenomenon is related to the anodal break excitation membranes.
4.2. MATHEMATICAL DESCRIPTION OF NEURONS 63
• Rebound Burst. Some neurons, including the thalamo-cortical cell, may fire post
inhibitory bursts [87], as in (Figure 4.3, N). It is believed that such bursts contribute
to the sleep oscillations in the thalamo-cortical system.
• Threshold Variability. A common misconception in the artificial neural network
community is the belief that spiking neurons have a fixed voltage threshold. It is
well-known that biological neurons have a variable threshold that depends on the
prior activity of the neurons (in (Figure 4.3, O).
• Bistability of Resting and Spiking States. Some neurons can exhibit two stable
modes of operation: resting and tonic spiking (or even bursting). An excitatory or
inhibitory pulse can switch between the modes as in (Figure 4.3, P).
• Depolarization After-Potentials. After firing a spike, the membrane potential
of a neuron may exhibit a prolonged after-hyperpolarization (called AHP) as e.g. in
(Figure 4.3, B, I, or M) or a prolonged depolarized after-potential (called DAP) as
in (Figure 4.3, Q).
• Accomodation. Neurons are extremely sensitive to brief coincidence inputs but
may not fire in response to a strong but slowly increasing input as illustrated in
(Figure 4.3, R).
• Inhibition-Induced Spiking. A bizarre feature of many thalamo-cortical neurons
is that they are quiescent when there is no input, but fire when hyperpolarized by
an inhibitory input or an injected current (Figure 4.3, S).
• Inhibition-Induced Bursting. Instead of spiking, a thalamo-cortical neuron can
fire tonic bursts when an inhibitory input is applied to it ((Figure 4.3, T)
Not all the models (integrate-and-fire, relaxation, etc.) described in Section 4.2 can
reproduce all the modes. In our work only the RS (Regular Spiking) is used.
In order to analyze synchronization among neurons, which will be used further in chapters
5 and 6 in Section 4.3, we will present a general framework that will allow us to do so.
64 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
4.3 Canonical Neuronal Model
It is shown (see Appendix A) that it is possible to find a global mathematical framework
in which all class I neuronal models (see Section 4.1.1) can be represented by a single
state variable φ. The synchronization conditions can be derived by using an invariant
manifold5 reduction (chapter 5, [97], and Appendix A). The advantage of the canonical
model for neuroscience applications is that it can model all types of neurons, even those
that have not yet been invented.
Many scientist believe that all pulse-coupled neural networks are toy models that are far
away from the biological reality. In what follows, we will show that a huge class of biophys-
ically detailed and biologically plausible neural network models can be transformed into
a canonical pulse-coupled form by a piece-wise continuous, possibly noninvertible, change
of variable. Such transformations exist when a network satisfies a number of conditions,
e.g., it is weakly connected; the neurons are Class 1 excitable; the synapses between neu-
rons are conventional(i.e., axo-dendritic and axo-somatic). This generalization will let us
analyze network properties (such as synchronization, etc.) independently of the model
used (because all models will reduce to the canonical model). Using this approach, we
will find some general conditions that can be applied to all models seen before.
As shown in 7.3, the in-phase synchronized solution of two identical Class 1 neurons exists,
but it is not exponentially stable 6. Small perturbations can make it disappear or stabilize
5A manifold is a topological space which is locally Euclidean (i.e., around every point, there is a
neighborhood which is topologically the same as the open unit ball in ). To illustrate this idea, consider
the ancient belief that the Earth was flat as contrasted with the modern evidence that it is round. This
discrepancy arises essentially from the fact that on the small scales that we see, the Earth does indeed
look flat (although the Greeks did notice that the last part of a ship to disappear over the horizon was
the mast). In general, any object which is nearly “flat” on small scales is a manifold, and so manifolds
constitute a generalization of objects we could live on in which we would encounter the round/flat Earth
problem, as first codified by Poincare. More formally, any object that can be “charted” is a manifold.
For a detailed discussion see [96].6Exponential stability of a system is defined in terms of Lyapunov coefficients [98]. Exponentially
4.4. DIFFERENT MODES OF SYNCHRONIZATION 65
with a small phase shift. The result is valid for any arbitrary synaptic organization [84].
We have so far observed that synchronization can be achieved between two neurons under
certain conditions. In the next section we will focus on the behavior of many neurons
(assembly of neurons) and will classify different modes.
4.4 Different modes of synchronization
As stated in Chapter 3, synchronization is a key element to the ’temporal correlation’
theory. In what follows, I enumerate different mode of synchronization, in which a cell
assembly can operate [99]:
• In-phase synchronization. All neurons in the assembly have the same frequency
and phase.
• Antiphase synchronization. Neurons oscillate at the same frequency but have a
phase difference of π.
• Out-of-phase synchronization. Neurons spike at the same frequency but their
phase difference may range somewhere between 0 et π.
• Frequency synchronization. Neurons may have similar spiking frequency but
variable phase difference.
• Frequency-ratio synchronization. One neuron has an oscillation angular fre-
quency that is equal to ω1 and the other one has an angular frequency equal to ω2
so that ω1
ω2= n and n ∈ N (N is an integer).
• Low-frequency-modulated synchronization. Neurons’ spikes are quasi-periodic
(sum of a high-frequency oscillation pattern and a low-frequency oscillation pattern).
Neurons are synchronized in respect with the low frequency oscillation pattern.
stable systems tolerate modest implementational inaccuracies; mere stable systems, in general, do not.
66 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
• Partial synchronization. A subpopulation of neurons in an assembly has syn-
chronized behavior while the remaining neurons are not synchronized, disturbing
the synchronized region. But the perturbation is not strong enough to destroy the
partial synchronization.
Since we will focus on the In-Phase synchronization in the remaining of this work, we will
present in the next section different models that can mimic this kind of dynamics.
4.5 Selection of the model
As stated earlier, not all the proposed model in the literature can mimic all the modes
seen in neurophysiology. Most of the models stated earlier has been implemented in our
SIMULINK library. SIMULINK is a graphical-interface extension to MATLAB. Therefore
the choice of the model for our architecture should be based on our needs, that is synchro-
nization with the same frequency and different phase. We have also seen in the previous
section, that neurons with complex dynamics synchronize with a phase lag or with a
varying frequency, which is not the aim for our simple model. Even some simple models,
like the Izhikevich neuron cannot insure an in-phase synchronization. We are looking also
for a model that is not computationally very expensive, because complex models cannot
be implemented with our limited computational resources. Therefore, a model like the
Hodgkin-Huxley equations, which should be solved with finite element analysis techniques
cannot be considered in this work. Taking into account these two criteria, three models
can be selected in a first stage: the relaxation oscillator, the integrate-and-fire neuron,
and the chaotic neuron.
4.5.1 Pros and cons of relaxation oscillators
Advantages:
• The spiking frequency of the neuron is independent of the input in a range of
4.5. SELECTION OF THE MODEL 67
value. This makes the synchronization simpler, even though it is not biologically
very motivated (actually this behavior is somewhere between of class 1 and class 2
neural excitability).
• The behavioral dynamics of the system is mathematically tractable and on the other
side the equations are not as complicated as the Hodgkin-Huxley model.
• The model can produce some of the spiking modes. For some suitable parameters it
spikes with a very low duty-cycle (single spikes), while with another set of parameters
it oscillates with high duty-cyles that can be seen as the envelope of bursts.
Disadvantages:
• The Van der Pol equation used in the Wang-Terman relaxation neuron is a ’stiff’
equation. Hence, a small step size should be used. Seen as a single neuron, this can
be a disadvantage for the Wang-Terman model compared with the integrate-and-fire
model. But if many neurons are used, a small step size should also be used in the
integrate-and-fire network to break the symmetry of initial values 7.
• The numbers of synchronized region is limited to 4-5 in the original version of the
Wang-Terman model. The algorithmic version of the model does not have this
disadvantage but cannot be implemented in parallel [100].
4.5.2 Pros and cons of ’integrate-and-fire’ neurons
Advantages:
• In the case of the Van der Pol equation, the trajectory of the neuron in the state
space is one-way. Therefore an external input or a synapse cannot advance or lag
7It means that in each cluster of neurons, there must be at least one neuron that has different initial
values from at least one neuron from other clusters.
68 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
the spiking time of a neuron. This is in contrast with the integrate-and-fire neuron
in which the internal potential can increase or decrease depending on the external
influence. Hence, the frequency of oscillations can decrease so that more regions
with different phases can be created.
• For small network, the required integration step size is much bigger in the case of
the integrate-and-fire network. This is because we require a great precision in order
to break the initial symmetry in each region.
Disadvantages:
• The spiking frequency is a function of the applied input. Therefore, to have in-phase
synchronization, binary images should be used. Techniques have been proposed in
the literature to convert gray-level images to binary images for the segmentation
application [102].
• The integrate-and-fire network is very sensitive to weight normalization. For each
neuron, the number of active neighboring neurons should be found and normaliza-
tion should be applied. Although, weight normalization increases the synchroniza-
tion speed for Wang-Terman oscillatory neurons, it is not mandatory.
• Breaking of the initial symmetry in the network (regions that have at least one neu-
ron with different initial value) requires a high-precision random number generator.
4.5.3 Pros and cons of chaotic neurons
Advantages:
• The model is very advantageous in terms of computational complexity: there is no
differential equation to be solved and no threshold crossing detection is necessary.
• The model uses only multiplications and additions, therefore it is very suitable for
FPGA (Field Programmable Gate Array) implementation.
4.6. LEARNING 69
Disadvantages:
• The chaotic behavior of this model corresponds to a special parameter tuning of the
Hodgkin-Huxley equations, for which the dynamics becomes chaotic.
• The in-phase synchronization in this model is replaced with trajectory similarity.
The algorithms that detect trajectory similarities are much more complicated than
in-phase synchronization detection.
The overall behavior of a neural network does not only depend on the dynamics of the neu-
rons themselves, but also on the behavior of synapses (i.e., the way neurons are connected
to each other). In the next section, learning of neural networks will be discussed.
4.6 Learning
The learning method in a neural network defines the way in which the synaptic weights
change during the functioning of the network depending on the nature of the input signals.
We focus here only on unsupervised learning, which is a way of learning in which there
is no “tutor” to teach the network. This is in contrast with the supervised learning in
which the network has the correct classification result for a set of test data in advance.
In what follows, two general unspervised learning frameworks will be shortly described:
memoryless learning and Hebbian learning.
4.6.1 Memoryless learning
In memoryless learning, the synaptic weights are adjusted according to the actual value
of the input signals. The weights neither depend on the past values of inputs nor on the
past state of neurons [103, 82, 104, 105, 106]. Suppose that neurons i and j are connected
by synaptic weights w(i, j; t) = w(j, i; t) (symmetric). Let us suppose that Ii(t) and Ij(t)
are external inputs to neurons i and j respectively. The synaptic weights are defined as
70 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
follows:
w(i, j; t) = f(|Ii(t)− Ij(t)|) (4.14)
f can be an exponential or fractional (bilinear) function. As stated earlier the function
doesn’t remember the past history of the network. The advantage of this type of learning
is its simplicity: synchronization is achieved more easily. The disadvantage as stated
above is the lack of memory. It is widely believed that the dynamics of synaptic weights
are as important as the dynamics of neurons themselves and that weights with memory
can convey huge amounts of information.
4.6.2 Hebbian Learning
In 1949, Donald Hebb predicted a form of synaptic plasticity driven by temporal contiguity
of pre- and postsynaptic activity. This prediction was verified decades later with the
discovery of long-term potentiation, securing Hebb’s place in the scientific pantheon. The
Hebb postulate is as follows: When an axon of cell A is near enough to excite a cell B and
repeatedly or persistently takes part in firing it, some growth process or metabolic change
takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is
increased.” In mathematical formalism the postulate stated above can be formulated as:
∆w(i, j; t) = αxi(t)xj(t) (4.15)
α is the learning factor, xi and xj are outputs of neurons i and j respectively. More
recent research has shown that the order in which the pre- and post-synaptic spikes are
generated affects the evolution of the synaptic weights. This can be seen as an enhanced
Hebbian rule. This more precise learning rule is called STDP (Spike Timing-Dependent
Synaptic Plasticity). Depending on the relative time of arrival of spikes, a neuron emits
LTD (Long-Time Depression) or LTP (Long-Time Potentiation). In a more precise way,
if the postsynaptic action potential is produced in a 10-ms interval after the pre-synaptic
spike an LTP is generated, while an LTD will be produced, if the order of arrival is
reversed [107].
4.6. LEARNING 71
In general, a ’local’ learning rule governing the modification of the synapses has to be
evaluated according to several global measures [108]:
• All possible stimuli should specifically activate some neurons in the network, i.e.
the union of all receptive fields should cover the stimulus space.
• Rules of synaptic plasticity should allow quick learning. Performance of biological
systems indicates extremely fast performance, reaching one-shot learning in extreme
cases [109].
• The system should allow ongoing learning and be stable simultaneously.
• A learning rule should be compatible with known physiological properties of cortical
neurons [108].
Based on the facts stated above and the neurophysiological observations on LTP and LTD,
Kording and Konig proposed an enhanced Hebbian rule [108]. In fact, they suppose that
the post-synaptic action potential propagates from the output to the input, giving birth
to the backpropagation action potential [110].
• If the output potential coincides with the input potential and there is no inhibitory
input in a 3ms interval following the action potential, the synaptic weight is increased
using the following formula:
∆ωLTP = αLTPτ
|τ + ∆t| (4.16)
where ∆t is the time difference between the input spike and the output spike, and
τ = 5ms.
• If the output action potential coincides with the input action potential and there
is an inhibitory input in a 3ms interval following the spike, the synaptic weight is
decreased using the following formula: :
∆ωLTD = αLTDτ
|τ + ∆t| (4.17)
72 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
• For stability purposes a damping term is added to the learning rule:
∆ωNorm = −αNormω − αDecay (4.18)
where αNorm et αDecay are constant.
The synaptic weight ω is given by:
ω(t + ∆t) = ω(t) + ∆ωLTD/LTP + ∆ωNorm (4.19)
ωLTD/LTP means either ωLTD or ωLTP depending on the situation.
• The synaptic weights remain positive:
ω = max(ω + ∆ωNorm + ∆ωLTP/LTD, 0) (4.20)
Although we described above some learning rules with memory, for the time being it is
difficult to implement such algorithms in our simple architecture of ’temporal correlation’.
The above-mentioned Hebbian-like algorithms have been implemented in our library but
have not been used because of computational complexity and synchronization problems.
It is much easier to attain synchronization and to normalize weights in the memoryless
learning paradigm.
So far, the dynamics of single neurons and cell assemblies along with learning algorithms
for our specific task of ’Temporal Correlation’ have been studied. In the next section,
we will compare two architectures that let us achieve our goal ’Temporal Correlation’
through synchronization. The selected framework will be adapted to our auditory and
visual problems of chapter 5 and 6.
4.7 Implementational aspects of ’Temporal Correla-
tion’
In this section, some implementational aspects of architectures that use ’temporal corre-
lation’ will be enumerated.
4.8. ARCHITECTURES FOR ’TEMPORAL CORRELATION’ 73
• As will be seen in Section 4.8, most proposed architecture used for ’temporal corre-
lation’ use a local connection strategy, in which each neuron is connected only to a
couple of neighboring neurons. This means that a neuron doesn’t need the value of
all outputs at instant t to update its state and output at instant t + 1. Hence, the
network can be implemented on a parallel architecture.
• As stated in chapter 3, the ’temporal correlation’ enables us to represent different
objects simultaneously, using the synchronization and desynchronization between
regions.
• Each single neuron in the network corresponds to a pixel of the image (either visual
scene or auditory scene). Therefore any change in the input image will have an
online and automatic impact in the behavior of the network. This is in contrast
with some classical neural networks (i.e., the Hopfield Network8 [111]), in which a
change in the input does not mean an instantaneous and automatic change in the
dynamics of the network.
In order to design an architecture that enables us to implement ’temporal correlation’,
we must implement neural synchronization and neural desynchronization.
4.8 Architectures for ’temporal correlation’
Neural synchrony is a local aspect, while neural desynchrony is a global aspect of a net-
work. Any proposed architecture should be able to handle both of this modes at the same
time. Some networks use long-range fully connected synapses. In this approach each neu-
ron is connected to all other neurons [21, 86]. A fully connected network cannot extract
8A kind of neural network investigated by John Hopfield in the early 1980s. The Hopfield network
has no special input or output neurons, but all are both input and output, and all are connected to all
others in both directions (with equal weights in the two directions). Input is applied simultaneously to
all neurons which then output to each other and the process continues until a stable state is reached,
which represents the network output.
74 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
local behaviors, because of long-range connections. In this case, it is difficult to achieve
desynchrony with only excitatory neurons (for details on obtaining desynchrony using an
architecture of mixed inhibitory and excitatory neurons see [86, 112]). Some more com-
plicated architectures use modulated synapses and two types of connections: long-range
and short-range [113]. Although the latter-mentioned approach is biologically-motivated
it is computationally very expensive. A third approach consists of using only short-range
synapses and let a global controller perform the desynchronization. In this paradigm,
each neuron on the map is only connected to a couple of neighboring neurons. The LE-
GION (Locally Excitatory Globally Inhibitory Oscillatory Network) and the Attentional
Oscillatory Neural Network (AONN) explained below follow this last approach.
4.8.1 LEGION: Locally Excitatory Globally Inhibitory Oscilla-
tory Network
The underlying dynamics of a LEGION can be integrate-and-fire neurons [102], Van
der Pol relaxation oscillators [3, 103, 114, 115, 116, 117, 118], or chaotic oscillators [91]
[82, 119] but the general framework remains the same in all cases: neuronal elements
are connected in a neighborhood of 4 or 8. Desynchrony is guaranteed via a global
inhibitor neuron that is connected to all other excitatory neurons. Depending on the
dynamics used (integrate-and-fire, van der Pol, etc.) a mapping between the real image
and the inputs to the neuronal map adjusts the dynamic range, maximum/minimum input
values, etc. For integrate-and-fire neurons, synchronization means ’same-time impulses’
and desynchronization means ’different-in-time impulses’. In the van der Pol oscillator
case, the outputs are analog (in contrast with the integrate-and-fire dynamics in which
the output of a neuron is discrete impulse train). Therefore, synchronization means a
phase difference equal to zero and a desynchronization means a phase difference different
from zero. For the chaotic neuron, the output is non-stationary and non-ergodic, therefore
mathematical criteria of synchrony should be defined [91] [82]. These criteria are based
on the trajectory similarity between two neurons. Figure 4.8 shows the architecture of a
4.8. ARCHITECTURES FOR ’TEMPORAL CORRELATION’ 75
LEGION network.
• integrate-and-fire LEGION. The building blocks of the integrate-and-fire LE-
GION is the I&F neuron described in Equation 4.11. The global inhibitor, G(t),
sends an instantaneous inhibitory pulse to the entire network when any oscillator
in the network fires. It is defined as:
G(t) = Γδ(t− tmj ) ∀j,m (4.21)
where tmj represents the m firing time of the jth neuron. The constant Γ is less than
the smallest coupling strength between neighboring oscillators.
• Van der Pol LEGION. The building blocks of the Van der Pol LEGION are the
Wang-Terman oscillators defined in equations 4.7. The global controller is defined
as:
G(t) = αH(z − θ) (4.22)
dz
dt= σ − ξz (4.23)
σ is equal to 1 if the global activity of the network is greater than a predefined ζ
and is zero otherwise.
• Chaotic LEGION. The building blocks of the chaotic LEGION are the chaotic
map defined in Equations 4.12 and 4.13. No global controller is used in this case,
since the chaotic behavior means that a very little difference in the initial values of
neurons creates big differences at infinity. So the desynchrony is implicitly done in
the network without any global controller (or let say an implicit global controller,
in order to be conform with other types of LEGION).
4.8.2 Attentional Oscillatory Neural Network (AONN) The
schematic of this architecture is shown in Figure 4.8.2 [1]
. The Primary Layer (PL) receives the information about the input image and performs
an early stage of information processing. At this stage, the primary features of the objects
76 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
composing the image such as color, brightness, contrast, orientation, local shape, etc. are
extracted. Also the attention focus is formed in the PL which means that for some time one
object is selected from the image to be transmitted to the higher layers of processing for
further analysis (recognition, memorization, novelty detection, etc.). Object selection is
made on the basis of the features and the additional information about the image context.
The context is important because it allows the system to determine which features and
which object are more salient at the current moment. For example. a controlling context
signal such as “black vertical bar” biases the equilibrium of the attention system in such
a way that a black vertical bar has the highest priority to be selected in the attention
focus. In this approach, the attentional system is represented by an ONN with a central
oscillator (CO). The CO plays the role of the central executive of the attention system as
it is suggested in [120]. The ONN with a CO has a star-like architecture of connections
where global interactions between the so-called peripheral oscillators (POs) (representing
the elements of the ONN different from the CO) is implemented through forward and
backward connections with the CO. The model of attention focus formation and control
consists of a CO and many POs. These are forward and backward connections between
the CO and POs which are characterized by both the connection strength and phase shift.
The dynamics of the system is described by the equations:
dθ0
dt= ω0 +
A
n
n∑
i=1
sin(θi − θ0 + γ) (4.24)
dθi
dt= ωi + Bsin(θ0 − θi) i = 1, 2, ..., n (4.25)
where θi are oscillator phases, ωi are the natural frequencies of the oscillators, A and
B are coupling strengths, γ is a phase shift, and dθi
dtdescribes the current frequencies
of oscillators. Equation 4.24 describes the dynamics of the CO and Equation 4.25 de-
scribes the dynamics of the POs. Focus of attention is formed by those POs which work
synchronously with the CO. Three types of synchronization can appear:
• Full synchronization: all POs work synchronously with the CO (that is with the
same current frequency for all oscillators).
4.9. CONCLUSION 77
• Partial synchronization: there are some POs which work nearly synchronously with
the CO but other are out of synchronization.
• No synchronization: all oscillators have different current frequencies.
In the remaining of this thesis, only the LEGION network is considered leaving the use
of the AONN network for further works.
4.9 Conclusion
The mathematical fundamentals of bio-inspired neurons have been laid down in this chap-
ter. We pointed out that some more precise models like the Hodgkin-Huxley model are
inadequate for fast implementation. Therefore, more simplified models have been intro-
duced, among which the Wang-Terman oscillator is further used and explained in chapters
5 and 6. In addition, a general framework of synchronization has been described, which
can be used to do ’temporal correlation’. In the next two chapters, we see how all the
theoretical concepts introduced in the first three chapters can be used to solve real-life
problems like sound source separation and object recognition.
78 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
(A) (B) (C) (D)
(E) (F) (G) (H)
(I) (J) (K) (L)
(M) (N) (O) (P)
(Q) (R) accomodation (S) (T)
DAP
20 m s
tonic spiking phasic spiking tonic bursting phasic bursting
mixed modespike frequency
adaptation
input dc-current
class 1 excitable class 2 excitable
spike latency
subthreshold
oscillations resonator integrator
rebound spike rebound burstthreshold
variabilitybistability
depolarization after
potential (DAP)inhibition-induced
spiking
inhibition-induced
bursting
Figure 4.3: Different excitation modes seen in real biological neurons (adapted from [88])
.
4.9. CONCLUSION 79
+ -+
-
+ + +
+
- - - - - - - - -- - - - 5
+ +
+
+
++
+
+ +
+
+
+
+
+++
++
+
+
+
+ +
+
+ +
++
+ ++
+ +
+
+ +
- --
--
--
--
----
--
---
-
--
--
-
- 13
10
7+
+ ++ + ++ ++ ++ 13
++ + 600
+ + ++ ++ ++ ++ + ++ 180
+++ + ++ +- --- - + 72
+ + ++ ++ ++ ++ ++ ++ 120
+ + + ++ ++ ++ + ++ ++ ++ 1200
22
3
5 13 72
-
-
-
-
-
-
-
-
+
+
+ + +- - - - - - - - -- - - 10- +
- - --
- - -+
- - -+
- -+ +
- - --
+ ++ +
++ --
++ +
++- -
+ +
+ ++
-
-
+
+
+
+
-
-
-
integrate-and-fire
integrate-and-fire with adaptation
quadratic integrate-and-fire
integrate-and-fire-or-burst
resonate-and-fire
FitzHugh-Nagumo
Morris-Lecar
Izhikevich (2003)Hindmarsh-Rose Wilson
Hodgkin-Huxley
(efficient)implementation cost ((# of steps)
(prohibitive)
biophysically meaningful
tonic spiking
phasic spiking
tonic bursting
phasic bursting
mixed mode
spike frequency adaptation
class 1 excitable
class 2 excitable
spike latency
subthreshold oscillations
resonator
integrator
rebound spike
bistability
DAP
accomodation
inhibition-induced spiking
inhibition-induced bursting
chaos
threshold variability
rebound burst
integrate-and-fire
integrate-and-fire
with adaptation
integrate-and-fire-or-burst
resonate-and-fire
quadratic integrate-
and-fire
Izhikevich (2003)
steps
FitzHugh-Nagumo
Hindmarsh-Rose
Morris-Lecar
Wilson
Hodgkin-Huxley
Models
bio
logic
al pla
usib
ility
(# o
f fe
atu
res)
(good)
(poor)
Figure 4.4: Comparison of different neural models (adapted from [87]). “Biological plau-
sibility” is the number of characteristics (i.e., tonic bursting, phasic bursting, etc.) that
can be implemented by the model. The number of flops is an approximate number of
floating point operations (addition, multiplication, etc.) needed to simulate the model
during a 1 ms time span. The author of [87] left the field blank when the verification of a
characteristics had been impossible. Some of the models not described here can be found
in [87].
80 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
A
x
y
B
x
y
q
r
s p
C
Figure 4.5: A) A nullcline of the Wang-Terman equation. A neuron with initial values
outside the nullcline tends to converge to the nullcline and continue on that curve. B) The
trajectory of a spiking relaxation oscillatory neuron in the state-space. C) The output of
a relaxation oscillator (spikes) (adapted fron [101]).
Figure 4.6: SIMULINK model of the “integrate-and-fire” neuron.
4.9. CONCLUSION 81
Figure 4.7: Temporal correlation. The initial binary image is applied to a LEGION
network. After some iterations, each letter pops-up with a different synchronization phase.
a) the initial image; b) the initial states of neurons; c-f) synchronized regions. Each of
the disconnected regions (letters) synchronize with a different synchronization phase. The
activity of the inhibitor is shown at the bottom [103].
82 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
Global Controller
Figure 4.8: The architecture of the LEGION. Each of the circles represents a neuron. The
dynamics of such neurons can be either integrate-and-fire, Wang-Terman, or chaotic. The
global inhibitor is indicated by the black circle. The global controller(inhibitor) is used
in the integrate-and-fire and Wang-Terman cases but not in the chaotic case.
4.9. CONCLUSION 83
Focus of
attention
Central
Oscillator
Input image
Primary Layer
Higher layers
processing
CONTEXT
object
Figure 4.9: The architecture of the AONN network. The input image contains three
objects. In the Primary Layer an object in the focus of attention is painted in black,
other activated regions are painted in gray (adapted from [99]). The dynamics of the
Central Oscillator (CO) and the Peripheral Oscillators (PO) are given in Equations 4.24
and 4.25 respectively.
84 CHAPTER 4. DYNAMICS OF BIO-INSPIRED NEURONS
CHAPTER 5
SOURCE SEPARATION BY BIO-INSPIRED
NEURAL NETWORKS
5.1 Introduction
In this chapter we propose a spiking-neural-network approach to monaural sound source
separation. We also compare our approach to other approaches found in the literature
and discuss the pros and cons of each of them.
5.2 Source separation
Source separation of mixed signals is an important problem with many applications in
the context of audio processing. It can be used to assist a robot in segregating multiple
speakers, to ease the automatic transcription of video via the audio tracks, to separate
musical instruments before automatic transcription, to clean-up the signals before per-
forming speech recognition, etc (see chapter 2). In fact, in that situation, very good
separation can be obtained [121] [122] [123] [124] [125]. But, very often, only one
channel is available to the audio engineer that still has to solve the separation problem.
The problem of monaural (one-microphone) sound source separation is nowadays a very
challenging problem in the speech processing field.
Most monophonic source separation systems are based on either expert systems [2] (ex-
plicit knowledge), or they are based on statistical approaches [5] [33] (implicit knowledge)
or on bio-inspired approaches [32] [3]. Jang and Lee [5], and Roweis [33] have proposed ex-
tensions of data-driven methods to the problem of monophonic source separation. Wang
85
86CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
and Brown [3] have proposed an original approach that uses features obtained from cor-
relograms and F0 (pitch frequency) in combination with an oscillatory neural network (
chapter 4) (for more details on these different approaches see chapter 2).
In the sound separation technique proposed in this thesis, we integrate physiology, psy-
choacoustic and signal processing to design an intelligent system in order to perform the
separation of multiple sources when only one recorded channel is available. The presented
approach is a first step towards the realization of a robust speech recognizer.
Compared with conventional approaches, our system does not require any knowledge of
the underlying signal. It neither needs a priori knowledge of the underlying sources, nor
does it estimates F0 or compute the computationally expensive correlograms. Computing
F0 is not always as simple as it can appear especially in noisy environment and it limits the
system to speech sound separation, ignoring the general case of audio sound separation
(note that pitch only exists in parts of speech and not in other types of speech and
sounds). Correlograms are three-dimensional plots as shown in Figure (2.8, page 33) that
are correlations computed at all time delays. Note that this is a very computing task.
We compare the performance of our system with that of the systems from [3] and [35]. In
the latter work, Hu and Wang have shown strong improvements in comparison to [3]when
the separation uses more conventional cues such as pitch. We believe that the integration
of conventional cues should in fact improve performance, but for this thesis, our goal is
to push the neural solution to its limits.
Our proposed architecture does not perform any segmentation of the sound file into frames.
This is in contrast with other approaches like the one proposed in [3]. It is based on the
availability of simultaneous auditory representations of signals. It is fully autonomous
and does not require any training (in contrast with other statistical approaches in [34, 33,
47, 126]). There is no training or recognition phase in the proposed neural network. To
our knowledge, it is one of the first architecture that makes use of fully dynamic synapses.
The approach used in this thesis uses many of the psychological (Gestalt) rules introduced
5.2. SOURCE SEPARATION 87
in chapter 2. For example:
• The mutual exclusivity is used to assign time-frequency bins to sources. In fact, each
time-frequency bin is assigned to one of the sources and as soon as it is assigned to
a source it cannot belong to any other one. This way of thinking gives birth to the
generation of binary masks as we will see later.
• Proximity is guaranteed through the connectedness of the neural architecture. Since
our network, has local connections between neighboring neurons, it implicitly inte-
grates the Gestalt proximity rule.
• The good continuation is somehow guaranteed by the dynamics of the neurons.
Neurons and synapses have memory in our architecture. This means that they will
not let any abrupt changes of the sound source separation algorithm through time.
• Closure is implemented through local connectivity. This phenomenon can be seen
very easily in oscillatory neural networks used for image segmentation [103].
• Common fate in oscillatory neural networks can be seen in motion detection tasks
like the one proposed in [115].
The sound source separation technique proposed in this chapter is based on temporal
binding (Chapter 3). In the present work, we implement the temporal correlation as
introduced by Malsburg [21] and Milner [66] to bind auditory image objects (see section
5.3).
We are aware that association between patterns in the auditory image could be based on
direct computation of cross-correlations. This solution would lead us to include delay lines
in our network, which we are not interested in for now. The advantage of the proposed
implementation of temporal binding and temporal correlation resides in its autonomy and
no delay lines have to be created into the network of neurons [127]. In the next section
some of the auditory-based feature recognizers found in the literature.
88CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
5.3 Proposed system strategy
Figure 5.1 shows the block diagram of the proposed sound separation algorithm. The
sound that may contain many sources goes through the analysis filterbank and many
outputs (channels) are generated. The CAM/CSM representation is generated. Our
proposed neural network generates a binary mask that is applied to the channels of the
filterbank. Our proposed neural network performs ‘temporal correlation’ (see chapter 3
and 4). In other words, neurons associated to filterbank channels that belong to the
same sound source synchronize, while the other neurons synchronize at a different phase.
The schematic behavior of such a network is shown in Figure 5.2. The figure is a 3-D
plot, in which time, channels, and spike height have been put on the three axes. As it
can be seen, neurons belonging to different sound sources synchronize at different phases.
The synthesis filterbank synthesizes masked sound, which is an approximation of one of
the original sound sources. In the next section each of the building blocks is studied
separately.
5.4 Description of the source separation system
5.4.1 The choice of the cochlear filterbank
The proposed method that will be discussed further in the following sections allows to
resynthesize the audio signal of a single sound source from a mixture of sources. Generally
speaking, this is achieved using a time-varying filter. The pathway of the audio signal
consists of a non-decimated, static analysis filterbank, the time-varying mask, and a static
synthesis filterbank.
We use an FIR implementation of the well-known gammatone filterbank1 by Patterson et
al. [128] (see Appendix F) as the analysis filterbank [129]. The number of channels is 256
1The adaptation of the analysis/synthesis filterbank has been done jointly with C. Feldbauer and G.
Kubin from Graz University of Technology.
5.4. DESCRIPTION OF THE SOURCE SEPARATION SYSTEM 89
FIR Gammatone
Analysis Filterbank
256
256
Output
Sound Mixture
Neural Synchrony
CSM
Generation
Envelope
Detection
CAM
Generation
Spiking Neural
Network
Mask
Generation
FIR Gammatone
Synthesis Filterbank
Figure 5.1: The proposed source separation system
with center frequencies from 100 Hz to 3600 Hz uniformly spaced on an ERB rate scale.
The sampling rate is 8 kHz.
The actual time-varying filtering is done by the mask. Once this mask is obtained by
grouping synchronous oscillators of the neural net (see section 5.4.4), it is multiplied with
the output of the analysis filterbank. Thus, auditory channels belonging to interfering
sound sources are muted and channels belonging to the sound source of interest remain
unaffected.
Before the signals of the masked auditory channels are added to form the synthesized
signal, they are passed through the synthesis filters, whose impulse responses are time-
reversed versions of the impulse responses of the corresponding analysis filters. That
means that the magnitude of the frequency response of a synthesis filter is the same as of
the analysis filter in the same channel. The convolution with the time-reversed impulse
90CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
Figure 5.2: 3-D plot of the output of the proposed neural network. The evolution of the
output of our proposed neural network is shown through time. Neurons associated to
channels belonging to the same source synchronize.
responses linearizes the phase responses and, if the impulse responses of all filters have
same lengths2 and, therefore, same total group delay in all channels, summation yields a
phase-distortion-free result. For a low number of channels, the only distortion of the pair
of analysis and synthesis filterbanks would be a minor magnitude ripple in the overall
frequency response. But for the high number of channels used in our system, this is
absolutely negligible.
This non-decimated FIR analysis/synthesis filterbank was proposed by Irino and Unoki
[130] and also used in the perceptual speech coder in [131] (in the latter with 20 channels
only).
We had also used the IIR gammatone filterbank proposed in [132] for signal analysis and,
for synthesis, we had simply summed up all modified channel signals after applying the
mask. This use of IIR filters had resulted in phase distortions and an overall reduced sig-
nal reconstruction quality. In addition, as stated earlier, since the CAM/CSM takes into
account only magnitude information, it cannot guarantee a good separation when nonlin-
2Shorter gammatones of higher-frequency channels need zero padding.
5.4. DESCRIPTION OF THE SOURCE SEPARATION SYSTEM 91
ear phase IIR filterbanks are used. The use of the FIR implementation of the gammatone
cochlear filterbank in the present work has allowed us to overcome this problem with a
significant increase of reconstruction quality. More quantitative and qualitative compar-
isons between results obtained by the IIR implementation and the FIR implementation
will be given in the following sections.
In the next section, the signal analysis part of the algorithm (from the raw data from the
sound mixture to the input of the neural network) is detailed.
5.4.2 Signal analysis
Our CAM/CSM generation algorithm is as follows.
1. Down–sampling to 8000 samples/s.
2. Filter the sound source using a 256-filter bark-scaled cochlear filterbank ranging
from 100 Hz to 3.6 kHz .
3. • For CAM: Extract the envelope (AM demodulation) for channels 30-256; for
other low frequency channels (1–29) use raw outputs [133].
• For CSM: Nothing is done in this step.
4. Compute the STFT using a Hamming window (4ms to 32ms depending on the
nature of the sound).
5. In order to increase the spectro-temporal resolution of the STFT, find the reassigned
spectrum of the STFT [134] (this consists of applying an affine transform to the
points in order to relocate the spectrum, see Figure 5.4). The reassigned spectrum
as proposed in [134] is for continuous-time signals. For our purpose the values of
reassigned ω and t are rounded to the nearest values.
92CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
6. Compute the logarithm of the magnitude of the STFT3. The logarithm enhances
the presence of the stronger source in a given 2D frequency bin of the CAM/CSM 4.
Channel Number
700 Hz
0 5 10 15 20 Channel Number
S2 Source
S1 Source
Fre
quency
S2 Source
Figure 5.3: CAM for the female /di/ and male /da/ mixture at SNR = 0 dB and t = 166
ms when the channel number is equal to 24. The separation of the two sources can be
done based on ray distances.
We suppose here that envelope detection and selection between the CAM and the CSM, in
the auditory pathway, could be associated to the change of stiffness of hair cells combined
with cochlear nucleus processing [135] [136] (see Figure 5.8) . For now, in the present
experimental setup, selection between the two auditory images is done manually.
The question the reader may ask is what is the theoretical proof that such representations
may work in sound source separation. The answer to this question is given in the next
subsection.
3A moving window is applied to the signal and the fourier transform is applied to the signal within
the window as the window is moved.4log(e1 + e2) ' max(log e1, log e2) (unless e1 and e2 are both large and almost equal) [33]
5.4. DESCRIPTION OF THE SOURCE SEPARATION SYSTEM 93
Figure 5.4: Schematic representation of the signal processing steps required to compute
the reassigned spectrum. Each analysis frame is windowed and Fourier transformed three
times: using the original window (h), the derivative of the window (dh), and the product
of the window and time (th). “mult” means multiplication. “cplx mutl” is the complex
multiplication. “div” corresponds to division and FT computes the Fourier Transform.
(adapted from [134]).
5.4.3 Theoretical motivation behind the CAM/CSM generation
Let us suppose that two persons are uttering sentences at different pitches (Figure 5.9,
Page 108). In a simplified scheme, we can assume that the harmonics of the fundamental
frequency (pitch) is convolved with the impulse response of the vocal tract (with resonance
frequencies we call formants). The effect of these formants is to amplify some of the
harmonics and to attenuate some others.
We suppose that each cochlear channel is dominated by one of the sources. Therefore
let say that in channel n there are a couple of harmonic components at multiples of
F01 and that in channel m there are harmonic components centered around multiples of
F02. There are also resonance frequencies (formants) noted Fri,j. Each Fri,j is the jth
resonance frequency (formant) for source i.
94CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
5 10 15 20
Siren (Noise)
Figure 5.5: CSM (24-channel) of the mixture of /di/ and the siren in Equation 5.23 at
t=50 ms. Segregation is based on the selection of energy bursts.
After AM demodulation, frequencies F01, F02, and Fri,j are translated to new frequencies
according to the following formulae:
fri,j = Fri,j − fc(n) (5.1)
f01 = F01 − fc(n) (5.2)
f02 = F02 − fc(m) (5.3)
Where fc(n) is the central frequency of the cochlear channel n. For each sound source (in
their respective channel of dominance m and n), the simplified utterance signal represen-
tation (after AM-demodulation) is given by:
S1(f) = W (fr1,1, fr1,2, fr1,3, .....)∑n
δ(nf01) (5.4)
S2(f) = W (fr2,1, fr2,2, fr2,3, .....)∑n
δ(nf02) (5.5)
5.4. DESCRIPTION OF THE SOURCE SEPARATION SYSTEM 95
Channel Number
700 Hz
0 5 10 15 20 Channel Number
S2 Source
S1 Source
Fre
quency
S2 Source
Figure 5.6: CAM (24-channel) for the /di/ /da/ mixture. Segregation is based on har-
monic selection.
W(.) is the windowing function caused by the resonances fri,j defined as follows:
W (fri,1, fri,2, fri,3, .....) = A∑
j
Π(fri,j, ∆) (5.6)
Where Π(fri,j, ∆) is a rectangular window with width ∆ centered on frequency fri,j. A
is a constant, with A À 1.
The effect of the windowing function W (.) is that some of the harmonic multiples nf01
or nf02 will have greater amplitudes and some others will have smaller amplitudes. In
order to minimize the effect of the resonance frequencies the logarithms of S1 and S2 are
computed.
Now suppose that our 2-D map is generated for our simplified two-source speech. Spectral
rays appear on the map for that channel and the distance between rays is equal to f0i
(see Figure 5.9). What if one of the channels is not dominated by only one source as
96CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
64 128 192 25600
3600Hz
Fre
quency
Channel Number
Figure 5.7: CSM (24-channel) for the speech plus tone mixture. Segregation is based on
energy bursts.
supposed in the beginning of this section? The answer is that although this can be a
practical issue, it is not a theoretical one, since the channel width for the filterbank can
be chosen as small as needed (by increasing the number of channels). (Figure 5.9, Page
108) shows an idealized case for the two-speaker case.
Now the front-end processing is completed, we should apply the result of this processing
to our proposed neural network. In the next subsection, the architecture of the neural
network is discussed.
5.4.4 The Neural Network
The neural network proposed in this work is based on relaxation oscillatory neurons. In
another work we have used chaotic neurons (see appendix B). Chaotic networks have a
5.4. DESCRIPTION OF THE SOURCE SEPARATION SYSTEM 97
simpler dynamics but are more complex to analyze and to detect synchronization (for
details see appendix B).
• First layer: Auditory image segmentation The dynamics of the neurons we use
is governed by a modified version of the Van der Pol relaxation oscillator5 (Wang-
Terman oscillators [3]). The state-space equations for this dynamics are as follows:
dx
dt= 3x− x3 + 2− y + ρ + p + S (5.7)
dy
dt= ε[γ(1 + tanh(x/β))− y] (5.8)
Where x is the membrane potential (output) of the neuron and y is the state for
channel activation or inactivation. ρ denotes the amplitude of a Gaussian noise, p is
the external input to the neuron, and S is the coupling from other neurons (connec-
tions through synaptic weights). ε, γ, and β are constants. The Euler integration
method is used to solve the equations. The first layer is a partially connected net-
work of relaxation oscillators [3]. Each neuron is connected to its four neighbors.
The CAM (or the CSM) is applied to the input of the neurons. Our observations
have shown that the geometric interpretation of pitch (ray distance criterion) is less
clear for the first 29 channels. For this reason, we have also established long-range
connections from clear (high frequency) zones to confusion (low frequency) zones.
These connections exist only across the cochlear channel number axis of the CAM.
This architecture can help the network to better extract harmonic patterns.
The weight between neuron(i, j) and neuron(k, m) of the first layer is computed
via the following formula:
wi,j,k,m(t) =1
Card{N(i, j)}0.25
eλ|p(i,j;t)−p(k,m;t)| (5.9)
5Relaxation oscillators comprise a large class of nonlinear dynamical systems, and arise naturally from
many physical systems such as mechanics, biology, chemistry, and engineering. Such periodic phenomena
are characterized by intervals of time during which little happens, interleaved with intervals of time during
which considerable changes take place. In other words, relaxation oscillators exhibit more than one time
scale [137].
98CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
here p(i, j) and p(k, m) are respectively external inputs to neuron(i, j) and
neuron(k, m) ∈ N(i, j). Card{N(i, j)} is a normalization factor and is equal to
the cardinal number (number of elements) of the set N(i, j) containing neighbors
connected to the neuron(i, j) (can be equal to 4, 3 or 2 depending on the location
of the neuron on the map, i.e. center, corner, etc.). The external input values are
normalized. The value of λ depends on the dynamic range of the inputs and is set
to λ = 1 in our case. This same weight adaptation is used for long range clear
to confusion zone connections (Equation 5.13) in the CAM processing case. The
coupling Si,j defined in Equation 5.7 is :
Si,j(t) =∑
k,m∈N(i,j)
wi,j,k,m(t)H(x(k, m; t))− ηG(t) + κLi,j(t) (5.10)
H(.) is the Heaviside function. The dynamics of G(t) (the global controller) is as
follows:
G(t) = αH(z − θ) (5.11)
dz
dt= σ − ξz (5.12)
σ is equal to 1 if the global activity of the network is greater than a predefined ζ
and is zero otherwise. α and ξ are constants.
Li,j(t) is the long range coupling as follows:
Li,j(t) =
0 j ≥ 30∑
k=225...256 wi,j,i,k(t)H(x(i, k; t)) j < 30(5.13)
κ is a binary variable defined as follows:
κ =
1 for CAM
0 for CSM(5.14)
• Second layer: temporal correlation and multiplicative synapses [138]. The
second layer is an array of 256 neurons (one for each channel). Each neuron receives
the weighted product of the outputs of the first layer neurons along the frequency
axis of the CAM/CSM. The weights between layer one and layer two are defined
5.4. DESCRIPTION OF THE SOURCE SEPARATION SYSTEM 99
TABLE 5.1: The numerical values of the different parameters used in the first layer of
the network.
Constant’s name Value
λ 1
θ 0.9
α -0.1
ξ 0.4
ζ 0.2
η 0.05
γ 4.0
ε 0.02
ρ 0.02
β 0.1
κ 0.2
as wll(i) = αi, where i can be related to the frequency bins of the STFT and α is
a constant for the CAM case, since we are looking for structured patterns. For the
CSM, wll(i) = α is constant along the frequency bins as we are looking for energy
bursts. Therefore, the input stimulus to neuron(j) in the second layer is defined as
follows [138]:
θ(j; t) =∏
i
wll(i)Ξ{x(i, j; t)} (5.15)
The operator Ξ is defined as:
Ξ{x(i, j; t)} =
1 for x(i, j; t) = 0
x(i, j; t) elsewhere(5.16)
where () is the averaging over a time window operator (the duration of the win-
dow is in the order of the discharge period). The multiplication is done only for
non-zero outputs (in which spike is present) [139, 140]. This behavior has been ob-
served in the integration of ITD (Interaural Time Difference) and ILD (Inter Level
100CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
TABLE 5.2: The numerical values of the different parameters used in the second layer of
the network.
Constant’s name Value
α 1
µ 2
Difference) information in the barn owl’s auditory system [139] or in the monkey’s
posterior parietal lobe neurons that show receptive fields that can be explained by
a multiplication of retinal and eye or head position signals [141]. The theoretical
motivations behind using a multiplicative synapse instead of using an additive one
is explained in 7.3.
The synaptic weights inside the second layer are adjusted through the following rule:
w′ij(t) =
0.2
eµ|p(j;t)−p(k;t)| (5.17)
µ is chosen to be equal to 2. The “binding” of these features is done via this second
layer. In fact, the second layer is an array of fully connected neurons along with
a global controller. The global controller desynchronizes the synchronized neurons
for the first and second sources by emitting inhibitory activities whenever there
is an activity (spikings) in the network [3]. Note also, that the H(.) (Heaviside
function) of the input values are applied to the neurons because of synchronization
considerations. Regions with different first layer activity will dissociate through
very weak synaptic connections, producing desynchronization (similar frequencies
but different phase) and similar region will synchronize (similar frequency and phase)
through strong synaptic connections.
Now that the neural networks successfully separated different sound sources based on the
neural synchrony of the outputs, the extracted information should be used to generate
a binary mask. This mask will be used to synthesize the sources. This aspect will be
explained in the next subsection.
5.4. DESCRIPTION OF THE SOURCE SEPARATION SYSTEM 101
5.4.5 Synthesis
Our system assumes that different sources segregate in the auditory image representation
space and that masking of the undesired sources is feasible. In fact, speech has a specific
(characteristic) structure that is different from that of most noises and perturbations [142].
Also, when dealing with simultaneous speakers, separation is possible when preserving the
time structure (the probability at a given instant t to observe overlap in pitch and timbre
is relatively low), therefore, a mask can be used to suppress the interference (or separate
all sources with adaptive masks). Here is how the synthesis is done in our system:
• The time-reversed signal is passed through the synthesis filterbank giving birth to
zi(t).
• The mask is applied to the channels and the extracted signal is computed. The
energy of each frame of the signal is normalized before synthesis.
s(t) =256∑
i=1
mi(t)znormi (t) (5.18)
where s(N − t) is the recovered signal (N is the length of the signal in discrete
mode), znormi (t) is the normalized filtered output of the original corrupted signal for
channel i, and mi(t) is the mask value. The mask has equal values for all channels
whose associated neurons are synchronized, e.g. mi(t) = 0 or 1, depending on the
source to be enhanced. Another approach is to apply the mask before the synthesis
filterbank, instead of applying it after the mask. This approach has been tested.
Figure 5.22 shows the result when the mask is applied before the synthesis. There
are pros and cons for each approach (masking before or after synthesis). In the case
of masking after the synthesis filterbank there is a musical noise and some rays on
the spectrum. In the case of masking before synthesis, there is a pink noise present
in the result.
In the next section results obtained from different experimental setups will be given.
These results will be compared to results obtained by other techniques proposed in the
102CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
literature.
5.5 Experiments
5.5.1 Database and comparison
Martin Cooke’s database [62] is used for evaluation purposes. The following noises have
been tested: 1 kHz tone, FM siren, white noise, trill telephone noise, and speech. The
aforementioned noises have been added to the target utterance. The audio results (files)
can be found at [143]. Each mixture is applied to the neural system and the mixed sound
sources are separated. The LSD (Log Spectral Distortion) and the PEL (Percentage of
Energy Loss) are used as performance criterion [144, 145, 43, 41]. The LSD is defined
below:
LSD =1
L
L−1∑
l=0
√√√√ 1
K
K−1∑
k=0
(20 log10|I(k, l)|+ ε
|O(k, l)|+ ε)2 (5.19)
Where I(k, l) and O(k, l) are the FFT of I(t) (ideal source signal) and O(t) (separated
source) respectively. L is the number of frames, K is the number of frequency bins and ε
is meant to prevent extreme values (equal to .001 in our case).
The PEL is defined as follows:
PEL =Σte2(t)
ΣtI2(t)(5.20)
The PNR (Percentage of Noise Residual) is defined as follows:
PNR =Σte1(t)
ΣtO2(t)(5.21)
PEL indicates the percentage of target speech excluded from segregated speech, and PNR
the percentage of intrusion included in the synthesized speech. O(t) gives the resulting
speech from our system. The speech waveform resynthesized from the ideal binary mask
is denoted by I(n). To obtain e2(t), a mask is constructed as follows. A T-F unit is
assigned 1 if and only if it is 1 in the ideal binary mask but 0 in the segregated target
5.6. SEPARATION EXAMPLES 103
stream. e2(t) is then obtained by resynthesizing the input mixture from the obtained
mask. e1(t) is obtained in a similar way6.
Although this criterion is used in [35, 146, 4, 42, 12], it is difficult to determine the ideal
mask, therefore this criterion is not used for all experiments.
5.5.2 Separation performance
Table 5.3 gives the LSD. The SNR of the initial signal is calculated by
SNR = 10log∑
s2(t)/∑
n2(t) (5.22)
Where s(t) represents the original target signal n(t) the noise. In all cases, the system
performs better than [3]. It is the best when the interference is a tone. For the siren,
it is comparable to [146]. For telephone and white noise, [146] is the best. For the
double-vowel, the LSD is the highest – showing that separation is more difficult when the
interference is speech. In what follows, spectrograms for different sounds and different
approaches are given for visual comparison purposes.
We have so far compared our proposed technique to other approaches quantitatively by
using LSD, PEL, and PNR criteria. In the next section qualitative comparison will be
made available to the reader.
5.6 Separation examples
5.6.1 Separation of speech from telephone trill
Figure 5.11 shows the mixture of the utterance “Why were you all weary?” with the
telephone trill noise (from Martin Cooke’s database). The trill telephone noise (ring) is
wideband, interrupted, and structured. Figure 5.12 shows separated utterance, and trill
6Personal communication with Goning Hu.
104CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
TABLE 5.3: The log spectral distortion (LSD) for three different methods: P-R (our
proposed approach), W-B (the method proposed by [3]), and H-W (the method proposed
by [146]). The intrusion noises are as follows a) 1 kHz pure tone, b) FM siren, c) telephone
ring, d) white noise, e) male-speaker intrusion (/di/) for the French /di//da/ mixture,
f) female-speaker intrusion (/da/) for the French /di//da/ mixture. Except for the last
two tests, the intrusions are mixed with a sentence taken from Martin Cooke’s database.
SNR of the initial P-R W-B H-W
Intrusion mixture(dB)
LSD LSD LSD
Tone -2 dB 7.07 23.15 16.45
Siren -5 dB 8.68 17.26 8.52
Tel. ring 3 dB 15.43 16.56 10.11
White noise -5 dB 15.29 18.41 12.77
Male (da) 0 dB 23.70 N/A N/A
Female (di) 0 dB 17.95 N/A N/A
5.6. SEPARATION EXAMPLES 105
telephone, spectrograms obtained by using our approach. It is interesting to note that
the low-frequency range of the telephone trill has been preserved. Figure 5.13 shows the
extracted utterance by using [3]. As can be seen, our approach performs better in higher
frequencies.
5.6.2 Separation of speech from 1 kHz tone
In this experiment the utterance “I willingly marry Marilyn” with a 1 kHz pure tone is
used. The tone is narrowband, continuous, and structured. Figure 5.14 shows the original
utterance plus 1 kHz tone. Figure 5.15 shows the separation results for our approach and
the approach proposed by [3]. The method proposed in [3] removes speech in middle
and high frequencies, while these frequencies remain unaffected by our approach. When
listening to the signal and according to the LSD (equal to 7.07), the tone has been removed
(even if a gray bar is shown in figure (5.15, left)).
5.6.3 Double-vowel segregation case
Two speakers have simultaneously and respectively pronounced a /di/ and a /da/ (spec-
trogram Figure 5.16). We observe that the CSM representation does not generate very
discriminative representation while, from the CAM, the two speakers are well separa-
ble. After binding, two sets of synchronized neurons are obtained: one for each speaker.
Separation is performed by using Equation 5.18, where mi(t) = 0 for one speaker and
mi(t) = 1 for the other speaker (target speaker). For the /di/+/da/ mixture, we used
the PEL (Percentage of Energy Loss) as an evaluation criterion.
The PEL for the synthesized /da/ is 15.01% at SNR = 0dB and is equal to 16.67% for
the /di/. Perceptual tests have shown that although we lose some sound quality after the
process, the vowels are separated and are clearly recognizable.
106CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
5.6.4 Sentence plus siren
The siren used in Cooke’s database [62] [3] (Equation 5.23) is mixed with the sentence “I
willingly marry Marilyn”.
The spectrogram of the mixed sound is shown in Figure 5.19). The noise is represented
by the following equation and can be generated by a VCO (Voltage controlled oscillator):
n(t) =∑
i
cos[(ωit +∆ω
ωm
cos(ωmt + ϕi)] (5.23)
Where ωi is the central angular frequency, ωm is the angular frequency of the modulating
signal, ∆ω is the angular frequency deviation, and ϕi is the phase of the modulating
signal (equal to 0 in Figure 5.16). We are looking for short but high energy bursts. We
observe that the CSM representation generates a very discriminative representation of the
speech and siren signals, while, on the other hand, the CAM fades the image because of
the envelopes. After binding, two sets of synchronized neurons are obtained: one for each
source. Separation is performed by using Equation 5.18, where mi(t) = 0 for the siren and
mi(t) = 1 for the speech sentence and vice-versa. The CSM is presented to the spiking
neural network. The weighted product of the outputs of first layer along the frequency
axis is different when the siren is present. The binding of channels on the two sides of
the “noise intruding zone” is done via the long-range synaptic connections of the second
layer. A CSM is extracted at each 10 ms and the selection is made by 10 ms intervals. In
a future work, we plan to use much smaller selection intervals and shorter STFT windows
to prevent discontinuities, as observed in Figures 5.20 and 5.21. Furthermore, overlapping
cochlear filters are not suitable for the synthesis of the processed speech.
5.6. SEPARATION EXAMPLES 107
Figure 5.8: The change in the stiffness of the hair cells due to a change of the stimulus.
Πl(τ) represents the terminal contribution of the Outer Hair Cells (OHC) efferent system.
The efferent gain Π1(τ) in the upper frequency band of the cochleogram partly compen-
sates for the loss. The low-frequency efferent gain Π4(τ) is primarily sensitive to voicing
and shows large temporal fluctuations. Kst is the stiffness of the terminal contribution of
the acoustic reflex (AR) on the middle ear. Maximal stapedial muscle contraction Kmax
incurs a loss in middle ear transmission below 1000 Hz of up to 15 dB (scanned from [31],
chapter 25). See section 5.4.2 for details.
108CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
Source 1
Source 2
Fre
quencie
s
Cochlear Channels
f01
fr1,1
fr1,2
fr2,1
fr2,2
fr2,3
f0
Figure 5.9: Idealized schematic of a 2-D spectral map (Cochleotopic/AMtopic) for a two-
speaker signal. The distance between rays corresponds to the pitch of the source. Note
that the amplitude of the rays are not equal because of the effect of the formants. Some
resonance frequency fri,j are shown by dotted boxes. Note that the resonance frequencies
do not always match the harmonic frequencies nf0.
5.6. SEPARATION EXAMPLES 109
Neuroni,j
Neuronk,m
H(.)x(k,m;t)
wi,j,k,m
sum > ζ sum < ζ
σ=1 σ=0
G
dz/dt= σ − ξ z
−η
L i,j
Glo
ba
l C
on
trolle
r
Synchronization
CAM/CSM
Figure 5.10: Architecture of the Two-Layer Bio-inspired Neural Network. G Stands for
global controller (the global controller for the first layer is not shown on the figure). One
long range connection is shown in the figure. The CAM/CSM is applied to the first layer.
The synchronization on the second layer is based on the similarity of cochlear channels.
Neurons associated to channels belonging to the same source synchronize.
110CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
Time (s)0 1.51
0
4000
Fre
qu
en
cy (
Hz)
Time (s)0 1.51
0
4000
Fre
qu
en
cy (
Hz)
Figure 5.11: Mixture of the utterance “Why were you all weary?” with a trill telephone
noise.
Time (s)0 1.51
0
4000
Fre
qu
en
cy (
Hz)
Time (s)0 1.51
0
4000
Fre
qu
en
cy (
Hz)
Figure 5.12: Separation results for the trill telephone noise. Left: The synthesised “Why
were you all weary?” after the separation by the approach proposed in this article. Right:
The synthesised trill phone after the separation by the approach proposed in this article.
5.6. SEPARATION EXAMPLES 111
Time(s)
Fre
quency (
Hz)
4000
00 1.75
Figure 5.13: The synthesized “Why were you all weary?” by the approach proposed by
[3]. The high-frequency information is missing.
4000
Freq
uenc
y (H
z)
00 Time (s) 1.75
Figure 5.14: Mixture of the utterance “I willingly marry Marilyn” with 1 kHz tone.
112CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
Time (s)0 1.75
0
4000
Fre
qu
en
cy (
Hz)
Time (s)0 1.46
0
4000
Fre
qu
en
cy (
Hz)
Figure 5.15: Comparison between our approach and Wang’s approach for the ’1 kHz’ tone.
Left: The separation result for the 1 kHz plus utterance mixture using the approach
described in this thesis. The dynamic range between the darkest gray level and the
brightest level is 50 dB. Right: The synthesised “Why were you all weary?” by the
approach proposed by [3]. The high-frequency information is missing.
4000
3000
2000
1000
Frequency (Hz)
Figure 5.16: The spectrogram of the /di/ /da/ mixture.
5.6. SEPARATION EXAMPLES 113
4000
3000
2000
1000
Frequency (Hz)
Figure 5.17: The spectrogram of the extracted /di/.
4000
3000
2000
1000
Frequency (Hz)
Figure 5.18: The spectrogram of the extracted /da/.
114CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
0 1.75
Fre
qu
en
cy(H
z)
4000
Time(s)0
Figure 5.19: Mixture of a siren and the sentence “I willingly marry Marilyn”.
5.6. SEPARATION EXAMPLES 115
Time (s) 0 1.75 0
4000
Freq
uenc
y (H
z)
Time (s)0 1.750
4000
Freq
uenc
y (H
z)
Figure 5.20: Synthesis by an FIR implementation Left: Results with the proposed 256-
channel FIR gammatone filterbank: the spectrogram of the extracted siren. Right:
Results with the proposed 256-channel FIR gammatone filterbank: the spectrogram of
the utterance (the siren is removed).
5.6.5 PESQ
Another quantitative performance criterion used in speech coding is the PESQ (Perceptual
Evaluation of Speech Quality). We propose here to use this criterion for sound source
separation 7.
The PESQ is an objective method for end-to-end speech quality assessment of narrow-
band telephone networks and speech codecs, which is applicable to any end-to-end mea-
surement. This evaluation method has been proposed by the ITU (International Telecom-
munication Union) under the recommendation P.862. The code and documentation for
PESQ can be downloaded at [147] (see also [148].
Our technique gives better result compared with [4] [3] based on the PESQ for all mixtures
tested (except for telephone ring, in which case performance is comparable). However,
according to LSD, our technique performs better than Wang & Brown [3] and performs
either better or worse (depending on the mixture used) compared with Hu & Wang [4].
Compared with a statistical approach proposed in [5] our approach performs better for the
7Many thanks to Vijay Parsa from the University of Western Ontario for fruitful discussions on PESQ.
116CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
Time (s)0 1.750
4000
Freq
uenc
y (H
z)
Time (s) 0 1.750
4000
Freq
uenc
y (H
z)
Figure 5.21: Synthesis by an IIR implementation Left: Results with a 256-channel IIR
implementation of the gammatone filter: the spectrogram of the extracted siren. Right:
Results with a 256-channel IIR implementation of the gammatone filter: the spectrogram
of the utterance (the siren is removed).
extraction of music and performs slightly worse for the extraction of speech. However, the
technique proposed in [5] had been statistically trained with speech before the separation
phase.
5.6.6 Three-source case
Results on three-source cases (utterance plus siren plus telephone trill) have shown that
the technique proposed in this thesis can be easily generalized to multiple-source separa-
tion. In addition, music plus utterance sound source separation has been tested with the
proposed technique. The sound files for these results can be found at [143]. Quantitative
comparison is not done, since data for other approaches is not available [148].
In the next section, some preliminary finding on source separation based on chaotic oscil-
lators will be given. The advantage of the technique described below is its computational
simplicity compared with the technique proposed above.
5.7. CONCLUSION AND FURTHER WORK 117
4000
Freq
uenc
y (H
z)
0 0 Time (s) 1.75
Figure 5.22: Synthesis result for the siren plus sentence case, when the masking is applied
before the masking. Musical noise is decreased but pink noise is increased.
Time (s)0 1.46
0
4000
Fre
qu
en
cy (
Hz)
Figure 5.23: The synthesized “Why were you all weary?” by the approach proposed by
[3] in the siren plus utterance mixture case. The high-frequency information is missing.
5.7 Conclusion and Further Work
Based on evidences regarding the dynamics of the efferent loop [135] and on the richness of
the representations observed in the Cochlear Nucleus, we proposed a technique to explore
the monophonic source separation problem using a multirepresentation (CAM/CSM) bio-
inspired pre-processing stage and a bio-inspired neural network that does not require any
a priori knowledge of the signal. We saw how this technique helped separate target sound
sources from interfering noises like: siren, trill telephone, music, tone, talkers, etc. We
also compared our technique to other techniques proposed in the literature and saw that
118CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
TABLE 5.4: The PESQ of three different methods: P-R (our proposed approach), W-B
([3]), and H-W ([4]) ( see caption of Table 5.3) . Higher values mean better performance.
Intrusion ini. SNR P-R W-B H-W
(noise) mixture (PESQ) (PESQ) (PESQ)
Tone -2 dB 0.4 0.2 0.4
Siren -5 dB 2.1 1.6 1.2
Tel. ring 3 dB 0.9 0.7 0.9
White -5 dB 0.9 0.2 0.3
Male (da) 0 dB 2.1 N/A N/A
Female (di) 0 dB 0.7 N/A N/A
TABLE 5.5: PESQ for two different methods: P-R (our proposed approach) and J-L ([5]).
The mixture comprises a female voice with musical rock background.
Mixture Separated P-R J-L
sources (PESQ) (PESQ)
Music & female music 1.70 0.35
(AF) voice 0.55 0.63
ours is doing better in almost all cases from a well-known bio-inspired sound separation
technique [3]. We also compared our technique to a pitch-based technique [4]. We saw
that our technique is sometimes doing better and sometimes worse, but the results remain
comparable. On the other hand, the aim of this thesis was to design a bio-inspired sound
separator and the pitch-based technique is more expert system oriented. We believe
that our system is more flexible than some other techniques found in the literature. For
example a technique that uses pitch like the one used in [4] cannot be used for musical
sound source separation, since the harmonic structure is different in music. We believe
that our approach can be applied to musical instruments separation since for example
we have not done any assumption on the formantic structure of speech to develop our
algorithm. In addition, preliminary experiments show that our technique work for three-
5.7. CONCLUSION AND FURTHER WORK 119
source sound separation. Authors of [3] and [4] have not reported more than two-source
sound separation.
For the time being, the CSM/CAM selection is done manually. In further work, one can
include a top–down module based on the SNR gain between inputs and the extracted
signals to selectively find the suitable auditory image representation, depending on the
neural network synchronization. Other maps like ones that are based on instantaneous
frequencies (FM) can be added to the multi-representation [149].
As stated earlier, the CSM is computed at 10 ms intervals with a 64 ms STFT window 8. In
a future work, smaller intervals and shorter STFT windows should be chosen to diminish
the spectral discontinuities. In addition, other speech analysis techniques that do not
require stationary signals (such as wavelets) could be used. More thorough comparison
with other techniques like the one proposed in [47] [5] (among others) are also planned.
Musical noise is inherent to any technique based on binary masks. In order to fix this
problem, non-binary masks with smooth transitions must be used to reduce different
types of noise. Non-binary masks are in contradiction with the Gestalt rule of mutual
exclusivity (remember that in this rule an object cannot belong to two different entities at
the same time). Therefore, a psychological interpretation to this non-binary mask should
be found.
Qualitative results obtained from signal synthesis are encouraging and we believe that
spiking neural networks in combination with suitable signal representations have a strong
potential in speech and audio processing.
The segregation results are not very good for a sound file mixed with white noise. Other
types of auditory maps should be developed for the white noise intrusions. In fact, we
observed that the leakage between different filters of the filterbank somehow amplifies
the noise. Therefore, more work should be done on the design of more suitable analy-
sis/synthesis filterbanks.
8The window length is equal to 4 ms for the telephone trill.
120CHAPTER 5. SOURCE SEPARATION BY BIO-INSPIRED NEURAL NETWORKS
In [150, 151, 152], it has been shown that a circular chain of spiking neural networks has
a faster synchronization time than a linear chain of neurons. Based on this fact, one can
imagine of modifying the linear chain of the second layer of our proposed network to a
circular chain in order to increase synchronization speed.
The parameters in tables 5.1 and 5.2 are chosen empirically in this work. A mathemat-
ical analysis or a statistical study should be done to find the optimal values for these
parameters.
In this chapter, a method has been proposed to do “bottom-up” sound source separa-
tion. A “top-down” processor should be integrated to this technique to further enhance
performance. Top-down processing uses higher-level information at word levels, etc. to
match the obtained pattern by “bottom-up” processor to a priori known pattern. In the
next chapter of this thesis, we propose an architecture that can potentially do this kind
of “top-down” processing, although for the time being it has been only tested on visual
“toy objects” (see [153]).
CHAPTER 6
ODLM FOR PATTERN RECOGNITION
6.1 Introduction
In this chapter we propose the Oscillatory Dynamic Link Matching (ODLM), which is
an extension to the Dynamic Link Matching (DLM). We present how this technique can
help match objects to predefined patterns. This technique can be used in future works to
match auditory patterns in a “top-down” processor that can be coupled to the approach
proposed in Chapter 5.
6.2 Pattern Recognition
Pattern recognition is a branch of artificial intelligence concerned with the classification
or description of observations.
Pattern recognition aims to classify data (patterns) based on either a priori knowledge
or on statistical information extracted from the patterns. The patterns to be classified
are usually groups of measurements or observations, defining points in an appropriate
multidimensional space.
A complete pattern recognition system consists of a sensor that gathers the observations
to be classified or described; a feature extraction mechanism that computes numeric or
symbolic information from the observations; and a classification or description scheme
that does the actual job of classifying or describing observations, relying on the extracted
features.
The classification or description scheme is usually based on the availability of a set of
121
122 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
patterns that have already been classified or described. This set of patterns is termed the
training set and the resulting learning strategy is characterized as supervised. Learning
can also be unsupervised, in the sense that the system is not given an a priori labelling
of patterns, instead it establishes the classes itself based on the statistical regularities of
the patterns.
The classification or description scheme usually uses one of the following approaches:
statistical (or decision theoretic), syntactic (or structural), or neural. Statistical pattern
recognition is based on statistical characterisations of patterns, assuming that the patterns
are generated by a probabilistic system. Structural pattern recognition is based on the
structural interrelationships of features. Neural pattern recognition employs the neural
computing paradigm that has emerged with neural networks.
Pattern recognition (or more specifically template matching) robust to noise, symme-
try, homothety (size change with angle preservation), etc. (Figure 6.1) has long been a
challenging problem in artificial intelligence. This task can be seen as a complementary
task to the source separation described in chapter 5, in which recognition is done using
the CAM/ CSM described in chapter 5, in addition to its direct application in artificial
intelligence and image processing. Many solutions or partial solutions to this problem
have been proposed using expert systems or neural networks. In general three different
approaches are used to perform invariant pattern recognition:
• Normalization. In this approach the analyzed object is normalized to a standard
position and size by an internal transformation. One advantage of this approach is
that: The coordinate information (the “where” information) is retrievable at any
stage of the processing and there is a minimum loss of information. The disadvantage
of this approach is that the network should find the object in the scene and then
normalize it. This task is not as obvious as it can appear [75] [154].
• Invariant Features. In this approach some features that are invariant to the lo-
cation and the size of an object are extracted. The disadvantages of this approach
6.2. PATTERN RECOGNITION 123
is that the position of the object may be difficult to extract after recognition and
information is lost during the process. The advantage is that the technique doesn’t
require to know where the object is and unlike normalization in which other tech-
niques should be used after this stage to recognize patterns, the invariant features
approach already does some pattern recognition by finding important features [73].
• Invariance Learning from temporal input sequences. The assumption is that
primary sensory signals, which in general code for local properties, vary quickly
while the perceived environment changes slowly. If it is possible to extract slow
features from the quickly varying sensory signal, it is likely to obtain an invariant
representation of the environment [155] [156].
Based on the Normalization approach, the “dynamic link matching” (DLM) has been
first proposed by Konen et al. [154, 157]. This approach consists of two layers of neurons
connected to each other through synaptic connections constrained to some normalization.
The saved pattern is applied to one of the layers and the pattern to be recognized to the
other. The dynamics of the neurons are chosen in such a way that “blobs” are formed
randomly in the layers. If the features in these two blobs are similar enough, some weight
strengthening and activity similarity will be observed between the two layers, which can
be detected by correlation computation [154, 158]. These blobs can or cannot correspond
to a segmented region of the visual scene, since their size is fixed in the whole simulation
period and is chosen by some parameters in the dynamics of the network [154]. The
apparition of blobs in the network has been linked to the attention process present in
the brain by the developers of the architecture. The dynamics of the neurons used in
the original DLM network is not the well-known spiking neuron dynamics. In fact, its
behavior is based on rate coding (average neuron activity over time, for details see section
6.8) and can be shown to be equivalent to an enhanced dynamic Kohonen Map in its
Fast Dynamic Link Matching (FDLM) form [154]. The DLM technique developed by
Von der Malsburg’s group has been applied with simplifications to real images. In fact,
the first layer of their network contains 10x10 neurons while the second layer contains
124 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
Rotation
Scaling (Homothety)
Shearing
Reflection
a)
b)
c)
d)
Reflection + Shearing
e)
origin
Translation
origin
f)
Figure 6.1: Some examples of affine transforms. Some transforms are simple like in
a,b,c,d,e and some are combinations of simple transforms like the one presented in f.
6.3. THE DYNAMIC LINK MATCHER 125
16x7 neurons (most probably because the computational complexity grows exponentially
with the number of neurons). They have used jets (grey-value distributions based on the
Gabor transform) to simplify the image and extract some features (or simple objects).
Here, we propose the Oscillatory Dynamic Link Matching algorithm (ODLM) [106] [159]
[160], which uses third generation spiking neurons and is based on phase (place) coding.
The network is capable of doing motion analysis, but neither it computes optical flow
nor it performs additional signal processing between the layers, unlike in [161]. In a more
general way, our proposed network can solve the correspondence problem, and at the same
time, perform the segmentation of the scene, which is in accordance with the Gestalt
theory of perception [162] and it is very useful when pattern recognition should be done
in multiple-object scenes. In other words the network does normalization, segmentation,
and pattern recognition at the same time. It is also self-organized. In addition, if only
one object is present in the scene the segmentation phase can be bypassed, if the speed
of convergence is the only concern (section 6.7). The application of this network is not
limited to visual scene analysis, it can be used in sound source segregation problem and
may act as a top-down (schema-driven) processor in the Computational Auditory Scene
Analysis (CASA) [42]. In the following two sections, we describe first the Dynamic Link
Matcher as proposed by Konen et al. We will then propose our improved Oscillatory
Dynamic Link Matcher. We then prove why our proposed technique works and what are
its advantages.
6.3 The Dynamic Link Matcher
The Dynamic Link Matcher was first proposed by Konen et al. [154]. The architecture
of the DLM consists of two layers of neurons. The dynamics of the neurons will be
detailed later below. There are synaptic couplings between neurons in different layers
and neurons in the same layer. In DLM, finding a match between patterns means finding
a set of mutually corresponding cells a ∈ X (X being the neurons of the first layer)
and b ∈ Y (Y being the neurons of the second layer). A cell a may be considered as a
126 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
neuron or neuronal group capable of two functions: 1) coding a local feature imposed by
the actual pattern; 2) representing an activity state which can be transmitted to other
cells. Correspondence in the DLM is expressed by the binding of cells a and b through a
dynamic link wba ≥ 0. The links converging on a given cell b are subject to a normalization
condition∑
a wextba = 1. Thus they can be interpreted as the probability that a cell a is
the correct correspondence for b. The link matching is done in a self-organized manner.
Inter-layer links wba that are static and homogeneous intra-layer connections in both X
and Y (wintba ) which are described through an interaction kernel k(d) = γexp(−d2/2s2)−β
which consists of short-range excitatory connections with range s and global inhibitory
connections of relative strength β. It can be shown that this type of interacting kernel
has a connected active region or “blob” as equilibrium solution [154].
Each iteration step consists of simultaneous blob formation process in both layers X and
Y . This is achieved through a set of coupled differential equations starting from initial
conditions x(0) = y(0) = 0:
dxa
dt= −αxa + (k ∗ σ(xa)) + Ixa (6.1)
dxb
dt= −αxb + (k ∗ σ(xb)) + Ixb (6.2)
σ(.) is the Gaussian Mexican Hat. The above equations differ only in their input terms:
The layer X receives its input Ixa which is slowly varying compared to the dynamics of
X and Y . On the other hand, the activity of layer Y is coupled to X through the input
term Ixb = ε∑
a wbaTbaσ(xa) with coupling strength ε. Tba is the similarity matrix. It has
high entries for all candidate matches with similar features.
When the activity in both layers X and Y has converged to its equilibrium blob solution,
the dynamic links between active cells are strengthened according to:
4wba = εwbaσ(xa)σ(xb) (6.3)
Based on these assumptions, Konen, Malsburg and others proposed a dynamic link
matcher [163, 154]. Although this approach is partially bio-inspired it is not entirely
6.4. THE OSCILLATORY DYNAMIC LINK MATCHER 127
biologically plausible. It does not use the neural building blocks used in bio-inspired neu-
rons like the integrate-and-fire or relaxation oscillators. In the next section, we propose
a technique based on DLMs but by using relaxation oscillators. We will further show,
why our proposed network can do pattern matching and why the original DLM is an
approximation of our ODLM.
6.4 The oscillatory dynamic link matcher
6.4.1 Introduction
In this section we propose our ODLM (Oscillatory Dynamic Link Matcher). Like the
DLM, the ODLM’s aim is to match patterns that have been applied to the two layers of
the network. In other terms, a visual scene is applied to the first layer of the network and
an object to the other layer. If the object exists in the scene, synchronization is achieved
between the two layers. If it does not exist, no synchronization is achieved between the two
layers. This behavior has been schematized in (Figure 6.2, Page 139). The mathematical
description of the network is given in the following subsection.
6.4.2 Mathematical Description of the Network
The building blocks of this network are oscillatory neurons [100] (see Chapters 4 and 5 for
further detail). The dynamics of this kind of neurons is governed by a modified version
of the Van der Pol relaxation oscillator (called the Wang-Terman oscillator) (for a similar
approach with different dynamics see [1]). There is an active phase when the neuron
spikes and a relaxation phase when the neuron is silent. The dynamics of the neurons
follows the following state-space equations, where xi is the membrane potential (output)
of the neuron and yi is the state for channel activation or inactivation.
dxi,j
dt= 3xi,j − x3
i,j + 2− yi,j + ρ + H(pinputi,j ) + Si,j (6.4)
128 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
dyi,j
dt= ε[γ(1 + tanh(xi,j/β))− yi,j] (6.5)
ρ denotes the amplitude of a Gaussian noise, pinputi,j the external input to the neuron (its
value is equal to the gray-level value of the corresponding pixel in the picture), and Si,j the
coupling from other neurons (connections through synaptic weights). ε, γ, β are constants
(defined at Table 5.1, page 99), and H(.) is the Heaviside function defined below:
H(x) =
1 if x > 0
0 otherwise(6.6)
Initial values are generated by a uniform distribution between the interval [-2; 2] for xi,j
and between [0; 8] for yi,j (these values correspond to the whole dynamic range of the
equations) (for more details on the dynamics of oscillatory neurons see chapters 4 and 5).
A neighborhood of four is chosen in each layer for the connections. Each neuron in the first
layer is connected to all neurons in the second layer and vice-versa. A global controller
is connected to all neurons in the first and second layers as in [115]. In a first stage,
segmentation is done in the two layers independently (with no extra-layer connections)
as explained in Section 6.5, while dynamic matching is done with both intra-layer and
extra-layer couplings. The intra-layer and extra-layer connections are defined as follows:
winti,j,k,m(t) =
wintmax
Card{N int(i, j) ∪N ext(i, j)} ·1
eλ|p(i,j;t)−p(k,m;t)| (6.7)
wexti,j,k,m(t) =
wextmax
Card{N ext(i, j) ∪N int(i, j)} ·1
eλ|p(i,j;t)−p(k,m;t)| (6.8)
where winti,j,k,m(t) are intra-layer connections and wext
i,j,k,m(t) are extra-layer connections (be-
tween the two layers) and wintmax = 0.2 and wext
max = 0.2 are constants equal to the maximum
value of the synaptic weights. Card{N int(i, j)} is a normalization factor and is equal to
the cardinal number (number of elements) of the set N int(i, j) containing neighbors con-
nected to the neuron(i, j) and can be equal to 4, 3 or 2 depending on the location of the
6.5. BEHAVIORAL DESCRIPTION OF THE NETWORK 129
neuron on the map, i.e., center, corner, etc., and the number of active connections. A
connection is active when H(p(i, j) − p(k,m) − 0.01) = 1. p(i, j) and p(k, m) are input
values and H(.) is the Heaviside function described in Equation 6.6. This condition is
tested both for intra-layer and extra-layer connections. Card{N ext(i, j)} is the cardinal
number for extra-layer connections and is equal to the number of neurons in the second
layer with active connection to neuroni,j in the first layer. Note that normalization in
Equation 6.8 is mandatory if someone wants to correspond similar pictures with different
sizes. If the aim is to match objects with exactly the same size the normalization factor
should be set to a constant for all neurons. The reason for this is that with normalization
even if the size of the picture in the second layer was the double of the same object in the
first layer the total influence to the neuroni,j would be the same as if the pattern was of
the same size.
The schematic of the network is shown in (Figure 6.3, 140).
6.5 Behavioral description of the network
The network has two different behavioral modes: segmentation and matching.
• Segmentation: In the segmentation stage, there is no connection between the two
layers. The two layers act independently (unless for the influence of the global
controller) and segment the two images applied to the two layers respectively. The
global controller forces the segments on the two layers to have different phases. At
the end of this stage, the two images are segmented but no two objects have the
same synchronization phase (Figure 6.6, 143). The results from segmentation are
used to create binary masks that select one object in each layer in multi-object
scenes. In fact, a snapshot like the one shown in Figure 6.12 is used to create the
binary mask m(i, j) for one of the objects as follows:
m(i, j) =
1 for xi,j(tsync) = xsync
0 otherwise(6.9)
130 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
tsync is a given instant of time after synchronization is reached. Since the neurons
are noiseless, all the neurons synchronized with each other at a given tsync will have
exactly the same output xsynch.
xsync can be the synchronized value that corresponds to either the cross or the
rectangle in Figure 6.12 at time tsync.
The coupling strength Si,j for each layer as defined in Equation 6.4 is computed by
:
Si,j(t) =∑
k,m∈N int(i,j)
winti,j,k,m(t)H(xint(k, m; t))− ηG(t) (6.10)
H(.) is the Heaviside function, G(t) is the influence of the global controller defined
by the following equation. η should be set to a value smaller than the maximum
value of synaptic weights, i.e., 0.25 in our case.
G(t) = αH(z − θ) (6.11)
dz
dt= σ − ξz (6.12)
σ is equal to 1 if the global activity of the network is greater than a predefined ζ
and is zero otherwise.
The reason why we use the global controller is that it may happen that the initial
values xi,j(0) of the leading neuron (the neuron to which all other neurons in the
segment will synchronize) in two different segments are similar, which means that
without a global controller these two segments would have similar phases. Note that
in contrast with the integrate-and-fire neurons, the phase trajectory of relaxation
neurons is progressing in one direction and cannot jump back. Thus, it is impossible
for other non-leader neurons to delay the spiking of the leading neuron. On the other
side, the probability that two leaders have the same initial value is equal to p(x)∆(x),
where p(x) is the probability distribution that is used to pick up initial values in
Equation 6.4. Since p(x) is bound to 1, the aforementioned probability is upper-
bounded to ∆(x), which is related to the numerical resolution of the integration
and to the accuracy of the random number generator. If we assume that small
6.6. GEOMETRICAL INTERPRETATION OF THE ODLM 131
phase discrepancies between regions are acceptable, it will be very unlikely that two
different segments synchronize for small networks (i.e., N ∼ 100). Hence, the global
controller becomes really mandatory for only bigger networks.
• Dynamic Matching: In the matching phase, the external input to the layers are
defined by the binary masks generated in the segmentation phase. The input to the
layers are defined by:
pmatchingi,j = m(i, j)pinput
i,j (6.13)
Extra-layer connections (Equation 6.8) are established. If there are similar objects
in the two layers, these extra-layer connections will help them synchronize. In other
words, these two segments are bound together through these extra-layer connections
[65]. In order to detect synchronization, double-thresholding can be used [164]. This
stage may be seen as a folded oscillatory texture segmentation device as the one
proposed in [100]. The coupling strength Si,j for each layer in the matching phase
is defined as follows :
Si,j(t) =∑
k,m∈Next(i,j)
{wexti,j,k,m(t)H(xext(k, m; t)) + wint
i,j,k,m(t)H(xint(k, m; t))}− ηG(t)
(6.14)
xext is the output of extra-layer neurons (neurons belonging to the other layer as
neuroni,j) and xint is the output of intra-layer neurons (neurons belonging to the
same layer as neuroni,j)
6.6 Geometrical Interpretation of the ODLM
We know that an object can be represented by a set of points corresponding to its corners,
and any affine transform is a map T : R2 → R2 of these points defined by the following
matrix operation
p’ = A ∗ p + t (6.15)
132 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
Where A is a 2x2 non-singular matrix, p ∈ R2 is a point in the plane, and p’ is its affine
transform. t is the translation vector. The transform is linear if t = 0. For example, for
a rotation with angle θ, the matrix A is :
cos θ − sin θ
sin θ cos θ
Affine transformation is a combination of several simple mappings such as rotation, scal-
ing, translation, and shearing. The similarity transformation is a special case of affine
transformation. It preserves length ratios and angles while the affine transformation, in
general does not. In this paragraph we show that the coupling Si,j is independent of the
affine transform used. We know that any object can be shattered into its constituent tri-
angles (three corners per triangle). Now suppose that the set {a, b, c, d} is mapped to the
set {T (a), T (b), T (c), T (d)}, and that the objects formed by these two sets of points are ap-
plied to the two layers of our neural network. Suppose also that points inside the triangle
{a, b, c} (resp. {T (a), T (b), T (c)}) have values equal to A (corresponding to the gray-level
value of the image at that points) and points inside {a, b, d} (resp. {T (a), T (b), T (d)})have values equal to B.
There are ∆T (abc) connections from the region with gray-level value A (triangle {T (a), T (b)
, T (c)}) and ∆T (abd) connections from the region with gray-level value B (triangle {T (a), T (b),
, T (d)}) to the neuroni,j belonging to the triangle {a, b, c} with gray-level value A.
We know that for an affine transform (Figure 6.4) (the affine transform conserves surface
ratio):∆abc
∆abd
=∆T (abc)
∆T (abd)
(6.16)
Where ∆abc is the area of the triangle {a, b, c} (expressed in number of neurons). For
neuroni,j belonging to {a, b, c} and neuronk,m belonging to {T (a), T (b), T (c)}, Equation
6.8 is equivalent to (neglecting the effect of intra-layer connections, since N ext À N int):
N ext = ∆T (abc) + ∆T (abd) (6.17)
6.7. RESULTS 133
Hence,
wexti,j,k,m(t) =
f(p(i, j; t)− p(k, m; t))
∆T (abc) + ∆T (abd)
, with f(x− y) =wext
max
eλ|x−y| ∀x, y (6.18)
Therefore, the external coupling for neuroni,j from all neuronk,m becomes :
Si,j(t) =∆T (abc)f(A− A)ψ(t, φ1)
∆T (abc) + ∆T (abd)
+∆T (abd)f(A−B)ψ(t, φ2)
∆T (abc) + ∆T (abd)
,
with ψ(t, φ) = H(xextk,m(t)) (6.19)
Where ψ(t, φ2) and ψ(t, φ1) (as seen in Figure 6.6, Page 143) are respectively associated
to spikes with phases φ2 and φ1 that appear after segmentation. After factorization and
using Equation 6.16 we obtain:
Si,j(t) =f(0)ψ(t, φ1)
1 + ∆abd
∆abc
+f(A−B)ψ(t, φ2)
1 + ∆abc
∆abd
(6.20)
The geometrical interpretation outlined here can be extended to more than four points and
can be applied to any complex object. This means that the extra-layer connections are
independent of the affine transform that maps the model to the scene (first and second
layer objects), therefore our template matching technique is independent of the affine
transform that has reshaped the image in comparison with the template and proves that
the technique theoretically works with any affine reshaping of the objects.
Note that if there are several objects in the scene and we want to match patterns, we can
use the results from the segmentation phase to break the scene into its constituent parts
(each synchronized region corresponds to one of the objects in the scene) and apply the
objects one by one to the network, until all combinations are tested. This is not possible
in the averaged Dynamic Link Matching case of Konen et al. where no segmentation
occurs.
6.7 Results
As stated earlier, this network can be used to solve the correspondence problem. For
example, suppose that in a factory chain, someone wants to check the existence of a
134 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
component on an electronic circuit board. All he/she has to do is to put an image of the
component on the first layer and check for synchronization between the layers. Ideally,
any change in the angle or the location of the camera or even the zoom factor should not
influence the result. One of the signal processing counterparts of our proposed technique
is the morphological processing [165]. Other partial solutions such as the Fourier (resp.
Mellin) transform could be used to perform matching robust to translation (resp. scaling)
[165].
There is no need to train or configure our architecture to the stimulus we want to apply.
The network is autonomous and flexible to not previously seen stimuli. This is in con-
trast with associative memory based architectures in which a stimulus must be applied
and saved into memory before retrieval [156]. It does not require any pre-configured archi-
tecture adapted to the stimulus, like in the hierarchical coding paradigm [69]. DLM can
play an important role in structuring memory, e.g. finding structural similarities between
stored information during sleep [163].
In this manuscript, we show the aforementioned capacities of the network using a proto-
type that will help us study the dynamics of the network.
6.8 Rate Coding vs. Phase coding
The aim in this paragraph is to show that the original DLM is a rate coding approximation
of the ODLM. First of all, we define what “rate coding” and “phase coding” mean. We
will then show how the DLM and the ODLM are related to each others.
6.8.1 Rate Coding (Average over Time)
The first and most commonly used definition of firing rate refers to temporal average.
This is essentially the spike count in an interval T divided by T . The length of the time
window is set by the experimenter and depends on the type of neuron recorded from and
6.8. RATE CODING VS. PHASE CODING 135
the stimulus. In practice, to get sensible averages, several spikes could occur within the
time window. Values of T = 100ms or T = 50ms are typical, but the duration may also
be longer or shorter.
6.8.2 Phase coding
We can also use spikes from other neurons as the reference signal for a pulse code. In
this scheme, times at which neurons spike convey the information (and not the averaged
rate). For example, synchrony between a pair or a group of neurons could signify special
events and convey information which is not contained in the firing rate of neurons (for
more details on synchrony and temporal correlation see chapters 3,4, and 5).
More generally, not only synchrony but any precise spatio-temporal pulse pattern could
be a meaningful event. For example, a spike pattern of three neurons, where neuron 1 fire
at some arbitrary time t1 followed by neuron 2 at time t1 + δ1 and by neuron 3 at t1 + δ2
might represent a certain stimulus condition (Rank Order Coding [109, 166]).
6.8.3 Dynamics of the Rate-coding DLM
Aoinishi et al. [158] have shown that a canonical form of rate coding dynamic equations
solve the matching problem in the mathematical sense. The dynamics of a neuron in one
of the layers of the original Dynamic Link Matcher proposed in [154] is as follows (see
section 6.3 for more details):
dxr
dt= −αxr + (k ∗ σ(xr)) + Ixr (6.21)
Where k(.) is a neighborhood function, Ixr is the summed value of extra-layer couplings
σ(.) is the sigmoidal function, x is the output of the rate coded neuron, and ∗ is the
convolution.
In order to prove that our system is a generalization of previous works by [154], we need to
perform a fixed-point approximation of the Van der Pol equation (the way we have derived
136 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
this fixed-point approximation is described in Appendix E). This approximation proves
that we can replace the Wang-Terman (Van der Pol) oscillators by the simpler “integrate-
and-fire” neurons. We will use this property in appendix G to prove that the original
DLM can be derived from our ODLM by a time average approximation.
6.8.4 Segmentation and Matching for Invariant Pattern Recog-
nition
A rectangular neuron map is chosen. There are 5x5 neurons in each layer. A vertical
bar in a background is presented in the first layer. The second layer receives the same
object transformed by an affine transformation (rotation, translation, etc.). Here are
some examples: Figure 6.5 shows an activity snapshots (instantaneous values of x(i, j))
in the two layers after segmentation (first phase). Note that same- colored neurons have
similar phases in the figure. On the other hand, different segments on different layers are
desynchronized (see Figures 6.6 and 6.7). In the dynamic matching stage, similar objects
among different layers are synchronized (Figure 6.9). The thresholded sum (synchroniza-
tion index) of the activity of all neurons (∑
i,j H(x(i, j) − 0.5))) is shown in Figure 6.8
for the segmentation phase and in Figure 6.9 for the dynamic matching phase. Since
there are four different regions in the two layers with different phases at the end of the
segmentation phase, four different synchronization regions can be seen in Figure 6.8. In
the dynamic matching phase, the similar objects (and the backgrounds) merge with each
other producing only two distinct regions. In addition, when a zero-mean Gaussian noise
with variance σ2 = 0.1 is added to both stimuli (SNR = 10dB) the matching results
remain unchanged.
6.8.5 One-object scenes
Note that if only one object is present in each layer of the scene, then the segmentation
phase can be bypassed and the network could function directly in the matching mode.
This strategy will help us speed up the pattern recognition process. Figure 6.10 and
6.9. CONCLUSION AND FURTHER WORK 137
Figure 6.11 show the behavior of a 13x5 network when only one object is present in each
layer. The synchronization time for the matching-only network is shorter. Note that the
matching-only approach cannot be used, if there are multiple objects in the scene. In the
latter-mentioned case the segmentation plus matching approach should be used.
6.9 Conclusion and Further Work
We proposed the oscillatory dynamic link matching as a mean to segment images and
solve the correspondence problem, as a whole system, using a two-layered oscillatory
neural network. Our work is an extension to the dynamic link matcher proposed by
Konen et al. [154]. In fact, we showed that Konen’s model is a time-averaged version of
our proposed technique. We showed theoretically and with “toy-object” experiments that
our network is capable of establishing correspondence between images and is robust to
translation, rotation, noise and homothetical transforms. More experiments with complex
objects and more general transforms like shearing, etc. are under investigation. Pattern
recognition of occluded objects is another challenge for this proposed architecture and
will be presented in further works. A more detailed study of robustness to noise should
be done for our proposed architecture.
Van Hemmen has shown that the maximum number of segmented object in a network
that uses temporal correlation is 6-7 objects [112]. Wang and Terman [100] has proposed
an algorithmic version of the Wang-Terman oscillator that can circumvent this limitation.
In a further work, the integration of this algorithmic version into the approach should be
considered. The problem with the algorithmic version is that you need global information
from all the neurons (or at least neurons from a synchronized region) to implement it.
This is in contradiction with the “modular” property of neural networks.
The possibility of the insertion of this architecture in our bottom-up sound segregator [42]
[104] (and chapter 5) as a top-down processor can be investigated in a further work. In
fact, in this application, visual images will be replaced by CAM (Cochleotopic/AMtopic)
138 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
and CSM (Cochleotopic/Spectrotopic) Maps proposed in [138]. The approach could also
be used as a separate discrete-word recognizer (see Figure 6.13, Page 150).
6.9. CONCLUSION AND FURTHER WORK 139
ODLM
ODLM
Figure 6.2: An industrial application of the odlm. Top: A resized version of an object
is applied to the matching layer. Synchronization is achieved, hence the object exists in
the visual scene. Bottom: A totally different object (which is not part of the scene)
is applied to the matching layer. Synchronization is not achieved. The object is not
matched.
140 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
G
Neuron i,j
Neuronk,m
xext
H(.)
Wext
Wint
H(.) x intNeuron
k',m'
Figure 6.3: The architecture of the oscillatory dynamic link matcher. The number of
neurons in the figure does not correspond to the real number of neurons. The global
controller has bidirectional connections to all neurons in the two layers. Synchronization
between neurons of the two layers is achieved when there is an affine similarity between
the pattern and the scene.
6.9. CONCLUSION AND FURTHER WORK 141
a d
bc
T(a) T(d)
T(c) T(b)
Figure 6.4: An affine transform T for a four-corner object.
142 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
Figure 6.5: A snapshot of the activity the first and second layers of the neural map.
Colors represent relative phase of oscillations.
6.9. CONCLUSION AND FURTHER WORK 143
0 2000 4000 6000 8000 10000 120002.5
2
1.5
1
0.5
0
0.5
1
1.5
2
Simulation time
0 2000 4000 6000 8000 10000 120002.5
2
1.5
1
0.5
0
0.5
1
1.5
2
Simulation time
t sync
xsync
Figure 6.6: Neural activity pattern after segmentation Left: Activity of one of the neurons
associated with the vertical bar in the first layer after segmentation. Right: Activity of
one of the neurons associated with the background in the same layer.
144 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100002.5
2
1.5
1
0.5
0
0.5
1
1.5
2
Simulation time
t sync
xsync
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100002.5
2
1.5
1
0.5
0
0.5
1
1.5
2
Simulation time
t sync
xsync
Figure 6.7: Neural activity pattern after matching. Left: Activity of one of the neurons
associated with the horizontal bar in the first layer after dynamic matching. Right:
Activity of one of the neurons associated with the vertical bar after dynamic matching in
the second layer. The two neurons are in full synchronization.
6.9. CONCLUSION AND FURTHER WORK 145
Figure 6.8: The evolution of the thresholded activity of network through time in the
segmentation phase. Each vertical rod represents a synchronized ensemble of neurons
and the number of neurons in that synchronized region is represented on the vertical axis.
146 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
Figure 6.9: The evolution of the thresholded activity of the network through time in the
dynamic matching phase.
6.9. CONCLUSION AND FURTHER WORK 147
0 2000 4000 6000 8000 10000 12000 14000 16000 180000
20
40
60
80
100
120
Synchronization
Synchro
niz
ation index
Simulation time
Figure 6.10: The Synchronization index of a one-object scene when the segmentation step
is bypassed. The synchronization takes 85 oscillations (spikes).
148 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
0 0.5 1 1.5 2 2.5 3
x 10 4
0
40
60
80
100
120
140
synchro
niz
ation index
Simulation time
Synchronization
20
Figure 6.11: The synchronization pattern of a one-object scene when the segmentation
phase precedes the matching phase. The synchronization takes 155 oscillations (spikes).
6.9. CONCLUSION AND FURTHER WORK 149
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
2
4
6
8
10
12
Figure 6.12: A scene segmentation done during the segmentation phase of the algorithm.
Colors represent synchronization phase. Binary masks are generated by assigning binary
values to different oscillation phases (Equation 6.9).
150 CHAPTER 6. ODLM FOR PATTERN RECOGNITION
Figure 6.13: Architecture of an integrated top-down and bottom-up processor (under
investigation): In the segregation level the desired sound source is segregated using har-
monicity and localized energy cues by the CAM/CSM maps (Bottom-Up segregation).
The bottom-up segregation generates a mask from which cochlear channels are selected.
The result of this stage is compared with pre-stored patterns via the Dynamic Link
Matcher and the best match (e.g., vowel) is found.
CHAPTER 7
CONCLUSION
7.1 Summary
The previous six chapters formed the complete presentation of the architectures I have
proposed for auditory and visual scene analysis. In this final chapter I will take a slightly
broader perspective to look at the things I might have added to the system (in an ideal
world without time constraints), and some of the aspects of audition I still do not know
how to incorporate into this approach.
7.2 What has been presented
Before drawing conclusions, let us briefly review what this thesis has contained. I started
by presenting the state of the art in Computational Auditory Scene Analysis (CASA) and
by pointing out the lack of a general system that can handle all types of mixtures and
sounds. I then proposed to partially mimic the behavior of the nervous system to come
up with a system that can separate different sound sources. For doing so, I laid down in
Chapter 3 the neurocognitional aspects I had used later and tried to find the ‘best’ math-
ematical model of bio-inspired neurons that let me obtain roughly the same behavior as
real neurons with a relatively low computational complexity in Chapter 4. I proposed two
different representations (the Cochleotopic/AMtopic and the Cochleotopic/Spectrotopic
Maps) that I used as front end to my neural architecture. I then proposed an architecture
for the sound source separation problem based on temporal correlation. Then came the
turn to the synthesis of the sound. The synthesis quality of the conventional synthesis fil-
terbank used in other works was not satisfactory. Therefore, jointly with other colleagues
the FIR gammatone filters had been adapted to this work.
151
152 CHAPTER 7. CONCLUSION
In the visual scene analysis domain, the ‘Dynamic Link Matching’ has been extended
to bio-inspired neurons. I called this extension ‘Oscillatory Dynamic Link Matching’
(ODLM). I studied some theoretical properties of this new architecture and proved some
fundamental concepts about this technique. I then applied ‘toy-objects’ to the system
and proved that what I have derived mathematically hold for these simple objects.
7.3 Future developments of the model
As stated earlier, none of the proposed systems in the literature is a complete system that
can function in any condition. The system proposed in this thesis is not an exception to
this general rule. So many things are missing in this work. This incompleteness is either
due to the general lack of the know-how in the scientific community for the emerging
field of Computational Auditory Scene Analysis or the time constraint for this work. The
following list consists of the system features that I would most like to add in the future:
• Automatic CAM/CSM selection. In this work the selection between the two
representations (maps) is done manually. An algorithm should be conceived, which
based on the features of the mixture will decide which representation is adequate
for the task.
• ‘Top-down’ processing of information. The processing of information in this
work is done in a bottom-up manner. It means that no lexical or any higher level
information has been used for separation. We know that this kind of information
is very important for better and more robust source separation. In my opinion, the
integration of such a ‘top-down’ processor would be an asset to the system.
• Auditory maps issues. I have limited myself to two different representations
(the CAM and the CSM). From neurophysiological observations, it is well known
that more than two maps are generated in the brain. Therefore, a system can be
designed in the future that extends the two-map strategy to multi-map strategy.
7.3. FUTURE DEVELOPMENTS OF THE MODEL 153
Other candidate maps could be those detecting onsets, offsets, etc. In addition, the
maps I have proposed are only approximations to real maps used in the brain. No
one can guaranty the optimality of them. All I have proved in this work is that these
two maps can solve certain auditory scene analysis problems. I have also speculated
that there might be some correlations between these maps and the ‘real-world’ maps
of the nervous system. A further investigation should prove whether it is possible
to use other signal processing techniques (such as wavelets) to further enhance the
representations.
• Neural architecture. Based on our observations and findings, I tried to find a
trade-off between performance and computational complexity for the neural net-
works I have proposed. There are so many empirical and ad-hoc parameters in the
network (maybe like any other neural network). Future work should further enhance
the performance of my proposed network.
• Visual pattern recognition. Since this part was an extension of my initial goals
for this thesis, I haven’t applied real-world images to this network. The following
questions should be answered by a future work: What would have happened if I
had applied visual objects with occlusions to this network? What are the ultimate
performance of the technique?
• Implementational issues. Throughout this work I did not focus on the opti-
mization of the computer code, leaving this for a further work. An ‘event-driven’1
simulator can help us to decrease the simulation time and to attain real-time.
1An event-driven simulator updates the state of a simulation block only if there is a change in the
external inputs to that block.
154 CHAPTER 7. CONCLUSION
7.4 The future of Computational Auditory Scene
Analysis
The initial goals of this project were too ambitious, and while it is fun to think in
grandiose terms about the ‘whole’ of audition, it is not always so obvious to find a
general solution to the auditory scene analysis problem.
In my opinion, today’s most sophisticated theories will appear naive and almost
willfully simplistic in the quite near future. This is inevitable: the challenge is to
make the discoveries that will permit a more realistically complex model of audi-
tory perception. The technical gap between the first tentative to do sound source
separation in the ’70s (i.e., by Parson [17]) and the actual sound separators is great.
Very few work had been done in the ’70s and ’80s and the real explosion had begun
in the late ’90s.
According to Ellis [30], “the comparison of machine vision and machine listening is
sobering. There are similarities between our field and the computer models vision of
fifteen or twenty years ago, and vision is still very far from being a ‘solved problem’”.
However there are reasons to be optimistic: Firstly, many of the lessons gained rather
painfully in vision research have been incorporated directly into theories of hearing
rather than being rediscovered. Secondly, hearing is simpler than vision, in terms of
sensory bandwidth, and at the same time the inexorable advance of computational
power makes possible models that would previously have been unthinkable. There
are also other reasons to think that machine listening will be tougher to achieve than
machine vision. Our physiological knowledge about the auditory nervous pathway
is much less than what we know about vision. Furthermore, there is a lag between
our fundamental understanding of audition compared to vision. The first book that
somehow outlined the bases of Gestalt principles for vision was published in 1936 by
Koffka [11] (that principles were elaborated later by Marr in 1982 [13] for computer
vision purposes). The first book ever published about auditory scene analysis was in
7.4. THE FUTURE OF COMPUTATIONAL AUDITORY SCENE ANALYSIS 155
1990 by Bregman [6]. These historical facts show that there is lot to do at the pure-
science level (psychology, neurophysiology, etc.) in parallel with technical research
in engineering.
Perception is the right biological mystery to be studying at the moment, given
our experimental and computational tools. Its solution will lead naturally into the
deeper cognitive secrets of the brain and will let us design more robust engineering
devices.
APPENDIX A: CANONICAL NEURONAL
MODEL
The purpose of this appendix is to show that all neural models described in this thesis
(i.e., integrate-and-fire, relaxation oscillators, etc.) can be reduced to the canonical model
described here. The canonical model is a unified framework for the analysis of bio-inspired
neural networks.
As stated before, all class I neurons become active by means of a saddle-node bifurcation.
In nonlinear dynamical systems, a bifurcation is a period doubling, quadrupling. etc.
(Figure A-1, 158). that accompanies the onset of chaos. It represents the sudden appear-
ance of a qualitatively different solution for a nonlinear system as some of the parameters
of the system’s differential equations are varied. Saddle-node bifurcations on limit cycles
are ubiquitous in two-dimensional systems:
x = f(x, y)
y = g(x, y) (A-1)
Where f and g are continuous functions. Let us plot the nullclines x = 0 and y = 0 in
the xy-plane 2. Each intersection of the nullclines corresponds to an equilibrium of the
model. When the nullclines intersect as in Figure A-2, the bifurcation occurs. Note that
the neurons introduced in Section 4.2.2 can be put in this canonical format. Roughly
speaking, a saddle-node bifurcation occurs when there are two intersections of nullclines
one stable and the other unstable. Saddle-node bifurcation on a limit cycle leading to
Class 1 neural excitability can be observed in many multidimensional neural models such
2In a two-dimensional system of differential equations the nullclines are the curves where the vector
field is either horizontal or vertical. The horizontal nullcline is found by setting y = 0 since this says that
there is no vertical component of the vector field along this curve. Similarly, to find the vertical nullcline
we set x = 0.
157
Figure A-1: Bifurcation in a two-dimensional space. In a dynamical system, a bifurcation
is a period doubling, quadrupling, etc., that accompanies the onset of chaos. It represents
the sudden appearance of a qualitatively different solution for a nonlinear system as some
parameter is varied. The illustration above shows bifurcations (occurring at the location
of vertical lines) of the logistic map as the parameter r is varied. Bifurcations come in four
basic varieties: flip bifurcation, fold bifurcation, pitchfork bifurcation, and transcritical
bifurcation (adapted from http://www.mathworld.com)
as the Hodgkin-Huxley, Morris-Lecar, etc. Although the Hodgkin-Huxley model exhibits
class 2 excitability for the original values of parameters, it exhibits class 1 excitability
when a transient potassium A-current is taken into account.
The characteristic feature of an Andronov-Hopf bifurcation is that the equilibrium point
loses its stability and a limit cycle appears. If the initial value is on the limit cycle, then
the point moves along the curve, periodically returning to the initial point (oscillatory
activity).
Figure A-2: Saddle-node bifurcation in Wilson-Cowan oscillators
Canonical model for saddle-node bifurcation
The state-space of a neuron follows the dynamics [167]:
X = F (X,λ) (A-2)
λ is a vector of parameters and X is the state-space of the system containing the membrane
potential, the ions, the channels, etc. Now suppose that λ0 is the vector value for which
there is a ’saddle-node’ bifurcation. For all λ close to λ0 we can find a map h(X, λ) that
transforms every system of the form of Equation 7.3 to the Ermentrout canonical model:
ϕ′ = (1− cosϕ) + (1 + cosϕ)r (A-3)
where ϕ ∈ S1 is a phase variable (state variable) that describes activity of the neuron
along the limit cycle, S1 = {ejφ ∈ C} is the unit circle in the complex plane, and r ∈ R
is a new bifurcation parameter. The transformation h that maps solutions of 7.3 to those
of A-3 blows up a small neighborhood of the saddle-node bifurcation point and compress
the entire limit cycle to an open set around π ∈ S1 (Figure A-3).
The canonical model has the following interesting behavior: if r > 0 the neuron spikes at
a frequency equal to π√r, if r < 0 the spiking threshold ϕ+ and the equilibrium point ϕ−
are given by:
ϕ± = ±cos−1 1 + r
1− r(A-4)
Weakly Connected Neural Networks in the Canonical Form
Figure A-3: The transformation h maps solutions of Equation 7.3 to those of Equation
A-3
The dynamics of a weakly coupled neural network can be written in the following canonical
form:
X = Fi(Xi, λ) + εGi(X1, X2, ..., XN , λ, ε) (A-5)
G(.) describes how the ith neuron is affected by the other neurons, Xi describes the
activity of the ith neuron. For weakly coupled networks ε ¿ 1. After linearization and
some approximations [84] the weakly-connected neural network becomes:
ϕ′i = (1− cosϕi) + (1 + cosϕi)ri +
n∑
j=1
wij(ϕi)δ(ϕj − π) + O(√
εlnε) (A-6)
w(ϕi) = 2atan(tanϕi
2+ sij) (A-7)
Coupling of two neurons
• Unidirectional coupling Here we consider the situation in which a neuron
is connected to another neuron in one direction only (neuron 2 receives input from
neuron 1, but neuron 1 does not receive input from neuron 2). The dynamics of this
system is:
ϕ′1 = ((1− cosϕ1) + (1 + cosϕ1)r (A-8)
ϕ′2 = ((1− cosϕ2) + (1 + cosϕ2)r + w(ϕ2)δ(ϕ1 − π) (A-9)
Let us perturbate the stable solution ϕ2(t) = ϕ1(t) supposing that ϕ2 > ϕ1. Since
w ≥ 0, the spikes ϕ1 advance ϕ2 even further. This is due to the fact that
w(ϕ2)δ(ϕ1 − π) is always positive. Therefore, the in-phase solution is unstable.
After a while, ϕ2(t) → ϕ1(t) + 2π and each firing of ϕ1 advances ϕ2 even closer to
ϕ1(t) + 2π ≡ ϕ1(t) (note that ϕ1(t) is periodic with period 2π. We see that the in-
phase synchronized solution for the synaptic organization is stable in one direction
and unstable in the other [97]. This is because we have supposed that the two neu-
rons are identical. If the neurons, are different, let say r1 < r2 then no synchronized
solution exists and there is an in-phase synchronized solution when r1 > r2, and the
shift increases when r1 − r2 increases.
• Bidirectional coupling Consider now the case of bidirectional coupling:
ϕ′1 = ((1− cosϕ1) + (1 + cosϕ1)r + w12(ϕ1)δ(ϕ2 − π) (A-10)
ϕ′2 = ((1− cosϕ2) + (1 + cosϕ2)r + w21(ϕ2)δ(ϕ1 − π) (A-11)
We can show that for this constellation, the difference ϕ2−ϕ1 may have a different
value during an oscillation, but return to the initial value at the end of the oscillation,
i.e. (ϕ2(0)− ϕ1(0) = ϕ2(T )− ϕ1(T ), T is the oscillation period).
This framework enables us to analyze the coupling of bio-inspired neural neurons in a
standard manner.
APPENDIX B: CHAOTIC-BASED SOUND
SEPARATIONS
The computational burden of the the technique proposed in Chapter 5 is high, therefore
some simplifications/optimizations should be done so that the technique can be imple-
mented in real-time. For instance, the use of chaotic neural networks (see Chapter 4)
instead of Wang-Terman oscillators can help us speed up the separation process. Wang-
Terman oscillators are stiff equation that must be solved by numerical integration tech-
niques with a very small step, but the chaotic neurons use only additions and multiplica-
tions (for details see chapter 4 and [82]). The disadvantage of chaotic neurons is that while
their dynamics is simple, the analysis of the synchronization is complex. Since the output
of chaotic neurons are not ergodic3, two outputs may be synchronized for a time interval
but not synchronized for the entire process. In addition in this preliminary work correl-
ograms have been used: they are computationally expensive and should be replaced by
CAM/CSM. Although as stated in Chapter 4, there are some reports on chaotic behavior
of neurons, the chaotic model is less biologically plausible than spiking neural networks.
A very simplified one-dimensional version of the neural separator proposed in chapter 5 has
been tested and a very preliminary version has been designed. We applied correlograms
of AM envelopes of cochlear filterbank outputs to a network of oscillatory neurons, in
order to separate two speakers (or a speaker from a tone). In this approach synchronised
regions belong to the same speaker while desynchronized regions with respect to the
first speaker’s clusters correspond to other speakers (or noise). Our proposed network is
composed of chaotic neuronal elements like in [91] but is one dimensional. Our learning
algorithm is a modified version of the rules proposed in the work by Zhao et al. We
3ergodicity: an attribute of stochastic systems; generally, a system that tends in probability to a
limiting form that is independent of the initial conditions. This is due to the fact that the statistical
average of the stochastic system is equal to the time average of the variable.
163
achieved synchronization patterns that are different from those in Zhao et al. [91]. In fact
we think that periodic and quasi-periodic patterns we obtained in our work is biologically
more plausible. In contrast with other works we didn’t use any global controller. Our
tests on pilot and real data showed that the symmetry breaking is done automatically in
this network. Although, more detailed analysis should be done to prove this statement,
but we think that this behavior stems in the fact that the behavior of the network is
chaotic at the beginning (before synchronization), which lets enough time to the network
to desynchronize. To our knowledge, this is the first time that real speech data has been
applied to a one dimensional chaotical neural network. In addition, our network and its
associated learning algorithm is well suited to multilevel inputs and not just to binary
ones.
Our preprocessing stage consists of a 24 channel cochlear filterbank that mimics in part
the behavior of the human cochlea. The feature extraction algorithm described in [133]
has been used and the normalized correlogram is computed for the delays corresponding
to the pitch of one of the speakers. In order to find the pitch of the signal we used
the pooled correlogram technique [3]. Then the correlograms are quantized to a limited
number of levels (4 levels) and is applied to our network of chaotic neurons.
An array of chaotic neurons is used to segregate speech. The dynamic of each neuron i is
governed by a Chaotic Map (Zhao et al. [91]) :
xi(t + 1) = xi(t) +ε
NΣN
j=1f(xj(t)) (B-1)
f(x) = ax(1 − x) is the logistic map, N the number of neurons. We used a modi-
fied version of the dynamic neighborhood algorithm described in [91] since we are using
a one-dimensional network in contrast to the two dimensional network used in Zhao
et al. for image segmentation purposes. In addition, our proposed modified weight
adaptation rule is able to process non-binary data. The aforementioned proposed al-
gorithm is implemented as follows: each neuron in the network is connected to other
neurons of the network through discrete-time delays (the maximum neighboring dis-
tance of connections is set to 10 neurons). In the beginning, each neuron runs freely,
that is no synaptic connection is established between neurons. Later, connections are
established according to an exponential rule e−(xi−xi−1) where xi and xi−1 are the in-
puts applied to neurons i and i − 1 respectively. The farther a neuron is from an-
other one, the longer the update delay time is. For instance, for neuron i, updating
delays are defined as di−1, di+1, di−2, di+2, ..., di−10, di+10 (minus and plus signs correspond
to bottom and up neurons respectively) with di−1 < di−2 < di−3 < ... < di−10 and
di−1 = di+1, . . . , di−10 = di+10. The update equations are as follows:
wij(t) =
e−5.5∗|(xi(t−di−j)−xi−1(t−di−j)| for t− di−j > 0
0 otherwise,(B-2)
At time instant t = di−1, the network computes the difference between the inputs to
neurons i− 1 and i + 1 , the closest neurons (the 1-neighbors) to the neuron i using the
DMM
DMM
z -N
z-N
Weight
Adaptation
Weight
Adaptation
Weight
Adaptation
Weight
Adaptation
Cochlear Output1
Cochlear Output 24
Network Output 1
DMM: Decision Making Module
Chaotic Neuron
Figure B-4: Architecture of the simplified chaotic neural network based sound source
separator. The Decision Making Module (DMM) defines the neighborhood for which
connections are established for each neuron in the network. The neighborhood grows
with time (as described in Equation B-2).
Figure B-5: Oscillatory behavior of the chaotic network for the two speaker segregation
problem: X-axis represents discrete time while Y-axis represents channels. Synchroniza-
tion can be roughly associated to similar changing gray levels in the figure. Gray levels
show the level of activity: dark regions are zones of synchronized neurons and bright
regions are zones of synchronized neurons among themselves and desynchronized with
neurons of dark regions.
exponential rule defined earlier, at t = di−2 it updates the connections to neurons i + 2
and i − 2. Since in our case delays are all exponents of 2, at the same time it updates
the weight connections between the 1-neighbors and neuron i. In this way, the region of
synchrony around a neuron shrinks or grows at fixed time delays according to the defined
learning rule.
The mask is generated by using the output of the network. Then, speech is synthesized by
weighting the filterbank outputs with that mask. The oscillatory neural network that we
use has the advantage of creating a mask that takes into account the mutual information
from the cochlear channels and that does not require any training.
The mask is generated using the output of the network and the synthesis is similar to
what is described in Section 5.4.5.
The reason why this technique is not used further in this thesis and chaotic oscillators are
replaced by relaxation oscillators is the fact that it is more difficult to detect synchronicity
in chaotic networks. Furthermore, the biological plausibility of chaotic networks is not
totally justified.
APPENDIX C: MULTIPLICATIVE SYNAPSES
In this appendix, we demonstrate mathematically why additive synapses may fail to
separate sound sources. We base our derivation on what has been shown in Figure C-6.
It must be pointed out that this appendix does not prove that multiplicative synapses are
optimal in all senses. It simply shows that the second-layer integration can be done more
powerfully by multiplicative synapses.
In Figure C-6 (at top) second-Layer integration with additive synapses is analyzed. A
snapshot of the first layer’s activity is shown in the rectangle (Figure C-6, (A)). The first
layer’s activity emphasizes the underlying CSM/CAM. In this specific example, the first
layer activity for the CAM of a single speaker is depicted. According to the activity shown
in that rectangle two different regions are shown (circled green and red). The distance
between red dots corresponds to the pitch of the signal. The region circled in green
corresponds to channels where no neural activity has been detected (the background).
The background (in white) and the red dot have different spiking phases as shown by the
red arrows. That means that all white neurons have the same phase while red pixels have
a different phase. In the following the phase of neurons are described by their associated
color. Since the synapses are additive all the activity along a channel is added. The sum
of all activities along a channel in the red region is given by (Figure C-6, (B)) (note that
all weights connecting the two layers are set to unity for the sake of simplicity):
Φ1 =h1∑
n=1
δ(n− T1) +h2∑
n=1
δ(n− T2) (C-1)
The sum of all activities along a channel in the green region is given by (Figure C-6, (C))
Φ2 =h1+h2∑
n=1
δ(n− T2) (C-2)
By comparing (Figure C-6, (C)) and (Figure C-6, (B)), the averaging result over the
169
chosen window is the same for the two green and red regions and is equal to h1 + h2. A
good separator should have separated these two different regions as two different sources.
Hence, the additive synapses (Equation 5.15) as described here does not separate correctly
the regions. Note that even if we had chosen weights different from unity, nothing would
have changed for the additive case. In fact, in this case we would have had:
< Φ1 >=h1∑
i=1
wi +h2∑
i=1
wi =h1+h2∑
i=1
wi (C-3)
< Φ2 >=h1+h2∑
i=1
wi (C-4)
Therefore even if the weights are different, we still have < Φ1 >=< Φ2 >.
In the Figure at the bottom even a more complicated situation is shown. We will show
that although the additive synapses were unable to separate the figure/backgroung, mul-
tiplicative synapses can do lot more by separating the two-speaker plus background case.
The CAM for a two-speaker case is considered in the rectangle associated to the first
layer’s activity showing the underlying behavior of the CAM/CSM ((Figure C-6, (G)).
The region circled red corresponds to the channels belonging to the first speaker, the
purple region to the second speaker, and the green region to the background. The spike
activity is shown in the three averaging windows. For the red-circled region by applying
the operator described Ξ in (Equation 5.16, chapter 4) the overall multiplicative is given
by (Figure C-6, (D)) :
θ(⋃
i
xi) =∏
i
wll(i)Ξ{xi} (C-5)
Note that all we have done so far is introducing a new notation in the equation we had
already defined in chapter 5 (Equation 5.15). Note also that δ(n − T ) is either 1 or 0,
therefore the multiplication as defined by θ() is either 0 or 1.
θ1 = θ(h1⋃
n=1
δ(n− T1)h2⋃
n=1
δ(n− T2)) = h3δ(n− T1) + h3δ(n− T2) (C-6)
Where h3 is a scaling factor. For the purple-circled region by applying the same multi-
plicative operator we will obtain (Figure C-6, (E))
θ2 = θ(h4⋃
n=1
δ(n− T1)h5⋃
n=1
δ(n− T2)) = h3δ(n− T1) + h3δ(n− T3) (C-7)
In which h5 is the number of neurons in purple and h4 is the number of neurons in white.
For the green-circled region we have:
θ3 = θ(h6⋃
n=1
δ(n− T1)) = h3δ(n− T1) (C-8)
The averaging results in the three cases with weights different from unity gives:
< θ1 >=h1∏
i=1
wi +h2∏
i=h1
wi (C-9)
< θ2 >=h4∏
i=1
wi +h5∏
i=h4
wi (C-10)
< θ3 >=h6∏
i=1
wi (C-11)
The above set of equations prove that the three results are different for multiplicative
synapses. The goal is achieved by proving that multiplicative synapses can separate
sources while additive synapses cannot.
First Layer's Activity
Sum
of
Neura
l A
cti
vit
y
Averaging Window
Averaging Window
time (t)
time (t)
Sum of all unfilled regionsSum of all unfilled regions
h e ig h t = h 1 + h 2
h e ig h t= h 1 h e ig h t= h 2
Averaging with additive synapses
Same Averaging Results
time (t)
time (t)
time (t)
weight=w1
weight=w2
First Layer's Activity
Same Result
if w1=w2
Different Result
if w1<> w2
DifferentAveraging Results
Pro
duct
of
Neura
l A
cti
vit
y
Averaging with multiplicative synapses
height=h3 height=h3
height=h3height=h3
height=h3
(A )
(B )
(C )
(D )
(E )
(F )
(G )
t= T2
t= T1
t= T3
ChannelsF
req
ue
ncy .
. .
. . .
. . .
. . .
+
+
temporal pattern
of spiking neurons
. . .
. . .
. . .
Synaptic input to the second layer neuron associated with the red (right) region
Synaptic input to the second layer neuron associated with the green (left) region
Sum of unfilled regions Sum of filled regions
Synaptic input to the second layer associated with the red (right) region
Synaptic input to the second layer associated with the purple (middle) region
Synaptic input to the second layer associated with the green () region
Figure C-6: Comparison of multiplicative and additive synapses. Top: additive synapses
are unable to separate the ground from the source. Bottom: multiplicative synapses are
able to separate two speakers and the background (refer to the text for more details).
APPENDIX D: PARAMETERS OF THE
HODGKIN-HUXLEY NEURAL MODEL
Here are the numerical values used for parameters in equations defined in Section (4.2.1,
Chapter 4).
x Ex gx
Na 115mV 120mS/cm2
K −12mV 36mS/cm2
L 10.6mV 0.3mS/cm2
TABLE D-1: Parameters for the Hodgkin-Huxley Equations.
TABLE D-2: Parameters used in Equation 4.2.1, page 56
173
APPENDIX E: FIXED-POINT
APPROXIMATION
The Van der Pol oscillator used in this appendix is a two-dimensional approximation of
the Hodgkin-Huxley equations (as seen in chapter 4)4. We will show here that a further
approximation to one-dimensional state-space will reduce the oscillators to “integrate-and-
fire” neurons. The derivation presented here is similar to the one described in [86] but it
has been adapted to relaxation oscillators. In fact, a pseudo-linear approximation of the
Wang-Terman state-space equations gives the following two-variable “Integrate-and-fire”
model. The model is obtained by linearization (Figure E-7) of each branch of the state
space trajectory. This means that the nullclines of the Wang-Terman oscillators (Figure
4.5, chapter 4) is linearized. The coupling strength Si,j is not considered below, since it
does not intervene in the analysis (i, j subscripts are omitted for the sake of simplicity).
The linearization in Figure E-7 gives the following equations:
dx
dt= f(x)− y + I (E-1)
dy
dt= ε[bx− d(H(x)− 0.5)] (E-2)
Typical values for this piecewise linearization are: f(x) = ax for x < 0.5, f(x) = a(1−x)
for 0.5 < x < 1.5 and f(x) = c0 + c1x for x > 1.5 where a, c1 are parameters and c0 =
−0.5− 1.5c1. Furthermore, b > 0, d > 0 and 0 < ε ¿ 1. H(.) is the Heaviside function as
usual. I is the input current. Note that these are typical values and the following reasoning
remains the same with different values and even different approximating functions.
The rest state is x = y = 0. Suppose that the system is stimulated by a short current
pulse that shifts the state of the system horizontally. As long as x < 1, we have f(x) < 0.
According to Equation E-1, dxdt
< 0 and x returns to the rest state. For x < 0.5 the
4A review of Chapter 3 (Section 3.1) of [86] is strongly advised before reading this appendix
175
relaxation to the rest is exponential with x(t) = exp(at) in the limit of ε → 0. Thus, the
return to rest after a small perturbation is governed by the fast time scale. If the current
x
dy/dt=bx-d (H(x)-0.5)=0
dx/dt=f(x)-y=0
y
Figure E-7: Piecewise linear model of the state space of the Wang-Terman oscillator
presented in Chapter 4. The open curve of (Figure 4.5, page 80) has been approximated
by lines and the closed curve has been approximated by a rectangle. The inset shows the
trajectory (arrows) which follows the x nullcline at a distance of order ε.
pulse moves x to a value larger to a predefined threshold, which is equal to one for this
choice of parameters, we have dudt
= f(u) > 0. Hence the voltage x increases and a pulse
is emitted.
Using the above reasoning, we have simplified the two-dimensional Wang-Terman oscil-
lator to a one dimensional “integrate-and-fire” neuron with threshold, as written below.
The coupling strength Si,j is re-added to the equations.
dxi,j
dt= −xi,j + Si,j + H(pinput
i,j )
xi,j = 0 xi,j > threshold (E-3)
H(.) is again the Heaviside function. In addition, Campbell and Wang have shown that
the behavior of Van der Pol oscillators and Integrate-and-Fire neurons are equivalent for
temporal correlation purposes by simulation (but not theoretically) [102].
APPENDIX F: GAMMACHIRP/GAMMATONE
FILTERBANKS
In what follows in this appendix, we will detail some of the most important properties of
the Gammachirp/Gammatone filterbanks [128]. Although the gammatone filter is used
in this work, the gammachirp filter which is a more generalized form will be explained.
We will show that the gammatone filter is a simplification of the gammachirp filter.
The gammachirp filters are designed in such a way that the time-frequency uncertainty
is minimized [130]. The impulse response of the gammachirp filter is similar to the Gam-
matone filter except for a “chirp factor” c which is used as a modulation carrier.
gc(t) = atn−1e−2πB(fc)tej(2πfct+c log t) (F-1)
B(f) = 0.1039f + 24.7 (F-2)
This filterbank has asymmetrical frequency response.
The spectrum of the Gammachirp filterbank can be factorized in the following way:
|Gc(f)| = aΓ(c)|GT (f)|ecθ (F-3)
Gc(f) is the spectrum of the Gammachirp filterbank, GT (f) is the spectrum of the Gam-
matone filterbank, c is the modulation parameter, aΓ(c) is a gain which depends on c,
and θ is given by:
θ = tan−1(f − fc
B(fc)) (F-4)
This decomposition proposed by Irino is interesting, because it represents the Gammachirp
filterbank as the cascade of a Gammatone filterbank and a compensation filter ecθ. In this
work only the gammatone filterbank is used, which is a special gammachirp filter with
the parameter c = 0 in Equation F-1.
179
There are other types of more computational-effective approaches and filterbanks, for
details see [168].
APPENDIX G: RATE-CODING EQUIVALENCE
BETWEEN THE DLM AND THE ODLM
In this appendix we will prove that the network we proposed in chapter 5 (ODLM) is
rate-code equivalent to the original Dynamic Link Matcher (DLM). In order to do so, we
use the fixed-point approximation of the relaxation oscillator derived in appendix E.
If we rewrite the dynamics in the dynamic link matching phase (remember from chapter
5 that there are the segmentation and matching phase in our network) of the neuron in
the simplified “integrate-and-fire with threshold” (see appendix E for details) form for our
ODLM network (using the coupling strength in Equation 6.5 without the global controller
influence):
dxtwo
dt= −xtwo + Σk,m 6=i,jw
inti,j,k,mH(xtwo
k,m) + Σk,mwexti,j,k,mH(xone
k,m) + H(pinput)
x = 0 x > threshold (G-1)
Where xtwo stands for neurons in layer two and xone stands for neurons in layer one. Note
that as explained in chapter 5 there are synaptic connections (wint) in layer 2 (the 4
neighbors in our proposed architecture in chapter 5) and synaptic connections from layer
1 to layer 2 (wext). The neighborhood N(i, j) has been replaced by (k, m) 6= (i, j). We
use exactly the same approximation as in chapter 5 (see section 6.6), that is we neglect
the influence of intra-layer connections, therefore Equation G-1 becomes:
dxtwo
dt= −xtwo + Σk,mwext
i,j,k,mH(xonek,m) + H(pinput)
x = 0 x > threshold (G-2)
Note that for an integrate-and-fire neuron the approximation H(x) = x holds, since the
output of an integrate-and-fire neuron is either 0 or 1 (it emits spikes or delta functions),
therefore Equation G-2 can be further simplified to :
dxtwo
dt= −xtwo + Σk,mwext
i,j,k,mxonek,m + H(pinput)
181
x = 0 x > threshold (G-3)
By averaging the two sides of Equation G-3 we get: (H(pinput) is considered constant over
T ) :
dxtwoa
dt= −xtwo
a + Σwextxonea + H(pinput) (G-4)
xa = < x >T =1
T
∫ T
0x(t)dt (G-5)
< x >T , the averaged version of x over a time window of length T . For the sake of
simplicity, the indices are omitted in Equation G-4. Note that H(pinput is constant over
time, therefore its time average is equal to H(pinput.
From Maass (chapter 2) [112], we know that the averaged output xtwoa of an integrate-
and-fire neuron is related to the averaged-over-time inputs of a neuron (Σwextxonea ) by
a continuous function (sigmoidal, etc.). Let name this function ϕ (note that β is a
proportionality constant):
< xtwoi,j >= βϕ(Σwext < xone
k,m >) (G-6)
Note that in Equation G-4 we need < xonei,j > in function of < xtwo
k,m >. Note further that
Equation G-6 is a set of linear equations in wext and we can deduce xonei,j from that sets
of equations:
xonei,j = Σk,mσ(xtwo
k,m) (G-7)
Where σ(x) = ϕ−1(x). Replacing the above result in Equation G-4 gives (note that for
the sake of simplicity we omitted again the indices):
dxtwoa
dt= −xtwo
a + ΣΣwextσ(xtwoa ) + H(pinput) (G-8)
On the other hand:
ΣΣwintσ(xtwoa ) = k(xtwo
a ) ∗ σ(xtwoa ) (G-9)
Where * is a 2-D convolution. In our case k(.) is a 2-D rectangular window (in the original
DLM k(.) was chosen to be a Mexican hat).
Ixr in Equation 6.21 is the input signal that can be replaced by H(pinput) in the nota-
tions of chapter 5. Therefore, we have proved that the DLM is an averaged-over-time
approximation of the ODLM.
As stated above, the influence of the global controller has been ignored in the derivation
of these results. The question the reader may ask is “What would happen, if we had the
global controller in the equations?”. The answer to this question is that in steady-state
the average influence of the global controller does not change in time ( see the activity of
the “Inhibitor” in Figure 4.7, chapter 3). Therefore the above-derived equations hold in
steady-state, up to a constant. The transient-state analysis seems much more complicated
and has not been included in this appendix. It has been left for future work.
From the above discussion and mathematical derivation, we conclude that the dynam-
ics of the original Dynamic Link Matcher proposed by Konen et al. is the rate-coding
approximation of our place-coding network.
BIBLIOGRAPHY
[1] R. M. Borisyuk and Y. Kazanovich. Oscillatory neural network model of attention
focus formation and control. Biosystems, 71:29–36, 2003.
[2] M. Cooke and D. Ellis. The auditory organization of speech and other sources in
listeners and computational models. Speech Comm., pages 141–177, 2001.
[3] D. Wang and G. J. Brown. Separation of speech from interfering sounds based on
oscillatory correlation. IEEE Transactions on Neural Networks, 10(3):684–697, May
1999.
[4] G. Hu and D.L. Wang. Monaural speech segregation based on pitch tracking and
amplitude modulation. IEEE Trans. On Neural Networks, pages 1135– 1150, Sept.
2004.
[5] G. Jang and T. Lee. Single-channel signal separation using time-domain basis func-
tions. Signal Processing Letters, pages 168–171, June 2003.
[6] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990.
[7] T. Lengagne, T. Aubin, and J. Lauga. How do king penguins (aptenodytes patag-
onicus) apply the mathematical theory of information to communicate in windy
conditions? Proc. R. Soc. (London) B Biology, 266:1623–1628, 1999.
[8] J. Kanwal, A. Medvev, and C. Micheyl. Neurodynamics for auditory stream seg-
regation: tracking sounds in the mustached bat’s natural environment. Network:
Computation in Neural Systems, 14(13), 2003.
[9] R. L. Cherry. Some experiments on the recognition of speech, with one and with
two ears. Journal of Acousticial Society of America, 25:975–979, 1953.
[10] J. Driver. Enhancement of selective listening by illusory mislocation of speech sounds
due to lip-reading. Nature, 381:66–68, 1996.
185
[11] K. Koffka. Principles of Gestalt Psychology. Lund Humphries (London), 1935.
[12] A.J.W. Van der Kouwe, D.L. Wang, and G. J. Brown. A comparison of auditory
and blind separation techniques for speech segregation. IEEE Trans. on Speech and
Audio Processing, 9:189–195, 2001.
[13] D. Marr. Vision. Freeman Publishers, 1982.
[14] W. Ainsworth and S. Greenberg. Springer Handbook of Auditory Research. Springer,
2003.
[15] L. A. Jeffress. A place theory of sound localization. J. Comp. Physiol. Psychol.,
41:35–39, 1948.
[16] J.C.R. Licklider and W.H. Huggins. Place mechanisms of auditory frequency anal-
ysis. JASA, 23:290–299, 1951.
[17] T. W. Parsons. Separation of speech from interfering speech by means of harmonic
selection. JASA, 60:911–918, 1976.
[18] R.F. Lyon. A computational model of filtering, detection and compression in the
cochlea. In ICASSP, 1982.
[19] M.T. Scheffers. Sifting Vowels: Auditory Pitch Analysis and Sound Segregation.
PhD thesis, Groningen University, The Netherlands, 1983.
[20] M. Weintraub. A computational model for separating two simultaneous talkers. In
ICASSP, 1986.
[21] C. Von der Marlsburg and W. Schneider. A neural cocktail-party processor. Biol.
Cybernetics, pages 29–40, 1986.
[22] F. Berthommier and G. Meyer. Improving of amplitude modulation maps for f0-
dependent segregation of harmonic sounds. In Eurospeech’97, 1997.
[23] R.J. Stubbs and A.Q. Summerfield. Evaluation of 2 voice-separation algorithms
using normal-hearing and hearing-impaired listeners. JASA, 84:1236–1249, 1988.
[24] M. Cooke. Modelling Auditory Processing and Organisation. PhD thesis, University
of Sheffield, 1991.
[25] K. Mellinger. Event Formation and Separation in Musical Sound. PhD thesis,
Stanford University, 1991.
[26] K. Kashino and H. Tanaka. A sound source separation system using spectral features
integrated by the Dempster’s law of combination. Annual Report of the Engineering
Research Institute, University of Tokyo, 51:67–72, 1992.
[27] G. Brown and M. Cooke. Computational auditory scene analaysis. Computer Speech
and Language, pages 297–336, 1994.
[28] A. de Cheveigne. Separation of concurrent harmonic sounds: Fundamental fre-
quency estimation and a time-domain cancellation model of auditory processing.
Journal of Acoustical Society of America, pages 3271–3290, 1993.
[29] R.D. Patterson, M. H. Allerhand, and C. Giguere. Time-domain modelling of
peripheral auditory processing: A modular architecture and a software platform.
JASA, 98:1890–1894, 1995.
[30] D. Ellis. Prediction-Driven Computational Auditory Scene Analysis. PhD thesis,
MIT, 1996.
[31] D.F. Rosenthal and H. G. Okuno. Computational Auditory Scene Analysis.
Lawrence Erlbaum Assoc, 1998.
[32] M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recogni-
tion with missing and unreliable acoustic data. Speech Communication, 34:267–285,
2001.
[33] S. T. Roweis. One microphone source seperation. In NIPS, Denver, USA, 2000.
[34] M. J. Reyes-Gomez, B. Raj, and D. Ellis. Multi-channel source separation by
factorial HMMs. In ICASSP 2003, 2003.
[35] G. Hu and D. L. Wang. Monaural speech segregation based on pitch tracking and
amplitude modulation. Technical report, Ohio State University, 2002.
[36] M. Wu, D.L. Wang, and G.J. Brown. A multipitch tracking algorithm for noisy
speech. IEEE Trans. on Speech and Audio Processing, 2003.
[37] D.P. Gibson, N. W. Campbell, and B.T. Thomas. Very low bit rate semantic
compression of natural outdoor images. In Picture Coding Symposium, Oregon,
USA, 1999.
[38] N. Todd. An auditory cortical theory of auditory stream segregation. Network :
Computation in Neural Systems, 7:349–356, 1996.
[39] G. Langner. Temporal processing of pitch in the auditory system. J. New Music
Res, pages 116–132, 1997.
[40] S. Cunningham and M. Cooke. The role of evidence and counter-evidence in speech
perception. In International Congress of Phonetic Sciences 1999, 1999.
[41] J. Rouat and R. Pichevar. Source separation with one ear: Proposition for an
anthropomorphic approach. EURASIP Journal on Applied Signal Processing (sub-
mitted, invited paper), 2004.
[42] R. Pichevar and J. Rouat. Cochleotopic/AMtopic (CAM) and
Cochleotopic/Spectrotopic (CSM) map based sound source separation using
relaxation oscillatory neurons. In IEEE Neural Networks for Signal Processing
Workshop, Toulouse, France, 2003.
[43] R. Pichevar and J. Rouat. Monophonic source separation with an unsupervised
network of spikings neurons. Speech Communication (Elsevier), submitted, 2004.
[44] F. Gaillard. Analyse de Scenes Auditives Computationnelle (CASA): Un Nouvel
Outil de Marquage Du Plan Temps-Frequence Par Detection D’harmonicite Ex-
ploitant Une Statistique de Passage Par Zero. PhD thesis, INPG, 1999.
[45] F. Klessner, V. Lesser, and S.H. Nawab. The IPUS Blackboard Architecture as a
Framework for Computational Auditory Scene Analysis. In Computational Auditory
Scene Analysis, D.F. Rosenthal and H.G, Okuno, 1998.
[46] S. Grossberg, K. K. Govindarajan, L.L. Wyse, and M.A. Cohen. ARTSTREAM:
A neural network model of auditory scene analysis and source segregation. Neural
Networks, 2003.
[47] S. T. Roweis. Factorial models and refiltering for speech separation and denoising.
In Eurospeech 2003, 2003.
[48] H. Sameti, H. Sheikhzadeh, L. Deng, and R.L. Brennan. HMM-based strategies for
enhancement of speech signals embedded in nonstationary noise. IEEE Trans. on
Speech and Audio Processing, pages 445–455, 1998.
[49] R. Remez and P. E. Rubin. Speech perception without traditional speech cues.
Science, pages 947–949, May 1981.
[50] R. E. Remez and P.E. Rubin. On the perceptual organization of speech. Psycho-
logical Review, pages 129–148, 1994.
[51] J. Barker and M. Cooke. Is the sine-wave speech cocktail party worth attending?
Speech communication, 27:159–174, 1999.
[52] C.G. Tsai. Auditory grouping in the perception of roughness induced by subhar-
monics: Empirical findings and a qualitative model. In International Symposium
on Musical Acoustics, Japan, 2004.
[53] T.S. Parker and L.O. Chua. Practical Numerical Algorithms for Chaotic Systems.
Springer-Verlag, 1989.
[54] F. Vrins, Lee J. A, M. Verleysen, V. Vigneron, and C. Jutten. Improving inde-
pendent component analysis performances by variable selection. In IEEE NNSP,
2003.
[55] J-F. Cardoso. Blind signal separation: Statistical principles. Proc. IEEE, 86:2009–
2025, 1998.
[56] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John
Wiley and Sons, 2001.
[57] P.Comon. Independent component analysis: A new concept? Signal Processing,
36:287–314, 1994.
[58] M. Casey. Separation of mixed audio sources by independent subspace analysis. In
Int’l Computer Music Conference, Berlin, Germany, 2000.
[59] L.Q. Zhang, C. Amari, and C. Cichoki. Natural gradient approach to blind separa-
tion of over- and under-complete mixtures. In In Proc. Int. Workshop on Indepen-
dent Component Analysis and Blind Source Separation, pages 455–460, 1999.
[60] P. Comon. Blind identification and source separation in 2x3 under-determined
mixtures. IEEE Trans. on signal processing, pages 11–22, 2004.
[61] L. Albera, P. Comon, P. Chevalier, and A. Ferreol. Blind identification of underde-
terminded mixtures based on the hexacovariance. In International Conference on
Audio Speech and Signal Processing, 2004.
[62] M. Cooke. http://www.dcs.shef.ac.uk/˜martin/.
[63] C. Prodohl, R. Wurtz, and C. Von der Malsburg. Learning the gestalt rule of
collinearity from object motion. Neural Computation, pages 1865–1896, 2003.
[64] W. Ross, S. Grossberg, and E. Mingolla. Visual cortical mechanisms of perceptual
grouping: Interacting layers, networks, columns, and maps. Neural Networks, pages
571–588, 2000.
[65] C. Von der Malsburg. The what and why of binding: The modeler’s perspective.
Neuron, pages 95–104, 1999.
[66] P. Milner. A model for visual shape recognition. Psychological Review, pages 521–
535, 1974.
[67] A. Kristjansson, D.L. Wang, and K. Nakayama. The role of priming in conjunctive
visual search. Cognition, 85:37–52, 2002.
[68] M. Shadlen and A. Movshon. Synchrony unbound: A critical evaluation of the
temporal binding hypothesis. Neuron, 24:67–77, 1999.
[69] M. Riesenhuber and T. Poggio. Are cortical models really bound by the binding
problem? Neuron, 24:87–93, 1999.
[70] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid
scene analyis. IEEE Trans. on Pattern Analysis and Machine Intelligence, pages
1254–1259, 1998.
[71] J. Reynolds and R. Desimone. The role of neural mechanisms of attention in solving
the binding problem. Neuron, 24:19–29, 99.
[72] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, 1990.
[73] K. Fukushima. A neural network model for selective attention in visual pattern
recognition. Biol. Cybernetics, pages 5–15, 1986.
[74] B.A. Olshausen, C.H. Anderson, and D.C. Van Essen. A neurobiological model
of visual attention and invariant pattern recognition based on dynamic routing of
information. J. Neuroscience, pages 4700–4719, 1993.
[75] E.O. Postma, H.J. Van der Herik, and P.T. W. Hudson. SCAN: A scalable neural
model of covert attention. Neural Networks, 10:993–1015, 1997.
[76] E. Salinas and L.F. Abott. Invariant visual responses from attentional gain fields.
Journal of Neurophysiology, pages 3267–3272, 1997.
[77] L. Wiskott. How Does our Visual System Achieve Shift and Size Invariance. In J.L.
Van Hemmen and T.J. Sejnowski (Eds.), Oxford University Press, 2003.
[78] MIT Encyclopedia of Cognitive Sciences. MIT press, online.
[79] W. Singer. Neuronal synchrony: A versatile code for the definition of relations?
Neuron, 24:49–65, 99.
[80] J. Wolfe and K. Cave. The psychological evidence for a binding problem. Neuron,
24:11–17, 1999.
[81] G. Bugmman. Binding by synchronisation: A task dependence hypothesis. Brain
and Behaviour Sciences, pages 685–688, 1997.
[82] J. Rouat and R. Pichevar. Nonlinear speech processing techniques for source segre-
gation. In EUSIPCO, Toulouse, France, 2002.
[83] V.I. Nenov. Neural network for learning, recognition, and recall of pattern sequences.
US Patent, No. 5,222,348, 1993.
[84] E. M. Izhikevich. Class 1 neural excitability, conventional synapses, weakly con-
nected networks, and mathematical foundations of pulse-coupled models. IEEE
Trans. on Neural Networks, 10(3):499–507, 1999.
[85] H.R. Wilson and J.D. Cowan. Excitatory and inhibitory interactions in localized
populations of model neurons. Biophysics Journal, pages 12:1–24, 1972.
[86] W. Gerstner. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cam-
bridge University Press, 2002.
[87] E. Izhikevich. Which model to use for cortical spiking neurons? IEEE Trans. on
Neural Networks, 2004.
[88] E. Izhikevich. Simple model of spiking neurons. IEEE Trans. on Neural Networks,
2003.
[89] L. Lapique. Recherches quantitatives sur l’excitation electrique des nerfs traitee
comme une polarisation. J. Physiol. Patho., pages 620–635, 1907.
[90] H. Kantz and T. Schreiber. Nonlinear time series. Cambridge University Press,
1997.
[91] L. Zhao and E. Macau. A network of dynamically coupled chaotic map for scene
segmentation. IEEE Trans. on Neural Networks, pages 1375–1385, 2001.
[92] K. Kaneko. Globally coupled chaos violates the law of large numbers but not the
central-limit thorem. Physical Review Letters, pages 1391–1394, 1990.
[93] K. Kaneko. Chaotic but regular posi-nega switch among coded attractors by cluster-
size variation. Physical Review Letters, pages 219–223, 1989.
[94] J. Ito and K. Kaneko. Self-organized hierarchical structure in a plastic network of
chaotic units.
[95] F. Pasemann. Complex dyanmics and the structure of small neural networks. Net-
work: Computation in Neural Systems, pages 195–216, 2002.
[96] E. Izhikevich. Dynamical Systems in Neuroscience: The geometry of excitability
and bursting. Springer-Verlag (to appear), 2005.
[97] F.C. Hoppensteadt and E. Izhikevich. Weakly Connected Neural Networks. Springer-
Verlag, New York, 1997.
[98] R. Hilborn. Chaos and Nonlinear Dynamics: An Introduction for Scientists and
Engineers. Oxford University Press, 2000.
[99] R. Borisyuk. Synchronization of neural activity and information coding. In NCWS
2003, 2003.
[100] D.L. Wang and D. Terman. Image segmentation based on oscillatory correlation.
Neural Computation, pages 805–836, 1997.
[101] D. Wang. Relaxation oscillators and networks. In Wiley Encyclopedia of Electrical
and Electronics Engineering, pages 396–405. Wiley & Sons, 1999.
[102] S. R. Campbell, D. L. Wang, and C. Jayaprakash. Synchrony and desynchrony in
integrate-and-fire oscillators. Neural Computation, pages 1595–1619, 1999.
[103] D. L. Wang and D. Terman. Image segmentation based on oscillatory correlataion.
Neural Computation, pages II 521– II 525, 1995.
[104] R. Pichevar and J. Rouat. Binding of audio elements in the sound source segregation
problem via a two-layered bio-inspired neural network. In IEEE CCECE’2003.
[105] R. Pichevar and J. Rouat. Double-vowel segregation through temporal correlation:
A bio-inspired neural network paradigm. In NOLISP’2003, 2003.
[106] R. Pichevar and J. Rouat. Oscillatory dynamic link matching for pattern recogni-
tion. In International Workshop on Neural Coding (NCWS), Aulla, Italy, 2003.
[107] H.X. Wang G.Q. Bi. Temporal asymmetry in spike timing-dependent synaptic
plasticity. Psychology and Behavior, pages 551–555, 2002.
[108] K.P. Kording and P. Konig. Neurons with two sites of synaptic integration learn
invariant representations. Neural Computation, pages 2823–2849, 2001.
[109] R. Van Rullen and S. J. Thorpe. Rate coding versus temporal order coding: What
the retinal ganglion cells tell the visual cortex. Neural Computation, 13:1255–1283,
2001.
[110] C. Panchev, S. Wermter, and H. Chen. Spike-timing dependent competitive learning
of integrate-and-fire neurons with active dendrites. In ICANN, Spain, 2002.
[111] Simon Haykin. Neural Networks : A Comprehensive Foundation. Macmillan, 1994.
[112] W. Maass and C. M. Bishop. Pulsed Neural Networks. MIT Press, 1998.
[113] R. Eckhorn. Neural mechanisms of scene segmentation: Recordings from the vi-
sual cortex suggest basic circuits for linking field models. IEEE Trans. on Neural
Networks, 10(3):464–479, 1999.
[114] X. Liu and D.L. Wang. Range image segmentation using a relaxation oscillator
network. IEEE Trans. On Neural Networks, pages 564–574, May 99.
[115] E. Cesmeli and D. Wang. Motion segmentation based on motion/brightness integra-
tion and oscillatory correlation. IEEE Trans. on Neural Networks, 11(4):935–947,
2000.
[116] D. Wang and D. Terman. Locally excitatory globally inhibitory oscillator networks.
IEEE Trans. on Neural Networks, pages 283–286, 1995.
[117] D. L. Wang. On connectedness: A solution based on oscillatory correlation. Neural
Computation, pages 131–139, 2000.
[118] S. N. Wrigley and G. J. Brown. A neural oscillator model of auditory attention.
Lecture Notes in Computer Science, pages 1163–1170, 2001.
[119] H. Nakano and T. Saito. Synchronization in a pulse-coupled network of chaotic
spiking oscillators. In 45th Midwest Symposium on Circuits and Systems, 2002.
[120] N. Cowan. Evolving conceptions of memory storage, selective attention and their
mutual constraints within the human information processing system. Psychol. Bull.,
104:163–191, 1988.
[121] B. Widrow. Adaptive noise cancelling: Principles and applications. Proceedings of
the IEEE, 63(12), 1975.
[122] Y. Kaneda and J. Ohga. Adaptive microphone-array system for noise reduction.
TrASSP, pages 1391–1400, 1986.
[123] J.-M. Valin, J. Rouat, and F. Michaud. Microphone array post-filter for separation
of simultaneous non-stationary sources. In ICASSP, Montreal, Canada, 2004.
[124] M.S. Brandstein and D.B. (Eds.). Microphoe Arrays: Signal Processing Techniques
and Applications. Springer Verlag, 2001.
[125] J. Sanchez-Bote, J. Gonzales-Rodriguez, and J. Ortega-Garcian. A real-time
auditory-based microphone array assessedwith e-rasti evaluation proposal. In
ICASSP, Hong-Kong, 2003.
[126] M.R. Gomez, D. Ellis, and N. Jojic. Multiband audio modeling for single-channel
acoustic source separation. In IC ASSP 2004, 2004.
[127] P.A. Cariani and B. Delgutte. Neural correlates of the pitch complex tones. i. pitch
and pitch salience. ii. pitch shift, pitch ambiguity, phase invariance, pitch circularity,
and the dominance region for pitch. J. Neurophysiology, 1996.
[128] R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Aller-
hand. Complex sounds and auditory images. In Y. Cazals, L. Demany, and
K. Horner, editors, Auditory Physiology and Perception, pages 429–446. Pergamon
Press, Oxford, 1992.
[129] R. Pichevar, J. Rouat, C. Feldbauer, and G. Kubin. A bio-inspired sound source
separation technique in combination with an enhanced FIR gammatone Analy-
sis/Synthesis filterbank. In EUSIPCO Vienna, 2004.
[130] T. Irino and M. Unoki. A time-varying, analysis/synthesis auditory filterbank using
the gammachirp. In 98, volume 6, pages 3653–3656, Seattle, Washington, May 1998.
[131] Gernot Kubin and W. Bastiaan Kleijn. On speech coding in a perceptual domain.
In 99, volume 1, pages 205–208, Phoenix, Arizona, March 1999.
[132] Malcolm Slaney. An efficient implementation of the Patterson-Holdsworth auditory
filter bank. Technical Report 35, Apple Computer, Inc, 1993.
[133] J. Rouat, Y. C. Liu, and D. Morissette. A pitch determination and voiced/unvoiced
decision algorithm for noisy speech. Speech Comm., 21:191–207, 1997.
[134] F. Plante, G. Meyer, and W. Ainsworth. Improvement of speech spectrogram accu-
racy by the method of reassignment. IEEE Trans. on Speech and Audio Processing,
pages 282–287, 1998.
[135] C. Giguere and Philip C. Woodland. A computational model of the auditory pe-
riphery for speech and hearing research. JASA, pages 331–349, 1994.
[136] M.C. Liberman, S. Puria, and J.J. Jr. Guinan. The ipsilaterally evoked olivo-
cochlearreflex causes rapid adaptation of the 2f1-f2 distortion product otoacoustic
emission. JASA, 99:2572–3584, 1996.
[137] D. L. Wang. Relaxation Oscillators and Networks, pages 396–405. John Wiley Sons,
1999.
[138] R. Pichevar and J. Rouat. Streaming of audio objects on 2D spectral maps through
multiplicative synaptic connection neurons. In Auditory Perception, Cognition, and
Action Meeting , Vancouver, Canada, 2003.
[139] F. Gabbiani, H. Krapp, C. Koch, and G. Laurent. Multiplicative computation in a
visual neuron sensitive to looming. Nature, 420:320–324, 2002.
[140] JL. Pena and M. Konishi. Auditory spatial receptive fields created by multiplication.
Science, 292:294–252, 2001.
[141] R.A. Andersen, L.H. Snyder, D.C. Bradley, and J. Xing. Multimodal representation
of space in the posterior parietal cortex and its use in planning movements. Ann.
Rev. Neurosci., page 20:303, 1997.
[142] J. Rouat. Spatio-temporal pattern recognition with neural networks: Application
to speech. In Artificial Neural Networks-ICANN’97, Lect. Notes in Comp. Sc. 1327,
pages 43–48. Springer, 10 1997.
[143] http://www-edu.gel.usherbrooke.ca/pichevar/.
[144] J.-M. Valin, F. Michaud, J. Rouat, and D. LUtourneau. Robust sound source local-
ization using a microphone array on a mobile robot. In IEEE/RSJ-Int. Conference
on Intelligent Robots and Systems., 2003.
[145] J.-M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on micro-
phone array source separation with post-filter. In IROS, 2004.
[146] G. Hu and D.L. Wang. Separation of stop consonants. In ICASSP 2003, 2003.
[147] http://www.itu.int/home/.
[148] R. Pichevar and J. Rouat. Bio-inspired sound source separation technique based
on a spiking neural network: Application to three-source sounds. Lecture Notes in
Computer Science (Springer-Verlag), to appear, 2004.
[149] B. Boashash and M. Mesbah. Signal enhancement by time-frequency peak filtering.
IEEE Trans. On Signal Processing, pages 929–938, 2004.
[150] S.C. Yen, E. D. Meschik, and L.H. Finkel. Cortical synchronization and perceptual
salience. Computational Neuroscience: Trends in Research, pages 125–130, 1993.
[151] D. Somers and N. Kopell. Rapid synchronization through fast threshold modulation.
Biological cybernetics, pages 393–407, 1993.
[152] N. Koppel and G.B. Ermentrout. Symmetry and phaselocking in chains of weakly
coupled oscillators. Communications on Pure and Applied Mathematics, pages 623–
660, 1986.
[153] R. Pichevar and J. Rouat. RN-spike process for spatio-temporal pattern recognition.
Canadian Provisional Patent, 2004.
[154] W. Konen, T. Maurer, and C. Von der Malsburg. A fast dynamic link matching
algorithm for invariant pattern recognition. Neural Networks, pages 1019–1030,
1994.
[155] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of in-
variances. Neural Computation, pages 715–770, 2002.
[156] T. Vinh Ho and J. Rouat. Novelty detection based on relaxation time of a net-
work of integrate-and-fire neurons. In Proc. IEEE Int’l Joint Conference on Neural
Networks, Alaska, USA, 1998.
[157] R. P. Wurtz. Multilayer Dynamic Link Networks for Establishing Image Point Cor-
respondences and Visual Object Recognition. PhD thesis, Ruhr-Universitat Bochum,
Germany, 1994.
[158] T. Aoinishi, K. Kurata, and T. Mito. A phase locking theory for matching common
parts of two images by dynamic link matching. Biological Cybernetics, 78(4):253–
264, 1998.
[159] R. Pichevar and J. Rouat. Oscillatory dynamic link matching for invariant pattern
recognition. Biosystems Journal (submitted), 2004.
[160] R. Pichevar and J. Rouat. Oscillatory dynamic link matcher: A bio-inspired neural
network for pattern recognition. In Brain Inspired Cognitive Systems 2004, Stirling,
Scotland (Invited Paper), 2004.
[161] X. Zhang and A. Minai. Detecting corresponding segments across images using
synchronizable pulse-coupled nerual networks. In IJCNN2001, 2001.
[162] L.E. Gordon. Theories of Visual Perception. John Wiley Sons, 1997.
[163] L. Wiskott, C. Von der Malsburg, and A. Weitzenfeld. The Neural Simulation
Language: A System for Brain Modeling, chapter 18, pages 343–372. MIT Press,
2002.
[164] H. Ando, N. Takashi Morie, M. Nagata, and A. Iwata. A nonlinear oscillator network
circuit for image segmentation with double-threshold phase detection. In ICANN
99, 1999.
[165] A.K. Jain. Fundamentals of Digital Image Processing. Prentice-Hall, 1989.
[166] R. VanRullen and S. J. Thorpe. Surfing a spike wave down the ventral stream.
Vision Research, pages 2593–2615, 2002.
[167] G.B. Ermentrout and N. Kopell. Parabolic bursting in an excitable system coupled
with a slow oscillation. SIAM J. Appl. Math., pages 233–253, 1986.
[168] C. Feldbauer and G. Kubin. Critically sampled frequency-warped perfect recon-
struction filterbank. In ECCTD‘03, 2003.
BIBLIOGRAPHY
[1] R. M. Borisyuk and Y. Kazanovich. Oscillatory neural network model of attentionfocus formation and control. Biosystems, 71:29–36, 2003.
[2] M. Cooke and D. Ellis. The auditory organization of speech and other sources inlisteners and computational models. Speech Comm., pages 141–177, 2001.
[3] D. Wang and G. J. Brown. Separation of speech from interfering sounds based onoscillatory correlation. IEEE Transactions on Neural Networks, 10(3):684–697, May1999.
[4] G. Hu and D.L. Wang. Monaural speech segregation based on pitch tracking andamplitude modulation. IEEE Trans. On Neural Networks, pages 1135– 1150, Sept.2004.
[5] G. Jang and T. Lee. Single-channel signal separation using time-domain basis func-tions. Signal Processing Letters, pages 168–171, June 2003.
[6] A. S. Bregman. Auditory Scene Analysis. MIT Press, 1990.
[7] T. Lengagne, T. Aubin, and J. Lauga. How do king penguins (aptenodytes patag-onicus) apply the mathematical theory of information to communicate in windyconditions? Proc. R. Soc. (London) B Biology, 266:1623–1628, 1999.
[8] J. Kanwal, A. Medvev, and C. Micheyl. Neurodynamics for auditory stream seg-regation: tracking sounds in the mustached bat’s natural environment. Network:Computation in Neural Systems, 14(13), 2003.
[9] R. L. Cherry. Some experiments on the recognition of speech, with one and withtwo ears. Journal of Acousticial Society of America, 25:975–979, 1953.
[10] J. Driver. Enhancement of selective listening by illusory mislocation of speech soundsdue to lip-reading. Nature, 381:66–68, 1996.
[11] K. Koffka. Principles of Gestalt Psychology. Lund Humphries (London), 1935.
[12] A.J.W. Van der Kouwe, D.L. Wang, and G. J. Brown. A comparison of auditoryand blind separation techniques for speech segregation. IEEE Trans. on Speech andAudio Processing, 9:189–195, 2001.
[13] D. Marr. Vision. Freeman Publishers, 1982.
[14] W. Ainsworth and S. Greenberg. Springer Handbook of Auditory Research. Springer,2003.
201
[15] L. A. Jeffress. A place theory of sound localization. J. Comp. Physiol. Psychol.,41:35–39, 1948.
[16] J.C.R. Licklider and W.H. Huggins. Place mechanisms of auditory frequency anal-ysis. JASA, 23:290–299, 1951.
[17] T. W. Parsons. Separation of speech from interfering speech by means of harmonicselection. JASA, 60:911–918, 1976.
[18] R.F. Lyon. A computational model of filtering, detection and compression in thecochlea. In ICASSP, 1982.
[19] M.T. Scheffers. Sifting Vowels: Auditory Pitch Analysis and Sound Segregation.PhD thesis, Groningen University, The Netherlands, 1983.
[20] M. Weintraub. A computational model for separating two simultaneous talkers. InICASSP, 1986.
[21] C. Von der Marlsburg and W. Schneider. A neural cocktail-party processor. Biol.Cybernetics, pages 29–40, 1986.
[22] F. Berthommier and G. Meyer. Improving of amplitude modulation maps for f0-dependent segregation of harmonic sounds. In Eurospeech’97, 1997.
[23] R.J. Stubbs and A.Q. Summerfield. Evaluation of 2 voice-separation algorithmsusing normal-hearing and hearing-impaired listeners. JASA, 84:1236–1249, 1988.
[24] M. Cooke. Modelling Auditory Processing and Organisation. PhD thesis, Universityof Sheffield, 1991.
[25] K. Mellinger. Event Formation and Separation in Musical Sound. PhD thesis,Stanford University, 1991.
[26] K. Kashino and H. Tanaka. A sound source separation system using spectral featuresintegrated by the Dempster’s law of combination. Annual Report of the EngineeringResearch Institute, University of Tokyo, 51:67–72, 1992.
[27] G. Brown and M. Cooke. Computational auditory scene analaysis. Computer Speechand Language, pages 297–336, 1994.
[28] A. de Cheveigne. Separation of concurrent harmonic sounds: Fundamental fre-quency estimation and a time-domain cancellation model of auditory processing.Journal of Acoustical Society of America, pages 3271–3290, 1993.
[29] R.D. Patterson, M. H. Allerhand, and C. Giguere. Time-domain modelling ofperipheral auditory processing: A modular architecture and a software platform.JASA, 98:1890–1894, 1995.
[30] D. Ellis. Prediction-Driven Computational Auditory Scene Analysis. PhD thesis,MIT, 1996.
[31] D.F. Rosenthal and H. G. Okuno. Computational Auditory Scene Analysis.Lawrence Erlbaum Assoc, 1998.
[32] M. Cooke, P. Green, L. Josifovski, and A. Vizinho. Robust automatic speech recogni-tion with missing and unreliable acoustic data. Speech Communication, 34:267–285,2001.
[33] S. T. Roweis. One microphone source seperation. In NIPS, Denver, USA, 2000.
[34] M. J. Reyes-Gomez, B. Raj, and D. Ellis. Multi-channel source separation byfactorial HMMs. In ICASSP 2003, 2003.
[35] G. Hu and D. L. Wang. Monaural speech segregation based on pitch tracking andamplitude modulation. Technical report, Ohio State University, 2002.
[36] M. Wu, D.L. Wang, and G.J. Brown. A multipitch tracking algorithm for noisyspeech. IEEE Trans. on Speech and Audio Processing, 2003.
[37] D.P. Gibson, N. W. Campbell, and B.T. Thomas. Very low bit rate semanticcompression of natural outdoor images. In Picture Coding Symposium, Oregon,USA, 1999.
[38] N. Todd. An auditory cortical theory of auditory stream segregation. Network :Computation in Neural Systems, 7:349–356, 1996.
[39] G. Langner. Temporal processing of pitch in the auditory system. J. New MusicRes, pages 116–132, 1997.
[40] S. Cunningham and M. Cooke. The role of evidence and counter-evidence in speechperception. In International Congress of Phonetic Sciences 1999, 1999.
[41] J. Rouat and R. Pichevar. Source separation with one ear: Proposition for ananthropomorphic approach. EURASIP Journal on Applied Signal Processing (sub-mitted, invited paper), 2004.
[42] R. Pichevar and J. Rouat. Cochleotopic/AMtopic (CAM) andCochleotopic/Spectrotopic (CSM) map based sound source separation usingrelaxation oscillatory neurons. In IEEE Neural Networks for Signal ProcessingWorkshop, Toulouse, France, 2003.
[43] R. Pichevar and J. Rouat. Monophonic source separation with an unsupervisednetwork of spikings neurons. Speech Communication (Elsevier), submitted, 2004.
[44] F. Gaillard. Analyse de Scenes Auditives Computationnelle (CASA): Un NouvelOutil de Marquage Du Plan Temps-Frequence Par Detection D’harmonicite Ex-ploitant Une Statistique de Passage Par Zero. PhD thesis, INPG, 1999.
[45] F. Klessner, V. Lesser, and S.H. Nawab. The IPUS Blackboard Architecture as aFramework for Computational Auditory Scene Analysis. In Computational AuditoryScene Analysis, D.F. Rosenthal and H.G, Okuno, 1998.
[46] S. Grossberg, K. K. Govindarajan, L.L. Wyse, and M.A. Cohen. ARTSTREAM:A neural network model of auditory scene analysis and source segregation. NeuralNetworks, 2003.
[47] S. T. Roweis. Factorial models and refiltering for speech separation and denoising.In Eurospeech 2003, 2003.
[48] H. Sameti, H. Sheikhzadeh, L. Deng, and R.L. Brennan. HMM-based strategies forenhancement of speech signals embedded in nonstationary noise. IEEE Trans. onSpeech and Audio Processing, pages 445–455, 1998.
[49] R. Remez and P. E. Rubin. Speech perception without traditional speech cues.Science, pages 947–949, May 1981.
[50] R. E. Remez and P.E. Rubin. On the perceptual organization of speech. Psycho-logical Review, pages 129–148, 1994.
[51] J. Barker and M. Cooke. Is the sine-wave speech cocktail party worth attending?Speech communication, 27:159–174, 1999.
[52] C.G. Tsai. Auditory grouping in the perception of roughness induced by subhar-monics: Empirical findings and a qualitative model. In International Symposiumon Musical Acoustics, Japan, 2004.
[53] T.S. Parker and L.O. Chua. Practical Numerical Algorithms for Chaotic Systems.Springer-Verlag, 1989.
[54] F. Vrins, Lee J. A, M. Verleysen, V. Vigneron, and C. Jutten. Improving inde-pendent component analysis performances by variable selection. In IEEE NNSP,2003.
[55] J-F. Cardoso. Blind signal separation: Statistical principles. Proc. IEEE, 86:2009–2025, 1998.
[56] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. JohnWiley and Sons, 2001.
[57] P.Comon. Independent component analysis: A new concept? Signal Processing,36:287–314, 1994.
[58] M. Casey. Separation of mixed audio sources by independent subspace analysis. InInt’l Computer Music Conference, Berlin, Germany, 2000.
[59] L.Q. Zhang, C. Amari, and C. Cichoki. Natural gradient approach to blind separa-tion of over- and under-complete mixtures. In In Proc. Int. Workshop on Indepen-dent Component Analysis and Blind Source Separation, pages 455–460, 1999.
[60] P. Comon. Blind identification and source separation in 2x3 under-determinedmixtures. IEEE Trans. on signal processing, pages 11–22, 2004.
[61] L. Albera, P. Comon, P. Chevalier, and A. Ferreol. Blind identification of underde-terminded mixtures based on the hexacovariance. In International Conference onAudio Speech and Signal Processing, 2004.
[62] M. Cooke. http://www.dcs.shef.ac.uk/˜martin/.
[63] C. Prodohl, R. Wurtz, and C. Von der Malsburg. Learning the gestalt rule ofcollinearity from object motion. Neural Computation, pages 1865–1896, 2003.
[64] W. Ross, S. Grossberg, and E. Mingolla. Visual cortical mechanisms of perceptualgrouping: Interacting layers, networks, columns, and maps. Neural Networks, pages571–588, 2000.
[65] C. Von der Malsburg. The what and why of binding: The modeler’s perspective.Neuron, pages 95–104, 1999.
[66] P. Milner. A model for visual shape recognition. Psychological Review, pages 521–535, 1974.
[67] A. Kristjansson, D.L. Wang, and K. Nakayama. The role of priming in conjunctivevisual search. Cognition, 85:37–52, 2002.
[68] M. Shadlen and A. Movshon. Synchrony unbound: A critical evaluation of thetemporal binding hypothesis. Neuron, 24:67–77, 1999.
[69] M. Riesenhuber and T. Poggio. Are cortical models really bound by the bindingproblem? Neuron, 24:87–93, 1999.
[70] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapidscene analyis. IEEE Trans. on Pattern Analysis and Machine Intelligence, pages1254–1259, 1998.
[71] J. Reynolds and R. Desimone. The role of neural mechanisms of attention in solvingthe binding problem. Neuron, 24:19–29, 99.
[72] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, 1990.
[73] K. Fukushima. A neural network model for selective attention in visual patternrecognition. Biol. Cybernetics, pages 5–15, 1986.
[74] B.A. Olshausen, C.H. Anderson, and D.C. Van Essen. A neurobiological modelof visual attention and invariant pattern recognition based on dynamic routing ofinformation. J. Neuroscience, pages 4700–4719, 1993.
[75] E.O. Postma, H.J. Van der Herik, and P.T. W. Hudson. SCAN: A scalable neuralmodel of covert attention. Neural Networks, 10:993–1015, 1997.
[76] E. Salinas and L.F. Abott. Invariant visual responses from attentional gain fields.Journal of Neurophysiology, pages 3267–3272, 1997.
[77] L. Wiskott. How Does our Visual System Achieve Shift and Size Invariance. In J.L.Van Hemmen and T.J. Sejnowski (Eds.), Oxford University Press, 2003.
[78] MIT Encyclopedia of Cognitive Sciences. MIT press, online.
[79] W. Singer. Neuronal synchrony: A versatile code for the definition of relations?Neuron, 24:49–65, 99.
[80] J. Wolfe and K. Cave. The psychological evidence for a binding problem. Neuron,24:11–17, 1999.
[81] G. Bugmman. Binding by synchronisation: A task dependence hypothesis. Brainand Behaviour Sciences, pages 685–688, 1997.
[82] J. Rouat and R. Pichevar. Nonlinear speech processing techniques for source segre-gation. In EUSIPCO, Toulouse, France, 2002.
[83] V.I. Nenov. Neural network for learning, recognition, and recall of pattern sequences.US Patent, No. 5,222,348, 1993.
[84] E. M. Izhikevich. Class 1 neural excitability, conventional synapses, weakly con-nected networks, and mathematical foundations of pulse-coupled models. IEEETrans. on Neural Networks, 10(3):499–507, 1999.
[85] H.R. Wilson and J.D. Cowan. Excitatory and inhibitory interactions in localizedpopulations of model neurons. Biophysics Journal, pages 12:1–24, 1972.
[86] W. Gerstner. Spiking Neuron Models: Single Neurons, Populations, Plasticity. Cam-bridge University Press, 2002.
[87] E. Izhikevich. Which model to use for cortical spiking neurons? IEEE Trans. onNeural Networks, 2004.
[88] E. Izhikevich. Simple model of spiking neurons. IEEE Trans. on Neural Networks,2003.
[89] L. Lapique. Recherches quantitatives sur l’excitation electrique des nerfs traiteecomme une polarisation. J. Physiol. Patho., pages 620–635, 1907.
[90] H. Kantz and T. Schreiber. Nonlinear time series. Cambridge University Press,1997.
[91] L. Zhao and E. Macau. A network of dynamically coupled chaotic map for scenesegmentation. IEEE Trans. on Neural Networks, pages 1375–1385, 2001.
[92] K. Kaneko. Globally coupled chaos violates the law of large numbers but not thecentral-limit thorem. Physical Review Letters, pages 1391–1394, 1990.
[93] K. Kaneko. Chaotic but regular posi-nega switch among coded attractors by cluster-size variation. Physical Review Letters, pages 219–223, 1989.
[94] J. Ito and K. Kaneko. Self-organized hierarchical structure in a plastic network ofchaotic units.
[95] F. Pasemann. Complex dyanmics and the structure of small neural networks. Net-work: Computation in Neural Systems, pages 195–216, 2002.
[96] E. Izhikevich. Dynamical Systems in Neuroscience: The geometry of excitabilityand bursting. Springer-Verlag (to appear), 2005.
[97] F.C. Hoppensteadt and E. Izhikevich. Weakly Connected Neural Networks. Springer-Verlag, New York, 1997.
[98] R. Hilborn. Chaos and Nonlinear Dynamics: An Introduction for Scientists andEngineers. Oxford University Press, 2000.
[99] R. Borisyuk. Synchronization of neural activity and information coding. In NCWS2003, 2003.
[100] D.L. Wang and D. Terman. Image segmentation based on oscillatory correlation.Neural Computation, pages 805–836, 1997.
[101] D. Wang. Relaxation oscillators and networks. In Wiley Encyclopedia of Electricaland Electronics Engineering, pages 396–405. Wiley & Sons, 1999.
[102] S. R. Campbell, D. L. Wang, and C. Jayaprakash. Synchrony and desynchrony inintegrate-and-fire oscillators. Neural Computation, pages 1595–1619, 1999.
[103] D. L. Wang and D. Terman. Image segmentation based on oscillatory correlataion.Neural Computation, pages II 521– II 525, 1995.
[104] R. Pichevar and J. Rouat. Binding of audio elements in the sound source segregationproblem via a two-layered bio-inspired neural network. In IEEE CCECE’2003.
[105] R. Pichevar and J. Rouat. Double-vowel segregation through temporal correlation:A bio-inspired neural network paradigm. In NOLISP’2003, 2003.
[106] R. Pichevar and J. Rouat. Oscillatory dynamic link matching for pattern recogni-tion. In International Workshop on Neural Coding (NCWS), Aulla, Italy, 2003.
[107] H.X. Wang G.Q. Bi. Temporal asymmetry in spike timing-dependent synapticplasticity. Psychology and Behavior, pages 551–555, 2002.
[108] K.P. Kording and P. Konig. Neurons with two sites of synaptic integration learninvariant representations. Neural Computation, pages 2823–2849, 2001.
[109] R. Van Rullen and S. J. Thorpe. Rate coding versus temporal order coding: Whatthe retinal ganglion cells tell the visual cortex. Neural Computation, 13:1255–1283,2001.
[110] C. Panchev, S. Wermter, and H. Chen. Spike-timing dependent competitive learningof integrate-and-fire neurons with active dendrites. In ICANN, Spain, 2002.
[111] Simon Haykin. Neural Networks : A Comprehensive Foundation. Macmillan, 1994.
[112] W. Maass and C. M. Bishop. Pulsed Neural Networks. MIT Press, 1998.
[113] R. Eckhorn. Neural mechanisms of scene segmentation: Recordings from the vi-sual cortex suggest basic circuits for linking field models. IEEE Trans. on NeuralNetworks, 10(3):464–479, 1999.
[114] X. Liu and D.L. Wang. Range image segmentation using a relaxation oscillatornetwork. IEEE Trans. On Neural Networks, pages 564–574, May 99.
[115] E. Cesmeli and D. Wang. Motion segmentation based on motion/brightness integra-tion and oscillatory correlation. IEEE Trans. on Neural Networks, 11(4):935–947,2000.
[116] D. Wang and D. Terman. Locally excitatory globally inhibitory oscillator networks.IEEE Trans. on Neural Networks, pages 283–286, 1995.
[117] D. L. Wang. On connectedness: A solution based on oscillatory correlation. NeuralComputation, pages 131–139, 2000.
[118] S. N. Wrigley and G. J. Brown. A neural oscillator model of auditory attention.Lecture Notes in Computer Science, pages 1163–1170, 2001.
[119] H. Nakano and T. Saito. Synchronization in a pulse-coupled network of chaoticspiking oscillators. In 45th Midwest Symposium on Circuits and Systems, 2002.
[120] N. Cowan. Evolving conceptions of memory storage, selective attention and theirmutual constraints within the human information processing system. Psychol. Bull.,104:163–191, 1988.
[121] B. Widrow. Adaptive noise cancelling: Principles and applications. Proceedings ofthe IEEE, 63(12), 1975.
[122] Y. Kaneda and J. Ohga. Adaptive microphone-array system for noise reduction.TrASSP, pages 1391–1400, 1986.
[123] J.-M. Valin, J. Rouat, and F. Michaud. Microphone array post-filter for separationof simultaneous non-stationary sources. In ICASSP, Montreal, Canada, 2004.
[124] M.S. Brandstein and D.B. (Eds.). Microphoe Arrays: Signal Processing Techniquesand Applications. Springer Verlag, 2001.
[125] J. Sanchez-Bote, J. Gonzales-Rodriguez, and J. Ortega-Garcian. A real-timeauditory-based microphone array assessedwith e-rasti evaluation proposal. InICASSP, Hong-Kong, 2003.
[126] M.R. Gomez, D. Ellis, and N. Jojic. Multiband audio modeling for single-channelacoustic source separation. In IC ASSP 2004, 2004.
[127] P.A. Cariani and B. Delgutte. Neural correlates of the pitch complex tones. i. pitchand pitch salience. ii. pitch shift, pitch ambiguity, phase invariance, pitch circularity,and the dominance region for pitch. J. Neurophysiology, 1996.
[128] R.D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Aller-hand. Complex sounds and auditory images. In Y. Cazals, L. Demany, andK. Horner, editors, Auditory Physiology and Perception, pages 429–446. PergamonPress, Oxford, 1992.
[129] R. Pichevar, J. Rouat, C. Feldbauer, and G. Kubin. A bio-inspired sound sourceseparation technique in combination with an enhanced FIR gammatone Analy-sis/Synthesis filterbank. In EUSIPCO Vienna, 2004.
[130] T. Irino and M. Unoki. A time-varying, analysis/synthesis auditory filterbank usingthe gammachirp. In 98, volume 6, pages 3653–3656, Seattle, Washington, May 1998.
[131] Gernot Kubin and W. Bastiaan Kleijn. On speech coding in a perceptual domain.In 99, volume 1, pages 205–208, Phoenix, Arizona, March 1999.
[132] Malcolm Slaney. An efficient implementation of the Patterson-Holdsworth auditoryfilter bank. Technical Report 35, Apple Computer, Inc, 1993.
[133] J. Rouat, Y. C. Liu, and D. Morissette. A pitch determination and voiced/unvoiceddecision algorithm for noisy speech. Speech Comm., 21:191–207, 1997.
[134] F. Plante, G. Meyer, and W. Ainsworth. Improvement of speech spectrogram accu-racy by the method of reassignment. IEEE Trans. on Speech and Audio Processing,pages 282–287, 1998.
[135] C. Giguere and Philip C. Woodland. A computational model of the auditory pe-riphery for speech and hearing research. JASA, pages 331–349, 1994.
[136] M.C. Liberman, S. Puria, and J.J. Jr. Guinan. The ipsilaterally evoked olivo-cochlearreflex causes rapid adaptation of the 2f1-f2 distortion product otoacousticemission. JASA, 99:2572–3584, 1996.
[137] D. L. Wang. Relaxation Oscillators and Networks, pages 396–405. John Wiley Sons,1999.
[138] R. Pichevar and J. Rouat. Streaming of audio objects on 2D spectral maps throughmultiplicative synaptic connection neurons. In Auditory Perception, Cognition, andAction Meeting , Vancouver, Canada, 2003.
[139] F. Gabbiani, H. Krapp, C. Koch, and G. Laurent. Multiplicative computation in avisual neuron sensitive to looming. Nature, 420:320–324, 2002.
[140] JL. Pena and M. Konishi. Auditory spatial receptive fields created by multiplication.Science, 292:294–252, 2001.
[141] R.A. Andersen, L.H. Snyder, D.C. Bradley, and J. Xing. Multimodal representationof space in the posterior parietal cortex and its use in planning movements. Ann.Rev. Neurosci., page 20:303, 1997.
[142] J. Rouat. Spatio-temporal pattern recognition with neural networks: Applicationto speech. In Artificial Neural Networks-ICANN’97, Lect. Notes in Comp. Sc. 1327,pages 43–48. Springer, 10 1997.
[143] http://www-edu.gel.usherbrooke.ca/pichevar/.
[144] J.-M. Valin, F. Michaud, J. Rouat, and D. LUtourneau. Robust sound source local-ization using a microphone array on a mobile robot. In IEEE/RSJ-Int. Conferenceon Intelligent Robots and Systems., 2003.
[145] J.-M. Valin, J. Rouat, and F. Michaud. Enhanced robot audition based on micro-phone array source separation with post-filter. In IROS, 2004.
[146] G. Hu and D.L. Wang. Separation of stop consonants. In ICASSP 2003, 2003.
[147] http://www.itu.int/home/.
[148] R. Pichevar and J. Rouat. Bio-inspired sound source separation technique basedon a spiking neural network: Application to three-source sounds. Lecture Notes inComputer Science (Springer-Verlag), to appear, 2004.
[149] B. Boashash and M. Mesbah. Signal enhancement by time-frequency peak filtering.IEEE Trans. On Signal Processing, pages 929–938, 2004.
[150] S.C. Yen, E. D. Meschik, and L.H. Finkel. Cortical synchronization and perceptualsalience. Computational Neuroscience: Trends in Research, pages 125–130, 1993.
[151] D. Somers and N. Kopell. Rapid synchronization through fast threshold modulation.Biological cybernetics, pages 393–407, 1993.
[152] N. Koppel and G.B. Ermentrout. Symmetry and phaselocking in chains of weaklycoupled oscillators. Communications on Pure and Applied Mathematics, pages 623–660, 1986.
[153] R. Pichevar and J. Rouat. RN-spike process for spatio-temporal pattern recognition.Canadian Provisional Patent, 2004.
[154] W. Konen, T. Maurer, and C. Von der Malsburg. A fast dynamic link matchingalgorithm for invariant pattern recognition. Neural Networks, pages 1019–1030,1994.
[155] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of in-variances. Neural Computation, pages 715–770, 2002.
[156] T. Vinh Ho and J. Rouat. Novelty detection based on relaxation time of a net-work of integrate-and-fire neurons. In Proc. IEEE Int’l Joint Conference on NeuralNetworks, Alaska, USA, 1998.
[157] R. P. Wurtz. Multilayer Dynamic Link Networks for Establishing Image Point Cor-respondences and Visual Object Recognition. PhD thesis, Ruhr-Universitat Bochum,Germany, 1994.
[158] T. Aoinishi, K. Kurata, and T. Mito. A phase locking theory for matching commonparts of two images by dynamic link matching. Biological Cybernetics, 78(4):253–264, 1998.
[159] R. Pichevar and J. Rouat. Oscillatory dynamic link matching for invariant patternrecognition. Biosystems Journal (submitted), 2004.
[160] R. Pichevar and J. Rouat. Oscillatory dynamic link matcher: A bio-inspired neuralnetwork for pattern recognition. In Brain Inspired Cognitive Systems 2004, Stirling,Scotland (Invited Paper), 2004.
[161] X. Zhang and A. Minai. Detecting corresponding segments across images usingsynchronizable pulse-coupled nerual networks. In IJCNN2001, 2001.
[162] L.E. Gordon. Theories of Visual Perception. John Wiley Sons, 1997.
[163] L. Wiskott, C. Von der Malsburg, and A. Weitzenfeld. The Neural SimulationLanguage: A System for Brain Modeling, chapter 18, pages 343–372. MIT Press,2002.
[164] H. Ando, N. Takashi Morie, M. Nagata, and A. Iwata. A nonlinear oscillator networkcircuit for image segmentation with double-threshold phase detection. In ICANN99, 1999.
[165] A.K. Jain. Fundamentals of Digital Image Processing. Prentice-Hall, 1989.
[166] R. VanRullen and S. J. Thorpe. Surfing a spike wave down the ventral stream.Vision Research, pages 2593–2615, 2002.
[167] G.B. Ermentrout and N. Kopell. Parabolic bursting in an excitable system coupledwith a slow oscillation. SIAM J. Appl. Math., pages 233–253, 1986.
[168] C. Feldbauer and G. Kubin. Critically sampled frequency-warped perfect recon-struction filterbank. In ECCTD‘03, 2003.