UNIVERSIDAD DE SEVILLA Departamento de Teoría de la Señal y Comunicaciones
Herramienta de simulación para construir y analizar sistemas
complejos y jerárquicamente estructurados basados en AER que
implementan procesado de la información visual
Memoria presentada por
José Antonio Pérez Carrasco
para optar al grado de Doctor por la Universidad de Sevilla
Sevilla, Enero de 2011
Herramienta de simulación para construir y analizar
sistemas complejos y jerárquicamente estructurados
basados en AER que implementan procesado de la
información visual
Memoria presentada por:
José Antonio Pérez Carrasco
para optar al grado de Doctor por la Universidad de Sevilla
Sevilla, Enero 2011
Directores:
Dra. Begoña Acha Piñero
Dra. Carmen Serrano Gotarredona
Dr. Be abé L' are Barranco % P P
Dra. Tere! serano Gotarredona
Departamento de Teoría d A a Señal y Comunicaciones
UNIVERSIDAD DE SEVILLA
A mi abuelo, mi abuela, y mi madre.
Agradecimientos
Ha sido un largo camino el recorrido para llegar hasta aquí, y la verdad es que no libre de dificultades. Pero la ilusión y la voluntad de trabajo pueden con todo y abren camino a pesar de las complicaciones. Mirando hacia el comienzo, aún recuerdo la chispa de ilusión de realizar un proyecto de fin de carrera en imágenes médicas. Y esa chispa se transformó en llama cuando gracias a Carmen, Bego, Bernabé y Teresa, el mundo de la investigación “me abrió las puertas”. Una especie de sentimiento de pertenencia me invadió haciéndome sentir que éste es mi trabajo, y esto es lo que quiero hacer. A veces parece que la vida es una serie de televisión en el sentido de que van apareciendo nuevos personajes, o que por circunstancias de la vida (o por desgracia) otros ya no están como acompañantes en el camino, y el hilo argumental va añadiendo situaciones nuevas, complicaciones nuevas, momentos de tensión, pero también momentos de ilusión, alegría, sí, de esos en los que uno tiene que admitir que ha llorado de lo feliz que se siente. En mi vida tengo claro quiénes son mis protagonistas: Gracias Carmen y Bego, pero no gracias, sino infinitas gracias. Serían tantas cosas que ni escribiendo un libro entero de gratitud contaría todo lo que significáis para mí. Ante todo gracias por abrirme las puertas de vuestras vidas, por vuestro cariño y apoyo en todo momento y en vuestro consejo en cada paso que he dado en todo. Tengo la suerte de tener dos directoras de tesis que además de quererlas como mis “profesoras”, las quiero más como mi familia. Sólo yo sé personalmente cuánto me habéis ayudado, en lo personal infinito, y en el trabajo, también infinito. Gracias, todo os lo debo a vosotras dos. Gracias Bernabé y Teresa, y de nuevo, no gracias, sino infinitas gracias. Con vuestro ejemplo, trabajo y todos los consejos/críticas constructivas me habéis hecho progresar muchísimo, me habéis enseñado a ser más crítico conmigo mismo y con mi trabajo. Gracias por regar la semilla de ilusión sembrada en mi interior todos los días, pero ante todo, gracias por ser las maravillosas personas que sois y por haberme apoyado en todo momento. Igualmente, sois también los responsables de que yo haya llegado hasta aquí. Muchas gracias. Gracias Aurora y Carlos, también infinitas, porque al final se vive casi más tiempo en el trabajo que en casa, y vosotros me habéis acompañado en los momentos buenos, y en los malos, en los de carcajadas y en los de llanto. Gracias por llenarme de vida cada día con vuestra amistad, compañía y compartir conmigo también las chispas de vuestras ilusiones y preocupaciones. Gracias a José Ignacio Acha, por su consejo y actitud conmigo. Muchísimas gracias.
Gracias a todos los compañeros del departamento y profesores. Muchísimas gracias a Pablo Aguilera y Pablo Olmos, a Michelle y a Luis, por darle la vida a nuestra sala y por ser tan maravillosas personas conmigo. Gracias a Eugenio, S. Thorpe y a S. Furber, por haberme acogido con ellos y dedicarme parte de sus súper-ocupados tiempos. Gracias por haberme enseñado tanto. Por supuesto, si hay alguien con quien tengo que estar agradecido es con mis padres y hermanos. Ellos han vivido muchísimo todo mi trabajo, recibiendo más de un dolor de cabeza. Muchísimas gracias a ellos porque son los pilares de mi vida y sin ellos no podría haber logrado nada. Gracias a mi madre en especial, porque es la más grande del mundo entero, porque lo que ha tenido que trabajar y luchar ella para que yo (y mis hermanos) estemos donde estamos, no está en los escritos. Será por su ejemplo y lucha que he aprendido a valorar infinito el esfuerzo que cuesta cada pequeña cosa. Gracias mamá. Finalmente, gracias a la persona estelar en mi vida, que seguro que a día de hoy lleva aún mi foto en su cartera presumiendo de nieto por ahí arriba. Seguro que es el que se sentirá más orgulloso el día de presentación de mi tesis, y, aunque no entienda inglés ni nada de lo que cuente, se sentará, me mirará todo el tiempo fijamente con una mirada alucinante de interesarle y no se perderá ni una sílaba de las palabras que salgan por mi boca. Y si alguien habla, seguro que mi abuelo le manda un “ssshhhhh” para seguir enterándose. Gracias a mi abuelo por ser la fuente de mi motivación, el responsable de que yo quiera ser siempre mejor persona y mejorar en mi trabajo. Gracias por ser mi acompañante ahora todos los días. Muchas gracias a todos, os quiero.
Contents
List of Figures ix
List of Tables xv
1 INTRODUCTION 1
1.1 Antecedents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 EVENT-BASED PROCESSING SYSTEMS 5
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Event-Based vs Frame-Based Processing Systems . . . . . . . . . . . . . 6
2.3 Coding Schemes for Event-Based Systems . . . . . . . . . . . . . . . . . 7
2.3.1 Rate Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Rank Order Coding . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Time-to-First-Spike . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Spike-Count Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.5 Population coding . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.6 Phase-of-firing code . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.7 Intensity Variation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 AER Protocol for Event-Based Systems . . . . . . . . . . . . . . . . . . 12
3 IMPLEMENTATION OF AN AER SIMULATION TOOL 15
3.1 Requirement of a simulation tool . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Synchronous or Clock-driven Algorithms . . . . . . . . . . . . . . 16
3.1.2 Asynchronous or Event-driven Algorithms . . . . . . . . . . . . . 16
iii
CONTENTS
3.2 Description of the AER Simulation Tool . . . . . . . . . . . . . . . . . . 19
3.2.1 Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Event Description . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Instance Description . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.4 Description of Program Flow . . . . . . . . . . . . . . . . . . . . 24
3.2.5 Algorithm Optimizations for Efficient Computation Speed . . . . 26
3.2.6 C++ implementation . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 AER Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1 AER switch Module . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.1.1 AER Splitter . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1.2 AER Merger . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2 Subsampling Module . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.3 Mapper Module . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.3.1 3.3.3.1 AER Scanner . . . . . . . . . . . . . . . . . . . 34
3.3.3.2 3.3.3.2 AER Rotator . . . . . . . . . . . . . . . . . . . 35
3.3.4 AER Convolution Chip . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4.1 3.3.4.1 System Level Architecture of the Convolution
Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.4.2 3.3.4.2 AERST Convolution Module . . . . . . . . . . 40
3.3.5 Projection Module . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.6 Integrate and Fire Module . . . . . . . . . . . . . . . . . . . . . . 44
3.3.7 Rate-Reducer Module . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.8 Self-Exciting Modules . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Validation of the AER Tool . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Detection and tracking of moving circles of given radius . . . . . 47
3.4.2 Recognition of high speed Rotating Propellers . . . . . . . . . . . 49
4 MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION 51
4.1 Fukushima’s Neocognitron . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 AER-based system for Character Recognition . . . . . . . . . . . . . . . 54
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
iv
CONTENTS
5 IMPLEMENTATION OF TEXTURE RETRIEVAL USING AER-BASED SYSTEMS 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 State of the art in texture recognition . . . . . . . . . . . . . . . . . . . 67
5.3 AER implementation for texture retrieval . . . . . . . . . . . . . . . . . 69
5.3.1 Frame-based implementation for texture retrieval . . . . . . . . . 70
5.3.2 AER-based implementation for texture retrieval . . . . . . . . . 72
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.1 Comparison with the State-of-the-Art . . . . . . . . . . . . . . . 78
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . 82
6 EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FAST VI-SION POSTURE RECOGNITION 85
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Frame-Based Convolutional Network . . . . . . . . . . . . . . . . . . . . 88
6.3 Justification of the Architecture Used . . . . . . . . . . . . . . . . . . . 92
6.4 Frame-Free Convolutional Network . . . . . . . . . . . . . . . . . . . . . 95
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.5.1 AER ConvNet with 32x32 pixel inputs . . . . . . . . . . . . . . . 102
6.5.2 AER ConvNet with 64x64 pixel inputs . . . . . . . . . . . . . . . 113
6.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.6.1 Learning in Convolutional Networks . . . . . . . . . . . . . . . . 115
6.6.2 Computations in the Frame-Based System . . . . . . . . . . . . . 117
6.6.2.1 Filtering layers . . . . . . . . . . . . . . . . . . . . . . . 117
6.6.2.2 Subsampling Layers . . . . . . . . . . . . . . . . . . . . 118
6.6.2.3 Full-Connection layer F6 . . . . . . . . . . . . . . . . . 118
6.6.3 Computations in the Frame-free system . . . . . . . . . . . . . . 118
6.6.3.1 Filtering Layers . . . . . . . . . . . . . . . . . . . . . . 119
6.6.3.2 Subsampling Layers . . . . . . . . . . . . . . . . . . . . 119
6.6.3.3 Sixth Layer F6 . . . . . . . . . . . . . . . . . . . . . . . 119
6.6.4 Implementation of non-linearities and equivalences between the
frame-based and the AER-based implementation . . . . . . . . . 120
v
CONTENTS
7 CONCLUSIONS 125
Appendices 127
Appendix A AERST Tool User Guide 129
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.2 Description of an AER System . . . . . . . . . . . . . . . . . . . . . . . 129
A.3 MATLAB Initialization of Parameters and States . . . . . . . . . . . . . 132
A.3.1 Initialization of Parameters . . . . . . . . . . . . . . . . . . . . . 133
A.3.2 Initialization of States . . . . . . . . . . . . . . . . . . . . . . . . 133
A.4 RUNNING AERST in MATLAB . . . . . . . . . . . . . . . . . . . . . . 134
A.4.1 Building Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
A.5 C++ Initialization of Parameters and States . . . . . . . . . . . . . . . 137
A.5.1 Initialization of Parameters . . . . . . . . . . . . . . . . . . . . . 138
A.5.2 Initialization of States . . . . . . . . . . . . . . . . . . . . . . . . 138
A.6 RUNNING AERST in C++ . . . . . . . . . . . . . . . . . . . . . . . . 139
A.6.1 Building C++ Modules . . . . . . . . . . . . . . . . . . . . . . . 142
A.7 Matlab Auxiliary Functions . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.7.1 Generation of AER events from a standard image . . . . . . . . . 143
A.7.2 Reconstruction of images from channels . . . . . . . . . . . . . . 144
A.7.3 Reconstruction of channels from the text output file . . . . . . . 144
A.8 MATLAB Step-by-Step Example . . . . . . . . . . . . . . . . . . . . . . 145
A.8.1 Preparing the Stimulus Events . . . . . . . . . . . . . . . . . . . 146
A.8.2 Setting Up the Configuration File . . . . . . . . . . . . . . . . . 147
A.8.3 Initializing Parameters . . . . . . . . . . . . . . . . . . . . . . . . 147
A.8.3.1 Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
A.8.3.2 Chip1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
A.8.4 Editing the Modules . . . . . . . . . . . . . . . . . . . . . . . . . 149
A.8.4.1 Splitter Module . . . . . . . . . . . . . . . . . . . . . . 150
A.8.4.2 Chip1 Module . . . . . . . . . . . . . . . . . . . . . . . 150
A.8.4.3 Merger Module . . . . . . . . . . . . . . . . . . . . . . . 152
A.8.5 Editing the AERST.m file . . . . . . . . . . . . . . . . . . . . . . 152
A.8.6 Simulating the System . . . . . . . . . . . . . . . . . . . . . . . . 153
A.8.7 Viewing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
vi
CONTENTS
A.9 C++ Step-by-Step Example . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.9.1 Converting a Matrix of Events to a source text file . . . . . . . . 156
A.9.2 Setting Up the Configuration File . . . . . . . . . . . . . . . . . 156
A.9.3 Initializing Parameters . . . . . . . . . . . . . . . . . . . . . . . . 156
A.9.3.1 Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.9.3.2 Chip1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
A.9.4 Editing the C++ Modules . . . . . . . . . . . . . . . . . . . . . . 160
A.9.4.1 Splitter Module . . . . . . . . . . . . . . . . . . . . . . 160
A.9.4.2 Chip1 C++ Module . . . . . . . . . . . . . . . . . . . . 162
A.9.4.3 MERGER C++ Module . . . . . . . . . . . . . . . . . 165
A.9.5 Simulating the System in C++ . . . . . . . . . . . . . . . . . . . 167
Appendix B RESUMEN 171
B.1 INTRODUCCION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
B.2 Descripcion del Simulador AERST . . . . . . . . . . . . . . . . . . . . . 173
B.3 IMPLEMENTACIONES . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
B.4 Sistema de Reconocimiento de Caracteres basado en AER . . . . . . . . 176
B.5 Resultados . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.6 Clasificacion de Imagenes basada en informacion de textura . . . . . . . 179
B.7 Red neuronal Convolucional para el reconocimiento de personas . . . . . 180
B.7.1 RED NEURONAL DE DETECCION DE PERSONAS BASADA
EN FOTOGRAMAS . . . . . . . . . . . . . . . . . . . . . . . . . 181
B.7.2 RED NEURONAL DE DETECCION DE PERSONAS BASADA
EN AER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
B.7.3 Resultados . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Bibliography 187
vii
CONTENTS
viii
List of Figures
2.1 Conceptual illustration of frame-based (top) versus event-based (bottom)
vision sensing and processing system. . . . . . . . . . . . . . . . . . . . . 7
2.2 Comparison of timing issues between (top) a frame- and (bottom) an
event-based sensing and processing system. . . . . . . . . . . . . . . . . 8
2.3 Rate-based vs Rank-Order based scheme . . . . . . . . . . . . . . . . . . 10
2.4 Representation of the Time-to-First Spike coding scheme . . . . . . . . . 11
2.5 Concept of point-to-point interchip AER communication. . . . . . . . . 13
3.1 Example AER system and its ASCII file netlist description . . . . . . . 20
3.2 Basic Algorithm implemented by the AER tool . . . . . . . . . . . . . . 25
3.3 Time Optimizations in the Simulation tool . . . . . . . . . . . . . . . . . 28
3.4 CAVIAR AER vision system . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 AER-Switch hardware interface . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 AER-switch acting as Splitter or Merger . . . . . . . . . . . . . . . . . . 32
3.7 Scanner and Rotator AER Modules . . . . . . . . . . . . . . . . . . . . . 33
3.8 Comparison between (a) classical frame-based and (b) AER event-based
convolution processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Convolution Chip 32x32 . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.10 AER convolution module implemented in the tool . . . . . . . . . . . . 42
3.11 AER Integrate and Fire module implemented in the tool . . . . . . . . . 44
3.12 AER Self-Exciting Module . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.13 Block diagram of the AER system developed to simulate the hardware
implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.14 a) Kernel to detect a circumference of a certain radius. b) Kernel used
in the WTA module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
ix
LIST OF FIGURES
3.15 Winner-Takes-All module . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.16 On the left, input and output obtained with the hardware implemen-
tation. On the right, input and output obtained with the simulated
implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.17 a) Kernel used to detect the propeller, b) and c) input and output when
we collect events during 50µs, d) and e) input and output when we collect
events during 200ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 A typical architecture of the neocognitron network. . . . . . . . . . . . . 52
4.2 The process of pattern recognition in the neocognitron. The lower half
of the figure is an enlarged illustration of a part of the network. . . . . . 54
4.3 Character recognition system based on AER . . . . . . . . . . . . . . . . 55
4.4 Kernels used in the first layer for feature detection. The red cross indi-
cates the origin of coordinates of the kernel when it is proyected in the
pixel array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5 Kernels used in the second layer for spatial weighting. The kernel C at
the bottom end is a single convolution chip for detecting whether the
events coming from the previous layer are more or less clustered together. 59
4.6 Letters used for testing the system based on AER for character recognition. 60
4.7 The output events generated at the different convolution outputs {c1,
c2, c3, c5, c11, d1, d2, d3, d5, d11, dA, dB, dC, dH, dL, dM, dT, fA,
fB, fC, fH, fL, fM, fT} for the case of input stimulus ‘A’. . . . . . . . . 62
4.8 Events obtained in the system at outputs {c1,c2,c3,c5,c11,d1,d2,d3,d5,d11,dA,fA}when input is letter ‘A’ Time is expressed in µs. . . . . . . . . . . . . . 63
4.9 Events obtained in the system at input and output channels for the first
version of each of the letters. . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1 Scheme of the AER-based system implemented for texture-based re-
trieval of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Scheme of a generic FEM used in Fig. 5.1. . . . . . . . . . . . . . . . . . 74
5.3 Texture retrieval accuracy obtained for images D1-D2-D3-D8-D9-D10 as
function of Tcount (in milliseconds) . . . . . . . . . . . . . . . . . . . . 78
5.4 Comparison between frame-based and AER-based systems . . . . . . . . 79
x
LIST OF FIGURES
6.1 Frame-based ConvNet to detect people in up, up-side-down or horizontal
positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Real scenarios where AER recordings with the motion retina were obtained 88
6.3 Images obtained collecting input spikes from the retina each 30ms. The
second and third rows were obtained rotating previosly the input events
90 and 180 degrees respectively. . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 Comparison of recognition rates when we use a trainable set of filters in
first layer or a fixed Gabor filter bank. . . . . . . . . . . . . . . . . . . . 93
6.5 Comparison of recognition rates when we use different Gabor filter banks
at different number of scales and orientations. . . . . . . . . . . . . . . . 94
6.6 Different accuracies obtained when varying the number of feature maps
in the third layer fixing the number of feature maps in the fifth layer. . 95
6.7 Different accuracies obtained when varying the number of feature maps
in the fifth layer fixing the number of feature maps in the third layer. . 96
6.8 Maximum absolute value of the weights during the training stage and at
the end of the training stage. . . . . . . . . . . . . . . . . . . . . . . . . 97
6.9 AER-based implementation of the ConvNet system. . . . . . . . . . . . 98
6.10 Convolution Structure at layers C3, C5. Each incoming spike makes a
convolution map to be added on a pixel array. . . . . . . . . . . . . . . . 99
6.11 Neuron in the pixel array. Each time a spike is received a certain weight
is added to the neuron state. . . . . . . . . . . . . . . . . . . . . . . . . 100
6.12 Algorithm used to configure the system. First, the system was trained
with the frame-based version. Then all the obtained weights were used
in the frame-free system. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.13 a) Images corresponding to downsampling the 128x128 input stimulus
to 32x32. b) Images obtained cropping the input stimulus in a central
square of size 64x64 and downsampling the cropped stimulus to 32x32. . 103
6.14 Input events used to test the system. x axis represents time in seconds.
y axis represents the input event coordinates in a 32x32 pixel array,
numbered from 0 to 1023. . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.15 Output events corresponding to each one of the input flows, a) outputs
when input is up position, b) output when input is horizontal position,
c) output when input is up-side-down position. . . . . . . . . . . . . . . 107
xi
LIST OF FIGURES
6.16 Recognition rate and number of output events per second obtained by
varying the refractory times in layers C3 and C5. . . . . . . . . . . . . . 108
6.17 a) Input and output activity when input is alternated between up, lay
and up-side-down positions. No refractory periods have been considered.
Values ‘5’, ‘6’ and ‘7’ correspond to up, horizontal and up-side-down
respectively. Absolute values ‘1’, ‘2’, ‘3’ and ‘4’ correspond to the output
channels identifying up-side-down, horizontal, up positions and noise
respectively. b) Input and output activity when input is alternated and
a refractory period of 9ms is used in layer C5. Input event correct
orientation is shown by the blue line. Values ‘1’, ‘2’ and ‘3’ correspond to
up, horizontal and up-side-down positions, respectively. Output events
corresponding to the up category are represented with blue circles, with
red crosses for the horizontal category, with green stars for the up-side-
down category and black dots for the noise category. c) Input and output
activity when a refractory time of 18ms is used in layer C5. d) Input
and output activity when the simulated annealing algorithm is employed
to obtain optimum parameters . . . . . . . . . . . . . . . . . . . . . . . 110
6.18 Recognition Rate and Number of Output Events obtained when varying
forgetting rates F1, F3, F5, and F6. a) Results when varying F1 and F3.
b)Results when varying F3 and F6. c)Results when varying F3 and F5.
d)Results when varying F1 and F5. . . . . . . . . . . . . . . . . . . . . . 111
6.19 Zoomed version of the simulation results of Fig. 6.17 between 5760ms y
5830ms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.20 a) Input and output activity when input is alternated and a refractory
period of 18ms is used in layer C5. Input events are shown by grey
circles. Values ‘1’, ‘2’ and ‘3’ correspond to up, horizontal and up-side-
down positions respectively. Output events corresponding to the up
category are represented with blue circles, output events corresponding
to the horizontal position are represented by red crosses, output events
corresponding to the up-side-down positions are represented by green
stars and the noise category by black points. b) Input and output activity
when the simulated annealing algorithm is employed to obtain optimum
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
xii
LIST OF FIGURES
6.21 Computation of the saturation point in the hyperbolic tangent function.
The function saturates when the absolute value of the argument is higher
than 1.5283. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.1 System Simulated in the Step-by-Step Example . . . . . . . . . . . . . . 148
B.1 Concepto de comunicacion punto a punto basada en AER. . . . . . . . . 172
B.2 Ejemplo de Sistema AER y su descripcion mediate un fichero ASCII . . 174
B.3 Sistema de Reconocimiento de Caracteres basado en AER . . . . . . . . 177
B.4 Caracteres utilizados para evaluar el Sistema AER. . . . . . . . . . . . . 178
B.5 Esquema del sistema basado en AER para clasificacion de imagenes
basada en textura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.6 Sistema neuronal convolucional basado en fotogramas para detectar per-
sonas de pie, en posicion horizontal o boca a bajo. . . . . . . . . . . . . 181
B.7 Implementacion AER de la red neuronal convolucional para el reconocimiento
de personas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
B.8 a) Entrada y salida del sistema cuando la entrada es alternada entre
las posiciones ‘de pie’, ‘horizontal’ y ‘boca abajo’. Los valores ‘5’, ‘6’
y ‘7’ corresponden a las posiciones ‘horizontal’, ‘boca abajo’ y ‘de pie’
respectivamente. Los valores absolutos ‘1’, ‘2’, ‘3’ y ‘4’ corresponden a la
actividad en los canales de salida identificando las posiciones ‘horizontal’,
‘boca abajo’, ‘de pie’ y ‘ruido’. . . . . . . . . . . . . . . . . . . . . . . . 186
xiii
LIST OF FIGURES
xiv
List of Tables
4.1 Origin of Coordinates for Kernels in Layer 2 . . . . . . . . . . . . . . . . 58
4.2 Timing and accuracy obtained for each of the letters . . . . . . . . . . . 61
5.1 Retrieval Performance for Each of the 112 Brodatz Images. Compar-
ison Between Manjunath’s Frame-Based Method and the AER-Based
Proposed Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Comparison of Average Retrieval Rate Between Different Methods Using
the Brodatz Database) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Comparison of Computational Times Between Different Methods Using
the Brodatz Database) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.1 Parameters used in the frame-based and AER-based implementations . 90
6.2 Maximum kernel weights, threshold values, refractory times, layer times
and events per second in the system . . . . . . . . . . . . . . . . . . . . 104
6.3 Parameter Vector obtained with the Simulated Annealing Algorithm . . 112
6.4 Time-to-first output event after transitions of the input between up,
horizontal and up-side-down positions. . . . . . . . . . . . . . . . . . . . 113
6.5 Parameter Vector obtained with the Simulated Annealing Algorithm . . 115
6.6 Refractory periods, layer times and maximum number of events per sec-
ond computed for each layer in the system . . . . . . . . . . . . . . . . . 123
xv
LIST OF TABLES
xvi
Chapter 1
INTRODUCTION
1.1 Antecedents
Still today it seems inconceivable that even the most powerful computer is not able to
implement a “simple” detection, recognition and tracking of an object or person in an
environment with multiple distractor elements in milliseconds. It is true that computers
are good at some kind of tasks (mathematical operations, storage, data recovery, etc.).
However, even today, computers are not ready to implement other tasks that could
be considered for a person as simple operations, such as object recognition, decision
taking, etc.
These operations, that are straightforward for human brains, require multiple se-
quential operations in computers, still today mainly based on the classical Von Neuman
architecture, where a stored program is executed line by line. For vision/image pro-
cessing is even worst, as all the pixels in the image have to be processed one by one
with complex operations. When all the pixels are processed, some features and results
can be extracted and sent to the next layer to continue processing. Thus, as well as the
individual processing for every pixel, several layers are required to finish a processing
that in the best of the cases will be only close to that offered by a human brain, both
in terms of speed and efficiency.
These efficiency and speed limitations in classical processing systems have motivated
many researchers to study the biological processing implemented in brains to propose
and analyze new techniques and architectures to implement those “so simple” tasks.
This is not a simple issue, as this power in processing in human brains is mainly due
1
1. INTRODUCTION
to the high number of neurons and the massive interconnections (synapses) present
between them. This massive interconnection allows a parallel processing of all the
input information, providing a real-time operation. Moreover, in the case of vision
processing, opposite to classical computers, information is not processed considering
frames, but transmitting and integrating independent and individual pulses, which are
processed in parallel and transmitted layer by layer without waiting to collect all the
input visual information.
There are many researchers working in emulating (both in software and hardware)
the human visual system and its processing (consider as example the works developed by
T. Serre [1], Fukushima [2], T. Masquelier [3], Y. LeCun [4], S. Furber [5], etc.). Other
researchers study where and how the information is coded (rank-order coding, rate-
based coding, population coding, etc.) and other analyses are focused to determine the
best model for neurons and their learning strategy (i&f (integrate and fire), Izhikevich
[6], Hodgkin and Huxley [7], etc. ).
The main barrier that these researchers face when trying to mimic the bio-inspired
systems is the massive connectivity present in the biological like systems. In today
technologies it is plausible to fabricate on a single chip many thousands (even millions)
of artificial neurons or simple processing cells. However, it is not viable to connect phys-
ically each of them to even a few hundreds of other neurons. The problem is greater
for multi-chip multi-layer hierarchically structured bio-inspired systems. Address-event
representation (AER) is a promising emergent hardware technology that shows poten-
tial for providing the computing requirements of large frameless projection-field-based
multilayer systems providing a hardware solution to the massive connectivity prob-
lems. AER was first proposed in 1991 in one of the California Institute of Technology
(Caltech) research labs [30], and has been used since then by a wide community of
neuromorphic hardware engineers.
The main conclusion that can be extracted after considering these ideas is that
it would be desirable to have available a simulator tool, easy to handle, to model
and to simulate cortical-like processing systems. This tool should be able to consider
not only the existing biological models for neurons and their learning strategies, but
also the limitations (timing, space, connectivity, etc) and performance figures of those
existing hardware elements which also try to emulate the complex bioinspired processing
strategies.
2
1.2 Objectives
1.2 Objectives
The first main objetive in this thesis work is to propose and evaluate an event-driven
software tool AERST for simulating systems based on the AER (address-event repre-
sentation) protocol. The second objective is to build and analyze complex multilayer
multichip AER vision processing systems using the developed simulator tool. The sys-
tems considered should be composed of existing available AER hardware elements in
order to provide real timing and performance figures. To achieve the two objetives
several stages in the work have been carried out:
1. Implementation of the AER software simulator tool AERST. To achieve this it
is necessary to study the different programming environment options and the
characteristics of the AER elements and systems that are going to be used (if
already existing) or implemented.
2. Validating the AERST tool. For doing this, several experiments already imple-
mented in hardware, will be repeated with the tool for comparison purposes.
AERST should provide the same results (timing and performance) obtained with
the hardware systems.
3. Proposing new vision and complex processing AER systems using AERST. These
systems should only be composed of existing AER elements or elements that can
be physically implemented in the close future.
The work developed in this thesis has been carried out within the Andalucian project
P06TIC01417 (Brain System) in the National Microelectronics Center (IMSE-CNM-
CSIC) and the University of Seville. The aim of this project was to build a complete
vision processing system using bioinspired AER convolution chips to emulate the pro-
cessing implemented in human brains.
1.3 Structure
The current thesis is structured as follows:
1. The First Chapter is an introduction comparing event-based and frame-based
systems. The AER hardware protocol as a solution to implement the massive
interconnection existing in brains using multi-AER-chips is explained
3
1. INTRODUCTION
2. The Second Chapter explains the AERST Simulator Tool and the Validations to
assess the tool
3. The Third Chapter shows one multi-layer feed-forward neural processing system
for character recognition in AER.
4. The Fourth Chapter explains a four-layer system implementing texture retrieval
for image classification.
5. In the Fifth Chapter a neural network consisting of six layers trained using back-
propagation is implemented to detect people in three different positions: up,
horizontal and up-side-down.
6. The Sixth Chapter corresponds to the Conclusions and Future Work Chapter
7. In the Seventh Chapter an Appendix Section where a Tutorial and User Manual
on how to use the AERST Simulation Tool is provided.
8. Finally, a list of publications is provided
4
Chapter 2
EVENT-BASED PROCESSING
SYSTEMS
2.1 Introduction
ARTIFICIAL man-made machine vision systems operate in a quite different way from
biological brains. Machine vision systems usually capture and process sequences of
frames. For example, a video camera captures images at about 25-30 frames per sec-
ond, which are then processed frame by frame, pixel by pixel, usually with convolution
operations, to extract, enhance, and combine features, and perform operations in fea-
ture spaces, until a desired recognition is achieved. This frame convolution processing
is slow, especially if many convolutions need to be computed in sequence for each input
image or frame. Biological brains seem to not operate on a frame by frame basis. In the
retina, each pixel sends spikes (also called events) to the cortex when its activity level
reaches a threshold. Pixels are not read by an external scanner. Pixels decide when to
send an event. All these spikes are transmitted as they are being produced, and do not
wait for an artificial “frame time” before sending them to the next processing layer.
Besides this frameless nature, brains are structured hierarchically in cortical layers [8].
Neurons (pixels) in one layer connect to a projection field of neurons (pixels) in the
next layer. This processing based on projection fields is similar to convolution-based
processing [9], at least for the earlier cortical layers. For example, it is widely accepted
that the first layer of visual cortex V1 performs an operation similar to a bank of
2-D Gabor-like filters at different scales and orientations [1] whose actual parameters
5
2. EVENT-BASED PROCESSING SYSTEMS
have been measured [10][11][12]. This fact has been exploited by many researchers to
propose powerful convolution-based image processing algorithms [1][13][14]. However,
convolutions are computationally expensive. It seems unlikely that the high number of
convolutions that might be performed by the brain could be emulated fast enough by
software programs running on the fastest of today’s computers.
It seems that the solution for powerful and frame-free biological-like real-time vision
processing systems is to consider the devolopment of hardware combining event-based
multineuron modules to compute projective fields. In these systems, relevant image
features will be communicated and processed first, resulting in extremely high-speed
processing throughput. This way, the processing delay will depend mainly on the
number of layers, and not on the complexity of objects and shapes to be recognized.
Their latency and throughput are not limited by a conventional sampling rate.
2.2 Event-Based vs Frame-Based Processing Systems
To show the high-speed processing capabilities of event-based systems when compared
to frame-based systems consider Fig. 2.1. Fig. 2.1 illustrates the conceptual differ-
ence between a frame- and an event-based sensing and processing system. Each use
a camera sensor to capture reality. In the top row, a frame-based camera captures a
sequence of frames, each of which is transmitted to the computing system. Each frame
is processed by sophisticated image processing algorithms for achieving some recogni-
tion. The computing system needs to have all pixel values of a frame before starting
any computation. In the bottom row, an event-based vision sensor operates without
frames. Each pixel sends an event (usually its own (x, y) coordinate) when it senses
something (change in intensity [15], contrast with respect to neighboring pixels [16],
etc.). Events are sent out to the computing system as they are produced, without
waiting for a frame time. The computing system updates its state after each event.
Fig. 2.2 illustrates the inherent difference in timings between both concepts. In the top
(frame-based), reality is binned into compartments of duration Tframe. During the first
frame T1, an event happens (such as a flashing shape), but the information produced
by this event does not reach the computing system until the full frame is captured (at
T1) and transmitted (with an additional delay ∆). Then, the computing system has
to process the full frame, handling large amount of data and requiring a long frame
6
2.3 Coding Schemes for Event-Based Systems
Figure 2.1: Conceptual illustration of frame-based (top) versus event-based (bottom)vision sensing and processing system.
computation time TFC before the “recognition” information is available. In the bot-
tom of Fig. 2.2, pixels “see” directly the event in reality and send out their own events
with a delay ∆ to the computing system. Events are processed as they flow with an
event computation delay Tev (some nanoseconds [17]). For performing recognition not
all events are necessary. Actually, more relevant events usually come out first or with
higher frequency. Consequently, recognition time Trcg can be smaller than the total
time of the events produced. Note that recognition is possible before frame time T1,
resulting in a negative T ′FC when compared to the recognition delay of a frame-based
system.
2.3 Coding Schemes for Event-Based Systems
Neuroscience, computational, and engineering application researchers have reported
and used several schemes for information coding with spikes or events. Some of them
are summarized next.
2.3.1 Rate Coding
Rate coding is a traditional coding scheme, assuming that most, if not all, information
about the stimulus is contained in the firing rate of the neuron. In most sensory systems,
7
2. EVENT-BASED PROCESSING SYSTEMS
Figure 2.2: Comparison of timing issues between (top) a frame- and (bottom) an event-based sensing and processing system.
the firing rate increases, generally non-linearly, with increasing stimulus intensity [18].
Any information possibly encoded in the temporal structure of the spike train is ignored.
The concept of firing rates has been successfully applied during the last 80 years. It
dates back to the pioneering work of Adrian who showed that the firing rate of stretch
receptor neurons in the muscles is related to the force applied to the muscle [19]. In the
following decades, measurement of firing rates became a standard tool for describing
the properties of all types of sensory or cortical neurons, partly due to the relative ease
of measuring rates experimentally. However, this approach neglects all the information
possibly contained in the exact timing of the spikes. During recent years, more and
more experimental evidences have suggested that a straightforward firing rate concept
based on temporal averaging may be too simplistic to describe brain activity [20].
2.3.2 Rank Order Coding
While the idea of coding an image using a rate code may seem plausible, recent ex-
perimental work has actually ruled it out in the mouse retina, since the amount of
information available by counting spikes within a given amount of time was insufficient
8
2.3 Coding Schemes for Event-Based Systems
to explain the animal’s behavioral performance [21]. In rank-order schemes (originally
proposed by Thorpe [22]), the information is encoded in the relative order of firing
across the population of neurons that is used (see also [23]). The idea follows natu-
rally from the fact that an integrate-and-fire neuron can be thought of as a capacitance
with a threshold. In response to a visual stimulus, retinal ganglion cells will charge up
progressively until they reach a threshold for generating a spike, and the time taken
to reach threshold will depend on how well the stimulus matches the cell’s receptive
field. Simulation studies have demonstrated that by using the order in which the cells
fire, it is possible to reconstruct an image sufficiently well to allow the key objects to
be identified even when less than 1% of the cells in the retina have had time to emit a
spike [24]. Furthermore, the idea that relative spike timing can be used as an efficient
code has recently been demonstrated experimentally in the salamander retina [25]. The
discovery of spike timing dependent plasticity, where synaptic efficacy is modulated by
the precise timing of spikes, is a strong evidence that temporal codes (as the rank-
order scheme) are used for cortical information transmission. The rate-based and Rank
Order-based coding schemes are compared in Fig 2.3. In the upper part of the figure,
the stimuli received by three different neurons (labeled as nA, nB and nC ) are shown.
As it can be seen, in the rate-based implementation (at the bottom of the figure), the
output activities of the three neurons vary with the intensity of the stimuly. In this
scheme we do not need to use reference signals. On the contrary, in the rank-order
scheme (central part of the figure), the information is coded in the relative order in
which the three neurons fired. This time a reference signal is needed to signal the start
of new time windows. However the number of spikes required is lower requiring thus a
lower bandwith in communications.
2.3.3 Time-to-First-Spike
This is a type of Rank Order Coding. Each neuron sends only one spike (event) after
a periodic reset signal. The time difference between a pixel’s spike and the reference
signals codes the state of the neuron. Consequently, not only the ordering of spikes
codes the information (as in rank order coding), but also the timing of the spikes is
used to code the precise amplitude of neural states (see Fig 2.4).
9
2. EVENT-BASED PROCESSING SYSTEMS
Figure 2.3: Rate-based vs Rank-Order based scheme
2.3.4 Spike-Count Rate
The Spike-count rate, also referred to as temporal average, is obtained by counting the
number of spikes that appear during a trial and dividing by the duration of trial. In
practice, to get sensible averages, several spikes should occur within the time window.
Typical values are T = 100ms or T = 500ms, but the duration may also be longer
or shorter. The spike-count rate can be determined from a single trial, but at the
expense of losing all temporal resolution about variations in neural response during the
course of the trial. Temporal averaging can work well in cases where the stimulus is
constant or slowly varying. Real-world input, however, is hardly stationary, but often
changing on a fast time scale. For example, even when viewing a static image, humans
perform saccades, rapid changes of the direction of gaze. The image projected onto
the retinal photoreceptors changes therefore every few hundred milliseconds. Despite
its shortcomings, the concept of a spike-count rate code is widely used not only in
experiments, but also in models of neural networks. It has led to the idea that a
neuron transforms information about a single input variable (the stimulus strength)
into a single continuous output variable (the firing rate).
10
2.3 Coding Schemes for Event-Based Systems
Figure 2.4: Representation of the Time-to-First Spike coding scheme
2.3.5 Population coding
Population coding is a method to represent stimuli by using the joint activities of a
number of neurons. In population coding, each neuron has a distribution of responses
over some set of inputs, and the responses of many neurons may be combined to de-
termine some value about the inputs. From the theoretical point of view, population
coding is one of a few mathematically well-formulated problems in neuroscience. It
grasps the essential features of neural coding and yet, is simple enough for theoretic
analysis [26]. Experimental studies have revealed that this coding paradigm is widely
used in the sensor and motor areas of the brain. For example, in the visual area medial
temporal (MT), neurons are tuned to the moving direction [27]. In response to an ob-
ject moving in a particular direction, many neurons in MT fire, with a noise-corrupted
and bell-shaped activity pattern across the population. The moving direction of the
object is retrieved from the population activity, to be immune from the fluctuation
existing in a single neuron’s signal. Population coding has a number of advantages,
including reduction of uncertainty due to neuronal variability and the ability to repre-
sent a number of different stimulus attributes simultaneously. Population coding is also
much faster than rate coding and can reflect changes in the stimulus conditions nearly
11
2. EVENT-BASED PROCESSING SYSTEMS
instantaneously [28]. Individual neurons in such a population typically have different
but overlapping selectivities, so that many neurons, but not necessarily all, respond to
a given stimulus.
2.3.6 Phase-of-firing code
Phase-of-firing code is a neural coding scheme that combines the spike count code
with a time reference based on slow oscillations. It has been shown that neurons in
some cortical sensory areas encode rich naturalistic stimuli in terms of their spike times
relative to the phase of ongoing network fluctuations, rather than only in terms of their
spike count [29]. Oscillations reflect local field potential signals. It is often categorized
as a temporal code although the time label used for spikes is coarse grained. That is,
four discrete values for phase are enough to represent all the information content in this
kind of code with respect to the phase of oscillations in low frequencies. Phase-of-firing
code is loosely based on the phase precession phenomena observed in place cells of the
hippocampus.
2.3.7 Intensity Variation
In this coding scheme a neuron (pixel) generates a spike if its intensity has changed a
certain quantity since the previous generated spike. This type of information coding
computes the temporal derivative of the signal. It is used, for example, in Dynamic-
Vision-Sensors (DVS ), where a pixel produces an Address-Event every time a pixel
detects a temporal contrast above a pre-tuned threshold [15].
2.4 AER Protocol for Event-Based Systems
As it has been shown, latency and throughput in event-based systems are not limited by
a sampling rate. However, in real-time hardware implementations, hardware engineers
face a very strong barrier when trying to mimic the bio-inspired hierarchically layered
structure: the massive connectivity. In present day state-of-the-art very large-scale
integrated (VLSI) circuit technologies it is plausible to fabricate on a single chip many
thousands (even millions) of artificial neurons or simple processing cells. However, it is
not viable to connect physically each of them to even a few hundreds of other neurons.
12
2.4 AER Protocol for Event-Based Systems
Figure 2.5: Concept of point-to-point interchip AER communication.
The problem is greater for multi-chip multi-layer hierarchically structured bio-inspired
systems.
Address-event representation (AER) is a promising emergent hardware technol-
ogy that shows potential for providing the computing requirements of large frameless
projection-field-based multilayer systems providing a hardware solution to the inter-
chip massive connectivity problem. AER was first proposed in 1991 in one of the
California Institute of Technology (Caltech) research labs [30], and has been used since
then by a wide community of neuromorphic hardware engineers.
Fig. 4 illustrates event communication in a point-to-point rate-coded AER link
[31], where pixel intensity is coded directly as pixel event frequency. The continuous-
time states of pixels in an emitter chip are transformed into sequences of fast digital
pulses (spikes or events) of minimal width (in the order of nanoseconds) but with
much longer inter-spike intervals (typically in the order of milliseconds). Each time a
pixel generates a spike, its address is written on the interchip digital bus, after proper
arbitration [30]. This is called an address event. The receiver chip reads and decodes
the addresses of the incoming events and sends spikes to the corresponding receiving
pixels for reconstruction or further processing. This point-to-point communication in
Fig. 4 can be extended to a multireceiver scheme [31]. Also, multiple emitters can
merge their outputs into a smaller set of receiver chips [36]. Moreover, AER visual
information can easily be translated or rotated by remapping the addresses during
interchip transmission [37][38]. Complex processing such as convolutions can be also
implemented [17][39][36].
AER has been used fundamentally in image sensors, for simple light intensity to
frequency transformations [32], time-to-first-spike coding [33][34], foveated sensors [40],
13
2. EVENT-BASED PROCESSING SYSTEMS
contrast [41][16], more elaborate transient detectors [15], and motion sensing and com-
putation systems [42]. But AER has also been used for auditory systems [43][44],
competition and winner-takes-all networks [45][46], and even for systems distributed
over wireless networks [47]. However, the high potential of AER has become even
more apparent since the availability of AER convolution chips [17][39]. These chips,
which can perform large arbitrary kernel convolutions (32x32 in [17]) at speeds of about
3x109 connections/s/chip, can be used as building blocks for larger cortical-like multi-
layer hierarchical structures, because of the modular and scalable nature of AER-based
systems.
There is a growing community of AER protocol users for bio-inspired applications in
vision and audition systems. The goal of this community is to build large multi-chip and
multi-layer hierarchically structured systems capable of performing complicated array
data processing in real time. Currently, only a small number of AER-based chips have
been used simultaneously [36]. The largest AER system reported so far is the CAVIAR
system [36], which uses four custom made AER chips (motion retina, convolution chip,
winner-take-all chip, and learning chip) plus a set of FPGA based AER interfacing and
mapping modules. The CAVIAR system includes 45k neurons, emulates up to 5 million
synapses, performs an equivalent of 9 giga-connects-per-second, and can sense, identify
and track objects with a 3ms delay. However, this system only has 4 convolution
modules, but it is expected that hundreds of such modular AER convolution units
could be integrated in a compact volume, such as a miniature printed circuit board
(PCB) or into chips of the type known as networks-on-chip (NoC) [49]. This would
eventually allow the assembly of large cortical-like convolutional neural networks and
event-based frameless vision processing systems operating at very high speeds. The
success of such systems will strongly depend on the availability of robust and efficient
development and debugging AER tools, as well as a theoretical know-how on how to
assemble and program multi-layer multi-chip AER systems for specific applications.
The objective of the present thesis is to provide a simulation tool for complex AER
systems, and study some particular application examples.
14
Chapter 3
IMPLEMENTATION OF AN
AER SIMULATION TOOL
3.1 Requirement of a simulation tool
With the growing popularity and real-time capability of AER systems, it becomes
really desirable to have available a powerful tool to simulate efficiently the behaviour
and operation of such systems, prior to their physical development in hardware. In
the present thesis we have developed an AER event-driven simulator. There are two
versions, one in Matlab and one in C++. This allows us to behavioraly describe any
AER module (including timing and non-ideal characteristics), and to assemble large
systems composed by many different modules, thus building complex and event-driven
processing systems.
The performance characteristics of the simulated AER modules (convolution chips,
mergers, splitters, and mappers) are obtained from already manufactured, tested and
reported AER modules. Many of the AER modules modeled are only available as exper-
imental prototype chips, so to assemble physically large AER systems is not possible.
However, modeling the performance characteristics of the available AER hardware mod-
ules (chips) with the AER behavioral simulator, we can obtain a very good estimate of
the overall systems performance. Furthermore, the AER behavioral simulator can be
used to propose and test new AER processing modules to be used in larger systems,
and thus orient hardware developers on what kind of AER hardware modules may be
useful and what performance characteristics they should possess.
15
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
At present there does not exist any AER simulation tools. So far, AER-based
systems developers and researchers have been proposing new systems mainly based on
hardware implementations. At present, AER modules implement simple processing
tasks such us edge detections and other simple kinds of filtering [36][50][51][45].
In a software simulator there are two main strategies for simulating the behaviour of
a system: Synchronous or clock-driven and asynchronous or event-driven algorithms.
3.1.1 Synchronous or Clock-driven Algorithms
In a synchronous or clock-driven algorithm the state variables of all neurons (and pos-
sibly synapses) are updated at every tick of a clock: X(t)− >X(t + dt). Then, after
updating all variables, the threshold condition is checked for every neuron. Each neuron
that satisfies this condition produces a spike which is transmitted to its target neurons,
updating the corresponding variables.
The obvious drawback of clock-driven algorithms is that spike timings are aligned to a
grid (ticks of the clock), thus the simulation is approximate even when the differential
equations are computed exactly. Other specific errors come from the fact that threshold
conditions are checked only at the ticks of the clock, implying that some spikes might
be missed.
For realistic large-scale networks, the algorithmic complexity and, hence, computational
load scales linearly with the number of neurons, but also linearly with the temporal
resolution. Increasing the temporal resolution dt leads to a marked increase in the
time needed to simulate neural activity in a corresponding time window and, as stated
above, it determines the accuracy of the numerical simulation (note that dt introduces
an artificial cutoff for time-scales captured by the simulation).
The main argued advantage of clock-driven algorithms is that they can be coded
and applied to any neuron model. However, when the timing of the spikes is important
(as occurs for instance in the STDP learning algorithms), the election of a proper dt
value is crucial and it can lead to severe misbehaviours.
3.1.2 Asynchronous or Event-driven Algorithms
The growing experimental evidence that spike timing may be important to explain
neural computations has motivated the use of event-based simulation techniques, rather
16
3.1 Requirement of a simulation tool
than the traditional clock-driven-based models.
In event-driven algorithms, the simulation advances from one event to the next event.
In contrast with standard clock-driven simulations, state variables need to be updated
at the time of every incoming spike rather than at every tick of the clock in order to
simulate the network.
The key advantages in event-driven algorithms are:
1. A potential gain in speed due to the avoidence of update steps in neurons where
no events have arrived. Systems with a reduced number of events require very
reduced processing times to complete the simulations.
2. Spike timings are computed exactly.
3. The event-driven approaches are free from the dependence on the temporal res-
olution by using the exact times of events. This gain in accuracy comes at the
cost that, now, the computational load scales with the number of events in the
network, which rises linearly with the number of neurons in realistic large-scale
neuronal networks [52].
The main drawback argued when using event-driven simulators is that not all neu-
ron models can be implemented (as Hodgkin and Huxley model [7]). So far, the simple
i&f (integrate and fire) model has been preferred with this kind of algorithms. How-
ever, there are efforts to design suitable algorithms for complex models (for example the
two-variable i&f models of Izhikevich [6] and Brette and Gerstner [53]), or to develop
more realistic models that are suitable for event-driven simulation. Besides, as it will
be shown in the next sections, some tricks can be implemented to emulate clock-driven
elements, such as dummy connections operating as clocks to activate the computation
of some differential equations.
Comprehensive comparisons between clock-driven and event-driven simulators can
be found in [54][52] together with the description of some well-known already imple-
mented simulators (NEURON, GENESIS, NEST, Mvaspike, etc.).
17
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
In the present thesis, an event-driven simulator tool has been implemented and
it is called AERST (Address Event Representation Simulator Tool). As it is event-
driven, it is aimed at simulating systems where event-times can be computed efficiently.
Among all programming languages, we initially chose MATLAB due to its following
properties:
1. Power in matrix manipulation. Matlab provides many convenient ways for cre-
ating (and operating with) vectors, matrices, and multi-dimensional arrays. This
property is crutial when simulating AER systems as neurons are usually allocated
in array distributions.
2. Modularity. It is very easy to design new functions in MATLAB implementing
the same functionalities as the physical devices. Besides it is easy to update this
functions adding new complexities or modifying their own parameters.
3. Powerful interface with programs written in other languages, including C, C++,
Java, ActivX, .NET, and Fortran. MATLAB can call functions and subroutines
written in the C programming language or Fortran. A wrapper function is cre-
ated allowing MATLAB data types to be passed and returned. The dynamically
loadable object files created by compiling such functions are termed ‘MEX-files’
(for MATLAB executable). Libraries written in Java, ActiveX or .NET can be
directly called from MATLAB and many MATLAB libraries (for example XML
or SQL support) are implemented as wrappers around Java or ActiveX libraries.
4. Easy to translate. In spite of not being a very fast processing language, in the last
times MATLAB has added many translation tools so that it is easy to translate
MATLAB applications to other languages, such as SIMULINK, java, C++ or
even to hardware description languages as VHDL.
These four powerful properties together with the growing popularity of MATLAB makes
this development environment a very attractive and appropriate tool to simulate AER-
based systems.
In spite of achieving an acceptable processing speed for small systems, the Matlab
implementation turned out to be not fast enough when trying to simulate large and
complex systems with a huge number of total events (higher than 1Mevents). This
18
3.2 Description of the AER Simulation Tool
fact motivated us to implement the tool in C++ trying to keep the format of all the
events and modules in the Matlab implementation.
In the following sections we describe the algorithm implemented, the library of
modules developed and how to analyze the information obtained after a simulation.
3.2 Description of the AER Simulation Tool
In this simulator a generic AER system is described by a netlist that uses only two
types of elements: instances and channels. An instance is a block that generates
and/or produces AER streams. For example, a retina chip would be a source that
provides an input AER stream to the AER system. A convolution chip [36] would be
an AER processing instance with an input AER stream and an output AER stream.
A splitter [55] would be an instance which replicates the events from one input AER
stream onto several output AER streams. Similarly, a merger [55] is another instance
which would receive as input several AER streams and merge them into a single output
AER stream. AER streams constitute the nodes of the netlist in an AER system, and
are called channels. The simulator imposes the restriction that a channel connects a
single AER output from an instance to a single AER input of another (or the same)
instance. This way, channels represent point-to-point connections. For splitting and/or
merging channels splitters and/or merger instances must be included in the netlist.
3.2.1 Configuration File
An AER system will be composed of several instances interconnected in some way and
a set of channels carrying the information between the instances. The system to be
simulated is described by a configuration file that includes a netlist with the instances
and channels. Fig 3.1 shows an example netlist and its ASCII file netlist description.
The netlist contains 7 instances and 8 channels. The netlist description is provided to
the simulator through a text file, which is shown in the bottom of Fig 3.1. Channel 1 is
a source channel. All its events are available a priori as an input file to the simulator.
There can be any arbitrary number of source channels in the system. Source channels
need a line in the netlist file, starting with key word sources, followed by the channel
numbers and the files containing their events.
19
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
Figure 3.1: Example AER system and its ASCII file netlist description
If we want to use more than one source in the system we only have to enumerate
the different sources using commas in the following way:
sources {1, 2, ..., N}{datasource1, datasource2, ..., datasourceN} (3.1)
The following lines describe each of the instances, one line per instance in the
network. Generally, a descriptive line for one instance will have the following format:
instance {ch in1, ..., ch inN}{ch out1, ..., ch outN}{file params}{file state}(3.2)
The first field in the line is the instance name, followed by its input channels, output
channels, name of structure containing its parameters, and name of structure contain-
ing its state. Each instance is described by a MATLAB function whose name is the
name of the instance.
The simulator imposes no restriction on the format of the parameters and states struc-
tures. This is left open to the user writing the code of the function of each instance.
The simulator only needs to know the name of the parameter and state files where
these structures are stored. If one instance does not need parameters or states, these
20
3.2 Description of the AER Simulation Tool
fields will appear as empty. For example, consider that an instance called receiver does
not need neither output channels nor any kind of parameters. Then its description will
be something like:
receiver {channel N}{}{}{} (3.3)
The module File in Fig 3.1 acts as a source in the system providing the events
that will travel in channel denoted as ‘1’ and entering to the module splitter. The
module splitter will receive channel ‘1’ events and will copy them to the two output
ports (corresponding to the channels labelled as ‘2’ and ‘4’ in the figure) without any
kind of processing. These two channels are communicated with the processing modules
labelled as HorizontalEdge and ‘−90’. The events generated by HorizontalEdge
travel through channel ‘3’ to the upper input port in the module named as merger in
the system. The events coming to the instance ‘−90’ get their (x, y) coordinates rotated
-90 degrees and travel through channel ‘5’ to HorizontalEdge. Channel ‘6’ connects
processing chip HorizontalEdge with the processing element ‘+90’, where events are
rotated this time +90 degrees. Events going out from the instance described as ‘+90’
will travel through channel ‘7’ to the second (bottom) input port in the merger module.
This module will replicate all the input events into the only output port labelled as
‘8’. The information in all the channels will be available to be used as source in other
systems or to analyze the behaviour and delays in the system simulated.
3.2.2 Event Description
Every event in the system carries information about the neuron that originated it and
the time in which the event was created so that any receptor module can decode the
information and decide how to process it. The reception of an event by a neuron can
cause changes in the state variables and the generation of new output events. For
instance, in a convolution chip [56], the reception of one event implies that charge
packets will contribute to the state voltage integral of a group of receiving neurons in
the chip. If the state of any of these neurons reaches a certain threshold, the neurons
will fire new output events and reset themselves.
AER systems use generally an asynchronous communication. Therefore, there will
be two corresponding request and acknowledge signals between the emitter and receiver
21
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
elements to handle the event communication. In our implementation each event con-
tains six fields. The first three correspond to timing information of the event, while the
last three correspond to data transmitted by the event:
[Tprereqst Treqst Tack x y sign] (3.4)
The data fields are irrelevant for the simulator, and only need to be interpreted
properly by the modules (instances) receiving and generating them. For the particular
cases we describe in this thesis we have always used the same three fields: ‘x′ and ‘y′
represent the coordinates or addresses of the pixel originating the event and ‘sign′ its
sign. The three timing fields are as follows: ‘Tprereqst’ represents the time at which
the event is created at the emitter instance, ‘Treqst’ represents the time at which the
event can start to be processed by the receiver instance, and ‘Tack’ represents the time
at which the event is finally acknowledged by the receiver instance. We distinguish
between a pre-Request time and an effective Request time. The first one is only depen-
dent on the emitter instance, while the second one requires that the receiver instance is
ready to read and process an event request. Thus, we can provide as source a full list
of events which are described only by their data fields and pre-Request times. Events
travelling through one channel in the system will be stored sequentially inside a matrix.
Each row in this matrix will correspond to the information of one single event as in
eq. B.1 . In this way, for each channel in the system there will be a matrix storing
all events that travelled through it, with so many rows as events and with six columns
to provide the six fields. Once the events are processed by the simulator, their final
effective request and acknowledge times are established. As the simulator is handling
the events in the different channels, it keeps track of the ‘actual time’ Tact, and uses it
to generate the final Treqst and Tack times for each terminated event transfer.
3.2.3 Instance Description
Each module in the system receives events and can implement some kind of processing.
When a module receives a request signal from one of its input ports (meaning the
presence of a new event) it reads the event (in case the module is not busy) and
returns an acknowledge signal to the emitter. The module will process the event, and
according to the parameters defining the module and its previous state variables it
22
3.2 Description of the AER Simulation Tool
might produce new output events. Its internal state variables will be updated and
also the internal actual time Tact for the block, which is the last time this block was
“visited” by the simulator. The new output events will be written to the output ports
with the corresponding pre−Request times for each of the generated events.
The operation of an instance must consider all these aspects and is described by an
independent MATLAB function whose name is identical to the instance name (module
name) in the netlist. A user can add and write new instances as desired. The only
restriction is to respect the calling format of the function. The calling format of a
function will be the following:
[new event in, events out, new state, new time, port out] = (3.5)
module f(event in, pars, old state, old time, port in)
event in corresponds to the present event information (as in eq. B.1) sent through
the channel. The event in information passed to the function as input parameter
contains the x and y coordinates of the event being processed and its Tprereqst time.
The updated new event in returned by the function contains also the established Treqstand Tack times. old state and new state represent the instance state before and after
processing the event. old time and new time are the global system times before and
after processing the event. events out is a list of output events produced by the instance
at its different output channels. port in is the port number from where the event has
entered the module and port out is a list of numbers identifying the output ports where
each of the output events created will be written. These new output events (which
are still unprocessed events) are included by the simulator in their respective channel
matrices with Tprereqst as the present actual times, which at a later time should be
processed by their respective destination instances.
The basic operation of a general module in our tool is as described next. Each time
an event has to be processed, the simulator tool provides to the corresponding instance
its parameters pars, the state variables old state describing the instance previous state,
the instance actual time old time and the input port from which the event is coming
in. According to the actual time and the parameters defining the internal delays and
processing times, the instance updates correspondingly the values Treqst and Tack for
the incoming event. Then the event is processed by the instance using the functional
parameters pars and internal behaviour. After this, the internal state is changed and
23
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
some new events might be created. All these new events will have their value Tprereqstset to the present creation time, and their values Treqst and Tack initialized to ‘0’ and
‘-1’ respectively (meaning such events are waiting to be processed). For each new event,
the corresponding output port is specified. After the execution, the actual time for the
instance is also updated and the main program will write the created events on the
correspondent channel list.
The way how the tool is run and how the netlist, parameters and sources are
initialized is carefully described in Appendix 1.
3.2.4 Description of Program Flow
The execution of the simulator is described in Fig 3.2. Initially the netlist file is read
as well as all parameters and states files of all instances. Each instance is initialized
according to the initial state of each instance. Then, the program enters a continuous
loop that performs the following steps:
1. All channels are examined. The simulator selects the channel with the earliest un-
processed event. An event is unprocessed when only its pre-Request time Tprereqsthas been established, but not its final request time Trqst nor its acknowledge time
Tack. The system is able to handle the situation in which several channels have
unprocessed events with the same Tprereqst time. This is left transparent to the
user. Note that events in different channels can occur simultaneously, since chan-
nels are physically independent.
2. Once a channel is selected for processing, its earliest unprocessed event informa-
tion is provided as input for the destination instance. By default, this event is
provided with updated values of its Tprereqst and its Tack time (to consider the case
in which an instance built by a user does not specify this operation). The state
(old state) and parameter (pars) variables, which were created and stored by the
user in Matlab files before running the simulation, are loaded to be provided to
the destination instance.
3. The instance is called and it updates and corrects the event time information
(Tprereqst and Tack) in case some instance specific delay times are considered.
The instance updates its internal state according to the information carried by
24
3.2 Description of the AER Simulation Tool
Figure 3.2: Basic Algorithm implemented by the AER tool
the event. In case this event triggers new output events, a list of new unprocessed
events is provided as output. This list of new unprocessed events provides the
events information (such as address and sign) and their Tprereqst time only.
4. The simulator updates all channels with the new events, and stores the new state
for the processing instance.
5. If there are more unprocessed events in the channels, the simulator goes back to
step 1, otherwise the simulation finishes.
25
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
3.2.5 Algorithm Optimizations for Efficient Computation Speed
One important aim pursued during the development of the tool was always to achieve
a minimum simulation time with an optimum memory management. This aim leaded
us to construct our algorithms in efficient ways, such us optimize the selection of the
next event to be processed, the way in which unprocessed events are allocated, the way
in which the already processed events are saved, etc.
In the first designs, the search for the first event to be processed in the system at
each step of the simulation was implemented by looking for that event with the lowest
Tprereqst among all the events belonging to every channel. The idea of searching among
all the events and all the channels was not feasible as the number of events in every
channel grew fast at each step of the simulation. Therefore, this exhaustive search
resulted in very slow simulations. Furthermore, the growth of the number of events in
some channels could be exponential. Another limiting factor is that, for a particular
simulation, there was a growing number of processed events stored in memory. This
implies that the channel matrices grew and grew making the simulation slower every
time and reaching unsustainable simulation times.
The first optimization implemented was the use of a temporal matrix composed of
cells to save all those events already processed. This way, after a certain number of
events processed, all the channels were analyzed and their processed events were taken
out and stored in the temporal cell matrix. Operating like this one avoids the search
of the earliest event among all the processed and unprocessed events, considering only
those still unprocessed. The adoption of this solution does not solve the growth of the
temporal matrix, which was still a problem as it can grow enormously and access to it
to save new processed events can be very expensive in terms of computation time.
The second optimization implemented is the use of an auxiliary matrix of indexes.
The first row in this matrix stores a pointer to the following event to be processed for
every channel. The second row stores the Tprereqst values for these events. Finally, the
third row stores a pointer to the last event. Therefore, each time an event belonging to
one channel is processed, the pointer corresponding to the channel is incremented by
one and the new Tprereqst value is loaded also into the matrix. This way of operation
imposes the restriction that all the events in a channel have to be in increasing order
according to their Tprereqst values. To tackle this, a sorting function is used to order
26
3.2 Description of the AER Simulation Tool
new events in case they are not correlative in time with the previous existing ones.
The main advantage of working with indexes was that it avoids searching for the next
earliest unprocessed event through all events in a channel. With the matrix of indexes
the algorithm only has to compare the Tprereqst times stored in the matrix to select the
channel and event to process. With this optimization we can simulate large systems
composed by a considerable number of modules processing events at speeds of around
500eps (events per second).
Despite the use of these two optimizations, the problem of the growth of the tem-
poral matrix storing the processed events was still unsolved. The simulation speed still
decreases exponentially with the growth of the temporal matrix, which makes it almost
impossible to simulate systems with large number of events. To solve this, the pro-
cessed events were decided not to be kept inside a matrix in main memory, but outside
main memory in a text file that would stay open until the end of the simulation. Thus,
after a certain time or number of events, all the already processed events were saved in
this file and cleared from main memory, avoiding the use of memory resources which
made the simulations too slow.
Thanks to all these optimizations, the tool in its actual form is able to simulate
large systems at constant speeds independently of the system size (number of channels
and instances) and number of events. The average per event speed will only depend on
the complexity of the instances and their parameters (such as size). After the end of
a simulation a text file is available containing all the events that have been transferred
through all channels and with all the corresponding timing information. At this point
the user can use provided functions by the simulator to read the text file and recover
all the events in the simulated system to analyze the behaviour of the system or to use
some of the channels as sources in different systems.
To compare and better appreciate the optimizations implemented in the system,
Fig 3.3 shows the time figures with and without these optimizations. Particularly, we
have simulated the system shown in Fig 3.1 which simulates a total of 1.9Mevents.
Note that the peaks in the simulation times in the second and third versions are due
to the processing time required by the simulator to store the already processed events.
Note how these times were longer when the tool used matrices of cells to store the
events.
27
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
Figure 3.3: Time Optimizations in the Simulation tool
Fig 3.3 plots the simulation times required under three different situations. x axis
represents the number of blocks, where each block contains 100 events. The upper
curve (plotted in red) corresponds to the simulation time when neither pointers nor
storing elements were used to save the processed events. In the central curve (plotted
in blue) the tool made use of pointers (matrix of indexes) to find the next events to be
processed and an auxiliary matrix of cells to store the already processed events. Finally,
the bottom curve (plotted in green) makes use of pointers and also of an external file to
store the processed events. As it can be appreciated, in the first and second cases the
simulation time depends on the size of the sources and follows a linear law proportional
to the number of events already processed. In particular, for the system of Fig 3.1
the time in the two first implementations follows a law with a temporal constant of
approximately 4 ∗ 10−6, so that the simulation time can be written as
tfinal = 4 ∗ 10−6 ∗ nevents + time100events (3.6)
where time100events is the time to process the first 100 events and nevents is the number
of events. In the first case time100events was 3.9s approximately. In the second case
the time required to store the processed events in the auxiliary array (each 10kevents)
28
3.2 Description of the AER Simulation Tool
was proportional to the number of events already processed and time100events was 0.8s.
Finally, in the third case note that the simulation time required for each 100 event-
block is constant (and less than 0.10s) along all the simulation for the particular system
simulated. This constant value implies a simulation speed higher than 1keps (1000
events per second). It must be pointed out that for systems in general, this time does
not only depend on the number of events but also on the complexity of the modules,
specially the size and number of variables used inside them.
3.2.6 C++ implementation
In spite of achieving a high processing speed, the Matlab implementation soon turned
out to be not fast enough when trying to simulate large and complex systems with a
huge number of total events (higher than 1Mevents). This fact motivated us to imple-
ment the tool in C++. Matlab includes utilities to translate matlab code into C++.
However, when we tested the resulting C++ code, the speed improvement was not re-
ally significant (times were comparable). Thus, the tool was rewritten entirely in C++
trying to keep the format of all the events and modules in the Matlab implementation.
However, some changes were implemented to optimize the execution time and the use
of memory by the C++ application. The main differences (and optimizations) with
respect to the Matlab implementation were:
1. Dynamic memory management. The memory is managed dinamically so that
channels grow when new events are added and shrink when events are concluded.
State variables and parameters are also created dinamically when the application
is started.
2. Use of lists. Events are not stored in matrices, as in the Matlab implementation,
but in lists, where each list corresponds to one channel. This way, each element
in the list stores the information corresponding to one event and also a pointer to
the following event to be processed in the channel list. The format of each event
is the same than in the Matlab implementation.
The main advantage of using lists and pointers is that memory is used more ef-
ficiently, as events are stored in different and small memory positions instead of
in matrices occupying sequential blocks of memory.
29
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
Figure 3.4: CAVIAR AER vision system
These two features allowed faster simulations and a more efficient use of memory. With
the C++ implementation the tool was able to process events at speeds higher than
30keps, more than 30 times the speed achieved in the Matlab implementation.
3.3 AER Modules
When describing modules, we have to distinguish between interconnection modules
and processing modules. In the complex systems developed by neuromorphic engineers
different interfaces are required to implement interconnections between them, and to
connect them to PCs for development, debugging, or other purposes. There are some
AER tools developed under the European CAVIAR project to facilitate these intercon-
nections [36]. To date, the biggest AER chain has been built in the CAVIAR project
(see Fig. 3.4). In this system, the front of the signal chain is composed of a 128x128
retina that spikes with temporal contrast changes, four convolution chips that can be
programmed with arbitrary kernels of up to 32x32 pixels, a Winner-takes-all chip and
a two-chip spike-learning stage comprised of a delay line and a learning chip.
In the next subsections, different hardware AER modules are briefly described to-
gether with their correspondent implementation in the AER tool. Extra modules are
then proposed to allow complex processing in larger and more sophisticated systems.
3.3.1 AER switch Module
To connect many to one and one to many AER chips inside a system, CAVIAR pro-
vided an interconnection module called AER-Switch [55] that can perform two different
30
3.3 AER Modules
Figure 3.5: AER-Switch hardware interface
operations:
3.3.1.1 AER Splitter
In this configuration, one AER input is replicated to up to four AER outputs.
3.3.1.2 AER Merger
Up to four inputs are joined to one output. It can add bits to identify the input channel
if necessary.
The AER switch is based on a Xilinx 9500 complex programmable logic device
(CPLD). It has five AER ports: one input, one output, and three bidirectional ports
(Fig. 3.5). It provides delays in the order of tens of nanoseconds.
A Figure showing the representation of the AER-switch module implemented in the
tool with the internal parameters can be seen in Fig. 3.6.
The workflow of this module is as follows:
1. Use current time, timedelay (delay configured for the asynchronous communi-
cation) parameters to update the incoming event (event in) timing information
(Treqst,and Tack):
new eventin(Treqst)← current time
new eventin(Tack)← current time+ timedelay
2. Update New current time using timetoprocess (time considered to process the
input event and generate the output ones):
New current time← current time+ timedelay + timetoprocess
31
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
Figure 3.6: AER-switch acting as Splitter or Merger
3. Create out (set of output events) with as many output events as output ports
(numb ports) exist setting their Tprereqst values to current time.
3.3.2 Subsampling Module
A subsampling module is important because it reduces the resolution of the input visual
flow. This module reduces the input event address space by a factor coeff so that the
address of each input event (xin, yin) is modified and turns to (xout, yout), which is
computed as:
xout = bxin/coeffc (3.7)
yout = byin/coeffc (3.8)
Note that the floor operation makes the output coordinates to be integer numbers. This
module can be easily created using a merger module to which the parameter coeff is
incorporated to reduce the input event address coordinates.
3.3.3 Mapper Module
An AER mapper implements spatial transformations of the address space. It com-
municates events between two AER chips by applying a transformation on the event
32
3.3 AER Modules
Figure 3.7: Scanner and Rotator AER Modules
data during the transmission. Each event from the sender is used to address an LUT
(look up table). The event to the receiver is the one stored in the LUT. Through the
mapper, one can transform the address space through a translation, rotation, shifting,
compression, etc. or by filtering the events.
Most of the current AER mappers have the following functionalities:
1. Map each address event (AE) from an emitter module into a different address for
the receiver module, (1 to 1 mapper).
2. Map each event from an emitter to several address events for the receiver (1 to n
mapper).
3. Send a mapped event following a probabilistic model (stochastic mapper).
4. Repeat a mapped event several times in order to make the effect stronger in the
receiver module (repetition mapper).
5. Manipulate the time information of the events so that multiple copies of an event
can be transmitted with different delays (delay mapper).
An example of an AER mapper module that is able to apply not only a spatial
address transformation, but also a time transformation in the AER bus traffic can be
found in [57]. In order to implement the example systems presented in this thesis, two
types of mapper modules were developed: an AER scanner and an AER rotator.
Fig. 3.7 shows the representation of the two modules with their respective parameters.
33
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
3.3.3.1 3.3.3.1 AER Scanner
This module has one AER input port and one AER output port. The module has
an internal look up table which stores consecutive pixel addresses of an (x,y) array,
scanned row by row. For each incoming event, regardless of its address, the module
sends out events with consecutive addresses. This way, each incoming event simply
increments the pointer to the following position in the LUT in a circular way. The
implemented module uses the following parameters:
-size1, X dimension of the array address space,
-size2, Y dimension of the array address space,
-timedelay, delay of the asynchronous communication,
-timetoprocess, time required by the module to process each incoming event.
The module has one state vector parameter called prev, storing the x and y coor-
dinates of the last the event sent out by the module, and it is initialized with the value
[0, 0].
The workflow of this module is as follows:
1. Use current time, timedelay and timetoprocess parameters to update the in-
coming event (event in) timing information (Treqst,and Tack):
new eventin(Treqst)← current time
new eventin(Tack)← current time+ timedelay
2. Update New current time:
New current time← current time+ timedelay + timetoprocess
3. Ignore event coordinates (x, y) and use parameters size1, size2 and state vector
parameter prev to compute the coordinates(xo, yo) for the output event:
xo ← prev(1); yo ← prev(2)
yo ← yo + 1
if yo >= size2
yo ← 0;xo ← xo + 1
if xo >= size1
xo ← 0;
34
3.3 AER Modules
4. Create out with one event and setting its Tprereqst value to New current time
and (x, y) as its new coordinates:
out = [New current time 0 − 1 xo yo 1]
3.3.3.2 3.3.3.2 AER Rotator
This module has one AER input port and one AER output port. For each incoming
event it generates an output event with a 0, 90, 180 or 270 rotated address depending
on the value specified in the direction parameter. The parameters in this module are:
-size1, X dimension of the input address space,
-size2, Y dimension of the input address space,
-direction, parameter (integer value from 0 to 3) to set the value of the rotation that
is going to be applied to the incoming event: 0, 90, 180, 270,
-timedelay, delay of the asynchronous communication,
-timetoprocess, time required by the module to process each incoming event.
This module does not have internal state. The operation of this module is as follows:
1. Use current time, timedelay and timetoprocess parameters to update the in-
coming event (event in) timing information (Treqst,and Tack):
new eventin(Treqst)← current time
new eventin(Tack)← current time+ timedelay
2. Update New current time:
New current time← current time+ timedelay + timetoprocess
3. Use event in information to get the event coordinates (x, y)
x← event in(3); y ← event in(4)
4. Use parameters size1, size2 and direction to rotate the (x, y) coordinates:
switch direction
case 0: xnew ← x; ynew ← y;
case 90:ynew ← x;xnew ← size2− y;
case 180:xnew ← size1− x; ynew ← size2− y;
case 270:xnew ← size1− x; ynew ← size2− y;
35
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
5. Create an output event out setting its Tprereqst value to New current time and
(xnew, ynew) as its new coordinates:
out = [New current time 0 − 1 xnew ynew 1]
3.3.4 AER Convolution Chip
An important processing module in complex AER-based systems is the 2D convolution
module. This module is meant for frame-free visual information processing. To illus-
trate how event-driven convolution is performed consider the example in Fig. 3.8. Fig.
3.8(a) corresponds to a conventional frame-based convolution, where a 5x5 input static
image f(i, j) is convolved with a 3x3 kernel h(m,n), producing a 5x5 output image
g(i, j). Mathematically, this corresponds to the convolution operation:
g(i, j) =∑m
∑n
f(m,n)h(i−m, j − n) (3.9)
In an AER system, a convolution module is composed of an internal pixel array
where a fixed threshold level is defined for all pixels and a convolution mask (kernel).
Each time an event is received, the kernel is added to the array of pixels (which oper-
ate as adders and accumulators) around the pixel having the same event coordinate.
Whenever a pixel exceeds the fixed threshold level, it will generate an output event, and
the pixel will be reset. This way, pixels also act as AER sender pixels, so that an AER
convolution module operates as an AER transceiver (receiver and emitter) module. To
explain this operation in detail,
In an AER system, shown in Fig. 3.8(b), a luminance retina sensing the same visual
stimulus would produce events for some pixels only (those sensing a non-zero light
intensity). In the figure, the pixel at coordinate (3,3) senses twice as much intensity as
pixels (2,3) and (3,2). Thus the source will produce output event with address (3,3)
with a frequency twice the one for pixels (3,2) and (2,3). These events are sent to a
convolution module.
The convolution module has an internal pixel array where a fixed threshold level is
defined for all pixels and a convolution mask (kernel). Every time an event is received
by the convolution chip, the kernel is added to the array of pixels (which operate as
adders and accumulators) around the pixel having the same event coordinate. This
36
3.3 AER Modules
Figure 3.8: Comparison between (a) classical frame-based and (b) AER event-basedconvolution processing.
is actually a projection-field operation. Whenever a pixel exceeds the fixed threshold
level, it will generate an output event, and the pixel will be reset. As a consequence,
pixels also act as AER sender pixels, so that an AER convolution module operates
as an AER transceiver (receiver and emitter) module. Note that, in the example in
Fig. 3.8(b), after the four retina events have been received and processed, the result
accumulated in the array of pixels in Fig. 3.8(b) is equal to that in Fig. 3.8(a).
The convolution module implemented in this work emulates a fully digital convo-
lution chip with programmable arbitrary-shape kernels as that published in [58]. The
convolution chip is shown in Fig. 3.9. It receives input AER events, which represent
visual information from a previous sensing or processing stage, and generates output
AER events, which represent the result of the convolution operation. The chip includes
a periodic forgetting mechanism which needs to be kept active during absence of input
events.
37
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
3.3.4.1 3.3.4.1 System Level Architecture of the Convolution Chip
The system level architecture of the chip [58] is illustrated in Fig. 3.9, where the
following blocks are shown:
1. Array of 32 x 32 digital pixels.
2. Static RAM that stores the kernel in two’s complement representation.
3. Synchronous controller, which performs the sequencing of all operations for each
input event and the global forgetting mechanism.
4. High-speed clock generator, used by the synchronous controller.
5. Configuration registers that store configuration parameters loaded serially at
startup.
6. A two’s complement block that changes the sign of the kernel data before being
added to the pixels, if the input event is negative.
7. Left/right column shifter, to properly align the stored kernel with the incoming
event coordinates.
8. AER-out, asynchronous circuitry for arbitrating and sending out the output
events.
The operation of the chip is as follows: when the synchronous controller detects a
falling edge in the input Rqst in line, the event address (x, y) and sign at Address in is
latched and the asynchronous handshaking completed. Then the controller, using the
available kernel size information, computes the limits of the projection field with three
different possible results: 1) the projection field fits fully inside the array of pixels, 2)
it can be partially inside the array, or 3) it can be completely outside the array. If the
projection field is outside the array, the controller discards the event and waits for the
next one. However, in any of the other possible situations, the controller calculates the
left/right shift between the RAM columns holding the kernel and the projection field
columns in the pixel array. After this, it enables the addition, row after row, of the
kernel values onto the pixels. Hence, after receiving an input event, the pixels inside the
projection field change their state. If any of them reaches the programmed threshold
38
3.3 AER Modules
Figure 3.9: Convolution Chip 32x32
it resets itself and generates an output event that will be handled by the asynchronous
AER-out block and sent off chip with its corresponding handshaking signals. Parallel
to this per-event processing, there is a global forgetting mechanism common for all the
pixels.
In the asynchronous AER-out block events are arbitrated by rows (for the same row
all request signals are wired-or). Once the row arbiter answers, all the events generated
in this row are latched on the top periphery, freeing the row arbiter. This way, the row-
arbiter can acknowledge the request of another row, while the events of the previous
row are sent out in a burst. The size of the array is 32x32 pixels, but the input address
space it can “see” is larger (128x128). This allows to build arrays of convolution chips to
process larger pixel arrays, programming each one of them to see a part of the address
39
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
space by setting some configuration registers. The size of the RAM is 32x32 words of
6 bits in two’s complement representation. In general, since convolution kernels can
have positive or negative values, output events generated by a convolution chip can also
be either positive or negative. In a multilayer system convolution operations can be
cascaded, which implies that a generic convolution chip must be able to handle signed
input events, and produce signed output events. For this reason, the chip includes a
sign bit both for the input and output address events, and also for the values stored in
the kernel RAM (in two’s complement representation). The pixels are able to compute
signed addition and produce positive and negative events. When processing a negative
input event, the controller enables the two’s complement block to invert the kernel
values before being added to the pixels. In the chip the forgetting mechanism is also
handled by the synchronous controller. The aim of this mechanism is that the absolute
state values stored in the pixels are decremented at a programmable rate, so that
they can “forget” their previous state after some controlled time. This functionality is
implemented by a 20-bit counter in the controller which generates a periodic forgetting
pulse for all the pixels every time it reaches the programmed limit. Each forgetting
pulse will decrement by ‘1’ the state of all the pixels with positive state, and increment
by ‘1’ those with negative state. Consequently, the chip implements a (programmable)
constant-rate (or linear) forgetting mechanism.
3.3.4.2 3.3.4.2 AERST Convolution Module
The convolution module implemented in AERST emulates the convolution chip de-
scribed above [58] and uses the following input parameters:
-size1, X dimension of the input address space,
-size2, Y dimension of the input address space,
-s, matrix containing the kernel values,
-cteloss, forgetting factor. Its value specifies the number of charge units discharged per
second in the pixels belonging to the pixel array,
-threshold, threshold value for the pixels,
-timedelay, delay of the asynchronous communication,
-zs, vector storing the origin of coordinates of the convolution kernel,
-timetoprocess, time required to process each event,
40
3.3 AER Modules
-offset, reset value for the firing pixels,
-trefract, time that the pixels that have fired have to wait to fire again,
-option, parameter that chooses the way the output events are created.
Besides the parameters described, the module has also the following state variables:
-J , matrix storing the state values of the pixels,
-time, matrix storing the previous time of modification for every pixel,
-time2, matrix storing the previous time in which the pixel fired an event,
-flags, matrix indicating which pixels have to wait for a trefract period.
A Figure showing the representation of the convolution module implemented in the
tool with the internal parameters and state variables can be seen in Fig. 3.10
The operation of this module is as follows:
1. Use current time, timedelay and timetoprocess parameters to update the in-
coming event (event in) timing information (Treqst,and Tack):
eventin(Treqst)← current time
eventin(Tack)← current time+ timedelay
2. Update New current time:
New current time← current time+ timedelay + timetoprocess
3. Use event in information to get the event coordinates (x, y) and sign information
(sign
x← event in(3); y ← event in(4); sign← event in(5)
4. Use parameter vector zs (origin of coordinates of the projection field), and pa-
rameters size1 and size2 (dimensions X and Y of J) to compute the limits of the
projection field s that fits inside the array of pixels J , which is called kern eff :
kern eff ←part of projection field s overlapping with J
5. Use parameter vector zs, parameters size1 and size2 and (x, y) coordinates of the
incoming event to compute the neurons in J that will be affected by kern eff .
These neurons will be called neur aff
neur aff ←neurons in J affected by kern eff.
41
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
Figure 3.10: AER convolution module implemented in the tool
6. Compute the time transcurred for the neurons to be affected since the last change
using current time and state array time to apply the forgetting loss cteloss. Pos-
itive and negative states will discharge towards ‘0’ :
abs(J(neur aff))← abs(J(neur aff))− cteloss ∗ [currenttime− time(neur aff)]
time(neur aff)← current time
7. For those neurons in J which fired a time higher than trefract ago, set their
corresponding value in matrix flags to ‘0’. For this, use matrix time2 and
current time:
positions← find((current time− time2) > 0)
flags(positions)← 0
42
3.3 AER Modules
8. Apply projection field kern eff to neurons neur aff considering the sign (sign)
of the incoming event:
J(neur aff) = J(neur aff) + kern eff ∗ sign
9. Due to the change in flags and J some neurons may have reach the parameter
threshold value and will fire events. These neurons will be called neur firing.
Locate these neurons, set their corresponding values in flags to ‘1’ and update
their firing time in time2. Finally, reset the firing positions in J with parameter
offset:
neur firing = find((abs(J) >= threshold)&(flags == 0))
flags(neur firing) = 1
time2(neur firing) = current time
J(neur firing) = offset
10. Use parameter option to select the sign of the new created events. If parameter
option is ‘0’, their sign will be that of the threshold achieved in each neuron
(positive or negative). if option is ‘1’, their sign will always be 1 (full-wave
rectification).
11. Create the output matrix of events with as many events as neurons firing (neur firing).
Compute their Tprereqst values using current time and timetoprocess to consider
delays between them. Update current time accordingly.
3.3.5 Projection Module
This module creates a set of output events with the shape of a projection field received
as parameter each time an event is received. The module can use different input ports
to provide different shapes depending on the input port of the incoming event. This
module can be implemented using a multikernel convolution module similar to the one
described in Section 3.3.4 with kernel s variable and with the desired shape. Threshold
is set to ‘0’ value, so that each incoming event will always produce the generation of
output events coding the shape.
43
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
Figure 3.11: AER Integrate and Fire module implemented in the tool
3.3.6 Integrate and Fire Module
This module has several AER input ports, one AER output port and one stored array.
The element adds (or substracts) a fixed quantity (specified by a parameter called
value) in the array position coded by the incoming event. The addition or substraction
depends on the input port from which the event is received. The sign of the operation
is coded by a parameter vector (called oper) with values -1, +1 for every input port.
Each time a certain value (specified by the parameter threshold) is achieved by one
pixel of the array, a new output event is produced and the pixel is reset.
A figure showing the representation of the integrate and fire module implemented
in the tool with the internal parameters and state variables can be seen in Fig. 3.11.
The operation of this module is as follows:
1. Use current time, timedelay and timetoprocess parameters to update the in-
coming event (event in) timing information (Treqst,and Tack):
eventin(Treqst)← current time
44
3.3 AER Modules
eventin(Tack)← current time+ timedelay
2. Update New current time:
New current time← current time+ timedelay + timetoprocess
3. Use event in information to get the event coordinates (x, y) and sign information
(sign)
x← event in(3); y ← event in(4); sign← event in(5)
4. Use parameters port in, oper, value and state array J to update the neuron
addressed by the incoming event:
J(x, y) = J(x, y) + value ∗ oper(port in)
5. If the neuron state is higher than threshold, it will fire an event and will be reset
to ‘0’.
if (abs(J(x, y)) > threshold)
xnew ← x; ynew ← y;
J(x, y) = 0
6. Create the new output event with coordinates (xnew, ynew). Use the sign of the
threshold achieved (positive or negative). Use current time and timetoprocess
to compute its Tprereqst value. Update current time accordingly.
3.3.7 Rate-Reducer Module
This module has only one input port and one output port. If we configure the module
i&f to have only one input port with value ‘+1’ and we fix the parameters value and
threshold (with threshold > value), the input rate will be decreased at the output by
a factor value/threshold. This means that several events coding one address will be
needed to fire only one output event with the same address. As in the module i&f we
will need to use an internal stored array.
3.3.8 Self-Exciting Modules
The present simulator is event-driven and consequently updates states of modules only
when events are processed. In between events, it does not update any state, nor checks
for new output events. However, one can think of AER modules which include some
45
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
Figure 3.12: AER Self-Exciting Module
type of internal self exciting capability, such that they could eventually generate output
spikes while no input events have been received for some time. In a clock-driven simula-
tor, one just would describe such modules through differential equations updated after
a given time step. This time step can be either fixed, or made it to change dynamically
according to the actual effective time constant. The modules considered so far are
not described, in general, by a set of clock-driven differential equations, because the
module states are only updated when they receive input events. However, it is possible
to describe a module by a set of differential equations. With the AERST event-driven
simulator, the idea is to add a “dummy channel” as shown in Fig. 3.12, whose in-
put and output connect only to the AER module described internally with differential
equations. Then, the module should be described in such a way that it will put a
future event in the dummy channel at the time at which the differential equations need
to be updated. This will depend on the time step used by the differential equations
algorithm.
3.4 Validation of the AER Tool
To validate the simulator, we have implemented two simulations of two AER systems
that had been previously built in hardware [50][17]. All the parameters describing
the modules such as thresholds, forgetting ratio, kernel values, delays, array sizes, etc.
46
3.4 Validation of the AER Tool
Figure 3.13: Block diagram of the AER system developed to simulate the hardwareimplementation
have been chosen according to the specifications of the corresponding AER hardware
devices. These two example systems are described next.
3.4.1 Detection and tracking of moving circles of given radius
The first example simulates part of the demonstration system in the CAVIAR project
[50]. This system could track a circular object of a given size. A block diagram of the
complete system is shown in Fig. 3.4.
The complete chain consisted of 17 AER modules. The AER block diagram that we
have used to emulate the CAVIAR system is shown in Fig. 3.13. The system receives as
input events recorded previously by the electronic retina when watching the movement
of a rotating disc with two solid circles of different radii. These events are sent to a
convolution module with a kernel tuned to detect a circumference of a certain radius
(Fig. 3.14 (a)).
The positive output events of the convolution module follow the center of the tar-
get circumference. These events are sent to a winner-takes-all module (WTA). We
implement the WTA module by using a two-inputs merger module together with a
convolution module. The convolution module is programmed with a kernel which is
positive in the center and negative in the rest of positions (Fig. 3.14 (b)). We use
the output activity from the convolution chip as feedback to the merger second input
port. Due to the feedback in the winner-takes-all and the convolution kernel positive
only in the central point, the output activity of the WTA module responds only to the
47
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
Figure 3.14: a) Kernel to detect a circumference of a certain radius. b) Kernel used inthe WTA module.
Figure 3.15: Winner-Takes-All module
incoming addresses having the highest activity.
A representation of the winner-takes-all created this way can be seen in Fig. 3.15.
The disc with the two circles rotates at a speed of 0.28rev/sec approximately. In Fig.
3.16 the four images in the left represent the images reconstructed with the hardware
implementation (images were obtained with the jAER tool [59]). Each 2D image is
obtained by collecting events during 33ms. The gray values correspond to non-activity.
Black values correspond to changes in intensity due to the motion sensing retina at
the input and white levels at the bottom figures correspond to the pixels detecting the
center of the moving ball at the output. The four images on the right correspond to
the images obtained using the C++ version of the simulator AERST with the same
input stimulus.
48
3.4 Validation of the AER Tool
Figure 3.16: On the left, input and output obtained with the hardware implementation.On the right, input and output obtained with the simulated implementation
3.4.2 Recognition of high speed Rotating Propellers
The second experiment demonstrates the high-speed processing capabilities of AER
based systems. It is the recognition and tracking of a high speed S-shaped rotating
propeller at 5000rev/sec [17] and moving across the screen. At this speed, a human
observer would not be able to discriminate the propeller shape and would only see a
moving circle across the screen. The propeller has a diameter of 16 pixels. The AER
simulated system is again the one shown in Fig. 3.13. This time, the convolution chip
was programmed with a kernel to detect the center of the S-shaped propeller when it
is in the horizontal position. Fig. 3.17 (a) shows the kernel. Fig. 3.17 (b) and (c)
show the 2-D input (propeller) and output reconstructed images by collecting events
during a 50µs interval (1/4 of a rotating movement). Fig. 3.17 (d) and (e) show
the 2-D input and output images reconstructed by collecting events during a 200ms
interval (corresponding to one complete back-and-forth screen crossing). As can be
seen, only those pixels detecting the center of the propeller produce output activity.
The propeller is properly detected and tracked at any instant in real time. Note that
using conventional frame-based image processing methods to discriminate the propeller
is a complicated task, which requires a high computational load. First, images must
be acquired with an exposure time of no more than 20µs (one 10th of a rotation) and
secondly , recognition must be achieved also in less than one rotation, 200µs.
49
3. IMPLEMENTATION OF AN AER SIMULATION TOOL
Figure 3.17: a) Kernel used to detect the propeller, b) and c) input and output whenwe collect events during 50µs, d) and e) input and output when we collect events during200ms.
50
Chapter 4
MULTI-CHIP MULTI-LAYER
CONVOLUTION PROCESSING
FOR CHARACTER
RECOGNITION
The system reported in this chapter is a simplification of Fukushima’s Neocognitron [2].
First we will briefly describe the network originally proposed by Fukushima and then
we will explain how a simplification of this system was implemented using AER-based
modules.
4.1 Fukushima’s Neocognitron
Fukushima’s Neocognitron is a hierarchical network consisting of several layers of
neuron-like cells. There are forward connections between cells in adjoining layers.
Some of these connections are variable, and can be modified by learning. The neocog-
nitron can acquire the ability to recognize patterns by learning, and can be trained to
recognize any set of patterns. Since it has a large power of generalization, presentation
of only a few typical examples of deformed patterns (or features) is enough for the
learning. It is not necessary to present all the deformed versions of the patterns which
might appear in the future. After learning, it can recognize input patterns robustly,
with little effect from deformation, changes in size, or shifts in position. In contrast to
51
4. MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION
Figure 4.1: A typical architecture of the neocognitron network.
most conventional pattern recognition systems, it does not require any preprocessing
such as normalizing the position, size, or deformation of input patterns.
Fig 4.1 shows a typical architecture of the neocognitron network. The lowest stage
is the input layer consisting of a two-dimensional array of cells, which corresponds to
photoreceptors of the retina. There are retinotopically ordered connections between
cells of adjoining layers. Each cell receives input connections that lead from cells
situated in a limited area on the preceeding layer. Layers of “S-cells” and “C-cells” are
arranged alternately in the hierarchical network. In the network shown in Fig 4.1, a
contrast-extracting layer is inserted between the input layer and the S-cell layer of the
first stage. S-cells work as feature-extracting cells. They resemble simple cells of the
primary visual cortex in their response. Their input connections are variable and can
be modified through learning. Each S-cell responds selectively to a particular feature
presented in its receptive field (the receptive field is a portion of sensory space that
can elicit neuronal responses when stimulated). The features extracted by S-cells are
determined during the learning process. Generally speaking, local features, such as
edges or lines in particular orientations, are extracted in lower stages. More global
features, such as parts of learning patterns, are extracted in higher stages.
C-cells, which resemble complex cells in the visual cortex, are inserted in the network
to allow for positional errors in the features of the stimulus. The input connections
52
4.1 Fukushima’s Neocognitron
of C-cells, which come from S-cells of the preceding layer, are fixed and invariable.
Each C-cell receives excitatory input connections from a group of S-cells that extract
the same feature, but from slightly different positions. The C-cell responds if at least
one of these S-cells yields an output. Even if the stimulus feature shifts in position
and another S-cell comes to respond instead of the first one, the same C-cell keeps
responding. Thus, the C-cell’s response is less sensitive to shift in position of the
input pattern. We can also express that C-cells make a blurring operation, because the
response of a layer of S-cells is spatially blurred in the response of the succeeding layer
of C-cells.
Each layer of S-cells or C-cells is divided into sub-layers, called “cell-planes” (feature
maps), according to the features to which the cells respond. The cells in each cell-plane
are arranged in a two-dimensional array. A cell-plane is a group of cells that share the
same set of connections. As a result, all the cells in a cell-plane have receptive fields
of an identical characteristic, but the locations of the receptive fields differ from cell
to cell. The modification of variable connections during the learning progresses also
under the restriction of shared connections. In the whole network, with its alternate
layers of S-cells and C-cells, the process of feature-extraction by S-cells and toleration
of positional shift by C-cells is repeated. During this process, local features extracted in
lower stages are gradually integrated into more global features, as illustrated in Fig 4.2.
Since small amounts of positional errors of local features are absorbed by the blurring
operation by C-cells, an S-cell in a higher stage comes to respond robustly to a specific
feature even if the feature is slightly deformed or shifted.
Thus, tolerating light positional errors at a time at each stage, rather than a con-
siderable positional error in one step, plays an important role in endowing the network
with the ability to recognize even distorted patterns.
C-cells in the highest stage work as recognition cells, which indicate the result of the
pattern recognition. Each C-cell of the recognition layer at the highest stage integrates
all the information of the input pattern, and responds only to one specific pattern.
Since errors in the relative position of local features are tolerated in the process of
extracting and integrating features, the same C-cell responds in the recognition layer
at the highest stage, even if the input pattern is deformed, changed in size, or shifted in
position. In other words, after having finished learning, the neocognitron can recognize
53
4. MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION
Figure 4.2: The process of pattern recognition in the neocognitron. The lower half of thefigure is an enlarged illustration of a part of the network.
input patterns robustly, with little effect from deformation, change in size, or shift in
position.
4.2 AER-based system for Character Recognition
We have adapted the original structure of the Neocognitron so that it can distinguish
between characters ‘A’, ‘B’, ‘C’, ‘H’, ‘L’, ‘M’ and ‘T’. It is based on AER and makes use
of the programmable kernel AER convolution chip proposed by Serrano-Gotarredona
et al. [17][50][39]. As shown in Fig 4.3, it receives an input visual stimulus (of 16
54
4.2 AER-based system for Character Recognition
Figure 4.3: Character recognition system based on AER
x 16 pixels), which can be one of the previous characters, and it can tolerate slight
deformations. Each active pixel of the 16 x 16 input stimulus will fire ten events, and
the rest of pixels will not fire. Input events will be separated 50ns. In this way, the
complete input stimulus, which has around 30 active pixels, will be transmitted in
about 15µs.
The first processing layer can be considered as a layer of “S-cells” and performs
17 convolutions in parallel for feature extraction with convolution masks (also called
kernels) ki (i = 1, ..., 17). Kernels have positive and negative values. Therefore, convo-
lution outputs would include both positive and negative events. The kernels are shown
in Fig 4.4 normalised from ‘-1’ to ‘1’. Black pixels (value ‘-1’) correspond to the most
negative value in each kernel and white pixels (value ‘1’) correspond to the highest
positive value. The cross in the kernel of Fig 4.4 shows the coordinates. As explained
in the previous chapter, the convolution modules work adding a convolution mask in a
matrix of pixels around the address coordinate specified by the incoming events. When
a pixel in the matrix of neurons belonging to one of the convolution modules reaches
a configurable threshold value, it will reset itself and generate an output event, which
will be sent out of the convolution module. In the system of Fig 4.3, each convolu-
tion module is configured to not send out any negative event1. Only positive events1Events in AER-based systems can have positive or negative sign. However, our convolution mod-
55
4. MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION
Figure 4.4: Kernels used in the first layer for feature detection. The red cross indicatesthe origin of coordinates of the kernel when it is proyected in the pixel array.
will be transmitted. Consequently, each convolution module will compute a half wave
rectification after the convolution operation.
Each Kernel in layer ‘1’ is intended to detect discriminatory features that help to
identify the characters. Kernel k1 is intended to detect the presence and position of
the upper peak in letter ‘A’. Kernel k2 detects a horizontal segment ending on the left
and touching a vertical segment. Kernel k3 detects a horizontal segment ending on
the right and touching a vertical segment. Kernel k4 detects a vertical segment ending
on the top and touching a horizontal segment. Kernel k5 detects the bottom end of
a vertical segment. Kernel k6 detects the top end of a vertical segment. Kernel k7
detects the left end of a horizontal segment, kernel k8 the same but for the right end.
Kernel k9 is intended to detect the upper curvature of letter ‘C’ and kernel k10 the
same but for the lower one. Kernel k11 detects a horizontal segment and kernel k12 the
same but for a vertical one. Kernel k13 is intended to detect the central crossing point
between the two inclined segments of letter ‘M’, kernel k14 detects the crossing point
between the two right curves in letter ‘B’. Kernel k15 the upper left peak in letter ‘M’,
kernel k16 the same, but on the right and finally, kernel k17 detects the crossing point
ules are configured so that if a pixel produces a negative output event, the pixel is reset, but it does
not transmit the event out of the chip.
56
4.2 AER-based system for Character Recognition
between the horizontal and vertical segments in letter ‘L’. Consequently, the first layer
of convolutions is intended to detect a set of 17 geometrical features which can be used
to detect and discriminate between the different letters. As shown in Fig. 4.3, each
kernel ki (i = 1, ..., 17) produce activity on channel ci (i = 1, ..., 17). Consequently,
letter ‘A’ should produce activity at outputs {c1, c2, c3, c5, c11}, letter ‘B’ at {c2,
c11, c12, c14, c17}, letter ‘C’ at {c8, c9, c10, c11, c12}, letter ‘H’ at {c2, c3, c5, c6,
c11, c12}, letter ‘L’ at {c6, c8, c11, c12, c17}, letter ‘M’ at {c5, c12, c13, c15, c16}and letter ‘T’ at {c4, c5, c7, c8, c11, c12}.
The second layer of convolution processing can be considered as a layer of “C-cells”
and perfoms 17 convolutions in parallel. Each of these convolution chips uses one of
six different convolution masks shown on the right of Fig 4.5 adjusted to the range ‘-1’
to ‘1’. The kernel that each filter pi uses and its origin of coordinates in the pixel array
are shown in Table 4.1. As shown in Fig. 4.3, each kernel pi (i = 1, ..., 17) produce
activity on channel di (i = 1, ..., 17). This layer is intended to evaluate whether the
spatial distribution of features detected in the first layer is meaningful for the character
to be detected. For example, for letter ‘A’, the top peak (detected by k1 in the first
layer) should be in the upper part above all other features. Consequently, filter p1 will
produce a positive contribution in the region below the peak, because this would be the
place in output d1 where the center of letter ‘A’ would be if all its features are detected
simultaneously. In a similar manner, if there is output activity at c2, the center of ‘A’
should be to the right. Therefore, filter p2 will add contribution to the pixels in d2
which are to the right of those who fired in c2. The output at c3 has to be treated
symmetrically than the one for c2. Filter k5 places events at c5 if a bottom end of
vertical segment is detected. This means that the center of letter ‘A’ is above, either
to the right or to the left. This spatial weighting is performed by filter p5. Finally, if
there is output at c11, the center of ‘A’ should be in the same position, as kernel k11
is intended to detect horizontal segments. Therefore, filter p11 will add contribution to
the pixels in d11 which are at the same position of those who fired in d11. In this way,
when the input is letter ‘A’ the activity at {c1, c2, c3, c5, c11} will be on different
pixels. However, the activity at {d1, d2, d3, d5, d11} would be around the center of
letter ‘A’. In this way, layer ‘2’ joins the meaningful features for each character in the
central region of the corresponding character. Something similar will occur with the
rest of letters, which center will be identified also with the respective outputs obtained
57
4. MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION
CENTRAL COORDINATE
FILTER KERNEL X-COORD Y-COORD
p1 f1 10 9
p2 f4 8 7
p3 f5 8 2
p4 f1 9 9
p5 f3 -3 9
p6 f3 10 8
p7 f6 10 7
p8 f6 10 0
p9 f1 7 11
p10 f1 0 11
p11 f1 -2 8
p12 f1 4 12
p13 f1 5 9
p14 f1 3 7
p15 f2 10 13
p16 f2 10 5
p17 f1 -2 12
Table 4.1: Origin of Coordinates for Kernels in Layer 2
from filters in layer ‘2’, which will perform the spatial weighting needed for each of the
different features extracted in the first layer.
This layer is intended to implement in some way the blurring operation of C-cells,
as the response of the previous layer of S-cells (first layer) is spatially blurred and the
features corresponding (and detected) to one character are copied near the center of
the character at this stage.
The purpose of the third layer is to combine with positive or negative weigths the
outputs of the second layer. For example, for letter ‘A’ outputs {d1, d2, d3, d5, d11}should contribute positively, while outputs {d4, d6, d7, d8, d9, d10, d12, d13, d14,
d15, d16, d17} should inhibit. The same will occur with the rest of letters. In this
layer, all outputs d1 − d17 from the second layer are splitted (blocks Sp in Fig 4.3)
into seven separate pathways with seven independent 17-input merger blocks (block
M in Fig 4.3), each one to detect one of the characters. Only positive events come
out at outputs d1 − d17. However, the sign bits are hardwired at the inputs of the
merger blocks, with positive sign if the events contribute positively or negative sign if
the events contribute negatively.
To implement the operation of addition or substraction between the input channels,
the merger blocks sequence the events coming from their seventeen input channels, and
58
4.2 AER-based system for Character Recognition
Figure 4.5: Kernels used in the second layer for spatial weighting. The kernel C at thebottom end is a single convolution chip for detecting whether the events coming from theprevious layer are more or less clustered together.
feed them to a convolution chip programmed with a 1x1 kernel with weight ‘1′ (block
U in Fig 4.3). The convolution chips parameters are set so that 3 input events will
be necessary to produce an output event for one pixel. 3 events is considered a value
high enough to implement the operation of addition between the channels efficiently.
Besides, the value is low enough to speed up the response of the system. Finally, the
fourth layer consists of one single convolution chip for each character path (block C in
Fig 4.3), which will detect whether the events coming from the previous layer are more
or less clustered together, rather than spread over the pixel array. If they are clustered
(in the center of the character), it means the character has been detected. The kernel,
normalised to ‘1’, is shown on the bottom of Fig 4.5. The ‘x’ in the kernel C of Fig 4.5
shows the origin of coordinates of the kernel.
Note that if this system is built with AER hardware modules, all this processing
is done in parallel and in real time, being the events sent from layer to layer with ns
delays. On the other hand, in a frame-based system, we would have to 1) receive the
59
4. MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION
Figure 4.6: Letters used for testing the system based on AER for character recognition.
complete number of pixel values belonging to the letter, 2) process the letter image
sequentially with all the convolution masks of the first layer, 3) process the resulting
filtered images sequentially with all the convolution masks of the second layer, 4) add
or substrat the corresponding filtered images, and 5) process the seven resulting images
from the third layer (each one corresponding to one letter) with the convolution mask
described in the fourth layer.
4.3 Experimental Results
The multi-chip (68 convolution modules) multi-layer (4 layers) system described above
has been tested using three slightly modified versions for each one of the seven charac-
ters proposed. The twenty-one characters are shown together in Fig 4.6. The results
obtained after the simulations of the system described in the previous Section are shown
in Table 4.2.
In Table 4.2, we show the duration of the input stimulus (stimulus time, T1), the
time when the first event corresponding to each character is obtained at the system
output (time first output, T2), the difference between these times (T2− T1), the time
when the last event corresponding to each character is obtained at the system output
(time last output) and the retrieval accuracy achieved for each of the letters (retrieval
accuracy). Finally, we also show the number of events obtained at each of the seven
output channels for each of the characters.
The output events generated at the different convolution outputs {c1, c2, c3, c5,
c11, d1, d2, d3, d5, d11, dA, dB, dC, dH, dL, dM, dT, fA, fB, fC, fH, fL, fM, fT}are shown in Fig. 4.7, for the case of input stimulus ‘A’. The specific timing of the
60
4.3 Experimental Results
Table 4.2: Timing and accuracy obtained for each of the letters
events can be seen in Fig 4.8. The vertical axes indicate pixel numbers (from 0 to 255)
in the 16x16 pixel arrays, while the horizontal axes represent time in µs (from 0 to
40µs). The specific timing of the input and output event bursts for the first version of
each character can be seen in Fig 4.9. As Table 4.2 indicates, in all cases, the system
is capable of detecting which letter is present in less than 9.3µs since the first input
stimulus event is received by the system. This delay is even smaller than the average
duration of the input stimulus spike burst (12.4µs). Consequently, on average the
system is able to recognize the letter before processing all the input spikes. In any case,
the recognition performance rate for such a system (it should be noted that it tolerates
a certain degree of letter deformation and scaling) is unprecedented (as it can be seen in
Table 4.2, the average time for detecting each letter is only 9,31µs, that is equivalent to
process over 100000 images per second). In a frame by frame based system, we would
have to wait for the frame-time to recover all the pixel values of the character under
recognition, and after that, we would have to process the entire image sequentially with
the 68 convolution modules described. If we suppose that a scheme using 25 frames per
second is used, we would always have the limitation of 40ms for processing each letter
(note that this value is computed without considering the post-processing time due to
the convolution modules). We believe that this technique for bio-inspired AER-based
vision processing is very promising.
61
4. MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION
Figure 4.7: The output events generated at the different convolution outputs {c1, c2, c3,c5, c11, d1, d2, d3, d5, d11, dA, dB, dC, dH, dL, dM, dT, fA, fB, fC, fH, fL, fM, fT} forthe case of input stimulus ‘A’.
One limitation that can be argued for this vision processing technique is that for real
size images much more events need to be processed. This is certainly true. However,
our experience is that if one uses appropiate input sensors, like retinae that directly
sense motion [15] or contrast [41][60][16] instead of image intensity [32], then the flow
of events is kept at reasonable event rate (below 1 Meps for arrays of 128x128 pixels
[15]).
4.4 Discussion
In this Chapter, we have implemented a system with four layers for character recog-
nition. It can distinguish between characters ‘A’, ‘B’, ‘C’, ‘H’, ‘L’, ‘M’ and ‘T’. The
system is based on AER and makes use of the programmable kernel AER convolution
chip proposed by Serrano-Gotarredona et al. [17][50][39].
62
4.4 Discussion
Figure 4.8: Events obtained in the system at outputs{c1,c2,c3,c5,c11,d1,d2,d3,d5,d11,dA,fA} when input is letter ‘A’ Time is expressedin µs.
The characters catalog used in the application could be expanded by simply sending
the outputs of layer ‘3’ to new merger modules. In a hardware implementation, besides
the convolution modules, one also requires splitters and/or merger blocks, and eventu-
ally some extra mappers. In the future, as AER processors become more sophisticated,
we expect to be able to fit in a single chip several convolution arrays together with
splitters/mergers and mappers. In the example of Fig 4.3, all convolution chips are of
size 16x16.
The system could be conceived to include learning, so that an external supervisor
trains it and updates the weights dynamically, optimizing final performance [61].
The results shown in Table 4.2 indicate some of the clear advantages of using AER
63
4. MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION
Figure 4.9: Events obtained in the system at input and output channels for the firstversion of each of the letters.
rather than using the classical image processing methods based on frames. Some of
these advantages regarding this application can be summarized as follows:
1. We do not need to wait for the complete image frame to start processing in none
of the layers. On the contrary, since the first event is received in the first layer
of the system, new output events are produced and can be processed by the
following layers. However, if we use the classical frame based methods, we would
have to process sequentially the input image frame completely with each one of
the convolution modules. Moreover, we could not start processing in one module
until the image has not been completely processed by a previous one. Note that
in the AER implementation we require less than 9.31µs to detect the character
under recognition.
2. We do not need to collect all the output events in layer ‘3’ to identify the character
under analysis in the system. With only the first output events from layer ‘3’, we
are able to finish the recognition task.
64
4.4 Discussion
3. It is possible to add new modules in parallel in each of the layers. This fact will
allow us to analyze more features in the characters under recognition, and always
improving the results without increasing the computational cost.
65
4. MULTI-CHIP MULTI-LAYER CONVOLUTION PROCESSING FORCHARACTER RECOGNITION
66
Chapter 5
IMPLEMENTATION OF
TEXTURE RETRIEVAL USING
AER-BASED SYSTEMS
5.1 Introduction
To illustrate the processing power of AER-based systems, we have developed an AER
architecture example of a sophisticated image processing application of content-based
image retrieval. Content-based image retrieval is emerging as an important research
area with applications in digital libraries and multimedia databases. An image can be
considered as a mosaic of different texture regions, and the image features associated
with these regions can be exploited for search and retrieval.
5.2 State of the art in texture recognition
Texture analysis has a long history and a very large amount of algorithms for texture
characterization has been developed in the last decades. The commonly used methods
for texture characterization can be divided into three categories: statistical, model-
based, and filtering approaches [62]. Statistical methods, such as cooccurrence features
[63][64], analyze the spatial distribution of gray values, by computing local features
at each point in the image, and deriving a set of statistics from the distributions of
the local features. Model-based methods such as Markov random field (MRF) [65]
and simultaneous autoregressive (SAR) models [66] provide a description of texture in
67
5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS
terms of spatial interaction. Most of the statistical and model-based approaches for
texture classification consider spatial interactions over relatively small neighborhoods.
Therefore, these approaches are more apt only for microtextures [67], [68]. Filter-
ing approaches including wavelet [69], [70], Gabor filters [67], [71], steerable pyramid
[72], and directional filter bank (DFB) [73], [74] characterize textures in the frequency
domain. Among the three categories, MPEG-7 has adopted Gabor-like filtering for tex-
ture description [75]. The rationale behind is that visual cortex is sensitive to localized
frequency components [76]. It has been shown that the direction together with scale in-
formation is important for texture perception. In the last decade, researchers have been
combining different methods in order to provide a better classification and retrieval of
images. Fusion of different types of texture features can be found in the literature
[77]-[80]. A comprehensive performance evaluation on filtering (i.e., spectral-based)
methods for texture classification is presented in [62], which suggests that no single
set of features derived from filtering approaches has consistent superior performances
on all textures. Other comparative studies about all these methods can be found in
[81]-[83]. In [84], two fast algorithms for multiscale directional filter banks (MDFB) are
proposed. These two algorithms are compared with the previous algorithm for MDFB
proposed in [85] and with the contourlet transform [86], [87] in terms of time of feature
extraction (FE) and total computational time. In [88], a texture representation suit-
able for recognizing images of textured surfaces under a wide range of transformations,
including viewpoint changes and nonrigid deformations is presented. At the feature
extraction stage, a sparse set of affine Harris and Laplacian regions is found in the im-
age. Each of these regions can be thought as a texture element having an elliptic-shape
characteristic and a distinctive appearance pattern. Using the Brodatz database [120],
the approach achieves a maximum average retrieval rate (ARR, see definition in the
Experimental Results Section) of 76, 26% when combined Harris and Laplacian descrip-
tor channels are used. In [89], a linear family of filters is introduced, which provides
certain scale invariance, resulting in a texture description invariant to local changes in
orientation, contrast and scale, and robust to local skew. Then, a texture discrimina-
tion method based on the χ2 similarity measure is applied to the histograms derived
from the filter responses. This approach achieves a maximum average retrieval rate of
78, 5% when it is tested using the Brodatz database [120]. In [90], the authors propose
an approach for rotation-invariant texture image retrieval by using a set of dual-tree
68
5.3 AER implementation for texture retrieval
rotated complex wavelet filter (DT-RCWF) and DT complex wavelet transform (DT-
CWT) jointly. They make a comparison of average retrieval accuracy using standard
real DWT (discrete wavelet transform), DT-CWT and a combination of DT-CWT and
DT-RCWF. In [74], rotation-invariant and scale-invariant Gabor representations are
proposed, where each representation only requires few summations on the conventional
Gabor filter impulse responses. The results show that the new implementations behave
better than the conventional Gabor-based scheme when rotated or scaled images are
considered. However, a conventional Gabor-based scheme provides better results when
no rotation or scaling is considered.
In [85], an MDFB is first proposed and it is compared with the Gabor filters in polar
form [71] and steerable pyramid [91] in terms of retrieval accuracy. In [92], fractal-code
signatures are proposed for texture-based retrieval of images. Fractal image coding
is a block-based scheme that exploits the self-similarity hiding within an image. By
combining fractal parameters and collage error, a set of statistical fractal signatures
is proposed. In [93], image signatures constructed from the bit planes of wavelet sub-
bands are presented [bit plane signature (BP) and three-pass layer probability (TPLP)
signature]. As can be observed, the method that provides the highest ARR is filter
based and is the combination of DT-CWT and DT-RCWF implemented by Kokare et
al. [90].
5.3 AER implementation for texture retrieval
To illustrate the potential of the AER technique, we have implemented a multireso-
lution representation based on Gabor filters using the filter-based method proposed
by Manjunath [94]. This example adapts a known frame-based image processing al-
gorithm to the AER frame-less vision processing philosophy. The use of Gabor filters
in extracting textured image features is motivated by various factors. The Gabor
representation has been shown to be optimal in the sense of minimizing the joint two-
dimentional uncertainty in space and frequency [95]. These filters can be considered as
orientation and scale tunable edge and line (bar) detectors, and the statistics of these
microfeatures in a given region are often used to characterize the underlying texture
information. Gabor features have been used in several image analysis applications in-
cluding texture classification and segmentation [96]-[97], image recognition [98]-[100],
69
5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS
object recognition [1], image registration, medical applications [101][102] and motion
tracking [1], and it has been demonstrated that using the Brodatz texture database,
the Gabor features provide a very good pattern retrieval accuracy. Furthermore, since
Hubel and Wiesel’s [103] discovery of the crystalline organization of the primary visual
cortex in mammalian brains some thirty years ago, an enormous amount of experimen-
tal and theoretical research has greatly advanced our understanding of this area and
the response properties of cortex cells. On the theoretical side, an important insight
has been advanced by Marcelja [104] and Daugman [105][106], who suggest that simple
cells in the visual cortex can be modeled by Gabor functions. The 2D Gabor functions
proposed by Daugman are local spatial bandpass filters that achieve the theoretical
limit for conjoint resolution of information in the 2D spatial and 2D Fourier domains.
For all these reasons, we will exploit Gabor wavelets for the texture based retrieval of
image data. The focus of this AER convolution processing application is on the image
processing aspects of the texture based retrieval processes. We have developed an AER
architecture to obtain Manjunath’s Gabor wavelet features for texture analysis [94] and
provide a comprehensive experimental evaluation. These features are still today being
widely used in many applications [107]-[116]. By performing texture analysis using Ga-
bor filters at different scales and orientations, these patterns can be efficiently described
in the frequency domain and localized in the spatial domain. Next we summarize the
sequence of computations performed in Manjunath’s method, and indicate how we have
adapted them for an AER hardware system.
5.3.1 Frame-based implementation for texture retrieval
In the method proposed by Manjunath [94], texture is analysed by applying a bank
of scale and orientation Gabor filters to an image. A two-dimensional (2-D) Gabor
function g(x,y) can be constructed as:
g(x, y) = (1
2πσxσy)exp
[− 12( x
2
σ2x
+ y2
σ2y
)+2πjWx](5.1)
where σx , σy, and W are its characteristic geometrical parameters. A class of self-
similar functions referred to as Gabor wavelets is now considered. Let g(x,y) be the
70
5.3 AER implementation for texture retrieval
mother wavelet. A Gabor filter bank can be obtained by appropiate dilations and
translations of g(x,y) through the generating function
gs,k(x′, y′) = a−sg(x, y) (5.2)
x′ = a−s(x cos θ + y sin θ) (5.3)
y′ = a−s(−x sin θ + y cos θ) (5.4)
where θ = kπ/K is the orientation of the filter with respect to the vertical, kε[0, ...,K − 1]
is the orientation index, and sε[0, ..., S − 1] is the scale index. K is the total number
of orientations, and S is the total number of scales in the filter bank. The filter bank
parameters σx, σy, a, θ,W are computed by Manjunath’s method [94], given the input
specifications S, K, and the upper and lower center frequencies of the filters Uh and
Ul. Given an image I (x,y), its Gabor wavelet transform is then defined as
Wmn(x, y) =∫I(x1, y1)g∗mn(x− x1, y − y1)dx1dy1 (5.5)
where * indicates the complex conjugate. It is assumed that the local texture regions
are spatially homogeneous, and the mean µmn and the standard deviation σmn of the
magnitude of the transform coefficients are used to represent the region for classification
and retrieval purposes:
µmn =∫∫|Wmn(x, y)|dxdy (5.6)
σmn =
√∫∫(|Wmn(x, y)| − µmn)2dxdy (5.7)
As we will see below, in our AER implementation we will not compute σmn as given in
eq. 5.7, but
Smn =
√∫∫||Wmn(x, y)| − µmn|dxdy (5.8)
71
5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS
without any degradation in performance. A feature vector is now constructed using
µmn and σmn as feature components. In the experiments, we use four scales S = 4 and
six orientations K = 6, resulting in a feature vector:
FE = [µ11σ11µ12σ12, ..., µ46σ46] = [µmnσmn]m=1,...,4;n=1,...,6 (5.9)
Consider two image patterns i and j, and let FEi and FEj represent the corresponding
feature vectors. The distance between the two patterns in the feature space is then
defined as
d(i, j) =∑m
∑n
dmn(i, j) (5.10)
where
dmn(i, j) = |µ(i)mn − µ(j)
mn
α(µmn)|+ |σ
(i)mn − σ(j)
mn
α(σmn)| (5.11)
with α(µmn) and α(σmn) being the standard deviations of the respective features over
the entire database. They are used to normalize the individual feature components.
For database texture retrieval, the feature vector FEi of a new input image is compared
with a precomputed database of feature vectors FEj . Computation of d(i, j) is fast
and can be done using simple algorithmic computations on conventional FPGA or DSP
like circuits. However, computing the feature vector is a slow process in conventional
computers. Other authors consider other distance measures, such us Mahalanobis,
Bhattacharyya or Euclidean distances [117][118]. We have tested all of them but the
distance measure providing the best result was that described in eq. (5.11).
In the next subsection we will now show how FEi can be computed very quickly
using AER hardware convolution modules.
5.3.2 AER-based implementation for texture retrieval
Our resulting AER system implements a slightly modified version of the system origi-
nally proposed by Manjunath for texture retrieval. The AER system is shown in Fig.
5.1. It has three layers. The first one is composed by a splitter module and 24 AER
convolution chips in parallel. It implements a Gabor filter bank with 4 scales and 6
orientations. In [119] this configuration of filters was demonstrated to provide the best
72
5.3 AER implementation for texture retrieval
Figure 5.1: Scheme of the AER-based system implemented for texture-based retrieval ofimages
results. In Fig. 5.1, a texture image is coded by events at intervals of 50ns. These
events are fed to a splitter module that replicates them on the 24 output channels.
Each output channel is connected to a convolution module gmn that uses as kernel the
real part of a gabor wavelet with scale m and orientation n.
The sign of output events from the convolution modules are changed to positive
(this is a full-wave rectification). This way, the output at each convolution module is
|Wmn(x, y)| (represented as cmn in Fig. 5.1).
Note that adding more chips to layer ‘1’ increases the number of scales and orienta-
tions in the bank of Gabor filters. This will help to improve classification performance.
However, note that adding more chips to layer ‘1’ will not increase the processing delay
of the hardware.
Layer ‘2’ consists of 24 Feature Extraction Modules (FEM in Fig. 5.1). A FEM
module is shown in Fig. 5.2. The first block in the FEM is a splitter with three
output channels. The top channel (labelled ‘2’ in Fig. 5.2) goes directly to layer 3,
thus providing an AER representation for |Wmn(x, y)|. The bottom channel (labelled
‘5’) goes to an internal merger module with a hardwired positive sign. The central
channel (labelled ‘3’) goes to an internal mapper. This mapper ignores the address of
73
5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS
Figure 5.2: Scheme of a generic FEM used in Fig. 5.1.
the incoming event, and generates a new address by sequentially sweeping all addresses.
Consequently, at the mapper output a uniform AER image is represented with the same
number of events as |Wmn(x, y)|. Thus, this output is an estimation of the mean µmn
in eq. (5.6). This mean is fed to the internal merger with a hardwired negative sign.
Consequently, at the merger output we have all |Wmn(x, y)| events with a positive sign
and all µmn events with a negative sign. After convolving them with a unitary kernel
C and changing the negative output event signs to positive, the output will represent:
Smn = ||Wmn(xy)| − µmn| (5.12)
Finally, for each of its inputs |Wmn(x, y)| and Smn(x, y), layer ‘3’ will count the
total number of events (regardless of their addresses) per unit time. We will use these
numbers to create our feature vector described as:
FE = [W11S11W12S12, ...,W46S46] (5.13)
where
Wmn =∑x
∑y
Wmn(x, y); Smn =∑x
∑y
Smn(x, y); (5.14)
Wmn and Smn will be an estimation of µmn and an approximation to |σmn| in Eqs.
(5.6), (5.8), and consequently they can be used as features for the input texture. Al-
though this feature vector is slightly different to Manjunath’s vector described in eq.
(5.9), this will not introduce any significant deviations as will become apparent in the
74
5.4 Experimental Results
Section on experimental results. This feature vector obtained without a frame-based
scheme will be then compared with Manjunath’s frame-based and other state-of-the-art
frame-based methods with the Brodatz database [120] using Eqs. (5.10) and (5.11).
We have computed distances dmn in Eq. (5.10) between two feature vectors FEi
and FEj as:
dmn(i, j) = |W(i)mn −W (j)
mn
α(Wmn)|+ |S
(i)mn − S(j)
mn
α(Smn)| (5.15)
There is an abundance of literature in the field of texture based image retrieval.
Some use different filters [121][62], different features [111], [62], [122]-[130], different
distance measures [117][131] or different numbers of scales or orientations [132]-[134].
But all of them can be mapped to an AER architecture similar to the one we have de-
scribed. The AER hardware implementation technique is therefore not restricted to the
particular example we have picked for illustration purposes. In the next Section we will
provide realistic performance characteristics for an eventual AER system implementing
the texture classification system described above.
5.4 Experimental Results
In this Section we provide a realistic performance evaluation of an eventual hardware
implementation. With this aim we have used our AER C++ behavioral simulator tool.
As in the rest of the applications presented in this thesis, the performance character-
istics of the AER modules employed here (convolution chips, mergers, splitters, and
mappers) are obtained from already manufactured, tested and reported AER modules
[50]-[55]. Using the module performance characteristics together with the AER behav-
ioral simulator, we can obtain a very good estimate of the overall system performance.
Below, we illustrate the performance obtained for the AER-based analysis system with
the 48 convolution modules described previously.
When researchers in texture retrieval and classification try to test their proposed
approaches, they use mainly two different databases, the Brodatz database [120] and
the Vistex database [135]. Compared to Brodatz textures, VisTex images are less
suitable for texture classification. Unlike Brodatz textures, the images in VisTex do
75
5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS
not conform to rigid frontal plane perspectives and studio lighting conditions1, which
brings in a large variation of scale, rotation, contrast and perspective. For these reasons,
we have chosen the Brodatz database as benchmark for texture recognition. Some
researchers use a part of the Brodatz database to show their results (normally forty
classes), but sometimes this it is not a proof of the validity of their algorithms, because
a method may be good for some classes of textures, but it may provide bad results for
the rest. Due to this, we have preferred to compare our approach with the state-of-the-
art methods that make use of the entire Brodatz database [84][88][89][90][74][85][94]
and two methods that use a part of the Brodatz database [92][93]. We have made our
comparison in terms of average retrieval rate (ARR), feature extraction time (FET)
and total computational time.
For this purpose we have used the entire Brodatz database [120], which consists
of 112 images and each image has been divided into sixteen 90x90 nonoverlapping
subimages, thus creating a database of 1792 texture images. These images have been
rate-coded into events separated by 50ns, creating stimulus bursts of 30ms on average2.
We used our C++ behavioral simulation tool to estimate the performance of an eventual
hardware implementation. The 48 channel outputs of Layer 2 (see Fig. 5.1) obtained
for each of the images in the database were collected during 30ms (duration of the input
burst) to create the feature vector database. In what follows, a query pattern is any
one of the 1792 patterns in the database. This pattern is then processed to compute
the feature vector as in Eq. 5.13. The distances d(i, j), where i is the query pattern
index and j is the index of a pattern from the database (with i 6= j), are computed
and sorted in increasing order. Only the closest set of patterns are retrieved. Ideally,
all top 15 retrievals are from the same large image.
The performance is measured in terms of the average retrieval rate (ARR), which
is defined as the average percent number of patterns belonging to the same image as
the query pattern in the top 15 matches.
Table 5.1 summarizes the results. It shows the retrieval accuracy for each of the
112 texture classes in the database when we compare our AER-based method with the
original Manjunath frame-based results. As can be seen, the retrieval accuracies are
approximately equal.
1Quoted from VisTex Web Site.2This burst time is conceptually comparable to the frame time in a frame-based system.
76
5.4 Experimental Results
IMAGE FRAME-
BASED
AER-
BASED
IMAGE FRAME-
BASED
AER-
BASED
IMAGE FRAME-
BASED
AER-
BASED
D1 100 100 D39 47,67 40,49 D77 100 100
D2 67,39 64,58 D40 25,75 38,44 D78 88,21 87,64
D3 100 100 D41 79,45 41 D79 100 100
D4 100 100 D42 19,72 22,04 D80 100 100
D5 39,45 65,09 D43 37,81 42,54 D81 100 100
D6 100 100 D44 40,55 37,41 D82 100 100
D7 19,18 22,55 D45 10,41 13,32 D83 100 100
D8 88,21 100 D46 86,02 66,62 D84 100 100
D9 93,69 96,35 D47 100 100 D85 100 100
D10 72,87 69,7 D48 75,06 57,91 D86 46,03 64,06
D11 100 100 D49 100 100 D87 100 100
D12 100 100 D50 82,74 89,69 D88 24,11 26,65
D13 19,18 23,06 D51 100 99,43 D89 19,18 33,31
D14 27,4 29,21 D52 89,31 58,94 D90 52,6 37,41
D15 78,9 99,43 D53 100 100 D91 15,89 16,4
D16 100 100 D54 80 87,64 D92 100 99,42
D17 100 100 D55 100 100 D93 100 98,91
D18 52,6 66,62 D56 100 100 D94 100 100
D19 100 90,71 D57 100 100 D95 100 97,37
D20 100 100 D58 14,25 15,37 D96 66,85 72,26
D21 100 100 D59 37,81 45,1 D97 36,16 59,96
D22 100 100 D60 39,45 51,25 D98 33,97 43,56
D23 21,92 33,31 D61 33,42 44,07 D99 19,72 24,09
D24 100 100 D62 32,33 33,83 D100 45,48 49,2
D25 56,98 55,86 D63 38,35 43,56 D101 100 99,43
D26 96,43 71,24 D64 100 100 D102 100 100
D27 31,23 38,44 D65 100 100 D103 98,08 100
D28 64,65 75,85 D66 76,71 87,12 D104 77,26 87,64
D29 100 100 D67 49,86 48,18 D105 100 94,3
D30 24,66 36,9 D68 100 100 D106 100 100
D31 21,37 15,89 D69 80 83,54 D107 32,87 11,79
D32 100 100 D70 95,89 97,89 D108 13,15 13,84
D33 94,79 100 D71 98,08 100 D109 72,32 66,11
D34 100 100 D72 35,07 33,82 D110 100 86,1
D35 100 100 D73 20,27 27,68 D111 71,78 78,41
D36 95,89 84,05 D74 56,44 37,93 D112 52,6 57,4
D37 100 100 D75 84,38 92,76
D38 100 93,79 D76 100 100 AVERAGE 73,21 73,89
Table 5.1: Retrieval Performance for Each of the 112 Brodatz Images. ComparisonBetween Manjunath’s Frame-Based Method and the AER-Based Proposed Approach.
To estimate the minimum time for correct texture retrieval we proceeded as follows.
Input stimuli lasted for about 30ms. Layer 3 counts events coming from the 48 Layer 2
output channels during a time Tcount. This time was increased in steps of 15µs from 0
to 30ms. We found that for Tcount approximately equal to 10ms the results were similar
to those shown in Table 5.1. Consequently, an AER hardware implementation would
be able to achieve correct texture retrieval in about Trcg=10ms. As an illustration, Fig.
5.3 shows the retrieval accuracy as a function of Tcount for six of the texture images in
[120]. As can be seen, after 10ms the retrieval accuracy has stabilized; this is 20ms
before the input stimulus is finished.
77
5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS
Figure 5.3: Texture retrieval accuracy obtained for images D1-D2-D3-D8-D9-D10 asfunction of Tcount (in milliseconds)
5.4.1 Comparison with the State-of-the-Art
In Table 5.2 we compare our AER event-based method with those reported in [74] and
[84][85], [88]-[90][92][93] and with Manjunath approach [94] in terms of average retrieval
rate (ARR) using the entire Brodatz database. In Table 5.3, we compare our method
with those published in [84], [90], [92], and [93] and also with Manjunath’s method [94],
in terms of computation times. We distinguish between a FE time (time required to
obtain a feature vector) and a searching and sorting time (additional time to classify
the texture: computation of terms dij, sorting them, and selecting the best match).
The sum of both is the total computation time. Note that, because of the conceptual
difference between a frame- and an event-based approach, total computation time for
a frame-based system is TFC (as defined in Fig. 5.4), while for an event-based system
it is Trcg (as defined in Fig. 5.4). Consequently, comparing the computational delay of
the two approaches by simply comparing times TFC and Trcg is not a fair comparison.
78
5.5 Discussion
Figure 5.4: Comparison between frame-based and AER-based systems
It is more realistic to either compare Tframe + TFC against Trcg, or the time between
a frame is fully available (T1 + ∆ in Fig. 5.4) and the computing system provides a
recognition result: TFC for a frame-based system against T ′FC = Trcg + td−Tframe (see
Fig. 5.4) for an event-based system. Note that the latter ends up being negative.
5.5 Discussion
AER is an emerging hardware technology with great potential for providing complex
cortical-like sensory-processing systems. Of special interest is its potential for provid-
ing very fast spike-processing convolutional neural networks with complex hierarchical
structures, similar to those found in biological cortex. Recent work on individual AER
convolutional chips reveals the outstanding capabilities of such components as “bricks”
for larger highly sophisticated and hierarchically structured cortical-like sensory pro-
cessing systems. To date, the largest AER multimodule system reported uses only
four processing stages, one of which is a convolution [36]. We believe that we are not
far from seeing systems made out of several hundreds (or thousands) of AER convo-
79
5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS
METHOD NUMBER OF CLASSES
CONSIDERED (%)
ARR(%)
MDFB [84] 100% 73,00%
FAST MDFB1 [84] 100% 73,00%
FAST MDFB2 [84] 100% 73,00%
CONTOURLET TRANS-
FORM [84]
100% 71,00%
LOCAL AFFINE REGIONS
[88]
100% 76,26%
LOCALLY INVARIANT DE-
SCRIPTORS [89]
100% 78,50%
STANDARD REAL DWT
[90]
100% 64,17%
DT-CWT [90] 100% 76,83%
COMBINATION OF DT-
CWT AND DT-RCWF [90]
100% 78,93%
ROTATION INVARIANT
GABOR FEATURE [74]
100% 59,00%
SCALE INVARIANT GA-
BOR FEATURE [74]
100% 57,00%
MDFB [85] 100% 72,10%
STEERABLE PYRAMID [85] 100% 69,60%
FRACTAL-CODE SIGNA-
TURES [92]
36% 53,2% - 85,3%
TPLP signature [93] 36% 82,10%
MANJUNATH APPROACH
[94]
100% 73.2%
AER-BASED APPROACH 100% 73,89%
Table 5.2: Comparison of Average Retrieval Rate Between Different Methods Using theBrodatz Database)
lutional modules in the near future. NoC (Network-On-Chip1) technology [136] could
host around 100 individual convolutional modules on a single chip, and about 100 such
chips could be put on one single PCB (Printed circuit board). Consequently, a small
physical volume like a desktop computer could easily hold 20-40 such PCBs, provid-
ing a total of almost half million convolution modules. However, currently, it is not
obvious what architectural structures should be used to assemble these emulated AER
convolutional “bricks” and how to set their parameters for a desired (recognition) appli-
cation. In this chapter, we have concentrated on one such possible application, texture
recognition, emulated it with a behavioral AER simulator, and used it as an exercise
to see how to set up such a system, its parameters, and estimate the performance of
multilayer AER convolutional systems. Some software computational works are start-
ing to appear in the literature that use massive convolutions for vision processing.
For example, in texture recognition, experiments in the last years have demonstrated
that filter-based schemes provide excellent results [62], [81]-[83]. However, massive con-
1In a NoC system, modules such as processor cores, memories and specialized IP blocks exchange
data using a network as a “public transportation” sub-system for the information traffic.
80
5.5 Discussion
METHODFeature Extraction(FE) time (s)
Searching andSorting time (s)
Total time (s)FRAME-based TFC
Total time (s) AER-based Trcg SOFTWARE HARDWARE
MDFB [84] - - 2,59 Matlab 6.5CPU of Intel Pentium 42.4 GHz
FMDFB1 [84] - - 1,69 Matlab 6.5CPU of Intel Pentium 42.4 GHz
FMDFB2 [84] - - 1,62 Matlab 6.5CPU of Intel Pentium 42.4 GHz
CONTOURLETTRANSFORM [84] - - 1,38 Matlab 6.5
CPU of Intel Pentium 42.4 GHz
STANDARD REALDWT [90]
0,47 0,060,53 Matlab 5.3
CPU of Intel PentiumIII 866 MHz
DT-CWT [90] 0,56 0,06 0,62 Matlab 5.3CPU of Intel PentiumIII 866 MHz
DT-CWT AND DT-RCWF [90] 1,05 0,09 1,14 Matlab 5.3
CPU of Intel PentiumIII 866 MHz
FRACTAL-CODESIGNATURES [92] - - 0,42 - 18 -
CPU of Intel Pentium 4(2 GHz)
TPLP [93] 3,3 4,78 - Visual C++ 6.0CPU of Intel Pentium 4(2 GHz)
MANJUNATHAPPROACH [94] 9,3 1,02 10,32 Matlab 5.0
CPU of SUN Sparc20
AER-BASEDAPPROACH 0,01 0,01 - 0,02 -
AER-based DEVICES
Table 5.3: Comparison of Computational Times Between Different Methods Using theBrodatz Database)
volutions on conventional computers result in excessive computational times, making
such approaches nonpractical for real-world applications. In general, vision processing
researchers tend to avoid the use of convolutional processing because of its excessive
computational load. For example, quoting Serre et al. [1] who use a first stage with
64 Gabor filters (for an input image of 128x128 pixels), the main limitation of their
powerful recognition system is the delay of this first stage, which requires several tens
of seconds. An AER-based spiking hardware could perform this processing with delays
of a few milliseconds, or fractions of milliseconds, while the visual input is being sensed.
In all reported approaches for texture recognition, there is a relationship between the
length of the feature vector and the computational time. The longer the feature vector,
the longer the feature extraction time. In AER convolutional hardware, this is not the
case, because all the elements of the feature vector are computed in parallel. Conse-
quently, it is possible to increase the feature vector length or elements [74] to improve
retrieval rate, without increasing feature extraction time, although at the cost of using
more hardware “bricks”. Actually, novel approaches for texture retrieval are based on
the use of filters that take into account more frequencies or scales [84], [90] and produce
less redundant features as compared to other wavelets (Gabor wavelet in our case). In
AER convolutional hardware, increasing the number of convolutional filters does not
degrade speed response of the overall system. This is because the filters receive the
81
5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS
same input events simultaneously and process them in parallel. There will be some
delay the hardware will add to distribute the events to a larger number of receivers,
but this extra delay will be in the order of nanoseconds, and consequently not perceived
by the overall system. For present day reported AER links, a typical bandwidth is in
the order of 10-30Meps (mega events per second). Retina sensors output event rate
is usually below 1Meps. However, when merging several AER module outputs into
one single AER channel, especially if we are thinking of several hundreds for the near
future, it is realistic to expect that the limited AER link bandwidth could easily end
up being the main delay bottleneck for such systems. Solutions for this problem could
be to do a hierarchical merging of outputs combined with replicating the number of
AER links to increase bandwidth.
Also, we have observed that event traffic is higher for the first stages and is gradually
reduced as convolutional processing compresses and extracts relevant information.
Perhaps the most interesting observation is that in AER sensory processing hardware,
processing is performed as events flow between modules. As a retina is sending out its
events they are sent directly to the processing structure and are processed as they flow
in. In the same way, each “brick” processes its input events as they flow in and gener-
ates new ones. This way the whole system operates as if a wave of (visual) information
(in the form of flow of events) travels through the convolutional structure while it is
processed. Since processing is on a per event basis, stages do not wait for transmitting
full “images” before processing them, thus reducing drastically the latency between
input and output information flow.
What we have found with the specific example in this chapter is that when mapping
a known convolutional processing (frame-based) algorithm to AER hardware: 1) the
recognition performance remains similar, and also comparable to state-of-the-art com-
putational methods not based on convolutions (or filters), and 2) if some day we are
able to build physically this hardware, it will be capable of providing output recognition
while the input stimulus is being produced by the sensors.
5.6 Conclusions and Future Work
The application presented in this chapter shows performance results for a relatively
large multi module multilayer convolutional neural network frameless AER processing
82
5.6 Conclusions and Future Work
system, estimated through behavioral simulations but using performance figures of real
individual AER hardware modules already available. A texture classification system
based on Manjunath’s method has been analyzed. This scheme uses 48 AER convolu-
tional modules plus a similar number of interfacing modules, such as splitters mergers
and mappers. We have shown that the recognition performance of the AER system is
equivalent to its original frame-based reference. However, if built with realistic AER
hardware, recognition is achieved while the sensory stimulus is being generated. This
would be equivalent to stating that an AER system has a negative processing delay
when compared to a frame-based system, where each frame has to be fully available be-
fore starting any recognition computation. Thus, AER systems reveal some interesting
properties. First, they are not constrained to frames and the output is often available
even before the input stimulus has finished. Processing delay is given mainly by the
number of layers and the number of events needed to represent the input stimulus.
The processing capability of such systems is increased by adding more modules per
layer, but without increasing the number of layers. Consequently, processing capability
can be increased without penalizing delays, although at the cost of adding hardware.
Currently, the available AER hardware modules are quite preliminary, although their
performance figures provide very promising system level performance estimations.
83
5. IMPLEMENTATION OF TEXTURE RETRIEVAL USINGAER-BASED SYSTEMS
84
Chapter 6
EVENT-DRIVEN
CONVOLUTIONAL
NETWORKS FOR FAST
VISION POSTURE
RECOGNITION
In this chapter a bio-inspired six-layer frame-free event-driven convolutional network
for people recognition is proposed. The system consists of six feed-forward layers and
22 AER convolution modules. Its corresponding frame-based version was trained using
32x32 images reconstructed collecting output spikes from a 128x128 AER motion (tem-
poral contrast) electronic retina. The computed weights obtained during the training
stage were then used in the frame-free version of the system. This frame-free implemen-
tation was tested with output spikes obtained from the same retina chip. We provide
simulation results of the system trained for people recognition, showing recognition
delays of a few miliseconds from stimulus onset.
6.1 Motivation
Nowadays power and speed requirements for sophisticated tasks such us people or
objects tracking and recognition, fabrication and quality of components, vision pro-
85
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
cessing, etc, impose strong real-time restrictions. Frame-based systems have difficulties
to deal with such restrictions as they are not able to work in real time, mainly when
several parallel processing layers are involved. In the last decades, a series of multi-
layer frame-based systems with short time responses have been proposed to solve and
accelerate complicated tasks emulating brain behavior [137]-[150], but the most suc-
cessful systems have been those based on convolutional networks trained using some
kind of continuous-time gradient-based learning algorithms [151]-[162]. Note that the
high number of connections present between neurons in perceptron neural networks is
reduced considerably in convolutional networks (ConvNets) since the connections are
shared (weight-sharing) and the connection weights are those stored in the convolu-
tion masks. Application examples of Convolutional Networks are object recognition
and scene analysis [1], image segmentation and biological image analysis (brain circuit
reconstruction) [151], natural language processing and understanding [152], biological
image analysis, object recognition and visual navigation for robots [153] and many
others [154]-[162]. In Convolutional Networks, early stages extract elementary visual
features such as oriented edges, endpoints, corners which are then combined by sub-
sequent layers in order to detect higher order features. Early stages usually operate
with small but dense convolution masks, while later stages use longer range but sparser
masks [1]. Example ConvNet systems for face and character recognition applications
may have several tens to hundreds filters per layer. There are many reasons that mo-
tivate the election of ConvNets to implement bio-inspired tasks in object recognition.
One of them is that, compared to other neural networks, ConvNets have a graceful
scaling capability. To increase knowledge one simply has to increase the number of
filters in a layer. Thus, number of neurons (pixels) scales linearly with the number of
filters, and as there is a fixed number of synapses per filter (the convolutional kernel
weights), the number of synapses also scales linearly with the number of filters. On
the other hand, the latency of the computing structure (if implemented as parallel
hardware) is determined mainly by the number of sequential layers, which is a reduced
number and does not change for a given application. Therefore, speed does not degrade
by adding more filters per layer (more knowledge). Consequently, ConvNets seem very
appealing for configurable, modular and scalable spiking hardware implementations.
Other important reasons motivating the use of ConvNets are that they combine three
architectural ideas to ensure some degree of shift, scale, and distortion invariance: 1)
86
6.1 Motivation
Local receptive fields. If the input image is shifted, the filter output will be shifted
by the same amount. This property is at the basis of the robustness of convolutional
networks to shifts and distortions of the input. Once a feature has been detected, its
exact location becomes less important. Only its approximate position relative to other
features is relevant; 2) Shared weights (or weight replication). This leads to a great
reduction of the number of trainable weights and provides shifting independence. 3)
Spatial or temporal subsampling. Subsampling at upper layers also provides shift, scale
and distortion invariance.
The big disadvantage of the so far developed convolutional networks is that they are
mainly frame-based systems, and consequently they are not truly bioinspired as they
lack the idea of continuous real-time spike processing implemented in the brain. They
have to wait to collect frames (or sections of them) in every layer to start processing.
A second drawback is that frame-based implementations are not good at handling
the massive interconnections usually present in ConvNets. In spite of the weight sharing
technique employed in ConvNets, each neuron in one layer is connected to a set of
number of neurons in the following layer and sometimes with neurons also in the same
layer. An example of a state-of-the-art frame-based successful ConvNet can be found
in [163].
All these drawbacks motivated our interest to design an AER system as an efficient
alternative which provides good results in speed and recognition at the same time.
Moreover, the massive interconnections present in ConvNets can be perfectly handled
by the AER protocol [30][43][31]. In this Chapter we present a six-layers frame-free
ConvNet similar to Convolutional Network LeNet-5 implemented by Y. LeCun [4] for
online handwriting recognition, but which is fully intended for an AER implementation
in a frame-free scheme. The frame-free ConvNet implemented detects people in vertical,
up-side-down and horizontal positions captured with a real temporal contrast (motion)
128x128 AER retina [15].
In next section the implemented frame-based system is described. Then we will
explain why this particular structure has been chosen compared to others with different
number of filters and parameters. Finally, the full AER-based version will be described.
87
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.1: Frame-based ConvNet to detect people in up, up-side-down or horizontalpositions
Figure 6.2: Real scenarios where AER recordings with the motion retina were obtained
6.2 Frame-Based Convolutional Network
The frame-based version of our AER six-layer ConvNet is shown in Fig. 6.1. This
frame-based version of the system has been implemented to obtain the trained weights
to be used in the corresponding frame-free (AER-based) implementation. It has six
layers and it receives as inputs 32x32 pixel images obtained after collecting spikes from
an electronic 128x128 AER retina during 30ms. The AER silicon retina chip used in
our implementation [157] generates events corresponding to relative changes in image
intensity. The retina recordings were obtained in scenarios like those shown in Fig. 6.2,
where static distractor objects are mixed with walking people. Some images obtained
after collecting spikes during 30ms are shown in Fig. 6.3. Note that due to the retina
dynamic nature, only motion information is captured. This way, all the static objects
present in the scene are removed implying a first stage of processing implemented
directly at the sensor.
The retina 128x128 address space was downsampled to 32x32 pixels. The first
88
6.2 Frame-Based Convolutional Network
Figure 6.3: Images obtained collecting input spikes from the retina each 30ms. Thesecond and third rows were obtained rotating previosly the input events 90 and 180 degreesrespectively.
column in Table 6.1 shows the different parameters that are used in the frame-based
implementations of the system.
The output of each of the six layers in the system consists of a set of output images
or planes called “feature maps”, which are composed of arrays of neurons. Neurons
belonging to a feature map in one layer are only connected to neurons in feature maps
in the following layer through projection fields (convolution masks). There are no
connections between neurons inside one layer. A unit (pixel or neuron) located at
position (i, j) inside a feature map q (of size KxL, q = 1, ..., Q) belonging to layer l will
have a value yql (i, j) computed as:
xql (i, j) =∑pεP
∑mεM
∑nεN
(ypl−1(m,n) ·W p,ql (i−m, j − n)) + bql (6.1)
yql (i, j) = A tanh (S · xql (i, j)) (6.2)
where P is the total number of feature maps in the preceeding l − 1 layer, Q is the
89
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Table 6.1: Parameters used in the frame-based and AER-based implementations
total number of feature maps in the present layer l, xql (m,n) is the state of pixel (i, j)
at feature map q, W p,ql is the convolution mask connecting input feature map ypl−1 to
output feature map yql . bql is a bias and A and S are constants. For our simulations,
we use A = 1.7159 and S = 2/3, as these particular values improve the convergence
towards the end of the learning session [4]. For simplification purposes, in the AER-
based hardware implementation we have not considered trainable biases.
The first layer C1 of the system in Fig. 6.1 is a Gabor filter bank [94] with six 10x10
filters (convolution masks) and six 28x28 feature maps. As done by LeCun [4], the size
of each feature map is 28x28 because we have only considered a square of 28x28 pixels
from each filter output. Each pixel in a feature map in C1 is connected to a square of
100 (10x10) pixels of the input, called the receptive field of the pixel. All the pixels
in a particular feature map in layer C1 share the same set of 100 weights, which are
the values of the corresponding filter (convolution mask). As each unit (neuron) has
100 inputs and the weights are shared for each feature map in C1, we would need
100 coefficients for each convolution mask (there are six feature maps each with its
corresponding convolution mask). In generic ConvNets, filtering layers are trainable.
90
6.2 Frame-Based Convolutional Network
However, opposite to conventional ConvNets, we have chosen a fixed (non-trainable)
bank of six 10x10 Gabor filters with two scales and three orientations in this layer. Due
to the fixed weights, this layer has 0 trainable coefficients and 470400 connections.
The second hidden layer S2 is a subsampling layer with six feature maps of size
14x14 pixels (each feature map is connected to each of the six feature maps in layer C1).
A subsampling layer in generic ConvNets performs a smoothing of the input followed by
a subsampling operation by two (in rows and columns), thereby reducing the resolution
of the feature map and the sensitivity of the outputs to shifts and distortions. The
receptive field of each pixel at this layer is a 2x2 area at the previous layer corresponding
feature maps. In generic subsampling layers, each pixel computes the average of its
inputs multiplied by a trainable coefficient and adds a trainable bias to the sum. Then
the result is passed to a sigmoid function. As contiguous pixels have nonoverlapping
contiguous receptive fields, these subsampling layer feature maps have half the number
of rows and columns as the feature maps in layer C1.
For simplification purposes (especially for the eventual hardware implementation)
we have implemented subsampling layers as addition layers with a multiplying factor
of value 1. In other words, each pixel computes the sum of its four corresponding input
pixels in the previous layer. This simplification did not affect to the recognition results
in our experiments but led to different weights obtained after training in the filtering
layers. With this simplification, subsampling layers have neither trainable coefficients
nor sigmoid functions. This implies that this layer has 0 trainable coefficients and 4704
connections.
As done by Y. LeCun [4], the third layer C3 is a convolutional layer with four
10x10 pixel feature maps. Each pixel in the third layer has input connections from all
the six feature maps in the previous layer. This way each feature map p in layer S2
is connected to a feature map q in layer C3 through an independent projection field
(filter or convolution mask) W p,q3 . Thus, there are 24 5x5 different trainable filters (6
input feature maps and 4 output feature maps). At this stage each pixel in layer C3
is connected to a square of 25x25 input pixels and the result for each output pixel in
C3 is passed through a sigmoid function. This layer has 600 trainable coefficients and
60000 connections.
Layer S4 is a subsampling block again with four 5x5 output feature maps. The
number of trainable parameters is zero again and the number of connections is 400.
91
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
The fifth layer C5 is a convolutional layer with eight 1x1 feature maps. Each unit
(pixel) is connected to all the 5x5 pixels on all the feature maps in S4 through different
projection fields. This implies that there is a full connection between S4 and C5. In
this layer there are 32 5x5 trainable filters (projection fields) connecting each of the 4
5x5 feature maps in S4 with the 8 1x1 feature maps in C5. Thus, layer C5 has 800
trainable connections (25 weights in each one of the 32 convolution masks).
Finally, layer F6 contains 4 units and is fully connected to C5. It has 32 trainable
connections.
The mean-squared error for the output units can be computed as:
E =∑i
(yi − di)2 (6.3)
where yi and di are the computed output and desired output at unit i respectively.
This frame-based version of the categorizing system has 536336 connections and only
1432 trainable parameters. All the trainable parameters have been computed using the
backpropagation algorithm [4] (see Appendix Section 6.6).
6.3 Justification of the Architecture Used
Since the ultimate long term goal of our work is the eventual implementation in AER
hardware of the systems this thesis analyzes, we seek to find the simplest architectures
for a given task. Our aim is to obtain a recognition rate over 98% on the training
set using the minimum number of filters and trainable parameters. As done by Y.
LeCun [4], we have preferred a six-layer system (repeating alternatively filtering and
susbsampling layers) because in this way the number of trainable parameters is low
and we obtain a high recognition rate [155].
As opposed to LeCun [4], we have substituted the trainable first layer by a bank
of fixed weights 10x10 Gabor filters with two scales and three orientations, because a
bank of Gabor filtering is often the first stage of visual processing in many systems and
in the human brain [8][1]. In addition, Gabor filters are selective to different scales and
orientations and they remove noise due to sparse spikes produced by the retina. The
reason for choosing a bank of gabor filters with size 10x10 was because smaller sizes
92
6.3 Justification of the Architecture Used
Figure 6.4: Comparison of recognition rates when we use a trainable set of filters in firstlayer or a fixed Gabor filter bank.
did not provide adequate filters in the upper scales and higher sizes did not improve
the recognition results.
Anyway, for comparison purposes we evaluated the recognition rate using fixed-
weight Gabor filters and using trainable filters at this stage with 250 training images
and 100 epochs (repetitions of the training set) and we obtained approximately the
same recognition rate of 98%. In Fig. 6.4 we show the recognition rate in both imple-
mentations.
To determine the simplest architecture that provides the best recognition perfor-
mance, several tests have been implemented varying the number of filters and feature
maps at each layer. First, the performance of the system was compared when different
number of Gabor filters were used in the first layer C1. Thus, different Gabor filter
banks at different scales and orientations in the first layer were tested, for a fixed num-
ber of feature maps in the rest of layers. In Fig. 6.5 we show the recognition rate
obtained for these combinations. It is evident that the choice that provides best results
is 2 scales and 3 orientations.
Once the optimum number of Gabor filters was adopted (2 scales and 3 orientations),
the optimum number of feature maps in the rest of layers should be computed. Note
that by fixing the number of filters in layer C1 we only have to choose the proper
number of feature maps in layers C3 and C5. This is because the number of feature
93
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.5: Comparison of recognition rates when we use different Gabor filter banks atdifferent number of scales and orientations.
maps in layers S2, S4 is fixed (since they implement subsampling of the feature maps
in their previous layers) and layer F6 is the final 4-outputs layer fully-connected to
layer C5.
In Fig. 6.6 we show the recognition rates obtained when varying the number of
feature maps in the third layer for different number of feature maps in the fifth layer.
The recognition rate for a fixed number of feature maps in the fifth layer remained
almost constant in spite of the number of filters in the third layer. This demonstrates
that the number of feature maps in the fifth layer is more critical than the number of
feature maps in the third layer. In Fig. 6.7 we show the different recognition rates when
varying the number of feature maps in the fifth layer for different number of feature
maps in the third layer. It is evident that beyond 8 feature maps in the fifth layer there
are stable values with high accuracy ( 98%) almost independently of the number of
feature maps in the third layer. These results motivated us to select four feature maps
for the third layer and eight for the fifth layer. A higher number of feature maps did
not provide significant improvement in the recognition rate and it would increase the
number of trainable weights. Note that in both experiments (Fig. 6.6 and Fig. 6.7)
the number of filters in layers C3 and C5 varies correspondingly to the variations in
the number of feature maps.
For hardware design purposes it is also interesting to analyze the possible ranges
94
6.4 Frame-Free Convolutional Network
Figure 6.6: Different accuracies obtained when varying the number of feature maps inthe third layer fixing the number of feature maps in the fifth layer.
of weight values at each layer. To do this, the values of the weights were computed
for different combinations of the number of feature maps in layers 3 and 5. In Fig.
6.8(a) and (b) the maximum absolute values that any of the weights achieved during
the training stage and at the end of the training stage are shown.
The structure we selected (as shown in Fig. 6.1) provides the lowest number of
filters and trainable parameters while maintaining a high recognition rate.
6.4 Frame-Free Convolutional Network
In Fig. 6.9 the frame-free AER structure corresponding to the frame-based scheme in
Fig. 6.1 is shown. The second column in Table 6.1 shows the different parameters that
are used in this frame-free implementations of the system.
In the frame-free architecture, we use as input a flow of events captured with an
AER motion sensing retina coding a 128x128 address space, downsampled to an address
space of 32x32. This is feasible using an AER subsampling module that transforms each
input event address coordinate in the range [0-127] to a new event address coordinate
in the range [0-31]. The subsampling module simply assigns to each input event with
address coordinates (xin, yin) the following new coordinate values:
xnew = bxin/4c; ynew = byin/4c; (6.4)
95
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.7: Different accuracies obtained when varying the number of feature maps inthe fifth layer fixing the number of feature maps in the third layer.
where operand bc indicates down rounding to the nearest integer. Note that modifying
the event address coordinates in this way is equivalent to an averaging and downsam-
pling operation. This new event flow with modified coordinates is used as input to the
system in Fig. 6.9. As is shown in the figure, each input event is replicated using a
1− to− 6 splitter module [36][50]. These six replicas are connected to layer C1, com-
posed of six AER convolution modules [165], each programmed with a 10x10 Gabor
filter (belonging to the Gabor filter bank with two scales and three orientations). Each
convolution module has internally an array (feature map) of 28x28 pixels.
Each neuron at position (i, j) in feature map q (qε(1, ...6)) belonging to layer C1
(l = 1) will have a state sq1(i, j) represented by the mathematical operation:
sq1(i, j) =∑mεM
∑nεN
(ein(m,n) ·W q1 (i−m, j − n)) (6.5)
where ein(m,n) is the number of input events per second coding activity at address
(m,n) (representing the input visual stimulus after being replicated with the splitter)
and W q1 is the Gabor filter connecting the input stimulus with the 28x28 output feature
map q (output of filtering the input stimulus with Gabor filter q).
The events obtained at the output of the six AER 28x28 internal arrays in the
convolution modules, thus coding a 28x28 address space, are sent to the six subsampling
96
6.4 Frame-Free Convolutional Network
Figure 6.8: Maximum absolute value of the weights during the training stage and at theend of the training stage.
modules in layer S2. With the simplifications considered for subsampling layers (no
trainable weights and no non-linearities), these modules can be easily implemented
using AER subsampling modules again: the address of each input event (xin,2, yin,2) is
modified so that address (i, j), for i, j = 1, ..., 28 turns to (k, l), for k, l = 1, ..., 14:
xnew,2 = bxin,2/2c; ynew,2 = byin,2/2c; (6.6)
The output of each of the six subsampling modules is sent to a splitter again to
replicate the output onto four channels. Each of the four channels is connected to one
input of the four convolution structures with six input ports available in the third layer.
97
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.9: AER-based implementation of the ConvNet system.
A detailed description of one of these convolution structures is shown in Fig. 6.10. In
this structure, each time an event is received, the convolution map (projection field)
corresponding to the event input port is added around the event address in the pixel
array (feature map). These kinds of structures can be implemented with the recently
developed multikernel AER-convolution chips [56], which have multikernel capability
(up to 32). This way, a neuron at position (i, j) in feature map q belonging to the third
layer (l = 3) will have a state represented by the mathematical operation:
sql (i, j) =∑pεP
∑mεM
∑nεN
(einpl (m,n) ·W p,ql (i−m, j − n)) (6.7)
where einpl (m,n) is the number of events per second coming into input port p (feature
map input of size 14x14) of layer l (l = 3) coding the address (m,n). Matrix W p,ql
is the convolution mask connecting input feature map p with feature map q (of size
10x10). Fig. 6.11 illustrates with an example what actually happens inside one of the
98
6.4 Frame-Free Convolutional Network
Figure 6.10: Convolution Structure at layers C3, C5. Each incoming spike makes aconvolution map to be added on a pixel array.
pixels in the pixel array of the feature map. In the figure, three events coming from
different input ports are received by the convolution module. The first event splashes
the convolution mask corresponding to that input port around the address coded by
the event in the pixel array and adds it to its state. If the neuron under study is inside
this area, only one of the weights (w1 in Fig. 6.11) corresponding to the first filter
will affect. As shown in this figure example, this weight w1 increments the state inside
the neuron (the weight will be added or decreased according to the event and weight
signs). The second event arriving later from a different (or the same) input port will
produce a different weight (or the same) to be added to the state value in the neuron.
The third event produces the same result but this time a threshold is reached and a
new output event is produced coding the address of the firing neuron.
Each time a neuron (unit or pixel) in a feature map reaches a threshold and the
time since the last output event (Toutput) is higher than an established refractory time
(Trefractory), a new output event is sent to the following layers coding the neuron
address, and the neuron is reset to the bias value of the feature map. Using this
refractory time to limit the neurons maximum firing rate [15], we emulate a rectifying
non-linearity. This is one of the most important factors in improving the performance of
a recognition system [156]. If we do not consider refractory times, the number of events
99
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.11: Neuron in the pixel array. Each time a spike is received a certain weight isadded to the neuron state.
that a neuron would fire in layer C3 at position (i, j) in feature map q (outq3,lin(i, j)) is
eoutq3,lin(i, j) =
∑pεP
∑mεM
∑nεN (einp3(m,n) ·W p,q
3 (i−m, j − n))Threshold3
· (1− F q3 ) + nq3
(6.8)
where einp3(m,n) is the number of events per second coming to input port p of layer
C3 coding address (m,n), W p,q3 is the convolution map connecting feature map p with
output feature map q and Threshold3 is the threshold selected for layer C3. In this
expression we have incorporated two new variables to model leakage due to a forgetting
factor F q3 (events per second lost by one neuron belonging to feature map q) and a
quantization noise factor nq3 (quantization caused by thresholding). The forgetting
mechanism is important because it “empties” the state stored in neurons (forgets) so
that no old information is relevant for the computations. In a certain way, the forgetting
mechanism has a similar effect than the refractory time, as it limits the neurons firing
100
6.5 Results
output activity.
When using refractory times (to emulate the sigmoid functions in frame-based sys-
tems), the number of events fired by a neuron in this layer is limited as:
eoutq3(i, j) = min(eoutq3,sat, eoutq3,lin(i, j)) (6.9)
where eout3,lin is computed as described in Eq. 6.8 and eout3,sat is the maximum
number of output events allowed by the imposed refractory time Tref3.
Layer S4 implements subsampling in the same way as in layer S2. Output events
from layer S4 are connected to neurons in layer C5 in the same way as in layer C3 (see
eq. 6.8). Each event produced in layer C5 is replicated in four different outputs using
splitter modules. These output neurons are fully connected to the four output neurons
in layer F6. Neurons at layer F6 will fire positive or negative events indicating that
the input has been or not categorized as the class coded by the firing neuron. In the
system, there are two stages where refractory periods have been incorporated. In the
frame-based system we considered sigmoid functions at layers C3, C5 and F6. However
in the AER implementation, only layers C3 and C5 needed refractory times, because
the activity at layer F6 does not saturate but will be high and positive only for the
desired output neuron. Besides, using too many or too long refractory times has the
negative effect of saturating the firing frequency, increasing this way the time-to-first
event and the separation between them.
6.5 Results
Due to the non availability so far of the large number of filters that would be needed
in our implementation, the system was simulated again with our AERST simulator
tool but using real input stimuli from the AER electronic retina and the performance
figures from the physically available AER hardware [39][36][165]. The combination of
real sensory event-format data with performance figures of available AER devices allows
us to estimate reasonably well the performance of our AER-based ConvNet proposed
in this thesis work. The AER-based system was first trained and tested using a frame-
based version of our AER (frame-free) ConvNet, as indicated in the algorithm depicted
in Fig. 6.12.
101
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.12: Algorithm used to configure the system. First, the system was trained withthe frame-based version. Then all the obtained weights were used in the frame-free system.
Several experiments were implemented in the system. The first set of experiments
was implemented downsampling the 128x128 input space obtained from the retina to a
modified input space 32x32. The second set of experiments was implemented selecting
a central square of 64x64 in the centre of the 128x128 input space and downsampling it
to 32x32. The first set of experiments has been called AER ConvNet with 32x32 pixel
inputs and the second set is called AER ConvNet with 64x64 pixel inputs.
6.5.1 AER ConvNet with 32x32 pixel inputs
In this first experiment, we used the 128x128 pixel retina to collect events during
intervals of 30ms. These events were histogrammed into images and then downsampled
to 32x32 pixel images. This way we created a total of 262 images of people walking.
We rotated these images 90 and 180 degrees to create the corresponding images in
horizontal and up-side-down positions. Some of these images are shown in the top row
of Fig. 6.13.
102
6.5 Results
Figure 6.13: a) Images corresponding to downsampling the 128x128 input stimulus to32x32. b) Images obtained cropping the input stimulus in a central square of size 64x64and downsampling the cropped stimulus to 32x32.
Finally, to add some distractors, we used another set of 262 images of moving objects
recorded with the retina. Thus, we generated a database composed of 1048 images
representing a total of four different categories. 250 of these images were used to train
the frame-based version of the system and the rest were used for testing purposes. With
these images we obtained a 98% recognition rate with the training set and 93.2% with
the testing set on the frame-based version.
To map the trained frame-based version to an event-based version we proceed as
follows. In the AER-based system there are mainly three sets of parameters that have
to be set before using the system. These are:
a) The weights of the convolution masks of each AER convolution module. In the
frame-free implementation, all the weights that had been obtained during the training
of the frame-based version were used as weights in the frame-free implementation.
b) Thresholds. The threshold values to be used inside the convolution modules in the
AER system were chosen considering two limiting factors:
1. Threshold values should be low to provide a high output rate, thus speeding up
the system.
103
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Table 6.2: Maximum kernel weights, threshold values, refractory times, layer times andevents per second in the system
2. Threshold values inside the convolution modules must be higher than the con-
volution mask maximum weights in order to avoid high quantization noise. On
the contrary, each neuron inside the convolution module affected by an incoming
event would always achieve the threshold value, thus producing the generation of
undesirable output events.
These two considerations lead us to choose threshold values between 1.5 and 2 times
the maximum weight existing inside each convolution module.
c)Refractory Times. To compute the refractory times in layers C3 and C5, we estab-
lished relations (see Appendix Section 6.6) between the sigmoid saturation values in
the frame-based version and the event rate in the AER-based version. Operating this
way we obtained the values 2.3ms and 17.5ms for the refractory times in layers C3 and
C5 respectively.
Table 6.2 shows the maximum kernel weights, the threshold values, and the refrac-
tory times used in each layer.
Once the main parameters have been set, the AER version can be tested using
different experiments. In the tests we are interested in getting a positive activity with
a high rate, thus minimizing the system response and the time-to-first event. In our
system, the time-to-first event is considered to be the delay between the first input
event categorizing one person position and the first positive output event in the target
output corresponding to the input person position.
The AER-based system shown in Fig. 6.9 was tested with three different flows of
spikes of visual information. The first flow, corresponding to up position, was obtained
composing several data files of people walking, recorded with the AER motion sensing
104
6.5 Results
retina. This flow had a duration of 9s. Note that if we assume a rate of 33 frames
per second in a frame-based version, 9s in the frame-free version would correspond to
approximately 297 frames. In the recording, several people appear at different times.
As the flow had 102572 spikes, this results in a total equivalent input firing rate of
11.4keps (kilo-events per second). Then, this flow of spikes (corresponding to the up
position) was rotated 90 and 180 degrees to create the other two flows for horizontal
and up-side-down positions. Note that rotating a flow of spikes in AER is very simple
as we only need to change the pixel address of each spike through simple operations
(as implemented in the rotator module). The three flows of activity created this way
were used as inputs in the frame-free system configured with the theoretical parameters
explained above.
Fig. 6.14 shows the spike sequence of the three input flows. Fig. 6.15 shows the
activities of the four output channels on the system when each of the three input are
used. Positive events in a particular output channel indicate that the system recognizes
input events as belonging to the category represented by that output channel. Negative
events indicate the opposite. As it can be observed, the system was able to recognize
the people position in the three cases (up, horizontal and up-side-down).
Despite of having theoretical values for the refractory times, we verified the per-
formance of the system when sweeping them systematically. We varied the refractory
times for layers C3 and C5 from 0 to 50ms. Note that 50ms is a time considered to be
excessively long. Defining the recognition rate as the ratio between the number of pos-
itive events in the target output channel, and the total positive output events collected
in the four output channels, the combination of refractory values in layers C3 and C5
that provided the highest recognition rate was 1.3ms and 16ms (note that these exper-
imental values are close to the theoretical values shown in Table 6.2). Higher refractory
periods provided better recognition but also a reduced number of output events and
consequently a slower response. Lower values for the refractory period in layer C5 (but
close to 16ms) also provided good recognition rates (and close to 98%). However, we
discarded these lower values for the refractory time as small variations provided highly
fluctuating recognition rates, thus increasing delay times and the time-to-first output
event. The selected combination of refractory times provided a recognition rate close
to 98% (97.92%) and a high number of output events per second (close to 100eps)
105
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.14: Input events used to test the system. x axis represents time in seconds.y axis represents the input event coordinates in a 32x32 pixel array, numbered from 0 to1023.
(speeding up the system). In Fig. 6.16 we show the number of output events and the
recognition rate obtained when we vary refractory periods in layers C3 and C5.
A different measure that can be used for computing the recognition rate, can be the
percent of time in which activity is positive on the correct output channel. However,
this measure is not realistic as there are periods with very low activity at the input
resulting also in a low activity at the output. Anyway, looking at the output flows in Fig.
6.15, it is clear that the system performed correctly almost all the time. The system
misclassified mainly up positions (classifying them as the up-side-down). In most of the
cases, this occurred when there were people moving in or out of the field of view. This is
reasonable as vertical and up-side-down positions are very similar, specially when only
few events are received during transitions. The minimum time-to-first event was of
approximately 16ms. Considering that the input flow was of 11.5Keps approximately,
16ms corresponds to over 184 input events. This value corresponds to a percent of
effective average firing pixels of 5.66%. This percent value was computed using images
created collecting over 184 events from the input flow and computing the number of
106
6.5 Results
Figure 6.15: Output events corresponding to each one of the input flows, a) outputswhen input is up position, b) output when input is horizontal position, c) output wheninput is up-side-down position.
active pixels in each image. The 184 events and the 5.66% value of effective firing pixels
leads to an appoximate average of 3.17 events per pixel when considering a 32x32 input
array. Note that the retina output has an address space of 128x128, so that the 3.17
events per pixel computed here corresponds on average to 0.4 events per pixel in the
128x128 array (less than one event per pixel). The maximum firing rate in each output
channel lead to minimum delays between events in the order of 15ms. These delays can
be considered low, but note that even so they are not determined by the computation
process within the ConvNet (in the order of microseconds [165]), but by the reduced
input firing frequency provided by the retina. With the development of new faster
107
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.16: Recognition rate and number of output events per second obtained byvarying the refractory times in layers C3 and C5.
electronic retinae it will be possible to achieve shorter delays, as recent convolution
chips have delays between input and output flows in the order of microseconds (or
fraction) [165].
A second experiment was to analyze the system response when the input flow was
changed alternatively between the three different people positions. The input (up)
flow was rotated after certain random time (on average 0.5s) to create the other two
positions (horizontal and up-side-down). The new flow was of 7.3s. To show how
critical refractory times are, Fig. 6.17(a) shows the input and outputs flows obtained
when they are not used in the simulations. Note that the output channels respond
producing double sign activity (positive and negative), which makes the system unable
to recognize the input pattern. For clarifying purposes, we have represented input
events corresponding to up position as ‘5’. Events corresponding to horizontal position
are represented as ‘6’ and up-side-down position is represented as ‘7’. The different
108
6.5 Results
outputs are shown in the same figure. Positive and negative output events for the
output channel identifying up position are represented with ‘1’ and ‘-1’ respectively.
Output events identifying horizontal position are represented with ‘2’ and ‘-2’. Output
events corresponding to up-side-down position are represented with ‘3’ and ‘-3’. Finally,
the events corresponding to the noise output are represented as ‘4’ and ‘-4’.
In Fig. 6.17(b) the system has been tested using a fixed refractory time (Tref3)
of 1.3ms for layer C3 and 9ms for layer C5 (Tref5). Here we are representing only
the positive activity of the output channels. Categories of input events are shown
by the blue line. Values ‘1’, ‘2’ and ‘3’ correspond to up, horizontal and up-side-down
positions, respectively. Output events corresponding to the up category are represented
with blue circles, horizontal category output events are represented by red crosses, up-
side-down category events are represented by green stars, and noise category events by
black points. Note that this time the system is able to track the input stimulus at any
time. Only a few wrong events are produced. Besides, the system never identifies the
people as noise, which is correct.
In Fig. 6.17(c) the system has been tested using a fixed refractory period (Tref3)
of 1.3ms for layer C3 and 18ms for layer C5 (Tref5). Note that this time the system
improves in accuracy producing a lower number of wrong events. However the total
number of events produced is lower, which means a degradation of the system speed.
So far, all the experiments have been implemented without considering forgetting
inside neurons. To add the forgetting mechanism to our system we have included
forgetting rates in layers C1, C3, C5 and C6. A forgetting rate Fl means that a state
quantity of value F will be discharged each second in all neurons belonging to layer l.
The state stored (positive or negative) will always leak towards ‘0’. In our system, the
forgetting rates are denoted as F1, F3, F5 and F6. To evaluate how important these
forgetting rates are, we have varied the four variables between 0 and 35 nups (nano
units per second). This last value has been chosen empirically. It corresponds to null
output activity without refractory times.
In Fig. 6.18 we show the recognition rates and Number of Output Events obtained
when varying forgetting rates F1, F3, F5, and F6 (in pairs). As can be seen, compared
to the other forgetting rates, the sensitivity to parameter F5 is very high as it makes the
recognition rate and number of events to change abruptly when F5 is close to 26 nups.
The system is also very sensitive to F3 variations. The system performance degrades
109
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.17: a) Input and output activity when input is alternated between up, layand up-side-down positions. No refractory periods have been considered. Values ‘5’, ‘6’and ‘7’ correspond to up, horizontal and up-side-down respectively. Absolute values ‘1’,‘2’, ‘3’ and ‘4’ correspond to the output channels identifying up-side-down, horizontal, uppositions and noise respectively. b) Input and output activity when input is alternated anda refractory period of 9ms is used in layer C5. Input event correct orientation is shownby the blue line. Values ‘1’, ‘2’ and ‘3’ correspond to up, horizontal and up-side-downpositions, respectively. Output events corresponding to the up category are representedwith blue circles, with red crosses for the horizontal category, with green stars for theup-side-down category and black dots for the noise category. c) Input and output activitywhen a refractory time of 18ms is used in layer C5. d) Input and output activity whenthe simulated annealing algorithm is employed to obtain optimum parameters
when F3 is close to 7nups. Finally F1 and F6 do not degrade the recognition in the
system very much, but their values affect considerably the number of output events,
which implies a slower system response. Thus we chose a 5nups value for F1 and F6.
For these experiments we fixed the refractory times and thresholds to the values shown
110
6.5 Results
Figure 6.18: Recognition Rate and Number of Output Events obtained when varyingforgetting rates F1, F3, F5, and F6. a) Results when varying F1 and F3. b)Results whenvarying F3 and F6. c)Results when varying F3 and F5. d)Results when varying F1 andF5.
in Table 6.2.
As thresholds, forgetting rates and refractory times parameters influence the sys-
tem performance, both in terms of recognition rate and speed, variations of them
were analyzed jointly to obtain the optimum practical parameters maximizing accu-
racy and speed. To find optimum parameters we used the simulated annealing algorithm
[170]. Note that there are 10 free parameters: Threshold1, Threshold3, Threshold5,
Threshold6, F1, F3, F5, F6, Tref3, Tref5. The simulated annealing algorithm min-
imizes a cost function, while providing lower and upper bounds for the parameters.
We used a cost function which penalizes a reduced number of positive output events
(under 90) and a reduced recognition rate (under 95%). For the refractory times we
imposed an upper bound of 5ms. Larger values are not necessary as the pixels include
the forgetting mechanism. The algorithm found several parameter vectors providing
good results. The best vector after 2000 iterations is shown in Table 6.3. Note that
in this set of parameters, the refractory times were driven to low values (almost ‘0’)
111
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
PARAMETER VALUE
Threshold1 0.62
Threshold3 3.12
Threshold5 6.4
Threshold6 1.5
F1 (nups) 17.46
F3 (nups) 0.21
F5 (nups) 21.17
F6 (nups) 32.6
Tref3 (ms) 2.23
Tref5 (ms) 0.95
Table 6.3: Parameter Vector obtained with the Simulated Annealing Algorithm
increasing the forgetting parameters, which it is a very desirable effect.
The resulting activity obtained with these parameters can be seen in Fig. 6.17(d).
Note that again only a few wrong recognition events are produced. The main advantage
with this last version is that we have included the forgetting mechanisms, making the
system more biological-like. To a certain extent the forgetting mechanism helps the
refractory time effect. For illustration purposes negative events are also shown in this
figure. Note how the different output channels respond strongly with negative activity
when they are not the target outputs.
Table 6.4 shows the time-to-first event at the target output when the input flow is
changed between positions. First column corresponds to the times at which the input
events represent a new position. Second column shows the delay to obtain a positive
output event at the target output when the first input event corresponding to a new
position has been fed to the system. This delay was computed considering only the
positive output events in the channel identifying the correct input.
As Table 6.4 shows, the minimum Time-to-First Event identifying properly the
position after changing the people position at the input flow was of 15.5ms (see Fig.
6.19,where a zoomed version of the simulation results between 5760ms y 5830ms is
shown).
112
6.5 Results
INPUT TIME-TO-FIRSTTRANSITION OUTPUT EVENT
TIME (ms) (Delay) (ms)
0 82.3
512 39.3
957 141.8
1310 28
1633 32
1909 13.8
2152 59.2
2453 87.7
2819 21
3029 31.5
3297 69.6
3623 48
3934 51.4
4349 26.8
4820 37.4
5574 46.3
5794 15.5
6019 112.3
6339 26
6673 37.2
Table 6.4: Time-to-first output event after transitions of the input between up, horizontaland up-side-down positions.
6.5.2 AER ConvNet with 64x64 pixel inputs
In this set of experiments, the 128x128 input flow of events obtained from the retina
was not downsampled directly to 32x32. This time, only the events in a 64x64 window
containing the region of interest of the 128x128 stimulus address space (coding addresses
inside a 64x64 square) were collected to create the set of training images and the flow
of events for testing. Then, the coordinates of these new events coding a 64x64 address
space were modified to a downsampled space of 32x32. Note that opposite to the
113
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.19: Zoomed version of the simulation results of Fig. 6.17 between 5760ms y5830ms.
previous set of experiments, images obtained collecting these new events have more
resolution (see Fig. 6.13(b)).
The first experiment was implemented again alternating the input between up,
horizontal and up-side-down positions. This time the input flow had a duration of 6.5s.
This first experiment was carried out without considering forgetting mechanisms. The
refractory period for layer C3 was 1.3ms and 18ms for layer C5. The results can be
seen in 6.20(a). As in the previous implementation the system is able to track the input
people position with delays lower than 15ms. Note that only a few wrong events are
produced.
The second experiment was carried out using forgetting parameters. The simulated
annealing algorithm was used again. The best set of parameters after 2000 iterations
is shown in Table 6.5.
The input and output events obtained using this set of parameters are shown in Fig.
6.20(b). Note that the accuracy of the system has been improved with this set of
parameters and that the system is able to provide correct output even when there
are fast transitions at the input (see the abrupt change between 4.82s and 4.88s).
Again, note that the refractory periods have been reduced and part of their operation
is implemented by the forgetting mechanisms.
114
6.6 Appendix
PARAMETER VALUE
Threshold1 0.6
Threshold3 1.11
Threshold5 6.17
Threshold6 1.75
F1 (nups) 18.68
F3 (nups) 4.08
F5 (nups) 0.015
F6 (nups) 15.58
Tref3 (ms) 0.72
Tref5 (ms) 5.96
Table 6.5: Parameter Vector obtained with the Simulated Annealing Algorithm
6.6 Appendix
6.6.1 Learning in Convolutional Networks
There are several approaches to automatic machine learning, but one of the most suc-
cessful approaches is called gradient-based learning. The learning machine computes
a loss function that measures the discrepancy between the “correct” or desired output
for pattern and the output produced by the system. The simplest output loss function
that can be used in convolutional networks is the minimum mean squared error (MSE)
[4]. The loss function can be computed then as:
E =1P
P∑p=1
12
∑q
(yq − dq)2 (6.10)
where yq is the output obtained at output port q and dq is the desired output for port
q. P is the number of the training samples considered.
The loss function is minimized by computing the gradient with respect to all the
parameters in all the layers, and a simple and efficient procedure to compute it in a
nonlinear system composed of several layers of processing is to use the back-propagation
algorithm [4]. The standard algorithm must be slightly modified to take the weight
sharing into account. The weight sharing technique has the interesting side effect of
reducing the number of free parameters. An easy way to implement it is to first compute
115
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.20: a) Input and output activity when input is alternated and a refractoryperiod of 18ms is used in layer C5. Input events are shown by grey circles. Values ‘1’,‘2’ and ‘3’ correspond to up, horizontal and up-side-down positions respectively. Outputevents corresponding to the up category are represented with blue circles, output eventscorresponding to the horizontal position are represented by red crosses, output eventscorresponding to the up-side-down positions are represented by green stars and the noisecategory by black points. b) Input and output activity when the simulated annealingalgorithm is employed to obtain optimum parameters
the partial derivatives of the loss function with respect to each connection. Then the
partial derivatives of all the connections that share a same parameter are added to
form the derivative with respect to that parameter. Before training, the weights are
initialized with random values using a uniform distribution between−2.4/Fi and 2.4/Fi,
where Fi is the number of inputs (fan-in) of the unit which the connection belongs to
[4]. To train the system, the patterns are presented in a constant random order, and the
training set is typically repeated a certain number of times (epochs). At each learning
116
6.6 Appendix
iteration, a particular parameter is updated according to the following update rule:
wk = wk − εk∂Ep
∂wk(6.11)
where Ep is the error obtained for class p. The step sizes are not constant and are
computed as
εk =η
µ+ hkk(6.12)
where η and µ are hand-picked constants and hkk is an estimate of the second derivative
of the loss function E with respect to wk. The larger hkk is, the smaller the weight
update. Once the system has been trained, the performance is estimated by measuring
the accuracy on a set of samples disjoint from the training set, which is called the test
set.
6.6.2 Computations in the Frame-Based System
6.6.2.1 Filtering layers
In filtering layers C1, C3 and C5, the state of a neuron yql (i, j) located at position (i, j)
in feature map q (q = 1, ..., Q) belonging to layer l (l = 1, 3, 5) is computed as
xql (i, j) =∑pεP
∑mεM
∑nεN
(ypl−1(m,n) ·W p,ql (i−m, j − n)) + bql (6.13)
yql (i, j) = A tanh (S · xql (i, j)) (6.14)
where P is the number of input feature maps, Q is the number of output feature maps,
ypl−1(i, j) is the analog value of the input feature map p at position (i, j), W p,ql is the
convolution map connecting input feature map p with output feature map q, bql is the
offset for the output feature map q and A and S are constants. For simplification
purposes in the hardware implementation, we have used zero values for all the biases
bql . For the first layer C1, P = 1 (the input image), Q = 6 and W p,q1 is a set of 6
Gabor filters (coding 2 scales and 3 orientations) and ypl−1 is the input image. We have
not used the hyperbolic tangent sigmoid function in neurons in this layer. In layer C3,
P = 6, Q = 4 and W p,q3 is a set of 24 5x5 trainable filters. In layer C5, P = 4, Q = 8
and W p,q5 is a set of 32 5x5 trainable filters.
117
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
6.6.2.2 Subsampling Layers
For the second and fourth layers (S2, S4), the analog output values are computed as:
yql (i/2, j/2) = W ql ·
1∑m=0
1∑n=0
yql−1(i+m, j + n) i, j = 2 · n, nεN (6.15)
In subsampling layers W ql are 1x1 fixed averaging factors (of value 1 in our imple-
mentation) instead of convolution matrixes.
6.6.2.3 Full-Connection layer F6
In layer F6, the state of a neuron yq6 (q = 1, ..., 4) is computed as
yq6 = A tanh (S · yp5 ·Wp,q6 + bq6) (6.16)
This time, W p,q6 are trainable 1x1 weights connecting each one of the 8 output
neurons in layer C5 with each one of the 4 neurons in layer F6. Again, we have used
zero values for the biases bq6. The error function can be computed then as:
E =12
∑q
(yq6 − dq6)2 (6.17)
Where yq6 is the output obtained at output port q and dq is the desired output for
port q.
6.6.3 Computations in the Frame-free system
In order to develop some simplifications in the hardware implementation of the AER-
based system, we decided to consider non-linearities only in layers C3 and C5. In the
AER-based system, the output events per second in feature map q from neuron (i, j)
at layer l − 1 (denoted by eoutql−1(i, j)) are used as input events in the following layer
(and are denoted by einql (i, j)). This way we have the following equivalence:
einql (i, j) = eoutql−1(i, j) (6.18)
118
6.6 Appendix
6.6.3.1 Filtering Layers
For filtering layers, the number of events per second that a neuron at the output
feature map q in layer l (l = 1, 3, 5) and position (i, j) fires if the refractory period is
not considered is denoted as eoutql,lin(i, j) and is computed as:
eoutql,lin(i, j) =
∑pεP
∑mεM
∑nεN (einpl (m,n) ·W p,q
l (i−m, j − n))Thresholdl
· (1− F ql ) + nql
(6.19)
where einpl (m,n) is the number of events per second coming to input port p (feature map
p of previous layer) in layer l coding the address (m,n), W p,ql is the convolution map
connecting feature map p with output feature map q and Thresholdl is the threshold
selected for layer l. In the expression we have incorporated two new variables to model
the loss of events due to a forgetting factor F ql (events per second forgotten by one
neuron belonging to feature map q) and a quantization noise factor nql (caused by the
use of a certain threshold).
6.6.3.2 Subsampling Layers
In Subsampling layers S2 and S4, each input event with address (xIN , yIN ) is replicated
to the corresponding output port but coding a new address (xNEW , yNEW ) computed
by
xNEW = bxIN/2c; yNEW = byIN/2c; (6.20)
6.6.3.3 Sixth Layer F6
In this layer, each connection between output unit q and input unit p is done through
a trained weight W p,q6 :
eoutq6,lin =
∑pεP (einp6 ·W
p,q6 )
Threshold6· (1− F q6 ) + nq6 (6.21)
119
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
6.6.4 Implementation of non-linearities and equivalences between the
frame-based and the AER-based implementation
After implementing non-linearities in the frame-free system, the real number of output
events per second in layer l and feature map q with position (i, j) is:
eoutql (i, j) = min(eoutl,sat, eoutql,lin(i, j)) (6.22)
where eoutql,lin(i, j) is computed as described in Eqs. (6.19) and (6.21), and eoutl,sat is
the maximum number of events per second allowed by the refractory period in layer l.
We can consider that each layer l is characterized by a layer time constant τl that
corresponds to the minimum time between events in one of the channels belonging to
the layer. This layer time constant τl is specified by the refractory period limiting the
firing activity at the outputs of layers C3 and C5, or by the maximum number of events
per second travelling in one channel belonging to one layer without refractory periods
(C1, S2, S4 and F6). Taking these ideas into consideration, in layers C3 and C5, the
layer time constants τ3 and τ5 are equal to the refractory periods (limiting the firing
activities) computed for these layers. In layers C1, S2, S4 and F6, as we do not use
refractory periods, time constants τ1, τ2, τ4 and τ6 are determined by the maximum
number of events per second travelling in these layers.
To compute the layer time constants τl for each layer in the AER-based system, we
have to compute the corresponding refractory periods (so that non-linearities equivalent
to sigmoid functions can be implemented). To do this, we have stablished mathematical
relations with the frame-based computations.
In the first layer, as we do not use non-linearities, we have no saturation in the
output units. Thus, we have computed τ1 experimentally. We have analyzed the output
obtained in all the units of layer C1 in the frame-free system and have computed the
minimum time separation between events corresponding to each output unit. Then
these values have been averaged between all the output units and the mean value
obtained has been chosen as τ1. With these computations we obtained a layer time
constant τ1 of 1.8ms which corresponds to a number of events approximately of 560eps.
In layers S2 (and therefore in S4), events coming from previous layers are simply
replicated but modifying their coded address, as specified in eq. (6.20), in order to
reduce the input space address by four. For these layers, we do not use refractory
120
6.6 Appendix
periods. However, the layer time constants τ2 and τ4 are determined by τ1 and τ3 and
are computed as:
τ2 = τ1/4; τ4 = τ3/4; (6.23)
Using eq. (6.23) we obtain a value for τ2 of 0.45ms.
In layers C3 and C5, we can use Eqs. (6.13) and (6.19) to relate the analog output
for one neuron belonging to a feature map q in position (i, j) in the frame-based version
with the number of output events fired by a neuron located in the same position, layer
and feature map in the AER-based version with the following expressions:
eoutq3,lin(i, j) =xq3(i, j)
4 · Threshold3 · τ2· (1− F q3 ) + nq3, τ2 =
1ein2,max
(6.24)
eoutq5,lin(i, j) =xq5(i, j)
4 ·A · S · Threshold5 · τ4· (1− F q5 ) + nq5, τ4 =
1ein4,max
(6.25)
where τ2 and τ4 indicates the layer time constants of the previous layers, determined
by the maximum firing rates ein2,max and ein4,max.
Note that the factors 4 are due to the 4-neighbourghs merging operation imple-
mented in the previous subsampling layers. The factor A · S is to consider that the
output activity in units belonging to layer C3 are modulated by this constant (see eq.
(6.14)) in the frame-based implementation.
To implement the non-linearities in the frame-free system we need first to compute
the saturation point in the AER-based version. To do this we can relate the analog
state for one neuron belonging to a feature map q in position (i, j) in the frame-based
version with the number of output events fired by a neuron in the AER-based version.
For this we can use eqs. (6.24) and (6.25) making xql (i, j) equal to the saturation point
xsat in the frame-based version. For simplification purposes, the simple case without
forgetting ratio and quantization noise will be considered.
Taking these ideas into consideration, in layer C3 the maximum number of events
per second in the AER version can be computed as:
eout3,sat =x3sat
4 · Threshold3 · τ2(6.26)
121
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
Figure 6.21: Computation of the saturation point in the hyperbolic tangent function.The function saturates when the absolute value of the argument is higher than 1.5283.
The tangent hyperbolic function computed as described in eq. (6.14) in the frame-
based version saturates when the argument is greater than 1.528 (saturation point
xsat) as shown in Fig. 6.21. Thus, considering that the τ2 value is 0.45ms and that
xsat is approximately 1.528, using the threshold value 2 for third layer, we obtain
an approximated number of events corresponding to the saturation situation point
in layer C3 of 427.8eps. This value leads to a refractory period (third layer time
τ3 = 1/eout3,sat) of 2.3ms, approximately.
Using eq. (6.23) we get a value for τ4 of 0.58ms. As done in layer C3, τ5 in the
AER-based implementation can be computed making xq5(i, j) equal to the saturation
point xsat:
τ5 =Threshold5 · τ4xsat/(4 · S ·A)
(6.27)
Again, using the saturation point xsat equal to 1.528, and a threshold value of 10 for
the fifth layer, we obtain an approximate refractory period in this layer τ5 of 17.5ms,
corresponding to a number of events per second in the saturation situation of 57.15eps.
Note that the selected experimental refractory period was 16ms, which is very close to
the theoretical computed value.
In layer F6, we do not use refractory periods. However, considering the results
obtained, we can compute the minimum achievable time between events at the output
using the weights computed for this layer. The minimum time between events will
occur when events from the fifth layer make all the weights to interfere constructively
122
6.6 Appendix
Table 6.6: Refractory periods, layer times and maximum number of events per secondcomputed for each layer in the system
in the output neurons. Then, we can compute the maximum number of events per
second as:
eoutq6,max =
∑pεP ‖W
p,q6 ‖
Threshold6 · τ5(6.28)
With the weights computed for this layer, using a threshold of value 1.5 and τ5
equal to 16ms, we obtain a maximum firing rate of 111.1eps, which corresponds to a
minimum time between events of 9ms.
In Table 6.6 we show the refractory periods, layer times and events per second
obtained at each layer. Using the expressions described in eqs. (6.26), (6.27) and
(6.28) we can configure our system in order to obtain certain desired system times
(and corresponding events per second at each layer) according to the input firing rate
provided by the AER retina, the thresholds selected and the weight values computed
at each layer.
123
6. EVENT-DRIVEN CONVOLUTIONAL NETWORKS FOR FASTVISION POSTURE RECOGNITION
124
Chapter 7
CONCLUSIONS
Throghout this tesis work we have presented the event-driven software simulator tool
AERST implemented in Matlab and C++ to simulate systems based on AER.
The AERST tool allows the user to simulate easily any system described by a netlist
where the modules together with their parameters are listed. AERST is able to process
around 20Keps. It allows us to simulate complex and hierachically-structured systems
before the available hardware technology becomes mature. AERST can be used to test
new AER processing modules within large systems, and thus orient hardware devel-
opers on what kind of AER hardware modules may be useful and what performance
characteristics they should possess.
Throughout this thesis work we have presented three large multi-module multi-
layer convolutional neural networks. They have been simulated considering available
AER hardware modules modeled with their performance figures. The first application
is intended to recognize characters even when they present slight deformations. The
second system implements texture classification for image retrieval and the third one
is a large neuronal network trained using backpropagation to implement people posture
recognition. In the three systems results show clearly the high speed and possibility
of implementing complex processing systems that AER provides. Besides, in all the
systems recognition was achieved while the sensory stimulus was still being generated.
In general, processing delay in feed-forward event-based systems is given mainly by
the number of layers and the number of events needed to represent the input stimulus.
In some of the systems implemented, the processing capability could be increased by
adding more modules per layer working in parallel and without increasing the number
125
7. CONCLUSIONS
of layers. This means that the processing capability is increased without penalizing
delays, although at the cost of adding hardware.
Currently, the available AER hardware modules are quite preliminary, although
their performance figures provide very promising system level performance estimations.
In the future, AER hardware design is aimed at miniaturizing present AER modules
so that a large number of them (several hundred) could fit on a single PCB or in
a large NoC dye. Also, such multi-module elements should allow a large degree of
reconfigurability and reprogrammability, so that many different applications can easily
be set up.
126
Appendices
127
Appendix A
AERST Tool User Guide
A.1 Introduction
The purpose of AERST is to simulate systems composed of several interconnected AER
modules. It is a simple and open MATLAB and C++ based simulator. A basic library
of AER modules has been developed. The user can easily add new AER modules to
the library or elaborate the available modules with different levels of complexity. In
this Appendix a guide of how to use the tool together with a step-by-step example is
provided.
A.2 Description of an AER System
An AER system is described in AERST by a netlist given in a configuration file. In
this netlist the AER system is composed of sources, channels and modules.
In the netlist channels are identified by integer numbers {1,2,3,...,N}. All input
sources to the system are defined in a single line beginning with the word sources.
Then, between brackets the user indicates the channel numbers the input sources are
connected to. Finally, the user has to indicate between brackets the name of the
MATLAB files (mat files) containing the list of source events.
The sources command has the following structure:
sources {1, 2, ..., N}{datasource1, datasource2, ..., datasourceN} (A.1)
129
A. AERST TOOL USER GUIDE
In the Matlab implementation, the list of events belonging to a source or channel
has the structure of a two-dimensional matrix. Each row of the matrix corresponds
to one event. In our implementation each event contains six fields. The first three
correspond to timing information of the event, while the last three correspond to data
transmitted by the event:
[Tprereqst Treqst Tack x y sign] (A.2)
The data fields are irrelevant for the simulator, and only need to be interpreted
properly by the modules (instances) receiving and generating them. For the particular
cases we describe in this thesis we have always used the same three fields: ‘x′ and ‘y′
represent the coordinates or addresses of the pixel originating the event and ‘sign′ its
sign. The three timing fields are as follows: ‘Tprereqst’ represents the time at which
the event is created at the emitter instance, ‘Treqst’ represents the time at which the
event is processed by the receiver instance, and ‘Tack’ represents the time at which
the event is finally acknowledged by the receiver instance. We distinguish between a
pre−Request time and an effective Request time. The first one is only dependent on
the emitter instance, while the second one requires that the receiver instance is ready
to read and process an event request. This way, we can provide as source a full list of
events which are described only by their data fields and pre-Request times.
Times in AERST are double numbers without units. This means that the unit
used in all the times appearing in parameters and events information is a choice for
the user. Is the user who determines the meaning of such numbers. The tool only uses
these numbers and processes them according to the parameters and the events timing
information provided. The user has to check that parameters and the events timing
information are correspondent.
In the C++ implementation, the sources are provided as matrixes (as in the MAT-
LAB implementation) to the tool. A matlab function called write2file is provided to
convert the sources to text files. However, the internal representation of events in the
C++ implementation is a bit different. The events are stored in lists where each el-
ement has two fields. The first field contains the information of the event, which is
stored in a vector with the six components previously described. The second field is a
pointer to the following event to be processed in the list. The use of lists and pointers
130
A.2 Description of an AER System
is totally transparent to the user. The number of fields inside an event can be increased
using more fields in the row vector. This is an open issue left to the user.
After declaring the sources, the instances are listed. The syntax to declare a system
module is:
module name {input channels}{output channels}{parameters file}{state file}(A.3)
In the Matlab version, file extensions are omitted as all source, parameter and state
files have the .mat extension (Matlab format). However, in the C++ implementation
files can have different extensions. Therefore, it is neccesary to include their extensions
in the file names.
An instance is described by an independent function whose name is identical to the
instance name (module name) that will appear in the netlist file describing the system.
The declaration format of a MATLAB instance (first line of a Matlab function) is
the following:
[new event in, events out, new state, new time, port out] = (A.4)
module f(event in, pars, old state, old time, port in)
event in corresponds to the present event information (as in eq. A.2) sent through
the channel. The event in information passed to the function as input parameter
contains the x and y coordinates of the event being processed and its Tprereqst time.
The updated new event in returned by the function contains also the established Treqstand Tack times. old state and new state represent the instance state before and after
processing the event. old time and new time are the global system times before and
after processing the event. events out is a list of output events produced by the instance
at its different output channels. port in is the port number from where the event has
entered the module and port out is a list of numbers identifying the output ports where
each of the output events created will be written. These output events (which are still
unprocessed events) are included by the simulator in their respective channel matrices
with Tprereqst as the present actual times, which at a later time should be processed by
their respective destination instances.
In the C++ implementation, the syntax is similar but with some differences:
131
A. AERST TOOL USER GUIDE
int instance name(double *event in, double ***events out, params2 params, params2
*state, double *timeact, int port input, int **port output, int tam vect)
event in is a pointer to the input event vector. It is received as a pointer so it can be
modified and updated inside the module. The input event is given back by the function
but the event fields Trqst and Tack have been updated.
params are the module parameters. They are stored in a struct where each field corre-
sponds to one parameter. The structure of params has been defined in an initialization
routine.
state is the internal state of the AER module before processing the incoming event.
The variables are stored in a struct where each field corresponds to one state variable.
The structures of the state variable have been defined in an initialization routine.
timeact is a pointer to the actual simulation time before and after processing the event.
port input is the port through which event is received.
events out is a pointer to an empty two-dimensional array. It returns the events gen-
erated (if any) in the module output ports during the execution of the function. The
module has to reserve dinamically the memory space needed for this array. When the
events are created and stored in the array they have their Trqst set to ‘0’, their Tack set
to ‘-1’ and their Tprerqst set to the simulation time. The generated events are added to
the event lists in the corresponding channels.
port output is a pointer to a vector of port numbers indicating where each event in
events out belongs to.
tam vec is the number of fields in an event. In the format we have used this field was
set to ‘6’. However, the user can increase this number.
The C++ function returns an integer value corresponding to the number of output
events that have been generated by the processing module.
A.3 MATLAB Initialization of Parameters and States
As it will be further explained in next sections, the files containing the parameters of
each AER instance (parameters file) and their initial internal states (state file) are
created each time the simulator AERST is invoked. The user has to create and/or
132
A.3 MATLAB Initialization of Parameters and States
update previously the initialization routines. These initialization routines are invoked
at the initial step of the simulation and have to be customized for the particular AER
modules listed in the configuration file. To illustrate how to create the parameter and
state files for a certain application, we show below some example initialization routines.
A.3.1 Initialization of Parameters
The example code to create a parameter file for an AER module that rotates an input
visual flow in a counterclockwise direction is:
function []=pars file rotate()
%SAMPLE PARAMETERS INITIALIZATION ROUTINE
direction=1; %SET THE OPTION FOR A COUNTERCLOCKWISE ROTATION DIRECTION
size1=128; % X DIMENSION IMAGE SIZE
size2=128; % Y DIMENSION IMAGE SIZE
save pars rotate direction size1 size2
A.3.2 Initialization of States
Bellow, we show a sample initialization routine to create an initial internal state file for
an AER module. The example initialization routine creates a file (state1 in this case)
containing two all zeros matrixes (one 128x128 called J and one 32x32 called times)
and a scalar variable (potential value) state. The state file created can be shared by
several AER modules with matched sizes.
function []=initstate1()
%INITIALIZATION OF THE INTERNAL STATE PARAMETERS OF AN AER MODULE
size1 1=128;
size1 2=128;
size2 1=32;
size2 2=32;
potential value=0;%FIRST SCALAR STATE VARIABLE
J=zeros(size1 1,size1 2);%FIRST MATRIX STATE VARIABLE
times=zeros(size2 1,size2 2);%SECOND MATRIX STATE VARIABLE
save state1 J times potential value%CREATE STATE FILE
133
A. AERST TOOL USER GUIDE
A.4 RUNNING AERST in MATLAB
AERST in MATLAB is called by the user in the command prompt to execute a simu-
lation as:
[VARS, CHANNELS] = AERST()
Before running a simulation the user has to customize the AERST initialization
contents. The user has to:
1. Store the name of the configuration file in the variable name CONFIG FILE. Theconfiguration file contains the netlist of the AER system to be simulated.
2. Store the name of the output file in the variable name OUTPUT FILE. The output fileis a text file that will be created during the simulation. The simulator stores the data ofthe events generated in the system channels.
3. List the initialization routines. These routines create the parameter and state files
(parameters files and state files) that contain the parameters and internal states used
inside the AER modules used in the system. The proper initialization routines for the
current system netlist have to be created and/or updated by the user before each simu-
lation.
At the end of the AERST file the call to the main function is invoked (main aerst).
Below is an example showing a system described by a configuration file conf file.txt
and an output file outfile.txt. The system receives a source called myevents.mat and
it has 9 channels. The initialization routines are pars file1, pars file2, pars file3,
pars file4, initstate1, initstate2 and initstate3:
function [VARS,CHANNELS] = AERST()%USER CONFIGURATION FILE:CONFIG FILE=‘conf file.txt’;%USER OUTPUT FILE:OUTPUT FILE=(‘outfile.txt’);
134
A.4 RUNNING AERST in MATLAB
MAT SOURCE=‘myevents.mat’;NUMB CHANNELS=9;
%GO TO THE PARAMETER FOLDER AND CREATE PARAMETER AND STATE FILEScd ./CREATE PARAMETERS;%USER PARAMETER AND STATE FILES:pars file1;%USER PARAMETER FILE 1pars file2;%USER PARAMETER FILE 2pars file3;%USER PARAMETER FILE 3pars file4;%USER PARAMETER FILE 4initstate1;%USER STATE FILE 1initstate2;%USER STATE FILE 2initstate3;%USER STATE FILE 3copyfile(‘*.mat’,‘../AERST MAIN’);
cd ../CONFIG FILES
copyfile(CONFIG FILE,‘../AERST MAIN’);
cd ../SOURCES
copyfile(MAT SOURCE,‘../AERST MAIN’);
cd ../AERST MAIN
%THE MAIN SIMULATOR FUNCTION IS INVOKED
VARS = main aerst(CONFIG FILE,OUTPUT FILE);
delete(‘*.mat’);
copyfile(OUTPUT FILE,‘../ALG REC’);
delete(‘*.txt’);
cd ../ALG REC
%FINALLY, THE EVENTS IN ALL CHANNELS ARE RETRIEVED
[CHANNELS]=disktocell2(NUMB CHANNELS,0,0,OUTPUT FILE);
delete(‘*.txt’);
cd ..
The aer tool function returns two variables: CHANNELS and VARS. CHANNELS
is a matrix of cells where each element stores all the events that have travelled in one
channel.
VARS is a matrix of cells where each row contains the information regarding to one
particular AER module in the system. Each row in VARS has five columns:
135
A. AERST TOOL USER GUIDE
1. First column contains the name of the corresponding AER module.
2. Second column contains the number of the input channels connected to that
module.
3. Third column is a structure containing the parameters of the AER module.
4. Fourth column is a structure containing the final internal states of the AER
module.
5. Fifth column contains the number of the output channels connected to that mod-
ule.
A.4.1 Building Modules
To build a MATLAB user defined module the user has to respect the format for the
declaration of functions in the simulator. The user also has to provide the outputs to
the module. An example for one simple function is the following:
function [new event in,events out,new state,new time,port out]=myfunction(event in,pars,old state,old time, port in)
new event in=event in;
new state=old state;
new time=old time;
% USE THE INCOMING EVENT TO GET (x,y) COORDINATES AND sign
x = event in(1,4);
y=event in(1,5);
sign = event in(1,6);
size1 = pars.size1;% EXAMPLE OF HOW TO USE PARAMETER 1
size2 = pars.size2;% EXAMPLE OF HOW TO USE PARAMETER 2
new state.J(x,y) = old state.J(x,y)+1*sign; % WE UPDATE THE INTERNAL STATE
new event in(1,2) = new time; % WE UPDATE THE REQUEST TIME OF THE INCOMING
EVENT
new time = new time+10e-9; % A TEN NANOSECONDS DELAY IS INTRODUCED
136
A.5 C++ Initialization of Parameters and States
new event in(1,3) = new time; % THE FINAL TIME IS USED TO SET THE ACKNOWL-
EDGE TIME
events out = [new time 0 -1 x y sign; new time 0 -1 x y sign]; % AN OUTPUT EVENT PER
OUTPUT CHANNEL WITH THE SAME X,Y,SIGN IS GENERATED
port out=[1 2]; %THE TWO EVENTS ARE WRITTEN IN PORTS ONE AND TWO.
In the above function, the incoming event is updated with a Trqst value equal to
the initial time and a Tack equal to the final time. The updating of the Trqst and Tack
inside the AER module is not mandatory. By default, the simulator updates the Trqstand Tack to simulator time time before the execution of the module function. If one or
more events are created, it is neccessary to indicate the output ports for each event in
port output.
A.5 C++ Initialization of Parameters and States
In the C++ implementation, similar functions can be used in Matlab to create the
parameter and state text files. The steps needed to build an initialization file in C++
are the following:
1. Create parameters as described in the Matlab initialization function.
2. If there are matrices, indicate in rows the number of rows and columns of each matrix.Use the first row in rows to indicate the number of rows of each matrix. Use the secondrow in rows to indicate the number of columns.
3. Create and open the parameter (or state) file.
4. Write key expression #doubles if there are scalar parameters.
5. Write the number of scalar parameters (num doubles) in the file.
6. Write sequentially each of the scalar parameter in the file.
7. Write key expression #matrices if there are matrices used as parameters.
8. Write the number of matrices (num matrices) in the file.
9. Write rows in the file to indicate rows and columns of each matrix.
10. Write sequentially each of the matrices in the file.
11. Close the parameters (or state) file.
137
A. AERST TOOL USER GUIDE
Note that creating the initialization files this way we manage the memory space
used in the application in an efficient way, as the tool reserves only the number and
size of the parameters indicated.
A.5.1 Initialization of Parameters
The initializing parameter file corresponding to the MATLAB described above is the
following:
function []=pars file rotate()
%SAMPLE PARAMETERS INITIALIZATION ROUTINE FOR A ROTATION AER MODULE
direction=1; %SET A COUNTERCLOCKWISE ROTATION DIRECTION
size1=128; % X DIMENSION IMAGE SIZE
size2=128; % Y DIMENSION IMAGE SIZE
%START THE C++ INITIALIZATION:
s2=‘pars rotate.txt’; %name of the file with the parameters
num doubles=3; %number of double parameters
%WRITE TO THE C++ txt PARAMETERS FILE:
fid=fopen(s2,‘w’); %OPEN PARAMETERS OUTPUT FILE
fprintf(fid,‘#doubles\n’);WRITE KEYWORD #doubles
fprintf(fid,‘%d\n’,num doubles);
fprintf(fid,‘%f ’,direction);
fprintf(fid,‘%f ’,size1);
fprintf(fid,‘%f\n’,size2);
fclose(fid);
Note that for the parameters file there are only 3 scalar parameters and there are
no matrices. This implies that we do not have to indicate the number of them nor the
number of rows and columns (stored in rows matrix).
A.5.2 Initialization of States
For the state file, the C++ initialization function is:
138
A.6 RUNNING AERST in C++
function []=initstate1()
%INITIALIZATION OF THE INTERNAL STATE PARAMETERS OF AN AER MODULE
size1 1=128;
size1 2=128;
size2 1=32;
size2 2=32;
potential value=0;
J=zeros(size1 1,size1 2);
times=zeros(size2 1,size2 2);
%START THE C++ INITIALIZATION
num doubles=1;
num matrices=2;
%NOW, WE STORE THE NUMBER OF ROWS OF EACH MATRIX IN ROWS:
rows=[size1 1 size2 1;size1 2 size2 2];
%WRITE TO THE C++ STATE FILE:
s2=‘state1.txt’; fid=fopen(s2,‘w’); %OPEN STATE OUTPUT FILE
fprintf(fid,‘#doubles\n’);WRITE KEYWORD #doubles
fprintf(fid,‘%d\n’,num doubles);
fprintf(fid,‘%f\n’,potential value);
fprintf(fid,‘#matrices\n’);WRITE KEYWORD #matrices
fprintf(fid,‘%d\n’,num matrices);
fprintf(fid,‘%d\n’,rows);
fprintf(fid,‘%f\n’,J);
fprintf(fid,‘%f\n’,times);
fclose(fid);
Note that this time we have a scalar parameter (potential value and two matrices (J
and times)). For each matrix rows stores the number of rows and colums.
A.6 RUNNING AERST in C++
In the C++ implementation we should provide the source txt files and the initialization
files for the parameters and state variables. A function for creating the sources in txt
139
A. AERST TOOL USER GUIDE
format from the Matlab sources stored as matrixes is provided, called write2file. The
calling format is the following:
write2file(input matrix,text file);
input matrix is a two-dimensional matrix with the source events and text file is the
name of the destination txt file.
AERST in C++ can be called from MATLAB by the user in the command prompt
to execute a simulation as:
[CHANNELS] = AERST()
Before running a simulation the user has to customize the AERST initialization
contents. The user has to:
1. Provide the sources to the system (stored in txt files).
2. Store the name of the configuration file in the variable name CONFIG FILE. The con-figuration file contains the netlist of the AER system to be simulated.
3. Store the name of the output file in the variable name OUTPUT FILE. The output fileis a text file that will be created during the simulation. The simulator stores the data ofthe events generated in the system channels.
4. List the initialization routines. These routines create the parameter and state files (pa-
rameters files and state files) that contain the parameters and internal states used inside
the AER modules used in the system. The proper initialization routines for the current
system netlist have to be created and/or updated by the user before each simulation.
Below is an example showing a system with 9 channels described by configuration
file conf file.txt and an output file outfile.txt. There is one source called myevents in
MATLAB format, which is converted to txt format using write2file. The initialization
routines are called pars file1, pars file2, pars file3, pars file4, initstate1, initstate2 and
initstate3 :
function [CHANNELS]=AERST()%MATLAB ARRAY OF SOURCE EVENTS:
140
A.6 RUNNING AERST in C++
MAT SOURCE=‘myevents.mat’;%CPP TXT SOURCE FILE:TXT SOURCE=‘myevents.txt’;%USER CONFIGURATION FILE:CONFIG FILE=‘conf file.txt’;%USER OUTPUT FILE:OUTPUT FILE=‘outfile.txt’;NUMB CHANNELS=9;
%CODE TRANSPARENT TO THE USER:%LOAD SOURCE AS A MATLAB EVENT ARRAY AND CONVERT IT TO TXT FOR-MAT:cd ./SOURCESload(MAT SOURCE)write2file(MAT SOURCE, TXT SOURCE); %convierte de mat a txtcopyfile(TXT SOURCE,‘../AERST MAIN/’);cd ..delete(TXT SOURCE);
%GO TO THE CONFIGURATION FILES FOLDER AND COPY THE CONFIG FILE TOTHE MAIN FOLDERcd ./CONFIG FILEScopyfile(CONFIG FILE, ‘../AERST MAIN/config file.txt’);
%GO TO THE PARAMETER FOLDER AND CREATE PARAMETER AND STATE FILEScd ../CREATE PARAMETERS %USER PARAMETER AND STATE FILES:pars file1;%USER PARAMETER FILE 1pars file2;%USER PARAMETER FILE 2pars file3;%USER PARAMETER FILE 3pars file4;%USER PARAMETER FILE 4initstate1;%USER STATE FILE 1initstate2;%USER STATE FILE 2initstate3;%USER STATE FILE 3copyfile(‘*.txt’,‘../AERST MAIN’);%IN THE C++ SIMULATOR PARAMETERS AND STATESARE STORED IN TXT FILESdelete(‘*.txt’);
%CALL THE C++ MAIN PROGRAM:cd ../AERST MAINsystem(‘./AERST.exe config file.txt out1.txt’);
141
A. AERST TOOL USER GUIDE
copyfile(‘out1.txt’,‘../ALG REC’);delete(‘*.txt’);cd ../ALG RECcopyfile(‘out1.txt’,OUTPUT FILE);
%FINALLY, THE EVENTS IN ALL CHANNELS ARE RETRIEVED:[CHANNELS]=disktocell3(NUMB CHANNELS,0,0,OUTPUT FILE);delete(‘out1.txt’);delete(OUTPUT FILE)cd ..
Note that now the parameter and state files are stored in text files. Note also how
at the end of the AERST file the call to the main function is invoked (AERST.exe).
The aer tool function returns CHANNELS, which is a matrix of cells where each element
stores all the events that have travelled in one channel.
A.6.1 Building C++ Modules
To build a user defined module the user has to respect the format for the declaration
of functions in the simulator. The user has to create dinamically the memory needed
for the ouput events and for the output port vector in case there exist output events.
An example for one simple function that replicates each input event in two different
output ports and accumulates the event on a state array J is the following:
int example function (double *event in, double ***events out, params2 params, params2 *state,double *timeact, int port input, int **port output, int tam vec)
{int c, numb ports,numb events;double timedelay, timetoprocess, **tt,double **J;;// Get parameterstimedelay = params.par doub[0]; // First parameter: delay of the asynchronous communicationtimetoprocess = params.par doub[1]; //Second parameter: time to process one eventnumb ports=(int)params.par doub[2];//Number of output ports//Update incoming event timing informationevent in[1]=*timeact;// We set the request time for the input event*timeact=*timeact+timedelay; // We compute the present time for the moduleevent in[2]=*timeact;// We acknowledge the input event*timeact=*timeact+timetoprocess; //We compute again the new present time in the module
142
A.7 Matlab Auxiliary Functions
//Update neuron state addressed by the incoming eventJ=state->p[0]; //We load the state array J.J[event in[3]][event in[4]]=J[event in[3]][event in[4]]+1*event in[5]; //We accumulate the event in the array// Start dinamical management of memory to create output eventstt = new double *[numb ports]; //We create dinamically the set of events.for(i=0;i¡numb ports;i++){
tt[i]=new double[tam vec]; //We create dinamically each event of size specified by tam vec}*events out=tt; //events out is the pointer to the array provided as output*port output=new int[numb ports]; //We reserve space for the output port vector. Each element indicates in// which output port each event must be written.for (i=0;i¡numb ports;i++){//We initialize each event
(*events out )[i][0]=*timeact;(*events out )[i][1]=0;(*events out)[i][2]=-1;(*events out)[i][3]=event[3];(*events out)[i][4]=event[4];(*events out)[i][5]=event[5];(*port output)[i]=i+1;
}numb events=numb ports;return numb events;//We return the number of events}
A.7 Matlab Auxiliary Functions
Auxiliary functions have been developed to help the user to create AER matrix sources
to be provided to the system from a specified standard image and to recover a standard
image from the AER events obtained after a simulation.
A.7.1 Generation of AER events from a standard image
To create an AER matrix from a specified standard image there are three methods
provided. These methods implement the random, scan and uniform methods proposed
by A. Linares-Barranco in [169]. The auxiliary functions implementing the algorithmic
approximations to generate the events are located in the directory ‘./ALG GEN’. and
143
A. AERST TOOL USER GUIDE
are the following:
[CIN] = randimage(I, prectemp, max events)
[CIN] = scanimage(I, prectemp, max events)
[CIN] = unifimage(I, prectemp, max events)
Their input parameters are I, which is the image coded in gray scale, prectemp,
minimum time interval between consecutive events and max events (maximum number
of events per frame). The output CIN is a matrix containing all sequence of events
representing image I, in the format required by AERST.
A.7.2 Reconstruction of images from channels
To reconstruct an image from channel events, the auxiliary function is reconstaer back
located in the directory ‘./ALG REC’.
The function is invoked with the following format:
[J]=reconstaer back(CIN, size1, size2),
where CIN is the AER events matrix with the appropiate format, size1 is the
image X dimension and size2 is the image Y dimension. J is the reconstructed image
coded in gray scale.
A.7.3 Reconstruction of channels from the text output file
To reconstruct channel events from the text output file, the auxiliary function is
disktocell2, which is located in directory ‘./ALG REC’. This function recovers the
events at one channel or at all the channels stored in the text output file.
The function is invoked as:
[AC]=disktocell2 (N, numb, flag, output file),
where N is the number of channels in the AER system, numb is the number of
the channel to be recovered, flag is a logic variable and output file is the text output
file. If flag value is ‘0’, the information in all the system channels will be reconstructed
144
A.8 MATLAB Step-by-Step Example
and numb is ignored. If flag it is ‘1’, only the information of the channel specified in
numb is recovered. AC is a matrix of cells containing the recovered events. It has
as many cells as channels in the AER system. The cells for the non-reconstructed
channels are left empty. As an example, if we want to recover the information from
channel 3 in a system with a total of 5 channels we should invoke the function as follows:
[AC] = disktocell2 (5, 3, 1, ‘output file.txt’);
channel3 = AC{3};
Afterwards, to visualize the image reconstructed from the channel 3 we can use the
previously described function reconstaer back.
A.8 MATLAB Step-by-Step Example
In AERST, all the auxiliary functions, modules, sources, and rest of files are organized
in folders. The tool looks for all the needed files in these locations. The folders are:
1. AERST MAIN. This folder contains the main function main aerst.m called by
the AERST tool (AERST.m) located in the root directory.
2. ALG GEN. This folder contains the functions to convert images to events unifim-
age.m, scanimage.m and randimage.m. The functions mat2dat (to generate dat
files from events) and dat2mat (to generate matrices of events from dat files) are
also stored in this location.
3. ALG REC. In this folder the function to recover the events from the text out-
put file disktocell2.m and the function to reconstruct images from events recon-
staer back are stored.
4. CONFIG FILES. This folder contains the user configuration files.
5. CREATE PARAMETERS. This folder stores the initialization files for parame-
ters and state variables.
6. FUNCS. This folder contains the auxiliary functions needed by the tool internal
processing.
145
A. AERST TOOL USER GUIDE
7. SOURCES. This folder contains the sources used as input to the system.
8. MODULES. This folder contains the library of modules and the user-defined
modules.
A.8.1 Preparing the Stimulus Events
The example proposed is for visual processing. We will use two different input stimuli.
The first one is generated by converting a static 64x64 pixel image into a sequence
of events. The second stimulus is directly obtained from a physical motion sensitive
64x64 pixel AER retina recording [15]. The static image is stored in a matlab file called
myimage.mat. The events from the retina are stored in a file called myevents.dat. The
.dat file format is provided by the jAER software [59] when recording real life scenes.
The first step is to convert the selected source to the proper format required by AERST.
If we choose the static image, we have to convert it to a sequence of events. In this
example, we have chosen the uniform method [169] to code the image into events:
[myevents] = unifimage(myimage, 100, 400000);
With this function, we obtain a list of events (myevents) as output. The input
parameters are:
1. I is the input image coded in gray scale.
2. prectemp is the minimum time interval between consecutive events.
3. max events is the maximum number of events per frame.
In the example prectemp will be higher than 100 and the maximum number of
events max events will be 400000. The output events are stored in matrix myevents
with as many rows as events and six columns containing the parameters of each event.
The uniform method distributes all the events uniformly in 400k slots.
If we use the .dat retina recording as source, we can do the format conversion using
the dat2mat function:
[myevents]=dat2mat(myevents.dat);
146
A.8 MATLAB Step-by-Step Example
Once the events have been created, they have to be saved in the Matlab working
directory inside the SOURCES directory:
cd ./SOURCES
save myevents myevents
cd ..
If we want to use more sources, we should repeat the above steps for each source.
A.8.2 Setting Up the Configuration File
The next step is writing the text configuration file which describes the netlist of the
system to be simulated.
In this example we consider the system shown in Fig. A.1. It is composed of two par-
allel processing modules (chip1 replicated twice) which receive the same input visual
stimulus (from a two output splitter) and merge the output events into one port using
a merger module. The configuration file describing this system has the following lines:
sources {1} {myevents}splitter {1} {2,3} {file1}chip1 {2} {4} {file2} {state2}chip2 {3} {5} {file3} {state3}merger {4,5} {6} {} {}
The file is saved as config example.txt in the subdirectory CONFIG FILES.
If we want to use more sources (for example two), the first line in the configuration
file would be:
sources {1,2} {myevents1, myevents2}
A.8.3 Initializing Parameters
The following step is to create the parameters and state variables that each block will
use. In our example there are four blocks, each one with its parameters and state
variables. Consequently, we must create the initialization files that will initialize and
save the parameters and state variables. In the example we need five initialization files,
147
A. AERST TOOL USER GUIDE
Figure A.1: System Simulated in the Step-by-Step Example
two for each processing chip (one for parameters and one for state variables), and one
for the splitter module. The merger block does not have parameters nor state variables.
The initialization files are stored in subdirectory CREATE PARAMETERS. They are
as follows:
A.8.3.1 Splitter
The module splitter in the system has only one initialization file called initparams1.m
(it has no state variables) with the following lines:
timedelay=0.5; %parameter delay time.
Numb ports=2;
save file1 timedelay numb ports
A.8.3.2 Chip1
Module Chip1 is used twice in the system but with different parameters and state vari-
ables. Note that modules can be used more than once in the same system and that
the user can specify different parameters and state variables for them. For the upper
Chip1 in Fig. A.1 there are two initialization files: iniparams2 and initstates2
a)initparams2 :
timedelay=0.6; %parameter delay time.
size1=64; %size 1 of the image.
148
A.8 MATLAB Step-by-Step Example
size2=64; %size 2 of the image.
s=[0 0 0;1 2 1;0 0 0]; %convolution kernel
sizekernel1= floor(size(s,1)/2);
threshold=60;
save file2 timedelay size1 size2 s sizekernel1 threshold
b)initstate2 :
imagestate2=zeros(64); %temporal bidimensional array used in block chip1.
timestate=zeros(64); %auxiliar state variable used in chip1.m.
save state2 imagestate2 timestate
For Chip1 at the bottom of Fig. A.1, there are two initialization files: iniparams3
and initstates2:
Note that both modules (upper Chip1 and bottom Chip1) use the same initialization file
for state variables (initstates2) as they use state variables with the same characteristics.
a)initparams3 :
timedelay=0.3; %parameter delay time.
size1=64; %size 1 of the image.
size2=64; %size 2 of the image.
s=[0 1 0; 0 2 0; 0 1 0]; %convolution mask
sizekernel1= floor(size(s,1)/2);
threshold=60;
save file3 timedelay size1 size2 s sizekernel1 threshold
A.8.4 Editing the Modules
Once the initialization files have been created, the next step is to write the code of
the different modules. The declaration of each module will be according to the syntax
below. In the example, the modules have the following code:
149
A. AERST TOOL USER GUIDE
A.8.4.1 Splitter Module
function [new event in,events out,new state,time,port out]=
splitter(event in,pars,old state,old time, port in)
new event in=event in;
new state=old state;
% Get the parameter variables timedelay, numb ports
tdel=pars.timedelay;
numb ports=pars.numb ports;
% USE THE INCOMING EVENT TO GET (x,y) COORDINATES AND sign
x = event in(4);
y=event in(5);
sign = event in(6);
new event in(2)=old time;% Update incoming event Treqst
new time=old time+tdel;% Update current time
new event in(3)=new time;% Update incoming event Tack
events out=[new time 0 -1 x y sgn];
events out=repmat(events out,numb ports,1); %ONE EVENT FOR EACH PORT
port out=1:numb ports;
According to the initialization file initfile1.m, this module creates two events, one
in channel 2 and one in channel 3 with a delay. It also updates variables time, Treqstand Tack for the incoming event. If we want more than one replicated events in each
channel, for example two, we only have to modify the instructions to create the output
events. A possible way could be as follows:
events out=[new time 0 -1 x y sgn];
events out=repmat(events out,2*numb ports,1);
port out=1:numb ports;
port out=repmat(port out,1,2);
A.8.4.2 Chip1 Module
function [new event in,events out,new state,time,port out]=
chip1(event in,pars,old state,old time, port in)
150
A.8 MATLAB Step-by-Step Example
% Get the parameter variables timedelay, size1 and size2, threshold and convolution mask snew event in=event in; new state=old state; new time=old time;tdel=pars.timedelay;size1=pars.size1;size2=pars.size2;threshold=pars.threshold;s2=pars.s;
% USE THE INCOMING EVENT TO GET (x,y) COORDINATES AND signx = event in(4);y=event in(5);sign = event in(6);
event in(2)=new time;% Update incoming event Treqst
new time=new time+tdel;% Update current timeevent in(3)=new time;% Update incoming event Tack
J=old state.imagestate2;% Get state bidimensional state array J
%APPLY CONVOLUTION MASK s TO STATE ARRAY J AT (x, y) POSITIONSc=params.sizekernel1;J(max(1,(x-c)):min(size1,(x+c)),max(1,(y-c)):min(size2,(y+c)))= ...J(max(1,(x-c)):min(size1,(x+c)),max(1,(y-c)):min(size2,(y+c)))+...sgn*s2(max(1,c-x+2):min(size(s2,1),size1-x+c+1),max(1,c-y+2):min(size(s2,2),size1-y+c+1));
%FIND NEURONS REACHING threshold
[a, b]=find(abs(J)>threshold);e=find(abs(J)>threshold);new state.timestate(e)=new time; %We update time of those pixels achieving thresholdx2=a’;y2=b’;signo=sign(J(e))’;
time re=zeros(1,length(a));Ack=(-1)*ones(1,length(a));treqini=new time*ones(1,length(a));
%RESET NEURONS ACHIEVING threshold
if length(e)>0
J(e)=0;
end
151
A. AERST TOOL USER GUIDE
new state.imagestate2=J;%save new state B
%CREATE NEW OUTPUT EVENTS
events out=[treqini’ time re’ Ack’ x2’ y2’ signo’];
port out=ones(1,length(a));%set the number of output port for each created event
In this block, a square convolution kernel s is applied to the present 2D state at
positions x, y. Positive and negative pixels reaching the threshold threshold (in this
case ‘60′), will produce new events in the output channel.
A.8.4.3 Merger Module
function [new event in,events out,new state,new time,port out]=
merger(event in,pars,old state,old time, port input)
% USE THE INCOMING EVENT TO GET (x,y) COORDINATES AND signnew event in=event in; new state=old state; new time=old time;
x = event in(4);
y=event in(5);
sign = event in(6);
new event in(2)=new time;% Update incoming event Treqst
new event in(3)=new time;% Update incoming event Tack
events out=[new time 0 -1 x y sgn]; %ONE EVENT COPIED TO THE OUTPUT PORT
port out=[1];
In this case, the input events coming from channels four and five are transferred to
channel ‘6’.
A.8.5 Editing the AERST.m file
Once the modules and initialization files are available, we must include the names of
the configuration file and the initialization files in the AERST.m file before starting the
simulation. AERST.m is located in the root directory. In this step-by-step example,
AERST.m has the following lines:
152
A.8 MATLAB Step-by-Step Example
function [VARS,CHANNELS] = AERST()%USER CONFIGURATION FILE:CONFIG FILE=‘config example.txt’;%USER OUTPUT FILE:OUTPUT FILE=(‘outfile.txt’);NUMB CHANNELS=6;
%GO TO THE PARAMETER FOLDER AND CREATE PARAMETER AND STATEFILEScd ./CREATE PARAMETERS;%USER PARAMETER AND STATE FILES:initfile1;%USER PARAMETER FILE 1initfile2;%USER PARAMETER FILE 2initfile3;%USER PARAMETER FILE 3initstate1;%USER STATE FILE 1initstate2;%USER STATE FILE 2
copyfile(‘*.mat’,‘../AERST MAIN’);
cd ../CONFIG FILEScopyfile(CONFIG FILE,‘../AERST MAIN’);
cd ../AERST MAIN%THE MAIN SIMULATOR FUNCTION IS INVOKEDVARS = main aerst(CONFIG FILE,OUTPUT FILE);delete(‘*.mat’);copyfile(OUTPUT FILE,‘../ALG REC’);delete(‘*.txt’);
cd ../ALG REC
%FINALLY, THE EVENTS IN ALL CHANNELS ARE RETRIEVED
[CHANNELS]=disktocell2(NUMB CHANNELS,0,0,OUTPUT FILE);
delete(‘*.txt’);
cd ..
A.8.6 Simulating the System
Finally, we must invoke the AERST simulator by typing in the matlab prompt the
following command:
153
A. AERST TOOL USER GUIDE
[VARS,CHANNELS] = AERST()
If all steps have been done correctly, no errors will appear on the screen. We
strongly recommend to check that all required files (sources, configuration file, initfiles,
initstates, new modules and aer tool.m ) have been created or modified before starting
the simulation.
A.8.7 Viewing Results
To analyze the resulting events of all channels, we can use the disktocell2 function as
mentioned previously:
[A4]=disktocell2(6, 0, 0, ‘outfile.txt’);
Here we use the value ‘6’ for the first argument because there are six channels in
our system. If we want to analyze only one channel, for example channel ‘2’, we must
type the following:
[A4]=disktocell2(6, 2, 1, ‘outfile.txt’);
In the two cases, we can access to the information of one channel (in this case chan-
nel ‘2’) with the following instruction:
channel=A4{2};
Now, in channel we have the full list of events at channel ‘2’, each with its 6 field
format. If we want to view the image resulting of accumulating all the events in the
recovered channel channel, we can type:
[J2]=reconstaer back(channel, 64, 64);
There is a function to convert events in matlab files to data files (.dat) in case that
the user wants to use the jAER software tools [59] to visualize the events as video
154
A.9 C++ Step-by-Step Example
streams. This function is called matwdat and the calling format is the following:
mat2dat(CIN,s,size)
CIN is a matrix of events with the format required by the AERST tool. s is the name
of the dat file where the events are going to be stored. size is the size of the address
space coded by the events.
mat2dat requires times in the events in CIN to be in nanoseconds. For instance if the
user wants to convert a matrix of events called myevents coding a 128x128 address
space to a data file called myevents.dat the calling format is:
mat2dat(myevents, myevents.dat,128);
A.9 C++ Step-by-Step Example
To maintain a correspondency with the Matlab implementation, the C++ implemen-
tation of AERST has been organized with the same folder structure. Again the folders
are:
1. AERST MAIN. This folder contains the main executable file AERST.exe called
by the AERST tool (AERST C.m located in the root directory.
2. ALG GEN. This folder contains the functions to convert images to events unifim-
age.m, scanimage.m and randimage.m. The functions mat2dat (to generate dat
files from events) and dat2mat (to generate matrices of events from dat files) are
also stored in this location.
3. ALG REC. In this folder the function to recover the events from the text out-
put file disktocell3.m and the function to reconstruct images from events recon-
staer back are stored.
4. CONFIG FILES. This folder contains the user configuration files.
5. CREATE PARAMETERS. This folder stores the initialization files for parame-
ters and state variables.
155
A. AERST TOOL USER GUIDE
6. FUNCS. This folder contains the auxiliary functions needed by the tool internal
processing.
7. SOURCES. This folder contains the sources used as input to the system.
A.9.1 Converting a Matrix of Events to a source text file
Given an event matrix stored as a mat file (in this example MATRIX.mat) with
the format required by the Matlat version of AERST (six fields of information per
event), it can be converted into the format required by the C++ version using function
write2file:
write2file(‘MATRIX.mat’, ‘myevents.txt’);
myevents.txt stores the events with the format required by the C++ version of AERST.
A.9.2 Setting Up the Configuration File
The next step is to write the text configuration file describing the netlist of the system
to be simulated. For the example system shown in Fig. A.1, the configuration file is:
sources {1} {myevents.txt}splitter {1} {2,3} {file1.txt}chip1 {2} {4} {file2.txt} {state2.txt}chip2 {3} {5} {file3.txt} {state3.txt}merger {4,5} {6} {} {}
Note that the configuration file is the same as that one used in the Matlab version,
except that the file extensions of the source and parameter files has to be given.
A.9.3 Initializing Parameters
The next step is to create the parameters and state variables that each module will
use. In our example there are four modules, each one with its parameters and state
variables associated. Consequently, we must create the files that will initialize and
save the parameters and state variables. In the example we need five initialization
156
A.9 C++ Step-by-Step Example
files, two for each processing chip (one for parameters and one for state variables), and
one for the splitter module. The merger module does not have parameters nor state
variables. The initialization files are stored in ‘./CREATE PARAMETERS’. In this
example, we can use the matlab initialization files described in the matlab step-by-step
section just including some lines to save the variables in a txt file. The initialization
files are described next:
A.9.3.1 Splitter
The module splitter has only one initializing file called initparams1.m (it has no state
variables) with the following lines:
timedelay=0.5; %parameter delay time.
numb ports=2;
%THE FOLLOWING CODE CREATES THE file1.txt INITIALIZATION FILE
num doubles=2;
s2=‘file1.txt’;
fid=fopen(s2,‘w’);
fprintf(fid,‘#doubles\n’);
fprintf(fid,‘%d\n’,num doubles);
fprintf(fid,‘%f ’,timedelay);
fprintf(fid,‘%f\n’,numb ports);
A.9.3.2 Chip1
For the upper module Chip1 in Fig. A.1 there are two initialization files: iniparams2
and initstates2
a)initparams2 :
timedelay=0.6; %parameter delay time.
size1=64; %size 1 of the input visual flow.
size2=64; %size 2 of the input visual flow.
157
A. AERST TOOL USER GUIDE
s=[0 0 0;1 2 1;0 0 0]; %CONVOLUTION MASK
shift= floor(size(s,1)/2);
threshold=60;
%THE FOLLOWING CODE CREATES THE TXT INITIALIZATION FILE
num doubles=3;
num matrices=1;
rows=[3;3];%THERE IS ONLY ONE MATRIX, THE 3x3 CONVOLUTION MASK s
s2=‘file2.txt’;
fid=fopen(s2,‘w’);
fprintf(fid,‘#doubles\n’);WRITE KEYWORD #doubles
fprintf(fid,‘%d\n’,num doubles);
fprintf(fid,‘%f ’,timedelay);
fprintf(fid,‘%f ’,shift);
fprintf(fid,‘%f\n’,threshold);
fprintf(fid,‘#matrices\n’);WRITE KEYWORD #matrices
fprintf(fid,‘%d\n’,num matrices);
fprintf(fid,‘%d\n’,rows);
fprintf(fid,‘%f\n’,s’);
fclose(fid);
b)initstate2 :
imagestate2=zeros(64); %temporal bidimensional array used in block chip1.
timestate=zeros(64); %auxiliar state variable used in chip1.m.
%THE FOLLOWING CODE CREATES THE TXT INITIALIZATION FILE
num doubles=2;
num matrices=2;
rows=[64 64;64 64];%THIS TIME THERE ARE TWO MATRIXES, the 64x64imagestate2 and
the 64x64 timestate.
s2=‘state2.txt’;
158
A.9 C++ Step-by-Step Example
fid=fopen(s2,‘w’);
fprintf(fid,‘#doubles\n’);WRITE KEYWORD #doubles
fprintf(fid,‘%d\n’,num doubles);
fprintf(fid,‘%f ’,size1);
fprintf(fid,‘%f\n’,size2);
fprintf(fid,‘#matrices\n’);WRITE KEYWORD #matrices
fprintf(fid,‘%f\n’,imagestate2’);
fprintf(fid,‘%f\n’,timestate’);
fclose(fid);
For the module Chip1 at the bottom of Fig. A.1, there are two initialization files:
iniparams3 and initstates2. Note that both modules (upper Chip1 and bottom Chip1)
use the same initialization file for state variables (initstates2) as they use state vari-
ables with the same characteristics.
a)initparams3 :
timedelay=0.3; %parameter delay time.
size1=64; %size 1 of the input visual flow.
size2=64; %size 2 of the input visual flow.
s=[0 1 0; 0 2 0; 0 1 0]; %CONVOLUTION MASK
shift= floor(size(s,1)/2);
threshold=60;
%THE FOLLOWING CODE CREATES THE TXT INITIALIZATION FILE
num doubles=3;
num matrices=1;
rows=[3;3];THERE IS ONLY ONE MATRIX, THE CONVOLUTION MASK s
s2=‘file3.txt’;
fid=fopen(s2,‘w’);
fprintf(fid,‘#doubles\n’);WRITE KEYWORD #doubles
159
A. AERST TOOL USER GUIDE
fprintf(fid,‘%d\n’,num doubles);
fprintf(fid,‘%f ’,timedelay);
fprintf(fid,‘%f ’,shift);
fprintf(fid,‘%f\n’,threshold);
fprintf(fid,‘#matrices\n’);WRITE KEYWORD #matrices
fprintf(fid,‘%d\n’,num matrices);
fprintf(fid,‘%d\n’,rows);
fprintf(fid,‘%5.20f\n’,s’);
fclose(fid);
A.9.4 Editing the C++ Modules
Once the initialization files have been created, the next step is to write the different
modules (in case new ones have been created and are not available in the modules
library).
In the example system, the modules will have the code given below:
A.9.4.1 Splitter Module
int splitter (double *event in, double ***events out, params2 params, params2 *state,double *timeact, int port input, int **port output, int tam vec)
{int i,c, numb ports;double timedelay, **tt;// Get parameterstimedelay = params.par doub[0]; // First parameter: delay of the asynchronous communicationnumb ports=params.par doub[1];//Number of output ports//Update incoming event timing information:event in[1]=*timeact;// Request time for input event*timeact=*timeact+timedelay; // We compute the present time for the moduleevent in[2]=*timeact;// We acknowledge the input eventtt = new double *[numb ports]; //We create dinamically the set of events. In this case only two events (two ports)if (tt==NULL){
160
A.9 C++ Step-by-Step Example
printf(“ERROR IN MEMORY ALLOCATION\n”);}else{
for(c=0;i<numb ports;c++){
tt[c]=new double[tam vec]; //We create dinamically each event of size specified by tam vecif (tt[c]==NULL){
printf(“ERROR IN MEMORY ALLOCATION\n”);}
}*events out=tt; //events out is the pointer to the array provided as output}*port output=new int[numb ports]; //We reserve space for the output port vector. Each element indicates inif (*port output==NULL){
printf(“ERROR IN MEMORY ALLOCATION\n”);}else{// which output port each event must be written.
for (i=0;i<numb ports;i++){//We initialize each event
(*events out)[i][0]=*timeact;(*events out)[i][1]=0;(*events out)[i][2]=-1;(*events out)[i][3]=event in[3];(*events out)[i][4]=event in[4];(*events out)[i][5]=event in[5];(*port output)[i]=i+1;
}}numb events=numb ports;return numb events;//We return the number of events}
In this block, a new event is created in channels ‘2’ and ‘3’ with a delay, and timeact
and state are refreshed. Effective request time and acknowledge time are modified in
the input event adding them a delay specified in params.
161
A. AERST TOOL USER GUIDE
A.9.4.2 Chip1 C++ Module
int chip1(double *event in, double ***events out, params2 params, params2 *state,double *timeact, int port input, int **port output, int tam vec)
{int i,j,numb events=0,cont=0, cont2=0,c=0;int ind11,ind12,ind21,ind22,sg11,sg21,sg12,sg22,shift;int *a, *b;double x, y, sgn event;double **J, **tb,**s2,**tt;double *signo;double timedelay, treqinic,offset,size1,size2,threshold;%GET PARAMETERS AND STATE VARIABLEStimedelay=params.par doub[0];size1=(int)state->par doub[0];size2=(int)state->par doub[1];s2=params.p[0];shift=params.par doub[1];threshold=params.par doub[2];%GET EVENT INFORMATION AND UPDATE ITx=event in[3];y=event in[4];sgn event=event in[5];treqinic=*timeact;event in[1]=*timeact;*timeact=*timeact+timedelay;event in[2]=*timeact;c=shift;J=state->p[0];%COMPUTE WHICH PART OF CONVOLUTION MASK s FITS IN THE STATE ARRAYind11=maxim2(0,-x+c);ind21=minim2((params.rows2 m[0][0]-1),(state-¿rows2 m[0][0]+ ...
params.rows2 m[0][0]-(x+c+2)));ind12=maxim2(0,-y+c);ind22=minim2((params.rows2 m[1][0]-1),(state-¿rows2 m[1][0]+ ...
params.rows2 m[1][0]-(y+c+2)));// if length(s2)>0] => s2 tiene elementosif(((ind21-ind11)>=0)&&((ind22-ind12)¿=0)){
a=new int[(ind21-ind11+1)*(ind22-ind12+1)];b=new int[(ind21-ind11+1)*(ind22-ind12+1)];
162
A.9 C++ Step-by-Step Example
signo=new double[(ind21-ind11+1)*(ind22-ind12+1)];if((a!=NULL)&(b!=NULL)&(signo!=NULL)){
cont=1;sg11=maxim2(0,x-c);sg21=minim2(size1-1, x+c);sg12=maxim2(0,y-c);sg22=minim2(size2-1,y+c);tb=state->p[1];s2=params.p[0];for(j=sg12;j<=sg22;j++){
for (i=sg11;i<=sg21;i++){
%APPLY CONVOLUTION MASK TO STATE ARRAY BJ[i][j]=J[i][j]+sgn event*s2[ind11+(i-sg11)][ind12+(j-sg12)];
%LOCATE THOSE NEURONS WITH A STATE HIGHER THAN thresholdif (fabs(J[i][j])>=threshold){
a[cont2]=i;b[cont2]=j;tb[i][j]=*timeact;if (J[i][j]>=0){
signo[cont2]=1;%RESET THOSE NEURONS WITH A STATE HIGHER THAN threshold
J[i][j]=0;}else{
signo[cont2]=-1;J[i][j]=0;
}cont2++;
}}
}}
}if(cont>0){
163
A. AERST TOOL USER GUIDE
%CREATE OUTPUT EVENTSnumb events=cont2;tt = new double *[numb events]; //We create dinamically the set of events. In this case only two events (two ports)if (tt==NULL){
printf(“ERROR IN MEMORY ALLOCATION\n”);}else{
for(c=0;i<numb events;c++){
tt[c]=new double[tam vec]; //We create dinamically each event of size specified by tam vecif (tt[c]==NULL){
printf(“ERROR IN MEMORY ALLOCATION\n”);}
}*events out=tt; //events out is the pointer to the array provided as output}*port output=new int[numb events]; //We reserve space for the output port vector. Each element indicates inif (*port output==NULL){
printf(“ERROR IN MEMORY ALLOCATION\n”);}else{// which output port each event must be written.
for (i=0;i<numb events;i++){//We initialize each event
(*events out)[i][0]=treqinic;(*events out)[i][1]=0;(*events out)[i][2]=-1;(*events out)[i][3]=a[i];(*events out)[i][4]=b[i];(*events out)[i][5]=signo[i];(*port output)[i]=1;*timeact=(*events out)[i][0];
}}
delete a;delete b;
164
A.9 C++ Step-by-Step Example
delete signo;}
return numb events;}
In this block, convolution kernel s is applied to the present 2D state at positions x, y.
Positive and negative pixels that reach threshold umbral (in this case 0), will produce
new events in the output channels.
A.9.4.3 MERGER C++ Module
int merger (double *event in, double ***events out, params2 params, params2 *state,double *timeact, int port input, int **port output, int tam vec)
{int numb ports=1;double **tt;// Get parameters//Update incoming event timing information:event in[1]=*timeact;// Request time for input eventevent in[2]=*timeact;// We acknowledge the input eventtt = new double *[numb ports]; //We create dinamically the set of events. In this case only one eventif (tt==NULL){
printf(“ERROR IN MEMORY ALLOCATION\n”);}else{
tt[0]=new double[tam vec]; //We create dinamically each event of size specified by tam vecif (tt[0]==NULL){
printf(“ERROR IN MEMORY ALLOCATION\n”);}
*events out=tt; //events out is the pointer to the array provided as output}*port output=new int[numb ports]; //We reserve space for the output port vector. Each element indicates inif (*port output==NULL){
printf(“ERROR IN MEMORY ALLOCATION\n”);}else{
165
A. AERST TOOL USER GUIDE
//We initialize the event(*events out)[0][0]=*timeact;(*events out)[0][1]=0;(*events out)[0][2]=-1;(*events out)[0][3]=event in[3];(*events out)[0][4]=event in[4];(*events out)[0][5]=event in[5];(*port output)[0]=1;
}numb events=numb ports;return numb events;//We return the number of events}
Once the modules have been written, the user has to:
1. Include the modules in the hpp file called lib modules.hpp.
2. Include the module names in the file AERST.cpp. The module names are included
as independent lines in the C++ structure called intern func type. This structure
stores the library modules and the user-defined modules. It has the following lines:
struct intern func type {
char *f name;
int (*p)(double *,double ***,params2 , params2 *,double *,int , int **, int );
} intern func[]={
“aerswitch”, aerswitch, //INCLUDE THE NAME OF NEW MODULES AS DONE
NEXT:
“aerscanner”,aerscanner,
“convolution”, convolution,
“projection”, projection,
“integandfire”,integandfire,
“ratereducer”, ratereducer,
“rotator”, rotator,
“subsampling”, subsampling,
“splitter”,splitter,
“merger”, merger,
166
A.9 C++ Step-by-Step Example
“chip1”,chip1,
“chip2”,chip2,
“”,0
};
A.9.5 Simulating the System in C++
Once the modules and initfiles are available, the simulator can be run. A MAT-
LAB function (AERST.m) to call the initialization functions and the main program
(AERST.exe) is provided. It has the following lines1:
function [CHANNELS]=AERST()%MATLAB ARRAY OF SOURCE EVENTS:MAT SOURCE=‘myevents.mat’;%CPP TXT SOURCE FILE:TXT SOURCE=‘myevents.txt’;%USER CONFIGURATION FILE:CONFIG FILE=‘config example c.txt’;%USER OUTPUT FILE:OUTPUT FILE=‘output.txt’;NUMB CHANNELS=6;
%CODE TRANSPARENT TO THE USER:%LOAD SOURCE AS A MATLAB EVENT ARRAY AND CONVERT IT TO TXT FOR-MAT:cd ./SOURCESload(MAT SOURCE)write2file(MAT SOURCE, TXT SOURCE); %convierte de mat a txtcopyfile(TXT SOURCE,‘../AERST MAIN/’);cd ..delete(TXT SOURCE);
%GO TO THE CONFIGURATION FILES FOLDER AND COPY THE CONFIG FILE INTHE MAIN FOLDERcd ./CONFIG FILES
1The user can run the application outside MATLAB creating the txt files as described in the tool
and calling the AERST with the name of the configuration file
167
A. AERST TOOL USER GUIDE
copyfile(CONFIG FILE, ‘../AERST MAIN/config file.txt’);
%GO TO THE PARAMETER FOLDER AND CREATE PARAMETER AND STATE FILEScd ../CREATE PARAMETERS %USER PARAMETER AND STATE FILES:
initfile1 C;%USER PARAMETER FILE 1initfile2 C;%USER PARAMETER FILE 2initfile3 C;%USER PARAMETER FILE 3initstate2 C;%USER STATE FILE 2
copyfile(‘*.txt’,‘../AERST MAIN’);%IN THE C++ SIMULATOR PARAMETERS AND STATESARE STORED IN TXT FILESdelete(‘*.txt’);
%CALL THE C++ MAIN PROGRAM:cd ../AERST MAINsystem(‘./AERST.exe config file.txt out1.txt’);
copyfile(‘out1.txt’,‘../ALG REC’);delete(‘*.txt’);cd ../ALG RECcopyfile(‘out1.txt’,OUTPUT FILE);
%FINALLY, THE EVENTS IN ALL CHANNELS ARE RETRIEVED:[CHANNELS]=disktocell3(NUMB CHANNELS,0,0,OUTPUT FILE);delete(‘out1.txt’);delete(OUTPUT FILE)cd ..
Now in CHANNELS all the events that have travelled through the system are
stored. The user can access to the information of one channel (consider case of channel
‘2’) invoking the MATLAB command:
channel=CHANNELS{2};
Now, in channel we have the main information of channel two with the format ex-
plained previously. If the user want to view the image resulting of reconstructing the
events in channel, an example way to do it would be:
168
A.9 C++ Step-by-Step Example
[J2]=reconstaer back(channel, 64, 64);
imshow(J2,[])
169
A. AERST TOOL USER GUIDE
170
Appendix B
RESUMEN
B.1 INTRODUCCION
La tecnologıa actual permite la implementacion de aplicaciones complejas a alta ve-
locidad y con resultados bastante eficientes. Sin embargo, cuando se trata de im-
plementar aplicaciones que resultan inmediatas para el cerebro, como actividades de
reconocimiento, de seguimiento o movimiento de objetos, etc., los sistemas electronicos
actuales resultan ineficientes. En el caso de aplicaciones de vision, la mayorıa de los
sistemas actuales basan su funcionamiento en el procesamiento de fotogramas. Los
sistemas de vision trabajan habitualmente capturando y procesando secuencias de fo-
togramas (frames), que son procesados, pıxel a pıxel hasta que alguna tarea determi-
nada es conseguida. Este procesamiento basado en fotogramas es lento, especialmente
si se necesitan varias convoluciones en secuencia para cada imagen de entrada. El cere-
bro no funciona con un esquema basado en fotogramas. En la retina, cada pıxel envıa
pulsos (tambien llamados eventos) a la corteza cerebral cuando su nivel de actividad
alcanza cierto umbral. Aquellos pıxeles muy activos enviaran mas pulsos que los menos
activos. Todos estos pulsos son transmitidos a medida que estan siendo producidos, y
no esperan el tiempo artificial “tiempo de frame” antes de enviarlos a la siguiente etapa
de procesamiento [8]. Las caracterısticas extraıdas son propagadas y procesadas etapa
por etapa tan pronto como han sido producidas, sin esperar a finalizar la recoleccion
y procesamiento de los datos de fotogramas completos. Un problema importante que
encuentran los ingenieros cuando tratan de implementar sistemas de procesamiento de
vision bio-inspirados es conseguir la masiva cantidad de interconexiones hacia delante
171
B. RESUMEN
Figura B.1: Concepto de comunicacion punto a punto basada en AER.
y de realimentacion que aparece entre las etapas neuronales existentes en el sistema de
procesamiento de vision humano. La representacion de datos basada en direcciones de
eventos (Address Event Representation, AER [30][31]) es una posible solucion. La Fig.
B.1 ilustra la comunicacion en un enlace punto a punto AER tradicional.
El estado continuo en el tiempo de las neuronas emisoras en un chip es transfor-
mado a una secuencia de pulsos digitales muy rapidos (eventos) de anchura mınima (del
orden de ns) pero con intervalos entre pulsos del orden de ms (similar a las neuronas
cerebrales). Este alto intervalo entre pulsos permite una potente multiplexacion, y los
pulsos generados por las neuronas emisoras pueden ser multiplexados en tiempo en un
bus de salida comun de alta velocidad. Cada vez que una neurona emite un pulso
o evento, la direccion de esa neurona aparece en el bus digital, junto con sus senales
request y acknowledgede. Esto se conoce como evento de direccion. El chip receptor lee
y decodifica las direcciones de los eventos entrantes y envıa pulsos a las neuronas recep-
toras correspondientes, que integran esos pulsos y son capaces de reproducir el estado
de las neuronas emisoras. Esta es la comunicacion entre chips basada en AER mas sim-
ple. Sin embargo, esta comunicacion punto a punto puede ser extendida a un esquema
multiemisor o multireceptor [50], donde rotaciones, traslaciones o procesamientos mas
complicados como convoluciones pueden ser implementados por chips de procesamiento
que reciben estos eventos [37]. Ademas, la informacion puede ser trasladada o rotada
facilmente simplemente cambiando las direcciones de los eventos al tiempo que viajan
de un chip al siguiente. Existe una creciente comunidad de usuarios del protocolo AER
para el diseno de aplicaciones de vision y audicion bio-inspiradas, robotica, seguimiento
y reconocimiento de objetos, etc., como ha sido demostrado por el exito en los ultimos
anos de los participantes en las ‘Neuromorphic Engineering Workshop series’ [48]. El
exito de esta comunidad es disenar sistemas grandes jerarquicamente estructurados
172
B.2 Descripcion del Simulador AERST
multichip multietapa capaces de implementar procesamientos complejos de matrices en
tiempo real. El exito de tales sistemas dependera grandemente de la disponibilidad de
herramientas robustas y eficientes de diseno y depuracion de sistemas AER [55][50]. Se
hace imprescindible por tanto la disponibilidad de un simulador de sistemas de proce-
samiento complejos basados en AER, que permita el analisis de tales sistemas y la
propuesta de otros nuevos, antes de la implementacion fısica de los mismos. En este
trabajo se presenta un simulador implementado en Matlab, que ha sido implementado
tambien en C++.
B.2 Descripcion del Simulador AERST
En este simulador, un sistema generico AER es descrito mediante un netlist o fichero de
conexiones que usa solamente dos tipos de elementos: modulos y canales. Un modulo
es un bloque que genera y/o produce trenes de eventos (streams) AER. Por ejemplo,
un chip retina serıa una fuente que proporciona eventos AER a un sistema AER. Un
chip de convolucion [39] serıa un modulo de procesamiento AER que recibirıa como
entrada un stream de eventos AER y que producirıa un nuevo stream de eventos AER
a la salida. Los streams AER constituyen los nodos del netlist en un sistema AER y
son llamados canales. De este modo, los canales representan conexiones punto a punto.
Para replicar o multiplexar canales, se deben incluir en el netlist modulos splitter o
merger. En la Fig. B.2 se muestra un netlist de ejemplo y su descripcion mediante un
archivo ASCII.
El netlist de la figura contiene 7 modulos y 8 canales. Como se puede observar,
en el netlist hay que indicar aquellos canales fuente del sistema (junto con el nombre
del fichero de texto donde estan los eventos de cada fuente), ademas de los modulos de
procesamiento junto con sus estructuras de parametros y estados. Para cada modulo se
indica aquellos canales que son de entrada y de salida. La descripcion del netlist es pro-
porcionada a la herramienta de simulacion mediante un fichero de texto, que se muestra
en la parte inferior de la Fig. B.2. Cada modulo de procesamiento es descrito mediante
una funcion cuyo nombre es el propio nombre del modulo. El simulador no impone
ninguna restriccion en el formato de las estructuras de parametros y estados. Estos
formatos estan abiertos al usuario que escribe el codigo de la funcion de cada modulo.
173
B. RESUMEN
Figura B.2: Ejemplo de Sistema AER y su descripcion mediate un fichero ASCII
El simulador solo necesita saber el nombre de los archivos donde estan almacenadas
estas estructuras. La informacion relativa a un evento tiene seis componentes:
[Tprereqst Treqst Tack x y sign] (B.1)
x e y representan las coordenadas o direcciones del evento, y sign es el signo. Tprereqst
representa el tiempo en el cual el evento fue creado por el modulo emisor, Treqst repre-
senta el tiempo en el cual el evento es procesado por el modulo receptor y Tack representa
el tiempo en el cual el evento es finalmente asentido por el modulo receptor. En nuestra
aplicacion distinguimos entre tiempo pre-Request Tprereqst y tiempo de Request efectivo
Treqst. El primero solo depende del modulo emisor, mientras que el segundo requiere
que el modulo receptor este preparado para procesar la senal request de un evento. De
este modo, podemos proporcionar como fuente al sistema una lista entera de eventos
que estan descritos solo mediante sus direcciones, signo y tiempos Tprereqst. Una vez
que los eventos van siendo procesados por el simulador, sus tiempos Treqst y Tack van
siendo establecidos.
Los eventos de los canales fuente pueden proceder directamente de dispositivos AER
174
B.2 Descripcion del Simulador AERST
como retinas electronicas, o del resultado de convertir imagenes a un listado de eventos.
La ejecucion del simulador es como sigue.
1. Inicialmente se lee el archivo netlist conjuntamente con todos los ficheros de
parametros y variables de estado pertenecientes a los diferentes modulos.
2. Todos los canales son examinados. El simulador selecciona aquel con el evento
no procesado con menor tiempo Tprereqst.
3. La informacion del evento es suministrada como entrada al modulo con el cual el
canal esta conectado. En ese momento, el modulo actualiza su estado interno a
partir de la informacion recibida del evento y del resto de parametros de config-
uracion del modulo. Este hecho puede ocasionar la generacion de nuevos eventos
no procesados en el modulo que seran proporcionados a los puertos de salida
correspondientes del modulo. Estos eventos no tendran asignados (por haber
sido creados, no procesados) valores en los tiempos Treqst y Tack. A partir de
este momento, el simulador actualiza todos los canales, almacena el nuevo estado
para el modulo que se acaba de procesar y regresa al punto 2. Aquellos eventos
procesados se almacenan en un fichero de texto que el usuario de la aplicacion
podra consultar y visualizar al final de la simulacion para analizar la evolucion
del sistema.
La herramienta dispone de una librerıa basica de modulos que puede ser ampliada
facilmente por el usuario. Los principales modulos son:
1. El modulo aerswitch, que replica el evento en su puerto (o puertos) de entrada
en el puerto (o puertos) de salida.
2. El modulo mapper, que transforma las direcciones codificadas por los eventos de
entrada de acuerdo a una LUT (look-up-table) o algun tipo de procesamiento.
De este modulo se han implementado dos variantes: rotator (que cambia las
direcciones de los eventos de entrada para aplicar una rotacion al estımulo visual
codificado por tales eventos), y aerscanner (que ignora las direcciones codificadas
por los eventos de entrada y produce por cada evento entrante, un evento de salida
que codifica una direccion consecutiva con la codificada por el evento enviado
anterior.
175
B. RESUMEN
3. El modulo subsampling, que reduce el espacio de direcciones codificado por los
eventos.
4. El modulo convolution, que implementa la convolucion del estımulo visual codifi-
cado por los eventos de entrada y una mascara de convolucion almacenada en el
interior del modulo.
B.3 IMPLEMENTACIONES
Con la herramienta descrita se han disenado varios sistemas AER multietapa multi-
procesamiento. Entre ellas, las principales son un Sistema AER de reconocimiento de
caracteres, un sistema de clasificacion de imagenes basada en informacion de textura y
una red neuronal convolucional de varias etapas para el reconocimiento de personas en
varias posiciones.
B.4 Sistema de Reconocimiento de Caracteres basado en
AER
Una de las implementaciones AER multietapa multiprocesamiento implementadas cor-
responde a un sistema de reconocimiento de caracteres escritos a mano. En particular,
se ha implementado una version simplificada de la arquitectura de reconocimiento de
caracteres Neocognitron [2]. La estructura implementada permite el reconocimiento de
varios caracteres (‘A’,‘B’,‘C’,‘H’,‘L’,‘M’,‘T’) que ademas pueden presentar ligeras defor-
maciones. El sistema se representa en la Fig B.3. La entrada al sistema es un estımulo
visual de 16x16 pıxeles codificado en eventos y que puede representar uno de los siete
caracteres (‘A’,‘B’,‘C’,‘H’,‘L’,‘M’,‘T’). Cada pixel produce 10 eventos y la separacion
entre eventos es de 50ns. Como el numero aproximado de pıxeles activos es de 30, el
estımulo completo tiene una duracion de 15mus aproximadamente.
La primera etapa de procesamiento implementa 17 convoluciones (con mascaras
de convolucion ki (i = 1, ..., 17)) en paralelo para la extraccion de caracterısticas en
el estimulo visual. Cada mascara de convolucion (kernel) en la etapa ‘1’ tiene como
objetivo detectar caracteristicas discriminatorias que ayuden a identificar los caracteres.
Por ejemplo, la mascara de convolucion k1 detecta la presencia y posicion del pico
superior en la letra ‘A’. La mascara k2 detecta un segmento horizontal que termina en
176
B.4 Sistema de Reconocimiento de Caracteres basado en AER
Figura B.3: Sistema de Reconocimiento de Caracteres basado en AER
la izquierda y toca con un segmento vertical. La mascara k3 igual, pero terminando el
segmento a la derecha, etc. Por tanto, la primera etapa tiene como objeticvo detectar
un conjunto de 17 caracterısticas geometricas que puedan ser usadas para detectar y
discriminar entre los diferentes caracteres.
La segunda etapa implementa 17 convoluciones en paralelo (con mascaras pi(i =
1, ..., 17)). Esta etapa tiene como objetivo evaluar si la distribucion espacial de las
caracterısticas detectadas en la primera etapa son significativas para el caracter que
esta siendo analizado. Por ejemplo, para el caracter ‘A’, el pico superior (detectado por
la mascara k1 en la primera etapa) deberıa estar por encima del resto de caracterısticas.
Por tanto, la mascara p1 producira una contribucion positiva en la region justo debajo
del pico detectado por k1, porque este sera el lugar donde deberıa estar el centro del
caracter ’A’. Lo mismo ocurre con el resto de mascaras de convolucion.
El proposito de la tercera etapa es combinar con pesos positivos o negativos las
salidas de la segunda etapa. Para ello, cada una de las 17 salidas obtenidas en la etapa
2 son replicadas en 7 canales independientes. Cada uno de estos canales entra a un
modulo merger de 17 entradas (modulo M en la Fig. B.3). Como los eventos que
proceden de la segunda etapa tienen todos signo positivo, los modulos merger tienen
cableados los signos en las entradas, de modo que se le asigna signo positivo a los eventos
que contribuyen positivamente para el reconocimiento del caracter y negativo a los que
177
B. RESUMEN
Figura B.4: Caracteres utilizados para evaluar el Sistema AER.
contribuyen negativamente. Los eventos con nuevo signo obtenidos de cada modulo
merger son enviados a un modulo de convolucion con una mascara de convolucion 1x1
y con valor 1 (modulo U en la Fig. B.3). Los parametros de los modulos de convolucion
son ajustados de modo que hacen falta 3 eventos a la entrada codificando la misma
direccion para producir la generacion de un evento. De este modo, en la tercera etapa
se implementa una suma de las actividades recibidas a la entrada.
Finalmente, la cuarta etapa consiste de un modulo de convolucion para cada canal
de salida de la etapa 3 (modulos C en Fig B.3). Hay un modulo C para cada caracter
(siete en total) y el objetivo de estos modulos es analizar si los eventos que vienen de las
etapas previas estan mas o menos agrupados, en lugar de dispersos. Si estan agrupados
(en el centro del caracter) significa que el caracter ha sido detectado.
B.5 Resultados
Al estar implementado todo el sistema con modulos AER, todo el procesamiento es en
paralelo y en tiempo real, siendo los eventos enviados de etapa a etapa en cuestion de
ns. El sistema multi-chip (68 modulos de convolucion) multi-etapa (4 etapas) ha sido
testeado usando tres versiones ligeramente modificadas de cada uno de los 7 caracteres.
Los 21 caracteres se muestran en la Fig B.4.
En todos los casos, el sistema es capaz de detectar que letra esta presente en menos
de 9,3µs (equivalente a procesar 100000 imagenes por segundo aproximadamente) desde
que el primer evento de entrada es recibido por el sistema. Este retraso es incluso
menor que la duracion del estımulo de entrada visual (12,4µs). Por tanto, el sistema
es capaz de reconocer el caracter a la entrada del sistema antes de haber recibido y
procesado todos los eventos. En una version implementada en fotogramas, habrıa que
esperar el tiempo correspondiente a un fotograma para recoger todos los valores de los
178
B.6 Clasificacion de Imagenes basada en informacion de textura
Figura B.5: Esquema del sistema basado en AER para clasificacion de imagenes basadaen textura
pıxeles correspondientes a un caracter, y despues de eso, deberiamos procesar toda los
pıxeles de la imagen secuencialmente con los 68 modulos de convolucion en el sistema.
Si suponemos un esquema de codificacion de 25 fotogramas por segundo, tendrıamos
siempre una limitacion de 40 ms para procesar cada caracter (sin considerar los tiempos
de procesamiento de los modulos de convolucion).
B.6 Clasificacion de Imagenes basada en informacion de
textura
En esta implementacion se ha realizado un sistema para clasificacion de imagenes
basada en textura usando filtros de Gabor. El sistema AER propuesto es una version
ligeramente modificada de una implementacion anterior propuesta por Manjunath [94]
basada en fotogramas.
Nuestro sistema AER propuesto esta descrito en la Fig. B.5.
En el sistema, una imagen de textura codificada mediante eventos llega a la primera
etapa (layer ‘1’), que esta compuesta por un modulo splitter que replica cada evento
en cada uno de los 24 puertos de salida, y 24 modulos de convolucion basados en AER
179
B. RESUMEN
trabajando en paralelo [39]. Esta primera etapa implementa por tanto un banco de 24
filtros de Gabor (4 escalas y 6 orientaciones). Cada modulo de convolucion Gmn usa
como mapa de convolucion la parte real de una wavelet de gabor a una determinada
escala y orientacion. Cuando un pıxel en el array de pıxeles dentro de un modulo de
convoluciones alcanza su umbral de disparo, se resetea a sı mismo y genera un evento de
salida, que sera enviado al exterior del modulo de convolucion. Cada evento obtenido a
la salida de un modulo perteneciente a la etapa 1 llega a un modulo de procesamiento
en la etapa 2. La etapa 2 consiste en 24 modulos de extraccion de caracterısticas (FEM
en la Fig. B.5) trabajando en paralelo. Cada modulo FEM computa una estimacion
de la media µm,n (llamada Wmn en la Fig. B.5) y de la varianza σmn (etiquetada como
Smn en la Fig. B.5) del resultado de la convolucion obtenido en la etapa anterior. Estos
parametros, codificados por los eventos que viajan de la etapa 2 a la 3, son recibidos
en la etapa ‘3’ por una FPGA. La FPGA escanea durante un tiempo especificado y
programado los eventos de los 48 canales de entrada, y crea de este modo un vector de
caracterısticas con el siguiente aspecto:
f = [W11S11W12S12, ...,W46S46] (B.2)
Una vez construido el vector de caracterısticas, esta etapa calcula la distancia entre
el nuevo vector de caracterısticas creado y los vectores de caracterısticas correspondi-
entes al resto de texturas en la base de datos. La textura bajo analisis sera considerada
textura k, si la distancia al vector de caracterısticas correspondiente a la textura k en
la base de datos es la mınima.
El esquema AER implementado demuestra que en un tiempo inferior (aproximada-
mente 10ms) al correspondiente a un fotograma (por ejemplo 33ms) el sistema es capaz
de reconocer la textura codificada por los eventos del estımulo visual entrante al sistema
en tiempo real y antes de recibir el estımulo por completo.
B.7 Red neuronal Convolucional para el reconocimiento
de personas
El objetivo de este apartado es implementar una red neuronal multietapa basada en
convoluciones similar a la red Lenet-5 implementada por Y. LeCun en [4]. El sis-
tema propuesto ha sido implementado completamente en AER y ha sido aplicado al
180
B.7 Red neuronal Convolucional para el reconocimiento de personas
Figura B.6: Sistema neuronal convolucional basado en fotogramas para detectar personasde pie, en posicion horizontal o boca a bajo.
reconocimiento e identificacion de posturas en personas. El esquema implementado
consta de 6 etapas y se muestra en la Fig. B.6.
Las etapas 1, 3, 5 y 6 son etapas de convolucion, donde cada peso de cada mascara
de convolucion se obtiene mediante el algoritmo de entrenamiento backpropagation
[4]. Las etapas 2 y 4 son etapas de subsampling o submuestreo. En el sistema im-
plementado, el sistema dispone de 4 salidas indicando una de las siguientes opciones:
persona de pie, horizontal, boca a bajo o ruido. El sistema ha sido implementado
en 2 versiones: La primera implementacion esta basada en fotogramas y su objetivo
principal es obtener el valor de los parametros entrenables (los pesos de las mascaras
de convolucion) mediante el algoritmo backpropagation. La segunda implementacion
ha sido realizada completamente en AER haciendo uso de la herramienta AERST de
simulacion.
B.7.1 RED NEURONAL DE DETECCION DE PERSONAS BASADA
EN FOTOGRAMAS
El sistema basado en fotogramas implementado se muestra en la Fig. B.6. En este
sistema los datos de entrada procedıan directamente de eventos AER reales obtenidos
mediante una retina electronica AER sensible al movimiento y que por tanto solo
detecta cambios en la escena [15]. De este modo, el background estatico tıpico en
una escena es totalmente eliminado. Al proporcionar la retina solamente informacion
de objetos/personas en movimiento, el sistema fue entrenado para identificar personas
(en 3 posibles posiciones) u objetos (estos ultimos categorizados como ruido por la
aplicacion). La informacion proporcionada por la retina (eventos en formato AER)
fueron recolectados para formar imagenes de tamano 128x128. Estas imagenes fueron
181
B. RESUMEN
reducidas a tamano 32x32 y un porcentaje de ellas fueron usadas para entrenar la red.
El sistema consta de las siguientes etapas:
1. Etapa C1: Primera etapa. Consiste en un banco de 6 filtros de Gabor 10x10 con
2 escalas y 3 orientaciones.
2. Etapas S2 y S4: Segunda y cuarta etapas. Son etapas de submuestreo o subsam-
pling.
3. Etapa C3: Tercera etapa. Consta de 24 filtros 5x5 entrenables y 4 canales de
salida.
4. Etapa C5: Quinta etapa. Consiste en 32 filtros 5x5 entrenables y 8 canales de
salida.
5. Etapa F6: Sexta etapa: Etapa de conectividad total entrenable con 32 parametros
de entrenamiento.
Como resultado del entrenamiento de la red mediante el sistema basado en fotogra-
mas, todos los pesos (parametros de entrenamiento) obtenidos fueron almacenados
para ser utilizados en los modulos de convolucion AER en la estructura no basada en
fotogramas.
B.7.2 RED NEURONAL DE DETECCION DE PERSONAS BASADA
EN AER
En la arquitectura implementada AER, la entrada al sistema es directamente el flujo
de eventos capturado con la retina AER (sin recolectar los eventos para producir
imagenes). La entrada al sistema codifica un espacio de direcciones 128x128 que es
primero submuestreado para proporcionar un flujo de eventos con un espacio de direc-
ciones 32x32. Esto es posible en AER utilizando modulos mapper que modifican con
operaciones muy sencillas la direccion de cada evento de entrada de modo que los nuevos
eventos de salida codifiquen el espacio deseado 32x32. Este modulo mapper asigna a
cada evento de entrada con direccion (xin, yin), la direccion de salida (xnew, ynew) del
siguiente modo:
xnew = bxin/4c; ynew = byin/4c; (B.3)
182
B.7 Red neuronal Convolucional para el reconocimiento de personas
Figura B.7: Implementacion AER de la red neuronal convolucional para el reconocimientode personas.
Este nuevo flujo de eventos con nuevas coordenadas se usa como entrada efectiva
para el sistema que se muestra en la Fig. B.7.
Para facilitar el diseno hardware de la red y adaptarla al problema concreto de
nuestra aplicacion se adoptaron las dos siguientes simplificaciones:
1. Los filtros habitualmente entrenables de la etapa C1 en redes neuronales con-
volucionales, se cambiaron en esta implementacion por un banco de filtros de
Gabor con 2 escalas y 3 orientaciones (las 3 orientaciones bajo analisis de la per-
sona). En la implementacion, cada uno de estos filtros de Gabor es un modulo
de convolucion AER [165]. Cada uno de los modulos de convolucion tiene un ar-
ray interno de tamano 28x28 neuronas y una mascara de convolucion de tamano
10x10 implementando uno de los 6 filtros de Gabor.
2. Los modulos de subsampling o submuestreo fueron cambiados por modulos map-
pers que simplemente replican cada 4 vecinos de entrada a una unica salida
cambiandoles la direccion para codificar el espacio de direcciones reducido. De
este modo, la etapa S2 trasnsforma el espacio de direcciones entrante 28x28 a
uno de 14x14, y la etapa S4 transforma el espacio entrante 10x10 a uno de salida
5x5. Con el uso de estos modulos se simplifica la implementacion hardware y
183
B. RESUMEN
se eliminan los parametros entrenables habituales en etapas de submuestreo en
sistemas neuronales multietapa.
Los eventos de salida de la etapa S2 son enviados a cuatro estructuras de convolucion
con seis puertos de entrada. En estas estructuras se utiliza un array interno de neuronas
de tamano 10x10 compartido y una mascara de convolucion (de tamano 5x5) diferente
para cada canal de entrada. Estas estructuras emulan el comportamiento de los chips
de convolucion AER multikernel [56]. Cuando tras la llegada de varios eventos, una
neurona del array 10x10 supera un umbral de disparo codificado como parametro, esta
neurona se resetea y dispara un evento a la salida. Para implementar la saturacion
habitual en el estado de las neuronas mediante funciones sigmoides en los sistemas
neuronales basados en fotogramas, se adopto la siguiente solucion en el sistema AER:
Para que una neurona pueda volver a disparar un evento a la salida tiene que esperar
un tiempo refractario Tref , que limita (satura) la actividad de disparo de la neurona.
Estos tiempos refractarios fueron utilizados en las etapas C3 y C5 y fueron obtenidos
analıticamente haciendo equivalencias entre las estructuras basadas en fotogramas (con
funciones sigmoides) y las basadas en AER.
Los eventos obtenidos en la etapa 3 son enviados de nuevo a modulos mapper en la
etapa S4 que reducen el espacio de direcciones codificado por 2.
La etapa C5 es totalmente equivalente a la etapa C3 y es implementada del mismo
modo. Esta vez cada neurona de salida (8 en total) produce eventos de salida codifi-
cando un espacio de direcciones 1x1. Cada evento procedente de uno de estos 8 canales
de salida es replicado mediante modulos splitter para ser utilizados como entrada en
la etapa F6, que implementa conectividad total entre las 8 entradas y las 4 salidas
mediante el uso de pesos de conexion entrenables (32 en total). Para la etapa F6 no se
ha hecho uso de tiempos refractarios ya que para esta etapa la saturacion de los canales
no es un aspecto relevante, al ser la actividad (sin saturar) positiva solo para el canal
de salida de interes y negativa para el resto.
B.7.3 Resultados
El sistema propuesto fue primero testeado haciendo uso de la implementacion basada
en fotogramas. Para ello se utilizaron imagenes obtenidas tras recolectar eventos de
la retina AER electronica. Las imagenes de personas en posicion horizontal y boca a
184
B.7 Red neuronal Convolucional para el reconocimiento de personas
bajo fueron obtenidas rotando las imagenes de las personas (habitualmente de pie y
andando en escenarios reales) 90 y 180 grados respectivamente. La implementacion
basada en fotogramas produjo muy buenos resultados con una tasa de reconocimiento
por encima del 93% (usando 250 imagenes para entrenar de un total de 1048).
La implementacion basada en AER tambien produjo excelentes resultados con una
tasa de acierto superior al 96%. El resultado obtenido dependıa de los valores de tiem-
pos refractarios usados en las etapas tercera (C3) y quinta(C5)).
Una figura bastante representativa de la tasa de acierto conseguida es la Fig. B.8.
Para obtenerla se creo un flujo de eventos en el que se alternaban personas en las 3
distintas posiciones y se analizaba la respuesta del sistema. Para este experimento se
consideraron tiempos refractarios Tref 0.5ms y 18ms para las etapas C3 y C5 respecti-
vamente. En esta figura se muestran los eventos de entrada y los de salida. Los eventos
correspondientes a la posicion de pie estan representados con valor 7, los correspon-
dientes a la posicion horizontal con valor 5 y los correspondientes a boca a bajo con
valor 6. El canal ‘UP’ de salida codificando la postura ‘de pie’ esta representado con
los valores 3 y -3 (el valor positivo corresponde a que el sistema interpreta la entrada
como ‘DE PIE’ y el valor negativo a que el sistema no la reconoce como tal), los even-
tos del canal ‘HORIZONTAL’ (persona en posicion horizontal) estan representados con
los valores 1 y -1, el canal ‘BOCA-ABAJO’ (persona boca a bajo) con los valores 2
y -2. Finalmente el canal ‘RUIDO’ (otro tipo de objetos o ruido) esta codificado con
los valores 4 y -4. Como se puede apreciar el sistema reconoce en todo momento y en
tiempo real que hay una persona en la escena y la posicion en la que esta con retrasos
inferiores a 15ms desde que el estimulo visual de entrada es recibido por el sistema.
185
B. RESUMEN
Figura B.8: a) Entrada y salida del sistema cuando la entrada es alternada entre lasposiciones ‘de pie’, ‘horizontal’ y ‘boca abajo’. Los valores ‘5’, ‘6’ y ‘7’ corresponden a lasposiciones ‘horizontal’, ‘boca abajo’ y ‘de pie’ respectivamente. Los valores absolutos ‘1’,‘2’, ‘3’ y ‘4’ corresponden a la actividad en los canales de salida identificando las posiciones‘horizontal’, ‘boca abajo’, ‘de pie’ y ‘ruido’.
186
Bibliography
[1] T. Serre, L.Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust object
recognition with cortex-like mechanisms,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 29(3), pp. 411-426, Mar. 2007. 2, 5, 6, 70, 81, 86, 92
[2] K. Fukushima and N. Wake, “Handwritten alphanumeric character recognition
by the neocognitron,” IEEE Trans. Neural Netw., vol. 2(3), pp. 355-365, May
1991. 2, 51, 176
[3] T. Masquelier, R. Guyonneau, and S. J. Thorpe, “Competitive STDP-based
spike pattern learning,” Neural Comp., 21, 1-18, 2008. 2
[4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning ap-
plied to document recognition,” Proc. IEEE, vol. 86(11), pp. 2278-2324, Nov.
1998. 2, 87, 90, 91, 92, 115, 116, 180, 181
[5] S. Furber , “The Future of Computer Technology and its Implications for the
Computer Industry”, J. Comput. vol. 51(6), pp. 735-740, 2008. 2
[6] E. M. Izhikevich, “Simple model of spiking neurons”, IEEE Transactions on
Neural Networks, vol. 14, pp. 1569-1572, 2003. 2, 17
[7] A. L. Hodgkin, A. F. Huxley, “A quantitative description of membrane current
and its application to conduction and excitation in nerve”, Journal of Physiology,
vol. 117(4), pp. 500-544, 1952. 2, 17
[8] G. M. Shepherd, The Synaptic Organization of the Brain, 3rd ed. New York:
Oxford Univ. Press, 1990. 5, 92, 171
187
BIBLIOGRAPHY
[9] E. T. Rolls and G. Deco, Computational Neuroscience of Vision. New York:
Oxford Univ. Press, 2002. 5
[10] R. DeValois, D. Albrecht, and L. Thorell, “Spatial frequency selectivity of cells
in macaque visual cortex,” Vis. Res., vol. 22, pp. 545-559, 1982. 6
[11] R. DeValois, E. Yund, and N. Hepler, “The orientation and direction selectivity
of cells in macaque visual cortex,” Vis. Res., vol. 22, pp. 531-544, 1982. 6
[12] P. H. Schiller, B. L. Finlay, and S. F. Volman, “Quantitative studies of single-cell
properties in monkey striate cortex. Spatial frequency,” J. Neurophysiol., vol.
39(6), pp. 1334-1351, 1976. 6
[13] S. Grossberg, E. Mingolla, and J.Williamson, “Synthetic aperture radar process-
ing by a multiple scale neural system for boundary and surface representation,”
Neural Netw., vol. 8(7/8), pp. 1005-1028, 1995. 6
[14] S. Lawrence, C. L. Giles, A. Tsoi, and A. Back, “Face recognition: A convo-
lutional neural network approach,” IEEE Trans. Neural Netw., vol. 8(1), pp.
98-113, Jan. 1997. 6
[15] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128x128 120 dB 30 mW asyn-
chronous vision sensor that responds to relative intensity change,” IEEE J. Solid-
State Circuits, vol. 43(2), pp. 566-576, Feb. 2008. 6, 12, 14, 62, 87, 99, 146, 181
[16] J. Costas-Santos, T. Serrano-Gotarredona, R. Serrano-Gotarredona, and B.
Linares-Barranco, “A contrast retina with on-chip calibration for neuromorphic
spike-based AER vision systems,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 54(7), pp. 1444-1458, Jul. 2007. 6, 14, 62
[17] R. Serrano-Gotarredona, T. Serrano-Gotarredona, A. Acosta-Jimenez, and B.
Linares-Barranco, “A neuromorphic cortical layer microchip for spike based
event processing vision systems,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 53(12), pp. 2548-2566, Dec. 2006. 7, 13, 14, 46, 49, 54, 62
[18] E. Kandel, J. Schwartz, and T. M. Jessel,. Principles of Neural Science. Elsevier,
New York, 1991. 8
188
BIBLIOGRAPHY
[19] E. D. Adrian, Y. Zotterman, “The impulses produced by sensory nerve endings:
Part II: The response of a single end organ”, Journal of Physiology, no 61, pp.
151-71, 1926. 8
[20] R. Stein, E. Gossen, K. Jones, “Neuronal variability: noise or part of the sig-
nal?”, Nature Reviews Neuroscience vol. 6, pp. 389-397, 2005. 8
[21] A. L. Jacobs, et al., “Ruling out and ruling in neural codes,” Proc Natl Acad
Sci U S A, vol. 106, pp. 5936-41, 2009. 9
[22] S. J. Thorpe, “Spike arrival times: A highly efficient coding scheme for neural
networks,” Parallel processing in neural systems and computers, R. Eckmiller,
G. Hartmann, and G. Hauske, Editors. 1990, Elsevier: North-Holland. p. 91-94.
9
[23] R. VanRullen, R. Guyonneau, and S. J. Thorpe, “Spike times make sense,”
Trends Neurosci, vol 28, pp. 1-4, 2005. 9
[24] R. VanRullen and S. J. Thorpe, “Rate coding versus temporal order coding:
what the retinal ganglion cells tell the visual cortex,” Neural Comput, vol 13,
pp. 1255-83, 2001. 9
[25] T. Gollisch and M. Meister, “Rapid neural coding in the retina with relative
spike latencies,” Science, vol 319, pp. 1108-11, 2008. 9
[26] S Wu, S Amari, and H Nakahara, “Population Coding and Decoding in a Neural
Field: A Computational Study,” Neural Computation vol. 14, pp. 999-1026,
2002. 11
[27] J. H. R. Maunsell, D. C. Van Essen, “Functional properties of neurons in middle
temporal visual area of the Macaque monkey. I. Selectivity for stimulus direction,
speed, and orientation,” Journal of Neurophysiology, vol. 49, pp. 1127-1147,
1983. 11
[28] D. H. Hubel, T. N. Wiesel, “Receptive fields of single neurons in the cat’s striate
cortex,” Journal of Physiology, vol. 148, pp. 574-591,1959. 12
189
BIBLIOGRAPHY
[29] M. A. Montemurro, M. J. Rasch, Y. Murayama, N. K. Logothetis, S. Panzeri,
“Phase-of-Firing Coding of Natural Visual Stimuli in Primary Visual Cortex,”
Current Biology, vol. 18(5), pp. 375-380, 2008 12
[30] M. Sivilotti, “Wiring considerations in analog VLSI systems with application to
field-programmable networks,” Ph.D. dissertation, Comput. Sci. Div., California
Inst. Technol., Pasadena, CA, 1991. 2, 13, 87, 172
[31] K. Boahen, “Point-to-point connectivity between neuromorphic chips using ad-
dress events,” IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol.
47(5), pp. 416-434, May 2000. 13, 87, 172
[32] E. Culurciello, R. Etienne-Cummings, and K. A. Boahen, “A biomorphic digital
image sensor,” IEEE J. Solid-State Circuits, vol. 38(2), pp. 281-294, Feb. 2003.
13, 62
[33] P. F. Ruedi, P. Heim, F. Kaess, E. Grenet, F. Heitger, P.-Y. Burgi, S. Gyger,
and P. Nussbaum, “A 128x128, pixel 120-dB dynamic-range vision-sensor chip
for image contrast and orientation extraction,” IEEE J. Solid-State Circuits,
vol. 38(12), pp. 2325-2333, Dec. 2003. 13
[34] C. Shoushun and A. Bermak, “A low power CMOS imager based on time-to-
first-spike encoding and fair AER,” in Proc. IEEE Int. Symp. Circuits Syst.,
2005, pp. 5306-5309. 13
[35] A. Delorme, L. Perrinet, and S. J. Thorpe, “Networks of integrate and-fire neu-
rons using rank order coding B: Spike timing dependent plasticity and emergence
of orientation selectivity,” Neurocomputing, vol. 38-40, pp. 539-45, 2001.
[36] R. Serrano-Gotarredona, M. Oster, P. Lichtsteiner, A. Linares-Barranco, R.
Paz-Vicente, F. Gomez-Rodrıguez, L. Camunas-Mesa, R. Berner, M. Rivas,
T. Delbruck, S. C. Liu, R. Douglas, P. Hafliger, G. Jimenez-Moreno, A.
Civit, T. Serrano-Gotarredona, A. Acosta-Jimenez, and B. Linares-Barranco,
“CAVIAR: A 45 k-Neuron, 5 M-Synapse, 12 G-connects/sec AER hardware
sensory-processing-learning-actuating system for high speed visual object recog-
nition and tracking,” IEEE Trans. Neural Netw., vol. 20(9), pp. 1417-1438, Sep.
2009. 13, 14, 16, 19, 30, 79, 96, 101
190
BIBLIOGRAPHY
[37] T. Serrano-Gotarredona, A. G. Andreou, and B. Linares-Barranco, “AER image
filtering architecture for vision-processing systems,” IEEE Trans. Circuits Syst.
I, Fundam. Theory Appl., vol. 46(9), pp. 1064-1071, Sep. 1999. 13, 172
[38] D. H. Goldberg, G. Cauwenberghs, and A. G. Andreou, “Probabilistic synap-
tic weighting in a reconfigurable network of VLSI integrate-and-fire neurons,”
Neural Netw., vol. 14, pp. 781-793, 2001. 13
[39] R. Serrano-Gotarredona, T. Serrano-Gotarredona, A. Acosta-Jimenez, C.
Serrano-Gotarredona, J. A. Perez-Carrasco, A. Linares-Barranco, G. Jimenez-
Moreno, A. Civit-Ballcels, and B. Linares-Barranco, “On real-time AER 2D
convolutions hardware for neuromorphic spike based cortical processing,” IEEE
Trans. Neural Netw., vol. 19(7), pp. 1196-1219, Jul. 2008. 13, 14, 54, 62, 101,
173, 180
[40] M. Azadmehr, J. Abrahamsen, and P. Hfliger, “A foveated AER imager chip,”
in Proc. IEEE Int. Symp. Circuits Syst., Kobe, Japan, 2005, pp. 2751-2754. 13
[41] K. A. Zaghloul and K. boahen, “Optic nerve signals in a neuromorphic chip:
Part I and II,” IEEE Trans. Biomed. Eng., vol. 51(4), pp. 657-675, Apr. 2004.
14, 62
[42] K. Boahen, “Retinomorphic chips that see quadruple images,” in Proc. Int.
Conf. Microelectron. Neural Fuzzy Bio-Inspired Syst., Granada, Spain, 1999,
pp. 12-20. 14
[43] J. Lazzaro, J. Wawrzynek, M. Mahowald, M. Silvilotti, and D. Gillespie, “Silicon
auditory processors as computer peripherals,” IEEE Trans. Neural Netw., vol.
4(3), pp. 523-528, May 1993. 14, 87
[44] G. Cauwenberghs, N. Kumar, W. Himmelbauer, and A. G. Andreou, “An analog
VLSI chip with asynchronous interface for auditory feature extraction,” IEEE
Trans. Circuits Syst. II, Analog Digit. Signal Process., vol. 45(5), pp. 600-606,
May 1998. 14
[45] E. Chicca, A. M. Whatley, P. Lichtsteiner, V. Dante, T. Delbruck, P. Del Giu-
dice, R. J. Douglas, and G. Indiveri, “A multichip pulse-based neuromorphic
191
BIBLIOGRAPHY
infrastructure and its application to a model of orientation selectivity,” IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 54(5), pp. 981-993, May 2007.
[46] M. Oster, Y. Wang, R. Douglas, and S.-C. Liu, “Quantification of a spike-based
winner-take-all VLSI network,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol.
55(10), pp. 3160-3169, Nov. 2008. 14, 16
14
[47] T. Teixeira, A. G. Andreou, and E. Culurciello, “Event-based imaging with
active illumination in sensor networks,” in Proc. IEEE Int. Symp. Circuits Syst.,
Kobe, Japan, 2005, pp. 644-647. 14
[48] A. Cohen, R. Etienne-Cummings, T. Horiuchi, G. Indiveri, S. Shamma, R. Dou-
glas, C. Koch, and T. Sejnowski, “Rep. 2004 Workshop on Neuromorphic Eng.,”,
Telluride, CO, Jun. 27 to Jul. 17 2004. 172
[49] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P.
Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar,
“An 80-Tile 1.28 TFLOPS network-on-chip in 65 nm CMOS,” in Proc. IEEE
Int. Solid-State Circ. Conf., Feb. 2007, pp. 98-99. 14
[50] R. Serrano-Gotarredona, et al., AER Building Blocks for Multi-Layers Multi-
Chips Neu-romorphic Vision Systems, in Advances in Neural Information Pro-
cessing Systems, Vol. 18, Y. Weiss and B. S. and J. Platt (Eds.), (NIPS’06),
MIT Press, Cambridge, MA, 1217-1224, (2006). 16, 46, 47, 54, 62, 75, 96, 172,
173
[51] T. Texeira, E. Culurciello and A.G. Andreou, “An address-event image sensor
network,” Proceedings of the 2006 IEEE International Symposium on Circuits
and Systems, (ISCAS 2006), Kos, Greece, pp. 4467-4470, May 2006. 16
[52] R. Brette, M. Rudolph, T. Carnevale, et al. “Simulation of networks of spiking
neurons: A review of tools and strategies”, Journal of Computional Neuro-
science, Vol. 23(3), pp.349-398, Dec 2007. 17
[53] R. Brette, W. Gerstner, “Adaptive exponential integrate-and-fire model as an
effective description of neuronal activity”, Journal of Neurophysiolgy, vol. 94,pp.
3637-3642, 2005. 17
192
BIBLIOGRAPHY
[54] T. Clayton, “How much can we trust neural simulation strategies?”, Neurocom-
puting, vol. 70(10-12), pp. 1966-1969, June 2007. 17
[55] F. Gomez-Rodrıguez, R. Paz, A. Linares-Barranco, M. Rivas, L. Miro, G.
Jimenez, A. Civit. “AER tools for Communications and Debugging”. Proc. IEEE
ISCAS06. Kos, Greece, May 2006. 19, 30, 75, 173
[56] L. A. Camunas-Mesa, A. Linares-Barranco, A. J. Acosta-Jimenez, T. Serrano-
Gotarredona, B. Linares Barranco, “Improved Aer Convolution Chip for Vision
Processing With Higher Resolution and New Functionalities,” Conference on
Design of Circuits and Integrated Systems 2009. Num. 21. Barcelona. DCIS.
2009. Pag. 1-6. 21, 98, 184
[57] A. Linares-Barranco, F. Gomez-Rodrıguez, G. Jimenez-Moreno, T. Delbruck, R.
Berner, et. al., “Implementation of a Time-Warping Aer Mapper,” Proc. IEEE
International Symposium on Circuits and Systems. IEEE Circuits and Systems
Society. pp. 2886-2889, Taiwan, 2009. 33
[58] L. Camunas-Mesa, A. Acosta-Jimenez, C. Zamarreno-Ramos, T. Serrano-
Gotarredona, and B. Linares-Barranco, “A 32x32 Pixel Convolution Processor
Chip for Address Event Vision Sensors with 155ns Event Latency and 20Meps
Throughput,” accepted for publication in IEEE Transactions On Circuits And
Systems, 2010. 37, 38, 40
[59] http://sourceforge.net/apps/trac/jaer/wiki. 48, 146, 154
[60] K. A. Zaghloul and K. Boahen, “Optic nerve signals in a neuro-morphic chip:
Part 2,” IEEE Trans.Biomed Eng., vol. 51, pp. 667-675, 2004. 62
[61] K. Fukushima: “Visual feature extraction by a multilayered network of analog
threshold elements”, IEEE Transactions on Systems Science and Cybernetics,
SSC-5 (4), pp. 322-333, Oct. 1969. 63
[62] T. Randen and J. H. Husoy, “Filtering for texture classification: A comparative
study,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 4, pp. 291-310,
Apr. 1999. 67, 68, 75, 80
193
BIBLIOGRAPHY
[63] R. M. Haralick, “Statistical and structural approaches to texture,” Proc. IEEE,
vol. 67, no. 5, pp. 786-804, May 1979. 67
[64] G. V. D. Wouwer, P. Scheunders, and D. V. Dyck, “Statistical texture charac-
terization from discrete wavelet representations,” IEEE Trans. Image Process.,
vol. 8, no. 4, pp. 592-598, Apr. 1999. 67
[65] G. R. Cross and A. K. Jain, “Markov random field texture models,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 5, no.1, pp. 25-39, Jan. 1983. 67
[66] R. L. Kashyap and R. Chellappa, “Estimation and choice of neighbors in spatial-
interaction models of images,” IEEE Trans. Inf. Theory, vol. 29, no. 1, pp. 60-72,
Jan. 1983. 67
[67] R. M. Haralick, K. Shanmugan, and I. Dinstein, “Texture features for image
classification,” IEEE Trans. Syst. Man Cybern., vol. 3, no. 6, pp. 610-621, Nov.
1973. 68
[68] A. Speis and G. Healey, “Feature extraction for texture discrimination via ran-
dom field models with random spatial interaction,” IEEE Trans. Image Process.,
vol. 5, no. 4, pp. 635-645, Apr. 1996. 68
[69] T. Chang and C.C. J. Kuo, “Texture analysis and classification with trees-
tructured wavelet transform,” IEEE Trans. Image Process., vol. 2, no. 4, pp.
429-441, Apr. 1993. 68
[70] M. Unser, “Texture classification and segmentation using wavelet frames,” IEEE
Trans. Image Process., vol. 4, no. 11, pp. 1549-1560, Nov. 1995. 68
[71] G. M. Haley and B. S. Manjunath, “Rotation-invariant texture classification
using a complete space-frequency model,” IEEE Trans. Image Process., vol. 8,
no. 2, pp. 255-269, Feb. 1999. 68, 69
[72] W. T. Freeman and E. H. Adelson, “The design and use of steerable filters,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 9, pp. 891-906, Sep. 1991.
68
194
BIBLIOGRAPHY
[73] J. G. Rosiles and M. J. T. Smith, “Texture classification with a biorthogonal
directional filter bank,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process.,
May 2001, pp. 1549-1552. 68
[74] J. Han and K. K. Ma, “Rotation-invariant and scale-invariant Gabor features
for texture image retrieval,” Image Vis. Comput., vol. 25, no. 9, pp. 1474-1481,
Sep. 2007. 68, 69, 76, 78, 80, 81
[75] T. Sikora, “The MPEG-7 visual standard for content description. An overview,”
IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 6, pp. 696-702, Jun. 2001.
68
[76] J. J. Kulikowski and P. O. Bishop, “Fourier analysis and spatial representation
in the visual cortex,” Experientia, vol. 37, pp. 160-163, 1981. 68
[77] D. A. Clausi and H. Deng, “Design-based texture feature fusion using Gabor
filters and co-occurrence probabilities,” IEEE Trans. Image Process., vol. 14,
no. 7, pp. 925-936, Jul. 2005. 68
[78] D. A. Clausi, “Comparison and fusion of co-occurrence, Gabor and MRF texture
features for classification of SAR sea-ice imagery,” Atmos. Oceans, vol. 39, no.
4, pp. 183-194, 2001.
[79] S. Li and J. Shawe-Taylor, “Comparison and fusion of multiresolution features
for texture classification,” Pattern Recognit. Lett., vol. 26, pp. 633-638, 2005.
[80] N. Qaiser,M. Hussain, A. Hussain, “Texture recognition by fusion of optimized
moment based and Gabor energy features,” Int. J. Comput. Sci. Network Secu-
rity, vol. 8, no. 2, pp. 264-270, Feb. 2008. 68
[81] C.C. Chen and C.C. Chen, “Filtering methods for texture discrimination,” Pat-
tern Recognit. Lett., vol. 20, pp. 783-790, 1999. 68, 80
[82] R. Picard, T. Kabir, and F. Liu, “Real-time recognition with the entire Brodatz
texture database,” in Proc. Comput. Vis. Pattern Recognit., 1993, pp. 638-639.
[83] P. P. Ohanian and R. C. Dubes, “Performance evaluation for four classes of
textural features,” Pattern Recognit., vol. 25, no. 8, pp. 819-833, 1992. 68, 80
195
BIBLIOGRAPHY
[84] K. O. Cheng, N. F. Law, and W. C. Siu, “A novel fast and reduced redundancy
structure for multiscale directional filter banks,” IEEE Trans. Image Process.,
vol. 16, no. 8, pp. 2058-68, Aug. 2007. 68, 76, 78, 80, 81
[85] K. O. Cheng, N. F. Law, and W. C. Siu, “Multiscale directional filter bank with
applications to structured and random texture retrieval,” Pattern Recognit., vol.
40, no. 4, pp. 1182-1194, 2007. 68, 69, 76, 78, 80
[86] M. N. Do and M. Vetterli, “Pyramidal directional filter banks and curvelets,”
in Proc. IEEE Int. Conf. Image Process., Oct. 2001, vol. 3, pp. 158-161. 68
[87] M. N. Do and M. Vetterli, “The contourlet transform: An efficient directional
multiresolution image representation,” IEEE Trans. Image Process., vol. 14, no.
12, pp. 2091-2106, Dec. 2005. 68
[88] S. Lazebnik, C. Schmid, and J. Ponce, “A sparse texture representation using
local affine regions,” Proc. Comput. Vis. Pattern Recognit., vol. 2, pp. 319-324,
2003. 68, 76, 78, 80
[89] M. Mellor, B. W. Hong, and M. Brady, “Locally rotation, contrast, and scale
invariant descriptors for texture analysis,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 30, no. 1, pp. 52-61, Jan. 2008. 68, 76, 80
[90] M. Kokare, P. K. Biswas, and B. N. Chatterji, “Rotation-invariant texture image
retrieval using rotated complex wavelet filters,” IEEE Trans. Syst. Man Cybern.
B, Cybern., vol. 36, no. 6, pp. 1273-1282, Dec. 2006. 68, 69, 76, 78, 80, 81
[91] E. P. Simoncelli and W. T. Freeman, “The steerable pyramid: A flexible ar-
chitecture for multi-scale derivative computation,” in Proc. Int. Conf. Image
Process., Oct. 1995, pp. 444-447. 69
[92] M. Pi and H. Li, “Fractal indexing with the joint statistical properties and its
application in texture image retrieval,” IET Image Process., vol. 2, no. 4, pp.
218-230, 2008. 69, 76, 78, 80, 81
[93] M. H. Pi, C. S. Tong, S. K. Choy, and H. Zhang, “A fast and effective model
for wavelet subband histograms and its application in texture image retrieval,”
196
BIBLIOGRAPHY
IEEE Trans. Image Process., vol. 15, no. 10, pp. 3078-3088, Oct. 2006. 69, 76,
78, 80, 81
[94] B. S. Manjunath and W. Y. Ma, “Texture features for browsing and retrieval
of image data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 18, no. 8, pp.
837-842, Aug. 1996. 69, 70, 71, 76, 78, 80, 81, 90, 179
[95] J.G. Daugman, “Complete Discrete 2D Gabor Transforms by Neural Net-works
for Image Analysis and Compression,” IEEE Trans. ASSP, vol. 36, pp. 1169-
1179, July 1988. 69
[96] A.C. Bovic, M. Clark, and W.S. Geisler, “Multichannel Texture Analysis Using
Localized Spatial Filters,” IEEE Trans. Pattern Analysis and Machine Intelli-
gence, vol. 12, no. 1, pp. 55-73, Jan. 1990.
[97] B.S. Manjunath and R. Chellappa, “A Unified Approach to Boundary Detec-
tion,” IEEE Trans. Neural Networks, vol. 4, no. 1, pp. 96-108, Jan. 1993. 69
69
[98] J.G. Daugman, “High Confidence Visual Recognition of Persons by a Test of
Statistical Independence,” IEEE Trans. Pattern Analysis and Machine Intelli-
gence, vol. 15, no. 11, pp. 1148-1161, Nov. 1993. 69
[99] M. Lades et al., “Distortion Invariant Object Recognition in the Dynamic Link
Architecture,” IEEE Trans. Computers, vol. 42, no. 3, pp. 300-311, Mar. 1993.
[100] B.S. Manjunath and R. Chellappa, “A Feature Based Approach to Face Recog-
nition,” Proc. IEEE Conf. CVPR ’92, pp. 373-378, Champaign, Ill., June 1992.
69
[101] R.J. Ferrari, R.M. Rangayyan, J.E.L. Desautels, R.A. Borges, A.F. Frere,
“Analysis of asymmetry in mammograms via directional filtering with Gabor
wavelets,” IEEE Trans. Med. Imag. 20(9), pp. 953-964, 2001. 70
[102] R.J. Ferrari, R.M. Rangayyan, J.E.L. Desautels, R.A. Borges, and A.F. Frere,
“Automatic identification of the pectoral muscle in mammograms”, IEEE Trans.
on Medical Imaging, 23(2), pp. 232-245, 2004. 70
197
BIBLIOGRAPHY
[103] D.H. Hubel and T.N. Wiesel, “Functional Architecture of Macaque Mon-key
Visual Cortex,” Proc. Royal Soc. B (London), vol. 198, pp. 1-59,1978. 70
[104] S. Marcelja, “Mathematical Description of the Responses of Simple Cortical
Cells,” J. Optical Soc. Am., vol. 70, pp. 1297-1300, 1980. 70
[105] J.G. Daugman, “Two-Dimensional Spectral Analysis of Cortical Receptive Field
Profile,” Vision Research, vol. 20, pp. 847-856,1980. 70
[106] J.G. Daugman, “Uncertainty Relation for Resolution in Space, Spatial Fre-
quency, and Orientation Optimized by Two- Dimensional Visual Cortical Fil-
ters,” J. Optical Soc. Amer., vol. 2, no. 7, pp. 1160-1169,1985. 70
[107] B. S. Manjunath, Jens-Rainer Ohm, Vinod V. Vasudevan, and Akio Ya-mada,
“Color and Texture Descriptors,” IEEE Trans. on Circuits and Systems for
Video Technology, vol. 11, no. 6, June 2001. 70
[108] M. Jian, J. Dong, R. Tang, “Combining Color, Texture and Region with Objects
of User’s Interest for Content-based Image Retrieval”, in Proceedings of Eighth
ACIS International Conference on Software Engineering, Artificial Intelligence,
Networking, and Parallel/Distributed Computing, pp. 764-769, IEEE Computer
Society, 2007.
[109] S. Bhagavathy and B. S. Manjunath, “Modeling and Detection of Geospatial
Objects Using Texture Motifs,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 44, no. 12, December 2006.
[110] S. Newsam and B. S. Manjunath, “Normalized texture motifs and their appli-
cation to statistical object modeling,” IEEE International Conference on Com-
puter Vision and Pattern Recognition: Workshop on Perceptual Organization
in Computer Vision, Washington, D. C., June 2004.
[111] J. Dong, M. Jian, D. Gao, S. Wang. “Reducing the Dimensionality of Feature
Vectors for Texture Image Retrieval Based on Wavelet Decomposition”, in Proc.
of Eighth ACIS International Conference on Software Engineering, Artificial
Intelligence, Networking, and Parallel/Distributed Computing, IEEE Computer
Society, 2007. 75
198
BIBLIOGRAPHY
[112] B. S. Manjunath, B. Sumengen, Z. Bi, J. Byun, M. El-Saban, D. Fedorov, and
N. Vu, “Towards automated bioimage analysis: From features to semantics,” in
IEEE Int. Symposium on Biomedical Imaging (ISBI), 2006.
[113] Text of ISO/IEC 15 938-3 Multimedia Content Description Interface-Part 3:
Visual. Final Committee Draft, ISO/IEC/JTC1/SC29/ WG11, Doc. N4062,
Mar. 2001.
[114] MPEG-7 Visual Experimentation Model (XM), Version 10.0,
ISO/IEC/JTC1/SC29/WG11, Doc. N4063, Mar. 2001.
[115] B.S.Manjunath, P. Salembier, and T. Sikora, Eds., Introduction to MPEG7:
Multimedia Content Descripfion Inreflure, John Wiley and Sons, first edition,
2002.
[116] Z. Sun, G. Bebis, and R. Miller, “On-Road Vehicle Detection Using Evolu-
tionary Gabor Filter Optimization”, IEEE Trans. On Intelligent Transportation
Systems, 6(2), 2005, 125-137. 70
[117] K. Xu, B. Georgescu, D. Comaniciu, P. Meer, “Performance Analysis in Content-
based Retrieval with Textures”. ICPR 2000, pp. 4275-4278. 72, 75
[118] M. Kokare, B.N.Chatterji and P.K.Biswas, “Comparison of Similarity Metrics
for Texture Image Retrieval”, IEEE Digital index No. 0-7803-7651-X, 2003. 72
[119] L. Chen, G. Lu, D. Zhang, “Effects of Different Gabor Filter Parameters on Im-
age Retrieval by Texture,” pp.273, Proceedings of the 10th International Multi-
media Modelling Conference, 2004 (MMM’04). 72
[120] P. Brodatz, Textures: A Photographic Album for Artists and Designers. New
York: Dover, 1966. 68, 75, 76, 77
[121] A.P.N. Vo, T. T. Nguyen, S. Oraintara, “Texture Image Retrieval Using Com-
plex Directional Filter Bank,” 2006 IEEE International Symposium on Circuits
and Systems, Island of Kos, Greece, May 2006. 75
[122] B.S. Manjunath, C. Shekhar, and R. Chellappa, “A New Approach to Image
Feature Detection with Applications,” Pattern Recognition, vol. 29(4), pp. 627-
640, April 1996. 75
199
BIBLIOGRAPHY
[123] M. Jian, J. Dong, D. Gao, Z. Liang, “New Texture Features Based on Wavelet
Transform Coinciding with Human Visual Perception,” in Proc. of Eighth ACIS
International Conference on Software Engineering, Artificial Intelligence, Net-
working, and Parallel/Distributed Computing, IEEE Computer Society, 2007.
[124] M. Kokare, P. K. Biswas, B. N. Chatterji, “Rotation invariant texture features
using rotated complex wavelet for content based image retrieval,” ICIP 2004,
pp. 393-396.
[125] Z. Liu and S. Wada, “Robust Feature Extraction Technique for Texture Image
Retrieval,” ICIP 2005, pp. 525-528.
[126] G. Guo; H. J. Zhang; S.Z. Li, “Distance-From-Boundary As A Metric for Tex-
ture Image Retrieval,” IEEE International Conference on Acoustics, Speech and
Signal Processing, vol. 3, pp. 1629-1632, 2001.
[127] Y. Liu, and X. Zhou, “A Simple Texture Descriptor for Texture Retrieval”,
Proceedings of ICTT, pp. 1662-1665, 2003.
[128] X. Fu, Y. Li, R. Harrison, S. Belkasim, “Content-based Image Retrieval Using
Gabor-Zernike Features,” ICPR 2006, pp. 417-420.
[129] B.S. Manjunath, P. Wu, S. Newsam, H.D. Shin1, “A texture descriptor for
browsing and similarity retrieval,” SP:IC(16), No. 1-2, September 2000, pp. 33-
43.
[130] A. Ahmadiad , E. Faramarz, Sayadian, “Image Indexing and Retrieval Using
Gabor Wavelet and Legendre Moments,” Proceedings of the 25th Annual Inter-
national Conference of the IEEE EMBS, Cancun, Mexico, pp.17-21, September
2003. 75
[131] M.N. Do, and M. Vetterli, “Wavelet-based texture retrieval using generalized
Gaussian density and Kullback-Leibler distance,” IEEE Trans. Image Process.
v11(2), pp. 146-158. February 2002. 75
[132] A. Ahmadian, A. Mostafa, “An efficient texture classification algorithm using
Gabor Wavelet”, Proceedings of the 25th Annual International Conference of
the IEEE EMBS, Cancun, Mexico, September 17-21, 2003, pp. 930-933. 75
200
BIBLIOGRAPHY
[133] Y. Liu, D. S. Zhang, G. Lu and W-Y. Ma, “Study on Texture Feature Extraction
in Region-based Image Retrieval System”, In Proc. of IEEE International Conf.
on Multimedia Modeling (MMM06), pp.264-271, Beijing, Jan. 2006.
[134] P. Howarth., S. Ruger., “Robust texture features for still-image retrieval,”
VISP(152), No. 6, pp. 868-874, December 2005. 75
[135] R. Picard, C. Graczyk, S. Mann, and et al., “Vision
texture 1.0,” tech. rep., Media Laboratory, MIT, 1995.
http://www.white.media.mit.edu/vismod/imagery/VisionTexture/vistex.html.
75
[136] R. Marculescu, P. Bogdan, “The chip is the network: Toward a science of
network-on-chip design”, Foundations and Trends in Electronic Design Automa-
tion, pp. 371461, March 2009. 80
[137] T. Serre, “Learning a dictionary of shape-components in visual cortex: Com-
parison with neurons, humans and machines,” MIT. Comput. Sci. & AI Lab,
Cambridge, MA, Tech. Rep. MIT-CSAIL-TR-2006-028 CBCL-260, 2006. 86
[138] H. Fujii, H. Ito, K. Aihara, N. Ichinose, and M. Tsukada, “Dynamical cell assem-
bly hypothesis - Theoretical possibility of spatio-temporal coding in the cortex,”
Neural Netw., vol. 9, pp. 1303-1350, 1996.
[139] Y. Le Cun and Y. Bengio, “Convolutional networks for images, speech, and time
series,” in Handbook of Brain Theory and Neural Networks, M. A. Arbib, Ed.
Cambridge, MA: MIT Press, 1995, pp. 255-258.
[140] M. Matsugu, K. Mori, M. Ishi, and Y. Mitarai, “Convolutional spiking neural
network model for robust face detection,” in Proc. 9th Int. Conf. Neural Inf.
Process., 2002, vol. 2, pp. 660-664.
[141] B. Fasel, “Robust face analysis using convolutional neural networks,” in Proc.
Int. Conf. Pattern Recognit., pp. 40-43, 2002.
[142] M. Browne and S. S. Ghidary, “Convolutional neural networks for image pro-
cessing: An application in robot vision,” in Advances in Artificial Intelligence.
Cambridge, MA: MIT Press, 2003, pp. 641-652.
201
BIBLIOGRAPHY
[143] C. Neubauer, “Evaluation of convolution neural networks for visual recognition,”
IEEE Trans. Neural Netw., vol. 9, no. 4, pp. 685-696, Jul. 1998.
[144] S. Thorpe, D. Fize, and C. Marlot, “Speed of processing in the human visual
system,” Nature, vol. 381, pp. 520-522, Jun. 1996.
[145] K. Fukushima, “Analysis of the process of visual pattern recognition by the
neocognitron,” Neural Netw., vol. 2, pp. 413-420, 1989.
[146] L. Itti, “Quantitative modeling of perceptual salience at human eye position,”
Vis. Cogn., vol. 14, no. 4, pp. 959-984, 2006.
[147] L. Itti, “Quantifying the contribution of low-level saliency to human eye move-
ments in dynamic scenes,” Vis. Cogn., vol. 12, no. 6, pp. 1093-1123, Aug. 2005.
[148] T. Crimmins, “Geometric filter for speckle reduction,” Appl. Opt., vol. 24, pp.
1438-1443, 1985.
[149] S. Grossberg, E. Mingolla, and W. D. Ross, “Visual brain and visual perception:
How does the cortex do perceptual grouping?,” Trends Neurosci., vol. 20, pp.
106-111, 1997.
[150] E. Mingolla, W. Ross, and S. Grossberg, “A neural network for enhancing bound-
aries and surfaces in synthetic aperture radar images,” Neural Netw., vol. 12,
no. 3, pp. 499-511, 1999. 86
[151] Jain V., et al., “Supervised learning of image restoration with convolutional
networks,” IEEE Int. Conf. Comp.t Vis. (ICCV), 2007. 86
[152] R. Collobert and J. Weston, “A unified architecture for natural language pro-
cessing. Deep neural networks with multitask learning,” Proc. Int. Conf. on
Machine Learning (ICML 08), pp. 160-167, 2008. 86
[153] R. Hadsell, P. Sermanet, M. Scoffier, A. Erkan, K. Kavackuoglu, U. Muller and
Y. LeCun, “Learning Long-Range Vision for Autonomous Off-Road Driving,”
J. Field Robotics, vol. 26(2), pp. 120-144, February 2009. 86
202
BIBLIOGRAPHY
[154] Y. Bengio and Y. LeCun, “Scaling learning algorithms towards AI,” In L. Bot-
tou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Ma-
chines. MIT Press, 2007. 86
[155] Y. Le Cun et al.: “Backpropagation applied to handwritten zip code recogni-
tion”, Neural Computation 1, 541-551 (1989). 92
[156] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the Best
Multi-Stage Architecture for Object Recognition?,” in Proc. International Con-
ference on Computer Vision (ICCV’09), 2009. 99
[157] P. Lichtsteiner, J. Kramer, T. Delbruck, “Improved ON/OFF temporally differ-
entiating address-event imager,” 11th IEEE International Conference on Elec-
tronics, Circuits and Systems (ICECS 2004) Tel Aviv, Israel, pp. 211-214. 88
[158] A. Gentile and D. S. Wills, “Portable video supercomputing,” IEEE Trans.
Comput., vol. 53(8), pp. 960-972, Aug. 2004.
[159] V. Wall, M. Torkelson, and P. Egelberg, “A custom image convolution DSP
with a sustained calculation capacity of ¿1 GMAC/s and low I/O bandwidth,”
J. VLSI Signal Process., vol. 23, pp. 355-349, 1999.
[160] H. Kwon, “A low-power image convolution algorithm for variable voltage pro-
cessors,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2003, vol. 2,
pp. 677-680.
[161] H. H. Cut, A. Gentile, J. C. Eble, M. Lee, O. Vendier, Y. J. Joo, D. S. Wills,
M. Brooke, N. M. Jokerst, and A. S. Brown, “SIMPiL: An OE integrated SIMD
architecture for focal plane processing applications,” in Proc. 3rd Int. Conf.
Massively Parallel Process. Using Opt. interconnects, 1996, pp. 44-52.
[162] F. Paillet, D. Mercier, and T. M. Bernard, “Making the most of 15k silicon
area for a digital retina PE,” in Proc. SPIE Adv. Focal Plane Arrays Electron.
Cameras II, Zurich, Switzerland, May 1998, vol. 3410, pp. 158-167. 86
[163] C. Farabet, C. Poulet and Y. LeCun, “An FPGA-Based Stream Processor for
Embedded Real-Time Vision with Convolutional Networks”, in Proc. of the Fifth
203
BIBLIOGRAPHY
IEEE Workshop on Embedded Computer Vision (ECV’09 ICCV’09), IEEE,
Kyoto, 2009. 87
[164] A. Torralba, “How many pixels make an image?,” Visual Neuroscience, vol.
26(1), pp. 123-131, 2009.
[165] L. Camunas-Mesa, et al., “Fully digital AER convolution chip for vision process-
ing,” Proc. 2008 IEEE Int. symp. Circ. & Syst. (ISCAS08), pp. 652-655, May
2008. 96, 101, 107, 108, 183
[166] R. Etienne-Cummings, Z. K. Kalayjian, and D. Cai, “A programmable focal-
plane MIMD image processor chip,” IEEE J. Solid-State Circuits, vol. 36(1),
pp. 64-73, Jan. 2001.
[167] D. B. Strukov and K. K. Likharev, “CMOL FPGA: a configurable architecture
for hybrid digital circuits with two-terminal nanodevices,” Nanotechnology 16,
pp. 888-900, 2005.
[168] B. Linares-Barranco and T. Serrano-Gotarredona, “Memristance
can explain Spike-Time-Dependent-Plasticity in Neural Synapses”,
http://hdl.handle.net/10101/npre.2009.3010.1, 2009.
[169] A. Linares-Barranco, G. Jimenez-Moreno, B. Linares-Barranco and A. Civit-
Ballcels, “On Algorithmic Rate-Coded AER Generation,” IEEE Trans. on Neu-
ral Networks, vol. 17, No. 3, pp. 771-788, May 2006.
[170] S. Kirkpatrick, C. D. Gellat, Jr., and M. P. Vecchi, “Optimization by simulated
annealing,” Science, vol. 220, pp. 671-689, 1983. 143, 146
111
204
Publications
Journal Papers
1. J. A. Perez-Carrasco, B. Acha, C. Serrano, L. Camunas-Mesa, T. Serrano-Gotarredona,
and B. Linares-Barranco, “Fast Vision through Frame-less Event-based Sensing
and convolutional Processing. Application to Texture Recognition,” IEEE Trans.
Neural Networks, vol. 21, No. 4, pp. 609-620, April 2010.
2. R. Serrano-Gotarredona, T. Serrano-Gotarredona, A. Acosta-Jimenez, C. Serrano-
Gotarredona, J. A. Perez-Carrasco, A. Linares-Barranco, G. Jimenez-Moreno, A.
Civit-Ballcels, and B. Linares-Barranco, “On Real-Time AER 2D Convolutions
Hardware for Neuromorphic Spike Based Cortical Processing,” IEEE Trans. Neu-
ral Networks, vol. 19, No. 7, pp. 1196-1219, July 2008.
Under Review or in Preparation
1. S. Chen, P. Akselrod, E. Culurciello, J. A. Perez Carrasco, B. Linares-Barranco,
“Efficient feedforward categorization of objects and human postures with address-
event image sensors”, submitted to IEEE Pattern Analysis and Machine Intelli-
gence (PAMI).
2. J. A. Perez-Carrasco, C. Serrano, B. Acha, T. Serrano-Gortarredona, and B.
Linares-Barranco, “Event-Driven convolutional networks for fast vision posture
recognition”, in preparation for IEEE Pattern Analysis and Machine Intelligence
(PAMI).
205
BIBLIOGRAPHY
Conference Proceedings
1. J. A. Perez-Carrasco, C. Serrano, B. Acha, T. Serrano-Gotarredona, B. Linares-
Barranco, “Spike-Based Convolutional Network for real-time processing”, 20th
International Conference on Pattern Recognition (ICPR 2010), pp.3085-3088, Is-
tanbul, Turkey, 2010.
2. J. A. Perez-Carrasco, C. Zamarreno-Ramos, L. Camunas-Mesa, T. Serrano-Gotarredona,
and B. Linares-Barranco, “On Neuromorphic Spiking Architectures for Asyn-
chronous STDP Menristive Systems”,. IEEE International Symposium on Cir-
cuits and Systems (ISCAS 2010), Paris, France, 2010.
3. L. Camuas-Mesa, J. A. Perez-Carrasco, C. Zamarreno-Ramos, T. Serrano-Gotarredona,
and B. Linares-Barranco, “On Scalable Spiking ConvNet Hardware for Cortex-
Like Visual Sensory Processing Systems”, IEEE International Symposium on Cir-
cuits and Systems (ISCAS 2010), Paris, France, 2010.
4. S. Thorpe, A. Brilhault, J. A. Perez-Carrasco, “Suggestions for a Biologically
Inspired Spiking Retina Using Order-Based Coding”, IEEE International Sym-
posium on Circuits and Systems (ISCAS 2010), Pars, Francia, 2010.
5. J. A. Perez-Carrasco, C. Zamarreno-Ramos, T. Serrano-Gotarredona, and B.
Linares-Barranco, “Neocortical Frame-free Vision Sensing and Processing through
Scalable Spiking ConvNet Hardware”, Luis Camuas-Mesa, 2010 IEEE World
Congress on Computational Intelligence, Barcelona, Spain, 2010.
6. J. A. Perez Carrasco, C. Serrano, B. Acha, T. Serrano-Gotarredona, B. Linares
Barranco, “Advanced Vision Processing Systems: Spike-Based Simulation and
Processing”, Advanced Concepts for Intelligent Vision Systems (ACIVS 2009),
pp. 640-651, Bordeaux, France, 2009
7. J. A. Perez-Carrasco, C. Serrano, B. Acha, T. Serrano-Gotarredona, B. Linares-
Barranco, “Simulacion de Sistemas Basados en Eventos”, XXIV Simposium Na-
cional de la Union Cientıfica Internacional de Radio (URSI 2009), Santander,
Septiembre 2009.
206
BIBLIOGRAPHY
8. J. A. Perez-Carrasco, C. Serrano, B. Acha, T. Serrano-Gotarredona, B. Linares-
Barranco, “Procesamiento rapido de vision basado en AER”, XXIV Simposium
Nacional de la Union Cientıfica Internacional de Radio (URSI 2009), Santander,
Septiembre 2009.
9. J. A. Perez-Carrasco, B. Acha, C. Serrano, “Calibracion colorimetrica para el
diagnostico automatico de quemaduras”, XXIV Simposium Nacional de la Union
Cientıfica Internacional de Radio (URSI 2009), Santander, Septiembre 2009.
10. J. A. Perez-Carrasco, C. Serrano, B. Acha, ”Clasificacion de Lesiones de Piel
Basada en Filtros de Gabor y Color”, Congreso Anual de la Sociedad Espaola de
Ingenierıa Biomedica (CASEIB 2009), Num. 27, pp.125-128, Cadiz, Spain, 2009.
11. J.A. Perez-Carrasco, T. Serrano-Gotarredona, C. Serrano-Gotarredona, B. Acha,
B. Linares-Barranco. “High-Speed Character Recognition System Based on a
Complex Hierarchical Aer Architecture”, High-Speed Character Recognition Sys-
tem Based on a Complex Hierarchical Aer Architecture. IEEE International Sym-
posium on Circuits and Systems. Seattle, EE.UU. IEEE. Pag. 2150-2153 (ISCAS
2008). Seattle. Washington, USA, 18-21 May 2008.
12. J.A. Perez-Carrasco, C. Serrano, B. Acha, T. Serrano-Gotarredona, B. Linares-
Barranco, “Event Based Vision Sensing and Processing”, Proceedings of the 15th
IEEE International Conference on Image Processing (ICIP 2008). pp: 1392-1395.
San Diego, California, USA, 12-15 October 2008.
13. J. A. Perez-Carrasco, C. Serrano, B. Acha, T. Serrano-Gotarredona, B. Linares-
Barranco, “Simulador de Sistemas AER Basado en Eventos”, XXIII Simposium
Nacional de la Union Cientıfica Internacional de Radio (URSI 2008), Madrid,
Spain, Septiembre 2008.
14. J. A. Perez Carrasco, Teresa Serrano Gotarredona, C. Serrano, B. Acha, B.
Linares Barranco, “On the Computational Power of Address-Event Represen-
tation (Aer) Vision Processing Hardware”, DCIS 2007, pp. 21-23, Sevilla, Spain,
2007.
207
Top Related