# Open the box of digital neuromorphic processor: Towards effective algorithm-hardware co-design

Guangzhi Tang\*, Ali Safa<sup>†‡</sup>, Kevin Shidqi\*, Paul Detterer\*, Stefano Traferro\*, Mario Konijnenburg\*, Manolis Sifalakis\*, Gert-Jan van Schaik\*, Amirreza Yousefzadeh\* \*imec Netherlands, Eindhoven, Netherlands, †imec, Leuven, Belgium, ‡KU Leuven, Leuven, Belgium

Abstract-Sparse and event-driven spiking neural network (SNN) algorithms are the ideal candidate solution for energyefficient edge computing. Yet, with the growing complexity of SNN algorithms, it isn't easy to properly benchmark and optimize their computational cost without hardware in the loop. Although digital neuromorphic processors have been widely adopted to benchmark SNN algorithms, their black-box nature is problematic for algorithm-hardware co-optimization. In this work, we open the black box of the digital neuromorphic processor for algorithm designers by presenting the neuron processing instruction set and detailed energy consumption of the SENeCA neuromorphic architecture. For convenient benchmarking and optimization, we provide the energy cost of the essential neuro-morphic components in SENeCA, including neuron models and learning rules. Moreover, we exploit the SENeCA's hierarchical memory and exhibit an advantage over existing neuromorphic processors. We show the energy efficiency of SNN algorithms for video processing and online learning, and demonstrate the potential of our work for optimizing algorithm designs. Overall, we present a practical approach to enable algorithm designers to accurately benchmark SNN algorithms and pave the way towards effective algorithm-hardware co-design.

## I. INTRODUCTION

Energy-efficient computations are essential for edge applications that operate with limited energy resources. Braininspired spiking neural networks (SNNs) have the potential to reduce energy costs by introducing sparse and event-driven computation [1], making them ideal candidate solutions for the edge. However, the low-power assumption of the SNN algorithms is not always valid if computational costs are not properly benchmarked. Many works use the sparsity of synaptic operations to demonstrate efficiency [2]-[4], disregarding additional expenses introduced by hardware primitives like memory access or instruction operation. Since SNN algorithms require dedicated hardware, namely the neuromorphic processor, algorithm designs based on inaccurate hardware assumptions can fail to realize potential advantages. Therefore, there is a need for effective algorithm-hardware co-design to truly realize the promised benefits of neuromorphic computing.

Digital neuromorphic processors provide the opportunity to benchmark the energy efficiency of SNNs [5]–[9]. However, these processors behave like a black box for algorithm designers. First, their bottom-up designs support restricted predefined computational elements and leave limited space for co-optimizing new algorithms with the hardware. Second, the coarse benchmarking results from the hardware do not provide precise insight into the design of the SNN algorithm to locate potential optimizations. Although there are neuromorphic processors developed using co-design approaches [10]–[13], they are mainly confined to a specific SNN algorithm and are hard to use by algorithm designers without a sufficient hardware background. Therefore, algorithm designers need a flexible neuromorphic processor design with transparent and customizable internal operations.



Fig. 1: The pipeline of a SENeCA neuromorphic core (left), the interconnected mesh architecture via NoC (right), and the hierarchical memory consist of register files (orange), local SRAM memories (green) and large shared memories (gray).

In this work, we precisely detail the neuron processing instruction set of SENeCA [14], our scalable and flexible digital neuromorphic architecture, to help algorithm designers conveniently benchmark and optimize the cost of their novel SNN algorithms. To demonstrate the potential of SENeCA on algorithm-hardware co-design, we show three levels of abstraction to benchmark costs for SNN algorithms. The main contributions of this paper are the following:

- We conduct circuit-level energy measurements on neuron processing instructions in SENeCA (Section II). This will enable the algorithm design community to accurately estimate the energy cost of their novel SNN algorithms without running them on the actual hardware.
- 2) We provide essential neuromorphic components (neuron models, learning rules, and hierarchical memory exploitation) constructed using SENeCA instructions, together with their energy costs (Section III). This layer of abstraction will enable algorithm designers to quickly estimate hardware overheads of typical SNN topologies without resorting to low-level instruction.
- 3) To clearly verify the usefulness of our contributions, we illustrate how our framework can be utilized to compute the energy efficiency of different SNN algorithms targeting video processing and online learning (Section IV), based on the energy costs provided in this work.

## II. NEURON PROCESSING ON NEUROMORPHIC PROCESSOR

The SENeCA neuromorphic architecture performs event-driven computation with time-multiplexing Neuron Processing Elements (NPEs) emulating numerous neurons per core (Figure 1). To provide sufficient flexibility, SENeCA embeds

a RISC-V controller that enables customizable processing pipelines, rich NPE instructions for versatile computations, and hierarchical memories to optimize the deployment and processing of networks. When a new event enters the core, the RISC-V is interrupted from sleep, preprocesses the event, writes information into the NPEs, and activates the neuron processing before returning to sleep. After events are captured from NPEs, they interrupt the RISC-V from sleep again and communicate to other cores via the NoC.

#### A. NPE and Neuron Processing Instruction Set

NPE is the central neuron processing unit in the SENeCA core, which accelerates a rich neuron processing instruction set (Table I). Each instruction is executed in one cycle (pipelined, 2ns per cycle) and operates in BrainFloat 16 (BF16) format [15]. SNN algorithms can be built by different sequential executions of the instructions, namely micro-kernels. These micro-kernels are stored in the register-files of loop buffer and sent to the NPEs during runtime. For efficient timemultiplexing, the loop buffer executes micro-kernels in a "forloop" fashion on NPEs and incrementally calculates Data-Memory addresses. This design gives a much lower cost than using the more flexible instruction memory (Table II). Determined by the event type, the RISC-V controller selects which micro-kernel to process on the NPEs. Neuron processing operates with hierarchical memory, including register-files, local data memory, and external shared memory if the model cannot fit locally. To introduce intra-core parallelism, NPEs in the SENeCA core form a SIMD (single instruction multiple data) type architecture [16] that accesses data through a wide data memory port in parallel. The NPE also supports quantized integer data types (Int4 and Int8) to reduce energy costs (see Section III-E). When events are generated, the event capture unit converts them to the Address Event Representation (AER) form [17] before sending them to the RISC-V and NoC. The present version of SENeCA core has 8 NPEs and 64 registers per NPE. These numbers are parameterized and can be finetuned before synthesis.

# B. Circuit-level Energy Measurements

We report the average consumption of the NPE instructions in Table I. The pre-silicon energy number includes the power consumption of all the modules needed to execute the instruction (e.g., address calculations in loop buffer, access to instruction memory, etc.). The results are measured by running each instruction 8k times with random data using the Cadence JOULES (time-based mode), a RTL level power measurement tool (within 15% of signoff power) [18], with the GF-22nm FDX technology node<sup>1</sup>. The leakage power for the core is around  $30\mu W$  (0.06pJ in a 2ns clock cycle). For clarity, we report the memory and NoC information in Table II. Since in a typical SNN, there are significantly more synaptic operations than events, the computational cost for synaptic operations (done in NPEs) largely dominates the event preprocessing (RISC-V) and communications (NoC). Therefore, in this paper, for simplicity and due to limited space, we safely ignored the RISC-V and NoC costs.

### III. ESSENTIAL NEUROMORPHIC COMPONENTS

Direct optimization of complex algorithms at the instruction level is difficult. A level of abstraction for essential components of the SNN algorithm can significantly simplify benchmarking and optimization. Here, we present neuron models

TABLE I: Neuron Processing Energy Consumption

| Instruction      | Description             | Energy (pJ) |
|------------------|-------------------------|-------------|
| ADD/SUB/MUL/DIV  | Arithmetic ops.         | 1.4         |
|                  | 2xINT8b Arithmetic ops. | 1.2         |
| GTH/MAX/MIN      | Compare ops.            | 1.2         |
| EQL/ABS          |                         | 1.1         |
| AND/ORR          | Bit-wise ops.           | 1.1         |
| SHL/SHR          | •                       | 1.2         |
| I2F              | data type cnv.          | 1.1         |
| RND              |                         | 1.4         |
| EVC              | Event Capture           | 0.5         |
|                  | + if generates event    | + 1.1       |
| MLD              | Data Mem Load/Store     | 3.7         |
| MST              |                         | 3.9         |
| RISC-V           | Per Instruction         | 11.6        |
| pre/post Process | + Data mem access       | +10.0       |

TABLE II: Memory Size and Energy Consumption

|               | Register-File (NPE) | SRAM (Inst/Data Mem)     |
|---------------|---------------------|--------------------------|
| Size          | $64W \times 16b$    | $8KW \times 32b \ (2Mb)$ |
| Energy (fJ/b) | 12.0                | 200                      |
|               | NoC event           | HBM (Shared Mem) [19]    |
| Size          | 32b                 | 32b (multi Gb)           |
| Energy (fJ/b) | 65.62               | 7000                     |

and learning rules constructed from NPE instructions and compute their cost using circuit-level power measurements (see Table III). Furthermore, we exploit the hierarchical memory in SENeCA and compare the costs of synaptic operations when using quantized integer weights and multi-event processing.

## A. Integrate and Fire Neuron

Integrate and Fire (IF) neurons are widely used for SNN processing [20]–[22]. Here, we define an IF neuron as:

$$v_i[k] \leftarrow v_i[k-1] \times (1 - s_{out,i}[k-1]) + \Sigma_j w_{ij} \times s_{in,j}[k]$$

$$s_{out,i}[k] \leftarrow H(v_i[k] - v_{th})$$

where k is the time step,  $v_i$  is the state of neuron i,  $s_{in,j}$  is the input spike from neuron j,  $s_{out,i}$  is the output spike of neuron i,  $w_{ij}$  is the weight,  $v_{th}$  is the voltage threshold and H is the Heaviside function. The first micro-kernel in Component 1 integrates spikes instantly, and the second micro-kernel generates spikes at the end of each time step.

#### B. Sigma Delta Neuron

Sigma Delta (SD) neurons sparsify deep neural networks (DNNs) by communicating temporal activation differences through events [23]. First, the sigma integrates events as:

$$z_i[k] \leftarrow z_i[k-1] + \sum_i w_{ij} \times o_{in,i}[k] \tag{2}$$

where  $z_i$  is the sigma state of neuron i and  $o_{in,j}$  is the input event from neuron j. Then, the delta generates events as:

$$o_{out,i}[k] \leftarrow round(\frac{f(z_i[k])}{q}) \times q - round(\frac{f(z_i[k-1])}{q}) \times q$$
 (3)

where  $o_{out,i}$  is the output event of neuron i, f is a non-linear activation function (e.g. ReLU), round function rounds a number to integer and q is the scaling factor. The quantization can significantly increase the sparsity of the events [24]. The first micro-kernel in Component 2 integrates events instantly and the second operates at flexible frequency while maintaining equivalence to the trained DNN [25].

<sup>&</sup>lt;sup>1</sup>In typical corner (0.8v and 25C, no back-biasing)

# Micro-kernel 1: Spike Integration. See Eq. (1).

```
MLD(R0, ADD1, 1) //load weight w_{ij} MLD(R1, ADD2, 0) //load state v_i ADD(R1, R0, R1) //v_i = v_i + w_{ij} MST(ADD2, R1, 1) //store R1 in v_i
```

## Micro-kernel 2: Spike Generation. See Eq. (1).

```
MLD(R0, ADD1, 0) //load state v_i GTH(R2, R0, R1) //generate spike H(v_i-R1) MUL(R3, R2, R0) //v_i \times s_{out,i} SUB(R0, R0, R3) //reset state if spike MST(ADD1, R0, 1) //store R0 in v_i EVC(R2) //capture event
```

Listing 1: Integrate and Fire Neuron

## Micro-kernel 1: Sigma Integration. See Eq. (2).

```
MLD(R0, ADD1, 1) //load weight w_{ij} MLD(R1, ADD2, 0) //load sigma state z_i MUL(R3, R0, R2) //w_{ij} \times o_{in,j} R2\leftarrow o_{in,j} ADD(R1, R1, R3) //z_i = z_i + w_{ij} \times o_{in,j} MST(ADD2, R1, 1) //store R1 in z_i
```

### Micro-kernel 2: Delta Difference. See Eq. (3).

```
\begin{array}{|l|c|c|c|c|c|c|}\hline & \text{MLD}\left(\text{R0, ADD1, 1}\right) & \text{//load sigma state } z_i\\ & \text{MLD}\left(\text{R1, ADD2, 0}\right) & \text{//load quantize } f(z_i[k-1])\\ & \text{MAX}\left(\text{R0, R0, R2}\right) & \text{/ReLU} & f(z_i[k]) = max(z_i, 0)_{I\!\!P} \text{R2} \leftarrow 0\\ & \text{DIV}\left(\text{R0, R0, R3}\right) & \text{//} f(z_i[k]) / q_{I\!\!P} \text{ R3} \leftarrow q\\ & \text{RND}\left(\text{R0, R0}\right) & \text{/round to closest integer}\\ & \text{MUL}\left(\text{R0, R0, R3}\right) & \text{/rescale quantization}\\ & \text{SUB}\left(\text{R3, R0, R1}\right) & \text{/delta} & f(z_i[k]) - f(z_i[k-1])\\ & \text{MST}\left(\text{ADD2, R0, 1}\right) & \text{//store R0 in quantize} & f(z_i[k])\\ & \text{EVC}\left(\text{R3}\right) & \text{//capture event} \end{array}
```

Listing 2: Sigma Delta Neuron

# C. Hebbian Learning

Hebbian learning and its variants are bio-inspired unsupervised learning rules that have been extensively used to train shallow SNNs [26]. In contrast to backprop-based learning, Hebbian learning schemes do not suffer from update locking and weight transport problems [10], making them better suited for low-complexity on-chip learning [27]. Given a layer of spiking neurons with fully-connected connections, the Hebbian learning rule modifies the weight as follows [28]:

$$w_{ij}[k] \leftarrow w_{ij}[k-1] + \eta \times \operatorname{trace}\{s_{out,i}\}[k] \times \operatorname{trace}\{s_{in,j}\}[k] \quad (4)$$

where  $\eta$  is the learning rate and trace $\{.\}$  is an estimator of the local spiking rate via low-pass filtering:

$$\operatorname{trace}\{s\}[k] \leftarrow \beta \times \operatorname{trace}\{s\}[k-1] + (1-\beta) \times s[k] \tag{5}$$

where  $\beta$  is the decay constant. Micro-kernels in Components 3 update the SNN weights at the end of each time step.

# D. Gradient-based Online Learning (e-prop)

Gradient-based online learning performs end-to-end learning in SNN by estimating gradients using only local information [29]–[32]. Here, we show the e-prop learning [29] in SENeCA as an example. First, the eligibility trace  $e_{ij}$  combines pre- and post-synaptic activities:

$$e_{ij}[k] \leftarrow e_{ij}[k-1] + h(v_i[k]) \times \operatorname{trace}\{s_{in,j}\}[k] \tag{6}$$

where h is the surrogate gradient function. The weight updates when there are error events from the supervised signal:

```
\triangle w_{ij} = -\eta \times e_{ij} \times \Sigma_k b_{ik} \times y_k \tag{7}
```

## Micro-kernel 1: Weight Update. See Eq. (4).

```
MLD (R0, ADD1, 0) //load weight w_{ij} MLD (R1, ADD2, 1) //load trace\{s_{in,j}\} MUL (R1, R1, R2) //trace\{s_{out,i}\} × trace\{s_{in,j}\} MUL (R1, R1, R3) //\eta×R1, R3\leftarrow \eta ADD (R0, R0, R1) //update weight MST (ADD1, R0, 1) //store R0 in w_{ij}
```

## Micro-kernel 2: Spike Trace Update. See Eq. (5).

```
MLD (R0, ADD1, 0) //load trace
MLD (R1, ADD2, 1) //load input s
MUL (R0, R0, R2) //\beta × trace, R2 \leftarrow \beta
MUL (R1, R1, R3) //(1-\beta) × s, R3 \leftarrow (1-\beta)
ADD (R0, R0, R1) //update trace
MST (ADD1, R0, 1) //store R0 in trace
```

Listing 3: Hebbian Learning

## Micro-kernel 1: Eligibility Trace Update. See Eq. (6).

# Micro-kernel 2: Weight Update. See Eq. (7).

```
MLD(R0, ADD1, 0) //load weight w_{ij} MLD(R1, ADD2, 1) //load e_{ij} MLD(R2, ADD3, 1) //load feedback error MUL(R1, R3, R1) //\eta \times e_{ij}, R3\leftarrow \eta MUL(R2, R2, R1) //\eta \times e_{ij} \times \Sigma_k b_{ik} \times y_k SUB(R0, R0, R2) //update weight MST(ADD1, R0, 1) //store R0 in w_{ij}
```

Listing 4: Gradient-based Online Learning (e-prop)

where  $b_{ik}$  is the feedback weight and  $y_k$  is the error events from the output layer. We implemented the learning rule using four SENeCA micro-kernels, with the first micro-kernel in Component 4 updating every time step using a rectangular function for h as introduced in [33], and the second micro-kernel in Component 4 updates when there is a supervised signal available. Additionally, we use micro-kernel 2 in Component 3 to compute  $trace\{s_{in,j}\}$  and micro-kernel 1 in Component 2 to compute  $trace\{s_{in,j}\}$ 

# E. Efficient Synaptic Operation with Hierarchical Memory

The measurement results show memory accesses dominate the total energy consumption for neuron processing. Hierarchical memory architecture in SENeCA allows for data-reuse in NPE register-files and therefore reducing more expensive SRAM accesses. This reduction is achieved using quantized weights and multi-event processing. Using quantized weights (4-bit or 8-bit) reduces the number of SRAM reads per weight. However, there is an overhead as the weight needs to be converted into BF16 using the I2F instruction for computation. As another example of data reuse in the NPEs, processing multiple events in one iteration also reduces the SRAM accesses. The neuron state becomes stationary on the NPEs, avoiding frequently accessing the states from the SRAM. Using fully integer operations on INT4 weights and INT8 states further reduce memory accesses, and thereby decrease energy cost.

TABLE III: Neuromorphic Components Energy Consumption

| Component | Micro-kernel | Energy (pJ) | Frequency |
|-----------|--------------|-------------|-----------|
| IF Neuron | 1            | 12.7        | event     |
|           | 2            | 13.2        | time step |
| SD Neuron | 1            | 14.1        | event     |
|           | 2            | 19.7        | flexible  |
| Hebbian   | 1            | 15.5        | time step |
| Learning  | 2            | 15.5        | time step |
| Gradient  | 1            | 22.9        | time step |
| Learning  | 2            | 19.2        | flexible  |

TABLE IV: Energy per Synaptic Operation in SENeCA

| Weight, 1 Event          | BF16      | Int8          | Int4   | Int4 <sup>2</sup> |
|--------------------------|-----------|---------------|--------|-------------------|
| Energy (pJ)              | 12.7      | 11.95         | 11.03  | 5.63              |
| Weight, 4 Events         | BF16      | Int8          | Int4   | $Int4^2$          |
| Energy (pJ)              | 7.0       | 6.25          | 5.33   | 2.78              |
| Hardware                 | Loihi [6] | TrueNorth [5] | Neuron | Flow [7]          |
| Energy (pJ) <sup>3</sup> | 23        | 2.5           | 2      | 20                |

Table IV shows the average energy cost per IF neuron synaptic operation (i.e., spike integration) when using the integer weights with one and four event processing. By exploiting hierarchical memory, SNN algorithms in SENeCA can potientially achieve lower energy costs compared to existing digital neuromorphic processors without hierarchical memory [5]–[7] (see Table IV bottom row).

### IV. APPLICATION-LEVEL BENCHMARKING

To illustrate how the cost of neuromorphic components reported in Table III can be used to reliably estimate the energy cost of solving downstream tasks and optimize the algorithm based on application needs, we perform an application-level benchmarking of SNN algorithms for video processing and online Hebbian learning.

## A. Sigma Delta Network for Video Processing

Sigma Delta networks can result in more than 90% synaptic operation sparsity when performing video-based human action recognition without sacrificing accuracy [24]. Here, we compute the energy cost of employing SD neurons in ResNet-50 [34] and MobileNet [35] on SENeCA, for efficient video processing using the UCF-101 human action recognition dataset [36]. Using Table III, we calculate the average energy cost of the networks per frame, as shown in Table V, by counting the number of synaptic operations (sigma) and neuron output evaluations (delta). Compared to the estimated energy cost in [24], the precise instruction-level results given here reflect in a more accurate way the actual energy cost of hardware processing. Although a single delta operation requires more instructions than sigma, Table V shows that sigma operations cost much more energy compared to delta, due to the difference in execution dimensionality. Therefore, an algorithm designer can reduce the number of events using a complex delta unit with only negligible energy overheads.

# B. Unsupervised Hebbian Learning for Digit Classification

For demonstrating the cost of online learning, we consider a canonical digit classification task [37] with a dataset composed of 1797 instances of  $8 \times 8$  grayscale images normalized

TABLE V: Energy consumption per frame for SD networks

| Network Model | Sigma Energy (μJ) | Delta Energy (μJ) |
|---------------|-------------------|-------------------|
| ResNet-50     | 3504.0            | 181.2             |
| MobileNet     | 1792.3            | 104.4             |



Fig. 2: Accuracy vs. energy consumption of SNN-Hebbian learning execution (one time-step), with M output neurons.

between 0 and 1 [38]. We flatten each image to a 64-dimension grayscale vector and encode each entry of the vector into a 100 time-step spike train using Poisson spike encoding.

In SENeCA, we implement a modified version of the SNN architecture proposed in [26], where we use the IF neuron model of Section III-A and the Hebbian learning model of Section III-C to replace the leaky IF neurons and the Hebbianlike STDP rule used in [26]. We can then estimate the energy consumption of the network in function of the number of output neurons M and the input dimension N, by counting the number of instructions executed for each completed SNN execution step (i.e., forward propagation of the input spikes, recurrent propagation of the output spikes, feedback propagation of the output spikes and all Hebbian learning mechanisms in [26]). Then, energy consumption is found by relating the number of occurrences of each instruction with the energy measurements provided in Table I and III. In the illustrative case of the canonical digit classification dataset [37], the input data dimension is N = 64 and the output dimension M can be arbitrarily chosen, leading to a tradeoff between energy consumption and classification accuracy, as shown in Figure 2. The ability to accurately generate this trade-off gives algorithm designers the opportunity to optimize the network size based on the need of the application without the overhead of hardware deployment.

# V. CONCLUSION

This paper presents a practical approach to properly benchmark SNN algorithms using the neuron processing instruction set of the SENeCA neuromorphic architecture. We strongly believe that the instructions and micro-kernels provided here, together with their precise energy measurements, will allow a reliable estimate of energy consumption at algorithm design time. We hope that this work will greatly help algorithm designers to conveniently benchmark the hardware costs of their various SNN algorithms, and will enable further optimization of these costs via effective algorithm-hardware co-design.

## ACKNOWLEDGMENT

This work is partially funded by research and innovation projects ANDANTE (ECSEL JU under grant agreement No876925), DAIS (KDT JU under grant agreement No101007273) and MemScale (Horizon EU under grant agreement 871371). The JU receives support from the European Union's Horizon 2020 research and innovation

<sup>&</sup>lt;sup>2</sup>Using INT8 state and integer arithmetic operations.

<sup>&</sup>lt;sup>3</sup>Results without synaptic operation details, may use different neuron model.

programme and Sweden, Spain, Portugal, Belgium, Germany, Slovenia, Czech Republic, Netherlands, Denmark, Norway and Turkey.

#### REFERENCES

- [1] W. Maass, "Networks of spiking neurons: the third generation of neural network models," *Neural networks*, vol. 10, no. 9, pp. 1659–1671, 1997.
- [2] A. Sengupta, Y. Ye, R. Wang, C. Liu, and K. Roy, "Going deeper in spiking neural networks: Vgg and residual architectures," Frontiers in neuroscience, vol. 13, p. 95, 2019.
- [3] S. Kim, S. Park, B. Na, and S. Yoon, "Spiking-yolo: spiking neural network for energy-efficient object detection," in *Proceedings of the AAAI conference on artificial intelligence*, vol. 34, no. 07, 2020, pp. 11270–11277.
- [4] L. Cordone, B. Miramond, and S. Ferrante, "Learning from event cameras with sparse spiking convolutional neural networks," in 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8.
- [5] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura *et al.*, "A million spiking-neuron integrated circuit with a scalable communication network and interface," *Science*, vol. 345, no. 6197, pp. 668–673, 2014.
- [6] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., "Loihi: A neuromorphic manycore processor with on-chip learning," *Ieee Micro*, vol. 38, no. 1, pp. 82–99, 2018.
- [7] O. Moreira, A. Yousefzadeh, F. Chersi, G. Cinserin, R.-J. Zwartenkot, A. Kapoor, P. Qiao, P. Kievits, M. Khoei, L. Rouillard et al., "Neuronflow: a neuromorphic processor architecture for live ai applications," in 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2020, pp. 840–845.
- [8] N. Kumar, G. Tang, R. Yoo, and K. P. Michmizos, "Decoding eeg with spiking neural networks on neuromorphic hardware," *Transactions on Machine Learning Research*, 2022.
- [9] G. Tang, N. Kumar, and K. P. Michmizos, "Reinforcement co-learning of deep and spiking neural networks for energy-efficient mapless navigation with neuromorphic hardware," in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 6090–6097.
- [10] C. Frenkel, J.-D. Legat, and D. Bol, "A 28-nm convolutional neuromorphic processor enabling online learning with spike-based retinas," in 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2020, pp. 1–5.
- [11] H. Fang, B. Taylor, Z. Li, Z. Mei, H. H. Li, and Q. Qiu, "Neuromorphic algorithm-hardware codesign for temporal pattern learning," in 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021, pp. 361–366.
- pp. 361–366.
  [12] Y. Zhong, X. Cui, Y. Kuang, K. Liu, Y. Wang, and R. Huang, "A spike-event-based neuromorphic processor with enhanced on-chip stdp learning in 28nm cmos," in 2021 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2021, pp. 1–5.
- [13] G. Datta, S. Kundu, A. Jaiswal, and P. A. Beerel, "Ace-snn: Algorithm-hardware co-design of energy-efficient & low-latency deep spiking neural networks for 3d image recognition," Frontiers in neuroscience, p. 400, 2022.
- [14] A. Yousefzadeh, G.-J. van Schaik, M. Tahghighi, P. Detterer, S. Traferro, M. Hijdra, J. Stuijt, F. Corradi, M. Sifalakis, and M. Konijnenburg, "Seneca: Scalable energy-efficient neuromorphic computer architecture," in 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2022, pp. 371–374.
- Circuits and Systems (AICAS). IEEE, 2022, pp. 371–374.

  [15] D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee, S. Avancha, D. T. Vooturi, N. Jammalamadaka, J. Huang, H. Yuen et al., "A study of bfloat16 for deep learning training," arXiv preprint arXiv:1905.12322, 2019.
- [16] M. J. Flynn, "Some computer organizations and their effectiveness," IEEE transactions on computers, vol. 100, no. 9, pp. 948–960, 1972.
- [17] T. Iakymchuk, A. Rosado, T. Serrano-Gotarredona, B. Linares-Barranco, A. Jiménez-Fernández, A. Linares-Barranco, and G. Jiménez-Moreno, "An aer handshake-less modular infrastructure pcb with x8 2.5gbps lvds serial links," in 2014 IEEE International Symposium on Circuits and Systems (ISCAS), 2014, pp. 1556–1559.
- [18] Joules rtl power solution. [Online]. Available https://www.cadence.com/content/dam/cadence-www/global/en\_US/documents/tools/digital-design-signoff/joules-rtl-power-solution-ds.pdf
- [19] Power consumption of hbm memory technology. [On-line]. Available: https://www.xilinx.com/products/silicon-devices/fpga/virtex-ultrascale-plus-hbm.html
- [20] J. P. Abrahamsen, P. Hafliger, and T. S. Lande, "A time domain winner-take-all network of integrate-and-fire neurons," in *Proceedings of 2004 IEEE International Symposium on Circuits and Systems*, vol. 5. IEEE, 2004, pp. V–V.

- [21] G. Indiveri, F. Stefanini, and E. Chicca, "Spike-based learning with a generalized integrate and fire silicon neuron," in *Proceedings of 2010 IEEE International Symposium on Circuits and Systems*. IEEE, 2010, pp. 1951–1954.
- pp. 1951–1954.

  [22] J. Stuijt, M. Sifalakis, A. Yousefzadeh, and F. Corradi, "µbrain: An event-driven and fully synthesizable architecture for spiking neural networks," *Frontiers in neuroscience*, vol. 15, p. 538, 2021.
- [23] P. O'Connor and M. Welling, "Sigma delta quantized networks," in 5th International Conference on Learning Representations, ICLR 2017, 2017. [Online]. Available: https://openreview.net/forum?id=HkNRsU5ge
- [24] A. Yousefzadeh and M. Sifalakis, "Delta activation layer exploits temporal sparsity for efficient embedded video processing," in 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 2022, pp. 01–10.
- [25] A. Yousefzadeh, M. A. Khoei, S. Hosseini, P. Holanda, S. Leroux, O. Moreira, J. Tapson, B. Dhoedt, P. Simoens, T. Serrano-Gotarredona et al., "Asynchronous spiking neurons, the natural key to exploit temporal sparsity," *IEEE Journal on Emerging and Selected Topics in Circuits and Systems*, vol. 9, no. 4, pp. 668–678, 2019.
  [26] A. Safa, I. Ocket, A. Bourdoux, H. Sahli, F. Catthoor, and G. G. Gielen,
- [26] A. Safa, I. Ocket, A. Bourdoux, H. Sahli, F. Catthoor, and G. G. Gielen, "Event camera data classification using spiking networks with spike-timing-dependent plasticity," in 2022 International Joint Conference on Neural Networks (IJCNN), 2022, pp. 1–8.
  [27] A. Safa, J. Van Assche, M. D. Alea, F. Catthoor, and G. G. Gielen,
- [27] A. Safa, J. Van Assche, M. D. Alea, F. Catthoor, and G. G. Gielen, "Neuromorphic near-sensor computing: From event-based sensing to edge learning," *IEEE Micro*, pp. 1–8, 2022.
  [28] M. Payvand, Y. Demirag, T. Dalgaty, E. Vianello, and G. Indiveri,
- [28] M. Payvand, Y. Demirag, T. Dalgaty, E. Vianello, and G. Indiveri, "Analog weight updates with compliance current modulation of binary rerams for on-chip learning," in 2020 IEEE International Symposium on Circuits and Systems (ISCAS), 2020, pp. 1–5.
  [29] G. Bellec, F. Scherr, A. Subramoney, E. Hajek, D. Salaj, R. Legenstein,
- [29] G. Bellec, F. Scherr, A. Subramoney, E. Hajek, D. Salaj, R. Legenstein, and W. Maass, "A solution to the learning dilemma for recurrent networks of spiking neurons," *Nature communications*, vol. 11, no. 1, pp. 1–15, 2020.
- [30] G. Tang, N. Kumar, I. Polykretis, and K. P. Michmizos, "Biograd: Biologically plausible gradient-based learning for spiking neural networks," arXiv preprint arXiv:2110.14092, 2021.
- [31] T. Bohnstingl, S. Woźniak, A. Pantazi, and E. Eleftheriou, "Online spatio-temporal learning in deep neural networks," *IEEE Transactions* on Neural Networks and Learning Systems, 2022.
- [32] E. O. Neftci, C. Augustine, S. Paul, and G. Detorakis, "Event-driven random back-propagation: Enabling neuromorphic deep learning machines," *Frontiers in neuroscience*, vol. 11, p. 324, 2017.
  [33] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, "Spatio-temporal backpropa-
- [33] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, "Spatio-temporal backpropagation for training high-performance spiking neural networks," Frontiers in neuroscience, vol. 12, p. 331, 2018.
- [34] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in *Proceedings of the IEEE conference on computer vision and pattern recognition*, 2016, pp. 770–778.
- and pattern recognition, 2016, pp. 770–778.
   [35] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
- [36] K. Soomro, A. R. Zamir, and M. Shah, "Ucf101: A dataset of 101 human actions classes from videos in the wild," arXiv preprint arXiv:1212.0402, 2012.
- [37] F. Alimoglu and E. Alpaydin, "Combining multiple representations and classifiers for pen-based handwritten digit recognition," in *Proceedings* of the Fourth International Conference on Document Analysis and Recognition, vol. 2, 1997, pp. 637–640 vol.2.
- [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., "Scikit-learn: Machine learning in python," *Journal of machine learning research*, vol. 12, no. Oct, pp. 2825–2830, 2011.