On the suitability of SIMD extensions for neural network simulation

doi:10.1016/S0141-9331(03)00062-0

Microprocessors and Microsystems

Volume 27, Issue 7, 1 August 2003, Pages 341-351

https://doi.org/10.1016/S0141-9331(03)00062-0 Get rights and content

Abstract

Current microprocessors contain SIMD execution units (also called multimedia or vector extensions) that allow the data-parallel execution of operations on several subwords packed in 64-bit or 128-bit registers. They can accelerate not only typical multimedia applications but also many other algorithms based on vector and matrix operations. In this paper the results of a detailed experimental study about the suitability of such units for the fast simulation of neural networks are presented. It is shown that a speedup in the range from 2.0 to 8.6 compared to sequential implementations can be achieved. A performance counter analysis is provided to explain several effects by features of the processor architecture.

Section snippets

Motivation

Today, fast SIMD execution units are available in all modern microprocessors. They are either called multimedia, streaming SIMD, visual instruction or vector units. The most popular examples are Intel's MMX [1] and Motorola's AltiVec [2], but also AMD, Sun, HP and MIPS offer similar enhancements to their processor series. All units operate according to the SIMD operation principle: each arithmetic operation is applied in parallel to several 8-, 16- or 32-bit subwords packed in 64- or 128-bit

The RBF network and its training algorithm

The RBF network represents a typical artificial neural network model suitable for many approximation or classification tasks [9]. It is selected for this study because it is composed of two neuron layers with different functionalities so that its simulation is based on various distinctive basic operations. The RBF neurons in the first layer (also called hidden layer, see Fig. 1) are fully connected to all input nodes. Here neuron j computes the squared Euclidean distance x_j between an input

Instruction sets of SIMD units

Five SIMD units of popular microprocessors have been selected for this experimental study. Table 1 shows the main differences: Intel's MMX [1] (also available in current AMD processors) and Sun's VIS [10] allow only operations on integer data, whereas Intel's SSE [11] and AMD's 3DNow! [12] support only the 32-bit IEEE single precision floating point data format. SSE and 3DNow! provide a few additional integer instructions to enhance the MMX capabilities. Motorola's AltiVec [2] is the only SIMD

Precision analysis

All neural operations required for the RBF network simulation are based on vector/matrix operations (compare Eqs. (1)–(7) in Fig. 1) that read the elements from contiguous memory locations. Thus, the first prerequisite for neural network simulation on SIMD units is fulfilled. But what about the demand for low precision that represents the second prerequisite especially when using integer SIMD units?

For training a MLP—the most popular artificial neural network model—by the error back-propagation

Methodology

All neural operations according to Eqs. (1)–(7) were implemented on the five selected SIMD units, either based on 16-bit fixed point representations or 32-bit single precision floating point representations of all variables. The parallelism degree p varies in the range from 2 to 8 depending on the available register width. The SIMD extensions of Sun and Motorola were programmed in the C language enhanced by compiler intrinsics and library routines. However for the SIMD units in Intel and AMD

Performance results

Fig. 6 shows the simulation time for both a recognition step (calculating the RBF network outputs z according to the presentation of a new input vector u) and a training step (calculating the RBF network outputs z and adapting all parameters c_ij, w_jk and s_j). The required RBF simulation time on all SIMD units can be compared here among each other and to the reference implementations on the processor cores.¹

Performance counter analysis for MMX and SSE

To analyze some of the surprising effects described in the last section the performance counters available in most modern microprocessors can be programmed and accessed accordingly. To keep the effort in reasonable limits, only one processor—the Intel Pentium III—was selected for this purpose because it contains both an integer and a floating point SIMD unit. Two internal counters can be programmed here to count simultaneously two out of 77 different events (for details see Appendix A of Ref.

Summary and conclusions

This case study demonstrates that the SIMD units of modern microprocessors are fairly well suited for the acceleration of neural network simulations. A high speedup in the range from 1.9 to 6.6 can be achieved for the simulation of a complete RBF training step. Furthermore, applications that require only the fast recognition of new patterns by an already trained RBF network can be accelerated by a factor of up to 8.6. If the main focus is on high performance, the integer SIMD units are

Alfred Strey received a PhD in Computer Science in 1991 from the University of Erlangen. Currently, he is a lecturer at the Department of Neural Information Processing at the University of Ulm (Germany). Here he is working on the parallel implementation of artificial neural networks. His research interests include parallel computer architectures, performance evaluation and neural networks.

References (16)

A. Peleg et al.
Intel MMX for multimedia PCs
Communications of the ACM
(1997)
K. Diefendorff et al.
AltiVec extension to PowerPC accelerates media processing
IEEE Micro
(2000)
R. Bhargava, L.K. John, B.L. Evans, R. Radhakrishnan, Evaluating MMX technology using DSP and multimedia applications,...
H. Nguyen, L. John, Exploiting SIMD parrallelism in DSP and multimedia algorithms using the AltiVec technology, in:...
O. Hammami, Neural network classifiers execution on superscalar microprocessors, in: Proceedings of the Second...
J. Holt et al.
Finite precision error analysis of neural networks hardware implementations
IEEE Transactions on Computers
(1993)
K. Asanovic et al.
Using simulations of reduced precision arithmetic to design a neuro-microprocessor
Journal of VLSI Signal Processing
(1993)
L. Gaborit, B. Granado, P. Garda, Evaluating micro-processors' multimedia extensions for the real time simulation of...

There are more references available in the full text version of this article.

Cited by (3)

Techniques for power reduction in an SIMD implementation of the VQ/SOM algorithms
2010, Neurocomputing
Hardware implementations of the VQ (vector quantization) and SOM (self organizing map) permit the deployment of these computationally intensive algorithms as single chips or IP cores. This paper discusses the design of an IP core based on an SIMD (single instruction multiple data) processor array for such an implementation with emphasis on those aspects of the design which lead to a low power implementation. Power reduction techniques described are: local memory sharing between processors; processor instruction set and datapath organization; implementation of the winner take all calculation; and use of a thresholding algorithm to permit power down of processors during the distance calculation. It is shown that with a typical $0.13 μ m$ low power semiconductor process and with a clock speed of 100 MHz the power dissipation per processor is approximately 1 mW without use of thresholding. Including thresholding reduces this power to less than 0.5 mW per processor. Area for a 256 processor array with 256 8-bit vector elements per processor is 3.5 mm ×2.5 mm.
Efficient polynomial root finding using SIMD extensions
2005, Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
Efficient SIMD numerical interpolation
2005, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View full text

On the suitability of SIMD extensions for neural network simulation

Abstract

Section snippets

Motivation

The RBF network and its training algorithm

Instruction sets of SIMD units

Precision analysis

Methodology

Performance results

Performance counter analysis for MMX and SSE

Summary and conclusions

Intel MMX for multimedia PCs

Communications of the ACM

AltiVec extension to PowerPC accelerates media processing

IEEE Micro

Finite precision error analysis of neural networks hardware implementations

IEEE Transactions on Computers

Using simulations of reduced precision arithmetic to design a neuro-microprocessor

Journal of VLSI Signal Processing