On the suitability of SIMD extensions for neural network simulation

https://doi.org/10.1016/S0141-9331(03)00062-0Get rights and content

Abstract

Current microprocessors contain SIMD execution units (also called multimedia or vector extensions) that allow the data-parallel execution of operations on several subwords packed in 64-bit or 128-bit registers. They can accelerate not only typical multimedia applications but also many other algorithms based on vector and matrix operations. In this paper the results of a detailed experimental study about the suitability of such units for the fast simulation of neural networks are presented. It is shown that a speedup in the range from 2.0 to 8.6 compared to sequential implementations can be achieved. A performance counter analysis is provided to explain several effects by features of the processor architecture.

Section snippets

Motivation

Today, fast SIMD execution units are available in all modern microprocessors. They are either called multimedia, streaming SIMD, visual instruction or vector units. The most popular examples are Intel's MMX [1] and Motorola's AltiVec [2], but also AMD, Sun, HP and MIPS offer similar enhancements to their processor series. All units operate according to the SIMD operation principle: each arithmetic operation is applied in parallel to several 8-, 16- or 32-bit subwords packed in 64- or 128-bit

The RBF network and its training algorithm

The RBF network represents a typical artificial neural network model suitable for many approximation or classification tasks [9]. It is selected for this study because it is composed of two neuron layers with different functionalities so that its simulation is based on various distinctive basic operations. The RBF neurons in the first layer (also called hidden layer, see Fig. 1) are fully connected to all input nodes. Here neuron j computes the squared Euclidean distance xj between an input

Instruction sets of SIMD units

Five SIMD units of popular microprocessors have been selected for this experimental study. Table 1 shows the main differences: Intel's MMX [1] (also available in current AMD processors) and Sun's VIS [10] allow only operations on integer data, whereas Intel's SSE [11] and AMD's 3DNow! [12] support only the 32-bit IEEE single precision floating point data format. SSE and 3DNow! provide a few additional integer instructions to enhance the MMX capabilities. Motorola's AltiVec [2] is the only SIMD

Precision analysis

All neural operations required for the RBF network simulation are based on vector/matrix operations (compare Eqs. (1)–(7) in Fig. 1) that read the elements from contiguous memory locations. Thus, the first prerequisite for neural network simulation on SIMD units is fulfilled. But what about the demand for low precision that represents the second prerequisite especially when using integer SIMD units?

For training a MLP—the most popular artificial neural network model—by the error back-propagation

Methodology

All neural operations according to Eqs. (1)–(7) were implemented on the five selected SIMD units, either based on 16-bit fixed point representations or 32-bit single precision floating point representations of all variables. The parallelism degree p varies in the range from 2 to 8 depending on the available register width. The SIMD extensions of Sun and Motorola were programmed in the C language enhanced by compiler intrinsics and library routines. However for the SIMD units in Intel and AMD

Performance results

Fig. 6 shows the simulation time for both a recognition step (calculating the RBF network outputs z according to the presentation of a new input vector u) and a training step (calculating the RBF network outputs z and adapting all parameters cij, wjk and sj). The required RBF simulation time on all SIMD units can be compared here among each other and to the reference implementations on the processor cores.1

Performance counter analysis for MMX and SSE

To analyze some of the surprising effects described in the last section the performance counters available in most modern microprocessors can be programmed and accessed accordingly. To keep the effort in reasonable limits, only one processor—the Intel Pentium III—was selected for this purpose because it contains both an integer and a floating point SIMD unit. Two internal counters can be programmed here to count simultaneously two out of 77 different events (for details see Appendix A of Ref.

Summary and conclusions

This case study demonstrates that the SIMD units of modern microprocessors are fairly well suited for the acceleration of neural network simulations. A high speedup in the range from 1.9 to 6.6 can be achieved for the simulation of a complete RBF training step. Furthermore, applications that require only the fast recognition of new patterns by an already trained RBF network can be accelerated by a factor of up to 8.6. If the main focus is on high performance, the integer SIMD units are

Alfred Strey received a PhD in Computer Science in 1991 from the University of Erlangen. Currently, he is a lecturer at the Department of Neural Information Processing at the University of Ulm (Germany). Here he is working on the parallel implementation of artificial neural networks. His research interests include parallel computer architectures, performance evaluation and neural networks.

References (16)

  • A. Peleg et al.

    Intel MMX for multimedia PCs

    Communications of the ACM

    (1997)
  • K. Diefendorff et al.

    AltiVec extension to PowerPC accelerates media processing

    IEEE Micro

    (2000)
  • R. Bhargava, L.K. John, B.L. Evans, R. Radhakrishnan, Evaluating MMX technology using DSP and multimedia applications,...
  • H. Nguyen, L. John, Exploiting SIMD parrallelism in DSP and multimedia algorithms using the AltiVec technology, in:...
  • O. Hammami, Neural network classifiers execution on superscalar microprocessors, in: Proceedings of the Second...
  • J. Holt et al.

    Finite precision error analysis of neural networks hardware implementations

    IEEE Transactions on Computers

    (1993)
  • K. Asanovic et al.

    Using simulations of reduced precision arithmetic to design a neuro-microprocessor

    Journal of VLSI Signal Processing

    (1993)
  • L. Gaborit, B. Granado, P. Garda, Evaluating micro-processors' multimedia extensions for the real time simulation of...
There are more references available in the full text version of this article.

Cited by (3)

  • Efficient polynomial root finding using SIMD extensions

    2005, Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
  • Efficient SIMD numerical interpolation

    2005, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Alfred Strey received a PhD in Computer Science in 1991 from the University of Erlangen. Currently, he is a lecturer at the Department of Neural Information Processing at the University of Ulm (Germany). Here he is working on the parallel implementation of artificial neural networks. His research interests include parallel computer architectures, performance evaluation and neural networks.

View full text