Skip to main content
Log in

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

The single instruction multiple data (SIMD) architecture is very efficient for executing arithmetic intensive programs, but frequently suffers from data-alignment problems. The data-alignment problem not only induces extra time overhead but also hinders automatic vectorization of the SIMD compiler. In this paper, we compare three on-chip memory systems, which are single-bank, multi-bank, and multi-port, for the SIMD architecture to resolve the data-alignment problems. The single-bank memory is the simplest, but supports only the aligned accesses. The multi-bank memory requires a little higher complexity, but enables the unaligned accesses and the stride accesses with a bank-conflict limitation. The multi-port memory is capable of both the unaligned and stride accesses without any restriction, but needs quite much expensive hardware. We also developed a vectorizing compiler that can conduct dynamic memory allocation and SIMD code generation. The performances of the three memory systems with our SIMD compiler are evaluated using several digital signal processing kernels and the MPEG2 encoder. The experimental results show that the multi-bank memory can carry out MPEG2 encoding 5.8 times faster, whereas the single-bank memory only achieves 2.9 times speed-up when employed in a multimedia system with a 2-issue host processor and an 8-way SIMD coprocessor. The multi-port memory obviously shows the best performance, which is however an impractical improvement over the multi-bank memory when the hardware cost is considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11

Similar content being viewed by others

References

  1. Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., et al. (2001). The microarchitecture of the Pentium 4 processor. Intel Technology Journal, 5.

  2. Texas Instruments (2003). TMS320C6414, TMS320C6415 , TMS320C6416 Fixed-point Digital Signal Processors. Dallas: Texas Instruments.

    Google Scholar 

  3. ARM (2002). The ARM11 Microprocessor and ARM PrimeXsys Platform. Austin: ARM

    Google Scholar 

  4. Hwang, K. (1987). Advanced parallel processing with supercomputer architectures. Proceedings of the IEEE, 75(10), 1348–1379.

    Article  Google Scholar 

  5. Padua, D., & Wolfe, M. (1986). Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12), 1184–1201.

    Article  Google Scholar 

  6. Shinohara, H., Matsumoto, N., Fujimori, K., Tsujihashi, Y., Nakao, H., Kato, S., et al. (1991). A flexible multiport RAM compiler for data path. IEEE Journal of Solid-State Circuits, 26(3), 343–349.

    Article  Google Scholar 

  7. Ranganathan, P., Adve, S., & Jouppi, N. (1999). Performance of image and video processing with general-purpose processors and media ISA extensions. In Proceedings of the 26th Annual International Symposium on Computer Architecture (pp. 124–135).

  8. Naishlos, D., Biberstein, M., Ben-David, S., & Zaks, A. (2003). Vectorizing for a SIMdD DSP architecture. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (pp. 2–11).

  9. Talla, D., John, L., & Burger, D. (2003). Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements. IEEE Transactions on Computers, 52(8), 1015–1031.

    Article  Google Scholar 

  10. Bik, A. J. (2004). The software vectorization handbook. Santa Clara: Intel.

    Google Scholar 

  11. Lorenz, M., Wehmeyer, L., & Dräger, T. (2002). Energy aware compilation for DSPs with SIMD instructions. In Proceedings of the Joint Conference on Languages, Compilers and Tools for Embedded Systems: Software and Compilers for Embedded Systems (pp. 94–101).

  12. Ren, G., Wu, P., & Padua, D. (2005). An empirical study on the vectorization of multimedia applications for multimedia extensions. In Proceedings of Parallel and Distributed Processing Symposium (pp. 89b).

  13. Bik, A., Girkar, M., Grey, P., & Tian, X. (2001). Experiments with automatic vectorization for the Pentium 4 processor. In Proceedings of Workshop on Compilers for Parallel Computers.

  14. Eichenberger, A. E., O’Brien, K., O’Brien, K., Wu, P., Chen, T., Oden, P. H., et al. (2005). Optimizing compiler for the CELL processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques.

  15. Panda, P., Dutt, N., & Nicolau, A. (2000). On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Transactions on Design Automation of Electronic Systems (TODAES), 5(3), 682–704.

    Article  Google Scholar 

  16. Udayakumaran, S., & Barua, R. (2003). Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (pp. 276–286)

  17. Chiou, D., Jain, P., Rudolph, L., & Devadas, S. (2000). Application-specific memory management for embedded systems using software-controlled caches. In Proceedings of the 37th Conference on Design Automation (pp. 416–419).

  18. Steinke, S., Wehmeyer, L., Lee, B., & Marwedel, P. (2002). Assigning program and data objects to scratchpad for energy reduction. In Proceedings of the Conference on Design, Automation and Test in Europe (p. 409).

  19. Kandemir, M., Ramanujam, J., Irwin, M., Vijaykrishnan, N., Kadayif, I., & Parikh, A. (2004). A compiler-based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(2), 243–260.

    Article  Google Scholar 

  20. van Voorhis, D., & Morrin, T. (1978). Memory systems for image processing. IEEE Transactions on Computers, 27(2), 113–125.

    Article  MATH  Google Scholar 

  21. Trenas, M., Lopez, J., & Zapata, E. (1998). A memory system supporting the efficient SIMD computation of the two dimensional DWT. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing.

  22. Mathew, B., McKee, S., Carter, J., & Davis, A. (2000). Design of a parallel vector access unit for SDRAM memory systems. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (pp. 39–48).

  23. Oed, W., & Lange, O. (1985). On the effective bandwidth of interleaved memories in vector processor systems. IEEE Transactions on Computers, 34(10), 949–957.

    MATH  Google Scholar 

  24. Ravi, C. (1972). On the bandwidth and interference in interleaved memory system. IEEE Transactions on Computers, C(21), 899–901.

    Article  Google Scholar 

  25. Raghavan, R., & Hayes, J. (1993). Reducing interference among vector accesses in interleaved memories. IEEE Transactions on Computers, 42(4), 471–483.

    Article  Google Scholar 

  26. Peleg, A., & Weiser, U. (1996). MMX technology extension to the Intel architecture. IEEE Micro, 16(4), 42–50.

    Article  Google Scholar 

  27. Budnik, P., & Kuck, D. (1971). The organization and use of parallel memories. IEEE Transactions on Computers, C(20), 1566–1569.

    Article  Google Scholar 

  28. Chakrapani, L. N., Gyllenhaal, J., Hwu, W. W., Mahlke, S. A., Palem, K. V., & Rabbah, R. M. (2005). Trimaran: An infrastructure for research in instruction-level parallelism. Lecture Notes in Computer Science, 3602, 32–41.

    Article  Google Scholar 

  29. GNU (2008). GCC, the GNU compiler collection. http://www.gnu.org/software/gcc.

  30. Naishlos, D. (2004). Autovectorization in GCC. In Proceedings of the 2004 GCC Developers Summit (pp. 105–118).

  31. Lee, C., Potkonjak, M., & Mangione-Smith, W. H. (1997). Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (pp. 330–335).

  32. Wang, Z. (1984). Fast algorithms for the discrete W transform and for the discrete fourier transform. Acoustics, Speech, and Signal Processing, 32(4), 803–816.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoseok Chang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, H., Cho, J. & Sung, W. Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit. J Sign Process Syst Sign Image Video Technol 56, 249–260 (2009). https://doi.org/10.1007/s11265-008-0229-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-008-0229-z

Keywords

Navigation