Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Chang, Hoseok; Cho, Junho; Sung, Wonyong

doi:10.1007/s11265-008-0229-z

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Published: 03 June 2008

Volume 56, pages 249–260, (2009)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Hoseok Chang¹,
Junho Cho¹ &
Wonyong Sung¹

176 Accesses
6 Citations
Explore all metrics

Abstract

The single instruction multiple data (SIMD) architecture is very efficient for executing arithmetic intensive programs, but frequently suffers from data-alignment problems. The data-alignment problem not only induces extra time overhead but also hinders automatic vectorization of the SIMD compiler. In this paper, we compare three on-chip memory systems, which are single-bank, multi-bank, and multi-port, for the SIMD architecture to resolve the data-alignment problems. The single-bank memory is the simplest, but supports only the aligned accesses. The multi-bank memory requires a little higher complexity, but enables the unaligned accesses and the stride accesses with a bank-conflict limitation. The multi-port memory is capable of both the unaligned and stride accesses without any restriction, but needs quite much expensive hardware. We also developed a vectorizing compiler that can conduct dynamic memory allocation and SIMD code generation. The performances of the three memory systems with our SIMD compiler are evaluated using several digital signal processing kernels and the MPEG2 encoder. The experimental results show that the multi-bank memory can carry out MPEG2 encoding 5.8 times faster, whereas the single-bank memory only achieves 2.9 times speed-up when employed in a multimedia system with a 2-issue host processor and an 8-way SIMD coprocessor. The multi-port memory obviously shows the best performance, which is however an impractical improvement over the multi-bank memory when the hardware cost is considered.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Memory Architectures

New access modes of parallel memory subsystem for sub-pixel motion estimation

Article 30 December 2014

Radomir Jakovljević, Aleksandar Berić, … Dragan Milićev

References

Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., et al. (2001). The microarchitecture of the Pentium 4 processor. Intel Technology Journal, 5.
Texas Instruments (2003). TMS320C6414, TMS320C6415 , TMS320C6416 Fixed-point Digital Signal Processors. Dallas: Texas Instruments.
Google Scholar
ARM (2002). The ARM11 Microprocessor and ARM PrimeXsys Platform. Austin: ARM
Google Scholar
Hwang, K. (1987). Advanced parallel processing with supercomputer architectures. Proceedings of the IEEE, 75(10), 1348–1379.
Article Google Scholar
Padua, D., & Wolfe, M. (1986). Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12), 1184–1201.
Article Google Scholar
Shinohara, H., Matsumoto, N., Fujimori, K., Tsujihashi, Y., Nakao, H., Kato, S., et al. (1991). A flexible multiport RAM compiler for data path. IEEE Journal of Solid-State Circuits, 26(3), 343–349.
Article Google Scholar
Ranganathan, P., Adve, S., & Jouppi, N. (1999). Performance of image and video processing with general-purpose processors and media ISA extensions. In Proceedings of the 26th Annual International Symposium on Computer Architecture (pp. 124–135).
Naishlos, D., Biberstein, M., Ben-David, S., & Zaks, A. (2003). Vectorizing for a SIMdD DSP architecture. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (pp. 2–11).
Talla, D., John, L., & Burger, D. (2003). Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements. IEEE Transactions on Computers, 52(8), 1015–1031.
Article Google Scholar
Bik, A. J. (2004). The software vectorization handbook. Santa Clara: Intel.
Google Scholar
Lorenz, M., Wehmeyer, L., & Dräger, T. (2002). Energy aware compilation for DSPs with SIMD instructions. In Proceedings of the Joint Conference on Languages, Compilers and Tools for Embedded Systems: Software and Compilers for Embedded Systems (pp. 94–101).
Ren, G., Wu, P., & Padua, D. (2005). An empirical study on the vectorization of multimedia applications for multimedia extensions. In Proceedings of Parallel and Distributed Processing Symposium (pp. 89b).
Bik, A., Girkar, M., Grey, P., & Tian, X. (2001). Experiments with automatic vectorization for the Pentium 4 processor. In Proceedings of Workshop on Compilers for Parallel Computers.
Eichenberger, A. E., O’Brien, K., O’Brien, K., Wu, P., Chen, T., Oden, P. H., et al. (2005). Optimizing compiler for the CELL processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques.
Panda, P., Dutt, N., & Nicolau, A. (2000). On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Transactions on Design Automation of Electronic Systems (TODAES), 5(3), 682–704.
Article Google Scholar
Udayakumaran, S., & Barua, R. (2003). Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (pp. 276–286)
Chiou, D., Jain, P., Rudolph, L., & Devadas, S. (2000). Application-specific memory management for embedded systems using software-controlled caches. In Proceedings of the 37th Conference on Design Automation (pp. 416–419).
Steinke, S., Wehmeyer, L., Lee, B., & Marwedel, P. (2002). Assigning program and data objects to scratchpad for energy reduction. In Proceedings of the Conference on Design, Automation and Test in Europe (p. 409).
Kandemir, M., Ramanujam, J., Irwin, M., Vijaykrishnan, N., Kadayif, I., & Parikh, A. (2004). A compiler-based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(2), 243–260.
Article Google Scholar
van Voorhis, D., & Morrin, T. (1978). Memory systems for image processing. IEEE Transactions on Computers, 27(2), 113–125.
Article MATH Google Scholar
Trenas, M., Lopez, J., & Zapata, E. (1998). A memory system supporting the efficient SIMD computation of the two dimensional DWT. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing.
Mathew, B., McKee, S., Carter, J., & Davis, A. (2000). Design of a parallel vector access unit for SDRAM memory systems. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (pp. 39–48).
Oed, W., & Lange, O. (1985). On the effective bandwidth of interleaved memories in vector processor systems. IEEE Transactions on Computers, 34(10), 949–957.
MATH Google Scholar
Ravi, C. (1972). On the bandwidth and interference in interleaved memory system. IEEE Transactions on Computers, C(21), 899–901.
Article Google Scholar
Raghavan, R., & Hayes, J. (1993). Reducing interference among vector accesses in interleaved memories. IEEE Transactions on Computers, 42(4), 471–483.
Article Google Scholar
Peleg, A., & Weiser, U. (1996). MMX technology extension to the Intel architecture. IEEE Micro, 16(4), 42–50.
Article Google Scholar
Budnik, P., & Kuck, D. (1971). The organization and use of parallel memories. IEEE Transactions on Computers, C(20), 1566–1569.
Article Google Scholar
Chakrapani, L. N., Gyllenhaal, J., Hwu, W. W., Mahlke, S. A., Palem, K. V., & Rabbah, R. M. (2005). Trimaran: An infrastructure for research in instruction-level parallelism. Lecture Notes in Computer Science, 3602, 32–41.
Article Google Scholar
GNU (2008). GCC, the GNU compiler collection. http://www.gnu.org/software/gcc.
Naishlos, D. (2004). Autovectorization in GCC. In Proceedings of the 2004 GCC Developers Summit (pp. 105–118).
Lee, C., Potkonjak, M., & Mangione-Smith, W. H. (1997). Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (pp. 330–335).
Wang, Z. (1984). Fast algorithms for the discrete W transform and for the discrete fourier transform. Acoustics, Speech, and Signal Processing, 32(4), 803–816.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronic Engineering, Seoul National University, Seoul, Korea
Hoseok Chang, Junho Cho & Wonyong Sung

Authors

Hoseok Chang
View author publications
You can also search for this author in PubMed Google Scholar
Junho Cho
View author publications
You can also search for this author in PubMed Google Scholar
Wonyong Sung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hoseok Chang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, H., Cho, J. & Sung, W. Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit. J Sign Process Syst Sign Image Video Technol 56, 249–260 (2009). https://doi.org/10.1007/s11265-008-0229-z

Download citation

Received: 30 March 2008
Accepted: 23 April 2008
Published: 03 June 2008
Issue Date: September 2009
DOI: https://doi.org/10.1007/s11265-008-0229-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Abstract

Access this article

Similar content being viewed by others

Memory Architectures

Memory Architectures

New access modes of parallel memory subsystem for sub-pixel motion estimation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Abstract

Access this article

Similar content being viewed by others

Memory Architectures

Memory Architectures

New access modes of parallel memory subsystem for sub-pixel motion estimation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation