Abstract
The single instruction multiple data (SIMD) architecture is very efficient for executing arithmetic intensive programs, but frequently suffers from data-alignment problems. The data-alignment problem not only induces extra time overhead but also hinders automatic vectorization of the SIMD compiler. In this paper, we compare three on-chip memory systems, which are single-bank, multi-bank, and multi-port, for the SIMD architecture to resolve the data-alignment problems. The single-bank memory is the simplest, but supports only the aligned accesses. The multi-bank memory requires a little higher complexity, but enables the unaligned accesses and the stride accesses with a bank-conflict limitation. The multi-port memory is capable of both the unaligned and stride accesses without any restriction, but needs quite much expensive hardware. We also developed a vectorizing compiler that can conduct dynamic memory allocation and SIMD code generation. The performances of the three memory systems with our SIMD compiler are evaluated using several digital signal processing kernels and the MPEG2 encoder. The experimental results show that the multi-bank memory can carry out MPEG2 encoding 5.8 times faster, whereas the single-bank memory only achieves 2.9 times speed-up when employed in a multimedia system with a 2-issue host processor and an 8-way SIMD coprocessor. The multi-port memory obviously shows the best performance, which is however an impractical improvement over the multi-bank memory when the hardware cost is considered.
Similar content being viewed by others
References
Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., et al. (2001). The microarchitecture of the Pentium 4 processor. Intel Technology Journal, 5.
Texas Instruments (2003). TMS320C6414, TMS320C6415 , TMS320C6416 Fixed-point Digital Signal Processors. Dallas: Texas Instruments.
ARM (2002). The ARM11 Microprocessor and ARM PrimeXsys Platform. Austin: ARM
Hwang, K. (1987). Advanced parallel processing with supercomputer architectures. Proceedings of the IEEE, 75(10), 1348–1379.
Padua, D., & Wolfe, M. (1986). Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12), 1184–1201.
Shinohara, H., Matsumoto, N., Fujimori, K., Tsujihashi, Y., Nakao, H., Kato, S., et al. (1991). A flexible multiport RAM compiler for data path. IEEE Journal of Solid-State Circuits, 26(3), 343–349.
Ranganathan, P., Adve, S., & Jouppi, N. (1999). Performance of image and video processing with general-purpose processors and media ISA extensions. In Proceedings of the 26th Annual International Symposium on Computer Architecture (pp. 124–135).
Naishlos, D., Biberstein, M., Ben-David, S., & Zaks, A. (2003). Vectorizing for a SIMdD DSP architecture. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (pp. 2–11).
Talla, D., John, L., & Burger, D. (2003). Bottlenecks in multimedia processing with SIMD style extensions and architectural enhancements. IEEE Transactions on Computers, 52(8), 1015–1031.
Bik, A. J. (2004). The software vectorization handbook. Santa Clara: Intel.
Lorenz, M., Wehmeyer, L., & Dräger, T. (2002). Energy aware compilation for DSPs with SIMD instructions. In Proceedings of the Joint Conference on Languages, Compilers and Tools for Embedded Systems: Software and Compilers for Embedded Systems (pp. 94–101).
Ren, G., Wu, P., & Padua, D. (2005). An empirical study on the vectorization of multimedia applications for multimedia extensions. In Proceedings of Parallel and Distributed Processing Symposium (pp. 89b).
Bik, A., Girkar, M., Grey, P., & Tian, X. (2001). Experiments with automatic vectorization for the Pentium 4 processor. In Proceedings of Workshop on Compilers for Parallel Computers.
Eichenberger, A. E., O’Brien, K., O’Brien, K., Wu, P., Chen, T., Oden, P. H., et al. (2005). Optimizing compiler for the CELL processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques.
Panda, P., Dutt, N., & Nicolau, A. (2000). On-chip vs. off-chip memory: The data partitioning problem in embedded processor-based systems. ACM Transactions on Design Automation of Electronic Systems (TODAES), 5(3), 682–704.
Udayakumaran, S., & Barua, R. (2003). Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (pp. 276–286)
Chiou, D., Jain, P., Rudolph, L., & Devadas, S. (2000). Application-specific memory management for embedded systems using software-controlled caches. In Proceedings of the 37th Conference on Design Automation (pp. 416–419).
Steinke, S., Wehmeyer, L., Lee, B., & Marwedel, P. (2002). Assigning program and data objects to scratchpad for energy reduction. In Proceedings of the Conference on Design, Automation and Test in Europe (p. 409).
Kandemir, M., Ramanujam, J., Irwin, M., Vijaykrishnan, N., Kadayif, I., & Parikh, A. (2004). A compiler-based approach for dynamically managing scratch-pad memories in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(2), 243–260.
van Voorhis, D., & Morrin, T. (1978). Memory systems for image processing. IEEE Transactions on Computers, 27(2), 113–125.
Trenas, M., Lopez, J., & Zapata, E. (1998). A memory system supporting the efficient SIMD computation of the two dimensional DWT. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing.
Mathew, B., McKee, S., Carter, J., & Davis, A. (2000). Design of a parallel vector access unit for SDRAM memory systems. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture (pp. 39–48).
Oed, W., & Lange, O. (1985). On the effective bandwidth of interleaved memories in vector processor systems. IEEE Transactions on Computers, 34(10), 949–957.
Ravi, C. (1972). On the bandwidth and interference in interleaved memory system. IEEE Transactions on Computers, C(21), 899–901.
Raghavan, R., & Hayes, J. (1993). Reducing interference among vector accesses in interleaved memories. IEEE Transactions on Computers, 42(4), 471–483.
Peleg, A., & Weiser, U. (1996). MMX technology extension to the Intel architecture. IEEE Micro, 16(4), 42–50.
Budnik, P., & Kuck, D. (1971). The organization and use of parallel memories. IEEE Transactions on Computers, C(20), 1566–1569.
Chakrapani, L. N., Gyllenhaal, J., Hwu, W. W., Mahlke, S. A., Palem, K. V., & Rabbah, R. M. (2005). Trimaran: An infrastructure for research in instruction-level parallelism. Lecture Notes in Computer Science, 3602, 32–41.
GNU (2008). GCC, the GNU compiler collection. http://www.gnu.org/software/gcc.
Naishlos, D. (2004). Autovectorization in GCC. In Proceedings of the 2004 GCC Developers Summit (pp. 105–118).
Lee, C., Potkonjak, M., & Mangione-Smith, W. H. (1997). Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (pp. 330–335).
Wang, Z. (1984). Fast algorithms for the discrete W transform and for the discrete fourier transform. Acoustics, Speech, and Signal Processing, 32(4), 803–816.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chang, H., Cho, J. & Sung, W. Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit. J Sign Process Syst Sign Image Video Technol 56, 249–260 (2009). https://doi.org/10.1007/s11265-008-0229-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-008-0229-z