Skip to main content
Log in

Avoiding Conversion and Rearrangement Overhead in SIMD Architectures

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Single-Instruction Multiple-Data (SIMD) instructions provide an inexpensive way to exploit the Data-Level Parallelism in multimedia applications. However, the performance improvement obtained by employing SIMD instructions is often limited because frequently many overhead instructions are required to bring data in a form amenable to SIMD processing. In this paper, we employ two techniques to overcome this limitation. The first technique, extended subwords, uses four extra bits for every byte in a media register. This allows many SIMD operations to be performed without overflow and avoids packing/unpacking conversion overhead. The second technique, Matrix Register File (MRF), allows flexible row-wise as well as column-wise access to the register file. It is useful for many two-dimensional multimedia algorithms such as the (I) Discrete Cosine Transform, 2 × 2 Haar Transform, and pixel padding. In addition, we propose a few new media instructions. Experimental results obtained by extending the SimpleScalar toolset show that these techniques improve performance by up to a factor of 4.5 compared to a conventional SIMD instruction set extension.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Peleg A., Wiljie S., U. Weiser, Intel MMX for Multimedia PCs, Commun. ACM, 25–38 (1997).

  2. S.K. Raman, V. Pentkovski, and J. Keshava, Implementing Streaming SIMD Extensions on the Pentium 3 Processor IEEE Micro, 47–57 (2000).

  3. S. Thakkar and T. Huff, The Internet Streaming SIMD Extensions, Intel. Technol. J., vol.1–8 (1999).

  4. M. Tremblay, J. M. O’Connor, V. Narayanan, and L. He, VIS Speeds New Media Processing, IEEE Micro, 10–20 (1996).

  5. M. T. Inc., MIPS Extension for Digital Media, with 3D, www.mips.com.

  6. K. Diefendorff, P. K. Dubey, R. H., and H. Scales, AltiVec Extension to PowerPC Accelerates Media Processing IEEE Micro, pp. 85–95 (2000).

  7. D. Burger and T. M. Austin, The SimpleScalar Tool Set, Version 2.0, www.simplescalay.com.

  8. Slingerland N., Smith A.J. (2002). Measuring the Performance of Multimedia Instruction Sets. IEEE Trans. Comput. 51(11):1317–1332

    Article  MathSciNet  Google Scholar 

  9. Y. Jung, S. G. Berg, D. Kim, and Y. Kim, A Register File with Transposed Access Mode, in Proc. Int. Conf. on Computer Design, pp. 559–560 (2000).

  10. B. Juurlink, A. Shahbahrami, and S. Vassiliadis, Avoiding Data Conversions in Embedded Media Processors, in Proc. 20th Annual ACM Symp. on Applied Computing, pp. 901–902 (2005).

  11. A. Shahbahrami, B. Juurlink, and S. Vassiliadis, Matrix Register File and Extended Subwords: Two Techniques for Embedded Media Processors, in Proc. 2nd ACM Int. Conf. on Computing Frontiers (2005).

  12. R. B. Lee, Subword Permutation Instructions for Two-Dimensional Multimedia Processing in MicroSIMD Architectures, in Proc. IEEE Int. Conf. on Application-Specific Systems Architectures and Processors, pp. 9–23 (2000).

  13. R.B. Lee, Subword Parallelism with MAX-2, IEEE Micro, 51–59 (1996).

  14. J. Oliver, V. Akella, and F. Chong, Efficient Orchestration of Sub-Word Parallelism in Media Processors, in Proc. Symp. on Parallel Algorithms and Architecture (2004).

  15. Cheresiz D., Juurlink B., Vassiliadis S., Wijshoff H.A.G. (2005). The CSI Multimedia Architecture. IEEE Trans. VLSI Syst, 13(1):1–13

    Article  Google Scholar 

  16. J. Corbal, M. Valero, and R. Espasa, Exploiting a New Level of DLP in Multimedia Applications, in Proc. Int. Symp. on Microarchitecture (1999).

  17. Dasu A., Panchanathan S. Reconfigurable Media Processing. Parallel comput 28(7):(2002).

  18. C. Loeffler, A. Ligtenberg, and G. S. Moschytz, Practical Fast 1-D DCT Algorithms With 11 Multiplications, in Proc. Int. Conf. on Acoustical and Speech, vol. 2, pp. 988–991 (1989).

  19. J. W. Waerdt, S. Vassiliadis, S. Das, S. Mirolo, C. Yen, B. Zhong, C. Basto, J. P. Itegem, D. Amirtharaj, K. Kalra, P. Rodriguez, and H. Antwerpen, The TM3270 Media-Processor, in Proc. 38th IEEE/ACM Int. Symp. on Microarchitecture (2005).

  20. S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens, Register Organization for Media Processing, in Proc. 6th Int. Symp. on High-Performance Computer Architecture, pp. 9–23 (2000).

  21. Austin T., Larson E., Ernst D. (2002). SimpleScalar: An Infrastructure for Comput System Modeling. IEEE Computer 35(2):59–67

    Google Scholar 

  22. B. Juurlink, D. Borodin, R. J. Meeuws, G. T. Aalbers, and H. Leisink, The SimpleScalar Instruction Tool (SSIT) and the SimpleScalar Architecture Tool (SSAT), available via http://ce.et.tudelft.nl/~shahbahrami/.

  23. Slingerland N., Smith A.J. (2002). Design and Characterization of the Berkeley Multimedia Workload. Multimedia Syst. 8:315–327

    Article  Google Scholar 

  24. G. Roelofs, PNG: The Definitive Guide, Ph.D. thesis, O’Reilly and Associates (1999).

  25. M. Berekovic, H. J. Stolberg, M. B. Kulaczewski, and P. Pirsch, Instruction Set Extensions for MPEG-4 Video, J VLSI Signal Process, 23:27–49 (1999).

    Google Scholar 

  26. H. C. Chang, L. G. Chen, M. Y. Hsu, and Y. C. Chang, Performance Analysis and Architecture Evaluation of MPEG-4 Video Codec System, in IEEE Int. Symp. on Circuits and Systems, vol. 2, pp. 449–452 (2000).

  27. H. C. Chang, Y. C. Wang, M. Y. Hsu, and L. G. Chen, Efficient Algorithms and Architectures for MPEG-4 Object-Based Video Coding, in Proc. IEEE Workshop on Signal Processing Systems (2000).

  28. S. Vassiliadis, G. Kuzmanov, and S. Wong, MPEG-4 and the New Multimedia Architectural Challenges, in Proc. 15th Int. Conf. on Systems for Automation of Engineering and Research, pp. 24–32 (2001).

  29. W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, Native Signal Processing on the Ultrasparc in the Ptolemy Environment, in Proc. IEEE Conf. on Signals Systems and Computers, vol. 2, pp. 1368–1372 (1996).

  30. A. Shahbahrami, B. Juurlink, and S. Vassiliadis, Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform, in Proc. 16th IEEE Int. Conf. on Application Specific Systems Architectures and Processors (ASAP), pp. 393–398 (2005).

  31. Intel, An Efficient Vector/Matrix Multiply Routine using MMX Technology, Technical report, Intel Developer Services (2004).

  32. Flachs B., Asano S., Dhong S.H., Hofstee H.P., Gervais G., Kim R., Le T., Liu P., Leenstra J., Liberty J., Michael B., Oh H.J., Mueller S.M., Takahashi O., Hatakeyama A., Watanabe Y., Yano N., Brokenshire D.A., Peyravian M., To V., Iwata E. (2006). The Microarchitecture of the Synergistic Processor for a Cell Processor. IEEE J. Solid-State Circ. 41:63–70

    Article  Google Scholar 

  33. H. P. Hofstee, Power Efficient Processor Architecture and the Cell Processor, in Proc. 11th IEEE Int. Symp. on High-Performance Computer Architectur, pp. 258–262 (2005).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Asadollah Shahbahrami.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shahbahrami, A., Juurlink, B., Borodin, D. et al. Avoiding Conversion and Rearrangement Overhead in SIMD Architectures. Int J Parallel Prog 34, 237–260 (2006). https://doi.org/10.1007/s10766-006-0015-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-006-0015-0

Keywords

Navigation