Avoiding Conversion and Rearrangement Overhead in SIMD Architectures

Shahbahrami, Asadollah; Juurlink, Ben; Borodin, Demid; Vassiliadis, Stamatis

doi:10.1007/s10766-006-0015-0

Avoiding Conversion and Rearrangement Overhead in SIMD Architectures

Published: 24 June 2006

Volume 34, pages 237–260, (2006)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Asadollah Shahbahrami^1,2,
Ben Juurlink¹,
Demid Borodin¹ &
…
Stamatis Vassiliadis¹

93 Accesses
5 Citations
6 Altmetric
Explore all metrics

Single-Instruction Multiple-Data (SIMD) instructions provide an inexpensive way to exploit the Data-Level Parallelism in multimedia applications. However, the performance improvement obtained by employing SIMD instructions is often limited because frequently many overhead instructions are required to bring data in a form amenable to SIMD processing. In this paper, we employ two techniques to overcome this limitation. The first technique, extended subwords, uses four extra bits for every byte in a media register. This allows many SIMD operations to be performed without overflow and avoids packing/unpacking conversion overhead. The second technique, Matrix Register File (MRF), allows flexible row-wise as well as column-wise access to the register file. It is useful for many two-dimensional multimedia algorithms such as the (I) Discrete Cosine Transform, 2 × 2 Haar Transform, and pixel padding. In addition, we propose a few new media instructions. Experimental results obtained by extending the SimpleScalar toolset show that these techniques improve performance by up to a factor of 4.5 compared to a conventional SIMD instruction set extension.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Vector Memory Unit for SIMD DSP

General-Purpose DSP Processors

SWIFT: A Computationally-Intensive DSP Architecture for Communication Applications

Article 30 April 2016

References

Peleg A., Wiljie S., U. Weiser, Intel MMX for Multimedia PCs, Commun. ACM, 25–38 (1997).
S.K. Raman, V. Pentkovski, and J. Keshava, Implementing Streaming SIMD Extensions on the Pentium 3 Processor IEEE Micro, 47–57 (2000).
S. Thakkar and T. Huff, The Internet Streaming SIMD Extensions, Intel. Technol. J., vol.1–8 (1999).
M. Tremblay, J. M. O’Connor, V. Narayanan, and L. He, VIS Speeds New Media Processing, IEEE Micro, 10–20 (1996).
M. T. Inc., MIPS Extension for Digital Media, with 3D, www.mips.com.
K. Diefendorff, P. K. Dubey, R. H., and H. Scales, AltiVec Extension to PowerPC Accelerates Media Processing IEEE Micro, pp. 85–95 (2000).
D. Burger and T. M. Austin, The SimpleScalar Tool Set, Version 2.0, www.simplescalay.com.
Slingerland N., Smith A.J. (2002). Measuring the Performance of Multimedia Instruction Sets. IEEE Trans. Comput. 51(11):1317–1332
Article MathSciNet Google Scholar
Y. Jung, S. G. Berg, D. Kim, and Y. Kim, A Register File with Transposed Access Mode, in Proc. Int. Conf. on Computer Design, pp. 559–560 (2000).
B. Juurlink, A. Shahbahrami, and S. Vassiliadis, Avoiding Data Conversions in Embedded Media Processors, in Proc. 20th Annual ACM Symp. on Applied Computing, pp. 901–902 (2005).
A. Shahbahrami, B. Juurlink, and S. Vassiliadis, Matrix Register File and Extended Subwords: Two Techniques for Embedded Media Processors, in Proc. 2nd ACM Int. Conf. on Computing Frontiers (2005).
R. B. Lee, Subword Permutation Instructions for Two-Dimensional Multimedia Processing in MicroSIMD Architectures, in Proc. IEEE Int. Conf. on Application-Specific Systems Architectures and Processors, pp. 9–23 (2000).
R.B. Lee, Subword Parallelism with MAX-2, IEEE Micro, 51–59 (1996).
J. Oliver, V. Akella, and F. Chong, Efficient Orchestration of Sub-Word Parallelism in Media Processors, in Proc. Symp. on Parallel Algorithms and Architecture (2004).
Cheresiz D., Juurlink B., Vassiliadis S., Wijshoff H.A.G. (2005). The CSI Multimedia Architecture. IEEE Trans. VLSI Syst, 13(1):1–13
Article Google Scholar
J. Corbal, M. Valero, and R. Espasa, Exploiting a New Level of DLP in Multimedia Applications, in Proc. Int. Symp. on Microarchitecture (1999).
Dasu A., Panchanathan S. Reconfigurable Media Processing. Parallel comput 28(7):(2002).
C. Loeffler, A. Ligtenberg, and G. S. Moschytz, Practical Fast 1-D DCT Algorithms With 11 Multiplications, in Proc. Int. Conf. on Acoustical and Speech, vol. 2, pp. 988–991 (1989).
J. W. Waerdt, S. Vassiliadis, S. Das, S. Mirolo, C. Yen, B. Zhong, C. Basto, J. P. Itegem, D. Amirtharaj, K. Kalra, P. Rodriguez, and H. Antwerpen, The TM3270 Media-Processor, in Proc. 38th IEEE/ACM Int. Symp. on Microarchitecture (2005).
S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens, Register Organization for Media Processing, in Proc. 6th Int. Symp. on High-Performance Computer Architecture, pp. 9–23 (2000).
Austin T., Larson E., Ernst D. (2002). SimpleScalar: An Infrastructure for Comput System Modeling. IEEE Computer 35(2):59–67
Google Scholar
B. Juurlink, D. Borodin, R. J. Meeuws, G. T. Aalbers, and H. Leisink, The SimpleScalar Instruction Tool (SSIT) and the SimpleScalar Architecture Tool (SSAT), available via http://ce.et.tudelft.nl/~shahbahrami/.
Slingerland N., Smith A.J. (2002). Design and Characterization of the Berkeley Multimedia Workload. Multimedia Syst. 8:315–327
Article Google Scholar
G. Roelofs, PNG: The Definitive Guide, Ph.D. thesis, O’Reilly and Associates (1999).
M. Berekovic, H. J. Stolberg, M. B. Kulaczewski, and P. Pirsch, Instruction Set Extensions for MPEG-4 Video, J VLSI Signal Process, 23:27–49 (1999).
Google Scholar
H. C. Chang, L. G. Chen, M. Y. Hsu, and Y. C. Chang, Performance Analysis and Architecture Evaluation of MPEG-4 Video Codec System, in IEEE Int. Symp. on Circuits and Systems, vol. 2, pp. 449–452 (2000).
H. C. Chang, Y. C. Wang, M. Y. Hsu, and L. G. Chen, Efficient Algorithms and Architectures for MPEG-4 Object-Based Video Coding, in Proc. IEEE Workshop on Signal Processing Systems (2000).
S. Vassiliadis, G. Kuzmanov, and S. Wong, MPEG-4 and the New Multimedia Architectural Challenges, in Proc. 15th Int. Conf. on Systems for Automation of Engineering and Research, pp. 24–32 (2001).
W. Chen, H. J. Reekie, S. Bhave, and E. A. Lee, Native Signal Processing on the Ultrasparc in the Ptolemy Environment, in Proc. IEEE Conf. on Signals Systems and Computers, vol. 2, pp. 1368–1372 (1996).
A. Shahbahrami, B. Juurlink, and S. Vassiliadis, Performance Comparison of SIMD Implementations of the Discrete Wavelet Transform, in Proc. 16th IEEE Int. Conf. on Application Specific Systems Architectures and Processors (ASAP), pp. 393–398 (2005).
Intel, An Efficient Vector/Matrix Multiply Routine using MMX Technology, Technical report, Intel Developer Services (2004).
Flachs B., Asano S., Dhong S.H., Hofstee H.P., Gervais G., Kim R., Le T., Liu P., Leenstra J., Liberty J., Michael B., Oh H.J., Mueller S.M., Takahashi O., Hatakeyama A., Watanabe Y., Yano N., Brokenshire D.A., Peyravian M., To V., Iwata E. (2006). The Microarchitecture of the Synergistic Processor for a Cell Processor. IEEE J. Solid-State Circ. 41:63–70
Article Google Scholar
H. P. Hofstee, Power Efficient Processor Architecture and the Cell Processor, in Proc. 11th IEEE Int. Symp. on High-Performance Computer Architectur, pp. 258–262 (2005).

Download references

Author information

Authors and Affiliations

Computer Engineering Laboratory, Faculty of Electrical Engineering, Mathematics, and Computer Science, Delft University of Technology, Delft, The Netherlands
Asadollah Shahbahrami, Ben Juurlink, Demid Borodin & Stamatis Vassiliadis
Department of Electrical and Computer Engineering, Faculty of Engineering, Guilan University, Rasht, Iran
Asadollah Shahbahrami

Authors

Asadollah Shahbahrami
View author publications
You can also search for this author in PubMed Google Scholar
Ben Juurlink
View author publications
You can also search for this author in PubMed Google Scholar
Demid Borodin
View author publications
You can also search for this author in PubMed Google Scholar
Stamatis Vassiliadis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Asadollah Shahbahrami.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shahbahrami, A., Juurlink, B., Borodin, D. et al. Avoiding Conversion and Rearrangement Overhead in SIMD Architectures. Int J Parallel Prog 34, 237–260 (2006). https://doi.org/10.1007/s10766-006-0015-0

Download citation

Received: 10 May 2006
Accepted: 02 June 2006
Published: 24 June 2006
Issue Date: June 2006
DOI: https://doi.org/10.1007/s10766-006-0015-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Avoiding Conversion and Rearrangement Overhead in SIMD Architectures

Access this article

Similar content being viewed by others

An Efficient Vector Memory Unit for SIMD DSP

General-Purpose DSP Processors

SWIFT: A Computationally-Intensive DSP Architecture for Communication Applications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Avoiding Conversion and Rearrangement Overhead in SIMD Architectures

Access this article

Similar content being viewed by others

An Efficient Vector Memory Unit for SIMD DSP

General-Purpose DSP Processors

SWIFT: A Computationally-Intensive DSP Architecture for Communication Applications

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation