Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Wang, Yi; Pan, Linfeng; Shao, Zili; Guan, Yong; Guo, Minyi

doi:10.1007/s11265-013-0754-2

Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Published: 11 May 2013

Volume 74, pages 137–150, (2014)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Yi Wang¹,
Linfeng Pan²,
Zili Shao¹,
Yong Guan³ &
…
Minyi Guo²

356 Accesses
Explore all metrics

Abstract

Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient Exploitation of Hyper Loop Parallelism in Vectorization

Memory latency optimizations for the elementary functions on the Sunway architecture

Article 22 January 2019

A Unified Approach to Variable Renaming for Enhanced Vectorization

References

Peleg, A., Wilkie, S., & Weiser, U. C. (1997). Intel MMX for multimedia PCs. Communications of the ACM, 40(1), 24–38.
Article Google Scholar
Bistry, D., Dulong, C., Gutman, M., Julier, M., Kieth, M., Mennemeier, L. M., Mittal, M., Peleg, A. D., Weiser, U. (1997). The Complete Guide to MMXTM Technology. Mcgraw-Hill.
Oberman, S., Favor, G., & Weber, F. (1999). AMD 3DNow! technology: architecture and implementations. IEEE Micro, 19(2), 37–48.
Article Google Scholar
IBM (2005) PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual. IBM Systems and Technology Group, Hopewell Junction, NY.
Motorola Corporation (1999). AltiVec Technology Programming Interface Manual.
Osman, S. (2004). Introduction to Game Programming. http://graphics.cs.cmu.edu/nsp/course/15-462/Fall04/slides/GameProg.pdf.
Andrews, J., & Baker, N. (2006). Xbox 360 system architecture. IEEE Micro, 26(2), 25–37.
Article Google Scholar
Kahle, J. A., Day, M. N., Peter Hofstee, H., Johns, C. R., Maeurer, T. R., & Shippy, D. J. (2005). Introduction to the cell multiprocessor. IBM Journal of Research and Development, 49(4–5), 589–604.
Article Google Scholar
Eichenberger, A. E., Wu, P., O'Brien, K. (2004). Vectorization for SIMD architectures with alignment constraints. PLDI. 82–93
Eichenberger, A. E., O'Brien, K. M., O'Brien, K., Wu, P., Chen, T., Oden, P. H., Prener, D. A., Shepherd, J. C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M. (2005). Optimizing Compiler for the CELL Processor. IEEE PACT. 161–172.
Bik, A. J. C., Girkar, M., Grey, P. M., & Tian, X. (2002). Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming, 30(2), 65–98.
Article MATH Google Scholar
Larsen, S., Amarasinghe, S. P. (2000). Exploiting superword level parallelism with multimedia instruction sets. PLDI. 145–156.
Naishlos, D. (2004). Autovectorization in GCC. In The 2004 GCC Developers’ Summit. 105–118.
Wu, P., Eichenberger, A. E., Wang, A. (2005). Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. CGO. 153–164.
Cheong, G., Lam, M. (1997). An Optimizer for Multimedia Instruction Sets. In Second SUIF Compiler Workshop.
Muchnick, S. S. (1997). Advanced Compiler Design and Implementation. Morgan Kaufmann, isbn 1-55860-320-4.
Larsen, S., Witchel, E., Amarasinghe, S. P. (2002) Increasing and Detecting Memory Address Congruence. IEEE PACT. 18–29.
C. B. Software (2004). VAST-F/AltiVec: Automatic Fortran Vectorizer for PowerPC Vector Unit. http://www.psrv.com/vastaltivec.html.
Wolf, M. E., & Lam, M. S. (1991). A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4), 452–471.
Article Google Scholar
Passos, N. L., & Hsing-Mean Sha, E. (1996). Achieving full parallelism using multidimensional retiming. IEEE Transactions on Parallel and Distributed Systems, 7(11), 1150–1163.
Article Google Scholar
Constantine, D. (1988). Polychronopoulos: compiler optimizations for enhancing parallelism and their impact on architecture design. IEEE Transactions on Computers, 37(8), 991–1004.
Article Google Scholar
IBM. http://www-128.ibm.com/developerworks/power/cell/.
Zima, H. P., Chapman, B. M. (1990). Supercompilers for parallel and vector computers. ACM Press frontier series, Addison-Wesley, isbn 978-0-201-17560-8, pp. I-XV, 1–376.
Allen, R., & Kennedy, K. (1987). Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4), 491–542.
Article MATH Google Scholar
Chen, W.-Y., Iancu, C., Yelick, K.A. (2005). Communication Optimizations for Fine-Grained UPC Applications. IEEE PACT. 267–278.
Min, S.-J., Basumallik, A., & Eigenmann, R. (2003). Optimizing OpenMP programs on software distributed shared memory systems. International Journal of Parallel Programming, 31(3), 225–249.
Article MATH Google Scholar
Adve, V. S., Jin, G., Mellor-Crummey, J. M., Yi, Q. (1998). High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes. SC. 11.
Dwarkadas, S., Cox, A. L., Zwaenepoel, W. (1996). An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System. ASPLOS. 186–197.
Liu, D., Wang, Y., Shao, Z., Guo, M., & Xue, J. (2012). Optimally maximizing iteration-level loop parallelism. IEEE Transactions on Parallel and Distributed Systems, 23(3), 564–572.
Article Google Scholar
Wang, M., Wang, Y., Liu, D., Qin, Z., & Shao, Z. (2010). Compiler-assisted leakage-aware loop scheduling for embedded VLIW DSP processors. Journal of Systems and Software, 83(5), 772–785.
Article Google Scholar
Jason Xue, C., Hu, .J, Shao, Z., Hsing-Mean Sha, E. (2010). Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding. ACM Trans. Embedded Comput. Syst. 9(3).
Zhang, J., Deng, T., Gao, Q., Zhuge, Q., Hsing-Mean Sha, E. (2012). Optimizing Data Allocation for Loops on Embedded Systems with Scratch-Pad Memory. RTCSA. 184–191.
Zhuge, Q., Guo, Y., Hu, J., Tseng, W.-C., Xue, C. J., & Hsing-Mean Sha, E. (2012). Minimizing access cost for multiple types of memory units in embedded systems through data allocation and scheduling. IEEE Transactions on Signal Processing, 60(6), 3253–3263.
Article MathSciNet Google Scholar
Huang, Y., Zhao, M., Jason Xue, C. (2012). WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architecture. LCTES. 31–40.
Qiu, M., & Hsing-Mean Sha, E. (2009). Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems. ACM Transactions on Design Automation of Electronic Systems. 14(2).
Wang, Y., Liu, D., Qin, Z., & Shao, Z. (2013). Optimally removing intercore communication overhead for streaming applications on MPSoCs. IEEE Transactions on Computers, 62(2), 336–350.
Article MathSciNet Google Scholar
Wang, Y., Liu, H., Liu, D., Qin, Z., Shao, Z., & Hsing-Mean Sha, E. (2011). Overhead-aware energy optimization for real-time streaming applications on multiprocessor system-on-chip. ACM Transactions on Design Automation of Electronic Systems, 16(2), 14.
Article Google Scholar
Wang, Y., Liu, D., Wang, M., Qin, Z., Shao, Z. (2010). Optimal Task Scheduling by Removing Inter-Core Communication Overhead for Streaming Applications on MPSoC. IEEE Real-Time and Embedded Technology and Applications Symposium. 195–204.
Wang, Y., Liu, D., Qin, Z., Shao, Z. (2010). Memory-Aware Optimal Scheduling with Communication Overhead Minimization for Streaming Applications on Chip Multiprocessors. RTSS. 350–359.

Download references

Acknowledgments

The work described in this paper is partially supported by the grants from the Innovation and Technology Support Programme of Innovation and Technology Fund of the Hong Kong Special Administrative Region, China (ITS/082/10), the Germany/Hong Kong Joint Research Scheme sponsored by the Research Grants Council of Hong Kong and the Germany Academic Exchange Service of Germany (Reference No.G_HK021/12), National Natural Science Foundation of China (Project 61070002 and 61272103), National 863 Program (No. 2013AA013202 and No. 2011AA01A202), Changjiang Scholars and Innovative Research Team in University (IRT1158, PCSIRT), and the Hong Kong Polytechnic University (4-ZZD7,G-YK24 and G-YM10).

Author information

Authors and Affiliations

Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
Yi Wang & Zili Shao
Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Linfeng Pan & Minyi Guo
College of Information Engineering, Capital Normal University, Beijing, China
Yong Guan

Authors

Yi Wang
View author publications
You can also search for this author inPubMed Google Scholar
Linfeng Pan
View author publications
You can also search for this author inPubMed Google Scholar
Zili Shao
View author publications
You can also search for this author inPubMed Google Scholar
Yong Guan
View author publications
You can also search for this author inPubMed Google Scholar
Minyi Guo
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zili Shao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Pan, L., Shao, Z. et al. Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors. J Sign Process Syst 74, 137–150 (2014). https://doi.org/10.1007/s11265-013-0754-2

Download citation

Received: 01 May 2012
Revised: 14 March 2013
Accepted: 18 April 2013
Published: 11 May 2013
Issue Date: February 2014
DOI: https://doi.org/10.1007/s11265-013-0754-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient Exploitation of Hyper Loop Parallelism in Vectorization

Memory latency optimizations for the elementary functions on the Sunway architecture

A Unified Approach to Variable Renaming for Enhanced Vectorization

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now