Abstract
Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core.













Similar content being viewed by others
References
Peleg, A., Wilkie, S., & Weiser, U. C. (1997). Intel MMX for multimedia PCs. Communications of the ACM, 40(1), 24–38.
Bistry, D., Dulong, C., Gutman, M., Julier, M., Kieth, M., Mennemeier, L. M., Mittal, M., Peleg, A. D., Weiser, U. (1997). The Complete Guide to MMXTM Technology. Mcgraw-Hill.
Oberman, S., Favor, G., & Weber, F. (1999). AMD 3DNow! technology: architecture and implementations. IEEE Micro, 19(2), 37–48.
IBM (2005) PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual. IBM Systems and Technology Group, Hopewell Junction, NY.
Motorola Corporation (1999). AltiVec Technology Programming Interface Manual.
Osman, S. (2004). Introduction to Game Programming. http://graphics.cs.cmu.edu/nsp/course/15-462/Fall04/slides/GameProg.pdf.
Andrews, J., & Baker, N. (2006). Xbox 360 system architecture. IEEE Micro, 26(2), 25–37.
Kahle, J. A., Day, M. N., Peter Hofstee, H., Johns, C. R., Maeurer, T. R., & Shippy, D. J. (2005). Introduction to the cell multiprocessor. IBM Journal of Research and Development, 49(4–5), 589–604.
Eichenberger, A. E., Wu, P., O'Brien, K. (2004). Vectorization for SIMD architectures with alignment constraints. PLDI. 82–93
Eichenberger, A. E., O'Brien, K. M., O'Brien, K., Wu, P., Chen, T., Oden, P. H., Prener, D. A., Shepherd, J. C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M. (2005). Optimizing Compiler for the CELL Processor. IEEE PACT. 161–172.
Bik, A. J. C., Girkar, M., Grey, P. M., & Tian, X. (2002). Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming, 30(2), 65–98.
Larsen, S., Amarasinghe, S. P. (2000). Exploiting superword level parallelism with multimedia instruction sets. PLDI. 145–156.
Naishlos, D. (2004). Autovectorization in GCC. In The 2004 GCC Developers’ Summit. 105–118.
Wu, P., Eichenberger, A. E., Wang, A. (2005). Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. CGO. 153–164.
Cheong, G., Lam, M. (1997). An Optimizer for Multimedia Instruction Sets. In Second SUIF Compiler Workshop.
Muchnick, S. S. (1997). Advanced Compiler Design and Implementation. Morgan Kaufmann, isbn 1-55860-320-4.
Larsen, S., Witchel, E., Amarasinghe, S. P. (2002) Increasing and Detecting Memory Address Congruence. IEEE PACT. 18–29.
C. B. Software (2004). VAST-F/AltiVec: Automatic Fortran Vectorizer for PowerPC Vector Unit. http://www.psrv.com/vastaltivec.html.
Wolf, M. E., & Lam, M. S. (1991). A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4), 452–471.
Passos, N. L., & Hsing-Mean Sha, E. (1996). Achieving full parallelism using multidimensional retiming. IEEE Transactions on Parallel and Distributed Systems, 7(11), 1150–1163.
Constantine, D. (1988). Polychronopoulos: compiler optimizations for enhancing parallelism and their impact on architecture design. IEEE Transactions on Computers, 37(8), 991–1004.
Zima, H. P., Chapman, B. M. (1990). Supercompilers for parallel and vector computers. ACM Press frontier series, Addison-Wesley, isbn 978-0-201-17560-8, pp. I-XV, 1–376.
Allen, R., & Kennedy, K. (1987). Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4), 491–542.
Chen, W.-Y., Iancu, C., Yelick, K.A. (2005). Communication Optimizations for Fine-Grained UPC Applications. IEEE PACT. 267–278.
Min, S.-J., Basumallik, A., & Eigenmann, R. (2003). Optimizing OpenMP programs on software distributed shared memory systems. International Journal of Parallel Programming, 31(3), 225–249.
Adve, V. S., Jin, G., Mellor-Crummey, J. M., Yi, Q. (1998). High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes. SC. 11.
Dwarkadas, S., Cox, A. L., Zwaenepoel, W. (1996). An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System. ASPLOS. 186–197.
Liu, D., Wang, Y., Shao, Z., Guo, M., & Xue, J. (2012). Optimally maximizing iteration-level loop parallelism. IEEE Transactions on Parallel and Distributed Systems, 23(3), 564–572.
Wang, M., Wang, Y., Liu, D., Qin, Z., & Shao, Z. (2010). Compiler-assisted leakage-aware loop scheduling for embedded VLIW DSP processors. Journal of Systems and Software, 83(5), 772–785.
Jason Xue, C., Hu, .J, Shao, Z., Hsing-Mean Sha, E. (2010). Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding. ACM Trans. Embedded Comput. Syst. 9(3).
Zhang, J., Deng, T., Gao, Q., Zhuge, Q., Hsing-Mean Sha, E. (2012). Optimizing Data Allocation for Loops on Embedded Systems with Scratch-Pad Memory. RTCSA. 184–191.
Zhuge, Q., Guo, Y., Hu, J., Tseng, W.-C., Xue, C. J., & Hsing-Mean Sha, E. (2012). Minimizing access cost for multiple types of memory units in embedded systems through data allocation and scheduling. IEEE Transactions on Signal Processing, 60(6), 3253–3263.
Huang, Y., Zhao, M., Jason Xue, C. (2012). WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architecture. LCTES. 31–40.
Qiu, M., & Hsing-Mean Sha, E. (2009). Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems. ACM Transactions on Design Automation of Electronic Systems. 14(2).
Wang, Y., Liu, D., Qin, Z., & Shao, Z. (2013). Optimally removing intercore communication overhead for streaming applications on MPSoCs. IEEE Transactions on Computers, 62(2), 336–350.
Wang, Y., Liu, H., Liu, D., Qin, Z., Shao, Z., & Hsing-Mean Sha, E. (2011). Overhead-aware energy optimization for real-time streaming applications on multiprocessor system-on-chip. ACM Transactions on Design Automation of Electronic Systems, 16(2), 14.
Wang, Y., Liu, D., Wang, M., Qin, Z., Shao, Z. (2010). Optimal Task Scheduling by Removing Inter-Core Communication Overhead for Streaming Applications on MPSoC. IEEE Real-Time and Embedded Technology and Applications Symposium. 195–204.
Wang, Y., Liu, D., Qin, Z., Shao, Z. (2010). Memory-Aware Optimal Scheduling with Communication Overhead Minimization for Streaming Applications on Chip Multiprocessors. RTSS. 350–359.
Acknowledgments
The work described in this paper is partially supported by the grants from the Innovation and Technology Support Programme of Innovation and Technology Fund of the Hong Kong Special Administrative Region, China (ITS/082/10), the Germany/Hong Kong Joint Research Scheme sponsored by the Research Grants Council of Hong Kong and the Germany Academic Exchange Service of Germany (Reference No.G_HK021/12), National Natural Science Foundation of China (Project 61070002 and 61272103), National 863 Program (No. 2013AA013202 and No. 2011AA01A202), Changjiang Scholars and Innovative Research Team in University (IRT1158, PCSIRT), and the Hong Kong Polytechnic University (4-ZZD7,G-YK24 and G-YM10).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Y., Pan, L., Shao, Z. et al. Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors. J Sign Process Syst 74, 137–150 (2014). https://doi.org/10.1007/s11265-013-0754-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-013-0754-2