Skip to main content

Advertisement

Log in

Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Multimedia SIMD extensions are commonly employed today to speed up media processing. When performing vectorization for SIMD architectures, one of the major issues is to handle the problem of memory alignment. Prior study focused on either vectorizing loops with all memory references being properly aligned, or introducing extra operations to deal with the misaligned memory references. On the other hand, multi-core SIMD architectures require coarse-grain parallelism. Therefore, it is an important problem to study how to parallelize and vectorize loop nests with the awareness of data misalignments. This paper presents a loop transformation scheme that maximizes the parallelism of outermost loops, while the misaligned memory references in innermost loops are reduced. The basic idea of our technique is to align each level of loops in the nest, considering the constraint of dependence relations. To reduce the data misalignments, we establish a mathematical model with a concept of offset-collection and propose an effective heuristic algorithm. For coarser-grain parallelism, we propose some rules to analyze the outermost loop. When transformations are applied, the inner loops are involved to maximize the parallelism. To avoid introducing more data misalignments, the involved innermost loop is handled from other levels of loops. Experimental results show that 7 % to 37 % (on average 18.4 %) misaligned memory references can be reduced. The simulations on CELL show that 1.1x speedup can be reached by reducing the misaligned data, while 6.14x speedup can be achieved by enhancing the parallelism for multi-core.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13

Similar content being viewed by others

References

  1. Peleg, A., Wilkie, S., & Weiser, U. C. (1997). Intel MMX for multimedia PCs. Communications of the ACM, 40(1), 24–38.

    Article  Google Scholar 

  2. Bistry, D., Dulong, C., Gutman, M., Julier, M., Kieth, M., Mennemeier, L. M., Mittal, M., Peleg, A. D., Weiser, U. (1997). The Complete Guide to MMXTM Technology. Mcgraw-Hill.

  3. Oberman, S., Favor, G., & Weber, F. (1999). AMD 3DNow! technology: architecture and implementations. IEEE Micro, 19(2), 37–48.

    Article  Google Scholar 

  4. IBM (2005) PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual. IBM Systems and Technology Group, Hopewell Junction, NY.

  5. Motorola Corporation (1999). AltiVec Technology Programming Interface Manual.

  6. Osman, S. (2004). Introduction to Game Programming. http://graphics.cs.cmu.edu/nsp/course/15-462/Fall04/slides/GameProg.pdf.

  7. Andrews, J., & Baker, N. (2006). Xbox 360 system architecture. IEEE Micro, 26(2), 25–37.

    Article  Google Scholar 

  8. Kahle, J. A., Day, M. N., Peter Hofstee, H., Johns, C. R., Maeurer, T. R., & Shippy, D. J. (2005). Introduction to the cell multiprocessor. IBM Journal of Research and Development, 49(4–5), 589–604.

    Article  Google Scholar 

  9. Eichenberger, A. E., Wu, P., O'Brien, K. (2004). Vectorization for SIMD architectures with alignment constraints. PLDI. 82–93

  10. Eichenberger, A. E., O'Brien, K. M., O'Brien, K., Wu, P., Chen, T., Oden, P. H., Prener, D. A., Shepherd, J. C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M. (2005). Optimizing Compiler for the CELL Processor. IEEE PACT. 161–172.

  11. Bik, A. J. C., Girkar, M., Grey, P. M., & Tian, X. (2002). Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming, 30(2), 65–98.

    Article  MATH  Google Scholar 

  12. Larsen, S., Amarasinghe, S. P. (2000). Exploiting superword level parallelism with multimedia instruction sets. PLDI. 145–156.

  13. Naishlos, D. (2004). Autovectorization in GCC. In The 2004 GCC Developers’ Summit. 105–118.

  14. Wu, P., Eichenberger, A. E., Wang, A. (2005). Efficient SIMD Code Generation for Runtime Alignment and Length Conversion. CGO. 153–164.

  15. Cheong, G., Lam, M. (1997). An Optimizer for Multimedia Instruction Sets. In Second SUIF Compiler Workshop.

  16. Muchnick, S. S. (1997). Advanced Compiler Design and Implementation. Morgan Kaufmann, isbn 1-55860-320-4.

  17. Larsen, S., Witchel, E., Amarasinghe, S. P. (2002) Increasing and Detecting Memory Address Congruence. IEEE PACT. 18–29.

  18. C. B. Software (2004). VAST-F/AltiVec: Automatic Fortran Vectorizer for PowerPC Vector Unit. http://www.psrv.com/vastaltivec.html.

  19. Wolf, M. E., & Lam, M. S. (1991). A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems, 2(4), 452–471.

    Article  Google Scholar 

  20. Passos, N. L., & Hsing-Mean Sha, E. (1996). Achieving full parallelism using multidimensional retiming. IEEE Transactions on Parallel and Distributed Systems, 7(11), 1150–1163.

    Article  Google Scholar 

  21. Constantine, D. (1988). Polychronopoulos: compiler optimizations for enhancing parallelism and their impact on architecture design. IEEE Transactions on Computers, 37(8), 991–1004.

    Article  Google Scholar 

  22. IBM. http://www-128.ibm.com/developerworks/power/cell/.

  23. Zima, H. P., Chapman, B. M. (1990). Supercompilers for parallel and vector computers. ACM Press frontier series, Addison-Wesley, isbn 978-0-201-17560-8, pp. I-XV, 1–376.

  24. Allen, R., & Kennedy, K. (1987). Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4), 491–542.

    Article  MATH  Google Scholar 

  25. Chen, W.-Y., Iancu, C., Yelick, K.A. (2005). Communication Optimizations for Fine-Grained UPC Applications. IEEE PACT. 267–278.

  26. Min, S.-J., Basumallik, A., & Eigenmann, R. (2003). Optimizing OpenMP programs on software distributed shared memory systems. International Journal of Parallel Programming, 31(3), 225–249.

    Article  MATH  Google Scholar 

  27. Adve, V. S., Jin, G., Mellor-Crummey, J. M., Yi, Q. (1998). High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes. SC. 11.

  28. Dwarkadas, S., Cox, A. L., Zwaenepoel, W. (1996). An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System. ASPLOS. 186–197.

  29. Liu, D., Wang, Y., Shao, Z., Guo, M., & Xue, J. (2012). Optimally maximizing iteration-level loop parallelism. IEEE Transactions on Parallel and Distributed Systems, 23(3), 564–572.

    Article  Google Scholar 

  30. Wang, M., Wang, Y., Liu, D., Qin, Z., & Shao, Z. (2010). Compiler-assisted leakage-aware loop scheduling for embedded VLIW DSP processors. Journal of Systems and Software, 83(5), 772–785.

    Article  Google Scholar 

  31. Jason Xue, C., Hu, .J, Shao, Z., Hsing-Mean Sha, E. (2010). Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding. ACM Trans. Embedded Comput. Syst. 9(3).

  32. Zhang, J., Deng, T., Gao, Q., Zhuge, Q., Hsing-Mean Sha, E. (2012). Optimizing Data Allocation for Loops on Embedded Systems with Scratch-Pad Memory. RTCSA. 184–191.

  33. Zhuge, Q., Guo, Y., Hu, J., Tseng, W.-C., Xue, C. J., & Hsing-Mean Sha, E. (2012). Minimizing access cost for multiple types of memory units in embedded systems through data allocation and scheduling. IEEE Transactions on Signal Processing, 60(6), 3253–3263.

    Article  MathSciNet  Google Scholar 

  34. Huang, Y., Zhao, M., Jason Xue, C. (2012). WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architecture. LCTES. 31–40.

  35. Qiu, M., & Hsing-Mean Sha, E. (2009). Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems. ACM Transactions on Design Automation of Electronic Systems. 14(2).

  36. Wang, Y., Liu, D., Qin, Z., & Shao, Z. (2013). Optimally removing intercore communication overhead for streaming applications on MPSoCs. IEEE Transactions on Computers, 62(2), 336–350.

    Article  MathSciNet  Google Scholar 

  37. Wang, Y., Liu, H., Liu, D., Qin, Z., Shao, Z., & Hsing-Mean Sha, E. (2011). Overhead-aware energy optimization for real-time streaming applications on multiprocessor system-on-chip. ACM Transactions on Design Automation of Electronic Systems, 16(2), 14.

    Article  Google Scholar 

  38. Wang, Y., Liu, D., Wang, M., Qin, Z., Shao, Z. (2010). Optimal Task Scheduling by Removing Inter-Core Communication Overhead for Streaming Applications on MPSoC. IEEE Real-Time and Embedded Technology and Applications Symposium. 195–204.

  39. Wang, Y., Liu, D., Qin, Z., Shao, Z. (2010). Memory-Aware Optimal Scheduling with Communication Overhead Minimization for Streaming Applications on Chip Multiprocessors. RTSS. 350–359.

Download references

Acknowledgments

The work described in this paper is partially supported by the grants from the Innovation and Technology Support Programme of Innovation and Technology Fund of the Hong Kong Special Administrative Region, China (ITS/082/10), the Germany/Hong Kong Joint Research Scheme sponsored by the Research Grants Council of Hong Kong and the Germany Academic Exchange Service of Germany (Reference No.G_HK021/12), National Natural Science Foundation of China (Project 61070002 and 61272103), National 863 Program (No. 2013AA013202 and No. 2011AA01A202), Changjiang Scholars and Innovative Research Team in University (IRT1158, PCSIRT), and the Hong Kong Polytechnic University (4-ZZD7,G-YK24 and G-YM10).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zili Shao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Pan, L., Shao, Z. et al. Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors. J Sign Process Syst 74, 137–150 (2014). https://doi.org/10.1007/s11265-013-0754-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-013-0754-2

Keywords