ABSTRACT
Short vector (SIMD) instructions are useful in signal processing, multimedia, and scientific applications. They offer higher performance, lower energy consumption, and better resource utilization. However, compilers still do not have good support for SIMD instructions, and often the code has to be written manually in assembly language or using compiler builtin functions. Also, in some applications, higher parallelism could be achieved if compilers inserted permutation instructions that reorder the data in registers. In this paper we describe how we create SIMD instructions from regular code, and determine ordering of individual operations in the SIMD instructions to minimize the number of permutation instructions. Individual memory operations are grouped into SIMD operations based on their effective addresses. The SIMD data flow graph is then constructed by following data dependences from SIMD memory operations. Then, the orderings of operations are propagated from SIMD memory operations into the graph.We also describe our approach to compute decomposition of a given permutation into the permutation instructions of the target architecture. Experiments with our prototype compiler show that this approach scales well with the number of operations in SIMD instructions (SIMD width) and can be used to compile a number of important kernels, achieving up to 35% speedup.
- A. V. Aho, M. Ganapathi, and S. W. K. Tjiang. Code generation using tree matching and dynamic programming. ACM Trans. Prog. Lang. Syst., 11(4):491--516, Oct. 1989.]] Google ScholarDigital Library
- A. E. Eichenberger, P. Wu, and K. O'Brien. Vectorization for SIMD architectures with alignment constraints. In PLDI, pages 82--93, June 2004.]] Google ScholarDigital Library
- R. J. Fisher and H. G. Dietz. Compiling for SIMD within a register. In Workshop on Languages and Compilers for Parallel Computing, pages 290--304, Aug. 1998.]] Google ScholarDigital Library
- Intel Corporation. Intel® C++ Compiler for Linux* Systems User's Guide, 2003.]]Google Scholar
- S. Larsen and S. Amarasinghe. Exploiting superword level parallelism. In Proc. of the Conference on Programming Language Design and Implementation (PLDI 2000), pages 145--156, Vancouver, British Columbia, Canada, June 2000.]] Google ScholarDigital Library
- S. Larsen, E. Witchel, and S. Amarasinghe. Increasing and detecting memory address congruence. In Proc. of International Conference on Parallel Architectures and Compilation Techniques, pages 18--29, Sept. 2002.]] Google ScholarDigital Library
- R. Leupers. Code Optimization Techniques for Embedded Processors. Kluwer Academic Publishers, 2000.]] Google ScholarDigital Library
- R. Leupers. Code selection for media processors with SIMD instructions. In Design, Automation and Test in Europe, pages 4--8, Mar. 2000.]] Google ScholarDigital Library
- S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.]] Google ScholarDigital Library
- D. Naishlos, M. Biberstein, S. Ben-David, and A. Zaks. Vectorizing for a SIMdD DSP architecture. In CASES, pages 2--11, San Jose, CA, Oct. 2003.]] Google ScholarDigital Library
Index Terms
- Generation of permutations for SIMD processors
Recommendations
Generation of permutations for SIMD processors
Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsShort vector (SIMD) instructions are useful in signal processing, multimedia, and scientific applications. They offer higher performance, lower energy consumption, and better resource utilization. However, compilers still do not have good support for ...
Compiler optimizations for processors with SIMD instructions
To achieve maximum efficiency, modern embedded processors for media applications exploit single instruction multiple data (SIMD) instructions. SIMD instructions provide a form of vectorization where a large machine word is viewed as a vector of subwords ...
Dynamic translation of structured Loads/Stores and register mapping for architectures with SIMD extensions
LCTES '17More and more modern processors have been supporting non-contiguous SIMD data accesses. However, translating such instructions has been overlooked in the Dynamic Binary Translation (DBT) area. For example, in the popular QEMU dynamic binary translator, ...
Comments