skip to main content
10.1145/3091966.3091975acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Efficient array slicing on the Intel Xeon Phi coprocessor

Published: 18 June 2017 Publication History

Abstract

Array slicing is an operation which selects a subset of elements from a source array and copies them into a destination array. In this article we present an algorithm for generating code for a subset of Fortran slicing expressions, targeting the first generation Intel Xeon Phi coprocessor. The resulting code outperforms the code produced by Intel's Fortran compiler by 2.40 x on average for a set of slicing expressions, and by 2.23 x and 1.13 x on average for two slicing expressions relevant for border exchange code.

References

[1]
Nawaaz Ahmed. Locality Enhancement of Imperfectly-Nested Loop Nests. PhD thesis, 2000.
[2]
AH Badawy, A Aggarwal, and D Yeung. The efficacy of software prefetching and locality optimizations on future memory systems. Journal of Instruction-Level Parallelism, 1, 2004.
[3]
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Minimizing Communication in Linear Algebra. SIAM Journal on Matrix Analysis and Applications, 32(3):866–901, 2011.
[4]
Benjamin Andreassen Bjørnseth. Repository for software framework and data material. https://bitbucket.org/benjambj/slicec.
[5]
Uday Bondhugula, Aravind Acharya, and Albert Cohen. The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests. ACM Transaction on Programming Languages and Systems, 2016.
[6]
Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construc2tion (CC’08), volume 4959 LNCS, pages 132–146. 2008.
[7]
David Callahan, Ken Kennedy, and Allan Porterfield. Software Prefetching. In Proceedings of the fourth international conference on Architectural support for programming languages and operating systems (ASPLOS’91), pages 40–52, 1991.
[8]
Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94), pages 252–262, 1994.
[9]
Linchuan Chen, Peng Jiang, and Gagan Agrawal. Exploiting Recent SIMD Architectural Advances for Irregular Applications. In International Symposium on Code Generation and Optimization (CGO’16), pages 47–58, 2016.
[10]
Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation (PLDI’95), pages 279–290, 1995.
[11]
Jianbin Fang, Henk Sips, LiLun Zhang, Chuanfu Xu, Yonggang Che, and Ana Lucia Varbanescu. Test-driving Intel Xeon Phi. In Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering (ICPE14), pages 137–148, 2014.
[12]
Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, and Binyu Zang. Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking. ACM Transactions on Architecture and Code Optimization (TACO), 11(4):1– 26, 2015.
[13]
Tobias Grosser. A Decoupled Approach to High-Level Loop Optimization: Tile Shapes, Polyhedral Building Blocks and Low-Level Compilers. PhD thesis, 2015.
[14]
Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G. Shet, George Chrysos, and Pradeep Dubey. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi™ Coprocessor. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pages 126–137, 2013.
[15]
Simon Heybrock, Balint Joo, Dhiraj D. Kalamkar, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Tilo Wettig, and Pradeep Dubey. Lattice QCD with Domain Decomposition on Intel® Xeon Phi™ Co-Processors. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 69–80, 2014.
[16]
Johannes Hofmann, Jan Treibig, Georg Hager, and Gerhard Wellein. Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips. In Proceedings of the 2014 Workshop on programming models for SIMD/Vector processing - WPMVP ’14, pages 57–64, 2014.
[17]
Kaixi Hou, Hao Wang, Wu-chun Feng, and Virginia Tech. ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Many-core Processors. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15), pages 383–392, 2015.
[18]
Xin Huo, Bin Ren, and Gagan Agrawal. A Programming System for Xeon Phis with Runtime SIMD Parallelization. In Proceedings of the 28th ACM international conference on Supercomputing (ICS’14), pages 283–292, 2014.
[19]
Intel. Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual. Technical report, 2012.
[20]
Jim Jeffers and James Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Elsevier, 2013.
[21]
Saurabh Jha, Bingsheng He, Mian Lu, Xuntao Cheng, and Huynh Phung Huynh. Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach. Proceedings of the VLDB Endowment, 8(6):642–653, 2015.
[22]
Muneeb Khan, Andreas Sandberg, and Erik Hagersten. A case for resource efficient prefetching in multicores. In Proceedings of the International Conference on Parallel Processing (ICPP’14), pages 101–110, 2014.
[23]
Seonggun Kim and Hwansoo Han. Efficient SIMD code generation for irregular kernels. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP’12), page 55, 2012.
[24]
Alexander C. Klaiber and Henry M. Levy. An Architecture for Software-Controlled Data Prefetching. In Proceedings of the 18th annual international symposium on Computer architecture (ISCA’91), pages 43–53, 1991.
[25]
Rakesh Krishnaiyer, Intel. Compiler Prefetching for the Intel ® Xeon Phi ™ coprocessor. Presentation, https://tinyurl.com/lww65wf, 2012.
[26]
Rakesh Krishnaiyer, Emre Kultursay, Pankaj Chawla, Serguei Preis, Anatoly Zvezdin, and Hideki Saito. Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor. In 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, volume 60, pages 1575–1586, 2013.
[27]
Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. When Prefetching Works, When It Doesn’t, and Why. ACM Transactions on Architecture and Code Optimization, 9(1):2:1–2:29, 2012.
[28]
Victor W. Lee, Per Hammarlund, Ronak Singhal, Pradeep Dubey, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, and Srinivas Chennupaty. Debunking the 100X GPU vs. CPU myth. ACM SIGARCH Computer Architecture News, 38(3):451, 2010.
[29]
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International conference on supercomputing - ICS ’13, page 273, 2013.
[30]
Sanyam Mehta, Gautham Beeraka, and Pen-Chung Yew. Tile size selection revisited. ACM Transactions on Architecture and Code Optimization, 10(4):1–27, 2013.
[31]
Sanyam Mehta, Antonia Zhai, Zhenman Fang, and Pen-Chung Yew. Multi-Stage Coordinated Prefetching for Present-day Processors. In Proceedings of the 28th ACM international conference on Supercomputing (ICS’14), pages 73–82, 2014.
[32]
Sparsh Mittal. A Survey of Recent Prefetching Techniques for Processor Caches. ACM Computing Surveys, 49(2):35–69, 2016.
[33]
Todd C. Mowry, Monica S. Lam, and Anoop Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the fifth international conference on Architectural support for programming languages and operating systems (ASPLOS’92), pages 62–73, 1992.
[34]
Mireya Paredes, Graham Riley, and Mikel Luján. Breadth First Search Vectorization on the Intel Xeon Phi. In Proceedings of the ACM International Conference on Computing Frontiers (CF’16), pages 1–10, 2016.
[35]
Simon J. Pennycook, Chris J. Hughes, M. Smelyanskiy, and S.a. Jarvis. Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pages 1085–1097, 2013.
[36]
A K Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, 1989.
[37]
Arunmoezhi Ramachandran, Jerome Vienne, Rob Van Der Wijngaart, Lars Koesterke, and Ilya Sharapov. Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi. In 2013 42nd International Conference on Parallel Processing, pages 736–743, 2013.
[38]
Vivek Sarkar. Automatic selection of high-order transformations in the IBM XL FORTRAN compilers. IBM Journal of Research and Development, 41(3):233–264, 1997.
[39]
Nadathur Satish, Changkyu Kim, Jatin Chhugani, Hideki Saito, Rakesh Krishnaiyer, Mikhail Smelyanskiy, Milind Girkar, and Pradeep Dubey. Can traditional programming bridge the Ninja performance gap for parallel computing applications? In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12), pages 440–451, 2012.
[40]
Xinmin Tian, Hideki Saito, Serguei V. Preis, Eric N. Garcia, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, and Nikolay Panchenko. Effective SIMD Vectorization for Intel Xeon Phi Coprocessors. Scientific Programming, 2015.
[41]
Steven P. Vanderwiel and David J. Lilja. Data Prefetch Mechanisms. ACM Computing Surveys, 32(2):174–199, 2001.
[42]
Andrey Vladimirov, Colfax International. FINE-TUNING VECTORIZATION AND MEMORY TRAFFIC ON INTEL XEON PHI COPROCESSORS: LU DECOMPOSITION OF SMALL MATRICES. Technical report, 2015.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ARRAY 2017: Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming
June 2017
62 pages
ISBN:9781450350693
DOI:10.1145/3091966
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Code generation
  2. Compiler
  3. Fortran
  4. Slicing
  5. Xeon Phi

Qualifiers

  • Research-article

Conference

PLDI '17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 25 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 104
    Total Downloads
  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media