research-article

Efficient array slicing on the Intel Xeon Phi coprocessor

Authors:

Benjamin Andreassen Bjørnseth,

Jan Christian Meyer,

Lasse NatvigAuthors Info & Claims

ARRAY 2017: Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

Pages 40 - 47

https://doi.org/10.1145/3091966.3091975

Published: 18 June 2017 Publication History

Abstract

Array slicing is an operation which selects a subset of elements from a source array and copies them into a destination array. In this article we present an algorithm for generating code for a subset of Fortran slicing expressions, targeting the first generation Intel Xeon Phi coprocessor. The resulting code outperforms the code produced by Intel's Fortran compiler by 2.40 x on average for a set of slicing expressions, and by 2.23 x and 1.13 x on average for two slicing expressions relevant for border exchange code.

References

[1]

Nawaaz Ahmed. Locality Enhancement of Imperfectly-Nested Loop Nests. PhD thesis, 2000.

Digital Library

[2]

AH Badawy, A Aggarwal, and D Yeung. The efficacy of software prefetching and locality optimizations on future memory systems. Journal of Instruction-Level Parallelism, 1, 2004.

[3]

Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Minimizing Communication in Linear Algebra. SIAM Journal on Matrix Analysis and Applications, 32(3):866–901, 2011.

[4]

Benjamin Andreassen Bjørnseth. Repository for software framework and data material. https://bitbucket.org/benjambj/slicec.

[5]

Uday Bondhugula, Aravind Acharya, and Albert Cohen. The Pluto+ Algorithm: A Practical Approach for Parallelization and Locality Optimization of Affine Loop Nests. ACM Transaction on Programming Languages and Systems, 2016.

Digital Library

[6]

Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construc2tion (CC’08), volume 4959 LNCS, pages 132–146. 2008.

Digital Library

[7]

David Callahan, Ken Kennedy, and Allan Porterfield. Software Prefetching. In Proceedings of the fourth international conference on Architectural support for programming languages and operating systems (ASPLOS’91), pages 40–52, 1991.

Digital Library

[8]

Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94), pages 252–262, 1994.

Digital Library

[9]

Linchuan Chen, Peng Jiang, and Gagan Agrawal. Exploiting Recent SIMD Architectural Advances for Irregular Applications. In International Symposium on Code Generation and Optimization (CGO’16), pages 47–58, 2016.

Digital Library

[10]

Stephanie Coleman and Kathryn S. McKinley. Tile size selection using cache organization and data layout. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation (PLDI’95), pages 279–290, 1995.

Digital Library

[11]

Jianbin Fang, Henk Sips, LiLun Zhang, Chuanfu Xu, Yonggang Che, and Ana Lucia Varbanescu. Test-driving Intel Xeon Phi. In Proceedings of the 5th ACM/SPEC International Conference on Performance Engineering (ICPE14), pages 137–148, 2014.

Digital Library

[12]

Zhenman Fang, Sanyam Mehta, Pen-Chung Yew, Antonia Zhai, and Binyu Zang. Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking. ACM Transactions on Architecture and Code Optimization (TACO), 11(4):1– 26, 2015.

Digital Library

[13]

Tobias Grosser. A Decoupled Approach to High-Level Loop Optimization: Tile Shapes, Polyhedral Building Blocks and Low-Level Compilers. PhD thesis, 2015.

[14]

Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G. Shet, George Chrysos, and Pradeep Dubey. Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi™ Coprocessor. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pages 126–137, 2013.

Digital Library

[15]

Simon Heybrock, Balint Joo, Dhiraj D. Kalamkar, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Tilo Wettig, and Pradeep Dubey. Lattice QCD with Domain Decomposition on Intel® Xeon Phi™ Co-Processors. In SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 69–80, 2014.

Digital Library

[16]

Johannes Hofmann, Jan Treibig, Georg Hager, and Gerhard Wellein. Comparing the performance of different x86 SIMD instruction sets for a medical imaging application on modern multi- and manycore chips. In Proceedings of the 2014 Workshop on programming models for SIMD/Vector processing - WPMVP ’14, pages 57–64, 2014.

Digital Library

[17]

Kaixi Hou, Hao Wang, Wu-chun Feng, and Virginia Tech. ASPaS: A Framework for Automatic SIMDization of Parallel Sorting on x86-based Many-core Processors. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15), pages 383–392, 2015.

Digital Library

[18]

Xin Huo, Bin Ren, and Gagan Agrawal. A Programming System for Xeon Phis with Runtime SIMD Parallelization. In Proceedings of the 28th ACM international conference on Supercomputing (ICS’14), pages 283–292, 2014.

Digital Library

[19]

Intel. Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual. Technical report, 2012.

[20]

Jim Jeffers and James Reinders. Intel Xeon Phi Coprocessor High Performance Programming. Elsevier, 2013.

Digital Library

[21]

Saurabh Jha, Bingsheng He, Mian Lu, Xuntao Cheng, and Huynh Phung Huynh. Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach. Proceedings of the VLDB Endowment, 8(6):642–653, 2015.

Digital Library

[22]

Muneeb Khan, Andreas Sandberg, and Erik Hagersten. A case for resource efficient prefetching in multicores. In Proceedings of the International Conference on Parallel Processing (ICPP’14), pages 101–110, 2014.

Digital Library

[23]

Seonggun Kim and Hwansoo Han. Efficient SIMD code generation for irregular kernels. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming (PPoPP’12), page 55, 2012.

Digital Library

[24]

Alexander C. Klaiber and Henry M. Levy. An Architecture for Software-Controlled Data Prefetching. In Proceedings of the 18th annual international symposium on Computer architecture (ISCA’91), pages 43–53, 1991.

Digital Library

[25]

Rakesh Krishnaiyer, Intel. Compiler Prefetching for the Intel ® Xeon Phi ™ coprocessor. Presentation, https://tinyurl.com/lww65wf, 2012.

[26]

Rakesh Krishnaiyer, Emre Kultursay, Pankaj Chawla, Serguei Preis, Anatoly Zvezdin, and Hideki Saito. Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor. In 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum, volume 60, pages 1575–1586, 2013.

Digital Library

[27]

Jaekyu Lee, Hyesoon Kim, and Richard Vuduc. When Prefetching Works, When It Doesn’t, and Why. ACM Transactions on Architecture and Code Optimization, 9(1):2:1–2:29, 2012.

Digital Library

[28]

Victor W. Lee, Per Hammarlund, Ronak Singhal, Pradeep Dubey, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, and Srinivas Chennupaty. Debunking the 100X GPU vs. CPU myth. ACM SIGARCH Computer Architecture News, 38(3):451, 2010.

Digital Library

[29]

Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International conference on supercomputing - ICS ’13, page 273, 2013.

Digital Library

[30]

Sanyam Mehta, Gautham Beeraka, and Pen-Chung Yew. Tile size selection revisited. ACM Transactions on Architecture and Code Optimization, 10(4):1–27, 2013.

Digital Library

[31]

Sanyam Mehta, Antonia Zhai, Zhenman Fang, and Pen-Chung Yew. Multi-Stage Coordinated Prefetching for Present-day Processors. In Proceedings of the 28th ACM international conference on Supercomputing (ICS’14), pages 73–82, 2014.

Digital Library

[32]

Sparsh Mittal. A Survey of Recent Prefetching Techniques for Processor Caches. ACM Computing Surveys, 49(2):35–69, 2016.

Digital Library

[33]

Todd C. Mowry, Monica S. Lam, and Anoop Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the fifth international conference on Architectural support for programming languages and operating systems (ASPLOS’92), pages 62–73, 1992.

Digital Library

[34]

Mireya Paredes, Graham Riley, and Mikel Luján. Breadth First Search Vectorization on the Intel Xeon Phi. In Proceedings of the ACM International Conference on Computing Frontiers (CF’16), pages 1–10, 2016.

Digital Library

[35]

Simon J. Pennycook, Chris J. Hughes, M. Smelyanskiy, and S.a. Jarvis. Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pages 1085–1097, 2013.

Digital Library

[36]

A K Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Applications. PhD thesis, 1989.

Digital Library

[37]

Arunmoezhi Ramachandran, Jerome Vienne, Rob Van Der Wijngaart, Lars Koesterke, and Ilya Sharapov. Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi. In 2013 42nd International Conference on Parallel Processing, pages 736–743, 2013.

Digital Library

[38]

Vivek Sarkar. Automatic selection of high-order transformations in the IBM XL FORTRAN compilers. IBM Journal of Research and Development, 41(3):233–264, 1997.

Digital Library

[39]

Nadathur Satish, Changkyu Kim, Jatin Chhugani, Hideki Saito, Rakesh Krishnaiyer, Mikhail Smelyanskiy, Milind Girkar, and Pradeep Dubey. Can traditional programming bridge the Ninja performance gap for parallel computing applications? In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA’12), pages 440–451, 2012.

Digital Library

[40]

Xinmin Tian, Hideki Saito, Serguei V. Preis, Eric N. Garcia, Sergey S. Kozhukhov, Matt Masten, Aleksei G. Cherkasov, and Nikolay Panchenko. Effective SIMD Vectorization for Intel Xeon Phi Coprocessors. Scientific Programming, 2015.

Digital Library

[41]

Steven P. Vanderwiel and David J. Lilja. Data Prefetch Mechanisms. ACM Computing Surveys, 32(2):174–199, 2001.

Digital Library

[42]

Andrey Vladimirov, Colfax International. FINE-TUNING VECTORIZATION AND MEMORY TRAFFIC ON INTEL XEON PHI COPROCESSORS: LU DECOMPOSITION OF SMALL MATRICES. Technical report, 2015.

Index Terms

Efficient array slicing on the Intel Xeon Phi coprocessor
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
    2. General programming languages
      1. Language types
        Data flow languages
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Vector / streaming algorithms

Recommendations

Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

The Intel® Xeon Phi™ coprocessor platform has a new software stack that enables new programming models. One such model is offload of computation from a host processor to a coprocessor that is a fully-capable Intel® Architecture CPU, namely, the Intel® ...
Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ARRAY 2017: Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming

June 2017

62 pages

ISBN:9781450350693

DOI:10.1145/3091966

General Chairs:
Martin Elsman
University of Copenhagen, Denmark
,
Clemens Grelck
University of Amsterdam
,
Andreas Kloeckner
Netherlands
,
David Padua
University of Illinois at Urbana-Champaign, USA
,
Edgar Solomonik
University of Illinois at Urbana-Champaign, USA

Copyright © 2017 ACM.

© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '17

Sponsor:

SIGPLAN

PLDI '17: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 18, 2017

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 17 of 25 submissions, 68%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
104
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten