research-article

AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

Authors:

Qing YiAuthors Info & Claims

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Article No.: 25, Pages 1 - 12

https://doi.org/10.1145/2503210.2503219

Published: 17 November 2013 Publication History

Abstract

Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.

References

[1]

Advanced Micro Devices, Inc. AMD New Bulldozer and Piledriver Instructions, 2012.10.

[2]

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for strassen's matrix multiplication. CoRR, abs/1202.3173, 2012.

[3]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. J. ACM, 59(6):32:1--32:23, Jan. 2013.

Digital Library

[4]

G. Belter, E. R. Jessup, I. Karlin, and J. G. Siek. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 59:1--59:12, New York, NY, USA, 2009. ACM.

Digital Library

[5]

J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using phipac: a portable, high-performance, ansi c coding methodology. In Proc. the 11th international conference on Supercomputing, pages 340--347, New York, NY, USA, 1997. ACM Press.

Digital Library

[6]

C. Chen, J. Chame, and M. Hall. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In International Symposium on Code Generation and Optimization, March 2005.

Digital Library

[7]

K. Cooper and L. Torczon. Engineering a Compiler. Morgan Kaufmann, 2004.

Digital Library

[8]

H. Cui, L. Wang, J. Xue, Y. Yang, and X. Feng. Automatic library generation for blas3 on gpus. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium, Anchorage, AK. Citeseer, 2011.

Digital Library

[9]

H. Cui, J. Xue, L. Wang, Y. Yang, X. Feng, and D. Fan. Extendable pattern-oriented optimization directives. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '11, pages 107--118, Washington, DC, USA, 2011. IEEE Computer Society.

Digital Library

[10]

H. Cui, Q. Yi, J. Xue, and X. Feng. Layout-oblivious compiler optimization for matrix computations. TACO, 9(4):35, 2013.

Digital Library

[11]

P. D'Alberto and A. Nicolau. Adaptive strassen's matrix multiplication. In ICS, pages 284--292, 2007.

Digital Library

[12]

K. Goto and R. A. v. d. Geijn. Anatomy of high-performance matrix multiplication. volume 34, pages 12:1--12:25, New York, NY, USA, May 2008. ACM.

Digital Library

[13]

K. Goto and R. Van De Geijn. High-performance implementation of the level-3 blas. ACM Trans. Math. Softw., 35(1), July 2008.

Digital Library

[14]

J. G. Siek and M. Vachharajani. Build to order linear algebra kernels. In DLS, page 7. ACM, 2008.

[15]

R. C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In SuperComputing 1998: High Performance Networking and Computing, 1998.

Digital Library

[16]

R. C. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(1):3--25, 2001.

Digital Library

[17]

Q. Yi. Automated programmable control and parameterization of compiler optimizations. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on, pages 97--106, april 2011.

Digital Library

[18]

Q. Yi. POET: A scripting language for applying parameterized source-to-source program transformations. Software: Practice & Experience, pages 675--706, May 2012.

Digital Library

[19]

Q. Yi and A. Qasem. Exploring the optimization space of dense linear algebra kernels. In LCPC, pages 343--355, 2008.

Digital Library

[20]

Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. Poet: Parameterized optimizations for empirical tuning. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8, march 2007.

[21]

X. Zhang, Q. Wang, and Y. Zhang. Model-driven level 3 blas performance optimization on loongson 3a processor. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS), 2012.

Digital Library

Cited By

Garske SEvans BArtlett CWong K(2025)ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line ScanningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.353222563(1-17)Online publication date: 2025
https://doi.org/10.1109/TGRS.2025.3532225
León-Vega LTosato NCozzini S(2025)EfiMon: A Process Analyser for Granular Power Consumption PredictionHigh Performance Computing10.1007/978-3-031-80084-9_8(112-126)Online publication date: 14-Feb-2025
https://doi.org/10.1007/978-3-031-80084-9_8
Ziebarth Mvon Specht S(2024)REHEATFUNQ (REgional HEAT-Flow Uncertainty and aNomaly Quantification) 2.0.1: a model for regional aggregate heat flow distributions and anomaly quantificationGeoscientific Model Development10.5194/gmd-17-2783-202417:7(2783-2828)Online publication date: 15-Apr-2024
https://doi.org/10.5194/gmd-17-2783-2024
Show More Cited By

Index Terms

AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has ...
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

We present an auto-tuning approach to optimize application performance on emerging multicore architectures. The methodology extends the idea of search-based performance optimizations, popular in linear algebra and FFT libraries, to application-specific ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

November 2013

1123 pages

ISBN:9781450323789

DOI:10.1145/2503210

General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC13

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC13: International Conference for High Performance Computing, Networking, Storage and Analysis

November 17 - 21, 2013

Colorado, Denver

Acceptance Rates

SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

144
Total Citations
View Citations
536
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Garske SEvans BArtlett CWong K(2025)ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line ScanningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.353222563(1-17)Online publication date: 2025
https://doi.org/10.1109/TGRS.2025.3532225
León-Vega LTosato NCozzini S(2025)EfiMon: A Process Analyser for Granular Power Consumption PredictionHigh Performance Computing10.1007/978-3-031-80084-9_8(112-126)Online publication date: 14-Feb-2025
https://doi.org/10.1007/978-3-031-80084-9_8
Ziebarth Mvon Specht S(2024)REHEATFUNQ (REgional HEAT-Flow Uncertainty and aNomaly Quantification) 2.0.1: a model for regional aggregate heat flow distributions and anomaly quantificationGeoscientific Model Development10.5194/gmd-17-2783-202417:7(2783-2828)Online publication date: 15-Apr-2024
https://doi.org/10.5194/gmd-17-2783-2024
Zimbrod PFleck MSchilp J(2024)An Application-Driven Method for Assembling Numerical Schemes for the Solution of Complex Multiphysics ProblemsApplied System Innovation10.3390/asi70300357:3(35)Online publication date: 24-Apr-2024
https://doi.org/10.3390/asi7030035
Pandey SDas SGanu HSingh SBalsamo SKnottenbelt WAbad CShang W(2024)Rethinking 'Complement' Recommendations at Scale with SIMDProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645041(25-36)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629526.3645041
Chen JGao CLu YZhang YShu J(2024)Ares-Flash: Efficient Parallel Integer Arithmetic Operations Using NAND Flash Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00109(1489-1503)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00109
Norouzi MMorew NIlias QRothenberger LJannesari AWolf F(2024)Fast data-dependence profiling through prior static analysisParallel Computing10.1016/j.parco.2024.103063119(103063)Online publication date: Feb-2024
https://doi.org/10.1016/j.parco.2024.103063
Yamaguchi KMatsukawa YNumazawa YAoki H(2024)Modeling of water gas shift reaction using neural network trained on detailed kinetic mechanismsChemical Engineering Journal10.1016/j.cej.2024.151659491(151659)Online publication date: Jul-2024
https://doi.org/10.1016/j.cej.2024.151659
Santamaria-Valenzuela ICarratalá-Sáez RTorres YLlanos DGonzalez-Escribano A(2024)Performance improvement of the triangular matrix product in commodity clustersThe Journal of Supercomputing10.1007/s11227-024-06097-780:11(16630-16653)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-06097-7
Błażejowski M(2024)Which C compiler and BLAS/LAPACK library should I use: gretl’s numerical efficiency in different configurationsComputational Statistics10.1007/s00180-024-01461-w39:7(3497-3522)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1007/s00180-024-01461-w
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten