skip to main content
10.1145/2503210.2503219acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

Published: 17 November 2013 Publication History

Abstract

Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In this paper, we present a template-based optimization framework, AUGEM, which can automatically generate fully optimized assembly code for several dense linear algebra (DLA) kernels, such as GEMM, GEMV, AXPY and DOT, on varying multi-core CPUs without requiring any manual interference from developers. In particular, based on domain-specific knowledge about algorithms of the DLA kernels, we use a collection of parameterized code templates to formulate a number of commonly occurring instruction sequences within the optimized low-level C code of these DLA kernels. Then, our framework uses a specialized low-level C optimizer to identify instruction sequences that match the pre-defined code templates and thereby translates them into extremely efficient SSE/AVX instructions. The DLA kernels generated by our template-based approach surpass the implementations of Intel MKL and AMD ACML BLAS libraries, on both Intel Sandy Bridge and AMD Piledriver processors.

References

[1]
Advanced Micro Devices, Inc. AMD New Bulldozer and Piledriver Instructions, 2012.10.
[2]
G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. Communication-optimal parallel algorithm for strassen's matrix multiplication. CoRR, abs/1202.3173, 2012.
[3]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. Graph expansion and communication costs of fast matrix multiplication. J. ACM, 59(6):32:1--32:23, Jan. 2013.
[4]
G. Belter, E. R. Jessup, I. Karlin, and J. G. Siek. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09, pages 59:1--59:12, New York, NY, USA, 2009. ACM.
[5]
J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using phipac: a portable, high-performance, ansi c coding methodology. In Proc. the 11th international conference on Supercomputing, pages 340--347, New York, NY, USA, 1997. ACM Press.
[6]
C. Chen, J. Chame, and M. Hall. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In International Symposium on Code Generation and Optimization, March 2005.
[7]
K. Cooper and L. Torczon. Engineering a Compiler. Morgan Kaufmann, 2004.
[8]
H. Cui, L. Wang, J. Xue, Y. Yang, and X. Feng. Automatic library generation for blas3 on gpus. In Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium, Anchorage, AK. Citeseer, 2011.
[9]
H. Cui, J. Xue, L. Wang, Y. Yang, X. Feng, and D. Fan. Extendable pattern-oriented optimization directives. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '11, pages 107--118, Washington, DC, USA, 2011. IEEE Computer Society.
[10]
H. Cui, Q. Yi, J. Xue, and X. Feng. Layout-oblivious compiler optimization for matrix computations. TACO, 9(4):35, 2013.
[11]
P. D'Alberto and A. Nicolau. Adaptive strassen's matrix multiplication. In ICS, pages 284--292, 2007.
[12]
K. Goto and R. A. v. d. Geijn. Anatomy of high-performance matrix multiplication. volume 34, pages 12:1--12:25, New York, NY, USA, May 2008. ACM.
[13]
K. Goto and R. Van De Geijn. High-performance implementation of the level-3 blas. ACM Trans. Math. Softw., 35(1), July 2008.
[14]
J. G. Siek and M. Vachharajani. Build to order linear algebra kernels. In DLS, page 7. ACM, 2008.
[15]
R. C. Whaley and J. Dongarra. Automatically tuned linear algebra software. In SuperComputing 1998: High Performance Networking and Computing, 1998.
[16]
R. C. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(1):3--25, 2001.
[17]
Q. Yi. Automated programmable control and parameterization of compiler optimizations. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on, pages 97--106, april 2011.
[18]
Q. Yi. POET: A scripting language for applying parameterized source-to-source program transformations. Software: Practice & Experience, pages 675--706, May 2012.
[19]
Q. Yi and A. Qasem. Exploring the optimization space of dense linear algebra kernels. In LCPC, pages 343--355, 2008.
[20]
Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. Poet: Parameterized optimizations for empirical tuning. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1--8, march 2007.
[21]
X. Zhang, Q. Wang, and Y. Zhang. Model-driven level 3 blas performance optimization on loongson 3a processor. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS), 2012.

Cited By

View all
  • (2025)ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line ScanningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.353222563(1-17)Online publication date: 2025
  • (2025)EfiMon: A Process Analyser for Granular Power Consumption PredictionHigh Performance Computing10.1007/978-3-031-80084-9_8(112-126)Online publication date: 14-Feb-2025
  • (2024)REHEATFUNQ (REgional HEAT-Flow Uncertainty and aNomaly Quantification) 2.0.1: a model for regional aggregate heat flow distributions and anomaly quantificationGeoscientific Model Development10.5194/gmd-17-2783-202417:7(2783-2828)Online publication date: 15-Apr-2024
  • Show More Cited By

Index Terms

  1. AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
    November 2013
    1123 pages
    ISBN:9781450323789
    DOI:10.1145/2503210
    • General Chair:
    • William Gropp,
    • Program Chair:
    • Satoshi Matsuoka
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. DLA code optimization
    2. auto-tuning
    3. code generation

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC13
    Sponsor:

    Acceptance Rates

    SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)48
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line ScanningIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.353222563(1-17)Online publication date: 2025
    • (2025)EfiMon: A Process Analyser for Granular Power Consumption PredictionHigh Performance Computing10.1007/978-3-031-80084-9_8(112-126)Online publication date: 14-Feb-2025
    • (2024)REHEATFUNQ (REgional HEAT-Flow Uncertainty and aNomaly Quantification) 2.0.1: a model for regional aggregate heat flow distributions and anomaly quantificationGeoscientific Model Development10.5194/gmd-17-2783-202417:7(2783-2828)Online publication date: 15-Apr-2024
    • (2024)An Application-Driven Method for Assembling Numerical Schemes for the Solution of Complex Multiphysics ProblemsApplied System Innovation10.3390/asi70300357:3(35)Online publication date: 24-Apr-2024
    • (2024)Rethinking 'Complement' Recommendations at Scale with SIMDProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645041(25-36)Online publication date: 7-May-2024
    • (2024)Ares-Flash: Efficient Parallel Integer Arithmetic Operations Using NAND Flash Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00109(1489-1503)Online publication date: 2-Nov-2024
    • (2024)Fast data-dependence profiling through prior static analysisParallel Computing10.1016/j.parco.2024.103063119(103063)Online publication date: Feb-2024
    • (2024)Modeling of water gas shift reaction using neural network trained on detailed kinetic mechanismsChemical Engineering Journal10.1016/j.cej.2024.151659491(151659)Online publication date: Jul-2024
    • (2024)Performance improvement of the triangular matrix product in commodity clustersThe Journal of Supercomputing10.1007/s11227-024-06097-780:11(16630-16653)Online publication date: 1-Jul-2024
    • (2024)Which C compiler and BLAS/LAPACK library should I use: gretl’s numerical efficiency in different configurationsComputational Statistics10.1007/s00180-024-01461-w39:7(3497-3522)Online publication date: 1-Dec-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media