skip to main content
10.1145/3636480.3637283acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article

Impact of Write-Allocate Elimination on Fujitsu A64FX

Published: 11 January 2024 Publication History

Abstract

ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications.
In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.

References

[1]
Christie Alappat, Nils Meyer, Jan Laukemann, Thomas Gruber, Georg Hager, Gerhard Wellein, and Tilo Wettig. 2021. Execution-Cache-Memory modeling and performance tuning of sparse matrix-vector multiplication and Lattice quantum chromodynamics on A64FX. Concurrency and Computation: Practice and Experience (2021), e6512.
[2]
Ariful Azad, Mohsen Mahmoudi Aznaveh, Scott Beamer, Mark Blanco, Jinhao Chen, Luke D’Alessandro, Roshan Dathathri, Tim Davis, Kevin Deweese, Jesun Firoz, 2020. Evaluation of graph analytics frameworks using the GAP benchmark suite. In 2020 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 216–227.
[3]
Md Abdullah Shahneous Bari, Barbara Chapman, Anthony Curtis, Robert J Harrison, Eva Siegmann, Nikolay A Simakov, and Matthew D Jones. 2021. A64FX performance: experience on Ookami. In 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 711–718.
[4]
Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–10.
[5]
Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP benchmark suite. arXiv preprint arXiv:1508.03619 (2015).
[6]
Scott Beamer, Krste Asanovic, and David Patterson. 2015. Locality exists in graph processing: Workload characterization on an ivy bridge server. In 2015 IEEE International Symposium on Workload Characterization. IEEE, 56–65.
[7]
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008, 10 (2008), P10008.
[8]
Andrew Burford, Alan Calder, David Carlson, Barbara Chapman, Firat Coskun, Tony Curtis, Catherine Feldman, Robert Harrison, Yan Kang, Benjamin Michalowicz, 2021. Ookami: Deployment and Initial Experiences. (2021), 1–8.
[9]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). Ieee, 44–54.
[10]
Jens Domke. 2021. A64FX–Your Compiler You Must Decide!. In 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 736–740.
[11]
Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: past, present and future. Concurrency and Computation: practice and experience 15, 9 (2003), 803–820.
[12]
Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, and Assefaw H Gebremedhin. 2018. miniVite: A graph analytics benchmarking tool for massively parallel systems. In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, 51–56.
[13]
Georg Hager. [n. d.]. Write-allocate evasion has finally arrived at Intel – or has it?https://blogs.fau.de/hager/archives/8997
[14]
Adrian Jackson, Michele Weiland, Nick Brown, Andrew Turner, and Mark Parsons. 2020. Investigating Applications on the A64FX. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 549–558.
[15]
Yuetsu Kodama, Tetsuya Odajima, Akira Asato, and Mitsuhisa Sato. 2020. Accuracy improvement of memory system simulation for modern shared memory processor. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. 142–149.
[16]
Scott P Kolodziej, Mohsen Aznaveh, Matthew Bullock, Jarrett David, Timothy A Davis, Matthew Henderson, Yifan Hu, and Read Sandstrom. 2019. The Suitesparse matrix collection website interface. Journal of Open Source Software 4, 35 (2019), 1244.
[17]
ARM Limited. 2020. Guide for ARMv8-A. https://developer.arm.com/documentation/den0024/a.
[18]
Hao Lu, Mahantesh Halappanavar, and Ananth Kalyanaraman. 2015. Parallel heuristics for scalable community detection. Parallel Comput. 47 (2015), 19–37.
[19]
John McCalpin. 2018. Notes on “non-temporal”(aka “streaming”) stores. https://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/.
[20]
John D McCalpin. 1995. Stream benchmark. Link: www. cs. virginia. edu/stream/ref. html# what 22 (1995), 7.
[21]
Benjamin Michalowicz, Eric Raut, Yan Kang, Tony Curtis, Barbara Chapman, and Dossay Oryspayev. 2021. Comparing OpenMP Implementations with Applications Across A64FX Platforms. In International Workshop on OpenMP. Springer, 127–141.
[22]
Richard C Murphy, Kyle B Wheeler, Brian W Barrett, and James A Ang. 2010. Introducing the graph 500. Cray Users Group (CUG) 19 (2010), 45–74.
[23]
Tetsuya Odajima, Yuetsu Kodama, Miwako Tsuji, Motohiko Matsuda, Yutaka Maruyama, and Mitsuhisa Sato. 2020. Preliminary performance evaluation of the Fujitsu A64FX using HPC applications. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 523–530.
[24]
Andrei Poenaru, Tom Deakin, Simon McIntosh-Smith, Simon D Hammond, and Andrew J Younge. 2021. An Evaluation of the Fujitsu A64FX for HPC Applications. In Presentation in AHUG ISC 21 Workshop.
[25]
Mitsuhisa Sato, Yutaka Ishikawa, Hirofumi Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, Hisashi Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, 2020. Co-design for a64fx manycore processor and” fugaku”. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
[26]
Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. Likwid: A lightweight performance-oriented tool suite for x86 multicore environments. In 2010 39th international conference on parallel processing workshops. IEEE, 207–216.
[27]
Jan Treibig, Georg Hager, and Gerhard Wellein. 2011. LIKWID: Lightweight performance tools. In Competence in High Performance Computing 2010. Springer, 165–175.
[28]
Zhenkai Zhang, Zihao Zhan, Daniel Balasubramanian, Xenofon Koutsoukos, and Gabor Karsai. 2018. Triggering rowhammer hardware faults on arm: A revisit. In Proceedings of the 2018 Workshop on Attacks and Solutions in Hardware Security. 24–33.

Cited By

View all
  • (2024)Studying CPU and memory utilization of applications on Fujitsu A64FX and Nvidia Grace SuperchipProceedings of the International Symposium on Memory Systems10.1145/3695794.3695813(198-207)Online publication date: 30-Sep-2024

Index Terms

  1. Impact of Write-Allocate Elimination on Fujitsu A64FX
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        HPCAsia '24 Workshops: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops
        January 2024
        134 pages
        ISBN:9798400716522
        DOI:10.1145/3636480
        Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 11 January 2024

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Fujitsu A64FX
        2. Graph analytics
        3. Rodinia benchmark suite
        4. Store Elimination
        5. Write-Allocate Evasion
        6. Zero Fill

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        HPCAsiaWS 2024

        Acceptance Rates

        Overall Acceptance Rate 69 of 143 submissions, 48%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)87
        • Downloads (Last 6 weeks)10
        Reflects downloads up to 19 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Studying CPU and memory utilization of applications on Fujitsu A64FX and Nvidia Grace SuperchipProceedings of the International Symposium on Memory Systems10.1145/3695794.3695813(198-207)Online publication date: 30-Sep-2024

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media