skip to main content
10.1145/3575693.3575727acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications

Published: 30 January 2023 Publication History

Abstract

While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks.

References

[1]
1996. COFF. https://wiki.osdev.org/COFF (accessed 2020)
[2]
2003. ELF - format of Executable and Linking Format (ELF) files. http://man7.org/linux/man-pages/man5/elf.5.html (accessed Aug 20 2019)
[3]
2009. OS X ABI Mach-O File Format Reference. https://developer.apple.com/
[4]
2010. LLVM MC Project. http://blog.llvm.org/2010/04/intro-to-llvm-mc-project.html (accessed Aug 15 2019)
[5]
2019. Machine IR (MIR) Format Reference Manual. https://llvm.org/docs/MIRLangRef.html (accessed Aug 20 2019)
[6]
2019. MITE Micro-ops to IDQ. https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/308522 (accessed Aug 20 2019)
[7]
Dennis Andriesse, Xi Chen, Victor Van Der Veen, Asia Slowinska, and Herbert Bos. 2016. An in-depth analysis of disassembly on full-scale x86/x64 binaries. In 25th $USENIX$ Security Symposium ($USENIX$ Security 16). 583–600.
[8]
ARM. 2021. Branch and Call Sequences Explained. https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/branch-and-call-sequences-explained [Online; accessed 6-August-2021]
[9]
Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers. In Proceedings of the 46th International Symposium on Computer Architecture. 462–473.
[10]
Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. 2018. The datacenter as a computer: Designing warehouse-scale machines. Synthesis Lectures on Computer Architecture, 13, 3 (2018), i–189.
[11]
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track. 41, 46.
[12]
P. Briggs, Doug Evans, B. Grant, R. Hundt, W. Maddox, D. Novillo, Seongbae Park, D. Sehr, Ian Taylor, and Ollie. [n. d.]. WHOPR-Fast and Scalable Whole Program Optimizations in GCC Initial Draft 12-Dec-2007.
[13]
Derek Bruening. 2017. Restartable Sequences. https://dynamorio.org/page_rseq.html (accessed 2022)
[14]
Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization, 2003. CGO 2003. 265–275.
[15]
Bryan Buck and Jeffrey K Hollingsworth. 2000. An API for runtime code patching. The International Journal of High Performance Computing Applications, 14, 4 (2000), 317–329.
[16]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data. In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 205–218.
[17]
Dehao Chen, David Xinliang Li, and Tipp Moseley. 2016. AutoFDO: automatic feedback-directed optimization for warehouse-scale applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, Barcelona, Spain, March 12-18, 2016. 12–23. https://doi.org/10.1145/2854038.2854044
[18]
Robert S Cohn, David W Goodwin, P Geoffrey Lowney, and N Rubin. 1997. Optimizing alpha executables on windows nt with spike. Digital Technical Journal, 9 (1997), 3–20.
[19]
Jonathan Corbet. 2015. Restartable Sequences. https://lwn.net/Articles/650333/ (accessed 2022)
[20]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google’s Globally Distributed Database. ACM Trans. Comput. Syst., 31, 3 (2013), Article 8, Aug., 22 pages. issn:0734-2071 https://doi.org/10.1145/2491245
[21]
Cary Coutant. 2013. DWARF Extensions for Separate Debug Information Files a.k.a. "Fission" project. https://gcc.gnu.org/wiki/DebugFission
[22]
Bruno De Bus, Bjorn De Sutter, Ludo Van Put, Dominique Chanet, and Koen De Bosschere. 2004. Link-time optimization of ARM binaries. In Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems. 211–220.
[23]
Jack Doweck, Wen-Fu Kao, Allen Kuan-yu Lu, Julius Mandelblat, Anirudha Rahatekar, Lihu Rappoport, Efraim Rotem, Ahmad Yasin, and Adi Yoaz. 2017. Inside 6th-generation intel core: New microarchitecture code-named skylake. IEEE Micro, 37, 2 (2017), 52–62.
[24]
Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan, Erik Mavrinac, Wolfram Schulte, Newton Sanches, and Srikanth Kandula. 2016. CloudBuild: Microsoft’s Distributed and Caching Build Service. In Proceedings of the 38th International Conference on Software Engineering Companion (ICSE ’16). Association for Computing Machinery, New York, NY, USA. 11–20. isbn:9781450342056 https://doi.org/10.1145/2889160.2889222
[25]
The LLVM Foundation. 2002. The LLVM Compiler Infrastructure. http://llvm.org (accessed Aug 20 2019)
[26]
The LLVM Foundation. 2002. LLVM Link Time Optimization: Design and Implementation. https://llvm.org/docs/LinkTimeOptimization.html (accessed Aug 20 2019)
[27]
The LLVM Foundation. 2020. SHT_LLVM_BB_ADDR_MAP Section (basic block address map). https://llvm.org/docs/Extensions.html##sht-llvm-bb-addr-map-section-basic-block-address-map
[28]
Taras Glek and Jan Hubicka. 2010. Optimizing real world applications with GCC link time optimization. arXiv preprint arXiv:1010.2196.
[29]
Google Propeller. 2021. llvm-propeller. https://github.com/google/llvm-propeller (accessed 2021)
[30]
Aysylu Greenberg. 2016. Building a Distributed Build System at Google Scale. https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistributedBuildSystemAtGoogleScale.pdf
[31]
Robert Hundt, Easwaran Raman, Martin Thuresson, and Neil Vachharajani. 2011. MAO–An extensible micro-architectural optimizer. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 1–10.
[32]
Andrew Hamilton Hunter, Chris Kennelly, Darryl Gove, Parthasarathy Ranganathan, Paul Jack Turner, and Tipp James Moseley. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21).
[33]
IBM. 2021. Using remote build clearmake command. https://www.ibm.com/docs/en/rational-clearcase/9.0.0?topic=feature-using-remote-build-clearmake-command [Online; accessed 6-August-2021]
[34]
LLVM Compiler Infrastructure. 2003. Exception Handling in LLVM. https://llvm.org/docs/ExceptionHandling.html
[35]
Texas Instruments. 2015. TMS320C28x Optimizing C/C++ Compiler. http://downloads.ti.com/docs/esd/SPRU514I/Content/SPRU514I_HTML/post_link_optimizer.html (accessed Aug 20 2019)
[36]
Intel. 2017. Intel Xeon Processor Scalable Family based on Skylake microarchitecture. https://perfmon-events.intel.com/skylake_server.html (accessed 2022)
[37]
Teresa Johnson, Mehdi Amini, and David Xinliang Li. 2017. ThinLTO: scalable and incremental LTO. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017. 111–121. http://dl.acm.org/citation.cfm?id=3049845
[38]
Svilen Kanev, Juan Pablo Darago, Kim M. Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David M. Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015. 158–169. https://doi.org/10.1145/2749469.2750392
[39]
Andi Kleen. 2016. An introduction to last branch records. https://lwn.net/Articles/680985/ (accessed Aug 20 2019)
[40]
Konrad Kleine. 2019. 2 tips to make your C projects compile 3 times faster. https://developers.redhat.com/blog/2019/05/15/2-tips-to-make-your-c-projects-compile-3-times-faster
[41]
Kumar, Snehasish. 2021. [RFC] Machine Function Splitter. https://groups.google.com/g/llvm-dev/c/RUegaMg-iqc/m/wFAVxa6fCgAJ [Online; accessed 6-August-2021]
[42]
Rahman Lavaee, John Criswell, and Chen Ding. 2019. Codestitcher: inter-procedural basic block layout optimization. In Proceedings of the 28th International Conference on Compiler Construction, CC 2019, Washington, DC, USA, February 16-17, 2019. 65–75. https://doi.org/10.1145/3302516.3307358
[43]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices. 40, 190–200.
[44]
Chi-Keung Luk, Robert Muth, Harish Patil, Robert Cohn, and Geoff Lowney. 2004. Ispike: A post-link optimizer for the Intel® Itanium® architecture. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. 15.
[45]
Chi-Keung Luk, Robert Muth, Harish Patil, Richard Weiss, P Geoffrey Lowney, and Robert Cohn. 2002. Profile-guided post-link stride prefetching. In Proceedings of the 16th international conference on Supercomputing. 167–178.
[46]
Robert Muth, Saumya K Debray, Scott Watterson, and Koen De Bosschere. 2001. alto: a link-time optimizer for the Compaq Alpha. Software: Practice and Experience, 31, 1 (2001), 67–101.
[47]
Itai Nahshon and David Bernstein. 1996. FDPR: A Post-pass Object-code Optimization Tool. In International Conference on Compiler Construction.
[48]
Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices. 42, 89–100.
[49]
Andy Newell and Sergey Pupyrev. 2018. Improved Basic Block Reordering. CoRR, abs/1809.04676 (2018), arxiv:1809.04676. arxiv:1809.04676
[50]
Maksim Panchenko. 2022. BOLT Open Projects. https://discourse.llvm.org/t/bolt-open-projects/61857 (accessed 2022)
[51]
Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: A Practical Binary Optimizer for Data Centers and Beyond. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, Washington, DC, USA, February 16-20, 2019. 2–14. https://doi.org/10.1109/CGO.2019.8661201
[52]
Maksim Panchenko, Rafael Auler, Laith Sakka, and Guilherme Ottoni. 2021. Lightning BOLT: Powerful, Fast, and Scalable Binary Optimization. In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction (CC 2021). Association for Computing Machinery, New York, NY, USA. 119–130. isbn:9781450383257 https://doi.org/10.1145/3446804.3446843
[53]
Rachel Potvin and Josh Levenberg. 2016. Why Google stores billions of lines of code in a single repository. Commun. ACM, 59, 7 (2016), 78–87.
[54]
Krzysztof Pszeniczny. 2022. llvm-bolt registers .eh_frames which may refer to unmapped sections. https://github.com/llvm/llvm-project/issues/56726 (accessed 2022)
[55]
Krzysztof Pszeniczny. 2022. Stripping BOLTed binaries may result in misaligned PT_LOADs. https://github.com/llvm/llvm-project/issues/56738 (accessed 2022)
[56]
NIST FIPS PUB. 2001. 140-2: Security requirements for cryptographic modules. Information Technology Laboratory, National Institute of Standards and Technology.
[57]
Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P Geoffrey Lowney, and Mateo Valero. 2001. Code layout optimizations for transaction processing workloads. In ACM SIGARCH Computer Architecture News. 29, 155–164.
[58]
Benjamin Schwarz, Saumya Debray, Gregory Andrews, and Matthew Legendre. 2001. Plto: A link-time optimizer for the Intel IA-32 architecture. In Proc. 2001 Workshop on Binary Translation (WBT-2001).
[59]
Han Shen, Rahman Lavaee, Krzysztof Pszeniczny, Snehasish Kumar, Sriraman Tallam, and Xinliang (David) Li. 2022. Artifacts for "Propeller: A Profile Guided, Relinking Optimizer for Warehouse Scale Applications". https://doi.org/10.5281/zenodo.7222794
[60]
Amitabh Srivastava and Alan Eustace. 2004. ATOM: A system for building customized program analysis tools. ACM SIGPLAN Notices, 39, 4 (2004), 528–539.
[61]
James Swift. 2017. Crazy Fast Builds Using distcc. https://pspdfkit.com/blog/2017/crazy-fast-builds-using-distcc/
[62]
Sriraman Tallam. 2020. LLD Support for Basic Block Sections. https://reviews.llvm.org/rG94317878d826 (accessed June 29, 2022)
[63]
Ian Lance Taylor. 2008. A New ELF Linker. In Proceedings of the GCC Developers’ Summit. http://ols.fedoraproject.org/GCC/Reprints-2008/taylor-reprint.pdf
[64]
Rui Ueyama. 2017. LLD - The LLVM Linker. https://lld.llvm.org/lld
[65]
Ludo Van Put, Dominique Chanet, Bruno De Bus, Bjorn De Sutter, and Koen De Bosschere. 2005. Diablo: a reliable, retargetable and extensible link-time rewriting framework. In Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, 2005. 7–12.
[66]
Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. Stardbt: An efficient multi-platform dynamic binary translation system. In Asia-Pacific Conference on Advances in Computer Systems Architecture. 4–15.
[67]
Kaiyuan Wang, Greg Tener, Vijay Gullapalli, Xin Huang, Ahmed Gad, and Daniel Rall. 2020. Scalable build service system with smart scheduling service. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 452–462.
[68]
Wikipedia contributors. 2021. Monorepo — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Monorepo&oldid=1024603377 [Online; accessed 6-August-2021]
[69]
David Williams-King and Junfeng Yang. 2019. CodeMason: Binary-Level Profile-Guided Optimization. FEAST’19. Association for Computing Machinery, New York, NY, USA. 47–53. isbn:9781450368346 https://doi.org/10.1145/3338502.3359763
[70]
Wired. 2011. Artificial intelligence: it’s nothing like we expected - Internet Search. https://www.wired.co.uk/article/artificial-intelligence (accessed 2022)
[71]
Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 35–44.

Cited By

View all
  • (2025)Post-Link Outlining for Code Size ReductionProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712692(154-166)Online publication date: 25-Feb-2025
  • (2024)Incremental Specialization of Network ProgramsProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696870(264-272)Online publication date: 18-Nov-2024
  • (2024)The Cost of Profiling in the HotSpot Virtual MachineProceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3679007.3685055(112-126)Online publication date: 13-Sep-2024
  • Show More Cited By

Index Terms

  1. Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
      January 2023
      947 pages
      ISBN:9781450399166
      DOI:10.1145/3575693
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 January 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      Author Tags

      1. Binary Optimization
      2. Datacenters
      3. Distributed Build System
      4. Post-Link Optimization
      5. Profile Guided Optimization
      6. Warehouse-Scale Applications

      Qualifiers

      • Research-article

      Conference

      ASPLOS '23

      Acceptance Rates

      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,358
      • Downloads (Last 6 weeks)160
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Post-Link Outlining for Code Size ReductionProceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction10.1145/3708493.3712692(154-166)Online publication date: 25-Feb-2025
      • (2024)Incremental Specialization of Network ProgramsProceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696870(264-272)Online publication date: 18-Nov-2024
      • (2024)The Cost of Profiling in the HotSpot Virtual MachineProceedings of the 21st ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes10.1145/3679007.3685055(112-126)Online publication date: 13-Sep-2024
      • (2024)Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-UpACM Transactions on Embedded Computing Systems10.1145/366063523:4(1-54)Online publication date: 10-Jun-2024
      • (2024)Research on Link Optimization Technology Based on Basic Block Reordering2024 5th International Symposium on Computer Engineering and Intelligent Communications (ISCEIC)10.1109/ISCEIC63613.2024.10810167(491-495)Online publication date: 8-Nov-2024
      • (2024)A Method for Implementing Basic Block Reordering in GNU 1D2024 4th International Conference on Electronic Information Engineering and Computer (EIECT)10.1109/EIECT64462.2024.10867069(1017-1021)Online publication date: 15-Nov-2024
      • (2024)Revamping Sampling-Based PGO with Context-Sensitivity and Pseudo-instrumentation2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444807(322-333)Online publication date: 2-Mar-2024
      • (2023)DASS: Dynamic Adaptive Sub-Target Specialization2023 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)10.1109/SBAC-PADW60351.2023.00016(36-45)Online publication date: 17-Oct-2023
      • (2023)Online Code Layout Optimizations via OCOLOSIEEE Micro10.1109/MM.2023.327475843:4(71-79)Online publication date: 1-Jul-2023
      • (2023)JACO: JAva Code Layout Optimizer Enabling Continuous Optimization without Pausing Application Services2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00032(295-306)Online publication date: 31-Oct-2023
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media