ABSTRACT
Serverless computing is a popular software deployment model for the cloud, in which applications are designed as a collection of stateless tasks. Developers are charged for the CPU time and memory footprint during the execution of each serverless function, which incentivizes them to reduce both runtime and memory usage. As a result, functions tend to be short (often on the order of a few milliseconds) and compact (128–256 MB). Cloud providers can pack thousands of such functions on a server, resulting in frequent context switches and a tremendous degree of interleaving. As a result, when a given memory-resident function is re-invoked, it commonly finds its on-chip microarchitectural state completely cold due to thrashing by other functions — a phenomenon termed lukewarm invocation.
Our analysis shows that the cold microarchitectural state due to lukewarm invocations is highly detrimental to performance, which corroborates prior work. The main source of performance degradation is the front-end, composed of instruction delivery, branch identification via the BTB and the conditional branch prediction. State-of-the-art front-end prefetchers show only limited effectiveness on lukewarm invocations, falling considerably short of an ideal front-end. We demonstrate that the reason for this is the cold microarchitectural state of the branch identification and prediction units. In response, we introduce Ignite, a comprehensive restoration mechanism for front-end microarchitectural state targeting instructions, BTB and branch predictor via unified metadata. Ignite records an invocation’s control flow graph in compressed format and uses that to restore the front-end structures the next time the function is invoked. Ignite outperforms state-of-the-art front-end prefetchers, improving performance by an average of 43% by significantly reducing instruction, BTB and branch predictor MPKI.
- 7-Zip LZMA Benchmark. 2019. Intel Ice Lake. Retrieved April 28, 2023 from https://www.7-cpu.com/cpu/Ice_Lake.htmlGoogle Scholar
- Narasimha Adiga, James Bonanno, Adam Collura, Matthias Heizmann, Brian R. Prasky, and Anthony Saporito. 2020. The IBM z15 High Frequency Mainframe Branch Predictor Industrial Product.. In Proceedings of the 47th International Symposium on Computer Architecture (ISCA). IEEE, 27–39. https://doi.org/10.1109/ISCA45697.2020.00014Google ScholarDigital Library
- Inc. Advanced Micro Devices. 2023. Software Optimization Guide for the AMD Zen4 Microarchitecture. Technical Report. Advanced Micro Devices, Inc., Cambridge, MA, USA.Google Scholar
- Alexandru Agache, Marc Brooker, Alexandra Iordache, Anthony Liguori, Rolf Neugebauer, Phil Piwonka, and Diana-Maria Popa. 2020. Firecracker: Lightweight Virtualization for Serverless Applications.. In Proceedings of the 17th Symposium on Networked Systems Design and Implementation (NSDI). USENIX Association, 419–434.Google Scholar
- Jeongseob Ahn, Chang Hyun Park, and Jaehyuk Huh. 2014. Micro-Sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems.. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, 394–405. https://doi.org/10.1109/MICRO.2014.49Google ScholarDigital Library
- Amazon. 2022. A Demo Running 4000 Firecracker MicroVMs. Retrieved April 12, 2022 from https://github.com/firecracker-microvm/firecracker-demoGoogle Scholar
- Amazon Web Services. 2022. Use API Gateway Lambda Authorizers. Retrieved April 12, 2022 from https://docs.aws.amazon.com/apigateway/latest/developerguide/apigateway-use-lambda-authorizer.htmlGoogle Scholar
- Andrei Frumusanu. 2021. Golden Cove Microarchitecture (P-Core) Examined. Retrieved January 27, 2023 from https://www.anandtech.com/show/16881/a-deep-dive-into-intels-alder-lake-microarchitectures/3Google Scholar
- Arm. 2022. Spectre-BHB: Speculative Target Reuse Attacks, Version 1.7. Technical Report. Arm Limited.Google Scholar
- Arm. 2023. Feature names in A-profile architecture. Retrieved July 01, 2023 from https://developer.arm.com/downloads/-/exploration-tools/feature-names-for-a-profileGoogle Scholar
- Truls Asheim, Boris Grot, and Rakesh Kumar. 2023. A Storage-Effective BTB Organization for Servers.. In Proceedings of the 29th IEEE Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1153–1167. https://doi.org/10.1109/HPCA56546.2023.10070938Google ScholarCross Ref
- Grant Ayers, Nayana Prasad Nagendra, David I. August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. AsmDB: understanding and mitigating front-end stalls in warehouse-scale computers.. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA). ACM, 462–473. https://doi.org/10.1145/3307650.3322234Google ScholarDigital Library
- Nathan L. Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt, Ali G. Saidi, Arkaprava Basu, Joel Hestness, Derek Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib Bin Altaf, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator.SIGARCH Comput. Archit. News 39, 2 (2011), 1–7. https://doi.org/10.1145/2024716.2024718Google ScholarDigital Library
- Ioana Burcea and Andreas Moshovos. 2009. Phantom-BTB: a virtualized branch target buffer design.. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XIV). ACM, 313–324. https://doi.org/10.1145/1508244.1508281Google ScholarDigital Library
- Dehao Chen, David Xinliang Li, and Tipp Moseley. 2016. AutoFDO: automatic feedback-directed optimization for warehouse-scale applications.. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, (CGO). ACM, 12–23. https://doi.org/10.1145/2854038.2854044Google ScholarDigital Library
- Colin Ian King. 2022. Stress-ng. Retrieved April 12, 2023 from https://github.com/ColinIanKing/stress-ngGoogle Scholar
- David Daly and Harold W. Cain. 2012. Cache restoration for highly partitioned virtualized systems.. In Proceedings of the 18th IEEE Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, 225–234. https://doi.org/10.1109/HPCA.2012.6169029Google ScholarDigital Library
- Datadog. 2020. The State of Serverless 2020. Retrieved October 27, 2022 from https://www.datadoghq.com/state-of-serverless-2020Google Scholar
- Datadog. 2021. The State of Serverless 2021. Retrieved October 27, 2022 from https://www.datadoghq.com/state-of-serverless-2021/Google Scholar
- Datadog. 2022. The State of Serverless 2022. Retrieved October 27, 2022 from https://www.datadoghq.com/state-of-serverless/Google Scholar
- Datadog. 2023. The University of Utah. Retrieved April 28, 2023 from https://www.cloudlab.us/hardware.phpGoogle Scholar
- Michael Ferdman, Cansu Kaynak, and Babak Falsafi. 2011. Proactive instruction fetch.. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 152–162. https://doi.org/10.1145/2155620.2155638Google ScholarDigital Library
- Michael Ferdman, Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. 2008. Temporal instruction fetch streaming.. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society, 1–10. https://doi.org/10.1109/MICRO.2008.4771774Google ScholarDigital Library
- Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems.. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXIV). ACM, 3–18. https://doi.org/10.1145/3297858.3304013Google ScholarDigital Library
- gem5 developers. 2022. gem5. Retrieved August 21, 2023 from https://github.com/gem5/gem5/releases/tag/v22.0.0.1Google Scholar
- GoogleCloudPlatform. 2022. Online Boutique. Retrieved April 12, 2022 from https://github.com/GoogleCloudPlatform/microservices-demoGoogle Scholar
- Brian Grayson, Jeff Rupley, Gerald D. Zuraski, Eric Quinnell, Daniel A. Jiménez, Tarun Nakra, Paul Kitchin, Ryan Hensley, Edward Brekelbaum, Vikas Sinha, and Ankit Ghiya. 2020. Evolution of the Samsung Exynos CPU Microarchitecture.. In Proceedings of the 47th International Symposium on Computer Architecture (ISCA). IEEE, 40–51. https://doi.org/10.1109/ISCA45697.2020.00015Google ScholarDigital Library
- UC Davis Computer Architecture Research Group. 2020. gem5 skylake config. Retrieved April 12, 2023 from https://github.com/darchr/gem5-skylake-config/blob/master/configuration-details.mdGoogle Scholar
- Intel. 2023. Ice Lake SP. Retrieved April 28, 2023 from https://www.intel.com/content/www/us/en/products/platforms/details/ice-lake-sp.htmlGoogle Scholar
- Intel. 2023. Intel 64 and IA-32 Architectures Software Developer Manuals. Retrieved July 01, 2023 from https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.htmlGoogle Scholar
- Yasuo Ishii, Jaekyu Lee, Krishnendra Nathella, and Dam Sunwoo. 2021. Re-establishing Fetch-Directed Instruction Prefetching: An Industry Perspective.. In Proceedings of the 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 172–182. https://doi.org/10.1109/ISPASS51385.2021.00034Google ScholarCross Ref
- Christian Jacobi, Anthony Saporito, Martin Recktenwald, Aaron Tsai, Ulrich Mayer, Markus M. Helms, Adam Collura, Pak kin Mak, Robert J. Sonnelitter, Michael A. Blake, Tim Bronson, Arthur O’neill, and Vesselina K. Papazova. 2018. Design of the IBM z14 microprocessor.IBM J. Res. Dev. 62, 2/3 (2018), 8:1–8:11. https://doi.org/10.1147/JRD.2018.2798718Google ScholarDigital Library
- Cansu Kaynak, Boris Grot, and Babak Falsafi. 2015. Confluence: unified instruction supply for scale-out servers.. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 166–177. https://doi.org/10.1145/2830772.2830785Google ScholarDigital Library
- kernel.org. 2020. perf: Linux profiling with performance counters. Retrieved April 12, 2023 from https://perf.wiki.kernel.org/index.php/Main_PageGoogle Scholar
- Tanvir Ahmed Khan, Nathan Brown, Akshitha Sriraman, Niranjan K. Soundararajan, Rakesh Kumar, Joseph Devietti, Sreenivas Subramoney, Gilles A. Pokam, Heiner Litz, and Baris Kasikci. 2021. Twig: Profile-Guided BTB Prefetching for Data Center Applications.. In Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 816–829. https://doi.org/10.1145/3466752.3480124Google ScholarDigital Library
- Tanvir Ahmed Khan, Akshitha Sriraman, Joseph Devietti, Gilles Pokam, Heiner Litz, and Baris Kasikci. 2020. I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing.. In Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 146–159. https://doi.org/10.1109/MICRO50266.2020.00024Google ScholarCross Ref
- Tanvir Ahmed Khan, Muhammed Ugur, Krishnendra Nathella, Dam Sunwoo, Heiner Litz, Daniel A. Jiménez, and Baris Kasikci. 2022. Whisper: Profile-Guided Branch Misprediction Elimination for Data Center Applications.. In Proceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 19–34. https://doi.org/10.1109/MICRO56248.2022.00017Google ScholarDigital Library
- Jeongchul Kim and Kyungyong Lee. 2019. FunctionBench: A Suite of Workloads for Serverless Cloud Function Service.. In Proceedings of the 12th IEEE International Conference on Cloud Computing (CLOUD). IEEE, 502–504. https://doi.org/10.1109/CLOUD.2019.00091Google ScholarCross Ref
- Jeongchul Kim and Kyungyong Lee. 2019. Practical Cloud Workloads for Serverless FaaS.. In Proceedings of the 2019 ACM Symposium on Cloud Computing (SOCC). ACM, 477. https://doi.org/10.1145/3357223.3365439Google ScholarDigital Library
- Rakesh Kumar, Boris Grot, and Vijay Nagarajan. 2018. Blasting through the Front-End Bottleneck with Shotgun.. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XXIII). ACM, 30–42. https://doi.org/10.1145/3173162.3173178Google ScholarDigital Library
- Rakesh Kumar, Cheng-Chieh Huang, Boris Grot, and Vijay Nagarajan. 2017. Boomerang: A Metadata-Free Architecture for Control Flow Delivery.. In Proceedings of the 23rd IEEE Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, 493–504. https://doi.org/10.1109/HPCA.2017.53Google ScholarCross Ref
- EASE Lab. 2022. vSwarm: A suite of representative serverless cloud-agnostic (i.e., dockerized) benchmarks. Retrieved April 12, 2022 from https://github.com/ease-lab/vSwarmGoogle Scholar
- EASE Lab. 2022. vSwarm-u: Microarchitecture for serverless workloads. Retrieved April 12, 2022 from https://github.com/ease-lab/vSwarm-uGoogle Scholar
- Chit-Kwan Lin and Stephen J. Tarsa. 2019. Branch Prediction Is Not A Solved Problem: Measurements, Opportunities, and Future Directions.. In Proceedings of the 2019 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 228–238. https://doi.org/10.1109/IISWC47752.2019.9042108Google ScholarCross Ref
- Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Srikant Bharadwaj, Gabe Black, Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues Carvalho, Jerónimo Castrillón, Lizhong Chen, Nicolas Derumigny, Stephan Diestelhorst, Wendy Elsasser, Marjan Fariborz, Amin Farmahini Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi, Dibakar Gope, Thomas Grass, Bagus Hanindhito, Andreas Hansson, Swapnil Haria, Austin Harris, Timothy Hayes, Adrian Herrera, Matthew Horsnell, Syed Ali Raza Jafri, Radhika Jagtap, Hanhwi Jang, Reiley Jeyapaul, Timothy M. Jones, Matthias Jung, Subash Kannoth, Hamidreza Khaleghzadeh, Yuetsu Kodama, Tushar Krishna, Tommaso Marinelli, Christian Menard, Andrea Mondelli, Tiago Mück, Omar Naji, Krishnendra Nathella, Hoa Nguyen, Nikos Nikoleris, Lena E. Olson, Marc S. Orr, Binh Pham, Pablo Prieto, Trivikram Reddy, Alec Roelke, Mahyar Samani, Andreas Sandberg, Javier Setoain, Boris Shingarov, Matthew D. Sinclair, Tuan Ta, Rahul Thakur, Giacomo Travaglini, Michael Upton, Nilay Vaish, Ilias Vougioukas, Zhengrong Wang, Norbert Wehn, Christian Weis, David A. Wood, Hongil Yoon, and Éder F. Zulian. 2020. The gem5 Simulator: Version 20.0+.CoRR abs/2007.03152 (2020).Google Scholar
- Chi-Keung Luk and Todd C. Mowry. 1998. Cooperative Prefetching: Compiler and Hardware Support for Effective Instruction Prefetching in Modern Processors.. In Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM/IEEE Computer Society, 182–194. https://doi.org/10.1109/MICRO.1998.742780Google ScholarCross Ref
- Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: A Practical Binary Optimizer for Data Centers and Beyond.. In Proceedings of the 2019 International Symposium on Code Generation and Optimization, (CGO). IEEE, 2–14. https://doi.org/10.1109/CGO.2019.8661201Google ScholarCross Ref
- Andrea Pellegrini, Ashok Kumar Tummala, Jamshed Jalal, Mark Werkheiser, Anitha Kona, Nigel Stephens, Magnus Bruce, Yasuo Ishii, Joseph Pusdesris, Abhishek Raja, Chris Abernathy, Jinson Koppanalil, and Tushar Ringe. 2020. The Arm Neoverse N1 Platform: Building Blocks for the Next-Gen Cloud-to-Edge Infrastructure SoC.IEEE Micro 40, 2 (2020), 53–62. https://doi.org/10.1109/MM.2020.2972222Google ScholarCross Ref
- Karl Pettis and Robert C. Hansen. 1990. Profile Guided Code Positioning.. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation (PLDI). ACM, 16–27. https://doi.org/10.1145/93542.93550Google ScholarDigital Library
- Glenn Reinman, Brad Calder, and Todd M. Austin. 1999. Fetch Directed Instruction Prefetching.. In Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM/IEEE Computer Society, 16–27. https://doi.org/10.1109/MICRO.1999.809439Google ScholarCross Ref
- David Schall, Artemiy Margaritov, Dmitrii Ustiugov, Andreas Sandberg, and Boris Grot. 2022. Lukewarm serverless functions: characterization and optimization.. In Proceedings of the 49th International Symposium on Computer Architecture (ISCA). ACM, 757–770. https://doi.org/10.1145/3470496.3527390Google ScholarDigital Library
- André Seznec. 2007. A 256 kbits l-tage branch predictor. Journal of Instruction-Level Parallelism (JILP) Special Issue: The Second Championship Branch Prediction Competition (CBP-2) 9 (2007), 1–6.Google Scholar
- Mohammad Shahrad, Jonathan Balkind, and David Wentzlaff. 2019. Architectural Implications of Function-as-a-Service Computing.. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 1063–1075. https://doi.org/10.1145/3352460.3358296Google ScholarDigital Library
- Mohammad Shahrad, Rodrigo Fonseca, Iñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider.. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC). USENIX Association, 205–218.Google Scholar
- Niranjan K. Soundararajan, Peter Braun, Tanvir Ahmed Khan, Baris Kasikci, Heiner Litz, and Sreenivas Subramoney. 2021. PDede: Partitioned, Deduplicated, Delta Branch Target Buffer.. In Proceedings of the 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 779–791. https://doi.org/10.1145/3466752.3480046Google ScholarDigital Library
- The Firecracker Authors. 2022. Production Host Setup Recommendations. Retrieved April 12, 2022 from https://github.com/firecracker-microvm/firecracker/blob/master/docs/prod-host-setup.mdGoogle Scholar
- Ilias Vougioukas, Nikos Nikoleris, Andreas Sandberg, Stephan Diestelhorst, Bashir M. Al-Hashimi, and Geoff V. Merrett. 2019. BRB: Mitigating Branch Predictor Side-Channels.. In Proceedings of the 25th IEEE Symposium on High-Performance Computer Architecture (HPCA). IEEE, 466–477. https://doi.org/10.1109/HPCA.2019.00058Google ScholarCross Ref
- WikiChip. 2023. Sunny Cove - Microarchitectures - Intel. Retrieved April 12, 2023 from https://en.wikichip.org/wiki/intel/microarchitectures/sunny_coveGoogle Scholar
- Ahmad Yasin. 2014. A Top-Down method for performance analysis and counters architecture.. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE Computer Society, 35–44. https://doi.org/10.1109/ISPASS.2014.6844459Google ScholarCross Ref
- Jason Zebchuk, Harold W. Cain, Xin Tong, Vijayalakshmi Srinivasan, and Andreas Moshovos. 2013. RECAP: A region-based cure for the common cold (cache).. In Proceedings of the 19th IEEE Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, 83–94. https://doi.org/10.1109/HPCA.2013.6522309Google ScholarDigital Library
- Yuhao Zhu, Daniel Richins, Matthew Halpern, and Vijay Janapa Reddi. 2015. Microarchitectural implications of event-driven server-side web applications.. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). ACM, 762–774. https://doi.org/10.1145/2830772.2830792Google ScholarDigital Library
Index Terms
- Warming Up a Cold Front-End with Ignite
Recommendations
Adaptive front-end throttling for superscalar processors
ISLPED '14: Proceedings of the 2014 international symposium on Low power electronics and designTo achieve high performance, conventional superscalar processors maintain maximum front-end instruction delivery bandwidth, which is often suboptimal when program behavior and priority metrics change. This paper proposes an adaptive front-end throttling ...
Energy-efficient and high-performance instruction fetch using a block-aware ISA
ISLPED '05: Proceedings of the 2005 international symposium on Low power electronics and designThe front-end in superscalar processors must deliver high application performance in an energy-effective manner. Impediments such as multi-cycle instruction accesses, instruction-cache misses, and mispredictions reduce performance by 48% and increase ...
A Front-end Execution Architecture for High Energy Efficiency
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on MicroarchitectureSmart phones and tablets have recently become widespread and dominant in the computer market. Users require that these mobile devices provide a high-quality experience and an even higher performance. Hence, major developers adopt out-of-order ...
Comments