skip to main content
10.1145/3626246.3653384acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Bouncer: Admission Control with Response Time Objectives for Low-latency Online Data Systems

Published: 09 June 2024 Publication History

Abstract

Internet companies rely on low-latency online data systems to provide quick responses to users. These systems employ complementary overload management techniques to offer a continued, acceptable service throughout traffic surges, where "acceptable" partly means that serviced queries meet or track closely their response time objectives. Thus, in this paper we present Bouncer, an admission control policy aimed to keep admitted queries under or near their service level objectives (SLOs) on percentile response times. Bouncer decides to accept or reject incoming queries based on inexpensive estimates of such percentiles. It can assign separate SLOs to different classes of queries in the workload, and implements early rejections to let clients react promptly and help data systems avoid doing useless work. We propose two starvation avoidance strategies that supplement Bouncer's basic formulation and prevent query types from receiving no service. Our evaluation, in simulation and on a production-grade distributed graph database, shows that Bouncer and its starvation-avoiding variants 1) let admitted queries meet or stay close to their SLOs when other in-house policies do not, and 2) report fewer overall rejections and a small overhead, while letting the system reach high utilization. We observe that the proposed strategies can prevent query starvation, but with a modest increase in rejections and with SLO violation counts for serviced queries that may be acceptable in practice.

References

[1]
Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.
[2]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems. 29--42.
[3]
Bryan Barkley. 2022. Hodor: Detecting and addressing overload in LinkedIn microservices. https://engineering.linkedin.com/blog/2022/hodor--detectingand- addressing-overload-in-linkedin-microservic. [Accessed: Feb 2024].
[4]
Novella Bartolini, Giancarlo Bongiovanni, and Simone Silvestri. 2009. Self-* through Self-Learning: Overload Control for DistributedWeb Systems. Computer Networks 53, 5 (April 2009), 727--743.
[5]
Josep M. Blanquer, Antoni Batchelli, Klaus E. Schauser, and Richard Wolski. 2005. Quorum: Flexible Quality of Service for Internet Services. In Proceedings of the 2nd USENIX Symposium on Networked Systems Design and Implementation. 159--174.
[6]
Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. 2013. TAO: Facebook's Distributed Data Store for the Social Graph. In Proceedings of the 2013 USENIX Annual Technical Conference. 49--60.
[7]
Andrew Carter, Andrew Rodriguez, Yiming Yang, and Scott Meyer. 2019. Nanosecond Indexing of Graph Data With Hash Maps and VLists. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD'19). ACM, 623--635.
[8]
Koral Chapnik, Ilya Kolchinsky, and Assaf Schuster. 2022. DARLING: Data-Aware Load Shedding in Complex Event Processing Systems. Proceedings of the VLDB Endowment 15, 3 (2022), 541--554.
[9]
Huamin Chen and Prasant Mohapatra. 2002. Session-based Overload Control in QoS-aware Web Servers. In Proceedings of the Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 2. 516--524.
[10]
Ludmila Cherkasova. 1998. Scheduling Strategy to Improve Response Time for Web Applications. In High-Performance Computing and Networking. Springer Berlin Heidelberg, 305--314.
[11]
Ludmila Cherkasova and Peter Phaal. 1998. Session Based Admission Control: A Mechanism for Improving the Performance of an OverloadedWeb Server. Technical Report HPL-98--119. Computer Systems Laboratory. Hewlett-Packard.
[12]
Inho Cho, Ahmed Saeed, Joshua Fried, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. 2020. Overload Control for ?'s-scale RPCs with Breakwater. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation. 299--314.
[13]
David Chou, Tianyin Xu, Kaushik Veeraraghavan, Andrew Newell, Sonia Margulis, Lin Xiao, Pol Mauri Ruiz, Justin Meza, Kiryong Ha, Shruti Padmanabha, Kevin Cole, and Dmitri Perelman. 2019. Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 430--446.
[14]
Michele Colajanni, Philip S. Yu, and Daniel M. Dias. 1997. Scheduling Algorithms for Distributed Web Servers. In Proceedings of 17th International Conference on Distributed Computing Systems. IEEE, 169--176.
[15]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google's Globally Distributed Database. ACM Transactions on Computer Systems 31, 3 (2013), 8:1--8:22.
[16]
LinkedIn Corp. 2018. The graph team at LinkedIn. https://engineering.linkedin. com/teams/data/data-infrastructure/graph. [Accessed: Feb 2024].
[17]
LinkedIn Corp. 2022. LinkedIn's Economic Graph. https://economicgraph. linkedin.com. [Accessed: Feb 2024].
[18]
Microsoft Corp. 2022. Azure Cosmos DB. https://azure.microsoft.com/en-us/ services/cosmos-db/. [Accessed: Feb 2024].
[19]
Microsoft Corp. 2022. SQL Server Resource Governor. https://learn. microsoft.com/en-us/sql/relational-databases/resource-governor/resourcegovernor? view=sql-server-ver16. [Accessed: Feb 2024].
[20]
Alejandro Forero Cuervo. 2017. Handling Overload. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 21. https: //sre.google/sre-book/handling-overload/.
[21]
Alejandro Forero Cuervo. 2017. Load Balancing in the Datacenter. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 20. https://sre.google/sre-book/load-balancing-datacenter/.
[22]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's Highly Available Key-Value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles. 205-- 220.
[23]
Mark Doran, Padmaja Potineni, and Rajesh Bhatiya. 2022. Managing Resources with Oracle Database Resource Manager. Oracle Database: Database Administrator's Guide, 21c. Chapter 26. https://docs.oracle.com/en/database/oracle/oracledatabase/ 21/admin/index.html.
[24]
Sameh Elnikety, Erich Nahum, John Tracey, and Willy Zwaenepoel. 2004. A Method for Transparent Admission Control and Request Scheduling in ECommerce Web Sites. In Proceedings of the 13th International Conference on World Wide Web. ACM, 276--286.
[25]
Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. 2017. MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-aware OS Interface. In Proceedings of the 26th Symposium on Operating Systems Principles. 168--183.
[26]
Hans-Ulrich Heiss and RogerWagner. 1991. Adaptive Load Control in Transaction Processing Systems. In Proceedings of the 17th International Conference on Very Large Data Bases. 47--54.
[27]
IBM. 2022. Db2 Adaptive workload manager. https://www.ibm.com/docs/en/db2/ 11.5?topic=management-adaptive-workload-manager. [Accessed: Feb 2024].
[28]
Ravi Iyer, Vijay Tewari, and Krishna Kant. 2001. Overload Control Mechanisms forWeb Servers. In Proceedings of the International Conference on the Performance and QoS of Next Generation Networking. Springer, 225--244.
[29]
Sugih Jamin, Peter B. Danzig, Scott J. Shenker, and Lixia Zhang. 1997. A Measurement-based Admission Control Algorithm for Integrated Service Packet Networks. IEEE/ACM Transactions on Networking 5, 1 (1997), 56--70.
[30]
Chris Jones, John Wilkes, Niall Murphy, and Cody Smith. 2017. Service Level Objectives. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 4. https://sre.google/sre-book/service-level-objectives/.
[31]
Mattijs Jonker, Alistair King, Johannes Krupp, Christian Rossow, Anna Sperotto, and Alberto Dainotti. 2017. Millions of Targets under Attack: A Macroscopic Characterization of the DoS Ecosystem. In Proceedings of the 2017 Internet Measurement Conference. ACM, 100--113.
[32]
Eugene Kim. 2018. Internal documents show how Amazon scrambled to fix Prime Day glitches. https://www.cnbc.com/2018/07/19/amazon-internal-documentswhat- caused-prime-day-crash-company-scramble.html. [Accessed: Feb 2024].
[33]
Daniel Kopp, Christoph Dietzel, and Oliver Hohlfeld. 2021. DDoS Never Dies? An IXP Perspective on DDoS Amplification Attacks. In Proceedings of the 22nd International Conference on Passive and Active Measurement (Lecture Notes in Computer Science, Vol. 12671), Oliver Hohlfeld, Andra Lutu, and Dave Levin (Eds.). Springer, 284--301.
[34]
Jay Kreps, Neha Narkhede, and Jun Rao. 2011. Kafka: A Distributed Messaging System for Log Processing. In Proceedings of the 6th International Workshop on Networking Meets Database (NetDB'11). ACM, 1--7.
[35]
William LeFebvre. 2001. CNN.com: Facing a World Crisis. In 15th Systems Administration Conference (LISA 2001). USENIX Association, San Diego, CA. https://www.usenix.org/conference/lisa-2001/cnncom-facing-world-crisis
[36]
Piotr Lewandowski. 2017. Load Balancing at the Frontend. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 19. https://sre.google/sre-book/load-balancing-frontend/.
[37]
J.W.S. Liu, Wei-Kuan Shih, Kwei-Jay Lin, R. Bettati, and Jen-Yao Chung. 1994. Imprecise Computations. Proceedings of the IEEE 82, 1 (1994), 83--94.
[38]
Anil Mallapur and Michael Kehoe. 2017. TrafficShift: Load Testing at Scale. https: //engineering.linkedin.com/blog/2017/05/trafficshift--load-testing-at-scale. [Accessed: Feb 2024].
[39]
Scott Meyer, Andrew Carter, and Andrew Rodriguez. 2020. LIquid: The soul of a new graph database, Part 1. https://engineering.linkedin.com/blog/2020/liquidthe- soul-of-a-new-graph-database-part-1. [Accessed: Feb 2024].
[40]
Scott Meyer, Andrew Carter, and Andrew Rodriguez. 2020. LIquid: The soul of a new graph database, Part 2. https://engineering.linkedin.com/blog/2020/liquid-- the-soul-of-a-new-graph-database--part-2. [Accessed: Feb 2024].
[41]
Sparsh Mittal. 2016. A Survey of Techniques for Approximate Computing. ACM Computing Surveys 48, 4 (May 2016).
[42]
Axel Mönkeberg and Gerhard Weikum. 1992. Performance Evaluation of an Adaptive and Robust Load Control Method for the Avoidance of Data-Contention Thrashing. In Proceedings of the 18th International Conference on Very Large Data Bases. 432--443.
[43]
Seung Yeob Nam, Sunggon Kim, and Dan Keun Sung. 2008. Measurement-Based Admission Control at Edge Routers. IEEE/ACM Transactions on Networking 16, 2 (April 2008), 410--423.
[44]
Sam Newman. 2021. Building Microservices: Designing Fine-Grained Systems (2 ed.). O'Reilly Media.
[45]
Stefan Noll, Norman May, Alexander Böhm, Jan Mühlig, and Jens Teubner. 2019. From the Application to the CPU: Holistic Resource Management for Modern Database Management Systems. IEEE Data Engineering Bulletin 42, 1 (2019), 10--21. http://sites.computer.org/debull/A19mar/p10.pdf
[46]
Spence Purnell. 2020. State Unemployment Websites Crash as COVID-19 Shines Light on Government Technology Failures. https://shorturl.at/BNS29. [Accessed: Feb 2024].
[47]
Chris Richardson. 2019. Microservices Patterns: With examples in Java (1 ed.). Manning, Chapter 8, 253--291.
[48]
SAP. 2022. Admission Control. Monitoring View. SAP HANA Administration with SAP HANA Cockpit (2.15.0 ed.). Chapter 7.5. https://help. sap.com/docs/SAP_HANA_COCKPIT/afa922439b204e9caf22c78b6b69e4f2/ce46dcceaef045cb85f6fdf694789ea0.html.
[49]
Bianca Schroeder and Mor Harchol-Balter. 2006. Web Servers under Overload: How Scheduling Can Help. ACM Transactions on Internet Technology 6, 1 (Feb. 2006), 20--52.
[50]
B. Schroeder, M. Harchol-Balter, A. Iyengar, E. Nahum, and A. Wierman. 2006. How to Determine a Good Multi-Programming Level for External Scheduling. In Proceedings of the 22nd International Conference on Data Engineering. 60--71.
[51]
Ahmad Slo, Sukanya Bhowmik, and Kurt Rothermel. 2020. hSPICE: State-aware Event Shedding in Complex Event Processing. In Proceedings of the 14th ACM International Conference on Distributed and Event-based Systems (DEBS'20). 109-- 120.
[52]
Ryszard Szopa et al. 2016. Doorman: Global Distributed Client Side Rate Limiting. https://github.com/youtube/doorman. [Accessed: Feb 2024].
[53]
Gil Tene et al. [n. d.]. wrk2: a HTTP benchmarking tool based mostly on wrk. https://github.com/giltene/wrk2. [Accessed: Feb 2024].
[54]
Alethea Toh, Anupam Vij, and Syed Pasha. 2022. Azure DDoS Protection - 2021 Q3 and Q4 DDoS attack trends. https://azure.microsoft.com/en-us/blog/azureddos- protection-2021-q3-and-q4-ddos-attack-trends/. [Accessed: Feb 2024].
[55]
Sean Tozer, Tim Brecht, and Ashraf Aboulnaga. 2010. Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads. In Proceedings of the IEEE 26th International Conference on Data Engineering. 397--408.
[56]
Mike Ulrich. 2017. Addressing Cascading Failures. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 22. https: //sre.google/sre-book/addressing-cascading-failures/.
[57]
Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song. 2016. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large ScaleWeb Services. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. 635--651.
[58]
MattWelsh and David Culler. 2003. Adaptive Overload Control for Busy Internet Servers. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems - Volume 4 (USITS'03). 1:1--1:15.
[59]
Pengcheng Xiong, Yun Chi, Shenghuo Zhu, Junichi Tatemura, Calton Pu, and Hakan Hacigümü?. 2011. ActiveSLA: A Profit-Oriented Admission Control Framework for Database-as-a-Service Providers. In Proceedings of the 2nd ACM Symposium on Cloud Computing. Article 15, 14 pages.
[60]
Hao Xu and Juan A. Colmenares. 2023. Admission Control with Response Time Objectives for Low-latency Online Data Systems (extended version). arXiv:2312.15123 [cs.DB]
[61]
Jian Yang, Kunjie Zhu, Yongyi Ran, Weizhe Cai, and Enzhong Yang. 2016. Joint Admission Control and Routing via Approximate Dynamic Programming for Streaming Video Over Software-defined Networking. IEEE Transactions on Multimedia 19, 3 (2016), 619--631.
[62]
Chaoqun Zhan, Maomeng Su, ChuangxianWei, Xiaoqiang Peng, Liang Lin, Sheng Wang, Zhe Chen, Feifei Li, Yue Pan, Fang Zheng, and Chengliang Chai. 2019. AnalyticDB: Real-time OLAP Database System at Alibaba Cloud. Proceedings of the VLDB Endowment 12, 12 (2019), 2059--2070.
[63]
Mingyi Zhang. 2014. AutonomicWorkload Management for Database Management Systems. Ph.D. Dissertation. Queen's University. http://hdl.handle.net/1974/ 12181.
[64]
Mingyi Zhang, Patrick Martin,Wendy Powley, and Jianjun Chen. 2018. Workload Management in Database Management Systems: A Taxonomy. IEEE Transactions on Knowledge and Data Engineering 30, 7 (2018), 1386--1402.
[65]
Bo Zhao, Nguyen Quoc Viet Hung, and Matthias Weidlich. 2020. Load Shedding for Complex Event Processing: Input-based and State-based Techniques. In Proceedings of the IEEE 36th International Conference on Data Engineering (ICDE'20). 1093--1104.
[66]
Hao Zhou, Ming Chen, Qian Lin, Yong Wang, Xiaobin She, Sifan Liu, Rui Gu, Beng Chin Ooi, and Junfeng Yang. 2018. Overload Control for Scaling WeChat Microservices. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 149--161.
[67]
Jingyu Zhou and Tao Yang. 2006. Selective Early Request Termination for Busy Internet Services. In Proceedings of the 15th International Conference on World Wide Web. ACM, 605--614.

Cited By

View all
  • (2024)Hippo: Accelerating Transaction Processing for Approximate Query Processing Engine with Sampling Semantics2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW)10.1109/CANDARW64572.2024.00048(117-122)Online publication date: 26-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data
June 2024
694 pages
ISBN:9798400704222
DOI:10.1145/3626246
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. in-memory distributed graph database
  2. load shedding
  3. overload management
  4. percentile-based response time objectives

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)295
  • Downloads (Last 6 weeks)48
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Hippo: Accelerating Transaction Processing for Approximate Query Processing Engine with Sampling Semantics2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW)10.1109/CANDARW64572.2024.00048(117-122)Online publication date: 26-Nov-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media