research-article

Open access

Bouncer: Admission Control with Response Time Objectives for Low-latency Online Data Systems

Authors:

Juan A. ColmenaresAuthors Info & Claims

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

Pages 400 - 413

https://doi.org/10.1145/3626246.3653384

Published: 09 June 2024 Publication History

Abstract

Internet companies rely on low-latency online data systems to provide quick responses to users. These systems employ complementary overload management techniques to offer a continued, acceptable service throughout traffic surges, where "acceptable" partly means that serviced queries meet or track closely their response time objectives. Thus, in this paper we present Bouncer, an admission control policy aimed to keep admitted queries under or near their service level objectives (SLOs) on percentile response times. Bouncer decides to accept or reject incoming queries based on inexpensive estimates of such percentiles. It can assign separate SLOs to different classes of queries in the workload, and implements early rejections to let clients react promptly and help data systems avoid doing useless work. We propose two starvation avoidance strategies that supplement Bouncer's basic formulation and prevent query types from receiving no service. Our evaluation, in simulation and on a production-grade distributed graph database, shows that Bouncer and its starvation-avoiding variants 1) let admitted queries meet or stay close to their SLOs when other in-house policies do not, and 2) report fewer overall rejections and a small overhead, while letting the system reach high utilization. We observe that the proposed strategies can prevent query starvation, but with a modest increase in rejections and with SLO violation counts for serviced queries that may be acceptable in practice.

References

[1]

Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.

Digital Library

[2]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the 8th ACM European Conference on Computer Systems. 29--42.

Digital Library

[3]

Bryan Barkley. 2022. Hodor: Detecting and addressing overload in LinkedIn microservices. https://engineering.linkedin.com/blog/2022/hodor--detectingand- addressing-overload-in-linkedin-microservic. [Accessed: Feb 2024].

[4]

Novella Bartolini, Giancarlo Bongiovanni, and Simone Silvestri. 2009. Self-* through Self-Learning: Overload Control for DistributedWeb Systems. Computer Networks 53, 5 (April 2009), 727--743.

Digital Library

[5]

Josep M. Blanquer, Antoni Batchelli, Klaus E. Schauser, and Richard Wolski. 2005. Quorum: Flexible Quality of Service for Internet Services. In Proceedings of the 2nd USENIX Symposium on Networked Systems Design and Implementation. 159--174.

[6]

Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. 2013. TAO: Facebook's Distributed Data Store for the Social Graph. In Proceedings of the 2013 USENIX Annual Technical Conference. 49--60.

[7]

Andrew Carter, Andrew Rodriguez, Yiming Yang, and Scott Meyer. 2019. Nanosecond Indexing of Graph Data With Hash Maps and VLists. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD'19). ACM, 623--635.

Digital Library

[8]

Koral Chapnik, Ilya Kolchinsky, and Assaf Schuster. 2022. DARLING: Data-Aware Load Shedding in Complex Event Processing Systems. Proceedings of the VLDB Endowment 15, 3 (2022), 541--554.

Digital Library

[9]

Huamin Chen and Prasant Mohapatra. 2002. Session-based Overload Control in QoS-aware Web Servers. In Proceedings of the Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Vol. 2. 516--524.

[10]

Ludmila Cherkasova. 1998. Scheduling Strategy to Improve Response Time for Web Applications. In High-Performance Computing and Networking. Springer Berlin Heidelberg, 305--314.

[11]

Ludmila Cherkasova and Peter Phaal. 1998. Session Based Admission Control: A Mechanism for Improving the Performance of an OverloadedWeb Server. Technical Report HPL-98--119. Computer Systems Laboratory. Hewlett-Packard.

[12]

Inho Cho, Ahmed Saeed, Joshua Fried, Seo Jin Park, Mohammad Alizadeh, and Adam Belay. 2020. Overload Control for ?'s-scale RPCs with Breakwater. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation. 299--314.

[13]

David Chou, Tianyin Xu, Kaushik Veeraraghavan, Andrew Newell, Sonia Margulis, Lin Xiao, Pol Mauri Ruiz, Justin Meza, Kiryong Ha, Shruti Padmanabha, Kevin Cole, and Dmitri Perelman. 2019. Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 430--446.

Digital Library

[14]

Michele Colajanni, Philip S. Yu, and Daniel M. Dias. 1997. Scheduling Algorithms for Distributed Web Servers. In Proceedings of 17th International Conference on Distributed Computing Systems. IEEE, 169--176.

[15]

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google's Globally Distributed Database. ACM Transactions on Computer Systems 31, 3 (2013), 8:1--8:22.

Digital Library

[16]

LinkedIn Corp. 2018. The graph team at LinkedIn. https://engineering.linkedin. com/teams/data/data-infrastructure/graph. [Accessed: Feb 2024].

[17]

LinkedIn Corp. 2022. LinkedIn's Economic Graph. https://economicgraph. linkedin.com. [Accessed: Feb 2024].

[18]

Microsoft Corp. 2022. Azure Cosmos DB. https://azure.microsoft.com/en-us/ services/cosmos-db/. [Accessed: Feb 2024].

[19]

Microsoft Corp. 2022. SQL Server Resource Governor. https://learn. microsoft.com/en-us/sql/relational-databases/resource-governor/resourcegovernor? view=sql-server-ver16. [Accessed: Feb 2024].

[20]

Alejandro Forero Cuervo. 2017. Handling Overload. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 21. https: //sre.google/sre-book/handling-overload/.

[21]

Alejandro Forero Cuervo. 2017. Load Balancing in the Datacenter. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 20. https://sre.google/sre-book/load-balancing-datacenter/.

[22]

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's Highly Available Key-Value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles. 205-- 220.

Digital Library

[23]

Mark Doran, Padmaja Potineni, and Rajesh Bhatiya. 2022. Managing Resources with Oracle Database Resource Manager. Oracle Database: Database Administrator's Guide, 21c. Chapter 26. https://docs.oracle.com/en/database/oracle/oracledatabase/ 21/admin/index.html.

[24]

Sameh Elnikety, Erich Nahum, John Tracey, and Willy Zwaenepoel. 2004. A Method for Transparent Admission Control and Request Scheduling in ECommerce Web Sites. In Proceedings of the 13th International Conference on World Wide Web. ACM, 276--286.

Digital Library

[25]

Mingzhe Hao, Huaicheng Li, Michael Hao Tong, Chrisma Pakha, Riza O. Suminto, Cesar A. Stuardo, Andrew A. Chien, and Haryadi S. Gunawi. 2017. MittOS: Supporting Millisecond Tail Tolerance with Fast Rejecting SLO-aware OS Interface. In Proceedings of the 26th Symposium on Operating Systems Principles. 168--183.

[26]

Hans-Ulrich Heiss and RogerWagner. 1991. Adaptive Load Control in Transaction Processing Systems. In Proceedings of the 17th International Conference on Very Large Data Bases. 47--54.

Digital Library

[27]

IBM. 2022. Db2 Adaptive workload manager. https://www.ibm.com/docs/en/db2/ 11.5?topic=management-adaptive-workload-manager. [Accessed: Feb 2024].

[28]

Ravi Iyer, Vijay Tewari, and Krishna Kant. 2001. Overload Control Mechanisms forWeb Servers. In Proceedings of the International Conference on the Performance and QoS of Next Generation Networking. Springer, 225--244.

[29]

Sugih Jamin, Peter B. Danzig, Scott J. Shenker, and Lixia Zhang. 1997. A Measurement-based Admission Control Algorithm for Integrated Service Packet Networks. IEEE/ACM Transactions on Networking 5, 1 (1997), 56--70.

Digital Library

[30]

Chris Jones, John Wilkes, Niall Murphy, and Cody Smith. 2017. Service Level Objectives. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 4. https://sre.google/sre-book/service-level-objectives/.

[31]

Mattijs Jonker, Alistair King, Johannes Krupp, Christian Rossow, Anna Sperotto, and Alberto Dainotti. 2017. Millions of Targets under Attack: A Macroscopic Characterization of the DoS Ecosystem. In Proceedings of the 2017 Internet Measurement Conference. ACM, 100--113.

Digital Library

[32]

Eugene Kim. 2018. Internal documents show how Amazon scrambled to fix Prime Day glitches. https://www.cnbc.com/2018/07/19/amazon-internal-documentswhat- caused-prime-day-crash-company-scramble.html. [Accessed: Feb 2024].

[33]

Daniel Kopp, Christoph Dietzel, and Oliver Hohlfeld. 2021. DDoS Never Dies? An IXP Perspective on DDoS Amplification Attacks. In Proceedings of the 22nd International Conference on Passive and Active Measurement (Lecture Notes in Computer Science, Vol. 12671), Oliver Hohlfeld, Andra Lutu, and Dave Levin (Eds.). Springer, 284--301.

[34]

Jay Kreps, Neha Narkhede, and Jun Rao. 2011. Kafka: A Distributed Messaging System for Log Processing. In Proceedings of the 6th International Workshop on Networking Meets Database (NetDB'11). ACM, 1--7.

[35]

William LeFebvre. 2001. CNN.com: Facing a World Crisis. In 15th Systems Administration Conference (LISA 2001). USENIX Association, San Diego, CA. https://www.usenix.org/conference/lisa-2001/cnncom-facing-world-crisis

[36]

Piotr Lewandowski. 2017. Load Balancing at the Frontend. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 19. https://sre.google/sre-book/load-balancing-frontend/.

[37]

J.W.S. Liu, Wei-Kuan Shih, Kwei-Jay Lin, R. Bettati, and Jen-Yao Chung. 1994. Imprecise Computations. Proceedings of the IEEE 82, 1 (1994), 83--94.

[38]

Anil Mallapur and Michael Kehoe. 2017. TrafficShift: Load Testing at Scale. https: //engineering.linkedin.com/blog/2017/05/trafficshift--load-testing-at-scale. [Accessed: Feb 2024].

[39]

Scott Meyer, Andrew Carter, and Andrew Rodriguez. 2020. LIquid: The soul of a new graph database, Part 1. https://engineering.linkedin.com/blog/2020/liquidthe- soul-of-a-new-graph-database-part-1. [Accessed: Feb 2024].

[40]

Scott Meyer, Andrew Carter, and Andrew Rodriguez. 2020. LIquid: The soul of a new graph database, Part 2. https://engineering.linkedin.com/blog/2020/liquid-- the-soul-of-a-new-graph-database--part-2. [Accessed: Feb 2024].

[41]

Sparsh Mittal. 2016. A Survey of Techniques for Approximate Computing. ACM Computing Surveys 48, 4 (May 2016).

Digital Library

[42]

Axel Mönkeberg and Gerhard Weikum. 1992. Performance Evaluation of an Adaptive and Robust Load Control Method for the Avoidance of Data-Contention Thrashing. In Proceedings of the 18th International Conference on Very Large Data Bases. 432--443.

Digital Library

[43]

Seung Yeob Nam, Sunggon Kim, and Dan Keun Sung. 2008. Measurement-Based Admission Control at Edge Routers. IEEE/ACM Transactions on Networking 16, 2 (April 2008), 410--423.

[44]

Sam Newman. 2021. Building Microservices: Designing Fine-Grained Systems (2 ed.). O'Reilly Media.

[45]

Stefan Noll, Norman May, Alexander Böhm, Jan Mühlig, and Jens Teubner. 2019. From the Application to the CPU: Holistic Resource Management for Modern Database Management Systems. IEEE Data Engineering Bulletin 42, 1 (2019), 10--21. http://sites.computer.org/debull/A19mar/p10.pdf

[46]

Spence Purnell. 2020. State Unemployment Websites Crash as COVID-19 Shines Light on Government Technology Failures. https://shorturl.at/BNS29. [Accessed: Feb 2024].

[47]

Chris Richardson. 2019. Microservices Patterns: With examples in Java (1 ed.). Manning, Chapter 8, 253--291.

[48]

SAP. 2022. Admission Control. Monitoring View. SAP HANA Administration with SAP HANA Cockpit (2.15.0 ed.). Chapter 7.5. https://help. sap.com/docs/SAP_HANA_COCKPIT/afa922439b204e9caf22c78b6b69e4f2/ce46dcceaef045cb85f6fdf694789ea0.html.

[49]

Bianca Schroeder and Mor Harchol-Balter. 2006. Web Servers under Overload: How Scheduling Can Help. ACM Transactions on Internet Technology 6, 1 (Feb. 2006), 20--52.

Digital Library

[50]

B. Schroeder, M. Harchol-Balter, A. Iyengar, E. Nahum, and A. Wierman. 2006. How to Determine a Good Multi-Programming Level for External Scheduling. In Proceedings of the 22nd International Conference on Data Engineering. 60--71.

[51]

Ahmad Slo, Sukanya Bhowmik, and Kurt Rothermel. 2020. hSPICE: State-aware Event Shedding in Complex Event Processing. In Proceedings of the 14th ACM International Conference on Distributed and Event-based Systems (DEBS'20). 109-- 120.

Digital Library

[52]

Ryszard Szopa et al. 2016. Doorman: Global Distributed Client Side Rate Limiting. https://github.com/youtube/doorman. [Accessed: Feb 2024].

[53]

Gil Tene et al. [n. d.]. wrk2: a HTTP benchmarking tool based mostly on wrk. https://github.com/giltene/wrk2. [Accessed: Feb 2024].

[54]

Alethea Toh, Anupam Vij, and Syed Pasha. 2022. Azure DDoS Protection - 2021 Q3 and Q4 DDoS attack trends. https://azure.microsoft.com/en-us/blog/azureddos- protection-2021-q3-and-q4-ddos-attack-trends/. [Accessed: Feb 2024].

[55]

Sean Tozer, Tim Brecht, and Ashraf Aboulnaga. 2010. Q-Cop: Avoiding bad query mixes to minimize client timeouts under heavy loads. In Proceedings of the IEEE 26th International Conference on Data Engineering. 397--408.

[56]

Mike Ulrich. 2017. Addressing Cascading Failures. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media Inc., Chapter 22. https: //sre.google/sre-book/addressing-cascading-failures/.

[57]

Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song. 2016. Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large ScaleWeb Services. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation. 635--651.

[58]

MattWelsh and David Culler. 2003. Adaptive Overload Control for Busy Internet Servers. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems - Volume 4 (USITS'03). 1:1--1:15.

[59]

Pengcheng Xiong, Yun Chi, Shenghuo Zhu, Junichi Tatemura, Calton Pu, and Hakan Hacigümü?. 2011. ActiveSLA: A Profit-Oriented Admission Control Framework for Database-as-a-Service Providers. In Proceedings of the 2nd ACM Symposium on Cloud Computing. Article 15, 14 pages.

Digital Library

[60]

Hao Xu and Juan A. Colmenares. 2023. Admission Control with Response Time Objectives for Low-latency Online Data Systems (extended version). arXiv:2312.15123 [cs.DB]

[61]

Jian Yang, Kunjie Zhu, Yongyi Ran, Weizhe Cai, and Enzhong Yang. 2016. Joint Admission Control and Routing via Approximate Dynamic Programming for Streaming Video Over Software-defined Networking. IEEE Transactions on Multimedia 19, 3 (2016), 619--631.

Digital Library

[62]

Chaoqun Zhan, Maomeng Su, ChuangxianWei, Xiaoqiang Peng, Liang Lin, Sheng Wang, Zhe Chen, Feifei Li, Yue Pan, Fang Zheng, and Chengliang Chai. 2019. AnalyticDB: Real-time OLAP Database System at Alibaba Cloud. Proceedings of the VLDB Endowment 12, 12 (2019), 2059--2070.

Digital Library

[63]

Mingyi Zhang. 2014. AutonomicWorkload Management for Database Management Systems. Ph.D. Dissertation. Queen's University. http://hdl.handle.net/1974/ 12181.

[64]

Mingyi Zhang, Patrick Martin,Wendy Powley, and Jianjun Chen. 2018. Workload Management in Database Management Systems: A Taxonomy. IEEE Transactions on Knowledge and Data Engineering 30, 7 (2018), 1386--1402.

[65]

Bo Zhao, Nguyen Quoc Viet Hung, and Matthias Weidlich. 2020. Load Shedding for Complex Event Processing: Input-based and State-based Techniques. In Proceedings of the IEEE 36th International Conference on Data Engineering (ICDE'20). 1093--1104.

[66]

Hao Zhou, Ming Chen, Qian Lin, Yong Wang, Xiaobin She, Sifan Liu, Rui Gu, Beng Chin Ooi, and Junfeng Yang. 2018. Overload Control for Scaling WeChat Microservices. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 149--161.

Digital Library

[67]

Jingyu Zhou and Tao Yang. 2006. Selective Early Request Termination for Busy Internet Services. In Proceedings of the 15th International Conference on World Wide Web. ACM, 605--614.

Digital Library

Cited By

Kawashima H(2024)Hippo: Accelerating Transaction Processing for Approximate Query Processing Engine with Sampling Semantics2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW)10.1109/CANDARW64572.2024.00048(117-122)Online publication date: 26-Nov-2024
https://doi.org/10.1109/CANDARW64572.2024.00048

Index Terms

Bouncer: Admission Control with Response Time Objectives for Low-latency Online Data Systems
1. General and reference
  1. Cross-computing tools and techniques
    1. Empirical studies
2. Information systems
  1. Data management systems
    1. Database administration
      1. Database utilities and tools
    2. Database management system engines
      1. Main memory engines

Recommendations

Admission control in time-slotted multihop mobile networks

The emergence of nomadic applications have generated a lot of interest in next-generation wireless network infrastructures which provide differentiated service classes. So it is important to study how the quality of service (QoS), such as packet loss ...
Measurement-based admission control at edge routers

It is very important to allocate and manage resources for multimedia traffic flows with real-time performance requirements in order to guarantee quality of service (QoS). In this paper, we develop a scalable architecture and an algorithm for admission ...
Resilient network admission control

Network admission control (NAC) limits the traffic in a network to avoid overload and to assure thereby the quality of service (QoS) for admitted flows. Overload may occur due to exceptional traffic demand, but it is mostly caused by redirected traffic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD/PODS '24: Companion of the 2024 International Conference on Management of Data

June 2024

694 pages

ISBN:9798400704222

DOI:10.1145/3626246

General Chairs:
Pablo Barcelo
Universidad Catolica, Chile
,
Nayat Sanchez-Pi
INRIA Chile
,
Program Chairs:
Alexandra Meliou
University of Massachusetts Amherst, USA
,
S. Sudarshan
Indian Institute of Technology Bombay

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '24

Sponsor:

SIGMOD

SIGMOD/PODS '24: International Conference on Management of Data

June 9 - 15, 2024

Santiago AA, Chile

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
295
Total Downloads

Downloads (Last 12 months)295
Downloads (Last 6 weeks)48

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kawashima H(2024)Hippo: Accelerating Transaction Processing for Approximate Query Processing Engine with Sampling Semantics2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW)10.1109/CANDARW64572.2024.00048(117-122)Online publication date: 26-Nov-2024
https://doi.org/10.1109/CANDARW64572.2024.00048

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten