Skip to main content
Log in

A scheduling framework for distributed key-value stores and its application to tail latency minimization

  • Published:
Journal of Scheduling Aims and scope Submit manuscript

Abstract

Distributed key-value stores employ replication for high availability. Yet, they do not always efficiently take advantage of the availability of multiple replicas for each value and read operations often exhibit high tail latencies. Various replica selection strategies have been proposed to address this problem, together with local request scheduling policies. It is difficult, however, to determine what is the absolute performance gain each of these strategies can achieve. We present a formal framework allowing the systematic study of request scheduling strategies in key-value stores. We contribute a definition of the optimization problem related to reducing tail latency in a replicated key-value store as a minimization problem with respect to the maximum weighted flow criterion. By using scheduling theory, we show the difficulty of this problem and therefore the need to develop performance guarantees. We also study the behavior of heuristic methods using simulations that highlight which properties enable limiting tail latency: for instance, the EarliestFinishTime strategy—which uses the earliest next available time of servers—exhibits a tail latency that is less than half that of state-of-the-art strategies, often matching the lower bound. Our study also emphasizes the importance of metrics such as the stretch to properly evaluate replica selection and local execution policies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Algorithm 3
Algorithm 4
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://doi.org/10.6084/m9.figshare.21750605.v1.

  2. We express non-migratory preemption as \( pmtn ^*\) in the \(\beta \)-part, not to be confused with the classic \( pmtn \) constraint.

  3. https://www.salabim.org.

  4. An Empirical Cumulative Distribution Function is the distribution function obtained from the empirical measure of a sample. With enough realizations, it converges to the actual, underlying cumulative distribution function.

  5. A boxplot consists of a bold line for the median, a box for the quartiles and whiskers that extend at most to 1.5 times the interquartile range from the box.

References

  • Ambühl, C., & Mastrolilli, M. (2005). On-line scheduling to minimize max flow time: An optimal preemptive algorithm. Operations Research Letters, 33(6), 597–602.

    Article  Google Scholar 

  • Anand, S., Bringmann, K., Friedrich, T., Garg, N., & Kumar, A. (2017). Minimizing maximum (weighted) flow-time on related and unrelated machines. Algorithmica, 77(2), 515–536.

    Article  Google Scholar 

  • Atikoglu, B., Xu, Y., Frachtenberg, E., Jiang, S., & Paleczny, M. (2012). Workload analysis of a large-scale key-value store. ACM SIGMETRICS Performance Evaluation Review, 40(1), 53–64.

    Article  Google Scholar 

  • Awerbuch, B., Azar, Y., Leonardi, S., & Regev, O. (2002). Minimizing the flow time without migration. SIAM Journal on Computing, 31(5), 1370–1382.

    Article  Google Scholar 

  • Baker, K. R. (1974). Introduction to sequencing and scheduling. Wiley.

  • Balmau, O., Dinu, F., Zwaenepoel, W., Gupta, K., Chandhiramoorthi, R., & Didona, D. (2020). Silk+ preventing latency spikes in log-structured merge key-value stores running heterogeneous workloads. ACM Transactions on Computer Systems, 36(4), 1–27.

    Article  Google Scholar 

  • Bansal, N. (2005). Minimizing flow time on a constant number of machines with preemption. Operations Research Letters, 33(3), 267–273.

    Article  Google Scholar 

  • Bansal, N., & Cloostermans, B. (2016). Minimizing maximum flow-time on related machines. Theory of Computing, 12(1), 1–14.

    Article  Google Scholar 

  • Bansal, N., & Dhamdhere, K. (2007). Minimizing weighted flow time. ACM Transactions on Algorithms, 3(4), 39.

    Article  Google Scholar 

  • Bansal, N., & Kulkarni, J. (2015). Minimizing flow-time on unrelated machines. In Proceedings of the forty-seventh annual acm symposium on theory of computing (pp. 851–860).

  • Bansal, N., & Pruhs, K. (2003). Server scheduling in the \(l_p\) norm: a rising tide lifts all boat. In Proceedings of the thirty-fifth annual acm symposium on theory of computing (pp. 242–250).

  • Baptiste, P., Brucker, P., Chrobak, M., Dürr, C., Kravchenko, S. A., & Sourd, F. (2007). The complexity of mean flow time scheduling problems with release times. Journal of Scheduling, 10(2), 139–146.

    Article  Google Scholar 

  • Becchetti, L., & Leonardi, S. (2004). Nonclairvoyant scheduling to minimize the total flow time on single and parallel machines. Journal of the ACM, 51(4), 517–539.

    Article  Google Scholar 

  • Bender, M.A., Chakrabarti, S., Muthukrishnan, S. (1998). Flow and stretch metrics for scheduling continuous job streams. In ACM-SIAM symposium on discrete algorithms (pp. 270–279).

  • Ben Mokhtar, S., Canon, L. C., Dugois, A., Marchal, L., Rivière, E. (2021). Taming tail latency in key-value stores: a scheduling perspective. In European conference on parallel processing (pp. 136–150).

  • Benoit, A., Elghazi, R., Robert, Y. (2021). Max-stretch minimization on an edge-cloud platform. In 2021 ieee international parallel and distributed processing symposium (pp. 766–775).

  • Brucker, P., Jurisch, B., & Krämer, A. (1997). Complexity of scheduling problems with multi-purpose machines. Annals of Operations Research, 70, 57–73.

    Article  Google Scholar 

  • Brucker, P., & Kravchenko, S. A. (2008). Scheduling jobs with equal processing times and time windows on identical parallel machines. Journal of Scheduling, 11(4), 229–237.

    Article  Google Scholar 

  • Bruno, J., Coffman, E. G., Jr., & Sethi, R. (1974). Scheduling independent tasks to reduce mean finishing time. Communications of the ACM, 17(7), 382–387.

    Article  Google Scholar 

  • Brutlag, J. (2009). Speed matters for google web search.

  • Carlson, J. L. (2013). Redis in action. Manning Publications Co.

  • Chekuri, C., Khanna, S., Zhu, A. (2001). Algorithms for minimizing weighted flow time. In Proceedings of the thirty-third annual acm symposium on theory of computing (pp. 84–93).

  • Chodorow, K. (2013). Mongodb: the definitive guide: Powerful and scalable data storage. O’Reilly.

  • Choudhury, A. R., Das, S., Garg, N., & Kumar, A. (2018). Rejecting jobs to minimize load and maximum flow-time. Journal of Computer and System Sciences, 91, 42–68.

    Article  Google Scholar 

  • Dean, J., & Barroso, L. A. (2013). The tail at scale. Communications of the ACM, 56(2), 74–80.

    Article  Google Scholar 

  • DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., & Vogels, W. (2007). Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Operating Systems Review, 41(6), 205–220.

    Article  Google Scholar 

  • Delgado, P., Didona, D., Dinu, F., Zwaenepoel, W. (2016). Job-aware scheduling in eagle: Divide and stick to your probes. In Proceedings of the seventh acm symposium on cloud computing (pp. 497–509).

  • Delgado, P., Dinu, F., Kermarrec, A. M., Zwaenepoel, W. (2015). Hawk: Hybrid datacenter scheduling. In 2015 USENIX annual technical conference (pp. 499–510).

  • Didona, D., & Zwaenepoel, W. (2019). Size-aware sharding for improving tail latencies in in-memory key-value stores. In 16th USENIX symposium on networked systems design and implementation (pp. 79–94).

  • Dutot, P. F., Saule, E., Srivastav, A., Trystram, D. (2016). Online non-preemptive scheduling to optimize max stretch on a single machine. In Computing and combinatorics - 22nd international conference (vol. 9797, pp. 483–495).

  • Feitelson, D. G. (2015). Workload modeling for computer systems performance evaluation. Cambridge University Press.

  • Garg, N., & Kumar, A. (2007). Minimizing average flow-time: Upper and lower bounds. In 48th annual ieee symposium on foundations of computer science (focs’07) (pp. 603–613).

  • Hall, L. A. (1993). A note on generalizing the maximum lateness criterion for scheduling. Discrete Applied Mathematics, 47(2), 129–137.

    Article  Google Scholar 

  • Jaiman, V., Ben Mokhtar, S., Quéma, V., Chen, L.Y., Rivière, E. (2018). Héron: Taming tail latencies in key-value stores under heterogeneous workloads. In 37th symposium on reliable distributed systems (pp. 191–200).

  • Jaiman, V., Mokhtar, S.B., Rivière, E. (2020). TailX: Scheduling heterogeneous multiget queries to improve tail latencies in key-value stores. In Ifip international conference on distributed applications and interoperable systems (pp. 73–92).

  • Jiang, W., Xie, H., Zhou, X., Fang, L., & Wang, J. (2019). Haste makes waste: The on-off algorithm for replica selection in key-value stores. Journal of Parallel and Distributed Computing, 130, 80–90.

    Article  Google Scholar 

  • Jose, J., Subramoni, H., Luo, M., Zhang, M., Huang, J., Wasi-ur Rahman, M. (2011). Memcached design on high performance rdma capable interconnects. In International conference on parallel processing (pp. 743–752).

  • Kalyanasundaram, B., & Pruhs, K. (2000). Speed is as powerful as clairvoyance. Journal of the ACM, 47(4), 617–643.

    Article  Google Scholar 

  • Kellerer, H., Tautenhahn, T., & Woeginger, G. (1999). Approximability and nonapproximability results for minimizing total flow time on a single machine. SIAM Journal on Computing, 28(4), 1155–1166.

    Article  Google Scholar 

  • Kravchenko, S. A., & Werner, F. (2009). Preemptive scheduling on uniform machines to minimize mean flow time. Computers & Operations Research, 36(10), 2816–2821.

    Article  Google Scholar 

  • Labetoulle, J., Lawler, E. L., Lenstra, J. K., Kan, A. R. (1984). Preemptive scheduling of uniform machines subject to release dates. In Progress in combinatorial optimization (pp. 245–261).

  • Lakshman, A., & Malik, P. (2010). Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2), 35–40.

    Article  Google Scholar 

  • Lawler, E. L., & Labetoulle, J. (1978). On preemptive scheduling of unrelated parallel processors by linear programming. Journal of the ACM, 25(4), 612–619.

    Article  Google Scholar 

  • Lee, K., Leung, J. Y., & Pinedo, M. L. (2013). Makespan minimization in online scheduling with machine eligibility. Annals of Operations Research, 204(1), 189–222.

    Article  Google Scholar 

  • Legrand, A., Su, A., & Vivien, F. (2008). Minimizing the stretch when scheduling flows of divisible requests. Journal of Scheduling, 11(5), 381–404.

    Article  Google Scholar 

  • Lenstra, J. K., Kan, A. R., & Brucker, P. (1977). Complexity of machine scheduling problems. Studies in integer programming, 1, 343–362.

    Article  Google Scholar 

  • Leonardi, S., & Raz, D. (2007). Approximating total flow time on parallel machines. Journal of Computer and System Sciences, 73(6), 875–891.

    Article  Google Scholar 

  • Leung, J. Y. T., & Li, C. L. (2008). Scheduling with processing set restrictions: A survey. International Journal of Production Economics, 116(2), 251–262.

    Article  Google Scholar 

  • Leung, J. Y. T., & Li, C. L. (2016). Scheduling with processing set restrictions: A literature update. International Journal of Production Economics, 175, 1–11.

    Article  Google Scholar 

  • Li, J., Sharma, N. K., Ports, D. R., Gribble, S. D. (2014). Tales of the tail: Hardware, OS, and application-level sources of tail latency. In Acm symposium on cloud computing (pp. 1–14).

  • Lucarelli, G., Moseley, B., Thang, N. K., Srivastav, A., Trystram, D. (2019). Online non-preemptive scheduling to minimize maximum weighted flow-time on related machines. In 39th IARCS annual conference on foundations of software technology and theoretical computer science (Vol. 150, pp. 24:1–24:12).

  • Maheswaran, M., Ali, S., Siegel, H. J., Hensgen, D., & Freund, R. F. (1999). Dynamic mapping of a class of independent tasks onto heterogeneous computing systems. Journal of Parallel and Distributed Computing, 59(2), 107–131.

    Article  Google Scholar 

  • Mastrolilli, M. (2004). Scheduling to minimize max flow time: Off-line and on-line algorithms. International Journal of Foundations of Computer Science, 15(02), 385–401.

    Article  Google Scholar 

  • Moseley, B., Pruhs, K., Stein, C. (2013). The complexity of scheduling for p-norms of flow and stretch. In International conference on integer programming and combinatorial optimization (pp. 278–289).

  • Muthukrishnan, S., Rajaraman, R., Shaheen, A., Gehrke, J. E. (1999). Online scheduling to minimize average stretch. In 40th annual symposium on foundations of computer science (pp. 433–443).

  • Reda, W., Canini, M., Suresh, L., Kostić, D., Braithwaite, S. (2017). Rein: Taming tail latency in key-value stores via multiget scheduling. In 12th european conference on computer systems (pp. 95–110).

  • Saule, E., Bozdağ, D., & Çatalyürek, Ü. V. (2012). Optimizing the stretch of independent tasks on a cluster: From sequential tasks to moldable tasks. Journal of Parallel and Distributed Computing, 72(4), 489–503.

    Article  Google Scholar 

  • Simons, B. (1983). Multiprocessor scheduling of unit-time jobs with arbitrary release times and deadlines. SIAM Journal on Computing, 12(2), 294–299.

    Article  Google Scholar 

  • Sitters, R. (2001). Two np-hardness results for preemptive minsum scheduling of unrelated parallel machines. In International conference on integer programming and combinatorial optimization (pp. 396–405).

  • Suresh, L., Canini, M., Schmid, S., Feldmann, A. (2015). C3: Cutting tail latency in cloud data stores via adaptive replica selection. In 12th USENIX symposium on networked systems design and implementation (pp. 513–527).

  • Vulimiri, A., Godfrey, P. B., Mittal, R., Sherry, J., Ratnasamy, S., Shenker, S. (2013). Low latency via redundancy. In 9th acm conference on emerging networking experiments and technologies (pp. 283–294).

  • Wu, Z., Yu, C., Madhyastha, H. V. (2015). Costlo: Cost-effective redundancy for lower latency variance on cloud storage services. In 12th USENIX symposium on networked systems design and implementation (pp. 543–557).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anthony Dugois.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is an extended version of Ben Mokhtar et al. (2021).

Appendices

Appendix A Proof of Theorem 5

Proof

First, we build an instance designed to reach an arbitrarily large ratio. Then, we determine a lower bound on the objective achieved with Max-Flow, and finally, an upper bound on the optimal one.

Instance characteristics For an arbitrary competitive ratio \(k\ge 1\), we build the following instance with n requests. The first k requests have a weight \(w_i=k\) and release time \(r_i=0\). Then, a new request arrives at each new time step with a weight that is the highest integer lower than or equal to \(1+1/k\) times the weight of the previous request (i.e., \(w_i=\lfloor (1+1/k)w_{i-1}\rfloor \) and \(r_i=i-k\) for \(k<i\le n\)). In total, \(n=k^2+11\) requests are submitted.

Lower bound At time \(t=0\), Max-Flow starts one of the first k requests because they are the only ones that are ready. We now prove that at any time t such that \(1\le t<k\), Max-Flow starts one of the remaining first k requests, which delays all arriving requests (any request \(T_i\) such that \(k<i<2k\)).

On the one hand, \(w_i(t+1-r_i)=k(t+1)\) for any of the first k requests (\(1\le i\le k\)). On the other hand, for \(k<i\le n\), \(w_i\le (1+1/k) w_{i-1}\), and thus, \(w_i\le (1+1/k)^{i-k} k\). Therefore, \(w_i(t+1-r_i)\le (1+1/k)^{i-k} k(t+1-i+k)\).

Let us show that at any time t such that \(1\le t<k\), any of the first k requests has the highest value, that is \(k(t+1)\ge (1+1/k)^{i-k} k(t+1-i+k)\) for all \(k<i\le t+k\). By changing variables (\(j=i-k\) and \(t'=t+1\)), this corresponds to proving \((1+1/k)^j (t'-j)\le t'\) for all \(1\le j<t'\le k\). We show by induction that \((1+1/k)^j (t'-j)\le t'\) for all \(0\le j\) and for a given \(t'\) (\(2\le t'\le k\)). The induction basis with \(j=0\) is direct. The induction step assumes \((1+1/k)^j (t'-j)\le t'\) to be true for a given \(j\ge 0\). We have

$$\begin{aligned} (1+1/k)\frac{t'-j-1}{t'-j}= & {} (1+1/k)\left( 1-\frac{1}{t'-j}\right) \\= & {} 1+1/k-\frac{1}{t'-j}-\frac{1}{k(t'-j)}\le 1. \end{aligned}$$

The last line is obtained by remarking that \(t'\le k\) and \(j\ge 0\) (thus, \(1/k\le \frac{1}{t'-j}\)). Therefore,

$$\begin{aligned}{} & {} (1+1/k)^{j+1} (t'-(j+1))\\{} & {} \quad = (1+1/k)^j(1+1/k)(t'-j)\frac{t'-j-1}{t'-j} \\{} & {} \quad \le (1+1/k)^j(t'-j)\le t', \end{aligned}$$

which concludes the induction proof.

At time \(t=k\), all of the first k requests have been completed. We now prove that at any time t such that \(k\le t<n\), Max-Flow starts request \(T_{t+1}\). This would mean that at time t, only requests \(T_i\) such that \(t<i\le t+k\) are ready and not completed. We prove by induction that at time \(k\le t<n\), all requests \(T_i\) with \(i\le t\) are completed. The induction basis with \(t=k\) is already proven above. Assume the hypothesis is true for a given \(k\le t<n\). It remains to prove that at time \(t'=t+1\), \(T_{t+2}\) is started among requests \(T_i\) such that \(t'<i\le t'+k\).

On the one hand, \(w_i(t'+1-r_i)=w_{t+2}k\) for request \(T_{t+2}\). On the other hand, for \(t'+1<i\le t'+k\), \(w_i(t'+1-r_i)\le (1+1/k)^{i-t'-1} w_{t+2}(t'+1-i+k)\). Let us show that \((1+1/k)^{i-t'-1} w_{t+2}(t'+1-i+k)<w_{t+2}k\) for \(t'+1<i\le t'+k\) and for a given \(k\le t<n\). By changing variables (\(j=i-t'-1\)), this corresponds to proving that \((1+1/k)^j (k-j)<k\) for all \(0<j<k\). We show this again by induction on j for a given \(k\ge 1\). For the induction basis, \((1+1/k)(k-1)=k+1-1-1/k<k\). For the induction step, we can show that \((1+1/k)\frac{k-j+1}{k-j}\le 1\) by remarking that \(k>k-j\), which concludes the induction proof.

To conclude on the performance of Max-Flow, request \(T_i\) is started at time \(i-1\) and therefore, the objective value is at least \(w_n F_n=w_n(n-(n-k))=k w_n\).

Upper bound A better objective value can be obtained by starting all requests as soon as they arrive except for the first k ones: request \(T_1\) is started at time \(t=0\); then, request \(T_i\) is started at time \(t=i-k\) for \(k<i\le n\); finally, the remaining requests among the first k ones are started (\(T_i\) is started at time \(t=n-k+i-1\) for \(1<i\le k\)). We analyze the objective value for request \(T_k\) because it is the last one to be executed among the first k requests, and \(T_n\) because it is the one with the highest weight among the last \(n-k\) requests. For \(T_k\), \(w_k F_k=k(C_k-r_k)=k n\). For \(T_n\), \(w_n F_n=w_n\).

We prove that \(w_n\ge k n\) by deriving a lower bound on \(w_n\). The weights increase in multiple stages. At first, each increment is unitary: \(w_{i+1}=w_i+1\) for \(k\le i<2k\). Then, the increment increases at the second stage and \(w_{i+1}=w_i+2\) for \(2k\le i<2k+\lceil k/2\rceil \). At the k-th stage, \(w_{i+1}=w_i+k\) for a single request. At a given stage j, the increment of the weight is j for at most \(\lceil k/j\rceil \) requests. Let \(n_1=\sum _{j=1}^k\lceil k/j\rceil \) be the number of such requests (assuming \(n-k\ge n_1\)). Finally, the remaining \(n_2=n-k-n_1\) requests are incremented by a value that increases by at least 1 for each new request: \(w_{i+1}\ge w_i+(k+i-n+n_2)\) for \(n-n_2<i\le n\).

The last weight \(w_n\) is at least the sum of the increments of all these stages:

$$\begin{aligned} w_n\ge k+\sum \limits _{j=1}^k j\lceil k/j\rceil +\sum \limits _{j=1}^{n_2}(k+j). \end{aligned}$$

Thus, \(w_n\ge k(k+1)+k n_2+n_2^2 /2\). Our hypothesis is that \(w_n\ge k n\), which would be verified if

$$\begin{aligned} k(k+1)+k n_2+n_2^2/2\ge kn. \end{aligned}$$

By replacing \(n_2\) and simplifying, the condition becomes

$$\begin{aligned} n\ge k+n_1+\sqrt{2k(n_1-1)}. \end{aligned}$$
(2)

We bound \(n_1\) using the asymptotic expansion of the harmonic number \(H_k\):

$$\begin{aligned} n_1=\sum \limits _{j=1}^k\lceil k/j\rceil< & {} k\sum \limits _{j=1}^k\frac{1}{j}+k \\< & {} k(H_k+1) \\< & {} k\left( \log (k)+\gamma +\frac{1}{2k}+1\right) , \end{aligned}$$

where \(\gamma \approx 0.577\) is the Euler–Mascheroni constant.

Thus, the optimal objective is at most \(w_n\) and the one achieved with Max-Flow is at least \(k w_n\), which concludes the proof. \(\square \)

Appendix B Notations

Table 5 summarizes the notations used in this paper.

Appendix C Approximation

We provide a comprehensive summary of results related to approximation and competitive analysis of scheduling problems that address the minimization of flow time. Table 6 presents competitive analysis results that are related to average weighted flow.

Table 5 List of the most used notations
Table 6 Existing results on average (weighted) flow minimization
Table 7 Complexity of sum-flow minimization problems

Appendix D Complexity

To complete our survey on scheduling problems related to our study, we present a summary on complexity results related to sum-flow in Table 7.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ben Mokhtar, S., Canon, LC., Dugois, A. et al. A scheduling framework for distributed key-value stores and its application to tail latency minimization. J Sched 27, 183–202 (2024). https://doi.org/10.1007/s10951-023-00803-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10951-023-00803-8

Keywords

Navigation