Skip to main content

Stragglers in Distributed Matrix Multiplication

  • Conference paper
  • First Online:
Job Scheduling Strategies for Parallel Processing (JSSPP 2023)

Abstract

A delay in a single processor may affect an entire system since the slowest processor typically determines the runtime. Problems with such stragglers are often mitigated using dynamic load balancing or redundancy solutions such as task replication. Unfortunately, the former option incurs high communication cost, and the latter significantly increases the arithmetic cost and memory footprint, making high resource overhead seem inevitable. Matrix multiplication and other numerical linear algebra kernels typically have structures that allow better straggler management. Redundancy based solutions tailored for such algorithms often combine codes in the algorithm’s structure. These solutions add fixed cost overhead and may perform worse than the original algorithm when little or no delays occur. We propose a new load-balancing solution tailored for distributed matrix multiplication. Our solution reduces latency overhead by \(O \left( P/\log {P} \right) \) compared to existing dynamic load-balancing solutions, where P is the number of processors. Our solution overtakes redundancy-based solutions in all parameters: arithmetic cost, bandwidth cost, latency cost, memory footprint, and the number of stragglers it can tolerate. Moreover, our overhead costs depend on the severity of delays and are negligible when delays are minor. We compare our solution with previous ones and demonstrate significant improvements in asymptotic analysis and simulations: up to x4.4 and x5.3 compared to general-purpose dynamic load balancing and redundancy-based solutions, respectively.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 818252). This work was supported by the Federmann Cyber Security Center in conjunction with the Israel national cyber directorate. This research was supported by a grant from the United States-Israel Binational Science Foundation (BSF), Jerusalem, Israel.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This may look similar to the standard model for analyzing the arithmetic cost of an algorithm in a heterogeneous environment (cf. [9]). However, while the heterogeneous model assumes different hardware with stable performance, our version assumes similar hardware with varying performance.

  2. 2.

    We present here the basic MDS code and the associated overhead costs. Several variations of the MDS solution incur lower overhead costs, for example, by using Systematic MDS codes or sub-classes with lower decoding complexity.

  3. 3.

    Near-perfect load balancing is only achieved when \(\frac{\bar{\gamma }_{a}}{\gamma _{1}} > \rho \), otherwise processors are expected to have idle times.

References

  1. Acar, U.A., Charguéraud, A., Rainey, M.: Scheduling parallel programs by work stealing with private deques. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 219–228 (2013)

    Google Scholar 

  2. Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)

    Article  Google Scholar 

  3. Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: attack of the clones. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, pp. 185–198. USENIX Association (2013)

    Google Scholar 

  4. Ananthanarayanan, G., et al.: Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pp. 265–278. USENIX Association (2010)

    Google Scholar 

  5. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1(1), 11–33 (2004)

    Article  Google Scholar 

  6. Ballard, G., et al.: Communication optimal parallel multiplication of sparse random matrices. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2013, pp. 222–231. Association for Computing Machinery (2013)

    Google Scholar 

  7. Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer 23, 1–155 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  8. Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Communication-optimal parallel and sequential cholesky decomposition. SIAM J. Sci. Comput. 32(6), 3495–3523 (2010)

    Google Scholar 

  9. Ballard, G., Demmel, J., Gearhart, A.: Brief announcement: communication bounds for heterogeneous architectures. In: Proceedings of the Twenty-third Annual ACM symposium on Parallelism in Algorithms and Architectures, pp. 257–258 (2011)

    Google Scholar 

  10. Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM (JACM) 59(6), 1–23 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  11. Basermann, A., et al.: Dynamic load-balancing of finite element applications with the drama library. Appl. Math. Model. 25(2), 83–98 (2000)

    Article  MATH  Google Scholar 

  12. Berenbrink, P., Friedetzky, T., Goldberg, L.A.: The natural work-stealing algorithm is stable. SIAM J. Comput. 32(5), 1260–1279 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  13. Birnbaum, N., Schwartz, O.: Fault tolerant resource efficient matrix multiplication. In: Proceedings of the Eighth SIAM Workshop on Combinatorial Scientific Computing 2018. SIAM (2018)

    Google Scholar 

  14. Biswas, R., Das, S., Harvey, D., Oliker, L.: Parallel dynamic load balancing strategies for adaptive irregular applications. Appl. Math. Model. 25(2), 109–122 (2000)

    Article  MATH  Google Scholar 

  15. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM (JACM) 46(5), 720–748 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  16. Boneti, C., Gioiosa, R., Cazorla, F.J., Corbalan, J., Labarta, J., Valero, M.: Balancing hpc applications through smart allocation of resources in mt processors. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12 (2008)

    Google Scholar 

  17. Boneti, C., Gioiosa, R., Cazorla, F.J., Valero, M.: A dynamic scheduler for balancing hpc applications. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–12 (2008)

    Google Scholar 

  18. Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Montana State University (1969)

    Google Scholar 

  19. Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innovations 1(1), 5–28 (2014)

    Google Scholar 

  20. Casanova, H.: Benefits and drawbacks of redundant batch requests. J. Grid Comput. 5(2), 235–250 (2007)

    Article  MathSciNet  Google Scholar 

  21. Clarke, D., Lastovetsky, A., Rychkov, V.: Dynamic load balancing of parallel computational iterative routines on higly heterogeneous hpc platforms. Parallel Process. Lett. 21(02), 195–217 (2011)

    Article  MathSciNet  Google Scholar 

  22. Dean, J., Barroso, L.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)

    Article  Google Scholar 

  23. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  24. Dinan, J., Larkins, D.B., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–11 (2009)

    Google Scholar 

  25. Dutta, S., Cadambe, V., Grover, P.: “Short-dot": computing large linear transforms distributedly using coded short dot products. IEEE Trans. Inf. Theory 65(10), 6171–6193 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  26. Gardner, K., Zbarsky, S., Doroudi, S., Harchol-Balter, M., Hyytia, E.: Reducing latency via redundant requests: exact analysis. SIGMETRICS Perform. Eval. Rev. 43(1), 347–360 (2015)

    Article  MATH  Google Scholar 

  27. Van de Geijn, R.A., Watts, J.: Summa: scalable universal matrix multiplication algorithm. Concurrency-Pract. Exper. 9(4), 255–274 (1997)

    Article  Google Scholar 

  28. Gupta, A., Sarood, O., Kale, L.V., Milojicic, D.: Improving hpc application performance in cloud through dynamic load balancing. In: 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, pp. 402–409 (2013)

    Google Scholar 

  29. Huang, L., Pawar, S., Zhang, H., Ramchandran, K.: Codes can reduce queueing delay in data centers. In: 2012 IEEE International Symposium on Information Theory Proceedings, pp. 2766–2770 (2012)

    Google Scholar 

  30. Joshi, G., Liu, Y., Soljanin, E.: On the delay-storage trade-off in content download from coded distributed storage systems. IEEE J. Sel. Areas Commun. 32(5), 989–997 (2014)

    Article  Google Scholar 

  31. Joshi, G., Soljanin, E., Wornell, G.: Efficient redundancy techniques for latency reduction in cloud systems. ACM Trans. Model. Perform. Eval. Comput. Syst. 2(2) (2017)

    Google Scholar 

  32. Koanantakool, P., et al.: Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 842–853 (2016)

    Google Scholar 

  33. Kumar, V., Grama, A., Vempaty, N.: Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput. 22(1), 60–79 (1994)

    Article  Google Scholar 

  34. Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., Ramchandran, K.: Speeding up distributed machine learning using codes. IEEE Trans. Inf. Theory 64(3), 1514–1529 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  35. Lee, K., Pedarsani, R., Papailiopoulos, D., Ramchandran, K.: Coded computation for multicore setups. In: 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2413–2417 (2017)

    Google Scholar 

  36. Lee, K., Suh, C., Ramchandran, K.: High-dimensional coded matrix multiplication. In: 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2418–2422 (2017)

    Google Scholar 

  37. Li, S., Maddah-Ali, M.A., Avestimehr, A.S.: Coded mapreduce. In: 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 964–971 (2015)

    Google Scholar 

  38. Li, S., Maddah-Ali, M.A., Yu, Q., Avestimehr, A.S.: A fundamental tradeoff between computation and communication in distributed computing. IEEE Trans. Inf. Theory 64(1), 109–128 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  39. Li, S., Supittayapornpong, S., Maddah-Ali, M.A., Avestimehr, S.: Coded terasort. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 389–398 (2017)

    Google Scholar 

  40. Luby, M.: Lt codes. In: Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, pp. 271–280. IEEE (2002)

    Google Scholar 

  41. Mallick, A., Chaudhari, M., Sheth, U., Palanikumar, G., Joshi, G.: Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication. Proc. ACM Meas. Anal. Comput. Syst. 3(3) (2019)

    Google Scholar 

  42. Mallick, A., Chaudhari, M., Sheth, U., Palanikumar, G., Joshi, G.: Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication. Commun. ACM 65(5), 111–118 (2022)

    Article  Google Scholar 

  43. Márquez, C., César, E., Sorribes, J.: Graph-based automatic dynamic load balancing for HPC agent-based simulations. In: Hunold, S., et al. (eds.) Euro-Par 2015. LNCS, vol. 9523, pp. 405–416. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27308-2_33

    Chapter  Google Scholar 

  44. McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24(3), 287–297 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  45. Menon, H., Acun, B., De Gonzalo, S.G., Sarood, O., Kalé, L.: Thermal aware automated load balancing for hpc applications. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–8 (2013)

    Google Scholar 

  46. Michael, M.M., Vechev, M.T., Saraswat, V.A.: Idempotent work stealing. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 45–54 (2009)

    Google Scholar 

  47. Mitzenmacher, M.: Analyses of load stealing models based on differential equations. In Proceedings of the Tenth ACM symposium on Parallel Algorithms and Architectures, pp. 212–221 (1998)

    Google Scholar 

  48. Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1960)

    Article  MathSciNet  MATH  Google Scholar 

  49. Reisizadeh, A., Prakash, S., Pedarsani, R., Avestimehr, A.S.: Coded computation over heterogeneous clusters. IEEE Trans. Inf. Theory 65(7), 4227–4242 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  50. Said, S.A., Habashy, S.M., Salem, S.A., Saad, E.M.: An optimized straggler mitigation framework for large-scale distributed computing systems. IEEE Access 10 (2022)

    Google Scholar 

  51. Sanders, P., Sibeyn, J.F.: A bandwidth latency tradeoff for broadcast and reduction. Inf. Process. Lett. 86(1), 33–38 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  52. Severinson, A., Graell i Amat, A., Rosnes, E.: Block-diagonal and lt codes for distributed computing with straggling servers. IEEE Trans. Commun. 67(3), 1739–1753 (2019)

    Google Scholar 

  53. Singleton, R.: Maximum distanceq-nary codes. IEEE Trans. Inf. Theory 10(2), 116–118 (1964)

    Article  MATH  Google Scholar 

  54. Sinha, A.B., Kale, L.V.: A load balancing strategy for prioritized execution of tasks. In: [1993] Proceedings Seventh International Parallel Processing Symposium, pp. 230–237 (1993)

    Google Scholar 

  55. Snir, M., et al.: Addressing failures in exascale computing. Inter. J. High Performance Comput. Appli. 28(2), 129–173 (2014)

    Article  Google Scholar 

  56. Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10

    Chapter  Google Scholar 

  57. Son, K., Choi, W.: Distributed matrix multiplication based on frame quantization for straggler mitigation. IEEE Trans. Signal Process. 70 (2022)

    Google Scholar 

  58. Tandon, R., Lei, Q., Dimakis, A., Karampatziakis, N.: Gradient coding: avoiding stragglers in distributed learning. In: Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3368–3376. PMLR (2017)

    Google Scholar 

  59. Tchiboukdjian, M., Gast, N., Trystram, D., Roch, J.L., Bernard, J.: A tighter analysis of work stealing. In: Cheong, O., Chwa, K.Y., Park, K. (eds.) Algorithms and Computation, pp. 291–302. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-17514-5_25

    Chapter  Google Scholar 

  60. Tumanov, A., Cipar, J., Ganger, G.R., Kozuch, M.A.: alsched: algebraic scheduling of mixed workloads in heterogeneous clouds. In: Proceedings of the third ACM Symposium on Cloud Computing, pp. 1–7 (2012)

    Google Scholar 

  61. Van Nieuwpoort, R.V., Kielmann, T., Bal, H.E.: Efficient load balancing for wide-area divide-and-conquer applications. In: Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, pp. 34–43 (2001)

    Google Scholar 

  62. Vulimiri, A., Godfrey, P., Mittal, R., Sherry, J., Ratnasamy, S., Shenker, S.: Low latency via redundancy. In: Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and Technologies, CoNEXT 2013, pp. 283–294. Association for Computing Machinery (2013)

    Google Scholar 

  63. Wang, D., Joshi, G., Wornell, G.: Efficient task replication for fast response times in parallel computation. In: The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2014, pp. 599–600. Association for Computing Machinery (2014)

    Google Scholar 

  64. Wang, D., Joshi, G., Wornell, G.: Using straggler replication to reduce latency in large-scale parallel computing. SIGMETRICS Perform. Eval. Rev. 43(3), 7–11 (2015)

    Article  Google Scholar 

  65. Wang, S., Liu, J., Shroff, N.: Coded sparse matrix multiplication. In: Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 5152–5160. PMLR (2018)

    Google Scholar 

  66. Wimmer, M., Cederman, D., Träff, J.L., Tsigas, P.: Work-stealing with configurable scheduling strategies. ACM SIGPLAN Notices 48(8), 315–316 (2013)

    Article  Google Scholar 

  67. Yang, C., Miller, B.P.: Critical path analysis for the execution of parallel and distributed programs. In: [1988] Proceedings. The 8th International Conference on Distributed, pp. 366–373 (1988)

    Google Scholar 

  68. Yang, J., He, Q.: Scheduling parallel computations by work stealing: a survey. Inter. J. Parallel Program. 46(2) (2018)

    Google Scholar 

  69. Yu, Q., Maddah-Ali, M., Avestimehr, S.: Polynomial codes: an optimal design for high-dimensional coded matrix multiplication. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 4403–4413. Curran Associates, Inc. (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roy Nissim .

Editor information

Editors and Affiliations

A Existing Solutions

A Existing Solutions

In this section, we provide an analysis of existing straggler mitigation solutions.

1.1 A.1 Dynamic Load Balancing

We review receiver-initiated load-balancing algorithms (also called work-stealing), which often perform better in decentralized models (such as ours). These solutions often share the following structure: when a target processor receives a work request, it acts in one of the following ways: if it has more than s tasks, it transfers a fraction \(\delta \) of the work to the processor that requested the work. If it has less than s tasks, it rejects the request, which is then passed on to the next candidate. Choosing targets at random incurs the optimal asymptotic costs (see [59] for further details).

Claim

Let \(\gamma _{P} \cdot F,\ BW,\ L\), and M denote an algorithm’s arithmetic cost, bandwidth cost, latency cost, and memory footprint, respectively. Let \({F}^{'},\ {BW}^{'},\ {L}^{'}\), and \(M^{'}\) denote the costs of the algorithm applying the work-stealing solution on C. Then \({F}^{'} = \bar{\gamma }_{a} \cdot (1+\frac{P}{S}) \cdot F\), \({BW}^{'} = BW + M\), \({L}^{'} = L + O \left( P\log {P} \right) \), and \(M^{'} = M\).

Proof

We follow the proof of Theorem 1 with slight modifications. The main difference between SLB and work-stealing is in the number of performed task requests, which affects the latency cost. According to Theorem 2 in [59], the work-stealing technique is expected to perform \(O \left( P\log {S} \right) \) work requests. Similarly to SLB, we bound S by a polynomial function with P (by grouping tasks together) so that \(\log {S} = O \left( \log {P} \right) \). In total, \({F}^{'} = \bar{\gamma }_{a} \cdot (1+\frac{P}{S}) \cdot F\), \({BW}^{'} = BW + O \left( M \right) \), \({L}^{'} = L + O \left( P\log {P} \right) \), and \(M^{'} = M\).

1.2 A.2 Redundancy

Here, we compare our solution to three commonly used erasure code based solutions: replication, MDS, and LT.

Replication. In the Replication solution, the algorithm divides the processors into P/r groups of r processors, where processors within the same group perform the exact same computations. The algorithm constructs the final output using the results of the fastest processor in each group.

Claim

Let \(\gamma _{P} \cdot F,\ BW,\ L\), and M denote an algorithm’s arithmetic cost, bandwidth cost, latency cost, and memory footprint, respectively. Let \({F}^{'},\ {BW}^{'},\ {L}^{'}\), and \(M^{'}\) denote the costs of the algorithm applying the replication solution on C. Then \({F}^{'} = \gamma _{P-r+1} \cdot r \cdot F\), \({BW}^{'} = BW + 2r\cdot M\), \({L}^{'} = L + O \left( \log {r} \right) \), and \(M^{'} = r\cdot M\).

Proof

Since the workload is shared between P/r processor (rather than P), each processor computes a factor of r more computations and uses a factor of r more memory. The algorithm uses an all-broadcast operator to share the input, and a scatter operator to share the output. This costs \(2\log {r}\) messages and \(2r \cdot M\) words. The algorithm halts when the first processor from each set completes its tasks. In the worst-case scenario, the algorithm halts when \(P-r+1\) processors have completed their tasks. Hence, the arithmetic cost is \(\gamma _{P-r+1} \cdot r \cdot F\). Summing up, we obtain \({F}^{'} = \gamma _{P-r+1} \cdot r \cdot F\), \({BW}^{'} = BW + 2r\cdot M\), \({L}^{'} = L + O \left( \log {r} \right) \), and \(M^{'} = r \cdot M\).

Erasure Codes. An erasure code (cf. [48, 53]) is a linear transformation T that takes a vector v of size x1, and outputs a vector w of size x2, where x1, x2, and \(\rho = \frac{x2}{x1}\) are called the rank, length, and rate of the code, respectively. We represent a code T by a generator matrix G of size \(x2 \times x1\), where applying the code is equivalent to multiplying the input vector from the left with the matrix G. The generator matrix of a replication code is an identity matrix where each row is duplicated r times. We say the code has a distance d if any two code vectors differ in at least d coordinates. A code with distance d can recover from \(d-1\) erasures. Maximal Distance Separable codes (MDS) are a family of codes with maximal distance.

Random codes combine randomness in the construction of the generator matrix. Luby-Transforms codes [40] (LT) are a family of random codes that obey the following conditions: I) The values of its generator matrix are either one or zero. II) The density of each row (number of non-zeroes elements) is sampled randomly from some distribution. III) The locations of the non-zeroes are sampled uniformly. A popular choice for the density distribution is the Robust Soliton degree distribution [40].

MDS. Given an MDS code with rank K and length P, the algorithm partitions the problem into K tasks and constructs P new tasks in place of the original ones. The \(i'th\) processor computes the \(i'th\) task, and the final output is produced from the outcomes of the first K processors to finish.

Claim

Let \(\gamma _{P} \cdot F,\ BW,\ L\), and M denote an algorithm’s arithmetic cost, bandwidth cost, latency cost, and memory footprint, respectively. Let \({F}^{'},\ {BW}^{'},\ {L}^{'}\), and \(M^{'}\) denote the costs of the algorithm applying the MDS solutionFootnote 2 on C. Then \({F}^{'} = \gamma _{K} \cdot O \left( \frac{P}{K} \cdot F \right) \), \({BW}^{'} = BW + 2P\cdot M\), \({L}^{'} = L + O \left( P \right) \), and \(M^{'} = \frac{P}{K} \cdot M\).

Proof

Input (resp. output) redistribution involves code encoding (resp. decoding) and an all-reduce operation. By Corollary 1, this adds \(2P \cdot M\) bandwidth cost, \(O \left( P \right) \) latency cost, and an arithmetic cost, which is often negligible. Moreover, each processor performs a factor of \(\frac{P}{K}\) more arithmetic computations (since the workload is distributed among K processors instead of among P) and uses a factor of \(\frac{P}{K}\) more memory. The algorithm halts when K processors have completed their tasks. Thus, the arithmetic cost is \(\gamma _{K} \cdot O \left( \frac{P}{K} \cdot F \right) \).

LT. Mallick et al. [41] proposed a new variation of the LT coding solution (denoted here as LT+) that utilizes partial computations performed by all processors and attains near-ideal load balancing. In the LT+ solution, each processor broadcasts an update message every time it completes a task. The algorithm generates \(\rho \cdot S\) new tasks (in place of the originals) using the LT code and halts when enough tasks have been completed.

Remark 1

Using the Robust Soliton degree distribution [40], the encoding and decoding complexity of a code of size x with probability \(1-\epsilon \) is \(O(x\log {\frac{x}{\epsilon }})\). Any \(x + \sqrt{x}\cdot \log ^{2}{\frac{x}{\epsilon }}\) symbols are sufficient to construct the output.

Claim

Let \(\gamma _{P} \cdot F,\ BW,\ L\), and M denote an algorithm’s arithmetic cost, bandwidth cost, latency cost, and memory footprint, respectively, and let \({F}^{'},\ {BW}^{'},\ {L}^{'}\), and \(M^{'}\) denote the costs of the algorithm applying the LT+ solution to C. Then \({F}^{'} = \bar{\gamma }_{a} \cdot O \left( 1 + \frac{\log ^{2}{s}}{\sqrt{S}} \right) \cdot F\), \({BW}^{'} = BW + 2P \cdot M\), \({L}^{'} = L + O \left( S\cdot \log {P} \right) \), and \(M^{'} = \rho \cdot M\).

Proof

Input and output distribution phases are similar to the MDS solution but with different weights. Since both use an all-reduce operation, the overhead costs remain the same. The algorithm constructs the final output when processors have computed a sufficient amount of tasks globally (maintaining a near-perfectFootnote 3 load balancing. The number of completed tasks expected to be sufficient for recovery is \(O \left( S + \sqrt{S}\log ^{2}{S} \right) \) (Claim 1), which means that each processor is expected to perform a factor of \(O \left( 1 + \frac{\log ^{2}{S}}{\sqrt{S}} \right) \) more computations compared to the non-mitigated algorithm. Notifying the processors when each task is completed adds \(S \cdot P\) bandwidth and \(O \left( S \cdot \log {P} \right) \) latency (Corollary 1). The algorithm generates a factor of \(\rho \) more tasks and thus uses a factor of \(\rho \) more memory. Summing up, the additional costs are: \(\left( {F}^{'}, {BW}^{'}, {L}^{'} \right) = \left( \bar{\gamma }_{a} \cdot O \left( 1 + \frac{\log ^{2}{S}}{\sqrt{S}} \right) \cdot F,\ BW + 2P\cdot M,\ L + O \left( S\log {P} \right) \right) \).

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nissim, R., Schwartz, O. (2023). Stragglers in Distributed Matrix Multiplication. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2023. Lecture Notes in Computer Science, vol 14283. Springer, Cham. https://doi.org/10.1007/978-3-031-43943-8_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43943-8_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43942-1

  • Online ISBN: 978-3-031-43943-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics