Stragglers in Distributed Matrix Multiplication

Nissim, Roy; Schwartz, Oded

doi:10.1007/978-3-031-43943-8_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14283))

Included in the following conference series:

Workshop on Job Scheduling Strategies for Parallel Processing

170 Accesses

Abstract

A delay in a single processor may affect an entire system since the slowest processor typically determines the runtime. Problems with such stragglers are often mitigated using dynamic load balancing or redundancy solutions such as task replication. Unfortunately, the former option incurs high communication cost, and the latter significantly increases the arithmetic cost and memory footprint, making high resource overhead seem inevitable. Matrix multiplication and other numerical linear algebra kernels typically have structures that allow better straggler management. Redundancy based solutions tailored for such algorithms often combine codes in the algorithm’s structure. These solutions add fixed cost overhead and may perform worse than the original algorithm when little or no delays occur. We propose a new load-balancing solution tailored for distributed matrix multiplication. Our solution reduces latency overhead by \(O \left( P/\log {P} \right) \) compared to existing dynamic load-balancing solutions, where P is the number of processors. Our solution overtakes redundancy-based solutions in all parameters: arithmetic cost, bandwidth cost, latency cost, memory footprint, and the number of stragglers it can tolerate. Moreover, our overhead costs depend on the severity of delays and are negligible when delays are minor. We compare our solution with previous ones and demonstrate significant improvements in asymptotic analysis and simulations: up to x4.4 and x5.3 compared to general-purpose dynamic load balancing and redundancy-based solutions, respectively.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 818252). This work was supported by the Federmann Cyber Security Center in conjunction with the Israel national cyber directorate. This research was supported by a grant from the United States-Israel Binational Science Foundation (BSF), Jerusalem, Israel.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This may look similar to the standard model for analyzing the arithmetic cost of an algorithm in a heterogeneous environment (cf. [9]). However, while the heterogeneous model assumes different hardware with stable performance, our version assumes similar hardware with varying performance.
2.
We present here the basic MDS code and the associated overhead costs. Several variations of the MDS solution incur lower overhead costs, for example, by using Systematic MDS codes or sub-classes with lower decoding complexity.
3.
Near-perfect load balancing is only achieved when \(\frac{\bar{\gamma }_{a}}{\gamma _{1}} > \rho \), otherwise processors are expected to have idle times.

References

Acar, U.A., Charguéraud, A., Rainey, M.: Scheduling parallel programs by work stealing with private deques. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 219–228 (2013)
Google Scholar
Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)
Article Google Scholar
Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: attack of the clones. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, pp. 185–198. USENIX Association (2013)
Google Scholar
Ananthanarayanan, G., et al.: Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, pp. 265–278. USENIX Association (2010)
Google Scholar
Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1(1), 11–33 (2004)
Article Google Scholar
Ballard, G., et al.: Communication optimal parallel multiplication of sparse random matrices. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2013, pp. 222–231. Association for Computing Machinery (2013)
Google Scholar
Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numer 23, 1–155 (2014)
Article MathSciNet MATH Google Scholar
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Communication-optimal parallel and sequential cholesky decomposition. SIAM J. Sci. Comput. 32(6), 3495–3523 (2010)
Google Scholar
Ballard, G., Demmel, J., Gearhart, A.: Brief announcement: communication bounds for heterogeneous architectures. In: Proceedings of the Twenty-third Annual ACM symposium on Parallelism in Algorithms and Architectures, pp. 257–258 (2011)
Google Scholar
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM (JACM) 59(6), 1–23 (2013)
Article MathSciNet MATH Google Scholar
Basermann, A., et al.: Dynamic load-balancing of finite element applications with the drama library. Appl. Math. Model. 25(2), 83–98 (2000)
Article MATH Google Scholar
Berenbrink, P., Friedetzky, T., Goldberg, L.A.: The natural work-stealing algorithm is stable. SIAM J. Comput. 32(5), 1260–1279 (2003)
Article MathSciNet MATH Google Scholar
Birnbaum, N., Schwartz, O.: Fault tolerant resource efficient matrix multiplication. In: Proceedings of the Eighth SIAM Workshop on Combinatorial Scientific Computing 2018. SIAM (2018)
Google Scholar
Biswas, R., Das, S., Harvey, D., Oliker, L.: Parallel dynamic load balancing strategies for adaptive irregular applications. Appl. Math. Model. 25(2), 109–122 (2000)
Article MATH Google Scholar
Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work stealing. J. ACM (JACM) 46(5), 720–748 (1999)
Article MathSciNet MATH Google Scholar
Boneti, C., Gioiosa, R., Cazorla, F.J., Corbalan, J., Labarta, J., Valero, M.: Balancing hpc applications through smart allocation of resources in mt processors. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–12 (2008)
Google Scholar
Boneti, C., Gioiosa, R., Cazorla, F.J., Valero, M.: A dynamic scheduler for balancing hpc applications. In: SC 2008: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–12 (2008)
Google Scholar
Cannon, L.E.: A cellular computer to implement the Kalman filter algorithm. Montana State University (1969)
Google Scholar
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innovations 1(1), 5–28 (2014)
Google Scholar
Casanova, H.: Benefits and drawbacks of redundant batch requests. J. Grid Comput. 5(2), 235–250 (2007)
Article MathSciNet Google Scholar
Clarke, D., Lastovetsky, A., Rychkov, V.: Dynamic load balancing of parallel computational iterative routines on higly heterogeneous hpc platforms. Parallel Process. Lett. 21(02), 195–217 (2011)
Article MathSciNet Google Scholar
Dean, J., Barroso, L.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Dinan, J., Larkins, D.B., Sadayappan, P., Krishnamoorthy, S., Nieplocha, J.: Scalable work stealing. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pp. 1–11 (2009)
Google Scholar
Dutta, S., Cadambe, V., Grover, P.: “Short-dot": computing large linear transforms distributedly using coded short dot products. IEEE Trans. Inf. Theory 65(10), 6171–6193 (2019)
Article MathSciNet MATH Google Scholar
Gardner, K., Zbarsky, S., Doroudi, S., Harchol-Balter, M., Hyytia, E.: Reducing latency via redundant requests: exact analysis. SIGMETRICS Perform. Eval. Rev. 43(1), 347–360 (2015)
Article MATH Google Scholar
Van de Geijn, R.A., Watts, J.: Summa: scalable universal matrix multiplication algorithm. Concurrency-Pract. Exper. 9(4), 255–274 (1997)
Article Google Scholar
Gupta, A., Sarood, O., Kale, L.V., Milojicic, D.: Improving hpc application performance in cloud through dynamic load balancing. In: 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, pp. 402–409 (2013)
Google Scholar
Huang, L., Pawar, S., Zhang, H., Ramchandran, K.: Codes can reduce queueing delay in data centers. In: 2012 IEEE International Symposium on Information Theory Proceedings, pp. 2766–2770 (2012)
Google Scholar
Joshi, G., Liu, Y., Soljanin, E.: On the delay-storage trade-off in content download from coded distributed storage systems. IEEE J. Sel. Areas Commun. 32(5), 989–997 (2014)
Article Google Scholar
Joshi, G., Soljanin, E., Wornell, G.: Efficient redundancy techniques for latency reduction in cloud systems. ACM Trans. Model. Perform. Eval. Comput. Syst. 2(2) (2017)
Google Scholar
Koanantakool, P., et al.: Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In: 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 842–853 (2016)
Google Scholar
Kumar, V., Grama, A., Vempaty, N.: Scalable load balancing techniques for parallel computers. J. Parallel Distrib. Comput. 22(1), 60–79 (1994)
Article Google Scholar
Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., Ramchandran, K.: Speeding up distributed machine learning using codes. IEEE Trans. Inf. Theory 64(3), 1514–1529 (2018)
Article MathSciNet MATH Google Scholar
Lee, K., Pedarsani, R., Papailiopoulos, D., Ramchandran, K.: Coded computation for multicore setups. In: 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2413–2417 (2017)
Google Scholar
Lee, K., Suh, C., Ramchandran, K.: High-dimensional coded matrix multiplication. In: 2017 IEEE International Symposium on Information Theory (ISIT), pp. 2418–2422 (2017)
Google Scholar
Li, S., Maddah-Ali, M.A., Avestimehr, A.S.: Coded mapreduce. In: 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 964–971 (2015)
Google Scholar
Li, S., Maddah-Ali, M.A., Yu, Q., Avestimehr, A.S.: A fundamental tradeoff between computation and communication in distributed computing. IEEE Trans. Inf. Theory 64(1), 109–128 (2018)
Article MathSciNet MATH Google Scholar
Li, S., Supittayapornpong, S., Maddah-Ali, M.A., Avestimehr, S.: Coded terasort. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 389–398 (2017)
Google Scholar
Luby, M.: Lt codes. In: Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science, pp. 271–280. IEEE (2002)
Google Scholar
Mallick, A., Chaudhari, M., Sheth, U., Palanikumar, G., Joshi, G.: Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication. Proc. ACM Meas. Anal. Comput. Syst. 3(3) (2019)
Google Scholar
Mallick, A., Chaudhari, M., Sheth, U., Palanikumar, G., Joshi, G.: Rateless codes for near-perfect load balancing in distributed matrix-vector multiplication. Commun. ACM 65(5), 111–118 (2022)
Article Google Scholar
Márquez, C., César, E., Sorribes, J.: Graph-based automatic dynamic load balancing for HPC agent-based simulations. In: Hunold, S., et al. (eds.) Euro-Par 2015. LNCS, vol. 9523, pp. 405–416. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-27308-2_33
Chapter Google Scholar
McColl, W.F., Tiskin, A.: Memory-efficient matrix multiplication in the BSP model. Algorithmica 24(3), 287–297 (1999)
Article MathSciNet MATH Google Scholar
Menon, H., Acun, B., De Gonzalo, S.G., Sarood, O., Kalé, L.: Thermal aware automated load balancing for hpc applications. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–8 (2013)
Google Scholar
Michael, M.M., Vechev, M.T., Saraswat, V.A.: Idempotent work stealing. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 45–54 (2009)
Google Scholar
Mitzenmacher, M.: Analyses of load stealing models based on differential equations. In Proceedings of the Tenth ACM symposium on Parallel Algorithms and Architectures, pp. 212–221 (1998)
Google Scholar
Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1960)
Article MathSciNet MATH Google Scholar
Reisizadeh, A., Prakash, S., Pedarsani, R., Avestimehr, A.S.: Coded computation over heterogeneous clusters. IEEE Trans. Inf. Theory 65(7), 4227–4242 (2019)
Article MathSciNet MATH Google Scholar
Said, S.A., Habashy, S.M., Salem, S.A., Saad, E.M.: An optimized straggler mitigation framework for large-scale distributed computing systems. IEEE Access 10 (2022)
Google Scholar
Sanders, P., Sibeyn, J.F.: A bandwidth latency tradeoff for broadcast and reduction. Inf. Process. Lett. 86(1), 33–38 (2003)
Article MathSciNet MATH Google Scholar
Severinson, A., Graell i Amat, A., Rosnes, E.: Block-diagonal and lt codes for distributed computing with straggling servers. IEEE Trans. Commun. 67(3), 1739–1753 (2019)
Google Scholar
Singleton, R.: Maximum distanceq-nary codes. IEEE Trans. Inf. Theory 10(2), 116–118 (1964)
Article MATH Google Scholar
Sinha, A.B., Kale, L.V.: A load balancing strategy for prioritized execution of tasks. In: [1993] Proceedings Seventh International Parallel Processing Symposium, pp. 230–237 (1993)
Google Scholar
Snir, M., et al.: Addressing failures in exascale computing. Inter. J. High Performance Comput. Appli. 28(2), 129–173 (2014)
Article Google Scholar
Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011. LNCS, vol. 6853, pp. 90–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23397-5_10
Chapter Google Scholar
Son, K., Choi, W.: Distributed matrix multiplication based on frame quantization for straggler mitigation. IEEE Trans. Signal Process. 70 (2022)
Google Scholar
Tandon, R., Lei, Q., Dimakis, A., Karampatziakis, N.: Gradient coding: avoiding stragglers in distributed learning. In: Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 3368–3376. PMLR (2017)
Google Scholar
Tchiboukdjian, M., Gast, N., Trystram, D., Roch, J.L., Bernard, J.: A tighter analysis of work stealing. In: Cheong, O., Chwa, K.Y., Park, K. (eds.) Algorithms and Computation, pp. 291–302. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-17514-5_25
Chapter Google Scholar
Tumanov, A., Cipar, J., Ganger, G.R., Kozuch, M.A.: alsched: algebraic scheduling of mixed workloads in heterogeneous clouds. In: Proceedings of the third ACM Symposium on Cloud Computing, pp. 1–7 (2012)
Google Scholar
Van Nieuwpoort, R.V., Kielmann, T., Bal, H.E.: Efficient load balancing for wide-area divide-and-conquer applications. In: Proceedings of the Eighth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, pp. 34–43 (2001)
Google Scholar
Vulimiri, A., Godfrey, P., Mittal, R., Sherry, J., Ratnasamy, S., Shenker, S.: Low latency via redundancy. In: Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and Technologies, CoNEXT 2013, pp. 283–294. Association for Computing Machinery (2013)
Google Scholar
Wang, D., Joshi, G., Wornell, G.: Efficient task replication for fast response times in parallel computation. In: The 2014 ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 2014, pp. 599–600. Association for Computing Machinery (2014)
Google Scholar
Wang, D., Joshi, G., Wornell, G.: Using straggler replication to reduce latency in large-scale parallel computing. SIGMETRICS Perform. Eval. Rev. 43(3), 7–11 (2015)
Article Google Scholar
Wang, S., Liu, J., Shroff, N.: Coded sparse matrix multiplication. In: Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 5152–5160. PMLR (2018)
Google Scholar
Wimmer, M., Cederman, D., Träff, J.L., Tsigas, P.: Work-stealing with configurable scheduling strategies. ACM SIGPLAN Notices 48(8), 315–316 (2013)
Article Google Scholar
Yang, C., Miller, B.P.: Critical path analysis for the execution of parallel and distributed programs. In: [1988] Proceedings. The 8th International Conference on Distributed, pp. 366–373 (1988)
Google Scholar
Yang, J., He, Q.: Scheduling parallel computations by work stealing: a survey. Inter. J. Parallel Program. 46(2) (2018)
Google Scholar
Yu, Q., Maddah-Ali, M., Avestimehr, S.: Polynomial codes: an optimal design for high-dimensional coded matrix multiplication. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 4403–4413. Curran Associates, Inc. (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

The Hebrew University of Jerusalem, Jerusalem, Israel
Roy Nissim & Oded Schwartz

Authors

Roy Nissim
View author publications
You can also search for this author in PubMed Google Scholar
Oded Schwartz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roy Nissim .

Editor information

Editors and Affiliations

CESNET, Prague, Czech Republic
Dalibor Klusáček
Barcelona Supercomputing Center, Barcelona, Spain
Julita Corbalán
Apple, Cupertino, CA, USA
Gonzalo P. Rodrigo

A Existing Solutions

In this section, we provide an analysis of existing straggler mitigation solutions.

1.1 A.1 Dynamic Load Balancing

We review receiver-initiated load-balancing algorithms (also called work-stealing), which often perform better in decentralized models (such as ours). These solutions often share the following structure: when a target processor receives a work request, it acts in one of the following ways: if it has more than s tasks, it transfers a fraction \(\delta \) of the work to the processor that requested the work. If it has less than s tasks, it rejects the request, which is then passed on to the next candidate. Choosing targets at random incurs the optimal asymptotic costs (see [59] for further details).

Claim

Let \(\gamma _{P} \cdot F,\ BW,\ L\), and M denote an algorithm’s arithmetic cost, bandwidth cost, latency cost, and memory footprint, respectively. Let \({F}^{'},\ {BW}^{'},\ {L}^{'}\), and \(M^{'}\) denote the costs of the algorithm applying the work-stealing solution on C. Then \({F}^{'} = \bar{\gamma }_{a} \cdot (1+\frac{P}{S}) \cdot F\), \({BW}^{'} = BW + M\), \({L}^{'} = L + O \left( P\log {P} \right) \), and \(M^{'} = M\).

Proof

We follow the proof of Theorem 1 with slight modifications. The main difference between SLB and work-stealing is in the number of performed task requests, which affects the latency cost. According to Theorem 2 in [59], the work-stealing technique is expected to perform \(O \left( P\log {S} \right) \) work requests. Similarly to SLB, we bound S by a polynomial function with P (by grouping tasks together) so that \(\log {S} = O \left( \log {P} \right) \). In total, \({F}^{'} = \bar{\gamma }_{a} \cdot (1+\frac{P}{S}) \cdot F\), \({BW}^{'} = BW + O \left( M \right) \), \({L}^{'} = L + O \left( P\log {P} \right) \), and \(M^{'} = M\).

1.2 A.2 Redundancy

Here, we compare our solution to three commonly used erasure code based solutions: replication, MDS, and LT.

Replication. In the Replication solution, the algorithm divides the processors into P/r groups of r processors, where processors within the same group perform the exact same computations. The algorithm constructs the final output using the results of the fastest processor in each group.

Claim

Let \(\gamma _{P} \cdot F,\ BW,\ L\), and M denote an algorithm’s arithmetic cost, bandwidth cost, latency cost, and memory footprint, respectively. Let \({F}^{'},\ {BW}^{'},\ {L}^{'}\), and \(M^{'}\) denote the costs of the algorithm applying the replication solution on C. Then \({F}^{'} = \gamma _{P-r+1} \cdot r \cdot F\), \({BW}^{'} = BW + 2r\cdot M\), \({L}^{'} = L + O \left( \log {r} \right) \), and \(M^{'} = r\cdot M\).

Proof

Since the workload is shared between P/r processor (rather than P), each processor computes a factor of r more computations and uses a factor of r more memory. The algorithm uses an all-broadcast operator to share the input, and a scatter operator to share the output. This costs \(2\log {r}\) messages and \(2r \cdot M\) words. The algorithm halts when the first processor from each set completes its tasks. In the worst-case scenario, the algorithm halts when \(P-r+1\) processors have completed their tasks. Hence, the arithmetic cost is \(\gamma _{P-r+1} \cdot r \cdot F\). Summing up, we obtain \({F}^{'} = \gamma _{P-r+1} \cdot r \cdot F\), \({BW}^{'} = BW + 2r\cdot M\), \({L}^{'} = L + O \left( \log {r} \right) \), and \(M^{'} = r \cdot M\).

Erasure Codes. An erasure code (cf. [48, 53]) is a linear transformation T that takes a vector v of size x1, and outputs a vector w of size x2, where x1, x2, and \(\rho = \frac{x2}{x1}\) are called the rank, length, and rate of the code, respectively. We represent a code T by a generator matrix G of size \(x2 \times x1\), where applying the code is equivalent to multiplying the input vector from the left with the matrix G. The generator matrix of a replication code is an identity matrix where each row is duplicated r times. We say the code has a distance d if any two code vectors differ in at least d coordinates. A code with distance d can recover from \(d-1\) erasures. Maximal Distance Separable codes (MDS) are a family of codes with maximal distance.

Random codes combine randomness in the construction of the generator matrix. Luby-Transforms codes [40] (LT) are a family of random codes that obey the following conditions: I) The values of its generator matrix are either one or zero. II) The density of each row (number of non-zeroes elements) is sampled randomly from some distribution. III) The locations of the non-zeroes are sampled uniformly. A popular choice for the density distribution is the Robust Soliton degree distribution [40].

MDS. Given an MDS code with rank K and length P, the algorithm partitions the problem into K tasks and constructs P new tasks in place of the original ones. The \(i'th\) processor computes the \(i'th\) task, and the final output is produced from the outcomes of the first K processors to finish.

Claim

Let \(\gamma _{P} \cdot F,\ BW,\ L\), and M denote an algorithm’s arithmetic cost, bandwidth cost, latency cost, and memory footprint, respectively. Let \({F}^{'},\ {BW}^{'},\ {L}^{'}\), and \(M^{'}\) denote the costs of the algorithm applying the MDS solution^{Footnote 2} on C. Then \({F}^{'} = \gamma _{K} \cdot O \left( \frac{P}{K} \cdot F \right) \), \({BW}^{'} = BW + 2P\cdot M\), \({L}^{'} = L + O \left( P \right) \), and \(M^{'} = \frac{P}{K} \cdot M\).

Proof

Input (resp. output) redistribution involves code encoding (resp. decoding) and an all-reduce operation. By Corollary 1, this adds \(2P \cdot M\) bandwidth cost, \(O \left( P \right) \) latency cost, and an arithmetic cost, which is often negligible. Moreover, each processor performs a factor of \(\frac{P}{K}\) more arithmetic computations (since the workload is distributed among K processors instead of among P) and uses a factor of \(\frac{P}{K}\) more memory. The algorithm halts when K processors have completed their tasks. Thus, the arithmetic cost is \(\gamma _{K} \cdot O \left( \frac{P}{K} \cdot F \right) \).

LT. Mallick et al. [41] proposed a new variation of the LT coding solution (denoted here as LT+) that utilizes partial computations performed by all processors and attains near-ideal load balancing. In the LT+ solution, each processor broadcasts an update message every time it completes a task. The algorithm generates \(\rho \cdot S\) new tasks (in place of the originals) using the LT code and halts when enough tasks have been completed.

Remark 1

Using the Robust Soliton degree distribution [40], the encoding and decoding complexity of a code of size x with probability \(1-\epsilon \) is \(O(x\log {\frac{x}{\epsilon }})\). Any \(x + \sqrt{x}\cdot \log ^{2}{\frac{x}{\epsilon }}\) symbols are sufficient to construct the output.

Claim

Let \(\gamma _{P} \cdot F,\ BW,\ L\), and M denote an algorithm’s arithmetic cost, bandwidth cost, latency cost, and memory footprint, respectively, and let \({F}^{'},\ {BW}^{'},\ {L}^{'}\), and \(M^{'}\) denote the costs of the algorithm applying the LT+ solution to C. Then \({F}^{'} = \bar{\gamma }_{a} \cdot O \left( 1 + \frac{\log ^{2}{s}}{\sqrt{S}} \right) \cdot F\), \({BW}^{'} = BW + 2P \cdot M\), \({L}^{'} = L + O \left( S\cdot \log {P} \right) \), and \(M^{'} = \rho \cdot M\).

Proof

Input and output distribution phases are similar to the MDS solution but with different weights. Since both use an all-reduce operation, the overhead costs remain the same. The algorithm constructs the final output when processors have computed a sufficient amount of tasks globally (maintaining a near-perfect^{Footnote 3} load balancing. The number of completed tasks expected to be sufficient for recovery is \(O \left( S + \sqrt{S}\log ^{2}{S} \right) \) (Claim 1), which means that each processor is expected to perform a factor of \(O \left( 1 + \frac{\log ^{2}{S}}{\sqrt{S}} \right) \) more computations compared to the non-mitigated algorithm. Notifying the processors when each task is completed adds \(S \cdot P\) bandwidth and \(O \left( S \cdot \log {P} \right) \) latency (Corollary 1). The algorithm generates a factor of \(\rho \) more tasks and thus uses a factor of \(\rho \) more memory. Summing up, the additional costs are: \(\left( {F}^{'}, {BW}^{'}, {L}^{'} \right) = \left( \bar{\gamma }_{a} \cdot O \left( 1 + \frac{\log ^{2}{S}}{\sqrt{S}} \right) \cdot F,\ BW + 2P\cdot M,\ L + O \left( S\log {P} \right) \right) \).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nissim, R., Schwartz, O. (2023). Stragglers in Distributed Matrix Multiplication. In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies for Parallel Processing. JSSPP 2023. Lecture Notes in Computer Science, vol 14283. Springer, Cham. https://doi.org/10.1007/978-3-031-43943-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-43943-8_4
Published: 15 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43942-1
Online ISBN: 978-3-031-43943-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Stragglers in Distributed Matrix Multiplication

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Existing Solutions

A Existing Solutions

1.1 A.1 Dynamic Load Balancing

Claim

Proof

1.2 A.2 Redundancy

Claim

Proof

Claim

Proof

Remark 1

Claim

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation