Asymmetric Coded Distributed Computation for Resilient Prediction Serving Systems

Wang, Lin; Hu, Yuchong; Liu, Yuxue; Xiao, Renzhi; Feng, Dan

doi:10.1007/978-3-031-69766-1_33

Lin Wang¹³,
Yuchong Hu¹³,
Yuxue Liu¹³,
Renzhi Xiao¹³ &
…
Dan Feng¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14802))

Included in the following conference series:

European Conference on Parallel Processing

979 Accesses

Abstract

With the surge of AI services, prediction serving systems (PSSes) have been widely deployed. PSSes are often run on many workers and thus are prone to stragglers (slowdowns or failures), so it is critical to design straggler-resilient PSSes for low latency of prediction. A traditional way is replication that assigns the same prediction job to multiple workers, while incurring significant resources overheads due to its replicated redundant jobs. Recently, coded distributed computation (CDC) has become a more resource-efficient way than replication, as it encodes the prediction job into parity units for prediction reconstruction via decoding. However, we find that state-of-the-art CDC methods either trade accuracy for low latency with the encoder and decoder both simple, or trade latency for high accuracy with the encoder and decoder both complicated, leading to unbalance between accuracy and latency due to the above symmetry between the encoder and decoder.

Our insight is that the encoder is always used in CDC-based prediction, while the decoder is only used when stragglers occur. In this paper, we first propose a new asymmetric CDC framework based on the insight, called AsymCDC, composed of a simple encoder but a complicated decoder, such that the encoder’s simplicity makes a low encoding time that reduces the latency largely, while the decoder’s complexity can be helpful for accuracy. Further, we design the decoder’s complexity in two steps: i) an exact decoding method that leverages an invertible neural network’s (INN) invertibility to make the decoding have no accuracy loss, and ii) a decoder compacting method that reshapes INN outputs to utilize knowledge distillation effectively that compacts the decoder for low decoding time. We prototype AsymCDC atop Clipper and experiments show that the prediction accuracy of AsymCDC is approximately the same as state-of-the-arts with the encoder and decoder both complicated, while the latency of AsymCDC only exceeds that of state-of-the-arts with the encoder and decoder both simple by no more than $2.6\%$.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Distributed cache strategy based on LT codes under spark platform

Article 12 April 2024

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

DMR: A Deterministic MapReduce for Multicore Systems

Article 06 October 2015

References

Alibaba cloud (2024). https://www.aliyun.com/
Behrmann, J., Grathwohl, W., Chen, R.T., Duvenaud, D., Jacobsen, J.H.: Invertible residual networks. In: Proceedings of ICML (2019)
Google Scholar
Crankshaw, D., Wang, X., Zhou, G., Franklin, M.J., Gonzalez, J.E., Stoica, I.: Clipper: a low-latency online prediction serving system. In: Proceedings of USENIX NSDI (2017)
Google Scholar
Dean, J., Barroso, L.A.: The tail at scale. Commun. ACM 56(2), 74–80 (2013)
Article Google Scholar
Finzi, M., Izmailov, P., Maddox, W., Kirichenko, P., Wilson, A.G.: Invertible convolutional networks. In: Proceedings of ICML (2019)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Huang, C., et al.: Erasure coding in windows azure storage. In: Proceedings of USENIX ATC (2012)
Google Scholar
Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 100(6), 518–528 (1984)
Article Google Scholar
Jacobsen, J.H., Smeulders, A.W., Oyallon, E.: i-RevNet: deep invertible networks. In: Proceedings of ICLR (2018)
Google Scholar
Jahani-Nezhad, T., Maddah-Ali, M.A.: Berrut approximated coded computing: straggler resistance beyond polynomial computing. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 111–122 (2022)
Google Scholar
Kosaian, J., Rashmi, K., Venkataraman, S.: Parity models: erasure-coded resilience for prediction serving systems. In: Proceedings of ACM SOSP (2019)
Google Scholar
Kosaian, J., Rashmi, K., Venkataraman, S.: Learning-based coded computation. IEEE J. Sel. Areas Inf. Theory (2020)
Google Scholar
Lee, K., Lam, M., Pedarsani, R., Papailiopoulos, D., Ramchandran, K.: Speeding up distributed machine learning using codes. IEEE Trans. Inf. Theory 64(3), 1514–1529 (2017)
Article MathSciNet Google Scholar
Phan, T.-D., Ibrahim, S., Zhou, A.C., Aupy, G., Antoniu, G.: Energy-driven straggler mitigation in MapReduce. In: Rivera, F.F., Pena, T.F., Cabaleiro, J.C. (eds.) Euro-Par 2017. LNCS, vol. 10417, pp. 385–398. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64203-1_28
Chapter Google Scholar
Radev, S.T., Mertens, U.K., Voss, A., Ardizzone, L., Köthe, U.: Bayesflow: learning complex stochastic models with invertible neural networks. IEEE Trans. Neural Netw. Learn. Syst. 33(4), 1452–1466 (2020)
Google Scholar
Rashmi, K.V., Shah, N.B., Gu, D., Kuang, H., Borthakur, D., Ramchandran, K.: A solution to the network challenges of data recovery in erasure-coded distributed storage systems: a study on the Facebook warehouse cluster. In: USENIX Workshop on HotStorage (2013)
Google Scholar
Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1960)
Article MathSciNet Google Scholar
Ren, K., Kwon, Y., Balazinska, M., Howe, B.: Hadoop’s adolescence: an analysis of hadoop usage in scientific workloads. Proc. VLDB Endow. 6(10), 853–864 (2013)
Article Google Scholar
Rizzo, L.: Effective erasure codes for reliable computer communication protocols. ACM SIGCOMM Comput. Commun. Rev. 27(2), 24–36 (1997)
Article Google Scholar
Soleymani, M., Ali, R.E., Mahdavifar, H., Avestimehr, A.S.: ApproxIFER: a model-agnostic approach to resilient and robust prediction serving systems. In: Proceedings of AAAI (2022)
Google Scholar

Download references

Acknowledgments.

This work was supported by the Development Program of China for Young Scholars (No. 2021YFB0301400), and Key Laboratory of Information Storage System Ministry of Education of China.

Author information

Authors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Lin Wang, Yuchong Hu, Yuxue Liu, Renzhi Xiao & Dan Feng

Authors

Lin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuchong Hu
View author publications
You can also search for this author in PubMed Google Scholar
Yuxue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Renzhi Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Dan Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuchong Hu .

Editor information

Editors and Affiliations

University Carlos III of Madrid, Madrid, Spain
Jesus Carretero
University of Oregon, Eugene, OR, USA
Sameer Shende
University Carlos III of Madrid, Madrid, Spain
Javier Garcia-Blas
TU Wien, Vienna, Austria
Ivona Brandic
Universidad Complutense de Madrid, Madrid, Spain
Katzalin Olcoz
Université Grenoble Alpes, Saint Martin d'Hères, France
Martin Schreiber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Hu, Y., Liu, Y., Xiao, R., Feng, D. (2024). Asymmetric Coded Distributed Computation for Resilient Prediction Serving Systems. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14802. Springer, Cham. https://doi.org/10.1007/978-3-031-69766-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-69766-1_33
Published: 26 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69765-4
Online ISBN: 978-3-031-69766-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Asymmetric Coded Distributed Computation for Resilient Prediction Serving Systems