skip to main content
research-article

PreFix: Switch Failure Prediction in Datacenter Networks

Published: 03 April 2018 Publication History

Abstract

In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifically, in our proposed system, named PreFix, we aim to determine during runtime whether a switch failure will happen in the near future. The prediction is based on the measurements of the current switch system status and historical switch hardware failure cases that have been carefully labelled by network operators. Our key observation is that failures of the same switch model share some common syslog patterns before failures occur, and we can apply machine learning methods to extract the common patterns for predicting switch failures. Our novel set of features (message template sequence, frequency, seasonality and surge) for machine learning can efficiently deal with the challenges of noises, sample imbalance, and computation overhead. We evaluated PreFix on a data set collected from 9397 switches (3 different switch models) deployed in more than 20 datacenters owned by a top global search engine in a 2-year period. PreFix achieved an average of 61.81% recall and 1.84 * 10^-5 false positive ratio. It outperforms the other failure prediction methods for computers and ISP devices.

References

[1]
Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture SIGCOMM. Seattle, WA, USA.
[2]
Lloyd Allison and Trevor I Dix. 1986. A bit-string longest-common-subsequence algorithm. Inform. Process. Lett. Vol. 23, 5 (1986), 305--310.
[3]
Mirela Madalina Botezatu, Ioana Giurgiu, Jasmina Bogojeska, and Dorothea Wiesmann. 2016. Predicting Disk Replacement towards Reliable Data Centers Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 39--48.
[4]
Guo Chen, Youjian Zhao, Dan Pei, and Dan Li. 2015. Rewiring 2 Links is Enough: Accelerating Failure Recovery in Production Data Center Networks. In Distributed Computing Systems (ICDCS), 2015 IEEE 35th International Conference on. IEEE, 569--578.
[5]
Maxime Crochemore, Costas S Iliopoulos, Yoan J Pinzon, and James F Reid. 2001. A fast and practical bit-vector algorithm for the longest common subsequence problem. Inform. Process. Lett. Vol. 80, 6 (2001), 279--285.
[6]
Liu Dapeng, Zhao Youjian, Xu Haowen, Sun Yongqian, Pei Dan, Luo Jiao, Jing Xiaowei, and Feng Mei. 2015. Opprentice: Towards Practical and Automatic Anomaly Detection through Machine Learning ACM IMC. Tokyo, Japan.
[7]
Alain De Cheveigné and Hideki Kawahara. 2002. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America Vol. 111, 4 (2002), 1917--1930.
[8]
Mukund Deshpande and George Karypis. 2002. Evaluation of techniques for classifying biological sequences. In Advances in Knowledge Discovery and Data Mining. Springer, 417--431.
[9]
Cees Elzinga, Sven Rahmann, and Hui Wang. 2008. Algorithms for subsequence combinatorics. Theoretical Computer Science Vol. 409, 3 (2008), 394--404.
[10]
R Wesley Featherstun and Errin W Fulp. 2010. Using Syslog Message Sequences for Predicting Disk Failures LISA.
[11]
Ilenia Fronza, Alberto Sillitti, Giancarlo Succi, Mikko Terho, and Jelena Vlasenko. 2013. Failure prediction based on log files using Random Indexing and Support Vector Machines. Journal of Systems and Software Vol. 86, 1 (2013), 2--11.
[12]
Errin W Fulp, Glenn A Fink, and Jereme N Haack. 2008 a. Predicting Computer System Failures Using Support Vector Machines. WASL Vol. 8 (2008), 5--5.
[13]
Errin W. Fulp, Glenn A. Fink, and Jereme N. Haack. 2008 b. Predicting Computer System Failures Using Support Vector Machines Proceedings of the First USENIX Conference on Analysis of System Logs (WASL'08). 5--12.
[14]
Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In SIGCOMM.
[15]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org
[16]
Albert Greenberg, James R Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: a scalable and flexible data center network. In SIGCOMM. Barcelona, Spain.
[17]
Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, and Varugis Kurien. 2015. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). 139--152.
[18]
Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazières, and Nick McKeown. 2014. I know what your packet did last hop: Using packet histories to troubleshoot networks 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 71--85.
[19]
Tin Kam Ho. 1995. Random decision forests. In Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, Vol. Vol. 1. IEEE, 278--282.
[20]
Guenther Hoffmann and Miroslaw Malek. 2006. Call availability prediction in a telecommunication system: A data driven empirical approach. In Reliable Distributed Systems, 2006. SRDS'06. 25th IEEE Symposium on. IEEE, 83--95.
[21]
Eamonn J Keogh and Michael J Pazzani. 2000. Scaling up dynamic time warping for datamining applications Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 285--289.
[22]
Tomohiro Kimura, Koji Ishibashi, Takayoshi Mori, Hideyuki Sawada, Tsuyoshi Toyono, Ken Nishimatsu, Atsuyori Watanabe, Akihiro Shimoda, and Kohei Shiomoto. 2014. Spatio-temporal factorization of log data for understanding network events INFOCOM, 2014 Proceedings IEEE. IEEE, 610--618.
[23]
Tatsuaki Kimura, Akio Watanabe, Tsuyoshi Toyono, and Keisuke Ishibashi. 2015. Proactive failure detection learning generation patterns of large-scale network logs Network and Service Management (CNSM), 2015 11th International Conference on. IEEE, 8--14.
[24]
Ron Kohavi. 1995. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2 (IJCAI'95). 1137--1143.
[25]
Terran Lane and Carla E Brodley. 1999. Temporal sequence learning and data reduction for anomaly detection. ACM Transactions on Information and System Security (TISSEC) Vol. 2, 3 (1999), 295--331.
[26]
Christina Leslie and Rui Kuang. 2004. Fast string kernels using inexact matching for protein sequences. The Journal of Machine Learning Research Vol. 5 (2004), 1435--1455.
[27]
Christina S Leslie, Eleazar Eskin, and William Stafford Noble. 2002. The spectrum kernel: A string kernel for SVM protein classification. Pacific symposium on biocomputing, Vol. Vol. 7. 566--575.
[28]
Yinglung Liang, Yanyong Zhang, Morris Jette, Anand Sivasubramaniam, and Ramendra Sahoo. 2006. Bluegene/l failure analysis and prediction models. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on. IEEE, 425--434.
[29]
Junda Liu, Aurojit Panda, Ankit Singla, Brighten Godfrey, Michael Schapira, and Scott Shenker. 2013 b. Ensuring Connectivity via Data Plane Mechanisms. In NSDI.
[30]
Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, and Thomas Anderson. 2013 a. F10: A Fault-tolerant Engineered Network. In NSDI.
[31]
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, and Chris Watkins. 2002. Text classification using string kernels. The Journal of Machine Learning Research Vol. 2 (2002), 419--444.
[32]
Zhiling Luo, Ying Li, Ruisheng Fu, and Jianwei Yin. 2016. Don't Fire Me, a Kernel Autoregressive Hybrid Model for Optimal Layoff Plan Big Data (BigData Congress), 2016 IEEE International Congress on. IEEE, 470--477.
[33]
Geoffrey McLachlan, Kim-Anh Do, and Christophe Ambroise. 2005. Analyzing microarray gene expression data. Vol. Vol. 422. John Wiley & Sons.
[34]
G Martin Milner. 2005. Detection/classification/quantification of chemical agents using an array of surface acoustic wave (SAW) devices. In Proceedings of SPIE, Vol. Vol. 5778. 305--316.
[35]
Andrew W Moore. 2001. Cross-validation for detecting and preventing overfitting. School of Computer Science Carneigie Mellon University (2001).
[36]
Nasser M. Nasrabadi. 2007. Pattern Recognition and Machine Learning. Journal of Electronic Imaging Vol. 16 (2007).
[37]
Srinivasan Parthasarathy, Sameep Mehta, and Soundararajan Srinivasan. 2006. Robust periodicity detection algorithms. In Proceedings of the 15th ACM international conference on Information and knowledge management. ACM, 874--875.
[38]
Rahul Potharaju and Navendu Jain. 2013. Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters. In Proceedings of the 2013 Conference on Internet Measurement Conference (IMC '13). 9--22.
[39]
Tongqing Qiu, Zihui Ge, Dan Pei, Jia Wang, and Jun Xu. 2010. What Happened in My Network: Mining Network Events from Router Syslogs Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement (IMC '10). 472--484.
[40]
Chotirat Ann Ratanamahatana and Eamonn Keogh. {n. d.}. Making Time-series Classification More Accurate Using Learned Constraints. 11--22.
[41]
Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. 2004. Protein homology detection using string alignment kernels. Bioinformatics Vol. 20, 11 (2004), 1682--1689.
[42]
Felix Salfner, Maren Lenk, and Miroslaw Malek. 2010. A survey of online failure prediction methods. ACM Computing Surveys (CSUR) Vol. 42, 3 (2010), 10.
[43]
Felix Salfner and Miroslaw Malek. 2007. Using hidden semi-Markov models for effective online failure prediction Reliable Distributed Systems, 2007. SRDS 2007. 26th IEEE International Symposium on. IEEE, 161--174.
[44]
Felix Salfner and Steffen Tschirpke. 2008. Error Log Processing for Accurate Failure Prediction Proceedings of the First USENIX Conference on Analysis of System Logs (WASL'08).
[45]
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR) Vol. 34, 1 (2002), 1--47.
[46]
Mohammed Shatnawi and Mohamed Hefeeda. 2015. Real-time failure prediction in online services. In 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 1391--1399.
[47]
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. 2015. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). 183--197.
[48]
Josef Sivic and Andrew Zisserman. 2009. Efficient visual search of videos cast as text retrieval. IEEE transactions on pattern analysis and machine intelligence Vol. 31, 4 (2009), 591--606.
[49]
Sören Sonnenburg, Gunnar R"atsch, and Bernhard Schölkopf. 2005. Large scale genomic sequence SVM classifiers. In Proceedings of the 22nd international conference on Machine learning. ACM, 848--855.
[50]
Pang-Ning Tan and Vipin Kumar. 2004. Discovery of web robot sessions based on their navigational patterns. In Intelligent Technologies for Information Analysis. Springer, 193--222.
[51]
Michail Vlachos, S Yu Philip, and Vittorio Castelli. 2005. On Periodicity Detection and Structural Periodic Similarity. SDM, Vol. Vol. 5. SIAM, 449--460.
[52]
Meg Walraed-Sullivan, Amin Vahdat, and Keith Marzullo. 2013. Aspen Trees: Balancing Data Center Fault Tolerance, Scalability and Cost CoNEXT.
[53]
Yoshihiro Watanabe, Hiroyuki Otsuka, Masataka Sonoda, Shinji Kikuchi, and Yuki Matsumoto. 2012. Online failure prediction in cloud datacenters by real-time message pattern learning Cloud Computing Technology and Science (CloudCom), 2012 IEEE 4th International Conference on. IEEE, 504--511.
[54]
Li Wei and Eamonn Keogh. 2006. Semi-supervised time series classification. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 748--753.
[55]
Xin Wu, Daniel Turner, Chao-Chih Chen, David A Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. 2012. NetPilot: automating datacenter network failure mitigation Proceedings of the 2012 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '12). 419--430.
[56]
Yang Wu, Mingchen Zhao, Andreas Haeberlen, Wenchao Zhou, and Boon Thau Loo. 2014. Diagnosing missing events in distributed systems with negative provenance ACM SIGCOMM Computer Communication Review, Vol. Vol. 44. ACM, 383--394.
[57]
Xiaopeng Xi, Eamonn Keogh, Christian Shelton, Li Wei, and Chotirat Ann Ratanamahatana. 2006. Fast time series classification using numerosity reduction Proceedings of the 23rd international conference on Machine learning. ACM, 1033--1040.
[58]
Zhengzheng Xing, Jian Pei, and Eamonn Keogh. 2010. A brief survey on sequence classification. ACM SIGKDD Explorations Newsletter Vol. 12, 1 (2010), 40--48.
[59]
Minlan Yu, Albert G Greenberg, David A Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. 2011. Profiling Network Performance for Multi-tier Data Center Applications NSDI.
[60]
Hongyi Zeng, Peyman Kazemian, George Varghese, and Nick McKeown. 2012. Automatic test packet generation. In Proceedings of the 8th international conference on Emerging networking experiments and technologies. ACM, 241--252.
[61]
Shenglin Zhang, Ying Liu, Dan Pei, Yu Chen, Xianping Qu, Shimin Tao, and Zhi Zang. 2015. Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services. In CONEXT. Heidelberg, Germany.
[62]
Shenglin Zhang, Ying Liu, Dan Pei, Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang, Xiaowei Jing, and Mei Feng. 2016. FUNNEL: Assessing Software Changes in Web-based Services. IEEE Transactions on Services Computing (2016).
[63]
Shenglin Zhang, Weibin Meng, Jiahao Bu, Sen Yang, Ying Liu, Dan Pei, Jun Xu, Yu Chen, Hui Dong, Xianping Qu, et al. 2017. Syslog processing for switch failure diagnosis and prediction in datacenter networks Quality of Service (IWQoS), 2017 IEEE/ACM 25th International Symposium on. IEEE, 1--10.
[64]
Z. Zheng, Z. Lan, B. H. Park, and A. Geist. 2009. System log pre-processing to improve failure prediction Dependable Systems Networks, 2009. DSN '09. IEEE/IFIP International Conference on. 572--577.

Cited By

View all
  • (2024)A Survey of the Past, Present, and Future of Erasure Coding for Storage SystemsACM Transactions on Storage10.1145/370899421:1(1-39)Online publication date: 31-Dec-2024
  • (2024)LogParser-LLM: Advancing Efficient Log Parsing with Large Language ModelsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671810(4559-4570)Online publication date: 25-Aug-2024
  • (2024)Network Connectivity Resilience in Next Generation Backhaul Networks: Challenges and Future OpportunitiesIEEE Transactions on Network and Service Management10.1109/TNSM.2024.339285721:5(5321-5334)Online publication date: 1-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 2, Issue 1
March 2018
603 pages
EISSN:2476-1249
DOI:10.1145/3203302
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 April 2018
Published in POMACS Volume 2, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. datacenter
  2. failure prediction
  3. machine learning
  4. operations

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)117
  • Downloads (Last 6 weeks)14
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Survey of the Past, Present, and Future of Erasure Coding for Storage SystemsACM Transactions on Storage10.1145/370899421:1(1-39)Online publication date: 31-Dec-2024
  • (2024)LogParser-LLM: Advancing Efficient Log Parsing with Large Language ModelsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671810(4559-4570)Online publication date: 25-Aug-2024
  • (2024)Network Connectivity Resilience in Next Generation Backhaul Networks: Challenges and Future OpportunitiesIEEE Transactions on Network and Service Management10.1109/TNSM.2024.339285721:5(5321-5334)Online publication date: 1-Oct-2024
  • (2024)Toward Resource-Efficient and High- Performance Program Deployment in Programmable NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2024.341338832:5(4270-4285)Online publication date: Oct-2024
  • (2024)Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00054(499-510)Online publication date: 28-Oct-2024
  • (2024)When Green Computing Meets Performance and Resilience SLOs2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00015(17-22)Online publication date: 24-Jun-2024
  • (2024)Paired 2-disjoint path covers of k-ary n-cubes under the partitioned edge fault modelJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104887(104887)Online publication date: Mar-2024
  • (2024)Ensuring reliable network operations and maintenanceJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10180935:10Online publication date: 4-Mar-2024
  • (2024)Explainable semantic wireless anomaly characterization for digital twinsComputer Networks10.1016/j.comnet.2024.110660251(110660)Online publication date: Sep-2024
  • (2024)Optoelectronic equipment-based fault monitoring with 64QAM-OFDM RoF transmissionJournal of Optics10.1007/s12596-024-02226-wOnline publication date: 18-Sep-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media