Skip to main content
Log in

Experience report on applying software analytics in incident management of online service

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

As online services become more and more popular, incident management has become a critical task that aims to minimize the service downtime and to ensure high quality of the provided services. In practice, incident management is conducted through analyzing a huge amount of monitoring data collected at runtime of a service. Such data-driven incident management faces several significant challenges such as the large data scale, complex problem space, and incomplete knowledge. To address these challenges, we carried out 2-year software-analytics research where we designed a set of novel data-driven techniques and developed an industrial system called the Service Analysis Studio (SAS) targeting real scenarios in a large-scale online service of Microsoft. SAS has been deployed to worldwide product datacenters and widely used by on-call engineers for incident management. This paper shares our experience about using software analytics to solve engineers pain points in incident management, the developed data-analysis techniques, and the lessons learned from the process of research development and technology transfer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Incident management, http://en.wikipedia.org/wiki/Incident_management.

  2. Microsoft SCOM, http://www.microsoft.com/en-us/server-cloud/products/system-center-2012-r2/default.aspx.

  3. Formal Concept Analysis, http://en.wikipedia.org/wiki/Formal_concept_analysis.

  4. We choose this dataset because it has quality labels verified by domain experts. Although we do have other real datasets that have a larger number of attributes, the quality of their labels may be insufficient to calculate TN and FN due to the difficulties of manual identification effort mentioned above.

References

  • Amazons s3 cloud service turns into a puff of smoke. In: InformationWeek NewsFilter (2008)

  • Ashok, B., Joy, J.M., Liang, H., Rajamani, S.K., Srinivasa, G., Vangala, V.: Debugadvisor: a recommender system for debugging. In: Proceedings of ACM FSE’09, pp. 373–382 (2009)

  • Basu, S., Bilenko, M., Mooney, R.: A probabilistic framework for semi-supervised clustering. In: Proceedings of SIGKDD, pp. 59–68 (2004)

  • Bird, C., Ranganath, V.P., Zimmermann, T., Nagappan, N., Zeller, A.: Extrinsic influence factors in software reliability: a study of 200,000 windows machines. In: Proceedings of ICSE, pp. 205–214 (2014)

  • Bodik, P., Goldszmidt, M., Fox, A., Woodard, D.B., Andersen, H.: Fingerprinting the datacenter: automated classification of performance crises. In: Proceedings of EuroSys, pp. 111–124 (2010)

  • Cellier, P.: Formal concept analysis applied to fault localization. In: Proceedings of ICSE, pp. 991–994 (2008)

  • Cohen, I., Chase, J.S., Goldszmidt, M., Kelly, T., Symons, J.: Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: Proceedings of OSDI, pp. 231–244 (2004)

  • Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., Fox, A.: Capturing, indexing, clustering, and retrieving system history. In: Proceedings of SOSP, pp. 105–118 (2005)

  • Dang, Y., Zhang, D., Ge, S., Chu, C., Qiu, Y., Xie, T.: Xiao: Tuning code clones at hands of engineers in practice. In: Proceedings of ACSAC, pp. 369–378 (2012)

  • Ding, R., Fu, Q., Lou, J.G., Lin, Q., Zhang, D., Shen, J., Xie, T.: Healing online service systems via mining historical issue repositories. In: Proceedings of ASE, pp. 318–321 (2012)

  • Ding, R., Fu, Q., Lou, J.G., Lin, Q., Zhang, D., Xie, T.: Mining historical issue repositories to heal large-scale online service systems. In: Proceedings of DSN, pp. 311–322 (2014)

  • Ding, R., Wang, Q., Dang, Y., Fu, Q., Zhang, H., Zhang, D.: Yading: Fastclustering of large-scale time series data. In: Proceedings of VLDB, ACM, pp. 473–484 (2015)

  • Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of SIGKDD, ACM, pp. 43–52 (1999)

  • Duan, S., Babu, S.: Guided problem diagnosis through active learning. In: Proceedings of ICAC, pp. 45–54 (2008)

  • Epifani, I., Ghezzi, C., Tamburrelli, G.: Change-point detection for black-box services. In: Proceedings of FSE, pp. 227–236 (2010)

  • Freitas, A.A.: Understanding the crucial differences between classification and discovery of association rules—a position paper. In: SIGKDD Exploration, vol. 2(1), pp. 65–69 (2000)

  • Fu, Q., Lou, J.G., Wang, Y., Li, J.: Execution anomaly detection in distributed systems through unstructured log analysis. In: Proceedings of ICDM, pp. 149–158 (2009)

  • Fu, Q., Lou, J.G., Lin, Q., Ding, R., Zhang, D., Xie, T.: Performance issue diagnosis for online service systems. In: Proceedings of SRDS (2012)

  • Fu, Q., Lou, J.G., Lin, Q., Ding, R., Zhang, D., Xie, T.: Contextual analysis of program logs for understanding system behaviors. In: Proceedings of Mining Software Repository, pp. 397–400 (2013)

  • Fu, Q., Zhu, J., Hu, W., Lou, J.G., Ding, R., Lin, Q., Zhang, D., Xie, T.: Where do developers log? an empirical study on logging practices in industry. In: Proceedings of ICSE (2014)

  • Glerum, K., Kinshumann, K., Greenberg, S., Aul, G., Or-govan, V., Nichols, G., Grant, D., Loihle, G., Hunt, G.C.: Debugging in the large: ten years of implementation and experience. In: Proceedings of SOSP, pp. 106–116 (2009)

  • Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, Burlington (2011)

    MATH  Google Scholar 

  • Han, S., Dang, Y., Ge, S., Zhang, D., Xie, T.: Performance debugging in the large via mining millions of stack traces. In: Proceedings of ICSE, pp. 145–155 (2012)

  • Hoover, J.N.: Outages force cloud computing users to rethink tactics. In: InformationWeek (2008)

  • Huang, C., Cohen, I., Symons, J., Abdelzaher, T.: Achieving scalable automated diagnosis of distributed systems performance problems. In: Technical Report, HP (2006)

  • Li, H., Zhi, W., Maris, J.: A hidden Markov random field model for genome-wide association studies. Biostatistics 11(1), 139–150 (2009)

    Article  Google Scholar 

  • Li, J., Shen, H., Topor, R.W.: Mining optimal class association rule set. In: Proceedings of PAKDD, pp. 364–375 (2001)

  • Li, P.L., Kivett, R., Zhan, Z., Jeon, S.E., Nagappan, N., Murphy, B., Ko, A.J.: Characterizing the differences between pre- and post- release versions of software. In: Proceedings of ICSE, pp. 716–725 (2011)

  • Lim, M., Lou, J.G., Zhang, H., Fu, Q., Teoh, A., Lin, Q., Ding, R., Zhang, D.: Identifying recurrent and unknown performance issues. In: Proceedings of ICDM (2014)

  • Lin, Q., Lou, J.G., Zhang, H., Zhang, D.: iDice: Problem identification for emerging issues. In: Proceedings of ICSE (2016)

  • Liu, C., Yan, X., Fei, L., Han, J., Midkiff, S.: Sober: statistical model-based bug localization. In: Proceedings of FSE, pp. 286–295 (2005)

  • Lou, J.G., Fu, Q., Yang, S., Xu, Y., Li, J.: Mining invariants from console logs for system problem detection. In: Proceedings of USENIX ATC, pp. 24–24 (2010)

  • Lou, J.G., Lin, Q., Ding, R., Fu, Q., Zhang, D., Xie, T.: Software analytics for incident management of online services: an experience report. In: Proceedings of ASE (2013)

  • Menzies, T., Butcher, A., Cok, D., Marcus, A., Layman, L., Shull, F., Turhan, B., Zimmermann, T.: Local versus global lessons for defect prediction and effort estimation. IEEE Trans. Softw. Eng. 39(6), 822–834 (2013)

    Article  Google Scholar 

  • Nagaraj, K., Killian, C., Neville, J.: Structured comparative analysis of systems logs to diagnose performance problems. In: Proceedings of USENIX NSDI, pp. 271–284 (2012)

  • Natu, M., Patil, S., Sadaphal, V., Vin, H.: Automated debugging of SLO violations in enterprise systems. In: Proceedings of ICAC, pp. 1–10 (2011)

  • Patterson, D.A.: A simple way to estimate the cost of downtime. In: Proceedings of USENIX LISA, pp. 185–188 (2002)

  • Sambasivan, R.R., Zheng, A.X., Rosa, M.D., Krevat, E., Whitman, S., Stroucken, M., Wang, W., Xu, L., Ganger, G.R.: Diagnosing performance changes by comparing request flows. In: Proceedings of USENIX NSDI (2011)

  • Sharma, B., Chudnovsky, V., Hellerstein, J.L., Rifaat, R., Das, C.R.: Modeling and synthesizing task placement constraints in google compute clusters. In: Proceedings of SoCC (2011)

  • Sun, C., Lo, D., Wang, X., Jiang, J., Khoo, S.C.: A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of ICSE, pp. 45–54 (2010)

  • Wang, X., Zhang, L., Xie, T., Anvik, J., Sun, J.: An approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of ICSE, pp. 461–470 (2008)

  • Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector spaces model in information retrieval. In: Proceedings of SIGIR, pp. 18–25 (1985)

  • Yuan, C., Lao, N., Wen, J.R., Li, J., Zhang, Z., Wang, Y.M., Ma, W.Y.: Automated known problem diagnosis with event traces. In: Proceedings of EuroSys, pp. 375–388 (2006)

  • Zhang, D., Xie, T.: Software analytics in practice: mini tutorial. In: Proceedings of ICSE, pp. 997 (2012)

  • Zhang, D., Dang, Y., Lou, J.G., Han, S., Zhang, H., Xie, T.: Software analytics as a learning case in practice: approaches and experiences. In: Proceedings of MALETS, pp. 55–58 (2008)

  • Zhang, D., Han, S., Dang, Y., Lou, J.G., Zhang, H., Xie, T.: Software analytics in practice. IEEE Softw. 30(5), 30–37 (2013)

    Article  Google Scholar 

  • Zhang, S., Cohen, I., Goldszmidt, M., Symons, J., Fox, A.: Ensembles of models for automated diagnosis of system performance problems. In: Proceedings of DSN, pp. 644–653 (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian-Guang Lou.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lou, JG., Lin, Q., Ding, R. et al. Experience report on applying software analytics in incident management of online service. Autom Softw Eng 24, 905–941 (2017). https://doi.org/10.1007/s10515-017-0218-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10515-017-0218-1

Keywords

Navigation