Ranking the importance of alerts for problem determination in large computer systems

Jiang, Guofei; Chen, Haifeng; Yoshihira, Kenji; Saxena, Akhilesh

doi:10.1007/s10586-010-0120-0

Ranking the importance of alerts for problem determination in large computer systems

Published: 20 February 2010

Volume 14, pages 213–227, (2011)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Guofei Jiang¹,
Haifeng Chen¹,
Kenji Yoshihira¹ &
…
Akhilesh Saxena¹

185 Accesses
16 Citations
3 Altmetric
Explore all metrics

Abstract

The complexity of large computer systems has raised unprecedented challenges for system management. In practice, operators often collect large volume of monitoring data from system components and set up many rules to check data and trigger alerts. However, the alerts from various rules usually have different problem reporting accuracy because their thresholds are often manually set based on operators’ experience and intuition. Meantime, due to system dependencies, a single problem may trigger many alerts at the same time in large systems and the critical question is which alert should be analyzed first in the following problem determination process. In this paper, we propose a novel peer review mechanism to rank the importance of alerts and the top ranked alerts are more likely to be true positives. After comparing a metric value against its threshold to generate alerts, we also compare the value with the equivalent thresholds from many other rules to determine the importance of alerts. Our approach is evaluated with a real test bed system and experimental results are also included to demonstrate its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm Selection for Combinatorial Search Problems: A Survey

MoMAC: Multi-objective optimization to combine multiple association rules into an interpretable classification

Article 29 June 2021

SP-AHP: An IT System for Collaborative Multi-criteria Decision-Making

References

Aguilera, M., Mogul, J., Wiener, J., Reynolds, P., Muthitacharoen, A.: Performance debugging for distributed systems of black boxes. In: Proceedings of ACM SOSP, pp. 74–89, NY (2003)
Chen, M., Accardi, A., Kiciman, E., Lloyd, J., Patterson, D., Fox, A., Brewer, E.: Path-based failure and evolution management. In: Proceedings of USENIX NSDI, pp. 309–322, San Francisco, CA (2004)
Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., Fox, A.: Capturing, indexing, clustering, and retrieving system history. SIGOPS Oper. Syst. Rev. 39(5), 105–118 (2005)
Article Google Scholar
Gruschke, B.: Integrated event management: event correlation using dependency graphs. In: Proceedings of the 9th IFIP/IEEE DSOM, Newark, NJ (1998)
Guo, Z., Jiang, G., Chen, H., Yoshihira, K.: Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In: Proceedings of 2006 DSN, pp. 259–268 (2006)
Jiang, G., Chen, H., Yoshihira, K.: Discovering likely invariants of distributed transaction systems for autonomic system management. In: Proceedings of the 3rd ICAC, pp. 199–208, Dublin, Ireland (2006)
Jiang, G., Chen, H., Yoshihira, K.: Modeling and tracking of transaction flow dynamics for fault detection in complex systems. IEEE Trans. Dependable Secure Comput. 3(4), 312–326 (2006)
Article Google Scholar
Jiang, G., Chen, H., Yoshihira, K.: Efficient and scalable algorithms for inferring likely invariants in distributed systems. IEEE Trans. Know. Data Eng. 19(11), 1508–1523 (2007)
Article Google Scholar
Ljung, L.: System Identification—Theory for The User, 2nd edn. Prentice Hall, New York (1998)
Google Scholar
Mas, C., Boudec, J.-Y.L.: An alarm filtering algorithm for optical communication networks. In: Management of Multimedia Networks and Services, pp. 11–12. Springer, Berlin (1998), Chap. 18
Google Scholar
Oppenheimer, D., Ganapathi, A., Patterson, D.: Why do internet services fail, and what can be done about it. In: Proceedings of the 4th USITS, pp. 1–16, Seattle, WA (2003)
Parekh, J., Jung, G., Swint, G., Pu, C., Sahai, A.: Comparison of performance analysis approaches for bottleneck detection in multi-tier enterprise applications. In: Proceedings of IEEE IWQoS, pp. 302–306, CT (2006)
Patterson, D.: A simple way to estimate the cost of downtime. In: Proceedings of LISA-2002, pp. 185–188, Philadelphia, PA (2002)
Yemini, A., Kliger, S.: High speed and robust event correlation. IEEE Commun. Mag. 34(5), 82–90 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

NEC Laboratories America, Princeton, NJ, 08540, USA
Guofei Jiang, Haifeng Chen, Kenji Yoshihira & Akhilesh Saxena

Authors

Guofei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Yoshihira
View author publications
You can also search for this author in PubMed Google Scholar
Akhilesh Saxena
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guofei Jiang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, G., Chen, H., Yoshihira, K. et al. Ranking the importance of alerts for problem determination in large computer systems. Cluster Comput 14, 213–227 (2011). https://doi.org/10.1007/s10586-010-0120-0

Download citation

Received: 03 September 2009
Accepted: 28 January 2010
Published: 20 February 2010
Issue Date: September 2011
DOI: https://doi.org/10.1007/s10586-010-0120-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ranking the importance of alerts for problem determination in large computer systems

Abstract

Access this article

Similar content being viewed by others

Algorithm Selection for Combinatorial Search Problems: A Survey

MoMAC: Multi-objective optimization to combine multiple association rules into an interpretable classification

SP-AHP: An IT System for Collaborative Multi-criteria Decision-Making

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Ranking the importance of alerts for problem determination in large computer systems

Abstract

Access this article

Similar content being viewed by others

Algorithm Selection for Combinatorial Search Problems: A Survey

MoMAC: Multi-objective optimization to combine multiple association rules into an interpretable classification

SP-AHP: An IT System for Collaborative Multi-criteria Decision-Making

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation