Event storm detection and identification in communication systems

doi:10.1016/j.ress.2005.05.001

Reliability Engineering & System Safety

Volume 91, Issue 5, May 2006, Pages 602-613

https://doi.org/10.1016/j.ress.2005.05.001 Get rights and content

Abstract

Event storms are the manifestation of an important class of abnormal behaviors in communication systems. They occur when a large number of nodes throughout the system generate a set of events within a small period of time. It is essential for network management systems to detect every event storm and identify its cause, in order to prevent and repair potential system faults.

This paper presents a set of techniques for the effective detection and identification of event storms in communication systems. First, we introduce a new algorithm to synchronize events to a single node in the system. Second, the system's event log is modeled as a normally distributed random process. This is achieved by using data analysis techniques to explore and then model the statistical behavior of the event log. Third, event storm detection is proposed using a simple test statistic combined with an exponential smoothing technique to overcome the non-stationary behavior of event logs. Fourth, the system is divided into non-overlapping regions to locate the main contributing regions of a storm. We show that this technique provides us with a method for event storm identification. Finally, experimental results from a commercially deployed multimedia communication system that uses these techniques demonstrate their effectiveness.

Introduction

Abnormal behavior in communication systems results in a number of events generated by nodes affected by such behavior. Abnormality is defined as behavior that deviates from expectation. The operator uses these events to detect and identify abnormalities in the system they manage. The problem under investigation is the detection and identification of event storms in communication systems. Event storms are an important class of abnormal behaviors in such systems. In many cases, such storms indicate failures that involve a large number of objects throughout the system. These objects react to such faults by generating a number of events within a short period of time.

The term ‘event storm’ implies that the number of events observed within a time unit is much larger than expected. In the statistical literature, such a concept has been called an ‘outlier’ among other names. An outlier is an observation that appears suspicious in light of some provisional initial assignment of a probability model to explain the data generation process [1].

Many researchers using a variety of techniques have investigated fault management in communication systems. FSM and probabilistic FSMs have been used by Refs. [2], [3], [4], [5]. A Petri Net approach to fault detection has been investigated by Refs. [6], [7]. Bayesian networks have been used by Ref. [8] for fault diagnosis in lightwave networks while the authors in Ref. [9] provided a fuzzy temporal reasoning model for event correlation. A statistical and heuristic approach to modeling of error logs captured from a distributed computer network at Carnegie Mellon University was presented in Ref. [10]. These authors presented a failure prediction heuristic based on the inter-arrival time function of errors. A review of many fault detection and identification models and algorithms is provided in Ref. [11] while the author in Ref. [12] provided an OSI model for the more general problem of network management. The statistical literature has addressed the more general problem of outlier detection [1], [13]. A time series approach is used in Ref. [14] to investigate computer system performance problem detection. A method for proactive anomaly detection in networks is introduced in Ref. [15]. Here, the authors used time series data obtained from selected MIB variables, which are processed to provide proactive alarms that are indicative of impending network problems. The exponential smoothing technique used in our work has been widely used in time series forecasting [16], [17], [18].

There are many challenges to solving the problem of event storm detection and identification in communication systems. First, in many large communication systems, nodes are not synchronized to a global clock. This is the case in the system under study. As a result there are events that cannot be temporally correlated and therefore detection and identification of such storms is not possible. This problem is solved by synchronizing all nodes in the system relative to a single node, i.e., the OMC.

Second, communication systems are dynamic in nature—the size and complexity changes on a daily basis. The result is an ever-increasing volume of events generated during a time unit. In such an environment, a small system's event storm is a large system's normal behavior. This behavior has been modeled as a non-stationary normally distributed random process, which consists of small stationary segments. To provide an effective storm detection technique in such an environment, we applied an exponential smoothing technique. The authors in Refs. [19], [20] investigated the problem of anomaly detection in an Ethernet system by monitoring packet traffic and processor loading data and show that performance anomalies are indicative of failure. The concept of exponential smoothing is used to update their system's daily behavior in a manner similar to ours. However, while analysis of performance data has been shown to be a promising technique for anomaly detection, events are more natural symptoms of abnormality in communication systems—nodes generate events in response to abnormality—and therefore are more effective for the task at hand. Moreover, performance data are system specific and more difficult to generalize.

Third, we investigate the problem of identifying event storms. Our approach consists of dividing the system into non-overlapping regions and then using the main contributing regions of a storm as its signature.

Section 2 presents the system under study. In Section 3, we present a time synchronization algorithm. In Section 4, we discuss the data analysis approach to event logs. In Section 5, we present a simple rule for detecting outliers based on the model given in Section 3 and provide a technique for solving the non-stationary behavior of event logs. In Section 6, we outline an approach to correlate event storms in time and space. In Section 7, we discuss the experimental results of the application of our techniques to the system under study.

Section snippets

Overview

The system under study is a commercial multimedia wireless communication system, which is being used by hundreds of thousands of mobile users generating both voice and data traffic. Like many other cellular systems, it covers a wide geographic area that spans hundreds or even thousands of square miles. The system is similar in many respects to other commercial cellular systems in its operation, system management, and the type of services provided to the end-user.

Although a typical system

Event log synchronization

Each event contains a time stamp, which identifies the time it was generated. This time is the node's local clock value and in many cases it is not synchronized with other nodes in the system.

Fundamental to the concept of fault is the assumption of the temporal relationship between events, i.e., events that arrive within some time interval are related in some fashion. However, given the synchronization problem with many operational systems, such a temporal relationship is of little use. To

Data analysis

Data analysis methods can be described as a two-phase process: exploratory and confirmatory. In 4.1 Non-stationary behavior of the event log, 4.2 Shape of the stationary segments we use exploratory data analysis techniques to draw some general conclusions about the data. These conclusions are used to guide the investigation in the modeling process. More specifically, both the behavior and shape of the event log are investigated. In Section 4.3, we use confirmatory data analysis to provide

Event storm detection

We showed that the event log can be modeled as a non-stationary normally distributed random process that is discrete-state and discrete parameter. This model represents an event log that changes on a daily basis due to changes in the system's size and complexity. More specifically, these daily changes are of two kinds: (1) daily volume, which translates into a move of the mean of the normally distributed random variable along the x-axis (when volume increases, it shifts to the right; when

The main contributing regions of event storms

An event storm may consist of thousands of events all arriving in a very short period of time. In this section, we consider two methods for finding the main contributing regions of event storms and their benefits. Our goal is to provide ways to help the system operator analyze these storms with simple yet powerful visual artifacts that can: (1) identify the nature of a storm and the region that caused it; and (2) correlate storms in time and space, which results in the discovery of potential

Experimental results

We have been analyzing the wireless communication system described in Section 2. We studied the system for 7 months. The number of nodes in the system varied over this time period. It consisted of 5331 nodes on 3 December 2000 and ended with 4953 nodes on 30 June 2001. Moreover, the system went through three system-wide software release changes.

Eqs. (4), (5) have been applied to the system's daily mean and standard deviation with different α values as shown in Fig. 12. As the value of α

Acknowledgements

We would like to thank Motorola Inc. for their support of this research. We are especially thankful to Mark Hamlen for his support in this effort. It is also our pleasure to acknowledge the help of Mohammed Petiwala for his implementation of the ideas presented in this work.

Mouayad Albaghdadi received his BSEE in 1983 from Gannon University, MSEE, and PhD in Computer Science from Illinois Institute of Technology in 1990 and 2001, respectively. Currently he is pursuing an MA in Statistics at DePaul University. He joined Motorola in 1990 and holds two US patents. His research interests include fault management and stochastic modeling and simulation of computer and communication systems.

References (27)

V. Barnett et al.
Outliers in statistical data
(1994)
A. Bouloutas et al.
Fault identification using a finite state machine model with unreliable partially observed data sequences
IEEE Trans Commun
(1993)
C. Wang et al.
Fault detection with multiple observers
(1992)
A. Bouloutas et al.
Simple finite-state detectors for communication networks
IEEE Trans Commun
(1992)
I. Rouvellou et al.
Automatic alarm correlation for fault identification
(1995)
R. Boubour et al.
A Petri net approach to fault detection and diagnosis in distributed systems. Part I. Application to telecommunication networks, motivation, and modeling
(1997)
A. Aghasaryan et al.
A Petri net approach to fault detection and diagnosis in distributed systems. Part II. Extending viterbi algorithm and hmm technique to Petri nets
(1997)
R. Deng et al.
A probabilistic approach to fault diagnosis in linear lightwave networks
IEEE J Sel Areas Commun
(1993)
E. Aboelela et al.
Fuzzy temporal reasoning model for event correlation in network management
(1999)
T. Lin et al.
Error log analysis: statistical modeling and heuristic trend analysis
IEEE Trans Reliab
(1990)

A. Lazar et al.

Models and algorithms for network fault detection and identification: a review

(1992)

Y. Yemini

The OSI network management model

IEEE Commun Mag

(1993)

P. Sprent

Data driven statistical methods

(1998)

Cited by (0)

Bruce Briley received BSEE, MSEE and PhD, all from the University of Illinois, Champaign, in 1958, 1959 and 1963, respectively. He has been working for Motorola in various capacities since 1996, after 30 years with Bell Labs/Lucent. He has authored two textbooks (Introduction to telephone switching and Introduction to fiber optics system design) and holds 20 US patents. He taught for 35 years at the Illinois Institute of Technology, where he was an Adjunct Full Professor and was the first Alva C. Todd Professor. He is a Senior Member of the IEEE and was nominated for the Alexander Graham Bell medal. He is a past Member of the Governing Board of the IEEE Computer Society.

Martha Evens is a Research Professor of Computer Science at Illinois Institute of Technology. She received an AB in Mathematics from Bryn Mawr, spent a year in Paris on a Fulbright Fellowship, received an AM in Mathematics from Radcliffe, and a PhD in Computer Science from Northwestern University. She has been an Associate Editor of the American Mathematical Monthly and the Journal of Computational Linguistics. In 1984, she was the President of the Association for Computational Linguistics, an international organization for people interested in natural language processing. She has published over 300 papers and received over 40 research grants.

View full text

Reliability Engineering & System Safety

Event storm detection and identification in communication systems

Abstract

Introduction

Section snippets

Overview

Event log synchronization

Data analysis

Event storm detection

The main contributing regions of event storms

Experimental results

Acknowledgements

Outliers in statistical data

Fault identification using a finite state machine model with unreliable partially observed data sequences

IEEE Trans Commun

Fault detection with multiple observers

Simple finite-state detectors for communication networks

IEEE Trans Commun

Automatic alarm correlation for fault identification

A Petri net approach to fault detection and diagnosis in distributed systems. Part I. Application to telecommunication networks, motivation, and modeling

A Petri net approach to fault detection and diagnosis in distributed systems. Part II. Extending viterbi algorithm and hmm technique to Petri nets

A probabilistic approach to fault diagnosis in linear lightwave networks

IEEE J Sel Areas Commun

Fuzzy temporal reasoning model for event correlation in network management

Error log analysis: statistical modeling and heuristic trend analysis

IEEE Trans Reliab

Models and algorithms for network fault detection and identification: a review

The OSI network management model

IEEE Commun Mag

Data driven statistical methods