Elsevier

Information Sciences

Volume 402, September 2017, Pages 1-14
Information Sciences

Efficient alarm behavior analytics for telecom networks

https://doi.org/10.1016/j.ins.2017.03.020Get rights and content

Abstract

Locating network fault problems and filtering trivial alarms from important ones are the two main challenges in Network Operation Centers (NOCs). In this paper, we present an alarm behavior analysis and discovery system, AABD, that establishes flapping and parent–child (P–C) rules to reveal the operation patterns from a large number of alarms in telecom networks. These rules can be exploited to filter out unimportant alarms, conduct multi-dimensional analysis of the alarms and identify potential network problems. We propose two novel and effective algorithms to establish the flapping rules and P-C rules. The proposed system is validated using alarm datasets from five Internet service providers. Specifically, we verify the system and methodology in each of the five network domains, i.e., circuit-switched network (CS), packet-switched network (PS), 2G-radio access network (RAN-2G), 3G-radio access network (RAN-3G) and 4G-radio access network (RAN-4G), as these five domains can, to a great extent, form a complete network environment. More importantly, our system can establish a small number of rules, only dozens of flapping rules and P-C rules, and compress the alarms by approximately 84%, i.e., 84% of alarms will not be sent to the network operator. To summarize, the proposed system can help network operators respond to network faults in a timely fashion, locate the faults accurately and significantly reduce the time spent on these tasks.

Introduction

In recent years, we have witnessed the rapid growth of telecom (telecommunication) networks in both scale and topological complexity with the increasing deployment of 4G-cellular, e.g., LTE equipment. However, the management of such complex and large networks has become a challenging issue. Network operators are compelled to monitor and process alarms to provide high-quality service to their customers. For a network service provider, the quality of services offered and the efficiency of identifying/fixing faults in telecom networks are the key criteria to become more competitive than other network providers, avoid/reduce churn and attract new customers.

An important way to sustain high-quality services is to address network faults in a timely manner. Every day, tens of thousands of faults are triggered across heterogeneous and interconnected devices in a telecom network. These faults are expressed by the network devices in the form of alarms, which are transmitted to the Network Operation Centre (NOC) for further processing by network operators. Additionally, there are thousands of types of alarms. If the network operators handle all alarms sequentially, they will be overloaded and unable to concentrate on finding the underlying reasons for the faults. Generally, approximately 1 million alarms are reported to the NOC every day. If there are five network operators working 8 h per day, each operator must process 20 alarms every minute throughout the day, which is an impossible workload. Therefore, it is necessary to select the important alarms that are useful for identifying network problems. There are two categories of approaches to filtering out trivial alarms. First, some types of alarms occur with high frequency but only last for a few hundred seconds or even shorter periods of time. Instead of receiving all alarms, the alarms that last for a long duration are sufficient for network operators to identify the underlying problems. Therefore, an approach to determining a proper rule by which the network operators can eliminate the trivial alarms is crucial. Second, alarms in different categories may be correlated with each other. For example, in the PS domain, if the alarm M3UA Signaling Link Failed occurs, alarm M3UA Route Unavailable will always occur within a few seconds. These correlations among alarms are called correlation rules, and they can be further exploited to (1) reduce the number of alarms sent to the network operators and (2) establish P–C (parent–child) rules to pinpoint the root causes of network faults. The network management system of the network provider normally focuses on processing alarms by identifying and compressing flapping alarms, identifying correlation alarms and locating the root cause of the alarms. Previous works [1], [2], [3], [4], [5] have investigated some aspects of these issues. In [1], a TASA system was proposed to determine the association and correlation rules rather than to address the flapping issues. In addition, the number of association and correlation rules, which explode with the growth of network scale, need to be confirmed by the operation experts one by one. The proposed solution in the article can filter lots of trivial rules and reduce the time spent on rule confirmation. IBM Netcool [4] provided alarm root correlation rules but required the operation experts to configure the rules for a specific project, while in our proposed solution, all rules are generated automatically. The root cause inference approach is applied to EMC SMARTS [5], but the system cannot be adapted to dynamical networks.

The difficulty of automatic alarm correlation is increased by the huge number of monitoring alarms and/or the time delay issue of probes and database systems. In this paper, we propose an alarm analysis and management system called Automatic Alarm Behavior Discovery (AABD). We also propose methods for establishing flapping rules and parent–child (P–C) rules automatically. The main contribution of this paper is to develop an alarm analysis and management system capable of following:

  • 1.

    Providing a statistical alarm analysis of historical alarms to assist the operators in better understanding network failures;

  • 2.

    Establishing flapping rules based on statistical analysis to filter the trivial alarms;

  • 3.

    Identifying correlation rules and P–C rules to locate the problems efficiently; and

  • 4.

    Conducting extensive experiments on real-world datasets obtained from network service providers.

The remaining sections of the paper are organized as follows: Section 2 reviews related works on alarm analysis and discovery. In Section 3, we briefly introduce the system architecture. We present the approaches to establishing flapping rules and P–C rules in Sections 4 and 5 respectively, and we elaborate upon and analyze the experimental results in Section 6. Finally, we draw our conclusions from this research in Section 7.

Section snippets

Related work

In network management, fault management is a set of functions that detect abnormal behaviors, generate alarms, isolate problems and resolve issues in the network. Fault management systems have existed for decades, but in most cases, these tools have not presented automatic solutions for alarm monitoring by developing correlations as well as root cause inference for automatic trouble ticket issues. IBM Netcool [4] provides an alarm root correlation rule based system to compress alarms generated

Alarms and system architecture

In this section, we will give a brief description of alarms in telecom networks and then introduce the architecture of the AABD system, as shown in Fig. 1.

Flapping analysis

In this section, we study the alarm behaviors and derive a flapping rule for each transient alarm. First, we give a detailed description of a transient (flapping) alarm. Then, an efficient method is presented to identify transient flapping alarms and output a flapping rule for each transient alarm.

Correlation analysis

Alarm instances of the same type with repeated occurrence and disappearance behavior are called transient. This type of alarm could be greatly reduced or compressed by the flapping rules. Most of the time, the root cause of the transient alarms can recover automatically without any network operator intervention. Another serious phenomenon is an alarm propagation chain, which means that one alarm triggers other alarms in a short time period because of the device linkage or functional dependency

Experiment settings

Our experiments are conducted on a Huawei RH2288 V2 server with 2*E5-2650 V2 CPUs, 384GB (16GB*24) memory, a 2,348GB (2T+300G) hard drive and the SUSE 11 (x86_64) OS. We use datasets from six different Internet service providers, and the data format is the same as shown in Table 1.

Data preprocessing

Data cleaning is a fundamental step prior to data analysis. As the fields NEName, ObjectInstance, NEType, EventDetail, FaultFlag and EventTime are significant during alarm management and root cause analysis, we focus

Conclusions and future work

In this paper, we have presented a new system for automatically discovering flapping rules, P–C rules and root cause inference, which we verified using real-world datasets. From the performance study, we can see that our system can compress the alarms by approximately 84%, and over 90% of the P–C rules are shown to be correct after a manual check by the operation experts. Using the generated P–C rules, we can locate the root cause of the network problems with Bayes network analysis. In future

Acknowledgment

The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (UGC/FDS11/E06/14), the Start-Up Research Grant (RG 37/2016-2017R) and the Internal Research Grant (RG 66/2016-2017) of The Education University of Hong Kong.

References (35)

  • P. Bahl et al.

    Towards highly reliable enterprise network services via inference of multi-level dependencies

    SIGCOMM

    (2007)
  • R.N. Mysore et al.

    Gestalt: fast, unified fault localization for networked systems

    USENIX Annual Technical Conference

    (2014)
  • S. Kandula et al.

    Shrink: a tool for failure diagnosis in IP networks

    ACM SIGCOMM Workshop on Mining Network Data

    (2005)
  • X. Chen et al.

    Knowops: Towards an embedded knowledge base for network management and operations

    USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services

    (2011)
  • S. Yin et al.

    A multivariate statistical combination forecasting method for product quality evaluation

    Inf. Sci.

    (2016)
  • S. Yin et al.

    Intelligent particle filter and its application to fault detection of nonlinear system

    IEEE Trans. Ind. Electron.

    (2015)
  • M. Steinder et al.

    A survey of fault localization techniques in computer networks

    Sci. Comput. Program

    (2004)
  • Cited by (30)

    • APGNN: Alarm Propagation Graph Neural Network for fault detection and alarm root cause analysis

      2023, Computer Networks
      Citation Excerpt :

      Compared with the true fault that needs to be repaired, the volume of original alarms is quite large. Due to the existence of associated [4], repeated [5], and transient [7] alarms, it is time-consuming and impractical for operators to figure out the true fault from all the reported alarms in real time. The intuitive way is to find the alarm pattern which is relevant to the true fault.

    • Roots-tracing of communication network alarm: A real-time processing framework

      2021, Computer Networks
      Citation Excerpt :

      Alarm data provide a valuable reference for network operators to identify the potential root causes of network faults and ensure the normal operation of network services. Unfortunately, network faults may lead to the chained activation of alarms since the high interconnection of network elements, which means one alarm will trigger a series of alarms [2]. On daily basis, NOC typically receives millions of network alarms, which can have different significance levels and domains [3].

    • Association rules extraction for the identification of functional dependencies in complex technical infrastructures

      2021, Reliability Engineering and System Safety
      Citation Excerpt :

      In this context, the objective of the present work is the identification of functional dependencies among components of CTIs using the large amount of alarm messages, which are collected thanks to the recent advancement in the sensors, data acquisition, data storage and monitoring technologies. Alarm sequences are currently used for purposes different from those of the present work, such as root causes analyses, CTI management and malfunctions and failures identification by plant personnel [1,22,23,29,34,46] However, the large amount of alarm messages collected in short periods of time, which is referred to as “alarms flood” [10,21], makes the direct use of alarm messages unfeasible. For example, the operation center of a telecommunication network receives approximately 1 million alarms every day [46], whereas the CTI of the particle accelerator of CERN has produced more than 10 million alarms during 2016 [42].

    View all citing articles on Scopus
    View full text