Efficient alarm behavior analytics for telecom networks
Introduction
In recent years, we have witnessed the rapid growth of telecom (telecommunication) networks in both scale and topological complexity with the increasing deployment of 4G-cellular, e.g., LTE equipment. However, the management of such complex and large networks has become a challenging issue. Network operators are compelled to monitor and process alarms to provide high-quality service to their customers. For a network service provider, the quality of services offered and the efficiency of identifying/fixing faults in telecom networks are the key criteria to become more competitive than other network providers, avoid/reduce churn and attract new customers.
An important way to sustain high-quality services is to address network faults in a timely manner. Every day, tens of thousands of faults are triggered across heterogeneous and interconnected devices in a telecom network. These faults are expressed by the network devices in the form of alarms, which are transmitted to the Network Operation Centre (NOC) for further processing by network operators. Additionally, there are thousands of types of alarms. If the network operators handle all alarms sequentially, they will be overloaded and unable to concentrate on finding the underlying reasons for the faults. Generally, approximately 1 million alarms are reported to the NOC every day. If there are five network operators working 8 h per day, each operator must process 20 alarms every minute throughout the day, which is an impossible workload. Therefore, it is necessary to select the important alarms that are useful for identifying network problems. There are two categories of approaches to filtering out trivial alarms. First, some types of alarms occur with high frequency but only last for a few hundred seconds or even shorter periods of time. Instead of receiving all alarms, the alarms that last for a long duration are sufficient for network operators to identify the underlying problems. Therefore, an approach to determining a proper rule by which the network operators can eliminate the trivial alarms is crucial. Second, alarms in different categories may be correlated with each other. For example, in the PS domain, if the alarm M3UA Signaling Link Failed occurs, alarm M3UA Route Unavailable will always occur within a few seconds. These correlations among alarms are called correlation rules, and they can be further exploited to (1) reduce the number of alarms sent to the network operators and (2) establish P–C (parent–child) rules to pinpoint the root causes of network faults. The network management system of the network provider normally focuses on processing alarms by identifying and compressing flapping alarms, identifying correlation alarms and locating the root cause of the alarms. Previous works [1], [2], [3], [4], [5] have investigated some aspects of these issues. In [1], a TASA system was proposed to determine the association and correlation rules rather than to address the flapping issues. In addition, the number of association and correlation rules, which explode with the growth of network scale, need to be confirmed by the operation experts one by one. The proposed solution in the article can filter lots of trivial rules and reduce the time spent on rule confirmation. IBM Netcool [4] provided alarm root correlation rules but required the operation experts to configure the rules for a specific project, while in our proposed solution, all rules are generated automatically. The root cause inference approach is applied to EMC SMARTS [5], but the system cannot be adapted to dynamical networks.
The difficulty of automatic alarm correlation is increased by the huge number of monitoring alarms and/or the time delay issue of probes and database systems. In this paper, we propose an alarm analysis and management system called Automatic Alarm Behavior Discovery (AABD). We also propose methods for establishing flapping rules and parent–child (P–C) rules automatically. The main contribution of this paper is to develop an alarm analysis and management system capable of following:
- 1.
Providing a statistical alarm analysis of historical alarms to assist the operators in better understanding network failures;
- 2.
Establishing flapping rules based on statistical analysis to filter the trivial alarms;
- 3.
Identifying correlation rules and P–C rules to locate the problems efficiently; and
- 4.
Conducting extensive experiments on real-world datasets obtained from network service providers.
The remaining sections of the paper are organized as follows: Section 2 reviews related works on alarm analysis and discovery. In Section 3, we briefly introduce the system architecture. We present the approaches to establishing flapping rules and P–C rules in Sections 4 and 5 respectively, and we elaborate upon and analyze the experimental results in Section 6. Finally, we draw our conclusions from this research in Section 7.
Section snippets
Related work
In network management, fault management is a set of functions that detect abnormal behaviors, generate alarms, isolate problems and resolve issues in the network. Fault management systems have existed for decades, but in most cases, these tools have not presented automatic solutions for alarm monitoring by developing correlations as well as root cause inference for automatic trouble ticket issues. IBM Netcool [4] provides an alarm root correlation rule based system to compress alarms generated
Alarms and system architecture
In this section, we will give a brief description of alarms in telecom networks and then introduce the architecture of the AABD system, as shown in Fig. 1.
Flapping analysis
In this section, we study the alarm behaviors and derive a flapping rule for each transient alarm. First, we give a detailed description of a transient (flapping) alarm. Then, an efficient method is presented to identify transient flapping alarms and output a flapping rule for each transient alarm.
Correlation analysis
Alarm instances of the same type with repeated occurrence and disappearance behavior are called transient. This type of alarm could be greatly reduced or compressed by the flapping rules. Most of the time, the root cause of the transient alarms can recover automatically without any network operator intervention. Another serious phenomenon is an alarm propagation chain, which means that one alarm triggers other alarms in a short time period because of the device linkage or functional dependency
Experiment settings
Our experiments are conducted on a Huawei RH2288 V2 server with 2*E5-2650 V2 CPUs, 384GB (16GB*24) memory, a 2,348GB (2T+300G) hard drive and the SUSE 11 (x86_64) OS. We use datasets from six different Internet service providers, and the data format is the same as shown in Table 1.
Data preprocessing
Data cleaning is a fundamental step prior to data analysis. As the fields NEName, ObjectInstance, NEType, EventDetail, FaultFlag and EventTime are significant during alarm management and root cause analysis, we focus
Conclusions and future work
In this paper, we have presented a new system for automatically discovering flapping rules, P–C rules and root cause inference, which we verified using real-world datasets. From the performance study, we can see that our system can compress the alarms by approximately 84%, and over 90% of the P–C rules are shown to be correct after a manual check by the operation experts. Using the generated P–C rules, we can locate the root cause of the network problems with Bayes network analysis. In future
Acknowledgment
The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (UGC/FDS11/E06/14), the Start-Up Research Grant (RG 37/2016-2017R) and the Internal Research Grant (RG 66/2016-2017) of The Education University of Hong Kong.
References (35)
- et al.
Data-driven process monitoring based on modified orthogonal projections to latent structures
IEEE Trans. Control Syst. Technol.
(2015) - et al.
Incremental support vector learning for ordinal regression
IEEE Trans. Neural Netw.Learn.Syst.
(2015) - et al.
A robust regularization path algorithm for ν-support vector classification
IEEE Trans.Neural Netw.Learning Systems
(2016) - et al.
Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement
IEEE Trans. Inf. Forensics Secur.
(2016) - et al.
Rule discovery in telecommunication alarm data
J. Netw. Syst. Manage.
(1999) - et al.
Mining complex event patterns in computer networks
New Frontiers in Mining Complex Patterns
(2013) - et al.
G-rca: A generic root cause analysis platform for service quality management in large ip networks
IEEE/ACM Trans. Netw.
(2012) - 2017, (http://www-03.ibm.com/software/products/en/netcool-network-management). Online; accessed 17 March...
- 2017, (http://www.emc.com/it-management/smarts/index.htm). Online; accessed 17 March...
- et al.
Troubleshooting chronic conditions in large ip networks
ACM CoNEXT Conference
(2008)
Towards highly reliable enterprise network services via inference of multi-level dependencies
SIGCOMM
Gestalt: fast, unified fault localization for networked systems
USENIX Annual Technical Conference
Shrink: a tool for failure diagnosis in IP networks
ACM SIGCOMM Workshop on Mining Network Data
Knowops: Towards an embedded knowledge base for network management and operations
USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services
A multivariate statistical combination forecasting method for product quality evaluation
Inf. Sci.
Intelligent particle filter and its application to fault detection of nonlinear system
IEEE Trans. Ind. Electron.
A survey of fault localization techniques in computer networks
Sci. Comput. Program
Cited by (30)
Alarm correlation analysis with applications to industrial alarm management
2024, Control Engineering PracticeAPGNN: Alarm Propagation Graph Neural Network for fault detection and alarm root cause analysis
2023, Computer NetworksCitation Excerpt :Compared with the true fault that needs to be repaired, the volume of original alarms is quite large. Due to the existence of associated [4], repeated [5], and transient [7] alarms, it is time-consuming and impractical for operators to figure out the true fault from all the reported alarms in real time. The intuitive way is to find the alarm pattern which is relevant to the true fault.
Roots-tracing of communication network alarm: A real-time processing framework
2021, Computer NetworksCitation Excerpt :Alarm data provide a valuable reference for network operators to identify the potential root causes of network faults and ensure the normal operation of network services. Unfortunately, network faults may lead to the chained activation of alarms since the high interconnection of network elements, which means one alarm will trigger a series of alarms [2]. On daily basis, NOC typically receives millions of network alarms, which can have different significance levels and domains [3].
Association rules extraction for the identification of functional dependencies in complex technical infrastructures
2021, Reliability Engineering and System SafetyCitation Excerpt :In this context, the objective of the present work is the identification of functional dependencies among components of CTIs using the large amount of alarm messages, which are collected thanks to the recent advancement in the sensors, data acquisition, data storage and monitoring technologies. Alarm sequences are currently used for purposes different from those of the present work, such as root causes analyses, CTI management and malfunctions and failures identification by plant personnel [1,22,23,29,34,46] However, the large amount of alarm messages collected in short periods of time, which is referred to as “alarms flood” [10,21], makes the direct use of alarm messages unfeasible. For example, the operation center of a telecommunication network receives approximately 1 million alarms every day [46], whereas the CTI of the particle accelerator of CERN has produced more than 10 million alarms during 2016 [42].
Mining concise patterns on graph-connected itemsets
2019, NeurocomputingEffective Fault Scenario Identification for Communication Networks via Knowledge-Enhanced Graph Neural Networks
2024, IEEE Transactions on Mobile Computing