Efficient alarm behavior analytics for telecom networks

doi:10.1016/j.ins.2017.03.020

Information Sciences

Volume 402, September 2017, Pages 1-14

https://doi.org/10.1016/j.ins.2017.03.020 Get rights and content

Abstract

Locating network fault problems and filtering trivial alarms from important ones are the two main challenges in Network Operation Centers (NOCs). In this paper, we present an alarm behavior analysis and discovery system, AABD, that establishes flapping and parent–child (P–C) rules to reveal the operation patterns from a large number of alarms in telecom networks. These rules can be exploited to filter out unimportant alarms, conduct multi-dimensional analysis of the alarms and identify potential network problems. We propose two novel and effective algorithms to establish the flapping rules and P-C rules. The proposed system is validated using alarm datasets from five Internet service providers. Specifically, we verify the system and methodology in each of the five network domains, i.e., circuit-switched network (CS), packet-switched network (PS), 2G-radio access network (RAN-2G), 3G-radio access network (RAN-3G) and 4G-radio access network (RAN-4G), as these five domains can, to a great extent, form a complete network environment. More importantly, our system can establish a small number of rules, only dozens of flapping rules and P-C rules, and compress the alarms by approximately 84%, i.e., 84% of alarms will not be sent to the network operator. To summarize, the proposed system can help network operators respond to network faults in a timely fashion, locate the faults accurately and significantly reduce the time spent on these tasks.

Introduction

In recent years, we have witnessed the rapid growth of telecom (telecommunication) networks in both scale and topological complexity with the increasing deployment of 4G-cellular, e.g., LTE equipment. However, the management of such complex and large networks has become a challenging issue. Network operators are compelled to monitor and process alarms to provide high-quality service to their customers. For a network service provider, the quality of services offered and the efficiency of identifying/fixing faults in telecom networks are the key criteria to become more competitive than other network providers, avoid/reduce churn and attract new customers.

An important way to sustain high-quality services is to address network faults in a timely manner. Every day, tens of thousands of faults are triggered across heterogeneous and interconnected devices in a telecom network. These faults are expressed by the network devices in the form of alarms, which are transmitted to the Network Operation Centre (NOC) for further processing by network operators. Additionally, there are thousands of types of alarms. If the network operators handle all alarms sequentially, they will be overloaded and unable to concentrate on finding the underlying reasons for the faults. Generally, approximately 1 million alarms are reported to the NOC every day. If there are five network operators working 8 h per day, each operator must process 20 alarms every minute throughout the day, which is an impossible workload. Therefore, it is necessary to select the important alarms that are useful for identifying network problems. There are two categories of approaches to filtering out trivial alarms. First, some types of alarms occur with high frequency but only last for a few hundred seconds or even shorter periods of time. Instead of receiving all alarms, the alarms that last for a long duration are sufficient for network operators to identify the underlying problems. Therefore, an approach to determining a proper rule by which the network operators can eliminate the trivial alarms is crucial. Second, alarms in different categories may be correlated with each other. For example, in the PS domain, if the alarm M3UA Signaling Link Failed occurs, alarm M3UA Route Unavailable will always occur within a few seconds. These correlations among alarms are called correlation rules, and they can be further exploited to (1) reduce the number of alarms sent to the network operators and (2) establish P–C (parent–child) rules to pinpoint the root causes of network faults. The network management system of the network provider normally focuses on processing alarms by identifying and compressing flapping alarms, identifying correlation alarms and locating the root cause of the alarms. Previous works [1], [2], [3], [4], [5] have investigated some aspects of these issues. In [1], a TASA system was proposed to determine the association and correlation rules rather than to address the flapping issues. In addition, the number of association and correlation rules, which explode with the growth of network scale, need to be confirmed by the operation experts one by one. The proposed solution in the article can filter lots of trivial rules and reduce the time spent on rule confirmation. IBM Netcool [4] provided alarm root correlation rules but required the operation experts to configure the rules for a specific project, while in our proposed solution, all rules are generated automatically. The root cause inference approach is applied to EMC SMARTS [5], but the system cannot be adapted to dynamical networks.

The difficulty of automatic alarm correlation is increased by the huge number of monitoring alarms and/or the time delay issue of probes and database systems. In this paper, we propose an alarm analysis and management system called Automatic Alarm Behavior Discovery (AABD). We also propose methods for establishing flapping rules and parent–child (P–C) rules automatically. The main contribution of this paper is to develop an alarm analysis and management system capable of following:

1.
Providing a statistical alarm analysis of historical alarms to assist the operators in better understanding network failures;
2.
Establishing flapping rules based on statistical analysis to filter the trivial alarms;
3.
Identifying correlation rules and P–C rules to locate the problems efficiently; and
4.
Conducting extensive experiments on real-world datasets obtained from network service providers.

The remaining sections of the paper are organized as follows: Section 2 reviews related works on alarm analysis and discovery. In Section 3, we briefly introduce the system architecture. We present the approaches to establishing flapping rules and P–C rules in Sections 4 and 5 respectively, and we elaborate upon and analyze the experimental results in Section 6. Finally, we draw our conclusions from this research in Section 7.

Section snippets

Related work

In network management, fault management is a set of functions that detect abnormal behaviors, generate alarms, isolate problems and resolve issues in the network. Fault management systems have existed for decades, but in most cases, these tools have not presented automatic solutions for alarm monitoring by developing correlations as well as root cause inference for automatic trouble ticket issues. IBM Netcool [4] provides an alarm root correlation rule based system to compress alarms generated

Alarms and system architecture

In this section, we will give a brief description of alarms in telecom networks and then introduce the architecture of the AABD system, as shown in Fig. 1.

Flapping analysis

In this section, we study the alarm behaviors and derive a flapping rule for each transient alarm. First, we give a detailed description of a transient (flapping) alarm. Then, an efficient method is presented to identify transient flapping alarms and output a flapping rule for each transient alarm.

Correlation analysis

Alarm instances of the same type with repeated occurrence and disappearance behavior are called transient. This type of alarm could be greatly reduced or compressed by the flapping rules. Most of the time, the root cause of the transient alarms can recover automatically without any network operator intervention. Another serious phenomenon is an alarm propagation chain, which means that one alarm triggers other alarms in a short time period because of the device linkage or functional dependency

Experiment settings

Our experiments are conducted on a Huawei RH2288 V2 server with 2*E5-2650 V2 CPUs, 384GB (16GB*24) memory, a 2,348GB (2T+300G) hard drive and the SUSE 11 (x86_64) OS. We use datasets from six different Internet service providers, and the data format is the same as shown in Table 1.

Data preprocessing

Data cleaning is a fundamental step prior to data analysis. As the fields NEName, ObjectInstance, NEType, EventDetail, FaultFlag and EventTime are significant during alarm management and root cause analysis, we focus

Conclusions and future work

In this paper, we have presented a new system for automatically discovering flapping rules, P–C rules and root cause inference, which we verified using real-world datasets. From the performance study, we can see that our system can compress the alarms by approximately 84%, and over 90% of the P–C rules are shown to be correct after a manual check by the operation experts. Using the generated P–C rules, we can locate the root cause of the network problems with Bayes network analysis. In future

Acknowledgment

The work described in this paper was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (UGC/FDS11/E06/14), the Start-Up Research Grant (RG 37/2016-2017R) and the Internal Research Grant (RG 66/2016-2017) of The Education University of Hong Kong.

References (35)

S. Yin et al.
Data-driven process monitoring based on modified orthogonal projections to latent structures
IEEE Trans. Control Syst. Technol.
(2015)
B. Gu et al.
Incremental support vector learning for ordinal regression
IEEE Trans. Neural Netw.Learn.Syst.
(2015)
B. Gu et al.
A robust regularization path algorithm for ν-support vector classification
IEEE Trans.Neural Netw.Learning Systems
(2016)
Z. Fu et al.
Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement
IEEE Trans. Inf. Forensics Secur.
(2016)
M. Klemettinen et al.
Rule discovery in telecommunication alarm data
J. Netw. Syst. Manage.
(1999)
D. Seipel et al.
Mining complex event patterns in computer networks
New Frontiers in Mining Complex Patterns
(2013)
H. Yan et al.
G-rca: A generic root cause analysis platform for service quality management in large ip networks
IEEE/ACM Trans. Netw.
(2012)
2017, (http://www-03.ibm.com/software/products/en/netcool-network-management). Online; accessed 17 March...
2017, (http://www.emc.com/it-management/smarts/index.htm). Online; accessed 17 March...
A. Mahimkar et al.
Troubleshooting chronic conditions in large ip networks
ACM CoNEXT Conference
(2008)

P. Bahl et al.

Towards highly reliable enterprise network services via inference of multi-level dependencies

SIGCOMM

(2007)

R.N. Mysore et al.

Gestalt: fast, unified fault localization for networked systems

USENIX Annual Technical Conference

(2014)

S. Kandula et al.

Shrink: a tool for failure diagnosis in IP networks

ACM SIGCOMM Workshop on Mining Network Data

(2005)

X. Chen et al.

Knowops: Towards an embedded knowledge base for network management and operations

USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services

(2011)

S. Yin et al.

A multivariate statistical combination forecasting method for product quality evaluation

Inf. Sci.

(2016)

S. Yin et al.

Intelligent particle filter and its application to fault detection of nonlinear system

IEEE Trans. Ind. Electron.

(2015)

M. Steinder et al.

A survey of fault localization techniques in computer networks

Sci. Comput. Program

(2004)

Cited by (30)

Alarm correlation analysis with applications to industrial alarm management
2024, Control Engineering Practice
Alarm systems are essential for the safe and efficient operation of process industries. However, complex plant connectivity and process interactions could cause many correlated alarms in practice and thus compromise alarm system performance. To address correlated alarms, it is desired that alarm correlations are discovered from historical Alarm and Event (A&E) logs, so the obtained results could help improve alarm configurations or design suppression strategies. Motivated by this problem, a systematic method to extract alarm correlation is proposed in this work and the contributions are: (1) Correlated alarms and their occurrence orders are captured as correlation patterns through pattern mining, and such patterns are characterized by statistical features. (2) Alarm correlations and their statistical features are visualized as network graphs to indicate process interactions and identify alarms for prioritized analysis. To demonstrate the effectiveness of the proposed method, case studies are provided using an industrial simulation benchmark Vinyl Acetate Monomer (VAM) plant model.
APGNN: Alarm Propagation Graph Neural Network for fault detection and alarm root cause analysis
2023, Computer Networks
Citation Excerpt :
Compared with the true fault that needs to be repaired, the volume of original alarms is quite large. Due to the existence of associated [4], repeated [5], and transient [7] alarms, it is time-consuming and impractical for operators to figure out the true fault from all the reported alarms in real time. The intuitive way is to find the alarm pattern which is relevant to the true fault.
Telecommunication network plays an important role in our daily life. Fault detection and alarm root cause analysis are the keys to ensure the normal operation of the network. To reduce the burden on operators, numerous methods are employed to analyse root cause of faults. However, there still remain a large amount of non-essential or transient alarms after root cause analysis. A simple Rule-based method may help ease the problems. But it needs prior expert knowledge and the diversity of alarm pattern makes the rules redundant and complicated. Moreover, it cannot accurately cover all true faults and need manual methods as complement. In this work, we propose Alarm Propagation Graph Neural Network(APGNN), a novel data-driven propagation-based root cause analysis and fault detection approach.It first associates alarms and extracts root-derived graph based on Bayesian Network. Then it constructs alarm propagation graphs(APG). We refine the repair orders to obtain actual fault information. At last, Graph Neural Network is used to extract features and learn the mapping from APG to the true fault. Our method not only detects the true fault from large volume of original alarms, but also analyses the root cause alarms. We evaluate our approach both on the offline and online environment of the real-world IP Radio Access Network. Experiments show that our model outperforms the state-of-art approach by 4.6% in F1-score on average.
Roots-tracing of communication network alarm: A real-time processing framework
2021, Computer Networks
Citation Excerpt :
Alarm data provide a valuable reference for network operators to identify the potential root causes of network faults and ensure the normal operation of network services. Unfortunately, network faults may lead to the chained activation of alarms since the high interconnection of network elements, which means one alarm will trigger a series of alarms [2]. On daily basis, NOC typically receives millions of network alarms, which can have different significance levels and domains [3].
In the communication network, since the interconnection of a large number of components, mobile network operators run Operations Support Systems that generate vast amounts of alarm events. The harsh challenge for network operators is how to find the potential root causes from massive alarms in real time. In this paper, we propose a novel solution for the root causes analysis. The solution includes a silent gap based approach to resolve the asynchrony of alarms, an algorithm for constructing Bayesian network (BN) based on sequentiality between alarms, and Bayesian inference to identify the root causes. The silent gap-based approach reduces preprocessing time while taking into account the validity. Also, the proposed BN-based mechanism allows the identification of the root causes with a higher accuracy. Experiments conducted on a real alarm dataset are provided to support the proposed methods. In addition, we propose a new algorithm processing framework.
Association rules extraction for the identification of functional dependencies in complex technical infrastructures
2021, Reliability Engineering and System Safety
Citation Excerpt :
In this context, the objective of the present work is the identification of functional dependencies among components of CTIs using the large amount of alarm messages, which are collected thanks to the recent advancement in the sensors, data acquisition, data storage and monitoring technologies. Alarm sequences are currently used for purposes different from those of the present work, such as root causes analyses, CTI management and malfunctions and failures identification by plant personnel [1,22,23,29,34,46] However, the large amount of alarm messages collected in short periods of time, which is referred to as “alarms flood” [10,21], makes the direct use of alarm messages unfeasible. For example, the operation center of a telecommunication network receives approximately 1 million alarms every day [46], whereas the CTI of the particle accelerator of CERN has produced more than 10 million alarms during 2016 [42].
This work proposes a method for identifying functional dependencies among components of complex technical infrastructures using databases of alarm messages. The developed method is based on the representation of the alarm database by a binary matrix, the use of the Apriori algorithm for mining association rules and a new algorithm for identifying groups of functionally dependent components. The effectiveness of the proposed method is shown by means of its application to an artificial case study and a real large-scale database of alarms generated by different supervision systems of the complex technical infrastructure of CERN (European Organization for Nuclear Research).
Mining concise patterns on graph-connected itemsets
2019, Neurocomputing
The itemset is a basic and usual form of data. People can obtain new insights into their business by discovering its implicit regularities through pattern mining. In some real applications, e.g., network alarm association, the itemsets usually have the following two characteristics: (1) the observed samples come from different entities, with inherent structural relationships implied in their static properties; (2) the samples are scarce, which may lead to incomplete pattern extraction. This paper considers how to efficiently find a concise set of patterns on such kind of data. Firstly, we use a graph to express the entities and their interconnections and propagate every sample to every node with a weight, determined by the pre-defined combination of kernel functions based on the similarities of the nodes and patterns. Next, the weight values can be naturally imported into the MDL-based filtering process and bring a differentiated pattern set for each node. Experiments show that the solution can outperform the global solution (trading all nodes as one) and isolated solution (removing all edges) on simulated and real data, and its effectiveness and scalability can be further verified in the application of large-scale network operation and maintenance.
Effective Fault Scenario Identification for Communication Networks via Knowledge-Enhanced Graph Neural Networks
2024, IEEE Transactions on Mobile Computing

View all citing articles on Scopus

View full text

Efficient alarm behavior analytics for telecom networks

Abstract

Introduction

Section snippets

Related work

Alarms and system architecture

Flapping analysis

Correlation analysis

Experiment settings

Data preprocessing

Conclusions and future work

Acknowledgment

IEEE Trans. Control Syst. Technol.

IEEE Trans. Neural Netw.Learn.Syst.

IEEE Trans.Neural Netw.Learning Systems

IEEE Trans. Inf. Forensics Secur.

Rule discovery in telecommunication alarm data

J. Netw. Syst. Manage.

Mining complex event patterns in computer networks

New Frontiers in Mining Complex Patterns

G-rca: A generic root cause analysis platform for service quality management in large ip networks

IEEE/ACM Trans. Netw.

Troubleshooting chronic conditions in large ip networks

ACM CoNEXT Conference