Mining concise patterns on graph-connected itemsets
Introduction
Pattern mining aims to discover potential co-occurrence relationships among the items in a database. Classical frequent pattern methods, such as Apriori [1] and FP-Growth [2], extract qualified patterns by generating, sorting and filtering the possible candidates from data by merely counting their occurrences. These patterns can be used either as a final result for human analyzers or as intermediate features for subsequent data mining tasks, such as classification, clustering, etc. These methods are employed in wide variety of domains, for its intuitive design and ability to run fast.
However, some common defects exposed cannot be overlooked in practice. The main problem is the pattern explosion [3]. If a highly frequent sub-itemset passes the threshold examination, it will probably have many similar companions that also satisfy the test. In this case, it is likely that a large number of redundant results will be obtained. For human labors, it is tedious to check and comprehend them one by one; for data mining tasks, it may make the subsequent model overfit on latent noise and deteriorate its predicting accuracy. A straightforward approach here is to set the filtering threshold (i.e., support) high enough, to control the total number of patterns in a reasonable range. Nevertheless, this may cause the final results less informative, for they are so apparent that they sometimes can even be found only by bare eyes.
To solve this problem, people have switched their attention from frequent patterns to interesting or useful patterns. The critical issue is to re-design a more meaningful, and also computable optimization target. One popular approach is to filter out redundant patterns based on the MDL criterion, the Krimp algorithm [3]. It assumes that the set of patterns to be sought is a dictionary to encode the data, including code table itself and data body, and the corresponding compressed size is calculated based the total empirical entropy. Be aware of searching for the best combinations of patterns is an NP-hard problem, Krimp employs a heuristic approach to find a sub-optimal solution in polynomial time. Experiments in paper [3] show that, Krimp can generally reduce the number of outputs at least 2–3 orders of magnitude, and thus uncover those rare but helpful patterns.
Here we mainly consider how to apply the Krimp on itemsets with their structural relationships. In relational databases, the historical records are usually not produced namelessly, but with an identity field, marking who has generated them. Besides, there is typically a property table, where the identities act as a primary key instead of the foreign key in records, describing the static attributes of each entity, such as a user’s basic information, or a networking device’s uplink and downlink, etc. These properties makes sense in two respects. Firstly, it is possible to make some “reuse” of data between similar objects to improve the completeness and robustness of results, by analyzing the similarity between entities when the sample size is insufficient. Secondly, sometimes users are more concerned about each entity’s specific patterns to see whether it has a unique personality, in contrast to the traditional global approach.
Similar to the scenario of multi-task learning, the key question here is how to acquire and exploit the relatedness of multiple tasks [4]. The cross-task structure can be either learned through the data or defined from prior knowledge. Once we have it, the relationship can be used to direct the data sharing between multiple tasks; or to provide a regularization for co-training multiple models; or, alternatively, to control a multi-output models complexity – in fact, all these three statements are equivalent. For our problem discussed here, unfortunately only a few samples can be collected, compared to the number of entities, and there is definitely no label in its unsupervised setting, thus we have to rely on domain knowledge to define the relevance and reflect it into the model through sensible regularization conditions.
The Krimp applies the MDL principle directly to the raw data through the self-defined heuristic search, which indeed bypasses the usual numerical optimization methods. A more practical multi-tasking solution is not to construct multiple coding tables simultaneously by some sophisticated tactics, but to make records under every node fully visible to each other, yet differentiate them by generating distinct weight matrices based on their relative positions and content. This similarity of nodes on a graph can be derived either by a classic random walk or a max entropy random walk. Next, the weights can be introduced directly into the Krimps criterion of evaluating pattern sets based on MDL. Moreover, this whole computation process is easy to be parallelized, and the performance can be enhanced almost linearly on a multi-core machine without much effort.
In the sections below, we first introduce the scenario of network alarm analysis and point out why the multi-task pattern mining is important, then summary the related works we have surveyed in Section 3. Next, some necessary background of compressing patterns, distribution embedding, and graph kernels are briefly given. In Sections 5 and 6, the theoretical model and algorithm design are presented in detail. A set of experiments are designed and conducted in Section 7. Finally, we summarize the whole article.
Section snippets
Motivated application: network alarm correlation
Nowadays, mobile communication has become a pivot ingredient for everyone’s life. Wireless carriers, to provide such a service, deploy a grid of base stations at an appropriate density to achieve ubiquitous coverage of a geographic area and provide adequate bandwidth with reliability in a load-balanced manner. This type of network is often referred to as a cellular network because its stations are laid out in a hive-like layout (Fig. 1(b)), though the coverage area of one tower is of course not
Structural pattern mining
A recent survey of alarm correlation methods can be found in [5]. There are two approaches for this issue: the rule-based and data-based, the former mainly relies on experts to define hard-coded rules, while a large part of the latter preferred the pattern mining to generate rules on the machine itself. The initial work on mining association rules on alarms can be found in [6], which built a semi-automatical system in service mainly based on rules from human, and let the machine propose
MDL-based pattern mining
Let be a character set, and an itemset I be a non-empty subset of . A database is a collection of transactions, and each transaction t is an itemset. The pattern X is also an itemset, which may appear in multiple transactions. Usually, we say that the transaction t supports X, if and only if X⊆t. Obviously, all the subsets of a transaction support it. The set of patterns contained by a database is .
For frequent pattern mining, the problem is to find out all the
Model
First of all, a global model is required to estimate the distribution density for all nodes with a limited amount of samples. The distribution density, in the form of weighted average of mapped existing samples, is estimated with two regularizations: one for preventing overfitting on noise, and the other for dumping the influence of every observed sample. Second, the available kernels, including graph and content, are combined in a separable form to enable the optimization process. A particular
Procedure
Let us analyze the complexity of the Algorithm 1 step by step. The step 4 needs time to complete, and the step 5 needs only and can be omitted later; The line 7 calls a matrix exponential function as the name of scipy.linalg.expm routine [42], using the Pade approximation [43] which complexity can be rough estimated as [44]; the SMO method in step 10 requires empirically to converge [40]; we can summarize the running time of all the above steps to
Data
We use two types of data to verify the effectiveness of the solution, including the synthesis and real dataset based on the scenario in Section 2. There are two main reasons for generating a simulated dataset: (1) whether the behavior of the algorithm is consistent with what we expect, and the key properties of the dataset, such as the region of the items occurrence, the specific characters it contains, etc., can be freely controlled as needed. (2) Because the real data, or even possible clues
Conclusion
In this paper, we have implemented and tested a solution built upon a two-phase framework: (1) a kernel-base multi-task density estimation, representing each target probability as a combination of all existing samples blending by differentiated weights, derived from kernels based on business understanding; (2) these mutually compensated samples, can be easily imported into the entropy calculation of Krimp algorithm. This solution builds a bridge between structural collaboration in multi-tasking
Acknowledgments
The work is partially supported by NSF of China under no. 11301420; NSF of Jiangsu Province under nos. BK20150373 and BK20171237; Suzhou Science and Technology Program under no. SZS201613 and the XJTLU Key Programme Special Fund (KSF) under no. KSF-A-01.
Di Zhang is currently a Ph.D. student at the School of Computer Science, Communication University of China, Beijing, and also a researcher in Noah’s Ark Lab, Huawei Corporation since 2011. He received the M.Sc. degree of Computer Science from the Beijing University of Aeronautics and Astronautics, China in 2006, and worked as research engineer in Institute of Software, Chinese Academy of Sciences from 2006 to 2010. His research interests include data mining, machine learning and distributed
References (45)
Learning output kernels for multi-task problems
Neurocomputing
(2013)- et al.
Efficient alarm behavior analytics for telecom networks
Inf. Sci.
(2017) - et al.
A method for pattern mining in multiple alarm flood sequences
Chem. Eng. Res. Des.
(2017) - et al.
Relational mining for discovering changes in evolving networks
Neurocomputing
(2015) - et al.
Kernel methods and the exponential family
Neurocomputing
(2006) - et al.
Ridgelet kernel regression
Neurocomputing
(2007) - et al.
Graph matching and clustering using kernel attributes
Neurocomputing
(2013) Feature selection based on closed frequent itemset mining: a case study on sage data classification
Neurocomputing
(2015)- et al.
On extending extreme learning machine to non-redundant synergy pattern based graph classification
Neurocomputing
(2015) - et al.
A history of graph entropy measures
Inf. Sci.
(2011)
Fast algorithms for mining association rules
Proceedings of the Twentieth International Conference on very large data bases (VLDB)
Mining frequent patterns without candidate generation
Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM sigmod record
Krimp: mining itemsets that compress
Data Min. Knowl. Discov.
Alert correlation algorithms: a survey and taxonomy
Cyberspace Safety and Security
Rule discovery in telecommunication alarm data
J. Netw. Syst. Manag.
Alarm correlation analysis in SDH network failure
Proceedings of the National Conference on Information Technology and Computer Science
When social influence meets item inference
Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
StructInf: mining structural influence from social streams
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence
Influence and correlation in social networks
Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Structural correlation pattern mining for large graphs
Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Local pattern detection in attributed graphs
Solving Large Scale Learning Tasks. Challenges and Algorithms
Kernel embeddings of conditional distributions: a unified kernel framework for nonparametric inference in graphical models
IEEE Signal Process. Mag.
Cited by (1)
IS COMPUTATIONAL ALGORITHM FOR HIGH MINING ITEMSETS EFFECTIVE?
2022, Journal of Theoretical and Applied Information Technology
Di Zhang is currently a Ph.D. student at the School of Computer Science, Communication University of China, Beijing, and also a researcher in Noah’s Ark Lab, Huawei Corporation since 2011. He received the M.Sc. degree of Computer Science from the Beijing University of Aeronautics and Astronautics, China in 2006, and worked as research engineer in Institute of Software, Chinese Academy of Sciences from 2006 to 2010. His research interests include data mining, machine learning and distributed computing.