Abstract
Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets.
Similar content being viewed by others
References
Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196
Akoglu L, McGlohon M, Faloutsos C (2010) Oddball: spotting anomalies in weighted graphs. In: Pacific-Asia conference on knowledge discovery and data mining, Springer, New York, pp 410–421
Akoglu L, Khandekar R, Kumar V, Parthasarathy S, Rajan D, Wu KL (2014) Fast nearest neighbor search on large time-evolving graphs. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, New York, pp 17–33
Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 44–54
Berlingerio M, Koutra D, Eliassi-Rad T, Faloutsos C (2012) Netsimile: a scalable approach to size-independent network similarity. In: arXiv:1209.2684
Bettencourt LM, Hagberg AA, Larkey LB (2007) Separating the wheat from the chaff: practical anomaly detection schemes in ecological applications of distributed sensor networks. In: Distributed computing in sensor systems, Springer, New York, pp 223–239
Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: International conference on machine learning, ACM, New York, pp 208–215
Davis M, Liu W, Miller P, Redpath G (2011) Detecting anomalies in graphs with numeric labels. In: ACM international conference on information and knowledge management, ACM, New York, pp 1197–1202
Eberle W, Graves J, Holder L (2010) Insider threat detection using a graph-based approach. J Appl Secur Res 6(1):32–81
Fan W, Wang X, Wu Y (2013) Incremental graph pattern matching. ACM Trans Database Syst 38(3):18
Franke C, Gertz M (2008) Detection and exploration of outlier regions in sensor data streams. In: IEEE international conference on data mining workshops, IEEE, Los Alamitos, pp 375–384
Gao J, Liang F, Fan W, Wang C, Sun Y, Han J (2010) On community outliers and their efficient detection in information networks. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 813–822
Gupta M, Gao J, Aggarwal C, Han J (2014) Outlier detection for temporal data. Synth Lect Data Min Knowl Discov 5(1):1–129
Gupte M, Eliassi-Rad T (2012) Measuring tie strength in implicit social networks. In: Annual ACM web science conference, ACM, New York, pp 109–118
He J, Carbonell JG (2007) Nearest-neighbor-based active learning for rare category detection. In: Advances in neural information processing systems, pp 633–640
He J, Liu Y, Lawrence R (2008) Graph-based rare category detection. In: IEEE international conference on data mining, IEEE, pp 833–838
He J, Tong H, Carbonell J (2010) Rare category characterization. In: IEEE international conference on data mining, IEEE, pp 226–235
Henderson K, Eliassi-Rad T, Faloutsos C, Akoglu L, Li L, Maruhashi K, Prakash BA, Tong H (2010) Metric forensics: a multi-level approach for mining volatile graphs. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 163–172
Hill DJ, Minsker BS, Amir E (2007) Real-time bayesian anomaly detection for environmental sensor data. In: Congress-international association for hydraulic research, Citeseer, vol 32, p 503
Kang U, McGlohon M, Akoglu L, Faloutsos C (2010) Patterns on the connected components of terabyte-scale graphs. In: IEEE international conference on data mining, IEEE, pp 875–880
Kang U, Tsourakakis CE, Appel AP, Faloutsos C, Leskovec J (2011) Hadi: mining radii of large graphs. ACM Trans Knowl Discov Data 5(2):8
Koutra D, Ke TY, Kang U, Chau DHP, Pao HKK, Faloutsos C (2011) Unifying guilt-by-association approaches: theorems and fast algorithms. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, New York, pp 245–260
Koutra D, Papalexakis EE, Faloutsos C (2012) Tensorsplat: spotting latent anomalies in time. In: Panhellenic conference on informatics, IEEE, pp 144–149
Kumar R, Mahdian M, McGlohon M (2010) Dynamics of conversations. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 553–562
Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 177–187
Liu Z, Chiew K, He Q, Huang H, Huang B (2014) Prior-free rare category detection: more effective and efficient solutions. Expert Syst Appl 41(17):7691–7706
Müller E, Sánchez PI, Mülle Y, Böhm K (2013) Ranking outlier nodes in subspaces of attributed graphs. In: IEEE international conference on data engineering workshops, IEEE, pp 216–222
Pelleg D, Moore AW (2004) Active learning for anomaly and rare-category detection. In: Advances in neural information processing systems, pp 1073–1080
Phua C, Lee V, Smith K, Gayler R (2010) A comprehensive survey of data mining-based fraud detection research. arXiv:hep-th/10096119
Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Annals Math Stat 21(1):124–127
Sricharan K, Das K (2014) Localizing anomalous changes in time-evolving graphs. In: ACM SIGMOD international conference on management of data, ACM, pp 1347–1358
Tong H, Papadimitriou S, Philip SY, Faloutsos C (2008) Proximity tracking on time-evolving bipartite graphs. In: SIAM international conference in data mining, pp 704–715
Yamanishi K, Takeuchi Ji (2002) A unifying framework for detecting outliers and change points from non-stationary time series data. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 676–681
Yamanishi K, Takeuchi JI, Williams G, Milne P (2004) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Min Knowl Discov 8(3):275–300
Zhou D, He J, Candan K, Davulcu H (2015a) Muvir: Multi-view rare category detection. In: International joint conference on artificial intelligence, pp 4098–4104
Zhou D, Wang K, Cao N, He J (2015b) Rare category detection on time-evolving graphs. In: IEEE international conference on data mining, IEEE, pp 1135–1140
Acknowledgments
This work is supported by NSF research Grant IIS-1552654, and an IBM Faculty Award. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the U.S. Government.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor Jian Pei.
Rights and permissions
About this article
Cite this article
Zhou, D., Karthikeyan, A., Wang, K. et al. Discovering rare categories from graph streams. Data Min Knowl Disc 31, 400–423 (2017). https://doi.org/10.1007/s10618-016-0478-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-016-0478-6