Skip to main content
Log in

Discovering rare categories from graph streams

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196

    Article  Google Scholar 

  • Akoglu L, McGlohon M, Faloutsos C (2010) Oddball: spotting anomalies in weighted graphs. In: Pacific-Asia conference on knowledge discovery and data mining, Springer, New York, pp 410–421

    Chapter  Google Scholar 

  • Akoglu L, Khandekar R, Kumar V, Parthasarathy S, Rajan D, Wu KL (2014) Fast nearest neighbor search on large time-evolving graphs. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, New York, pp 17–33

    Google Scholar 

  • Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 44–54

  • Berlingerio M, Koutra D, Eliassi-Rad T, Faloutsos C (2012) Netsimile: a scalable approach to size-independent network similarity. In: arXiv:1209.2684

  • Bettencourt LM, Hagberg AA, Larkey LB (2007) Separating the wheat from the chaff: practical anomaly detection schemes in ecological applications of distributed sensor networks. In: Distributed computing in sensor systems, Springer, New York, pp 223–239

  • Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: International conference on machine learning, ACM, New York, pp 208–215

  • Davis M, Liu W, Miller P, Redpath G (2011) Detecting anomalies in graphs with numeric labels. In: ACM international conference on information and knowledge management, ACM, New York, pp 1197–1202

  • Eberle W, Graves J, Holder L (2010) Insider threat detection using a graph-based approach. J Appl Secur Res 6(1):32–81

    Article  Google Scholar 

  • Fan W, Wang X, Wu Y (2013) Incremental graph pattern matching. ACM Trans Database Syst 38(3):18

    Article  MathSciNet  Google Scholar 

  • Franke C, Gertz M (2008) Detection and exploration of outlier regions in sensor data streams. In: IEEE international conference on data mining workshops, IEEE, Los Alamitos, pp 375–384

  • Gao J, Liang F, Fan W, Wang C, Sun Y, Han J (2010) On community outliers and their efficient detection in information networks. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 813–822

  • Gupta M, Gao J, Aggarwal C, Han J (2014) Outlier detection for temporal data. Synth Lect Data Min Knowl Discov 5(1):1–129

    Article  Google Scholar 

  • Gupte M, Eliassi-Rad T (2012) Measuring tie strength in implicit social networks. In: Annual ACM web science conference, ACM, New York, pp 109–118

  • He J, Carbonell JG (2007) Nearest-neighbor-based active learning for rare category detection. In: Advances in neural information processing systems, pp 633–640

  • He J, Liu Y, Lawrence R (2008) Graph-based rare category detection. In: IEEE international conference on data mining, IEEE, pp 833–838

  • He J, Tong H, Carbonell J (2010) Rare category characterization. In: IEEE international conference on data mining, IEEE, pp 226–235

  • Henderson K, Eliassi-Rad T, Faloutsos C, Akoglu L, Li L, Maruhashi K, Prakash BA, Tong H (2010) Metric forensics: a multi-level approach for mining volatile graphs. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 163–172

  • Hill DJ, Minsker BS, Amir E (2007) Real-time bayesian anomaly detection for environmental sensor data. In: Congress-international association for hydraulic research, Citeseer, vol 32, p 503

  • Kang U, McGlohon M, Akoglu L, Faloutsos C (2010) Patterns on the connected components of terabyte-scale graphs. In: IEEE international conference on data mining, IEEE, pp 875–880

  • Kang U, Tsourakakis CE, Appel AP, Faloutsos C, Leskovec J (2011) Hadi: mining radii of large graphs. ACM Trans Knowl Discov Data 5(2):8

    Article  Google Scholar 

  • Koutra D, Ke TY, Kang U, Chau DHP, Pao HKK, Faloutsos C (2011) Unifying guilt-by-association approaches: theorems and fast algorithms. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, New York, pp 245–260

    Chapter  Google Scholar 

  • Koutra D, Papalexakis EE, Faloutsos C (2012) Tensorsplat: spotting latent anomalies in time. In: Panhellenic conference on informatics, IEEE, pp 144–149

  • Kumar R, Mahdian M, McGlohon M (2010) Dynamics of conversations. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 553–562

  • Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 177–187

  • Liu Z, Chiew K, He Q, Huang H, Huang B (2014) Prior-free rare category detection: more effective and efficient solutions. Expert Syst Appl 41(17):7691–7706

    Article  Google Scholar 

  • Müller E, Sánchez PI, Mülle Y, Böhm K (2013) Ranking outlier nodes in subspaces of attributed graphs. In: IEEE international conference on data engineering workshops, IEEE, pp 216–222

  • Pelleg D, Moore AW (2004) Active learning for anomaly and rare-category detection. In: Advances in neural information processing systems, pp 1073–1080

  • Phua C, Lee V, Smith K, Gayler R (2010) A comprehensive survey of data mining-based fraud detection research. arXiv:hep-th/10096119

  • Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Annals Math Stat 21(1):124–127

    Article  MathSciNet  Google Scholar 

  • Sricharan K, Das K (2014) Localizing anomalous changes in time-evolving graphs. In: ACM SIGMOD international conference on management of data, ACM, pp 1347–1358

  • Tong H, Papadimitriou S, Philip SY, Faloutsos C (2008) Proximity tracking on time-evolving bipartite graphs. In: SIAM international conference in data mining, pp 704–715

    Chapter  Google Scholar 

  • Yamanishi K, Takeuchi Ji (2002) A unifying framework for detecting outliers and change points from non-stationary time series data. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 676–681

  • Yamanishi K, Takeuchi JI, Williams G, Milne P (2004) On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Min Knowl Discov 8(3):275–300

    Article  MathSciNet  Google Scholar 

  • Zhou D, He J, Candan K, Davulcu H (2015a) Muvir: Multi-view rare category detection. In: International joint conference on artificial intelligence, pp 4098–4104

  • Zhou D, Wang K, Cao N, He J (2015b) Rare category detection on time-evolving graphs. In: IEEE international conference on data mining, IEEE, pp 1135–1140

Download references

Acknowledgments

This work is supported by NSF research Grant IIS-1552654, and an IBM Faculty Award. The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dawei Zhou.

Additional information

Responsible editor Jian Pei.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, D., Karthikeyan, A., Wang, K. et al. Discovering rare categories from graph streams. Data Min Knowl Disc 31, 400–423 (2017). https://doi.org/10.1007/s10618-016-0478-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-016-0478-6

Keywords

Navigation