ABSTRACT
As distributed systems become more ubiquitous and more complex, the need for efficient, scalable tools to analyze these systems increases. Network provenance graphs offer a rich framework for this task, mapping dependencies between system states and allowing one to explain these states. In this paper, we investigate methods for more efficient substructure mining in the context of network provenance graphs. Specifically, we are interested in identifying frequent substructures that can be used as a feature set for modeling common execution patterns. Knowing these will help network administrators detect nodes in the distributed system that are misbehaving. Therefore, this paper focuses on applying and scaling up substructure mining for network provenance graphs by incorporating a graph database (neo4j) into the substructure mining process and implementing optimizations that improve the efficiency of the substructure mining task. Our results show that the use of the neo4j graph database combined with our algorithmic optimizations greatly improves the run time of our algorithm while not significantly affecting the quality of the substructures returned.
- R. Balachandran, S. Padmanabhan, and S. Chakravarthy. Enhanced db-subdue: Supporting subtle aspects of graph mining using a relational approach. In Advances in Knowledge Discovery and Data Mining. 2006. Google ScholarDigital Library
- J. Cheng, J. Yu, B. Ding, P. Yu, and H. Wang. Fast graph pattern matching. In Proc. ICDE, pages 913--922, 2008. Google ScholarDigital Library
- S. A. Cook. The complexity of theorem-proving procedures. In Proc. STOC, 1971. Google ScholarDigital Library
- H. He and A. K. Singh. Graphs-at-a-time: query language and access methods for graph databases. In Proc. SIGMOD, pages 405--418, 2008. Google ScholarDigital Library
- L. B. Holder, D. J. Cook, S. Djoko, et al. Substructure discovery in the subdue system. In Proc. of the AAAI Workshop on Knowledge Discovery in Databases, 1994.Google Scholar
- G. Jiang, H. Chen, and K. Yoshihira. Efficient and scalable algorithms for inferring likely invariants in distributed systems. IEEE TKDE, 19(11), 2007. Google ScholarDigital Library
- H. Jiang, H. Wang, P. Yu, and S. Zhou. Gstring: A novel approach for efficient search in graph databases. In Proc. ICDE, pages 566--575, 2007.Google ScholarCross Ref
- N. S. Ketkar, L. B. Holder, and D. J. Cook. Subdue: compression-based frequent pattern discovery in graph data. In Proc. OSDM, 2005. Google ScholarDigital Library
- J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining invariants from console logs for system problem detection. In Proc. of ATC, 2010. Google ScholarDigital Library
- S. Padmanabhan and S. Chakravarthy. Hdb-subdue: A scalable approach to graph mining. Data Warehousing and Knowledge Discovery, 2009. Google ScholarDigital Library
- N. Spring, R. Mahajan, and D. Wetherall. Measuring isp topologies with rocketfuel. ACM SIGCOMM CCR, 32(4), 2002. Google ScholarDigital Library
- Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. Proc. VLDB Endow., 5(9):788--799, May 2012. Google ScholarDigital Library
- Y. Tian, R. C. Mceachin, C. Santos, D. J. States, and J. M. Patel. Saga: a subgraph matching tool for biological graphs. Bioinformatics, 23(2):232--239, Jan. 2007. Google ScholarDigital Library
- X. Yan and J. Han. Closegraph: mining closed frequent graph patterns. In Proc. SIGKDD, 2003. Google ScholarDigital Library
- Y. Yuan, G. Wang, L. Chen, and H. Wang. Efficient subgraph similarity search on large probabilistic graph databases. Proc. VLDB Endow., 5(9):800--811, May 2012. Google ScholarDigital Library
- S. Zhang, S. Li, and J. Yang. Gaddi: distance index based subgraph matching in biological networks. In Proc. EDBT, pages 192--203, 2009. Google ScholarDigital Library
- P. Zhao and J. Han. On graph query optimization in large networks. Proc. VLDB Endow., 3(1-2):340--351, Sept. 2010. Google ScholarDigital Library
- W. Zhou, Q. Fei, A. Narayan, A. Haeberlen, B. T. Loo, and M. Sherr. Secure network provenance. In Proc. SOSP, 2011. Google ScholarDigital Library
- W. Zhou, S. Mapara, Y. Ren, A. Haeberlen, Z. Ives, B. T. Loo, and M. Sherr. Distributed time-aware provenance. In Proc. VLDB, 2013. Google ScholarDigital Library
- W. Zhou, M. Sherr, T. Tao, X. Li, B. T. Loo, and Y. Mao. Efficient querying and maintenance of network provenance at Internet-scale. In Proc. SIGMOD, 2010. Google ScholarDigital Library
- L. Zou, L. Chen, and M. T. Özsu. Distancejoin: Pattern match query in a large graph database. PVLDB, 2(1):886--897, 2009. Google ScholarDigital Library
Recommendations
Provenance for data mining
TaPP '13: Proceedings of the 5th USENIX Workshop on the Theory and Practice of ProvenanceData mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates ...
Provenance for data mining
TaPP'13: Proceedings of the 5th USENIX conference on Theory and Practice of ProvenanceData mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates ...
Metagraph-Based Substructure Pattern Mining
ICACTE '08: Proceedings of the 2008 International Conference on Advanced Computer Theory and EngineeringThe need for mining structured data has increased in the past few years. One of the best studied data structures in computer science and discrete mathematics are graphs. Graph based data mining has become quite popular in the last few years. In this ...
Comments