Skip to main content
Log in

Efficient graph similarity join for information integration on graphs

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Zhao X, Xiao C, Lin X, Wang W. Efficient graph similarity joins with edit distance constraints. In: Proceedings of the 28th IEEE International Conference on Data Engineer. 2012, 834–845

    Google Scholar 

  2. Qin J, Wang W, Lu Y, Xiao C, Lin X. Efficient exact edit similarity query processing with the asymmetric signature schemes. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 1033–1044

    Chapter  Google Scholar 

  3. Fan W, Li J, Ma S, Tang N, Wu Y. Graph pattern matching: from intractable to polynomial time. Proceedings of the VLDB Endowment, 2011, 3(1): 264–275

    Google Scholar 

  4. Ma S, Cao Y, FanW, Huai J, Wo T. Capturing topology in graph pattern matching. Proceedings of the VLDB Endowment, 2011, 5(4): 310–321

    Article  Google Scholar 

  5. Sanfeliu A, Fu K S. A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 1983, 13(3): 353–362

    Article  MATH  Google Scholar 

  6. Bunke H, Allermann G. Inexact graph matching for structural pattern recognition. Pattern Recognition Letters, 1983, 1(4): 245–253

    Article  MATH  Google Scholar 

  7. Gouda K, Arafa M. An improved global lower bound for graph edit similarity search. Pattern Recognition Letters, 2015 58: 8–14

    Article  Google Scholar 

  8. Ibragimov R. Exact and heuristic algorithms for network alignment using graph edit distance models. Dissertation for the Doctoral Degree. Fachrichtung 6.2 — Informatik, 2015

    Google Scholar 

  9. Baumbach J, Guo J, Ibragimov R. Multiple graph edit distance: simultaneous topological alignment of multiple protein–protein interaction networks with an evolutionary algorithm. In: Proceedings of the 2014 Conference on Genetic and Evolutionary Computation. 2014: 277–284

    Google Scholar 

  10. Justice D, Hero A. A binary linear programming formulation of the graph edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(8): 1200–1214

    Article  Google Scholar 

  11. Fankhauser S, Riesen K, Bunke H. Speeding up graph edit distance computation through fast bipartite matching. In: Proceedings of the 8th International Workshop on Graph–Based Representations in Pattern Recognition. 2011, 102–111

    Chapter  Google Scholar 

  12. Wang G, Wang B, Yang X, G. Yu G. Efficiently indexing large sparse graphs for similarity search. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 440–451

    Article  Google Scholar 

  13. Wang Y, Wang H, Li J, Gao H. Graph similarity join with k–hop tree indexing. In: Proceedings of the International Conference of Young Computer Scientists, Engineers and Educators. 2015, 38–47

    Google Scholar 

  14. Zaki M J. Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(8): 1021–1035

    Article  Google Scholar 

  15. Gao X, Xiao B, Tao D, Li X. A survey of graph edit distance. Pattern Analysis and Applications, 2010, 13(1): 113–129

    Article  MathSciNet  Google Scholar 

  16. Conte D, Ramel JY, Sidère N, Luqman MM, Gaüzère B, Gibert J, Brun L, Vento M. A comparison of explicit and implicit graph embedding methods for pattern recognition. In: Proceedings of the 9th International Workshop on Graph–Based Representations in Pattern Recognition. 2013, 81–90

    Chapter  Google Scholar 

  17. Shao Y, Cui B, Chen L, Liu M, Xie X. An efficient similarity search framework for SimRank over large dynamic graphs. Proceedings of the VLDB Endowment, 2015, 8(8): 838–849

    Article  Google Scholar 

  18. Shao Y, CuiM, Ma L. PAGE: a partition aware engine for parallel graph computation. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(2): 518–530

    Article  MathSciNet  Google Scholar 

  19. Xu N, Chen L, Cui B. LogGP: a log–based dynamic graph partitioning method. Proceedings of the VLDB Endowment, 2014, 7(14): 1917–1928

    Article  Google Scholar 

  20. Shao Y, Chen L, Cui B. Efficient cohesive subgraphs detection in parallel. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014, 613–624

    Google Scholar 

  21. Shao Y, Cui B, Chen L, Ma L, Yao J, Xu N. Parallel subgraph listing in a large–scale graph. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014, 625–636

    Google Scholar 

  22. Shao Y, Yao J, Cui B, Ma L. PAGE: a partition aware graph computation engine. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013, 823–828

    Google Scholar 

  23. Cui B, Mei H, Ooi B C. Big data: the driver for innovation in databases. National Science Review, 2014, 1 (1): 27–30

    Article  Google Scholar 

  24. Shang H, Lin X, Zhang Y, Yu J X, Wang W. Connected substructure similarity search. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 903–914

    Chapter  Google Scholar 

  25. Yan X, Yu P S, Han J. Substructure similarity search in graph databases. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. 2005, 766–777

    Chapter  Google Scholar 

  26. Zhu Y, Qin L, Yu J X, Ke Y, Lin X. High efficiency and quality: large graphs matching. The VLDB Journal — The International Journal on Very Large Data Bases, 2013, 22(3): 345–368

    Article  Google Scholar 

  27. Williams D W, Huan J, Wang W. Graph database indexing using structured graph decomposition. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 976–985

    Google Scholar 

  28. Zou L, Chen L, Özsu M T. Distance–join: pattern match query in a large graph databases. Proceedings of the VLDB Endowment, 2009, 2(1): 886–897

    Article  Google Scholar 

  29. Zeng Z, Tung A K, Wang J, Feng J, Zhou L. Comparing stars: on approximating graph edit distance. Proceedings of the VLDB Endowment, 2009, 2(1): 25–36

    Article  Google Scholar 

  30. Zheng W, Zou L, Feng Y, Chen L, Zhao D. Efficient SimRank–based similarity join over large graphs. Proceedings of the VLDB Endowment, 2013, 6(7): 493–504

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Yue Wang is a master student majoring in computer science and technology in Harbin Institute of Technology, China. Her main research area is XML data management.

Hongzhi Wang is an associate professor at Harbin Institute of Technology, China. His research area is data management, including data quality, XML data management and graph management. He is a recipient of the outstanding dissertation award of CCF, Microsoft Fellow and IBM PhD Fellowship.

Jianzhong Li is a professor and doctoral supervisor at Harbin Institute of Technology, China. He is a senior member of CCF. His research interests include database, parallel computing and wireless sensor networks, etc.

Hong Gao is a professor and doctoral supervisor at Harbin Institute of Technology, China. She is a senior member of CCF. Her research interests include data management, wireless sensor networks and graph database, etc.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Wang, H., Li, J. et al. Efficient graph similarity join for information integration on graphs. Front. Comput. Sci. 10, 317–329 (2016). https://doi.org/10.1007/s11704-015-4505-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-015-4505-3

Keywords

Navigation